repeats(1) General Commands Manual repeats(1)
NAME
repeats, repeats.pl, and repeats.py - search for duplicate files
SYNOPSIS
repeats [-a algorithm] [-h(elp)] [-i(nit_rs) bytes] [-l(inks_hard)]
[-p(aranoid)] [-v(erbose)] [-z(ero_include)] [directory...]
repeats.pl [-1(_line_output)] [-a algorithm] [-h(elp)]
[-i(nit_rs) bytes] [-l(inks_hard)] [-m(in_fs) bytes] [-r ramp_factor]
[-v(erbose)] [-z(ero_include)] [file...] [directory...]
repeats.py [-1(_line_output)] [-a algorithm] [-h(elp)]
[-i(nit_rs) bytes] [-l(inks_hard)] [-m(in_fs) bytes] [-r ramp_factor]
[-v(erbose)] [-z(ero_include)] [file...] [directory...]
DESCRIPTION
repeats (written in C and sh), repeats.pl (written in perl and
utilizing routines from the CryptX module), and repeats.py (written in
python and utilizing routines from the digest module) all search for
duplicate files in one or more specified directories, using a three-,
four-, or five-stage process. This process works as follows:
Initially, all specified files, all files in the specified directories
(and their subdirectories) are listed as potential duplicates. In the
optional first stage, any files which are actually a hardlink to
another file are removed, since they don't actually take up any more
disk space. In the second stage, all files with a unique filesize are
declared unique and are removed from the list. In the third, all files
for which the first 16384 (for repeats) or 256 (for repeats.pl and
repeats.py) bytes (both adjustable with the -i option) have a unique
filehash are declared unique and are removed from the list. In the
fourth, all files which have a unique filehash (for the entire file)
are declared unique and are removed from the list. And in the optional
fifth stage, all files with matching filehashes are compared using cmp
and are printed to stdout if they match.
This process is MUCH less disk and CPU intensive than creating full
hashes for all files. It is implemented using a combination of the
filehash, filenode, filesize, rep_hash, rep_node, rep_size, and
tempname utilities. The duff, dupd, fdupes, jdupes, and rdfind
commands utilize similar strategies.
OPTIONS
-1 Print each set of duplicate files on a single line. This option
is available only in repeats.pl and repeats.py.
-a hash_algorithm
Specify which hash algorithm should be used. Choices are
1 (MD5), 2 (SHA1), 3 (SHA224), 4 (SHA256), 5 (SHA384),
6 (SHA512), 7 (BLAKE2B-256), and 8 (BLAKE2B-512). The default
is 8, for BLAKE2B-512 hashes.
-h Print help and quit.
-i initial_readsize
Specify the number of bytes read per file at the beginning of
stage 3. The default is 16384 for repeats, and 256 for
repeats.pl and repeats.py.
-l List files that are actually hardlinks as duplicates. Normally,
only the first hardlink sharing an i-node number is included as
a possible repeat. [This skips stage 2.]
-m minimum_filesize
In repeats.pl and repeats.py, this option can be used to specify
the minimum size of file (in bytes) that will be considered for
duplicates. This allows the user to focus only on larger files.
The default value is 1, which includes all non-zero files.
-p Perform a final cmp-based "paranoia" check to absolutely ensure
that listed duplicates are truly duplicates. Using this option
can result in each duplicate being read completely two or three
times, which can substantially increase execution time when
duplicates of large files are present. [This is stage 5 and is
only available in repeats.]
-r ramp_factor
In repeats.pl and repeats.py, stage 3 is run repeatedly in place
of stage 4, with the number of bytes read in each round being
multipled by the "ramp factor" value. The default value is 4.0.
-v Verbose output. Write some statistics concerning number of
potential duplicates found at each stage to stderr.
-z Include zero-length files in the search. If there is more than
one zero-length file, all of those files will be considered
duplicates.
NOTES
If no directory is specified, the current directory is assumed.
In terms of program history, the repeats utility was written first (in
2004). The repeats.pl utility was written in 2020 to explore new
algorithms and it currently implements a multi-step stage 3 algorithm
that requires less disc I/O than repeats. The repeats.py utility was
written in 2023 to see how Python performance differed from Perl. Both
run slightly slower than repeats on Linux for most data sets but faster
on Cygwin.
BUGS
It must be noted that it is theoretically possible (though freakishly
improbable) for two different files to be listed as duplicates if the
-p option is not used. If they have the same size and the same file
hash, they will be listed as the duplicates. The odds of two different
files (of the same size) being listed as duplicates is approximately
1.16e77 to 1 for the SHA256 hash. Using arguments similar to the
classic "birthday paradox" (i.e., the probability of two people sharing
the same birthday in a room of only 23 people is greater than 50%), it
can be shown that it would take approximately 4.01e38 different files
(of exactly the same size) to achieve similar odds. In other words,
it'll probably never happen. Ever. However, it's not inconceivable.
You have been warned.
For the various hashes, the number of same-sized files required for the
probability of a false positive to reach 50% are as follows:
MD5: 2.17e19 files
SHA1: 1.42e24 files
SHA224: 6.11e33 files
SHA256: 4.01e38 files (default prior to version 1.2.0)
SHA384: 7.39e57 files
SHA512: 1.36e77 files
BLAKE2B-256: 4.01e38 files
BLAKE2B-512: 1.36e77 files (default for versions 1.2.0 and later)
See https://en.wikipedia.org/wiki/Birthday_problem and
https://en.wikipedia.org/wiki/Birthday_attack for more information. If
this extremely remote risk is too much to bear, use repeats with the -p
option.
Also, repeats, repeats.pl, and repeats.py currently lack logic to mark
hardlinks (under the -l option) as duplicates without actually reading
the entire file multiple times. This will be addressed in a future
version of littleutils.
And finally, repeats will malfunction if asked to examine files that
have one or more "tab" (0x09) characters in the filename, as tab
characters are used as delimiters in the temporary working files that
repeats creates. If scanning a data set with embedded tabs in the
filenames, use repeats.pl or repeats.py instead, as they maintain
filename as separate strings in memory.
SEE ALSO
filehash(1), filenode(1), filesize(1), perl(1), python(1), CryptX(3),
rep_hash(1), rep_node(1), rep_size(1), duff(1), dupd(1), fdupes(1),
jdupes(1), rdfind(1)
COPYRIGHT
Copyright (C) 2004-2024 by Brian Lindholm. This program is free
software; you can use it, redistribute it, and/or modify it under the
terms of the GNU General Public License as published by the Free
Software Foundation; either version 3, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
littleutils 2025 Mar 22 repeats(1)
littleutils 1.2.8 - Generated Sun Oct 19 09:22:51 CDT 2025
