repeats(1) General Commands Manual repeats(1)
NAME
repeats, repeats.pl, and repeats.py - search for duplicate files
SYNOPSIS
repeats [-a hash_algorithm] [-h(elp)] [-l(inks_hard)] [-m bytes_for_partial] [-p(aranoid)] [-v(erbose)] [-z(ero_include)] [directory...] repeats.pl [-1(_line_output)] [-a hash_algorithm] [-h(elp)] [-l(inks_hard)] [-m bytes_for_partial] [-r ramp_factor] [-v(erbose)] [-z(ero_include)] [directory...] repeats.py [-1(_line_output)] [-a hash_algorithm] [-h(elp)] [-l(inks_hard)] [-m bytes_for_partial] [-r ramp_factor] [-v(erbose)] [-z(ero_include)] [directory...]
DESCRIPTION
repeats (written in C and sh), repeats.pl (written in perl and utilizing routines from the CryptX module), and repeats.py (written in python and utilizing routines from the digest module) all search for duplicate files in one or more specified directories, using a three-, four-, or five-stage process. This process works as follows: Initially, all files in the specified directories (and all of their subdirectories) are listed as potential duplicates. In the first stage, all files with a unique filesize are declared unique and are removed from the list. In the optional second stage, any files which are actually a hardlink to another file are removed, since they don't actually take up any more disk space. In the third, all files for which the first 16384 (for repeats) or 1024 (for repeats.pl and repeats.py) bytes (both adjustable with the -m option) have a unique filehash are declared unique and are removed from the list. In the fourth, all files which have a unique filehash (for the entire file) are declared unique and are removed from the list. And in the optional fifth stage, all files with matching filehashes are compared using cmp and are printed to stdout if they match. This process is MUCH less disk and CPU intensive than creating full hashes for all files. It is implemented using a combination of the filehash, filenode, filesize, rep_hash, rep_node, rep_size, and tempname utilities. The duff, dupd, fdupes, jdupes, and rdfind commands utilize similar strategies.
OPTIONS
-1 Print each set of duplicate files on a single line. This option is available only in repeats.pl and repeats.py. -a hash_algorithm Specify which hash algorithm should be used. Choices are 1 (MD5), 2 (SHA1), 3 (SHA224), 4 (SHA256), 5 (SHA384), 6 (SHA512), 7 (BLAKE2B-256), and 8 (BLAKE2B-512). The default is 8, for BLAKE2B-512 hashes. -h Print help and quit. -l List files that are actually hardlinks as duplicates. Normally, only the first hardlink sharing an i-node number is included as a possible repeat. [This skips stage 2.] -m bytes_for_partial Specify the number of bytes read per file in stage 3. -p Perform a final cmp-based "paranoia" check to absolutely ensure that listed duplicates are truly duplicates. Using this option can result in each duplicate being read completely two or three times, which can substantially increase execution time when duplicates of large files are present. [This is stage 5 and is only available in repeats.] -r ramp_factor In repeats.pl and repeats.py, stage 3 is run repeatedly in place of stage 4, with the number of bytes read in each round being multipled by the "ramp factor" value. The default value is 4.0. -v Verbose output. Write some statistics concerning number of potential duplicates found at each stage to stderr. -z Include zero-length files in the search. If there is more than one zero-length file, all of those files will be considered duplicates.
NOTES
If no directory is specified, the current directory is assumed. In terms of program history, the repeats utility was written first (in 2004). The repeats.pl utility was written in 2020 to explore new algorithms and it currently implements a multi-step stage 3 algorithm that requires less disc I/O than repeats. The repeats.py was written in 2023 to see how Python performance differed from Perl. Both run slightly slower than repeats on Linux for most data sets but faster on Cygwin.
BUGS
It must be noted that it is theoretically possible (though freakishly improbable) for two different files to be listed as duplicates if the -p option is not used. If they have the same size and the same file hash, they will be listed as the duplicates. The odds of two different files (of the same size) being listed as duplicates is approximately 1.16e77 to 1 for the SHA256 hash. Using arguments similar to the classic "birthday paradox" (i.e., the probability of two people sharing the same birthday in a room of only 23 people is greater than 50%), it can be shown that it would take approximately 4.01e38 different files (of exactly the same size) to achieve similar odds. In other words, it'll probably never happen. Ever. However, it's not inconceivable. You have been warned. For the various hashes, the number of same-sized files required for the probability of a false positive to reach 50% are as follows: MD5: 2.17e19 files SHA1: 1.42e24 files SHA224: 6.11e33 files SHA256: 4.01e38 files (default prior to version 1.2.0) SHA384: 7.39e57 files SHA512: 1.36e77 files BLAKE2B-256: 4.01e38 files BLAKE2B-512: 1.36e77 files (default for versions 1.2.0 and later) See https://en.wikipedia.org/wiki/Birthday_problem and https://en.wikipedia.org/wiki/Birthday_attack for more information. If this extremely remote risk is too much to bear, use repeats with the -p option. Also, repeats, repeats.pl, and repeats.py currently lack logic to mark hardlinks (under the -l option) as duplicates without actually reading the entire file multiple times. This will be addressed in a future version of littleutils. And finally, repeats will malfunction if asked to examine files that have one or more "tab" (0x09) characters in the filename, as tab characters are used as delimiters in the temporary working files that repeats creates. If scanning a data set with embedded tabs in the filenames, use repeats.pl or repeats.py instead, as they maintain filename as separate strings in memory.
SEE ALSO
filehash(1), filenode(1), filesize(1), perl(1), python(1), CryptX(3pm), rep_hash(1), rep_node(1), rep_size(1), duff(1), dupd(1), fdupes(1), jdupes(1), rdfind(1)
COPYRIGHT
Copyright (C) 2004-2023 by Brian Lindholm. This program is free software; you can use it, redistribute it, and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. littleutils 2023 May 03 repeats(1)
littleutils 1.2.6 - Generated Mon Jun 26 07:56:30 CDT 2023