man page repeats.pl section 1

repeats(1)                  General Commands Manual                 repeats(1)

NAME

       repeats, repeats.pl, and repeats.py - search for duplicate files

SYNOPSIS

       repeats [-a hash_algorithm] [-h(elp)] [-l(inks_hard)]
       [-m bytes_for_partial] [-p(aranoid)] [-v(erbose)] [-z(ero_include)]
       [directory...]

       repeats.pl [-1(_line_output)] [-a hash_algorithm] [-h(elp)]
       [-l(inks_hard)] [-m bytes_for_partial] [-r ramp_factor] [-v(erbose)]
       [-z(ero_include)] [directory...]

       repeats.py [-1(_line_output)] [-a hash_algorithm] [-h(elp)]
       [-l(inks_hard)] [-m bytes_for_partial] [-r ramp_factor] [-v(erbose)]
       [-z(ero_include)] [directory...]

DESCRIPTION

       repeats (written in C and sh), repeats.pl (written in perl and
       utilizing routines from the CryptX module), and repeats.py (written in
       python and utilizing routines from the digest module) all search for
       duplicate files in one or more specified directories, using a three-,
       four-, or five-stage process.  This process works as follows:

       Initially, all files in the specified directories (and all of their
       subdirectories) are listed as potential duplicates.  In the first
       stage, all files with a unique filesize are declared unique and are
       removed from the list.  In the optional second stage, any files which
       are actually a hardlink to another file are removed, since they don't
       actually take up any more disk space.  In the third, all files for
       which the first 16384 (for repeats) or 1024 (for repeats.pl and
       repeats.py) bytes (both adjustable with the -m option) have a unique
       filehash are declared unique and are removed from the list.  In the
       fourth, all files which have a unique filehash (for the entire file)
       are declared unique and are removed from the list.  And in the optional
       fifth stage, all files with matching filehashes are compared using cmp
       and are printed to stdout if they match.

       This process is MUCH less disk and CPU intensive than creating full
       hashes for all files.  It is implemented using a combination of the
       filehash, filenode, filesize, rep_hash, rep_node, rep_size, and
       tempname utilities.  The duff, dupd, fdupes, jdupes, and rdfind
       commands utilize similar strategies.

OPTIONS

       -1     Print each set of duplicate files on a single line.  This option
              is available only in repeats.pl and repeats.py.

       -a hash_algorithm
              Specify which hash algorithm should be used.  Choices are
              1 (MD5), 2 (SHA1), 3 (SHA224), 4 (SHA256), 5 (SHA384),
              6 (SHA512), 7 (BLAKE2B-256), and 8 (BLAKE2B-512).  The default
              is 8, for BLAKE2B-512 hashes.

       -h     Print help and quit.

       -l     List files that are actually hardlinks as duplicates.  Normally,
              only the first hardlink sharing an i-node number is included as
              a possible repeat.  [This skips stage 2.]

       -m bytes_for_partial
              Specify the number of bytes read per file in stage 3.

       -p     Perform a final cmp-based "paranoia" check to absolutely ensure
              that listed duplicates are truly duplicates.  Using this option
              can result in each duplicate being read completely two or three
              times, which can substantially increase execution time when
              duplicates of large files are present.  [This is stage 5 and is
              only available in repeats.]

       -r ramp_factor
              In repeats.pl and repeats.py, stage 3 is run repeatedly in place
              of stage 4, with the number of bytes read in each round being
              multipled by the "ramp factor" value.  The default value is 4.0.

       -v     Verbose output.  Write some statistics concerning number of
              potential duplicates found at each stage to stderr.

       -z     Include zero-length files in the search.  If there is more than
              one zero-length file, all of those files will be considered
              duplicates.

NOTES

       If no directory is specified, the current directory is assumed.

       In terms of program history, the repeats utility was written first (in
       2004).  The repeats.pl utility was written in 2020 to explore new
       algorithms and it currently implements a multi-step stage 3 algorithm
       that requires less disc I/O than repeats.  The repeats.py was written
       in 2023 to see how Python performance differed from Perl.  Both run
       slightly slower than repeats on Linux for most data sets but faster on
       Cygwin.

BUGS

       It must be noted that it is theoretically possible (though freakishly
       improbable) for two different files to be listed as duplicates if the
       -p option is not used.  If they have the same size and the same file
       hash, they will be listed as the duplicates.  The odds of two different
       files (of the same size) being listed as duplicates is approximately
       1.16e77 to 1 for the SHA256 hash.  Using arguments similar to the
       classic "birthday paradox" (i.e., the probability of two people sharing
       the same birthday in a room of only 23 people is greater than 50%), it
       can be shown that it would take approximately 4.01e38 different files
       (of exactly the same size) to achieve similar odds.  In other words,
       it'll probably never happen.  Ever.  However, it's not inconceivable.
       You have been warned.

       For the various hashes, the number of same-sized files required for the
       probability of a false positive to reach 50% are as follows:

       MD5:          2.17e19 files
       SHA1:         1.42e24 files
       SHA224:       6.11e33 files
       SHA256:       4.01e38 files  (default prior to version 1.2.0)
       SHA384:       7.39e57 files
       SHA512:       1.36e77 files
       BLAKE2B-256:  4.01e38 files
       BLAKE2B-512:  1.36e77 files  (default for versions 1.2.0 and later)

       See https://en.wikipedia.org/wiki/Birthday_problem and
       https://en.wikipedia.org/wiki/Birthday_attack for more information.  If
       this extremely remote risk is too much to bear, use repeats with the -p
       option.

       Also, repeats, repeats.pl, and repeats.py currently lack logic to mark
       hardlinks (under the -l option) as duplicates without actually reading
       the entire file multiple times.  This will be addressed in a future
       version of littleutils.

       And finally, repeats will malfunction if asked to examine files that
       have one or more "tab" (0x09) characters in the filename, as tab
       characters are used as delimiters in the temporary working files that
       repeats creates.  If scanning a data set with embedded tabs in the
       filenames, use repeats.pl or repeats.py instead, as they maintain
       filename as separate strings in memory.

COPYRIGHT

       Copyright (C) 2004-2023 by Brian Lindholm.  This program is free
       software; you can use it, redistribute it, and/or modify it under the
       terms of the GNU General Public License as published by the Free
       Software Foundation; either version 3, or (at your option) any later
       version.

       This program is distributed in the hope that it will be useful, but
       WITHOUT ANY WARRANTY; without even the implied warranty of
       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
       General Public License for more details.

littleutils                       2023 May 03                       repeats(1)

littleutils 1.2.6 - Generated Mon Jun 26 07:56:30 CDT 2023

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

NOTES

BUGS

SEE ALSO

COPYRIGHT