manpagez: man pages & more
man repeats.pl(1)
Home | html | info | man
repeats(1)                  General Commands Manual                 repeats(1)


NAME

       repeats.pl and repeats.py - search for duplicate files


SYNOPSIS

       repeats.pl [-h(elp)] [-1(_line_output)] [-a algorithm]
       [-i(nit_rs) bytes] [-l(inks_hard)] [-m(in_fs) bytes] [-r ramp_factor]
       [-v(erbose)] [-z(ero_include)] [file...] [directory...]

       repeats.py [-h(elp)] [-1(_line_output)] [-a algorithm]
       [-i(nit_rs) bytes] [-l(inks_hard)] [-m(in_fs) bytes] [-r ramp_factor]
       [-v(erbose)] [-z(ero_include)] [file...] [directory...]


DESCRIPTION

       repeats.pl (written in perl and utilizing routines from the CryptX
       module), and repeats.py (written in python and utilizing routines from
       the digest module) all search for duplicate files in one or more
       specified directories, using a three-stage process.  This process works
       as follows:

       Initially, all specified files, all files in the specified directories
       (and their subdirectories) are listed as potential duplicates.  In the
       optional first stage, any files which are actually a hardlink to
       another file are removed, since they don't actually take up any more
       disk space.  In the second stage, all files with a unique filesize are
       declared unique and are removed from the list.  In the third, all files
       for which the first 256 bytes (adjustable with the -i option) have a
       unique filehash are declared unique and are removed from the list, and
       then the third state is repeated with larger and larger reads until all
       files have been verified as having a match or been declared as unique
       and removed from the list.

       This process is MUCH less disk and CPU intensive than creating full
       hashes for all files.  The duff, dupd, fdupes, jdupes, and rdfind
       commands utilize similar strategies.


OPTIONS

       -h     Print help and quit.

       -1     Print each set of duplicate files on a single line.

       -a hash_algorithm
              Specify which hash algorithm should be used.  Choices are
              1 (MD5), 2 (SHA1), 3 (SHA224), 4 (SHA256), 5 (SHA384),
              6 (SHA512), 7 (BLAKE2B-256), and 8 (BLAKE2B-512).  The default
              is 8, for BLAKE2B-512 hashes.

       -i initial_readsize
              Specify the number of bytes read per file at the beginning of
              stage 3.  The default 256.

       -l     List files that are actually hardlinks as duplicates.  Normally,
              only the first hardlink sharing an i-node number is included as
              a possible repeat.  [This skips stage 2.]

       -m minimum_filesize
              Specify the minimum size of file (in bytes) that will be
              considered for duplicates.  This allows the user to focus only
              on larger files.  The default value is 1, which includes all
              non-zero files.

       -r ramp_factor
              In repeats.pl and repeats.py, stage 3 is run repeatedly with the
              number of bytes read in each round being multipled by the "ramp
              factor" value.  The default value is 4.0.

       -v     Verbose output.  Write some statistics concerning number of
              potential duplicates found at each stage to stderr.

       -z     Include zero-length files in the search.  If there is more than
              one zero-length file, all of those files will be considered
              duplicates.


NOTES

       If no directory is specified, the current directory is assumed.

       In terms of program history, the legacy sh- and C-based repeats utility
       was written first (in 2004).  The repeats.pl utility was written in
       2020 to explore new algorithms and it currently implements a multi-step
       stage 3 algorithm that requires less disc I/O than the legacy repeats
       did.  The repeats.py utility was written in 2023 to see how Python
       performance differed from Perl.  The Python version is slightly faster
       on Linux and slightly slower on Cygwin.


BUGS

       It must be noted that it is theoretically possible (though freakishly
       improbable) for two different files to be listed as duplicates if the
       -p option is not used.  If they have the same size and the same file
       hash, they will be listed as the duplicates.  The odds of two different
       files (of the same size) being listed as duplicates is approximately
       1.16e77 to 1 for the SHA256 hash.  Using arguments similar to the
       classic "birthday paradox" (i.e., the probability of two people sharing
       the same birthday in a room of only 23 people is greater than 50%), it
       can be shown that it would take approximately 4.01e38 different files
       (of exactly the same size) to achieve similar odds.  In other words,
       it'll probably never happen.  Ever.  However, it's not inconceivable.
       You have been warned.

       For the various hashes, the number of same-sized files required for the
       probability of a false positive to reach 50% are as follows:

       MD5:          2.17e19 files
       SHA1:         1.42e24 files
       SHA224:       6.11e33 files
       SHA256:       4.01e38 files  (default prior to version 1.2.0)
       SHA384:       7.39e57 files
       SHA512:       1.36e77 files
       BLAKE2B-256:  4.01e38 files
       BLAKE2B-512:  1.36e77 files  (default for versions 1.2.0 and later)

       See https://en.wikipedia.org/wiki/Birthday_problem and
       https://en.wikipedia.org/wiki/Birthday_attack for more information.

       Also, repeats.py and repeats.pl currently lack logic to mark hardlinks
       (under the -l option) as duplicates without actually reading the entire
       file multiple times.  This will be addressed in a future version of
       littleutils.


SEE ALSO

       perl(1), python(1), CryptX(3pm), duff(1), dupd(1), fdupes(1),
       jdupes(1), rdfind(1)


COPYRIGHT

       Copyright (C) 2004-2026 by Brian Lindholm.  This program is free
       software; you can use it, redistribute it, and/or modify it under the
       terms of the GNU General Public License as published by the Free
       Software Foundation; either version 3, or (at your option) any later
       version.

       This program is distributed in the hope that it will be useful, but
       WITHOUT ANY WARRANTY; without even the implied warranty of
       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
       General Public License for more details.

littleutils                       2026 Jan 01                       repeats(1)

littleutils 1.4.0 - Generated Wed Feb 18 07:41:14 CST 2026
© manpagez.com 2000-2026
Individual documents may contain additional copyright information.