manpagez: man pages & more
man repeats(1)
Home | html | info | man
repeats(1)                                                          repeats(1)




NAME

       repeats and repeats.pl - search for duplicate files


SYNOPSIS

       repeats      [-a      hash_algorithm]     [-h(elp)]     [-l(inks_hard)]
       [-m  bytes_for_partial]  [-p(aranoid)]  [-v(erbose)]  [-z(ero_include)]
       [directory...]

       repeats.pl    [-1(_line_output)]    [-a    hash_algorithm]    [-h(elp)]
       [-l(inks_hard)] [-m bytes_for_partial]  [-r  ramp_factor]  [-v(erbose)]
       [-z(ero_include)] [directory...]


DESCRIPTION

       repeats  (written in C and sh) and repeats.pl (written in perl and uti-
       lizing routines from the CryptX module) both search for duplicate files
       in  one  or more specified directories, using a three-, four-, or five-
       stage process.  This process works as follows:

       Initially, all files in the specified directories  (and  all  of  their
       subdirectories)  are  listed  as  potential  duplicates.   In the first
       stage, all files with a unique filesize are  declared  unique  and  are
       removed  from  the list.  In the optional second stage, any files which
       are actually a hardlink to another file are removed, since  they  don't
       actually  take  up  any  more  disk space.  In the third, all files for
       which the first 65536 (for repeats)  or  4096  (for  repeats.pl)  bytes
       (both  adjustable  with  the  -m  option)  have  a  unique filehash are
       declared unique and are removed from the  list.   In  the  fourth,  all
       files  which  have a unique filehash (for the entire file) are declared
       unique and are removed from the list.  And in the optional fifth stage,
       all  files  with  matching  filehashes  are  compared using cmp and are
       printed to stdout if they match.

       This process is MUCH less disk and CPU  intensive  than  creating  full
       hashes  for  all  files.   It is implemented using a combination of the
       filehash, filenode, filesize, rep_hash, rep_node, rep_size,  and  temp-
       name  utilities.   The  duff, dupd, fdupes, jdupes, and rdfind commands
       utilize similar strategies.


OPTIONS

       -1     Print each set of duplicate files on a single line.  This option
              is available only in repeats.pl.

       -a hash_algorithm
              Specify  which  hash  algorithm  should  be  used.   Choices are
              1  (MD5),  2  (SHA1),  3  (SHA224),  4  (SHA256),  5   (SHA384),
              6  (SHA512),  7 (BLAKE2B-256), and 8 (BLAKE2B-512).  The default
              is 8, for BLAKE2B-512 hashes.

       -h     Print help and quit.

       -l     List files that are actually hardlinks as duplicates.  Normally,
              only  the first hardlink sharing an i-node number is included as
              a possible repeat.  [This skips stage 2.]

       -m bytes_for_partial
              Specify the number of bytes read per file in stage 3.

       -p     Perform a final cmp-based "paranoia" check to absolutely  ensure
              that  listed duplicates are truly duplicates.  Using this option
              can result in each duplicate being read completely two or  three
              times,  which  can  substantially  increase  execution time when
              duplicates of large files are present.  [This is stage 5 and  is
              only available in repeats.]

       -r ramp_factor
              In  repeats.pl,  stage  3 is run repeatedly in place of stage 4,
              with the number of bytes read in each round being  multipled  by
              the "ramp rate" value.  The default value is 4.

       -v     Verbose  output.   Write  some  statistics  concerning number of
              potential duplicates found at each stage to stderr.

       -z     Include even zero-length files in the search.  If there is  more
              than one zero-length file, all of those files will be considered
              duplicates.


NOTES

       If no directory is specified, the current directory is assumed.

       In terms of program history, the repeats utility was written first  (in
       2004).   The  repeats.pl utility was written later (in 2020) to explore
       new algorithms and it currently implements a multi-step stage  3  algo-
       rithm that requires less disc I/O than repeats.  It still runs slightly
       slower than repeats on Linux for most data sets but is actually  faster
       on Cygwin.


BUGS

       It  must  be noted that it is theoretically possible (though freakishly
       improbable) for two different files to be listed as duplicates  if  the
       -p  option  is  not used.  If they have the same size and the same file
       hash, they will be listed as the duplicates.  The odds of two different
       files  (of  the  same size) being listed as duplicates is approximately
       1.16e77 to 1 for the SHA256 hash.  Using arguments similar to the clas-
       sic "birthday paradox" (i.e., the probability of two people sharing the
       same birthday in a room of only 23 people is greater than 50%), it  can
       be  shown  that it would take approximately 4.01e38 different files (of
       exactly the same size) to achieve similar odds.  In other words,  it'll
       probably  never  happen.   Ever.  However, it's not inconceivable.  You
       have been warned.

       For the various hashes, the number of same-sized files required for the
       probability of a false positive to reach 50% are as follows:

       MD5:          2.17e19 files
       SHA1:         1.42e24 files
       SHA224:       6.11e33 files
       SHA256:       4.01e38 files  (default prior to version 1.2.0)
       SHA384:       7.39e57 files
       SHA512:       1.36e77 files
       BLAKE2B-256:  4.01e38 files
       BLAKE2B-512:  1.36e77 files  (default as of version 1.2.0)

       See          https://en.wikipedia.org/wiki/Birthday_problem         and
       https://en.wikipedia.org/wiki/Birthday_attack for more information.  If
       this extremely remote risk is too much to bear, use the -p option.

       Also,  repeats  and  repeats.pl  currently lack logic to mark hardlinks
       (under the -l option) as duplicates without actually reading the entire
       file  multiple  times.   This  will be addressed in a future version of
       littleutils.

       And finally, repeats will malfunction if asked to  examine  files  that
       have  one or more "tab" (0x09) characters in the filename, as tab char-
       acters are used as delimiters  in  the  temporary  working  files  that
       repeats  creates.   If  scanning a data set with embedded "tabs" in the
       filenames, use repeats.pl instead, as it maintains file lists  in  mem-
       ory.


SEE ALSO

       filehash(1),  filenode(1), filesize(1), perl, CryptX(3pm), rep_hash(1),
       rep_node(1),  rep_size(1),  duff(1),  dupd(1),  fdupes(1),   jdupes(1),
       rdfind(1)


COPYRIGHT

       Copyright  (C) 2004-2020 by Brian Lindholm.  This program is free soft-
       ware; you can use it, redistribute it, and/or modify it under the terms
       of  the  GNU  General  Public License as published by the Free Software
       Foundation; either version 3, or (at your option) any later version.

       This program is distributed in the hope that it  will  be  useful,  but
       WITHOUT  ANY  WARRANTY;  without  even  the  implied  warranty  of MER-
       CHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU  General
       Public License for more details.



littleutils                       2020 Oct 20                       repeats(1)

littleutils 1.2.3 - Generated Tue Dec 1 16:39:44 CST 2020
© manpagez.com 2000-2021
Individual documents may contain additional copyright information.