rmdupes

2017-05-15 05:55 UTC
  • Xyne

Metadata

Description:

Command-line tool to find and remove duplicate files.

Latest Version:

2017.3.6

Source Code:

src/

Architecture:

  • any

Dependencies:

  • python3

Arch Repositories:

  • [xyne-any]
  • [xyne-i686]
  • [xyne-x86_64]

AUR Page:

rmdupes

Arch Forum Thread:

223750

Tags:

About

rmdupes is a command-line utility to scan a directory for duplicate files and remove them. The main feature is an option to use a reference directory: all files in the target directory that are duplicates of files in the reference directory will be removed. If no reference directory is given, files in the target directory will be compared against each other.

There is also an option to move files to a backup directory with preserved relative paths.

By default it will not delete any files without confirmation. There are options to perform a dry run, delete without confirmation, automatically select one file to keep from a set of duplicates (oldest, newest, first in alphabetical order). There is also an option for inclusive and exclusive regular expressions to find-type the search by e.g. file extensions or subdirectories.

There is also an option to generate shell scripts to remove the files after inspection. Before you ask why there is such an option, the answer is "because I can".

Algorithm

The algorithm is naïve. All files are first scanned by filesize. Files with the same size are then compared using Python's filecmp.cmp function. By default, rmdupes uses deep comparions and guarantees that no false positives will be reported. There is an option to enable shallow comparisons for faster execution.

A few quick tests on a large set of files with a lot of duplicates (~4 GB of photos) ran on the same order of time as fdupes. There are likely faster tools out there but this one is good enough for me and the options are exactly what I need (what a coincidence!).

Usage Examples

With A Reference Directory

Remove all files from foo that are duplicates of files in bar:

rmdupes -r bar foo

bar can be a subdirectory of foo. For example, if you have organized your photos in /home/me/photos and you want to remove leftover copies from other subdirectories in your home directory, use

rmdupes -r /home/me/photos /home/me

Collect all duplicates in a backup directory instead of deleting them:

rmdupes -r /home/me/photos -b /home/me/photos.bak /home/me

Without A Reference Directory

Remove all duplicate files in foo with prompts for selecting files and confirming deletions:

rmdupes `foo`

Same as above but without the deletion confirmation dialogues:

rmdupes --noconfirm `foo`

Keep the oldest version of a file and automatically remove all others without any confirmation:

rmdupes --noconfirm --keep oldest `foo`

Check which files would be removed with keep oldest (without deleting them):

rmdupes -n --keep oldest `foo`

Move all duplicates except for the newest ones to a backup directory:

rmdupes --keep newest -b backup

Help Message

$ rmdupes --help

usage: rmdupes [-h] [-r <reference directory> [<reference directory> ...]]
               [-i] [-b <backup directory>] [--restore] [--symlink {abs,rel}]
               [--hardlink] [--copy] [-l] [-n] [--display {script,json}]
               [--noconfirm] [--keep {oldest,newest,first}]
               [-f <i|e><regex> [<i|e><regex> ...]] [--shallow]
               <target directory> [<target directory> ...]

Prune duplicate files.

positional arguments:
  <target directory>    Directories to scan for duplicates.

optional arguments:
  -h, --help            show this help message and exit
  -r <reference directory> [<reference directory> ...], --refdir <reference directory> [<reference directory> ...]
                        Directories of reference files. The target directory
                        will be scanned for duplicates of these files.
  -i, --invert          Invert file selection in the target directories.
                        Instead of selecting duplicates for removal, this will
                        select non-duplicates. This may be useful when using a
                        reference directory to limit files in a target
                        directory to a subset of files in the reference
                        directory.
  -b <backup directory>, --bakdir <backup directory>
                        Move duplicates to a backup directory instead of
                        deleting them. Relative paths are preserved.
  --restore             Attempt to restore files to the target directory/-ies
                        from the backup directory. This will also restore
                        files which were affixed with suffixes. Perform a dry
                        run with this option to check that it does what you
                        want before using it. Use filters to restrict selected
                        files if necessary.
  --symlink {abs,rel}   Create absolute or relative symlinks when deleting
                        files. This does nothing with --invert.
  --hardlink            Try to create hardlinks when deleting files. This may
                        be combined with --symlinks to create symlinks when
                        hardlinks are not possible. This does nothing with
                        --invert.
  --copy                When using a reference diretory and a backup
                        directory, this will copy duplicates of the reference
                        files in the target directory to the backup directory
                        while preserving their relative subpaths. This can be
                        useful to copy a subset of a file hierarchy.
  -l, --list            List duplicates and exit.
  -n, --dryrun          Dry run. List actions on STDOUT.
  --display {script,json}
                        Display dryrun output in the chosen format. Implies
                        --dryrun.
  --noconfirm           Do not prompt for confirmation before deleting files.
  --keep {oldest,newest,first}
                        Automatically select the file to keep in a set of
                        duplicates.
  -f <i|e><regex> [<i|e><regex> ...]
                        Regular expression filters: prefix with "i" for
                        inclusive, "e" for exclusive. The patterns are applied
                        in order. The last one that matches determines if the
                        file is included or excluded. For example, to exclude
                        everything in a directory named "foo" except for a
                        subdirectory named "bar", use "-f e^foo/ i^foo/bar/".
  --shallow             Compare files by os.stat only. See Python's filecmp
                        library for details. This is faster than the default
                        mode which compares files by content, but may result
                        in false positives. If unsure, try a dry run first.

CHANGELOG

2017-03-06

  • Added --copy option.
  • Fixed handling of filters if first is inclusive.

2017-03-05

  • Code cleanup.
  • New options: --link, --symlink, --restore, --invert, --display.
  • Restrict scans to real files.
  • Hardlinked files are not considered duplicates.
  • Removed --script option. Use "--display script" instead.
  • JSON data output with --display option.

2017-03-04

  • Fixed bug with --keep oldest and --keep newest (they were switched).
  • Added sanity checks to prevent false positives when scanning nested target directories.
  • Added --list option to list duplicates.
Contact
echo xyne.archlinux.ca | sed 's/\./@/'
Feeds
Blog News
Validation
XHTML 1.0 Strict CSS level 3 Atom 1.0