Linux: find duplicate files by hash, list them, and remove duplicates safely

linux find identical files (same bytes) is a content problem: compare checksums, not only names. linux find duplicate files by name only catches clashing basenames—two different photos both called IMG_0001.jpg are not necessarily byte-identical. This page focuses on hash-based dedupe, then calls out the name-only case briefly.

Commands below were run with GNU findutils, GNU coreutils (sha256sum, sort command), and Bash 5.2.37 on Ubuntu 25.04 (kernel 6.14.0-37-generic). Adjust for BSD if you drop -printf or sha256sum.

List duplicate files (same content) under a directory

find emits every regular file; sha256sum prints hash path; awk command groups paths that share a hash:

bash


root=/path/to/scan
find "$root" -type f -exec sha256sum {} + | awk '
{
  h=substr($0,1,64)
  p=substr($0,67)
  paths[h]=paths[h] p ORS
  c[h]++
}
END {
  for (h in c)
    if (c[h] > 1)
      printf "---- %s (%d copies)\n%s", h, c[h], paths[h]
}'

GNU sha256sum prints hash␠␠path (two spaces after the 64 hex digits). The substr($0,67) slice keeps paths with spaces intact.

linux remove duplicate files: keep one copy, delete the rest (interactive)

Never parse ls. Hash first, sort so equal sums are adjacent, split each line into hash<TAB>path (GNU sha256sum: two spaces after the digest, path starts at column 67), then walk groups in Bash:

bash


#!/usr/bin/env bash
set -euo pipefail
root=${1:?usage: $0 /path/to/scan}

prev_sum=
group=()

flush_group() {
  ((${#group[@]} <= 1)) && { group=(); return 0; }
  local keep="${group[0]}"
  printf '---- %d identical files, keeping:\n%s\n' "${#group[@]}" "$keep" >&2
  local f ans
  for f in "${group[@]:1}"; do
    read -r -p "delete \"$f\"? [y/N] " ans || true
    if [[ ${ans:-N} == [yY]* ]]; then
      rm -f -- "$f" && printf 'deleted: %s\n' "$f" >&2
    fi
  done
  group=()
}

while IFS=$'\t' read -r sum path; do
  [[ -z $sum ]] && continue
  if [[ -n ${prev_sum:-} && $sum != "$prev_sum" ]]; then
    flush_group
  fi
  prev_sum=$sum
  group+=("$path")
done < <(
  find "$root" -type f -exec sha256sum {} + |
    awk '{ print substr($0,1,64) "\t" substr($0,67) }' |
    sort -t $'\t' -k1,1
)
flush_group

That is a practical remove duplicate files linux flow: you always keep the first path in each sorted group (alphabetical by path—change sort if you prefer newest wins).

Dry-run: print `rm` lines for every duplicate except the first

After sort -k1,1, emit rm only when the hash repeats:

bash


find "$root" -type f -exec sha256sum {} + | sort -k1,1 | awk '
{
  h=substr($0,1,64); p=substr($0,67)
  if (h==prev && prev!="") print "rm -f -- " p
  prev=h
}'

Review the lines, then re-run through sh or bash only if you accept the targets. Paths with spaces are already safe here because p is everything after the first field from sha256sum.

For a packaged dry-run, jdupes -n / fdupes -n lists duplicate groups without deleting—read man jdupes before jdupes -d on real data.

Faster checks: size first, then hash

linux find identical files faster on huge trees: group candidates by stat size, then sha256sum only within equal-size buckets. find -printf '%s\t%p\n' feeds awk cleanly; skip open/read when sizes differ.

linux find duplicate files by name (not content)

Same basename, possibly different bytes:

bash


find "$root" -type f -printf '%f\t%p\n' | sort | awk -F'\t' '
$1==prev { print prevpath ORS $2 }
{ prev=$1; prevpath=$2 }
'

Use this when you care about naming collisions, not byte-identical copies.

linux find duplicates in file (repeated lines inside one text file) is a different problem: use sort file | uniq -d or awk counting—this page is about duplicate files on disk.

awk tutorial for heavier grouping logic.
check if script is already running if you wrap long dedupe jobs with a lock.

Summary

linux find duplicate files with confidence means find … -type f -exec sha256sum {} + (or md5sum) and grouping on the hash column. linux remove duplicate files should default to listing and prompting, or a dry-run before rm. linux find identical files is content-driven; linux find duplicate files by name is a basename report and does not prove files match. Combine size filters with hashes on large trees, and prefer find -exec … + over parsing ls output in any bash find duplicate files script you ship. For duplicate lines inside one file, use sort | uniq -d instead of hashing paths.

List duplicate files (same content) under a directory

linux remove duplicate files: keep one copy, delete the rest (interactive)

Dry-run: print rm lines for every duplicate except the first

Faster checks: size first, then hash

linux find duplicate files by name (not content)

Related

Summary

Related Articles

Bash case Statement (Switch Case)

Shell Scripting & Bash Tutorial (Free Course)

Bash progress bar: text bar, script implementation, `pv`, and `dialog`

Search GoLinuxCloud

Dry-run: print `rm` lines for every duplicate except the first