How to find and remove duplicate files using shell script in Linux

Related Searches: How to remove duplicate files in Linux or Unix. How to find duplicate files using shell script in Linux. List the duplicate files in Linux using shell script. Linux find duplicate files by name and hash value. automatic duplicate file remover.

 

In my last article I shared multiple commands and methods to check if the node is connected to internet with shell script in Linux and Unix, Now in this article I will show some sample shell scripts to remove duplicate files.

 

Script 1: Find duplicate files using shell script

The script template is taken from bash cookbook. I have modified the script to prompt the user before removing any duplicate file. This can help decide the user as which file out of the duplicates he/she wishes to delete. For example /etc/hosts and /etc/hosts.bak are same file so the script will fail to understand which file is important and it may delete /etc/hosts considering it as duplicate.

Below is the sample script to remove duplicate files with a prompt before deleting the file:

#!/bin/bash
# Find duplicate files using shell script
# Remove duplicate files interactively
TMP_FILE=$(mktemp /tmp/temp_file.XXX)
DUP_FILE=$(mktemp /tmp/dup_file.XXX)

function add_file() {
  # Store the hash of the file if not already added
  echo "$1" "$2"  >> $TMP_FILE
}

function red () {
   # print colored output
   /bin/echo -e "\e[01;31m$1\e[0m" 1>&2
}

function del_file() {
    # Delete the duplicate file
    rm -f "$1" 2>/dev/null
    [[ $? == 0 ]] && red "File \"$1\" deleted"
}

function srch_file() {

  # Store the filename in this variable
  local NEW="$1"
  # Before we check hash value of file, make this variable null
  local SUM="0"

  # If file exists and the temporary file's size is zero
  if [ -f $NEW ] && [ ! -s $TMP_FILE ];then
     # Print Store the hash value of file. This value will be later stored in RET which is further used to check duplicate file
     echo $(sha512sum ${NEW} | awk -F' ' '{print $1}')
     # Exit the loop here
     return
  fi

  # If the size of temporary file is non-zero read temporary file line by line in a loop. Each line is stored in ELEMENT variable
  while read ELEMENT; do
    # Get the hash value of input file
    SUM=$(sha512sum ${NEW} | awk -F' ' '{print $1}')

    # Get the hash value of file collected from temporary file
    ELEMENT_SUM=$(echo $ELEMENT | awk -F' ' '{print $1}')
    ELEMENT_FILENAME=$(echo $ELEMENT | awk -F' ' '{print $2}')

    # if the hash value is same means we have found a duplicate file
    if [ "${SUM}" == "${ELEMENT_SUM}" ];then
       echo $ELEMENT_FILENAME > $DUP_FILE
       return 1
    else
       continue
    fi
  done<$TMP_FILE

  # If duplicate file is not found then just print the hash value of the file
  echo "${SUM}"
}


function begin_search_and_deduplication {

  local DIR_TO_SRCH="$1"

  for FILE in `find ${DIR_TO_SRCH} -type f`; do

     # this stores the return value from srch_file function
     RET=`srch_file ${FILE}`
     if [[ "${RET}" == "" ]];then
         FILE1=`cat $DUP_FILE`
         echo "$FILE1 is a duplicate of $FILE"
         while true; do
            read -rp "Which file you wish to delete? $FILE1 (or) $FILE: " ANS
            if [ $ANS == "$FILE1" ];then
               del_file $FILE1
               break
            elif [ $ANS == "$FILE" ];then
               del_file $FILE
               break
            fi
         done
         continue
     elif [[ "${RET}" == "0" ]];then
          continue
     elif [[ "${RET}" == "1" ]];then
          continue
     else
          # If the file hash is not found then it's entry is added in temporary file
          add_file "${RET}" ${FILE}
          continue
      fi
  done
}

# This will read the user input to collect list of directories to search for duplicate files
echo "Enter directory name to search: "
echo "Press [ENTER] when ready"
echo

read DIR

begin_search_and_deduplication "${DIR}"
# Delete the temporary files once done
rm -f $TMP_FILE
rm -f $DUP_FILE

You can execute the script which will print for the list of directories you wish to check for and remove duplicate files in Linux. I have created few files with duplicate content for the sake of this article.

# /tmp/remove_duplicate_files.sh
Enter directory name to search:
Press [ENTER] when ready

/dir1 /dir2 /dir3 <-- This is my input (search duplicate files in these directories)
/dir1/file1 is a duplicate of /dir1/file2
Which file you wish to delete? /dir1/file1 (or) /dir1/file2: /dir1/file2
File "/dir1/file2" deleted
/dir1/file1 is a duplicate of /dir2/file101
Which file you wish to delete? /dir1/file1 (or) /dir2/file101: /dir2/file101
File "/dir2/file101" deleted

Here as you see the script waits for user input when a duplicate file is found in the provided directory. Based on the user input it will proceed next.

 

How it works?

  • I have added comments before most of the section which can help you understand how the script works to remove duplicate files.
  • Using this hash, we can compare the hash against a list of hashes already computed.
  • If the has matches, we have seen the contents of this file before and so we can delete it.
  • If the hash is new, we can record the entry and move onto calculating the hash of the next file until all files have been hashed.

 

Now if you wish the script to automatically find and remove duplicate files then you can remove the highlighted block in the above script and just use del_file $FILE1 so it will directly remove duplicate files (if found)

 

Script 2: Remove duplicate files using shell script

Here we will use awk to find duplicate files using shell script. This code will find the copies of the same file in a directory and remove all except one copy of the file.

#!/bin/bash
# Filename: remove_duplicate.sh
# Description: Find and remove duplicate files and
# keep one sample of each file.
ls -lS --time-style=long-iso | awk 'BEGIN {
  getline; getline;
  name1=$8; size=$5
}
{
  name2=$8;
  if (size==$5)
{
  "md5sum "name1 | getline; csum1=$1;
  "md5sum "name2 | getline; csum2=$1;
  if ( csum1==csum2 )
  {
     print name1; print name2
  }
};

size=$5; name1=name2;
}' | sort -u > duplicate_files

cat duplicate_files | xargs -I {} md5sum {} | sort | uniq -w 32 | awk '{ print $2 }' | sort -u > unique_files

echo Removing..
comm duplicate_files unique_files -3 | tee /dev/stderr | xargs rm
echo Removed duplicates files successfully.

You must navigate inside the directory where you wish to find and remove duplicate files and then execute the script. Here I want to find duplicate files using shell script inside /dir1 so I will cd /dir1 and then execute the script without any argument

[root@centos-8 dir1]# /tmp/remove_duplicate.sh
Removing..
file1_copy
file2_copy
Removed duplicates files successfully.

List the files under /dir1 and make sure no duplicate files exists here. Some more files are created and left for your reference.

[root@centos-8 dir1]# ls -l
total 20
-rw-r--r-- 1 root root 34 Jan  9 07:01 duplicate_files
-rw-r--r-- 1 root root 16 Jan  9 06:56 duplicate_sample
-rw-r--r-- 1 root root  5 Jan  9 07:00 file1
-rw-r--r-- 1 root root  6 Jan  9 07:00 file2
-rw-r--r-- 1 root root 12 Jan  9 07:01 unique_files

 

How it works?

ls -lS lists the details of the files in the current folder sorted by file size. The --time-style=long-iso option tells ls to print dates in the ISO format. awk reads the output of ls -lS and performs comparisons on columns and rows of the input text to find duplicate files using shell script.

The logic behind the code to find duplicate files using shell script is as follows:

  1. We list the files sorted by size, so files of the same size will be adjacent. The first step in finding identical files is to find ones with the same size. Next, we calculate the checksum of the files. If the checksums match, the files are duplicates and one set of the duplicates are removed.
  2. The BEGIN{} block of awk is executed before the main processing. It reads the "total" lines and initializes the variables. The bulk of the processing takes place in the {} block, when awk reads and processes the rest of the ls output. The END{} block statements are executed after all input has been read.
  3. In the BEGIN block, we read the first line and store the name and size (which are the eighth and fifth columns). When awk enters the {} block, the rest of the lines are read, one by one. This block compares the size obtained from the current line and the previously stored size in the size variable. If they are equal, it means that the two files are duplicates by size and must be further checked by md5sum.
  4. Once the line is read, the entire line is in $0 and each column is available in $1, $2, ..., $n. Here, we read the md5sum checksum of files into the csum1 and csum2 variables. The name1 and name2 variables store the consecutive filenames. If the checksums of two files are the same, they are confirmed to be duplicates and are printed.
  5. We calculate the md5sum value of the duplicates and print one file from each group of duplicates by finding unique lines, comparing md5sum from each line using -w 32 (the first 32 characters in the md5sum output; usually, the md5sum output consists of a 32-character hash followed by the filename). One sample from each group of duplicates is written to unique_files.
  6. Now, we need to remove the files listed in duplicate_files, excluding the files listed in unique_files. The comm command prints files in duplicate_files but not in unique_files.
  7. comm only processes sorted input. Therefore, sort -u is used to filter duplicate_files and unique_files.
  8. The tee command is used to pass filenames to the rm command as well as print. The tee command sends its input to both stdout and a file. We can also print text to the terminal by redirecting to stderr. /dev/stderr is the device corresponding to stderr (standard error). By redirecting to a stderr device file, text sent to stdin will be printed in the terminal as standard error.

 

References:
Linux Shell Scripting Cookbook
Bash cookbook

Leave a Comment

Please use shortcodes <pre class=comments>your code</pre> for syntax highlighting when adding code.