Related Searches: How to remove duplicate files in Linux or Unix. How to find duplicate files using shell script in Linux. List the duplicate files in Linux using shell script. Linux find duplicate files by name and hash value. automatic duplicate file remover.
In my last article I shared multiple commands and methods to check if the node is connected to internet with shell script in Linux and Unix, Now in this article I will show some sample shell scripts to remove duplicate files.
Script 1: Find duplicate files using shell script
The script template is taken from bash cookbook. I have modified the script to prompt the user before removing any duplicate file. This can help decide the user as which file out of the duplicates he/she wishes to delete. For example /etc/hosts
and /etc/hosts.bak
are same file so the script will fail to understand which file is important and it may delete /etc/hosts
considering it as duplicate.
Below is the sample script to remove duplicate files with a prompt before deleting the file:
#!/bin/bash
# Find duplicate files using shell script
# Remove duplicate files interactively
TMP_FILE=$(mktemp /tmp/temp_file.XXX)
DUP_FILE=$(mktemp /tmp/dup_file.XXX)
function add_file() {
# Store the hash of the file if not already added
echo "$1" "$2" >> $TMP_FILE
}
function red () {
# print colored output
/bin/echo -e "\e[01;31m$1\e[0m" 1>&2
}
function del_file() {
# Delete the duplicate file
rm -f "$1" 2>/dev/null
[[ $? == 0 ]] && red "File \"$1\" deleted"
}
function srch_file() {
# Store the filename in this variable
local NEW="$1"
# Before we check hash value of file, make this variable null
local SUM="0"
# If file exists and the temporary file's size is zero
if [ -f $NEW ] && [ ! -s $TMP_FILE ];then
# Print Store the hash value of file. This value will be later stored in RET which is further used to check duplicate file
echo $(sha512sum ${NEW} | awk -F' ' '{print $1}')
# Exit the loop here
return
fi
# If the size of temporary file is non-zero read temporary file line by line in a loop. Each line is stored in ELEMENT variable
while read ELEMENT; do
# Get the hash value of input file
SUM=$(sha512sum ${NEW} | awk -F' ' '{print $1}')
# Get the hash value of file collected from temporary file
ELEMENT_SUM=$(echo $ELEMENT | awk -F' ' '{print $1}')
ELEMENT_FILENAME=$(echo $ELEMENT | awk -F' ' '{print $2}')
# if the hash value is same means we have found a duplicate file
if [ "${SUM}" == "${ELEMENT_SUM}" ];then
echo $ELEMENT_FILENAME > $DUP_FILE
return 1
else
continue
fi
done<$TMP_FILE
# If duplicate file is not found then just print the hash value of the file
echo "${SUM}"
}
function begin_search_and_deduplication {
local DIR_TO_SRCH="$1"
for FILE in `find ${DIR_TO_SRCH} -type f`; do
# this stores the return value from srch_file function
RET=`srch_file ${FILE}`
if [[ "${RET}" == "" ]];then
FILE1=`cat $DUP_FILE`
echo "$FILE1 is a duplicate of $FILE"
while true; do
read -rp "Which file you wish to delete? $FILE1 (or) $FILE: " ANS
if [ $ANS == "$FILE1" ];then
del_file $FILE1
break
elif [ $ANS == "$FILE" ];then
del_file $FILE
break
fi
done
continue
elif [[ "${RET}" == "0" ]];then
continue
elif [[ "${RET}" == "1" ]];then
continue
else
# If the file hash is not found then it's entry is added in temporary file
add_file "${RET}" ${FILE}
continue
fi
done
}
# This will read the user input to collect list of directories to search for duplicate files
echo "Enter directory name to search: "
echo "Press [ENTER] when ready"
echo
read DIR
begin_search_and_deduplication "${DIR}"
# Delete the temporary files once done
rm -f $TMP_FILE
rm -f $DUP_FILE
You can execute the script which will print for the list of directories you wish to check for and remove duplicate files in Linux. I have created few files with duplicate content for the sake of this article.
# /tmp/remove_duplicate_files.sh Enter directory name to search: Press [ENTER] when ready /dir1 /dir2 /dir3 <-- This is my input (search duplicate files in these directories) /dir1/file1 is a duplicate of /dir1/file2 Which file you wish to delete? /dir1/file1 (or) /dir1/file2: /dir1/file2 File "/dir1/file2" deleted /dir1/file1 is a duplicate of /dir2/file101 Which file you wish to delete? /dir1/file1 (or) /dir2/file101: /dir2/file101 File "/dir2/file101" deleted
Here as you see the script waits for user input when a duplicate file is found in the provided directory. Based on the user input it will proceed next.
How it works?
- I have added comments before most of the section which can help you understand how the script works to remove duplicate files.
- Using this hash, we can compare the hash against a list of hashes already computed.
- If the has matches, we have seen the contents of this file before and so we can delete it.
- If the hash is new, we can record the entry and move onto calculating the hash of the next file until all files have been hashed.
Now if you wish the script to automatically find and remove duplicate files then you can remove the highlighted block in the above script and just use del_file $FILE1
so it will directly remove duplicate files (if found)
Script 2: Remove duplicate files using shell script
Here we will use awk to find duplicate files using shell script. This code will find the copies of the same file in a directory and remove all except one copy of the file.
#!/bin/bash
# Filename: remove_duplicate.sh
# Description: Find and remove duplicate files and
# keep one sample of each file.
ls -lS --time-style=long-iso | awk 'BEGIN {
getline; getline;
name1=$8; size=$5
}
{
name2=$8;
if (size==$5)
{
"md5sum "name1 | getline; csum1=$1;
"md5sum "name2 | getline; csum2=$1;
if ( csum1==csum2 )
{
print name1; print name2
}
};
size=$5; name1=name2;
}' | sort -u > duplicate_files
cat duplicate_files | xargs -I {} md5sum {} | sort | uniq -w 32 | awk '{ print $2 }' | sort -u > unique_files
echo Removing..
comm duplicate_files unique_files -3 | tee /dev/stderr | xargs rm
echo Removed duplicates files successfully.
You must navigate inside the directory where you wish to find and remove duplicate files and then execute the script. Here I want to find duplicate files using shell script inside /dir1
so I will cd /dir1
and then execute the script without any argument
[root@centos-8 dir1]# /tmp/remove_duplicate.sh Removing.. file1_copy file2_copy Removed duplicates files successfully.
List the files under /dir1
and make sure no duplicate files exists here. Some more files are created and left for your reference.
[root@centos-8 dir1]# ls -l
total 20
-rw-r--r-- 1 root root 34 Jan 9 07:01 duplicate_files
-rw-r--r-- 1 root root 16 Jan 9 06:56 duplicate_sample
-rw-r--r-- 1 root root 5 Jan 9 07:00 file1
-rw-r--r-- 1 root root 6 Jan 9 07:00 file2
-rw-r--r-- 1 root root 12 Jan 9 07:01 unique_files
How it works?
ls -lS
lists the details of the files in the current folder sorted by file size. The --time-style=long-iso
option tells ls to print dates in the ISO format. awk
reads the output of ls -lS
and performs comparisons on columns and rows of the input text to find duplicate files using shell script.
The logic behind the code to find duplicate files using shell script is as follows:
- We list the files sorted by size, so files of the same size will be adjacent. The first step in finding identical files is to find ones with the same size. Next, we calculate the checksum of the files. If the checksums match, the files are duplicates and one set of the duplicates are removed.
- The
BEGIN{}
block ofawk
is executed before the main processing. It reads the "total" lines and initializes the variables. The bulk of the processing takes place in the{}
block, when<a href="https://www.golinuxcloud.com/awk-examples-with-command-tutorial-unix-linux/" title="30+ awk examples for beginners / awk command tutorial in Linux/Unix" target="_blank" rel="noopener noreferrer">awk</a>
reads and processes the rest of the ls output. TheEND{}
block statements are executed after all input has been read. - In the
BEGIN
block, we read the first line and store the name and size (which are the eighth and fifth columns). When awk enters the{}
block, the rest of the lines are read, one by one. This block compares the size obtained from the current line and the previously stored size in the size variable. If they are equal, it means that the two files are duplicates by size and must be further checked bymd5sum
. - Once the line is read, the entire line is in $0 and each column is available in
$1, $2, ..., $n
. Here, we read themd5sum
checksum of files into thecsum1
andcsum2
variables. Thename1
andname2
variables store the consecutive filenames. If the checksums of two files are the same, they are confirmed to be duplicates and are printed. - We calculate the
md5sum
value of the duplicates and print one file from each group of duplicates by finding unique lines, comparingmd5sum
from each line using-w 32
(the first 32 characters in themd5sum
output; usually, themd5sum
output consists of a 32-character hash followed by the filename). One sample from each group of duplicates is written tounique_files
. - Now, we need to remove the files listed in
duplicate_files
, excluding the files listed inunique_files
. The comm command prints files induplicate_files
but not inunique_files
. - comm only processes sorted input. Therefore,
sort -u
is used to filterduplicate_files
andunique_files
. - The
tee
command is used to pass filenames to therm
command as well as print. The tee command sends its input to bothstdout
and a file. We can also print text to the terminal by redirecting tostderr
./dev/stderr
is the device corresponding tostderr
(standard error). By redirecting to astderr
device file, text sent tostdin
will be printed in the terminal as standard error.
References:
Linux Shell Scripting Cookbook
Bash cookbook
I adapted 2nd script for /bin/zsh at MacOS. Works fine. Thanks a lot!