Finding duplicate files on NAS storage

I have this bad habit of copying things so many times when modifying them, when that is a large database, we talking many gigas, so here is a script to find those duplicate files among many hard drives and telling you which ones are duplicates, moving and deleting and symbolic linking is done manually after.

1- this script is PHP-CLI, so make sure that is installed on your computer
2- this script runs the find command, make sure it can execute that program
3- you run the script with the path parameter, but will need to edit the script to change the 1GB size i have hard coded

What this script does is

1- find files with size greater than 1GB (find /hds -size +1G)
2- Store the files in database with size
3- retrieve the files ordered by size
4- if 2 files have exactly the same size, calculates MD5sum for the first MB of the file
5- If the MD5 of the first MB of the files are the same, calculate the whole MD5
6- If they turn out to be duplicates, they are printed to the command line

Leave a Reply

Your email address will not be published. Required fields are marked *