There are 2 popular packages to deal with duplicate files on linux
The first would be fslint (apt-get install fslint), the down side is that it is not sorting by file size, because when i have 10,000 duplicate files on my disks, i really don’t want to deal with them all and make choices, so what comes to mind is this, the second is fdupes, i have never used it before so i would not know, so we will be using fslint with a small script.
first, find the duplicate files, in my case, i only want the ones over 2MBs
/usr/share/fslint/fslint/findup /hds -size +2048k > /root/dups.txt
Now, this little simple script should read the data into a mysql table (Command line PHP script, you will need to edit the mysql username, password and database), you also need to tell it what the path you used in the command above is (I used
“/hds”), also included is the database sql file, you can put that in with PHPMyAdmin
Now, you can run the above script and it will go and investigate file sizes on the file system.
Then, you can either walk through the database after sorting by size, or write your own display script (Fetch and print, nothing too fancy), so you will know where your greatest gains are, and this way you will not lose a day filtering those duplicate files.
Have fun, and please let me know what you think