Posted by: StanleyTweedle
« on: March 25, 2009, 12:27:47 »
I'd like to share with the Similarity User Community, my own technique for effective, efficient duplicate detection, and removal with confidence in every deletion.
Step 00:
Obtain the latest copy of Similarity from www . music-similarity . com .
Obtain the latest copy of freeware media player, "Foobar2000", available at foobar2000 . org
Install the software.
Prepare your media-library folders:
Placing ALL media files in a single, "Parent" folder is recommended, from this user's experience. Pulling results from a single location, subfolders or not, is more easy to manage than results from multiple Parent locations. Trust me-- put all mp3s, etc., into a single folder (subfolders are okay. And, when you're finished removing dupes, you can re-arrange your library to its previous state.)
[for example, my path:
Parent Folder - E:\myMusic\
Folder structure example:
E:\myMusic\_artist-1\
E:\myMusic\_artist-2\
etc,
etc.
...totalling about 7,000 mp3's, in hundreds of sub-folders, under a single parent, "myMusic"]
Step 01:
Launch Similarity. The default options are recommended. Enable the experimental algorithm if you wish to have a very accurate report.*(1)
Click the folder icon. Select that single folder, under which you placed all of your other mp3 folders. Subfolders are automatically scanned, so DO NOT add anything but the "parent folder". (correct me, please, if i'm wrong)
[according to the illustrated folder structure, above, i select ONLY "E:\myMusic"]
Step 02:
START THE MEDIA SCAN:
How do I enable Scanning in Similarity?
As soon as the folder is selected, and the dialogue window is closed ( clicking [OK] button ), Similarity begins the scanning process. In fact, this is the very method for enabling the scan. **NOTE** If you close Similarity, and re-open it, wondering why it's not doinging anything-- you must first "delete" any folders from the "add-folder" dialogue, then re-add those folders. (odd, perhaps-- but that's how it works! hey, be patient-- it's still very "beta" software! Give the man some time to tend to such quirks. ;-)
Step 03:
SIMILARITY COMMENCE ROCKING (, and rolling)
so...Wait...and...wait... (tic, toc, tic, toc...)
Consider This: Similiarity is actually "listening" to all of those files-- so it knows if there are duplicates (vs just looking at file-names, etc.). Imagine how long it would take for you to do the same!
Scan Progress is indicated in the upper-right [Number of Files scanned], and lower-left [number of duplicates detected]
NOTE: Similarity WILL gradually report each duplicate file, as the scan proceeds. Duplicates may be verified, and even deleted, as the scan progresses
Step 04:
SAVE PLAYLIST
NOTE: This section begins a particular process-- unique to the author of this tutorial. The following procedure is not the recommended action per Similarity, but I find it to be my own preference. I recommend the reader try this method on only a few files, at first, so he or she might make the best personal judgment for further proceeding with this technique.
When Similarity finds approx 100 duplicates (observe, lower-left corner), you may wish to save the playlist. Before saving, sort the playlist items by column, by "content", such that the items marked "100%" are mostly near the "top" of the column. Save the playlist in an easily accessible folder (the "Save" icon, looks like a floppy disk).
Step 05:
OPEN PLAYLIST IN FOOBAR2000
Foobar2000 has a built-in function to perform actions on files, including "delete". By loading the playlist into Foobar2000, you create an environment quite friendly for comparing the duplicates reported by Similarity. Simply go through the playlist, determine which file is "better" (and likewise, which to "delete"). In this manner, a confidence level is raised from assurance that a deleted file is in fact the one which might be better discarded.
Indeed, Similarity offers the "speaker" icons, which launch the file-in-focus, for previewing its contents-- however, using the aforementioned method, it is much easier to perform an "A-B" comparison between files. Furthermore, the practice of comparison becomes less methodical, and more enjoyable-- as the playlist can continue in the background while the user need not give 100% of his or her attention. In other words, let the playlist play-- enjoy your tunes, and checking only as the duplicates load into playing.
NOTE: It is recommended, regardless of playlists in Foobar2000, to USE SIMILARITY FOR THE DELETION action. ALthough, Foobar2000 is capable of deleting files from the playlist, Similiarity will not function as beautifully if files are modified outside of its own process. Experiment. See what you prefer, but begin by deleting from Similarity, NOT from within Foobar, unless you opt to "Move" files, as suggested, after the next paragraph.
I find this is much more "bearable" than working only in Similarity. This is NOT to take any value from Similarity itself, but only to suggest a possible practice which may be more enjoyable for the user who has a large library containing many duplicates.
NOTE: Foobar2000, in addition to offering a "File-Operation -> Delete" command, has also a "move" operation. If the user is really paranoid about which files should be deleted, I recommend creating a "Safe_Delete" folder (i.e. E:\SafeDeleteMp3s\ ). Using a safety-net folder, instead of deleting every duplicate, the user might use "File-Operations -> move -> SafeDelete", so the duplicate mp3's are removed from a primary library of files, separated into a sort of "pre-delete" folder (not unlike recycle bin, but safe from system cleaner utilities which might otherwise automatically dump recycle bin contents). If the user is confident about deleting duplicates, then such a "safeDelete" folder is only superfluous, and probably a waste of effort. Use your own best judgment for how to handle your own files.
Step 06:
Return to step 03, repeating from Step 03 - Step 05, until all duplicate files are eliminated.
Good luck!
[Footnotes]
*(1)Enabling the "experimental algorithm" does not affect the "regular" scan duration, but only adds time to the end of the job, as the statistics it offers are "in addition-to" those which are revealed, for example, if a scan is performed without it. In other words-- Similarity performs the "Experimental" check only AFTER the "regular" scan is finished (in my observation). you'll get the same "regular" results either way, in the same amount of time. If you decide not to wait for the "experimental", then there is no harm-- the "regular" results are unaffected by a premature cancellation of an "experimental" scan.