Yes, this is a problem. Fade durations is another. A pair of songs with 0% similarity in tags, different filename, bit rate and, of course, size. Length is different too, but 1 to 4 seconds, it´s just a longer fade-out. (Almost) whole sound content is identical. FOR ME, this song is 90% similar, at least. Today Similarity "says" this pair is less similar than a classic music compared with a punk music. Really! With same situation, but 100% similarity and filename, give me about ~50% similarity.
At least, allow sort/filter/group songs by a compound sort key (Ex. %content + difference between lengths + %tags)
Another improvements:
Give more freedom to user. Each person better knows how to classify its collection.
Allow to add more columns info to browse (and to sort, filter and group), including calculated data (release an API with special variables)
Add a weighted overall score combining %content and %tag, and maybe another information.
I downloaded about 400.000 songs, now I have ~260.000 and I guess 40% is duplicated yet. With Similarity 9.360 I could down to ~190.000 (optimistic estimation). With improvements I can (relatively) safely and a reasonable time spending, down to 160.000 songs.
I am available for more information and suggestions.
Thank you very much.