Author Topic: false 100% similarity (Read 116922 times)

hsei · « **on:** July 10, 2010, 11:01:56 »

The program seems to scan for similarity only about one minute at the beginning. Even if two songs differ in length by minutes, they give a 100% score if they start the same (e.g. a life CD vs. the first track).
This may be unavoidable because of performance but it is very dangerous if you rely on automark: one of the two is deleted even though they differed greatly.
Even worse: If there is a crippled track with a missing piece in the middle, it will be ranked as 100% similar and in worst case the complete track is deleted and the damaged one remains.
A configurable limit of allowable track length difference (file size would be another topic) would be very nice. At least there should be a warning if track times differ considerably (by e.g. a red color of the duration entry). The loss in performance for that should be neglible.

Admin · « **Reply #1 on:** July 10, 2010, 21:21:52 »

Similarity designed for scaning music compositions and yes it's scans only 1 min of song. We think about how to solve problem with long durations.

djluckyluciano · « **Reply #2 on:** July 11, 2010, 01:31:41 »

Hi,
i am confudsed of 70 % similarity of two titels one is an mega mix with 70 minutes
the other a short version of an song with 3 minutes...

hsei · « **Reply #3 on:** July 11, 2010, 11:00:18 »

It's not only a problem of long durations: Having two files of e.g. 2 minutes with high similarity score and differing by 10 secs is a strong indication of corruption.
I actually use that for identifying corrupted files but at the moment it has to be done "manually" by looking for significant duration mismatches in high score groups.

FtMgAl · « **Reply #4 on:** August 06, 2010, 20:28:14 »

The first time I used the program I selected a small folder with about 100 tracks that I knew had no or not more than a couple of duplicates. The program found 22 supposed duplicates. The reason is these were mostly live performances and the first minute contained much applause.

I would suggest adding a criteria that the length must match within X%. If 2 tracks differ in length by more than 25% I find it hard to believe anyone would consider that similar but with a 0-100% option even people who would could have that option. And, as someone else mentioned, eliminating duplicates by track length could significantly improve speed.

You might also want to consider using the second minute to reduce the false positives on live tracks.

Admin · « **Reply #5 on:** August 07, 2010, 18:22:47 »

Quote from: FtMgAl on August 06, 2010, 20:28:14

...

duration test will be added in future versions.

hsei · « **Reply #6 on:** October 24, 2010, 09:29:43 »

The newly introduced duration check helps to get rid of most of false positives, but a few 100% "precise" pairs with equal length still remain. They can be easily identified by their tag score below 10% and standard score below 50%, but the implication is: You still can't rely on a totally automatic removal of duplicates, you have to look at the lists.
A hint: All false positive pairs I found had durations below 1 min. So it is not a severe bug, but it is one.

TBacker · « **Reply #7 on:** October 30, 2010, 05:09:44 »

Quote from: Admin on July 10, 2010, 21:21:52

Similarity designed for scaning music compositions and yes it's scans only 1 min of song. We think about how to solve problem with long durations.

I'm a new user, but I am a radio broadcast engineer with a bit of experience writing some audio apps for my job and personal use (VB6/VB.Net).

How about taking 3 or 4 short (30 second) samples across the length of a file. Say a 30 second sample at 0%, 25%, 50%, and 75% of the length of valid audio data (ignore those metadata headers / tails!). You would have to seek in past any silence at the head for the first sample (as the silence can vary even if the cut is the same).

In theory this would produce a "fingerprint" representative of most of the audio without having to scan the whole thing, and more accurate than judging the whole file by one sample.

If this data is compared to a duplicate, and the duplicate is the same audio and length, the data from each of the 4 samples should match up waveform-wise. If the length is different on the duplicate, say an extra interlude on a remix, the last 2 or 3 samples will not be the same as the original.

This would also detect if a file is corrupted half way through - samples 1 and 2 might match, but 3 and 4 are random noise on the bad cut.

One last caveat - I don't know how your comparison code works, but if the levels are different between the original and dupe, you would need to compensate (make the highest peak of the sample match, i.e. normalize the low sample to the hotter one) before comparing the waveforms.

Sorry for the long post

Admin · « **Reply #8 on:** October 30, 2010, 16:46:09 »

Thanks for your message, we already fixed problem with 100% similarity, publically fix will be available in next version.

About duration, problem in speed, the more you decode, more time need to analyze file. We must balance between speed/quality.
But thanks for you comment we'll think about your ideas.

hsei · « **Reply #9 on:** October 31, 2010, 19:49:35 »

a) The proposal of TBacker does not necessarily mean more effort: Taking e.g. three 20 sec excerpts at begin, middle and end results in approximately the same decoding time as for a single 60 sec probe, but gives higher reliability. There is a little bit more trouble at the borders of the excerpts, but dropped samples in one file and drastically different fade-outs (that are missed in the current version) would then show up most likely. This is probably worth the small loss in speed.

b) You can only be sure to detect all corrupted frames if you decode the whole file. That's clearly a matter of balancing speed vs. quality.

c) Normalizing to the highest peak of a sample/excerpt would not be an good idea. The standard approach for comparison in the frequency domain is to normalize to the overall average (or in other words: the component at frequency bin 0).

hsei · « **Reply #10 on:** October 31, 2010, 19:53:18 »

@admin: The last posts would better fit to wishlist.

TBacker · « **Reply #11 on:** October 31, 2010, 19:56:20 »

Quote from: hsei on October 31, 2010, 19:49:35

c) Normalizing to the highest peak of a sample/excerpt would not be an good idea. The standard approach for comparison in the frequency domain is to normalize to the overall average (or in other words: the component at frequency bin 0).

I guess my point wasn't clear in this respect. I basically meant that the amplitudes of the two samples should be made to match (in the compare procedure) before frequency analysis to insure the best accuracy.

Admin · « **Reply #12 on:** November 01, 2010, 17:40:10 »

Thanks for your comments, we think about some modifications in algorithms, but this is not simple as seems.

GIL · « **Reply #13 on:** January 07, 2011, 12:28:58 »

I understand the performance limitation, but I would prefer to be able to set the sample length.
Better wait than erase the wrong track.

7b683d4548 · « **Reply #14 on:** January 10, 2011, 13:33:03 »

How about uniquely identifying a song based on discogs release, amplifind id, musicbrainz unique id or such?

Ought to save you lots of scanning.

Similarity - Home

Author Topic: false 100% similarity (Read 116922 times)

hsei

false 100% similarity

Admin

false 100% similarity

djluckyluciano

false 100% similarity

hsei

false 100% similarity

FtMgAl

Re: false 100% similarity

Admin

Re: false 100% similarity

hsei

Re: false 100% similarity

TBacker

Re: false 100% similarity

Admin

Re: false 100% similarity

hsei

Re: false 100% similarity

hsei

Re: false 100% similarity

TBacker

Re: false 100% similarity

Admin

Re: false 100% similarity

GIL

Re: false 100% similarity

7b683d4548

Re: false 100% similarity

Quick Reply