Author Topic: Similarity seems not optimized for 300,000+ mp3 files (Read 83175 times)

ektorbarajas · « **on:** January 17, 2016, 23:14:39 »

Hi.

I have a huge amount of mp3 files (300,000+) and usually use the following config for checking duplicates (collection vs new files):
Audio comparison method: precise 95%
Duration check: enable 95%
Skip video files

My collection is stored accros several drives: my laptop drive and 2 EHDD (USB 3.0)
The Similatity Cache shows a total of 393,693 files

And every time I want to search for duplicates have to deal with the following:
1) when I open the program, it takes between 1 and 3 minutes to load and the GUI to appear
2) then when I start the comparison, a couple of types i have compared all my files (I've not created groups) and it takes 4 to 5 full days to complete. First time I launched this comparison, I though that it took longer since the cache was build, but then launched again the same exact comparison and it took the same 4 to 5 days, isn't it supposed to take much less? I have not added nor deleted any file, just launched again the same comparison to check the difference between having the cache to build and having the cache already created

Are there any plans to further optimize similarity? to really take advantage of the cache and reduce dramatically the time? and also to process a huge collection more efficiently?

Thanks

AntiBotQuestion · « **Reply #1 on:** February 15, 2016, 08:29:23 »

It requires way less than 300k tracks to be annoying, I tell you. Similarity keeps telling me I've got some twenty hours left, plus minus - it can do that for days. Down to sixteen now, two days after I completed a scan and hit a new, just to test.
My disk is USB2 only, if the bottleneck were read speed, then it would not be using 90 percent CPU all the time. For days.

ektorbarajas · « **Reply #2 on:** February 15, 2016, 12:55:09 »

That is the point, I don't know why the CPU usage is high but Similarity is taking ages to do its job.

My 2 EHDD are USB 3.0, but what concerns me is that it appears that the cache is useless (at least for a very high volume of files).

I can't believe that I performed a full scan, added several new files to the cache, and by running the same exact scan (wth no new added files) takes the same amount of time like the initial scan.

ektorbarajas · « **Reply #3 on:** February 23, 2016, 05:12:39 »

Any feedback comment from the developers?

Admin · « **Reply #4 on:** February 29, 2016, 15:29:24 »

Similarity algorithms isn't linear or even better logarithmic, they quadratic. Complexity of content based algorithm of Similarity is N^2. If directly calculated each new file need to be compared with all previous ones (it can't be searched by some index in relational databases, fingerprints can't be sorted to greater or lower).Example, for 300K files we have sum of arithmetic progression (1 + 300000) * 300000 / 2 = 90000300000 / 2 comparisions, compare it with 100K file for example (1 + 100000) * 100000 / 2 = 10000100000 / 2. You see comparing 300K file is 9 times longer then comparing 100K, not just 3 times. Even worse if you computer (all CPUs and GPUs) can compare 1mln fingerprints in 1sec (very, very fast computer), processing 300K files took 25hours.
To optimize this we added duration check, it dramatically decrease comparison count.
And we already working on new algorithm what can be used to compare 1mln of files and it will be linear, but it still far from completion.

ektorbarajas · « **Reply #5 on:** February 29, 2016, 19:19:33 »

Thanks for the reply. I really appreciate that.

But isn't the cache supposed to minimize the time for a second scan? (That is exactly as the first scan)

As mentioned before, I did some test doing the same exact comparison, the first time the cache increased because it added new files, but the second time it took the same time as when the cache was created.

Admin · « **Reply #6 on:** February 29, 2016, 20:52:04 »

Yes, indeed cache helps to skip decoding and preproceesing file procedure, but the time for this procedure is linear, ie calculated only once per file.
Here example, just pretend we have more realistic very fast computer what can prepare 10 caches in 1 sec and compare 100K fingerprints in 1sec.

N	[(N+1)*N/2 / 1000000 / 3600]	[N * 10 / 3600]	% preparing time
10000 files	0,14 hours	0,28 hours	66,66 %
100000 files	13,89 hours	2,78 hours	16,67 %
300000 files	125,00 hours	8,33 hours	6,25 %
1000000 files	1388,90 hours	27,78 hours	1,96 %

You see for larger files amount caching importance is decreasing.
This calculation is idealistic without duration skip mechanism (disabled).

ektorbarajas · « **Reply #7 on:** February 29, 2016, 23:08:04 »

Thanks for the explanation!

Will look forward for the next update and also for the next algorithm implementation.

Regards

Springdream · « **Reply #8 on:** September 18, 2018, 08:50:34 »

I do also often use grouping with a similar usecase than described at the beginning:

As soon as you use grouping not every item needs to be compared with all others but only each item in group e.g. 1 needs to be compared with group 2.
=> it should be linear, isn't it?

A further improvement might be: as soon one match is found (often that meand there is one song already double) further comparing could be stopped for that items as it is not neccessary to know that there are more than one duplicates...

Also I notice that count goes up to number itema Group 1+2. Group 2 will be the one that can be deleted with automarked files than the count should only go up to number items of group 2?!

Best, Fred

Springdream · « **Reply #9 on:** January 04, 2019, 09:16:10 »

Happy New Year!

I found some time to work with Similarity and I am still suffering from such an progressive slow down. It gets so slow that it is almost impossible to use.
Usecase: 100.000 files (radio recordings), same effect for precise comparison (with or without global optimization) and tag comparison

Before I get into more details the good news: clearing the cache helps a lot. I never cleared the cache since 5 years with >1.000.000 items and it speeds up to initial (high) speed. Even when the cache is cleared more often and at the end of the comparison.

The CPU load goes down from 100% to 60% but recovers within 30s... or some minutes. At low CPU load the bottleneck is the disk access speed of the data disk (the queue is at maximum).
Then slowly CPU gets to be the bottleneck again.

For me it looks like the cache is not only adding less benefit at larger number of files but is really contraproductive.

=> I'd suggest to add an option to keep it off. Or (if you can imagine the root cause of that issue) implement it differently.

Thank you,
Fred

Seader · « **Reply #10 on:** January 30, 2019, 18:24:41 »

Is 60% the lowest the CPU goes down to? That doesn't seem all that bad tbh.

Admin · « **Reply #11 on:** September 11, 2019, 19:08:52 »

Quote from: Springdream on January 04, 2019, 09:16:10

=> I'd suggest to add an option to keep it off. Or (if you can imagine the root cause of that issue) implement it differently.

Sorry for the delay.
Similarity uses cache data to compare next files with current one, it can't be disabled, theoretically you can only forbid to save it on hard disk.

Similarity - Home

Author Topic: Similarity seems not optimized for 300,000+ mp3 files (Read 83175 times)

ektorbarajas

Similarity seems not optimized for 300,000+ mp3 files

AntiBotQuestion

Re: Similarity seems not optimized for 300,000+ mp3 files

ektorbarajas

Re: Similarity seems not optimized for 300,000+ mp3 files

ektorbarajas

Re: Similarity seems not optimized for 300,000+ mp3 files

Admin

Re: Similarity seems not optimized for 300,000+ mp3 files

ektorbarajas

Re: Similarity seems not optimized for 300,000+ mp3 files

Admin

Re: Similarity seems not optimized for 300,000+ mp3 files

ektorbarajas

Re: Similarity seems not optimized for 300,000+ mp3 files

Springdream

Re: Similarity seems not optimized for 300,000+ mp3 files

Springdream

Re: Similarity seems not optimized for 300,000+ mp3 files

Seader

Re: Similarity seems not optimized for 300,000+ mp3 files

Admin

Re: Similarity seems not optimized for 300,000+ mp3 files

Quick Reply