Author Topic: Similarity seems not optimized for 300,000+ mp3 files  (Read 76581 times)

ektorbarajas

  • Jr. Member
  • **
  • Posts: 45
    • View Profile
    • DJ ektorbarajas
Re: Similarity seems not optimized for 300,000+ mp3 files
« Reply #240 on: February 23, 2016, 05:12:39 »
Any feedback comment from the developers?

Admin

  • Administrator
  • Hero Member
  • *****
  • Posts: 664
    • View Profile
    • https://www.smilarityapp.com
Re: Similarity seems not optimized for 300,000+ mp3 files
« Reply #241 on: February 29, 2016, 15:29:24 »
Similarity algorithms isn't linear or even better logarithmic, they quadratic. Complexity of content based algorithm of Similarity is N^2. If directly calculated each new file need to be compared with all previous ones (it can't be searched by some index in relational databases, fingerprints can't be sorted to greater or lower).Example, for 300K files we have sum of arithmetic progression (1 + 300000) * 300000 / 2 = 90000300000 / 2 comparisions, compare it with 100K file for example (1 + 100000) * 100000 / 2 = 10000100000 / 2. You see comparing 300K file is 9 times longer then comparing 100K, not just 3 times. Even worse if you computer (all CPUs and GPUs) can compare 1mln fingerprints in 1sec (very, very fast computer), processing 300K files took 25hours.
To optimize this we added duration check, it dramatically decrease comparison count.
And we already working on new algorithm what can be used to compare 1mln of files and it will be linear, but it still far from completion.
« Last Edit: February 29, 2016, 20:53:20 by Admin »

ektorbarajas

  • Jr. Member
  • **
  • Posts: 45
    • View Profile
    • DJ ektorbarajas
Re: Similarity seems not optimized for 300,000+ mp3 files
« Reply #242 on: February 29, 2016, 19:19:33 »
Thanks for the reply. I really appreciate that.

But isn't the cache supposed to minimize the time for a second scan? (That is exactly as the first scan)

As mentioned before, I did some test doing the same exact comparison, the first time the cache increased because it added new files, but the second time it took the same time as when the cache was created.

Admin

  • Administrator
  • Hero Member
  • *****
  • Posts: 664
    • View Profile
    • https://www.smilarityapp.com
Re: Similarity seems not optimized for 300,000+ mp3 files
« Reply #243 on: February 29, 2016, 20:52:04 »
Yes, indeed cache helps to skip decoding and preproceesing file procedure, but the time for this procedure is linear, ie calculated only once per file.
Here example, just pretend we have more realistic very fast computer what can prepare 10 caches in 1 sec and compare 100K fingerprints in 1sec.
N[(N+1)*N/2 / 1000000 / 3600][N * 10 / 3600]% preparing time
10000 files0,14 hours0,28 hours66,66 %
100000 files13,89 hours2,78 hours16,67 %
300000 files125,00 hours8,33 hours6,25 %
1000000 files1388,90 hours27,78 hours1,96 %
You see for larger files amount caching importance is decreasing.
This calculation is idealistic without duration skip mechanism (disabled).

ektorbarajas

  • Jr. Member
  • **
  • Posts: 45
    • View Profile
    • DJ ektorbarajas
Re: Similarity seems not optimized for 300,000+ mp3 files
« Reply #244 on: February 29, 2016, 23:08:04 »
Thanks for the explanation!

Will look forward for the next update and also for the next algorithm implementation.

Regards

Springdream

  • Jr. Member
  • **
  • Posts: 51
    • View Profile
Re: Similarity seems not optimized for 300,000+ mp3 files
« Reply #245 on: September 18, 2018, 08:50:34 »
I do also often use grouping with a similar usecase than described at the beginning:

As soon as you use grouping not every item needs to be compared with all others but only each item in group e.g. 1 needs to be compared with group 2.
=> it should be linear, isn't it?

A further improvement might be: as soon one match is found (often that meand there is one song already double) further comparing could be stopped for that items as it is not neccessary to know that there are more than one duplicates...

Also I notice that count goes up to number itema Group 1+2. Group 2 will be the one that can be deleted with automarked files than the count should only go up to number items of group 2?!

Best, Fred
« Last Edit: September 18, 2018, 17:22:01 by Springdream »

Springdream

  • Jr. Member
  • **
  • Posts: 51
    • View Profile
Re: Similarity seems not optimized for 300,000+ mp3 files
« Reply #246 on: January 04, 2019, 09:16:10 »
Happy New Year!

I found some time to work with Similarity and I am still suffering from such an progressive slow down. It gets so slow that it is almost impossible to use.
Usecase: 100.000 files (radio recordings), same effect for precise comparison (with or without global optimization) and tag comparison

Before I get into more details the good news: clearing the cache helps a lot. I never cleared the cache since 5 years with >1.000.000 items and it speeds up to initial (high) speed. Even when the cache is cleared more often and at the end of the comparison.

The CPU load goes down from 100% to 60% but recovers within 30s... or some minutes. At low CPU load the bottleneck is the disk access speed of the data disk (the queue is at maximum).
Then slowly CPU gets to be the bottleneck again.

For me it looks like the cache is not only adding less benefit at larger number of files but is really contraproductive.

=> I'd suggest to add an option to keep it off. Or (if you can imagine the root cause of that issue) implement it differently.

Thank you,
Fred

Seader

  • Jr. Member
  • **
  • Posts: 1
    • View Profile
Re: Similarity seems not optimized for 300,000+ mp3 files
« Reply #247 on: January 30, 2019, 18:24:41 »
Is 60% the lowest the CPU goes down to? That doesn't seem all that bad tbh.

Admin

  • Administrator
  • Hero Member
  • *****
  • Posts: 664
    • View Profile
    • https://www.smilarityapp.com
Re: Similarity seems not optimized for 300,000+ mp3 files
« Reply #248 on: September 11, 2019, 19:08:52 »
=> I'd suggest to add an option to keep it off. Or (if you can imagine the root cause of that issue) implement it differently.

Sorry for the delay.
Similarity uses cache data to compare next files with current one, it can't be disabled, theoretically you can only forbid to save it on hard disk.