Similarity algorithms isn't linear or even better logarithmic, they quadratic. Complexity of content based algorithm of Similarity is N^2. If directly calculated each new file need to be compared with all previous ones (it can't be searched by some index in relational databases, fingerprints can't be sorted to greater or lower).Example, for 300K files we have sum of arithmetic progression (1 + 300000) * 300000 / 2 = 90000300000 / 2 comparisions, compare it with 100K file for example (1 + 100000) * 100000 / 2 = 10000100000 / 2. You see comparing 300K file is 9 times longer then comparing 100K, not just 3 times. Even worse if you computer (all CPUs and GPUs) can compare 1mln fingerprints in 1sec (very, very fast computer), processing 300K files took 25hours.
To optimize this we added duration check, it dramatically decrease comparison count.
And we already working on new algorithm what can be used to compare 1mln of files and it will be linear, but it still far from completion.