Author Topic: If a song is displayed as the "primary", don't show it as a "candidate match"  (Read 75857 times)

thegleep

  • Jr. Member
  • **
  • Posts: 2
    • View Profile
Let me try to explain that a little better.

I have two folders I'm scanning - "Music" and "Copied Music".  It is very common to see something like this:

Music\Song1.mp3
...Copied Music\Song1.mp3            100%

Copied Music\Song1.mp3
...Music\Song1.mp3                       100%

The first few times I used the program, I managed to delete *both* because they were both marked as "100%".

For this situation, I'd rather just see:
Music\Song1.mp3
...Copied Music\Song1.mp3             100%



Maybe if I just said "Only show each file once...ever" it might be more clear?

PS: I guess this is really the same issue as this topic:
http://www.music-similarity.com/forum/index.php?action=vthread&forum=2&topic=178

Does anyone see a better way to resolve the problem?

Oleggg10

  • Jr. Member
  • **
  • Posts: 12
    • View Profile
Do not attack the administrator. He has a lot of work. Do not create topics clone.
Everybody knows about this problem. Solutions are many. What will actually see in future versions.
(Sorry for the clumsy English, using transliteration)

lakecityransom

  • Jr. Member
  • **
  • Posts: 25
    • View Profile
In my opinion this is the most critical issue. It destroys the basic functionality and purpose of the program. I have also suffered because of this. Luckily, all my music is on an extra disk and I was able to recover most songs completely intact so I can figure out if they have any copies left still. Normally I would not delete them completely from the recycle bin, but my recycle bin was going crazy? Not entirely music-similarity fault...

The only safe way to use this program at the moment is to delete match groups one by one so the list will be reprocessed to purge inverse relationships. It really is important that people understand this.

Admin

  • Administrator
  • Hero Member
  • *****
  • Posts: 664
    • View Profile
    • https://www.smilarityapp.com
lakecityransom
There is no simple answer that the best method to show groups.

We have 3 possible solution:


Entry data 6 files, they similarity scheme shown on picture, we don't need percentage, they simply similar somehow. How we must show them ?

1) Current realization - every file forms a group:

1.mp3
       2.mp3
       3.mp3
       4.mp3
2.mp3
       1.mp3
       3.mp3
       6.mp3
3.mp3
       1.mp3
       2.mp3
       5.mp3
4.mp3
       1.mp3
       5.mp3
5.mp3
       3.mp3
       4.mp3
6.mp3
       2.mp3

Pro: Everythin mathematically right :)
Neg: Too many duplicates, hard too understand


2. Possible solution #1 - every file forms a group, but pairs shown only once:

1.mp3
       2.mp3
       3.mp3
       4.mp3
2.mp3
       3.mp3
       6.mp3
3.mp3
       5.mp3
4.mp3
       5.mp3

Pro: Smaller results, all results shown (%)
Neg: Not very well grouped

2. Possible solution #2 - group everything that least once crossed

1.mp3
       2.mp3
       3.mp3
       4.mp3
       5.mp3
       6.mp3

Pro: Best results, formed on group
Neg: not intuitive showing, file 6.mp3 similar to 2.mp3 why it grouped with 1 ? why results for 6-2 showed with 1 ? what we must do if user delete 2.mp3 ? 6 also must dissappear. what we must show in percents for 3.mp3 for 1.mp3 or 2.mp3 or 5.mp3 ?  very, very hard to understand how it work? but by users mind it the best.

In next release we try to implement "Possible solution #1".

If you have better ideas, please tell us.
« Last Edit: July 25, 2010, 22:02:45 by Admin »

lakecityransom

  • Jr. Member
  • **
  • Posts: 25
    • View Profile
Apologies for not posting on the bigger topic in general section.

My complaint about solution #1 is that accidental deletion is still possible in the matches. Consider the following based on your example:

Group#1:  3.mp3 is preferred. 1.mp3 2.mp3 and 4.mp3 chosen for deletion.
Group #2: 2.mp3 is preferred. 3.mp3 and 6.mp3 chosen for deletion.
Both groups are deleted at the same time. 3.mp3 is lost although it was meant as the saved mp3 for group #1

Solution #2 I would recommend it as an alternate option, but the drawback of similarity based on 1 file is indeed a problem. Many files will have a lesser relationship to it than if they were compared to another mp3. This means a lot of false positives. However, you do have controls on the match percentages now so it can be limited to high % matches. This would not be the perfect solution, the program would have to be run multiple times to keep trimming down results and the same results that were not taken care of will appear again. However, it can be useful to delete a good amount of duplicates in an easy and secure way on a large file set.

I can see some issues in accidental match deleting in solution 2 maybe? Unless you allow each mp3 to be on the list only once whether it be candidate or candidate match. That is, unless the file purging process is redesigned as explained below:

My suggestion: Batch deletions should be performed like deleting files from 1 file group at a time:

Currently:
--------------------
I delete 1 file group at a time: The deleted files are checked against the remaining list and all inverse relationships and matches under all groupings are purged.

I delete a bunch of files under multiple groups at once: The deleted file list is compared to the remaining file list. If the deleted file list contains inverse relationships among candidates or the issue I described for solution #1, many incorrect deletions will occur.
--------------------

So, if a user selects a bunch of files for deletion, in the code delete them 1 by 1 and compare them to the list. This will erase groupings of same matches or inverse relationships, preventing unintended deletions even if they are selected again further down the list.

r0lZ

  • Jr. Member
  • **
  • Posts: 3
    • View Profile
IMO, one possible solution to this problem could be to allow the user to specify two different source groups.

Currently, you can specify as many folders as you wish, but they are all included in the same group, and all files are checked against each other.  With two different source groups of folders, you could consider that group 1 must NEVER be touched, and group 2 contains potential dupes of files in group 1.  So, it would be easy to list the files in group 1 as the headers (with no similarity percentages), and beneath them, the files of group 2 only.

File 1 of group 1
- File N1 of group 2 - 100%
- File N2 of group 2 - 80%
File 2 of group 1
- File N3 of group 2 - 90%
etc...

Since File N1 of group 2 is in group 2, it will not be listed as a header file, and similarly, the files of group 1 will never be listed as potential dupes.

(That method is supported by Audio Comparer, and I like it.  But AC is not a good product, it is extremely slow, it misses most similar files, and it is not free.)

Of course, it is still necessary to support the current method, to find similarities within a single group.  In that case, perhaps a good improvement would be to highlight the files that have already been listed above in the list in a different color.

lakecityransom

  • Jr. Member
  • **
  • Posts: 25
    • View Profile
I hope my post was understandable. At any rate, I cannot use this program to its full capability until its 100% sure you cannot delete a song you meant to keep. Otherwise, you must go through each group and delete results in each group 1 by 1 to update the results and prevent this issue.  For this reason, I suggest that you should process selections one by one if you select multiple files. If the following groups in the results are purged due to deletion, ignore all selections within those groups. This would allow you to delete multiple files at once without worrying that you selected the same match in match groups after the group you have evaluated.

thegleep

  • Jr. Member
  • **
  • Posts: 2
    • View Profile
I like the suggestion from post #6; have "source groups", and only files in the "source group" can be headers for that group.

I know it's a lot more work, but maybe implement *all* the grouping options, and the users can choose which works best for them?

lakecityransom

  • Jr. Member
  • **
  • Posts: 25
    • View Profile
The suggestion is good, assuming you already have a preferred list. The problem is, say you got tons of collections and have many dupes as there are many popular songs that are duplicated. However, these songs are not already in your library.

For this reason, it seems to me that that option is like post-exploration of music. I do not see much benefit from it. Sure, you can purge duplicate results from music you know you already have, but still there are going to be a lot of duplicates of music you do not already have.

Admin

  • Administrator
  • Hero Member
  • *****
  • Posts: 664
    • View Profile
    • https://www.smilarityapp.com
lakecityransom
in next we implement folder grouping mechanism, you can mark 9 different groups, only files from different groups are compared, never in same.

lakecityransom

  • Jr. Member
  • **
  • Posts: 25
    • View Profile
Well that is better, I just can't see why you would not code it in a manner in which the deletion candidates are purged from the list 1 by 1 and it eliminates deletion candidates further down the list that were inverse matches and such. It would sort the problem out?

As I said, I can use Similarity in its current state if I delete single files and get a refreshed list. It prevents me from deleting the inverse relationships, in other words, deleting both files.

Nonetheless, I applaud your efforts.

lakecityransom

  • Jr. Member
  • **
  • Posts: 25
    • View Profile
Sorry I did not notice it was released, but when it comes to a majority of my stuff, I do not have a 'keep' version yet and I have hundreds upon hundreds of folders that have duplicates between them. For this reason a grouping function will not help.

lakecityransom

  • Jr. Member
  • **
  • Posts: 25
    • View Profile
The grouping function is nice but it is not very useful when you have many, many folders of files and/or are not sure if you want either group of files, so you cannot reduce the number of folders. You would have to run many compares. This leaves you with non-grouped searching and inverse relationship issues. This is why I feel that the delete function should take out 1 deletion selection at a time from the results and check if subsequent deletion selections still exist on the results list after refreshing the list.

For example if you choose to keep "keepme" base group and deleted "deleteme" candidate, further down the list inverse base group "deleteme" will disappear along with "keepme" as candidate. Thus, if you were to choose to keep "keepme", but accidently choose to delete "keepme" further down the list, you can rest assured that your second deletion will be omitted. In effect, you ensure that 1 copy of the file must exist. This logic behind this is that you are always choosing to keep 1 file out of every group base. Therefore, if the list is reprocessed 1 deletion at a time, it must reflect that you intend on keeping that 1 file, because all inverse group bases that hold "keepme" as a candidate disappear.

Again, this can already be done by users if they simply delete one file at a time. Inverse relationships disappear as the list is updated. The problem is selecting a bunch of files deletes them all. Single file deletion is too slow, however.

Maybe I am crazy, but this seems foolproof? Screenshot to explain:

\"\"

Admin

  • Administrator
  • Hero Member
  • *****
  • Posts: 664
    • View Profile
    • https://www.smilarityapp.com
lakecityransom
thanks for your comment,
in future we add priority of groups in "rearrange" and "automark" dialog.

lakecityransom

  • Jr. Member
  • **
  • Posts: 25
    • View Profile
Hi again, I made a mistake in the example. The right example should not choose 2 in one of the group bases and choose it for deletion instead. I'm guessing you understood though.

The thing I'm trying to get across is that if you simply have a bunch of songs to look through, the grouping function is not very helpful. You have no preferred group of folders and there are many folders. Maybe for example you took 10 peoples collections and put them altogether. There will be many duplicates but the grouping function will not work well unless you run it 10 times.

The priority of groups will help when using it though.