Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests suite #43

Open
wants to merge 59 commits into
base: main
Choose a base branch
from
Open

Add tests suite #43

wants to merge 59 commits into from

Conversation

Serene-Arc
Copy link
Contributor

I thought it would be good to have this draft PR open so that we can discuss bugs that the tests reveal as I find them.

First one is a bug I've been trying to nail dain for a while: @Neurrone what is the reason that the path of the items were used to sort instead of the name? There's a comment that specifies that it needs to be that way but not why. I've made a bunch of test cases to check sorting under different naming schemes and the paths mess up, while the names do not (so far).

If you remember the use cases that necessitate using the path, could you please tell them so I can add them to the tests? It's possible that a more complex sorting algorithm will be required but it needs tests before I can start working on a test.

@Serene-Arc
Copy link
Contributor Author

Also, question: this plugin seems to have a different mechanism of matching tracks than the normal beets algorithm, like the methods for importing them are different. Is this intentional? It's very likely that I'm not understanding the whole process but there are issues that this impacts.

Consider the ordering issue; if beets can normally order the tracks according to the returned album when searching (either on MB or Audible), then that might be the solution to the ordering issue for the tracks, but that doesn't happen, and instead there is the natsort that fails in a lot of cases.

@Neurrone
Copy link
Owner

Apologies for the delayed response.

I am trying to remember the reason for this.

I believe it was because I found it unreliable to rely on the track titles and it was usually far more accurate to rely on the file names to natsort them to get the correct order of the tracks. Otherwise, I would have stuck to using the track titles, per default Beets behaviour.

If the ordering of the files on disks that Beets detects doesn't match the data source from Audible, then Beets would try to assign the wrong files to the wrong chapter titles from Audible.

@Serene-Arc
Copy link
Contributor Author

So I did some more testing and reimporting a bunch of books into my library to find those edge cases. Natural sorting works well for a lot of cases but not when bytes are used and not when there are different naming schemes (e.g. Chapter 1-10 and then a chapter named Prologue; the prologue will be placed at the end because P comes after C).

I'm not sure if there's any good way to sort this with a one-size fits all approach given the sheer breadth of possible track namings. I would like to try playing around with the track matching and distance algorithms, with your permission, to try and get that working correctly instead of this system with natsorting. There are a couple reasons for this:

  • I have had issues where the track sorting still fails and it renames chapters inappropriately, so this will have to be addressed anyway at some point
  • The Audible response is the master record of the chapters, and known to be correct. If we can match to that then we should; if we can't then natural sorting or a more complex algorithm can be a fallback.
  • This is somewhat at odds with the rest of beets and their methodology so intrinsically this method will not benefit from any further improvements that beets makes to their core functions.

It's up to you if you'd like to take this approach but right now, the tests I've got don't seem to have an immediate answer to get them working. They fail with the old method and they fail with a natural sort on the titles too.

@Neurrone
Copy link
Owner

I'd be happy to use a better solution.

I used the Natsort method on paths because it seemed to be the easiest way for it to work for most cases. Part of the trade-off was accepting it would never work all the time.

@Serene-Arc
Copy link
Contributor Author

Ignore all the commits here, I need to upload them to github to test them on my system, I'm working out the kinks with a matching method, with some promising results! See how it goes.

@Serene-Arc Serene-Arc marked this pull request as ready for review June 1, 2023 04:27
@Serene-Arc
Copy link
Contributor Author

It is finally complete! I have created a series of algorithms that allow for most books to be matched and which also allows for extension in the future. It is completely configurable and comes with tests that cover the entire chapter matching process along with a couple of other tests that cover the things that are likely to fail.

What do you think @Neurrone?

@Neurrone
Copy link
Owner

Neurrone commented Jun 5, 2023

Thanks for working on this. Given this is a large change, I'll probably only be able to take a look at this in a week or two.

@Serene-Arc
Copy link
Contributor Author

Sure no rush.

Copy link
Owner

@Neurrone Neurrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work.

Did an initial pass to understand what was done. Here are some initial questions.

scrub:
auto: yes # optional, enabling this is personal preference
```
```yaml
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you re-indent this so that the code is correctly rendered as a code block as part of the list item? Also helps with reducing the size of the diff so that only the significant changes made show up.

beetsplug/audible.py Outdated Show resolved Hide resolved
@@ -104,11 +248,117 @@ def __init__(self):
self.add_media_field("subtitle", subtitle)

def album_distance(self, items, album_info, mapping):
dist = get_distance(data_source=self.data_source, info=album_info, config=self.config)
dist = beets.autotag.hooks.Distance()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between this function and the previous one that was used?

It seems to just ignore the source weight set in config.

return dist

def track_distance(self, item, track_info):
return get_distance(data_source=self.data_source, info=track_info, config=self.config)
dist = beets.autotag.hooks.Distance()
dist.add_string("track_title", item.title, track_info.title)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this is needed? I'd assume that the default distance algorithm already factors the distance between titles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the better way to do it according to the beets source code when constructing a custom distance. I didn't actually do that in the end because it would have taken a while but when I said we could use a customised Levenshtein algorithm, this is where it would be called.



def is_continuous_number_series(numbers: Iterable[Optional[int]]):
return all([n is not None for n in numbers]) and all(b - a == 1 for a, b in zip(numbers, numbers[1:]))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to accept a possibly blank list of None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This typing means that there is an iterable which may have any combination of None or integers. This is because when the algorithm this code is called in is attempted, any files where there is no source numbering will return None. If this is the case, then it is rejected as there is no way to know where the unsourced file will fit into the numbering.

beetsplug/audible.py Show resolved Hide resolved
dist.add_string("track_title", item.title, track_info.title)
return dist

def attempt_match_trust_source_numbering(self, items: List[Item], album: AlbumInfo) -> Optional[List[Item]]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the unused album parameter?

Likewise for the other attempt match functions below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. All of the functions need to have the exact same signature to work in the given code. It doesn't call any function explicitly but reads the configuration and then matches it to the appropriate function. Should also make it so it's easy to add new methods as well.

# if len(items) > len(album.tracks):
# # TODO: find a better way to handle this
# # right now just reject this match
# return None
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is commented out, what happens in this situation? None should still be returned but the item would fail to match.

I believe this case is currently being handled correctly in the existing matching algorithm.

beetsplug/audible.py Outdated Show resolved Hide resolved
beetsplug/audible.py Show resolved Hide resolved
"""If the input album has a single item, use that; if the album also has a single item, prefer that."""
if len(items) == 1:
# Prefer a single named book from the remote source
if len(album.tracks) == 1 and album.tracks[0].title != "Chapter 1":
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the extra condition on chapter 1 needed? It seems safe enough to trust it if the audible source has a single track.

Would prefer removing this check if itss not needed to simplify the logic.

matches = sorted_tracks
return matches

def attempt_match_starting_numbers(self, items: List[Item], album: AlbumInfo) -> Optional[List[Item]]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having some trouble understanding what this is trying to do, could you help walk me through this?

In what situations does this algorithm help? I.e, are there cases where this algorithm is better than the next one down the list (natural sorting) which is much simpler?

# magic number here, it's a judgement call
if max(average_title_change) < 4:
# can't assume that the tracks actually match even when there are the same number of items, since lengths
# can be different e.g. an even split into n parts that aren't necessarily chapter-based so just natsort
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situations is the extra logic to find and strip out common prefixes or suffixes needed, vs the previous method that just did a natural sort on the titles?

@Neurrone
Copy link
Owner

Neurrone commented Jul 8, 2023

I've finished looking at the implementation but haven't gone through the tests in detail yet.

Really appreciate the concept of a list of strategies that you introduced, since it does simplify the overall logic and allows customization.

My main concern is the complexity of some of the algorithms relative to how the matching was done previously. The previous implementation used the file input if the number of tracks didn't match the album info from Audible. I suspect that trying to perform matches in these cases (e.g, chapter_levenshtein) is hard to debug and may not work properly.

What do you think of restoring that logic and having a simpler default list? So something like:

  1. source_numbering
  2. Instead of having a special case for matching on a single file, replace it with a strategy that replaces the info from Audible with the input list if the number of chapters don't match
  3. natural_sort

@Serene-Arc
Copy link
Contributor Author

Complexity is required for complex cases. The simple fact is that the old method for doing this didn't cope well with the varied track titles that I found and used personally. Simple methods don't work when the data input isn't simple.

The previous implementation used the file input if the number of tracks didn't match the album info from Audible.

To address your point, using the file input doesn't always work. It doesn't work when you don't have source numbering, which is a lot of audiobooks in my experience. Even if the number of tracks do match, that's not actually a guarantee that the actual tracks match. Consider the difference between a book with 20 chapters on audible, and a the actual audiobook which has been split into 20 tracks of the same size. There are 20 tracks in both cases, but using the audible method is objectively wrong and won't match the actual book.

The goal of the many different measures is that each person can change the ordering based on what audiobooks they're importing at the time. If you're importing a book that you know Levenshtein will work well at, you can make it the primary method that the plugin uses.

Instead of having a special case for matching on a single file, replace it with a strategy that replaces the info from Audible with the input list if the number of chapters don't match

This is the Levenshtein function, but it is very bad at doing this. Consider that you don't know which chapters match up, because there might not be any ordering. Any numbers in the title are ignored (they're considered in the natural sort) but you don't know which matches to what track from Audible. You don't know if any of them match at all. What the program does is calculate all of the distances from every track in the files to every track from Audible. Then, it matches the closest match with the file and overwrites the title. This isn't actually guaranteed to be a close match. If there's a single letter in common, then the closest match will be that, but it's not guaranteed to be accurate or even relevant.

I suspect that trying to perform matches in these cases (e.g, chapter_levenshtein) is hard to debug and may not work properly.

Your concern over the Levenshtein is a little strange to me. This is the method that was used before. The Levenshtein function is what was used before when you called the track distance function. I just made it explicit so that we could improve on it. It's the last in the list because it's the least effective and most prone to errors. That is the old method.

@Neurrone
Copy link
Owner

Neurrone commented Jul 9, 2023

the old method for doing this didn't cope well with the varied track titles that I found and used personally. Simple methods don't work when the data input isn't simple.

I understand, I'm just trying to get a feel for what examples in the wild would work better with these algorithms. I'll admit the inputs that I have are generally quite simple - either they match on a chapter to chapter basis, or they don't.

Even if the number of tracks do match, that's not actually a guarantee that the actual tracks match. Consider the difference between a book with 20 chapters on audible, and a the actual audiobook which has been split into 20 tracks of the same size. There are 20 tracks in both cases, but using the audible method is objectively wrong and won't match the actual book.

This should already be handled currently. If the number of tracks matches, the previous method wouldn't override the match from Audible and it falls back to the default matching, so you'd see that the matches fail if the lengths are off for example.

This is the Levenshtein function, but it is very bad at doing this. Consider that you don't know which chapters match up, because there might not be any ordering. Any numbers in the title are ignored

What I think is missing here is the old behaviour where if the folder has 2 files while Audible returns 20 chapters, I want to have the ability to completely ignore the contents from Audible, since I know its not split by chapter. It is meant to prevent the poor results from falling back to the default Levenshtein function.

Your concern over the Levenshtein is a little strange to me. This is the method that was used before. The Levenshtein function is what was used before when you called the track distance function.

I didn't realize that, I'll take a look at the Beets source to understand this better.

@Serene-Arc
Copy link
Contributor Author

The tests that I've included are real world examples. I got a bunch of different audiobooks (hundreds) and started importing them. When I ran into issues, I added a test for the condition and worked to make sure that they passed correctly. That's why the tests try the correct ordering, then reversing, then randomisation, to make sure that they come up with the correct answer every time.

This should already be handled currently. If the number of tracks matches, the previous method wouldn't override the match from Audible and it falls back to the default matching, so you'd see that the matches fail if the lengths are off for example.

I don't think it did this? The old method was to match if the number of tracks were the same, regardless of length. That wasn't part of the calculation.

What I think is missing here is the old behaviour where if the folder has 2 files while Audible returns 20 chapters, I want to have the ability to completely ignore the contents from Audible, since I know its not split by chapter. It is meant to prevent the poor results from falling back to the default Levenshtein function.

You can! This would be the natural sort, source numbering, or chapter numbering algorithm. The order in which the algorithms are listed in the configuration is the order in which they're tried. If you have two files that are numbered Part 1 Part 2 then the natsort function will work, as will the 'starting number'. Actually, my code mostly takes only the metadata for the album as a whole i.e. the data taken from Audible is rarely, if ever, the track data, because that's really hard to match. Instead it's all the album level data, like author, ID, narrator, etc.

The levenshtein algorithm is only called, with the default configuration I've provided if the following conditions are met:

  • All of the chapter titles are very different (have a Levenshtein difference more than 4)
  • There is no contiguous source matching
  • There are no common affixes to the titles that, when removed, result in a number e.g. 'Chapter 1', 'Chapter 2', etc

The Levenshtein option, which is basically the track distance, is the last option that is called because it is the most prone to errors. But again, that is the method that the beets library itself uses when you call the track_distance function.

The reason I made it explicit is because, when using the Levenshtein, there are certain replacements that lead to greater errors, particularly numbers. If the Levenshtein function is replacing a 2 with a 4 for example, that's almost certainly a bigger error than a space to a dash or whatever. My idea was to create a custom Levenshtein function that penalises number replacements more, which would make it more accurate imo, but I haven't actually done that yet because it's already a method of last resort.

@Serene-Arc
Copy link
Contributor Author

I do apologise, I'm quite busy in my semester and it's hard finding the spare time to come back to this. I'm trying to knock some things off.

With regards to the source_weight option, which you mentioned earlier, do you know what it's for? Because looking through the code, I get the impression that there's no reason to have this exposed to the user at all, based on how I think you intend for this plugin to be used.

By that, I mean that the option is meant to rank different sources compared to Musicbrainz. Having it be zero means that it is compared equally to Musicbrainz. But the thing is, there will NEVER be an audiobook match from Musicbrainz; for audiobooks, it should always be zero.

If this plugin is used for a separate library (meaning that a beets instance with this plugin will never be used for music), then there's never a reason for this to be anything other than zero. In that case, I think it should be removed from the configuration and we should permanently set it to zero in the code.

Do you have a reason for it to be exposed to the user through the configuration? And do you expect for this to be used in the same library as music?

@Neurrone
Copy link
Owner

Np.

Agree that source weight shouldn't be exposed, since the goal is to ignore Musicbrainz-

@Serene-Arc
Copy link
Contributor Author

Great, I removed that from the example config in the README. Do you have any other concerns about the new code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants