-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests suite #43
base: main
Are you sure you want to change the base?
Add tests suite #43
Conversation
Also, question: this plugin seems to have a different mechanism of matching tracks than the normal beets algorithm, like the methods for importing them are different. Is this intentional? It's very likely that I'm not understanding the whole process but there are issues that this impacts. Consider the ordering issue; if beets can normally order the tracks according to the returned album when searching (either on MB or Audible), then that might be the solution to the ordering issue for the tracks, but that doesn't happen, and instead there is the natsort that fails in a lot of cases. |
Apologies for the delayed response. I am trying to remember the reason for this. I believe it was because I found it unreliable to rely on the track titles and it was usually far more accurate to rely on the file names to natsort them to get the correct order of the tracks. Otherwise, I would have stuck to using the track titles, per default Beets behaviour. If the ordering of the files on disks that Beets detects doesn't match the data source from Audible, then Beets would try to assign the wrong files to the wrong chapter titles from Audible. |
So I did some more testing and reimporting a bunch of books into my library to find those edge cases. Natural sorting works well for a lot of cases but not when bytes are used and not when there are different naming schemes (e.g. Chapter 1-10 and then a chapter named Prologue; the prologue will be placed at the end because P comes after C). I'm not sure if there's any good way to sort this with a one-size fits all approach given the sheer breadth of possible track namings. I would like to try playing around with the track matching and distance algorithms, with your permission, to try and get that working correctly instead of this system with natsorting. There are a couple reasons for this:
It's up to you if you'd like to take this approach but right now, the tests I've got don't seem to have an immediate answer to get them working. They fail with the old method and they fail with a natural sort on the titles too. |
I'd be happy to use a better solution. I used the Natsort method on paths because it seemed to be the easiest way for it to work for most cases. Part of the trade-off was accepting it would never work all the time. |
Ignore all the commits here, I need to upload them to github to test them on my system, I'm working out the kinks with a matching method, with some promising results! See how it goes. |
513994e
to
55a1ba8
Compare
It is finally complete! I have created a series of algorithms that allow for most books to be matched and which also allows for extension in the future. It is completely configurable and comes with tests that cover the entire chapter matching process along with a couple of other tests that cover the things that are likely to fail. What do you think @Neurrone? |
Thanks for working on this. Given this is a large change, I'll probably only be able to take a look at this in a week or two. |
Sure no rush. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hard work.
Did an initial pass to understand what was done. Here are some initial questions.
scrub: | ||
auto: yes # optional, enabling this is personal preference | ||
``` | ||
```yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you re-indent this so that the code is correctly rendered as a code block as part of the list item? Also helps with reducing the size of the diff so that only the significant changes made show up.
@@ -104,11 +248,117 @@ def __init__(self): | |||
self.add_media_field("subtitle", subtitle) | |||
|
|||
def album_distance(self, items, album_info, mapping): | |||
dist = get_distance(data_source=self.data_source, info=album_info, config=self.config) | |||
dist = beets.autotag.hooks.Distance() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between this function and the previous one that was used?
It seems to just ignore the source weight set in config.
return dist | ||
|
||
def track_distance(self, item, track_info): | ||
return get_distance(data_source=self.data_source, info=track_info, config=self.config) | ||
dist = beets.autotag.hooks.Distance() | ||
dist.add_string("track_title", item.title, track_info.title) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why this is needed? I'd assume that the default distance algorithm already factors the distance between titles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the better way to do it according to the beets source code when constructing a custom distance. I didn't actually do that in the end because it would have taken a while but when I said we could use a customised Levenshtein algorithm, this is where it would be called.
|
||
|
||
def is_continuous_number_series(numbers: Iterable[Optional[int]]): | ||
return all([n is not None for n in numbers]) and all(b - a == 1 for a, b in zip(numbers, numbers[1:])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to accept a possibly blank list of None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This typing means that there is an iterable which may have any combination of None or integers. This is because when the algorithm this code is called in is attempted, any files where there is no source numbering will return None
. If this is the case, then it is rejected as there is no way to know where the unsourced file will fit into the numbering.
beetsplug/audible.py
Outdated
dist.add_string("track_title", item.title, track_info.title) | ||
return dist | ||
|
||
def attempt_match_trust_source_numbering(self, items: List[Item], album: AlbumInfo) -> Optional[List[Item]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove the unused album
parameter?
Likewise for the other attempt match functions below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. All of the functions need to have the exact same signature to work in the given code. It doesn't call any function explicitly but reads the configuration and then matches it to the appropriate function. Should also make it so it's easy to add new methods as well.
# if len(items) > len(album.tracks): | ||
# # TODO: find a better way to handle this | ||
# # right now just reject this match | ||
# return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is commented out, what happens in this situation? None
should still be returned but the item would fail to match.
I believe this case is currently being handled correctly in the existing matching algorithm.
"""If the input album has a single item, use that; if the album also has a single item, prefer that.""" | ||
if len(items) == 1: | ||
# Prefer a single named book from the remote source | ||
if len(album.tracks) == 1 and album.tracks[0].title != "Chapter 1": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the extra condition on chapter 1 needed? It seems safe enough to trust it if the audible source has a single track.
Would prefer removing this check if itss not needed to simplify the logic.
beetsplug/audible.py
Outdated
matches = sorted_tracks | ||
return matches | ||
|
||
def attempt_match_starting_numbers(self, items: List[Item], album: AlbumInfo) -> Optional[List[Item]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having some trouble understanding what this is trying to do, could you help walk me through this?
In what situations does this algorithm help? I.e, are there cases where this algorithm is better than the next one down the list (natural sorting) which is much simpler?
# magic number here, it's a judgement call | ||
if max(average_title_change) < 4: | ||
# can't assume that the tracks actually match even when there are the same number of items, since lengths | ||
# can be different e.g. an even split into n parts that aren't necessarily chapter-based so just natsort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what situations is the extra logic to find and strip out common prefixes or suffixes needed, vs the previous method that just did a natural sort on the titles?
I've finished looking at the implementation but haven't gone through the tests in detail yet. Really appreciate the concept of a list of strategies that you introduced, since it does simplify the overall logic and allows customization. My main concern is the complexity of some of the algorithms relative to how the matching was done previously. The previous implementation used the file input if the number of tracks didn't match the album info from Audible. I suspect that trying to perform matches in these cases (e.g, chapter_levenshtein) is hard to debug and may not work properly. What do you think of restoring that logic and having a simpler default list? So something like:
|
Complexity is required for complex cases. The simple fact is that the old method for doing this didn't cope well with the varied track titles that I found and used personally. Simple methods don't work when the data input isn't simple.
To address your point, using the file input doesn't always work. It doesn't work when you don't have source numbering, which is a lot of audiobooks in my experience. Even if the number of tracks do match, that's not actually a guarantee that the actual tracks match. Consider the difference between a book with 20 chapters on audible, and a the actual audiobook which has been split into 20 tracks of the same size. There are 20 tracks in both cases, but using the audible method is objectively wrong and won't match the actual book. The goal of the many different measures is that each person can change the ordering based on what audiobooks they're importing at the time. If you're importing a book that you know Levenshtein will work well at, you can make it the primary method that the plugin uses.
This is the Levenshtein function, but it is very bad at doing this. Consider that you don't know which chapters match up, because there might not be any ordering. Any numbers in the title are ignored (they're considered in the natural sort) but you don't know which matches to what track from Audible. You don't know if any of them match at all. What the program does is calculate all of the distances from every track in the files to every track from Audible. Then, it matches the closest match with the file and overwrites the title. This isn't actually guaranteed to be a close match. If there's a single letter in common, then the closest match will be that, but it's not guaranteed to be accurate or even relevant.
Your concern over the Levenshtein is a little strange to me. This is the method that was used before. The Levenshtein function is what was used before when you called the track distance function. I just made it explicit so that we could improve on it. It's the last in the list because it's the least effective and most prone to errors. That is the old method. |
I understand, I'm just trying to get a feel for what examples in the wild would work better with these algorithms. I'll admit the inputs that I have are generally quite simple - either they match on a chapter to chapter basis, or they don't.
This should already be handled currently. If the number of tracks matches, the previous method wouldn't override the match from Audible and it falls back to the default matching, so you'd see that the matches fail if the lengths are off for example.
What I think is missing here is the old behaviour where if the folder has 2 files while Audible returns 20 chapters, I want to have the ability to completely ignore the contents from Audible, since I know its not split by chapter. It is meant to prevent the poor results from falling back to the default Levenshtein function.
I didn't realize that, I'll take a look at the Beets source to understand this better. |
The tests that I've included are real world examples. I got a bunch of different audiobooks (hundreds) and started importing them. When I ran into issues, I added a test for the condition and worked to make sure that they passed correctly. That's why the tests try the correct ordering, then reversing, then randomisation, to make sure that they come up with the correct answer every time.
I don't think it did this? The old method was to match if the number of tracks were the same, regardless of length. That wasn't part of the calculation.
You can! This would be the natural sort, source numbering, or chapter numbering algorithm. The order in which the algorithms are listed in the configuration is the order in which they're tried. If you have two files that are numbered The levenshtein algorithm is only called, with the default configuration I've provided if the following conditions are met:
The Levenshtein option, which is basically the track distance, is the last option that is called because it is the most prone to errors. But again, that is the method that the beets library itself uses when you call the The reason I made it explicit is because, when using the Levenshtein, there are certain replacements that lead to greater errors, particularly numbers. If the Levenshtein function is replacing a 2 with a 4 for example, that's almost certainly a bigger error than a space to a dash or whatever. My idea was to create a custom Levenshtein function that penalises number replacements more, which would make it more accurate imo, but I haven't actually done that yet because it's already a method of last resort. |
I do apologise, I'm quite busy in my semester and it's hard finding the spare time to come back to this. I'm trying to knock some things off. With regards to the source_weight option, which you mentioned earlier, do you know what it's for? Because looking through the code, I get the impression that there's no reason to have this exposed to the user at all, based on how I think you intend for this plugin to be used. By that, I mean that the option is meant to rank different sources compared to Musicbrainz. Having it be zero means that it is compared equally to Musicbrainz. But the thing is, there will NEVER be an audiobook match from Musicbrainz; for audiobooks, it should always be zero. If this plugin is used for a separate library (meaning that a beets instance with this plugin will never be used for music), then there's never a reason for this to be anything other than zero. In that case, I think it should be removed from the configuration and we should permanently set it to zero in the code. Do you have a reason for it to be exposed to the user through the configuration? And do you expect for this to be used in the same library as music? |
Np. Agree that source weight shouldn't be exposed, since the goal is to ignore Musicbrainz- |
This isn't really a bug with the audible plugin, but something that audnex returns. I don't really know why but since it could be misleading this ignores those cases.
45f5e35
to
5aecb02
Compare
Great, I removed that from the example config in the README. Do you have any other concerns about the new code? |
I thought it would be good to have this draft PR open so that we can discuss bugs that the tests reveal as I find them.
First one is a bug I've been trying to nail dain for a while: @Neurrone what is the reason that the path of the items were used to sort instead of the name? There's a comment that specifies that it needs to be that way but not why. I've made a bunch of test cases to check sorting under different naming schemes and the paths mess up, while the names do not (so far).
If you remember the use cases that necessitate using the path, could you please tell them so I can add them to the tests? It's possible that a more complex sorting algorithm will be required but it needs tests before I can start working on a test.