Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove few xpath redundancies in y-003, add drama exclusion #658

Merged
merged 1 commit into from
Mar 8, 2024

Conversation

vr8hub
Copy link
Contributor

@vr8hub vr8hub commented Mar 8, 2024

I modified the xpath (see first one below) to catch instances of ancestor drama or letter that didn't match any of the rest of the exclusions, and ran it on the corpus. I got a dozen or so hits on various drama, mostly scenes at the very end of a dramatis personae (I didn't know that was a thing), and a handful of others.

I did not get any letters, however, and looking further at it I'm not sure how we would. We do have sections that are letters in our epistolary works, but the things in a letter that would end in lowercase, e.g. dateline, recipient, salutation, valediction, signature, etc., are all going to be contained in something that is already excluded, e.g. salutation, header, footer. If I understand correctly, we would have to have something in the body of the letter that validly ended in a lowercase letter, and I can't think of anything, unless it was in a container that is already excluded, e.g. a table. But, again, if there was such a thing, we don't have any in the corpus (if the below xpath is correct), so it's theoretical at the moment.

I therefore removed the z3998:letter exclusion from the code. It's easy enough to add back if you prefer it to stay.

xpath I used to test for drama/letter:

//p[re:test(., '[a-z]$') and not(@epub:type) and not(following-sibling::*[1]/node()[1][contains(@epub:type, 'z3998:roman')] or following-sibling::*[1][re:test(., '^([0-9]|first|second|third|fourth|fifth|sixth|seventh|eight|ninth|tenth)', 'i')]) and not(@class or contains(@epub:type, 'z3998:salutation')) and not(following-sibling::*[1][name() = 'blockquote' or name() = 'figure' or name() = 'table' or name() = 'footer' or name() = 'ul' or name() = 'ol'] or ancestor::*[(name() = 'blockquote' or name() = 'footer' or name() = 'header' or name() = 'table' or name() = 'ul' or name() = 'ol' or name() = 'li' or name() = 'figure' or name() = 'hgroup')]) and ancestor::*[re:test(@epub:type, '(z3998:drama|z3998:letter)')]]

I ran a second xpath (see below) to see what shook out from p's having a class that didn't match any of the other exclusions. There were a total of ten books that had matches; most of them that had matches had multiple matches. Is 10 out of 960 worth an exclusion? I'll leave that with you; I did not make any change for this in this PR.

xpath I used for the class test:

//p[re:test(., '[a-z]$') and not(@epub:type) and not(following-sibling::*[1]/node()[1][contains(@epub:type, 'z3998:roman')] or following-sibling::*[1][re:test(., '^([0-9]|first|second|third|fourth|fifth|sixth|seventh|eight|ninth|tenth)', 'i')]) and (@class) and not(following-sibling::*[1][name() = 'blockquote' or name() = 'figure' or name() = 'table' or name() = 'footer' or name() = 'ul' or name() = 'ol'] or ancestor::*[name() = 'blockquote' or name() = 'footer' or name() = 'header' or name() = 'table' or name() = 'li' or name() = 'figure' or name() = 'hgroup' or re:test(@epub:type, '(z3998:drama|dedication|halftitlepage)')])]

@vr8hub
Copy link
Contributor Author

vr8hub commented Mar 8, 2024

Also, is the corpus not being updated with text-wrap: pretty; just a no time yet issue (no problem then), or is there a technical reason? It's generating a lot of noise when I do lint testing on the corpus.

@acabal acabal merged commit 5199357 into standardebooks:master Mar 8, 2024
1 check passed
@acabal
Copy link
Member

acabal commented Mar 8, 2024

Thanks! I just haven't had time to update the corpus.

@vr8hub vr8hub deleted the improve_y003 branch March 10, 2024 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants