-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regex for match words #7
Comments
I've thought about this! When I was first building this for FOIA Feed I was pretty sure it would have to use regex, but then (in English, for my purposes) the results from simple string-matching were so effective that I didn't bother introduce the complexity of regex. (That complexity is mostly for end-users... I don't think it would be very difficult to do regex matches in the code itself.) At the risk of proliferation, what do you think about a third matchwords file that has regular expressions? |
Sounds like a good idea. So people can start off with simple words and move on to more complex patterns as needed. |
Will just mark interest in some regex support. I have started a Norwegian FOIAbot [0], and as we have no single term that most stories using FOIA uses in the same way as in the US we have to try to track combinations of words in order to pick as many stories as possible. |
OK, this convinces me, @byeskille, I'll get it into the next release. Do you have any objection to another file that has newline-separated regexes, like the format of the other matchwords files? That seems to me the most straightforward way of doing it. |
That should work I believe. |
Off the top of my head I'm a little concerned with how you would match actual newlines, but I think that's edge-case-y enough that we don't need to worry about it. (Plus, like, newlines are used as paragraph breaks and paragraphs are the unit within which matches happen, so I'm not really sure what it would even mean to match newlines.) |
I am not happy with the current way how match words work. I have the problem that the German term "IFG" is not as omnipresent as the English one. And I don't want to annoy people with posting false positives [0]. For this, it would be better to allow regex patterns instead of words. Then, I could prevent matches such as
LIFG
. In the code, you would have to construct a regex [1] and use e.g.findall
to get matches.[0] https://twitter.com/IFG_IFG_IFG/status/1014061073559949312
[1] https://docs.python.org/3/library/re.html
The text was updated successfully, but these errors were encountered: