Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex for match words #7

Open
jfilter opened this issue Jul 3, 2018 · 6 comments
Open

regex for match words #7

jfilter opened this issue Jul 3, 2018 · 6 comments

Comments

@jfilter
Copy link
Contributor

jfilter commented Jul 3, 2018

I am not happy with the current way how match words work. I have the problem that the German term "IFG" is not as omnipresent as the English one. And I don't want to annoy people with posting false positives [0]. For this, it would be better to allow regex patterns instead of words. Then, I could prevent matches such as LIFG. In the code, you would have to construct a regex [1] and use e.g. findall to get matches.

[0] https://twitter.com/IFG_IFG_IFG/status/1014061073559949312
[1] https://docs.python.org/3/library/re.html

@thisisparker thisisparker changed the title match words Regex for match words Jul 3, 2018
@thisisparker thisisparker changed the title Regex for match words regex for match words Jul 3, 2018
@thisisparker
Copy link
Contributor

I've thought about this! When I was first building this for FOIA Feed I was pretty sure it would have to use regex, but then (in English, for my purposes) the results from simple string-matching were so effective that I didn't bother introduce the complexity of regex. (That complexity is mostly for end-users... I don't think it would be very difficult to do regex matches in the code itself.)

At the risk of proliferation, what do you think about a third matchwords file that has regular expressions?

@jfilter
Copy link
Contributor Author

jfilter commented Jul 4, 2018

Sounds like a good idea. So people can start off with simple words and move on to more complex patterns as needed.

@byeskille
Copy link

Will just mark interest in some regex support.

I have started a Norwegian FOIAbot [0], and as we have no single term that most stories using FOIA uses in the same way as in the US we have to try to track combinations of words in order to pick as many stories as possible.

[0] https://twitter.com/InnsynBot

@thisisparker
Copy link
Contributor

OK, this convinces me, @byeskille, I'll get it into the next release. Do you have any objection to another file that has newline-separated regexes, like the format of the other matchwords files? That seems to me the most straightforward way of doing it.

@byeskille
Copy link

That should work I believe.

@thisisparker
Copy link
Contributor

Off the top of my head I'm a little concerned with how you would match actual newlines, but I think that's edge-case-y enough that we don't need to worry about it. (Plus, like, newlines are used as paragraph breaks and paragraphs are the unit within which matches happen, so I'm not really sure what it would even mean to match newlines.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants