-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Addition of three new predefined recognizers, improved regex for IN_PAN #1323
base: main
Are you sure you want to change the base?
Conversation
Added India PAN (Permanent Account Number) recognizer
refined the regex for better recognition and enhanced the test cases accordingly
Fixed lint error that was missed earlier.
Added test cases , verification and context data
Added negative test cases per review comments.
linted code
update pattern recognizer value per suggestion in review
added PresidioAnalyzerUtils class with generic functions. removed usage of stdnum
added test cases for analyzer_utils.py in prescribed format
added to the count of predefined recognizers
Added India specific predefined pattern recognizer for vehicle registration number
reinstated python 3.9 compatibility, reorganized code
Logic reverted from analyzer_utils to recognizer classfile
added min size check to avoid failures per review comment
Two english language predefine recognizers added viz. ISIN , CFI
1. Improved IN_PAN regex 2. Utility function for LUHN ModN validation 3. New recognizers : IN_GSTIN, CFI_CODE, ISIN_CODE
hi @omri374 , regards |
Hi @devopam, thanks for this! unfortunately the |
Is there a simple alternative to it? |
hi @omri374 , yes I will add it neatly in analyzer_utils so that others can use it as well. We will need to maintain all these metadata whenever world events change as such. I am down with cold & fever :( - that's the delay cause. |
Absolutely no rush! Hope you feel better soon! |
Removed pycountry per feedback on it's license. Built the utility in analyzer_utils.py & removed all references.
hi @omri374 , |
Thanks! I will take a look soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! Left some comments, would be happy to discuss.
|
||
gstin_country_codes_iso3a = "" | ||
utils = Utils() | ||
countries = utils.get_country_codes(iso_code="ISO3166-1-Alpha-3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to think of way to make this more efficient. Creating the PresidioAnalyzerUtils
object for every recognizer that needs it is inefficient, and would require holding the countries data in memory multiple times.
On one hand, there's the need for this data for some recognizers, on the other hand most recognizers (and users) might not need it. Therefore, we can think of two options to handle this in my view.
- Instantiate the
utils
in theAnalyzerEngine
class and pass it to each recognizer. This would allow it to only be instantiated once. - Make some of these recognizers optional (which aren't loaded in
load_predefined_recognizers
) and therefore reduce the memory footprint.
Long term, I think we should think of a mechanism to specify the countries you expect PII to come from. If someone doesn't expect any Australian license plates, it doesn't make sense to load that recognizer for them.
Happy to get your thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @omri374
I am confused with the best approach here frankly. #2 is not a good idea as it will defeat the purpose of 'built-in' recognizers readily available as such from a consumer's perspective imho.
We will need more of such metadata to improve detection beyond pattern matching as we add more industry specific recognizers to give comprehensive solution offering.
So, how best to design this is bit complex !
approach 1 above seems to be a feasible approach to me but I will need fair bit of help here to implement since I don't want to end up breaking things due to my limited knowledge of the core.
Shall we discuss offline ? Please suggest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @devopam. I agree that (1) is a better approach, also since in parallel we're thinking of introducing a country flag to be able to filter out recognizers that aren't needed.
Adding a new input to the __init__
of EntityRecognizer
would not break the API. I would suggest to do that, and instantiate the PresidioAnalyzerUtils
in the recognizer registry, so that it could be passed to all recognizers during instantiation.
How about taking this constructor:
def __init__( |
and adding a new param to it:
def __init__(
self,
supported_entities: List[str],
name: str = None,
supported_language: str = "en",
version: str = "0.0.1",
context: Optional[List[str]] = None,
analyzer_utils: Optional[PresidioAnalyzerUtils] = None
):
self.analyzer_utils = analyzer_utils
We could inject the analyzer_utils during recognizers instantiation:
self.__instantiate_recognizer( |
WDYT?
Happy to discuss offline too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devopam how about we have a quick chat on this? If you're interested, let's chat on LinkedIn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @omri374 ,
Please have a look at the current implementation and advise. Utils is instantiated only in the recognizers using it now viz. in_gstin, isin
presidio-analyzer/presidio_analyzer/predefined_recognizers/isin_recognizer.py
Outdated
Show resolved
Hide resolved
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
interim code with EntityRecognizer enhancement WIP
class instantiation changed for analyzer utils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @devopam, added a few additional comments. This looks much better!
Thanks
@@ -36,13 +52,33 @@ def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str: | |||
text = text.replace(search_string, replacement_string) | |||
return text | |||
|
|||
@staticmethod | |||
def get_luhn_mod_n(input_str: str, alphabet="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_luhn_mod_n(input_str: str, alphabet="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"): | |
def get_luhn_mod_n(input_str: str, alphabet="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ") -> bool: |
return ( | ||
sum(luhn_input[::2]) + sum(sum(divmod(i * 2, n)) for i in luhn_input[1::2]) | ||
) % n == 0 | ||
|
||
@staticmethod | ||
def is_verhoeff_number(input_number: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def is_verhoeff_number(input_number: int): | |
def is_verhoeff_number(input_number: int) -> bool: |
# return full country list for given code | ||
return self.__get_country_master_full_data__(iso_code=iso_code) | ||
|
||
def get_full_country_information(self, lookup_key: str, lookup_index: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please define return type (List[str]
?)
ISO3166-1-Alpha-2,ISO3166-1-Alpha-3, ISO3166-1-Numeric, | ||
International_licence_plate_country_code, Country_code_top_level_domain, | ||
Currency_Name, ISO4217-Alpha-3, ISO4217-Numeric, Capital_City, Dialing_Code | ||
:return: Dictionary object with additional information enriched from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says it returns a dictionary, but it looks like the code returns a list
|
||
self.load() | ||
logger.info("Loaded recognizer: %s", self.name) | ||
self.is_loaded = True | ||
|
||
if analyzer_utils is not None: | ||
self.analyzer_utils = analyzer_utils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class fields should not be optional. I would suggest to remove the if analyzer_utils is not None
and put the value anyway (even if it's None). Otherwise the analyzer_utils
field would not always be part of the class.
supported_language: str = "en", | ||
supported_entity: str = "ISIN_CODE", | ||
replacement_pairs: Optional[List[Tuple[str, str]]] = None, | ||
analyzer_utils=PresidioAnalyzerUtils(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment on other recognizer
:param patterns: List of patterns to be used by this recognizer | ||
:param context: List of context words to increase confidence in detection | ||
:param supported_language: Language this recognizer supports | ||
:param supported_entity: The entity this recognizer can detect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add analyzer_utils to docstring
:param supported_language: Language this recognizer supports | ||
:param supported_entity: The entity this recognizer can detect | ||
:param replacement_pairs: List of tuples with potential replacement values | ||
for different strings to be used during pattern matching. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add analyzer_utils to docstring
from presidio_analyzer.analyzer_utils import PresidioAnalyzerUtils | ||
|
||
|
||
# from presidio_analyzer.analyzer_utils import PresidioAnalyzerUtils as Utils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove
luhn_mod_n_test_set = [ | ||
["27AAACM6094R1ZP", True], | ||
["36AAICA3369H1ZJ", True], | ||
["36AAHAA2262Q1ZF", True], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add negative examples?
Change Description
Added three new recognizers
Improved IN_PAN regex to not get spoofed by 0000 pattern
Describe your changes
Added new recognizers to improve detection of PII elements , one India specific and two global.
Issue reference
This PR fixes issue #XX
Checklist