Addition of three new predefined recognizers, improved regex for IN_PAN #1323

devopam · 2024-03-04T18:00:36Z

Change Description

Added three new recognizers

IN_GSTIN
ISIN_CODE
CFI_CODE

Improved IN_PAN regex to not get spoofed by 0000 pattern

Describe your changes
Added new recognizers to improve detection of PII elements , one India specific and two global.

Issue reference

This PR fixes issue #XX

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required

Added India PAN (Permanent Account Number) recognizer

refined the regex for better recognition and enhanced the test cases accordingly

Fixed lint error that was missed earlier.

Added test cases , verification and context data

Added negative test cases per review comments.

linted code

update pattern recognizer value per suggestion in review

added PresidioAnalyzerUtils class with generic functions. removed usage of stdnum

added test cases for analyzer_utils.py in prescribed format

added to the count of predefined recognizers

Added India specific predefined pattern recognizer for vehicle registration number

reinstated python 3.9 compatibility, reorganized code

Logic reverted from analyzer_utils to recognizer classfile

added min size check to avoid failures per review comment

Two english language predefine recognizers added viz. ISIN , CFI

1. Improved IN_PAN regex 2. Utility function for LUHN ModN validation 3. New recognizers : IN_GSTIN, CFI_CODE, ISIN_CODE

devopam · 2024-03-04T18:03:15Z

hi @omri374 ,
Apologies if I made it bit heavy by adding three recognizers in one go.
One additional library used : pycountry (https://pypi.org/project/pycountry/) , please let me know if their licenses are okay to be included within presidio.

regards
Devopam

omri374 · 2024-03-05T12:45:51Z

Hi @devopam, thanks for this! unfortunately the pycountry package is LGPL and we cannot have a dependency on it.

omri374 · 2024-03-07T07:44:37Z

Is there a simple alternative to it?

devopam · 2024-03-07T15:29:37Z

hi @omri374 , yes I will add it neatly in analyzer_utils so that others can use it as well. We will need to maintain all these metadata whenever world events change as such. I am down with cold & fever :( - that's the delay cause.

omri374 · 2024-03-07T17:12:19Z

Absolutely no rush! Hope you feel better soon!

Removed pycountry per feedback on it's license. Built the utility in analyzer_utils.py & removed all references.

devopam · 2024-03-09T21:49:43Z

hi @omri374 ,
I have removed pycountry as desired and added the metadata to country_master.csv file. Added util functions in analyzer_utils to load the required information as well (added one function for future use also). However, I am not convinced of the location of country_master.csv myself - shall we create a new sub-folder at the top level for such data ?
Please have a look when you can and let me know your feedback on the overall changes.

omri374 · 2024-03-10T21:38:22Z

Thanks! I will take a look soon.

omri374

Thanks for the great work! Left some comments, would be happy to discuss.

presidio-analyzer/tests/test_recognizer_registry.py

presidio-analyzer/presidio_analyzer/analyzer_utils.py

presidio-analyzer/presidio_analyzer/country_master.csv

omri374 · 2024-03-13T08:14:51Z

presidio-analyzer/presidio_analyzer/predefined_recognizers/in_gstin_recognizer.py

+
+    gstin_country_codes_iso3a = ""
+    utils = Utils()
+    countries = utils.get_country_codes(iso_code="ISO3166-1-Alpha-3")


I'm trying to think of way to make this more efficient. Creating the PresidioAnalyzerUtils object for every recognizer that needs it is inefficient, and would require holding the countries data in memory multiple times.

On one hand, there's the need for this data for some recognizers, on the other hand most recognizers (and users) might not need it. Therefore, we can think of two options to handle this in my view.

Instantiate the utils in the AnalyzerEngine class and pass it to each recognizer. This would allow it to only be instantiated once.

Make some of these recognizers optional (which aren't loaded in load_predefined_recognizers) and therefore reduce the memory footprint.

Long term, I think we should think of a mechanism to specify the countries you expect PII to come from. If someone doesn't expect any Australian license plates, it doesn't make sense to load that recognizer for them.

Happy to get your thoughts on this.

hi @omri374
I am confused with the best approach here frankly. #2 is not a good idea as it will defeat the purpose of 'built-in' recognizers readily available as such from a consumer's perspective imho.
We will need more of such metadata to improve detection beyond pattern matching as we add more industry specific recognizers to give comprehensive solution offering.
So, how best to design this is bit complex !
approach 1 above seems to be a feasible approach to me but I will need fair bit of help here to implement since I don't want to end up breaking things due to my limited knowledge of the core.
Shall we discuss offline ? Please suggest

Hi @devopam. I agree that (1) is a better approach, also since in parallel we're thinking of introducing a country flag to be able to filter out recognizers that aren't needed.

Adding a new input to the __init__ of EntityRecognizer would not break the API. I would suggest to do that, and instantiate the PresidioAnalyzerUtils in the recognizer registry, so that it could be passed to all recognizers during instantiation.

How about taking this constructor:

presidio/presidio-analyzer/presidio_analyzer/entity_recognizer.py

Line 35 in ea8d830

def __init__(

and adding a new param to it:

def __init__( self, supported_entities: List[str], name: str = None, supported_language: str = "en", version: str = "0.0.1", context: Optional[List[str]] = None, analyzer_utils: Optional[PresidioAnalyzerUtils] = None ): self.analyzer_utils = analyzer_utils

We could inject the analyzer_utils during recognizers instantiation:

presidio/presidio-analyzer/presidio_analyzer/recognizer_registry.py

Line 139 in ea8d830

self.__instantiate_recognizer(

WDYT?
Happy to discuss offline too.

@devopam how about we have a quick chat on this? If you're interested, let's chat on LinkedIn

Hi @omri374 ,
Please have a look at the current implementation and advise. Utils is instantiated only in the recognizers using it now viz. in_gstin, isin

presidio-analyzer/tests/test_isin_recognizer.py

presidio-analyzer/presidio_analyzer/predefined_recognizers/isin_recognizer.py

omri374 · 2024-03-13T08:32:34Z

/azp run

azure-pipelines · 2024-03-13T08:32:48Z

Azure Pipelines successfully started running 1 pipeline(s).

interim code with EntityRecognizer enhancement WIP

class instantiation changed for analyzer utils

omri374

Hi @devopam, added a few additional comments. This looks much better!
Thanks

omri374 · 2024-04-23T16:28:17Z

presidio-analyzer/presidio_analyzer/analyzer_utils.py

@@ -36,13 +52,33 @@ def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
            text = text.replace(search_string, replacement_string)
        return text

+    @staticmethod
+    def get_luhn_mod_n(input_str: str, alphabet="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"):


Suggested change

def get_luhn_mod_n(input_str: str, alphabet="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"):

def get_luhn_mod_n(input_str: str, alphabet="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ") -> bool:

omri374 · 2024-04-23T16:28:34Z

presidio-analyzer/presidio_analyzer/analyzer_utils.py

+        return (
+            sum(luhn_input[::2]) + sum(sum(divmod(i * 2, n)) for i in luhn_input[1::2])
+        ) % n == 0
+
    @staticmethod
    def is_verhoeff_number(input_number: int):


Suggested change

def is_verhoeff_number(input_number: int):

def is_verhoeff_number(input_number: int) -> bool:

omri374 · 2024-04-23T16:30:04Z

presidio-analyzer/presidio_analyzer/analyzer_utils.py

+            # return full country list for given code
+            return self.__get_country_master_full_data__(iso_code=iso_code)
+
+    def get_full_country_information(self, lookup_key: str, lookup_index: str):


please define return type (List[str]?)

omri374 · 2024-04-23T16:30:47Z

presidio-analyzer/presidio_analyzer/analyzer_utils.py

+        ISO3166-1-Alpha-2,ISO3166-1-Alpha-3, ISO3166-1-Numeric,
+        International_licence_plate_country_code, Country_code_top_level_domain,
+        Currency_Name, ISO4217-Alpha-3, ISO4217-Numeric, Capital_City, Dialing_Code
+        :return: Dictionary object with additional information enriched from


It says it returns a dictionary, but it looks like the code returns a list

omri374 · 2024-04-23T16:32:22Z

presidio-analyzer/presidio_analyzer/entity_recognizer.py


        self.load()
        logger.info("Loaded recognizer: %s", self.name)
        self.is_loaded = True

+        if analyzer_utils is not None:
+            self.analyzer_utils = analyzer_utils


class fields should not be optional. I would suggest to remove the if analyzer_utils is not None and put the value anyway (even if it's None). Otherwise the analyzer_utils field would not always be part of the class.

omri374 · 2024-04-23T16:48:38Z

presidio-analyzer/presidio_analyzer/predefined_recognizers/isin_recognizer.py

+        supported_language: str = "en",
+        supported_entity: str = "ISIN_CODE",
+        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
+        analyzer_utils=PresidioAnalyzerUtils(),


See comment on other recognizer

omri374 · 2024-04-23T16:48:52Z

presidio-analyzer/presidio_analyzer/predefined_recognizers/isin_recognizer.py

+    :param patterns: List of patterns to be used by this recognizer
+    :param context: List of context words to increase confidence in detection
+    :param supported_language: Language this recognizer supports
+    :param supported_entity: The entity this recognizer can detect


please add analyzer_utils to docstring

omri374 · 2024-04-23T16:49:07Z

presidio-analyzer/presidio_analyzer/predefined_recognizers/in_gstin_recognizer.py

+    :param supported_language: Language this recognizer supports
+    :param supported_entity: The entity this recognizer can detect
+    :param replacement_pairs: List of tuples with potential replacement values
+    for different strings to be used during pattern matching.


please add analyzer_utils to docstring

omri374 · 2024-04-23T16:49:14Z

presidio-analyzer/presidio_analyzer/predefined_recognizers/in_gstin_recognizer.py

+from presidio_analyzer.analyzer_utils import PresidioAnalyzerUtils
+
+
+# from presidio_analyzer.analyzer_utils import PresidioAnalyzerUtils as Utils


please remove

omri374 · 2024-04-23T16:49:53Z

presidio-analyzer/tests/test_analyzer_utils.py

+luhn_mod_n_test_set = [
+    ["27AAACM6094R1ZP", True],
+    ["36AAICA3369H1ZJ", True],
+    ["36AAHAA2262Q1ZF", True],


Can we please add negative examples?

devopam and others added 30 commits June 26, 2023 16:39

IN_PAN pattern recognizer

818fe90

Added India PAN (Permanent Account Number) recognizer

refined IN_PAN regex

87a1aae

refined the regex for better recognition and enhanced the test cases accordingly

Update recognizer_registry.py

8756c93

Fixed lint error that was missed earlier.

Fixed Lint errors

2f85d5d

Added test cases , verification and context data

Merge branch 'main' of https://github.com/devopam/presidio

1b47061

Added more test cases in test_in_pan_recognizer.py

b0d1ce8

Added negative test cases per review comments.

Merge branch 'main' into main

b3e94ed

Merge branch 'main' into main

838402f

Merge branch 'main' into main

d4ae26d

Merge branch 'main' of https://github.com/devopam/presidio

1e81cfb

added IN_AADHAAR recognizer

88c6c1f

Merge branch 'microsoft:main' into main

b4edab4

Update in_aadhaar_recognizer.py

2d01bd0

linted code

Merge branch 'main' into main

b7c6e65

Update in_aadhaar_recognizer.py

2434bb5

update pattern recognizer value per suggestion in review

added utility function class

b6db593

added PresidioAnalyzerUtils class with generic functions. removed usage of stdnum

Merge branch 'main' into main

2dd5cec

Merge branch 'main' into main

dfb2d26

Create test_analyzer_utils.py

fd28708

added test cases for analyzer_utils.py in prescribed format

Update test_recognizer_registry.py

f0c9737

added to the count of predefined recognizers

Merge branch 'main' into main

a67f19f

Merge branch 'microsoft:main' into main

8383e08

Merge branch 'microsoft:main' into main

37b2f97

added predefined recognizer : IN_VEHICLE_REGISTRATION

57b2294

Added India specific predefined pattern recognizer for vehicle registration number

review comments incorporated

365be21

reinstated python 3.9 compatibility, reorganized code

Merge branch 'main' into main

3cdec15

Merge branch 'main' into main

28f8bec

review comments incorporated

bc059ce

Logic reverted from analyzer_utils to recognizer classfile

added null/min vehicle number size

1ffbb8b

added min size check to avoid failures per review comment

Merge branch 'main' into main

b05399f

devopam added 6 commits February 18, 2024 23:51

incorporated review comments

2a4708b

Merge branch 'main' into main

22003a4

Merge branch 'main' into main

3f00fdc

added two predefined recognizers : ISIN, CFI

424174d

Two english language predefine recognizers added viz. ISIN , CFI

added three predefined recognizers, improvements

b0767aa

1. Improved IN_PAN regex 2. Utility function for LUHN ModN validation 3. New recognizers : IN_GSTIN, CFI_CODE, ISIN_CODE

merged main branch conflicts

4133632

removed pycountry

d1f2fc6

Removed pycountry per feedback on it's license. Built the utility in analyzer_utils.py & removed all references.

Merge branch 'main' into main

93a79cf

omri374 reviewed Mar 13, 2024

View reviewed changes

Merge branch 'main' into main

3053088

devopam and others added 7 commits March 14, 2024 22:48

review feedback incorporation

65a2e70

Merge branch 'main' into main

f4a1541

interim commit - not ready for merging

4391cd4

interim code with EntityRecognizer enhancement WIP

Merge branch 'microsoft:main' into main

acf7331

Merge branch 'microsoft:main' into main

e26da0f

incorporated review suggestions

e040ecc

class instantiation changed for analyzer utils

Merge branch 'main' into main

7d6ee38

omri374 reviewed Apr 23, 2024

View reviewed changes

interim commit

64407fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Addition of three new predefined recognizers, improved regex for IN_PAN #1323

Addition of three new predefined recognizers, improved regex for IN_PAN #1323

devopam commented Mar 4, 2024

devopam commented Mar 4, 2024

omri374 commented Mar 5, 2024

omri374 commented Mar 7, 2024

devopam commented Mar 7, 2024

omri374 commented Mar 7, 2024

devopam commented Mar 9, 2024

omri374 commented Mar 10, 2024

omri374 left a comment

omri374 Mar 13, 2024

devopam Mar 14, 2024 •

edited

Loading

omri374 Mar 14, 2024

omri374 Mar 22, 2024

devopam Apr 17, 2024

omri374 commented Mar 13, 2024

azure-pipelines bot commented Mar 13, 2024

omri374 left a comment

omri374 Apr 23, 2024

omri374 Apr 23, 2024

omri374 Apr 23, 2024

omri374 Apr 23, 2024

omri374 Apr 23, 2024

omri374 Apr 23, 2024

omri374 Apr 23, 2024

omri374 Apr 23, 2024

omri374 Apr 23, 2024

omri374 Apr 23, 2024

	def get_luhn_mod_n(input_str: str, alphabet="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
	def get_luhn_mod_n(input_str: str, alphabet="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ") -> bool:

	def is_verhoeff_number(input_number: int):
	def is_verhoeff_number(input_number: int) -> bool:

		from presidio_analyzer.analyzer_utils import PresidioAnalyzerUtils


		# from presidio_analyzer.analyzer_utils import PresidioAnalyzerUtils as Utils

Addition of three new predefined recognizers, improved regex for IN_PAN #1323

Are you sure you want to change the base?

Addition of three new predefined recognizers, improved regex for IN_PAN #1323

Conversation

devopam commented Mar 4, 2024

Change Description

Issue reference

Checklist

devopam commented Mar 4, 2024

omri374 commented Mar 5, 2024

omri374 commented Mar 7, 2024

devopam commented Mar 7, 2024

omri374 commented Mar 7, 2024

devopam commented Mar 9, 2024

omri374 commented Mar 10, 2024

omri374 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devopam Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omri374 commented Mar 13, 2024

azure-pipelines bot commented Mar 13, 2024

omri374 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devopam Mar 14, 2024 •

edited

Loading