Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error message when using validator #197

Closed
lopierra opened this issue Jul 3, 2024 · 4 comments
Closed

Error message when using validator #197

lopierra opened this issue Jul 3, 2024 · 4 comments
Assignees
Labels
linkml Issues that require linkml development

Comments

@lopierra
Copy link
Member

lopierra commented Jul 3, 2024

Hi @madanucd - I'm attempting to run the validator on a test dataset:

validate-data -o ./errorlogs ./ABC-DS.csv participant

but I get the following error message:

Traceback (most recent call last):
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Scripts\validate-data", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\cli.py", line 36, in main
    validation_function(args.input_file, args.output)
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validation.py", line 20, in validate_participant
    return validate_data(file_path, string_columns, validate_participant_entry, output_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validation_utils.py", line 54, in validate_data
    clean_dataframe_strings(df, string_columns)
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validation_utils.py", line 16, in clean_dataframe_strings
    df[string_columns] = df[string_columns].map(clean_string)
                         ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Lib\site-packages\pandas\core\generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DataFrame' object has no attribute 'map'. Did you mean: 'max'?

Am I doing something wrong, or is it an issue with the validator?

Not urgent - we can discuss next Tuesday at Data Modeling meeting. Thanks!

@lopierra lopierra added the linkml Issues that require linkml development label Jul 3, 2024
@madanucd
Copy link
Contributor

madanucd commented Jul 8, 2024

It seems the map function is executing on my local machine, but typically, map cannot be directly applied to a DataFrame. To ensure consistency and correctness, we should update it to use applymap or apply which are the appropriate methods for applying functions to DataFrame elements. I will be preparing a PR to make this adjustment.

@madanucd
Copy link
Contributor

Hi Pierrette,

I wanted to bring to your attention that the applymap function has been deprecated for pandas versions after 2.1.0. You can find more details in the pandas documentation here. It was working for me because my pandas version is 2.2.0.

We could switch to using applymap as suggested in earlier versions of pandas. However, please note that with future pandas updates, it might not work.

Could you please try updating your pandas version? This should resolve the issue.

Thank you!

@lopierra
Copy link
Member Author

lopierra commented Jul 17, 2024

@madanucd I updated pandas and got a bit further with the validator. I ran it on the same file that I sent you before (ABC-DS.csv) and got the expected validation errors, but also got a TypeError. Is this expected? (Maybe due to ABC-DS having IDs that are integers instead of strings?)

validate-data -o ./errorlogs ./ABC-DS.csv participant

Validating participant data from file: ./ABC-DS.csv
Traceback (most recent call last):
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validate_participant.py", line 7, in validate_participant_entry
    instance = Participant(
               ^^^^^^^^^^^^
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Lib\site-packages\pydantic\main.py", line 192, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 7 validation errors for Participant
participantExternalId
  Input should be a valid string [type=string_type, input_value=10001, input_type=int]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type
familyId
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type
fatherId
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type
motherId
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type
siblingId
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type
otherFamilyMemberId
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type
ageAtLastVitalStatus
  Input should be a finite number [type=finite_number, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/finite_number

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Scripts\validate-data", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\cli.py", line 36, in main
    validation_function(args.input_file, args.output)
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validation.py", line 20, in validate_participant
    return validate_data(file_path, string_columns, validate_participant_entry, output_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validation_utils.py", line 55, in validate_data
    valid_count, invalid_count = validate_dataframe(df, validation_function, input_file_name=file_name,
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validation_utils.py", line 20, in validate_dataframe
    validation_results = df.apply(entry_validator, axis=1)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Lib\site-packages\pandas\core\frame.py", line 10374, in apply
    return op.apply().__finalize__(self, method="apply")
           ^^^^^^^^^^
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Lib\site-packages\pandas\core\apply.py", line 916, in apply
    return self.apply_standard()
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Lib\site-packages\pandas\core\apply.py", line 1063, in apply_standard
    results, res_index = self.apply_series_generator()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Lib\site-packages\pandas\core\apply.py", line 1081, in apply_series_generator
    results[i] = self.func(v, *self.args, **self.kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validate_participant.py", line 31, in validate_participant_entry
    error_details = (row['Study Code'] + "-" + row['Participant External ID'], e)
                     ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: can only concatenate str (not "int") to str

@madanucd
Copy link
Contributor

Hi @lopierra , Yes, the error occurred because the Participant ID was represented as an integer instead of a string in the ABC-DS dataset you are trying to validate. I have corrected this issue by ensuring that all fields are cast to or represented as strings for logging purposes going forward. The necessary changes have been made and can be reviewed in the following PR: #199.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linkml Issues that require linkml development
Projects
None yet
Development

No branches or pull requests

2 participants