-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove duplicates from general list #120
Comments
Hi I would like to contribute to this issue. As the issue suggest I would have to take all other csv file as input in the script and validate with the general csv and check for duplicates right? |
@sreenath-tm Yes! If a journal is in one of the csv files e.g. acs and also in General, then They should then be removed from the general.csv |
Hi @Siedlerchr if possible can you assign this issue to me. |
What is the approach that we should follow to handle non-ASCII characters like "Académie Royale de Belgique"? |
It should be fine to use Unicode/UTF-8 here. Biblatex supports utf8/unicode. Bibtex supports only ASCII, but that is not a problem as we have a converter for unicode <-> latex encoding (latex2unicode) in JabRef for fields |
I was reading the csv files one by one and while handling them in the python script I am getting an error "UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position ...". I was able to overcome the issue by specifying the "encoding='cp437' while writing and reading. Is that an approach that can be followed or is there any alternate approach? |
Hm this looks odd, I just coned the repo and VSCode tells me that all files should be in utf8. Do you have a concrete file where the issue occurs? |
The file "journal_abbreviations_annee-philologique" on line 31 has somewhat of a special character that matches the criteria. I am using windows and will definitely try the command out. |
Okay I will take a look here as well |
Yes notepad++ works well in this case. While handling the special characters in code can i use the "cp437" encoding so that all kinds of special characters can be properly parsed? |
It is the year 2023 now and UTF-8 is standard. The journal lists have to be UTF-8. Maybe, you could start to work on a GitHub action checking for UTF-8. See #125. |
Hello, is there an update on this issue? I am asking because me and my groupmates would like to work consider working on it as a part of a university assignment. |
Yes, I am working on this issue and I am almost ready with the script. I will raise the PR as soon as I am done testing the functionalities. |
@sreenath-tm What is the status here? If you already have something, it would be nice if you could create a PR |
Yes, @Siedlerchr I have the code ready for finding the duplicates and removing them from the general CSV file. I have been testing the code with the CSV files that are present in the repository and I had a few confusions.
This character is causing a problem when I am running the script because in cases where the line is not ending with ";;" the dataframe gets populated with empty values in the first columns. I have validated the script by removing the trailing ";;" from all the lines that have this issue. I am currently working on handling this scenario so that the script runs without any csv modification. I can raise a draft PR so that we can discuss this further. What do you suggest? |
Therefore, no full automation can take place. More data is better quality! So, if one entry has A1, B1, C1, D1 and the other A2, B2, C2 and no D2, and A1 is A2, B1 is B2, C1 is C2, they can be merged into A2, B2, C2, D2. If C2 differs, manual investigation is necessary! If B2 differs, manual investigation is necessary! Maybe, these are different journals... It's on a case by case basis. Maybe you even need to Google the journals! Note that some lists are generated using scripts based on web data and cannot be altered. In that case a post processing step needs to be added to the respective scripts.
I don't understand. Pandas should be able to treat empty values. .
Yes, go ahead. As said, maybe other scripts need to be adated, too. Learning: data cleansing is hard ... |
Sure @koppor will do a deep dive into the cleansing part which I did not give that much importance. With the existing script that I had These problems are coming in the files mentioned above. That is the reason why I feel other scripts might not need to be edited because it deals with other CSV files but I will definitely take a look into those scripts also and get back to you within less than a week. I will raise the draft pull request in a few days with the logic that you have suggested. |
@sreenath-tm Thank you! It would be very great if there was logic on all scripts enabling automatic data cleanup of the lists. I did not check the source lists itself. A good start is the index of all lists at https://github.com/JabRef/abbrv.jabref.org/blob/main/journals/README.md. For the three lists where sources are provided, there is an automatic update in place: https://github.com/JabRef/abbrv.jabref.org/blob/main/.github/workflows/refresh-journal-lists.yml. - Maybe you find sources for the other lists? I doubt that any of these lists is hand-crafted. |
Most of them have been added as a file itself around 2016 by the contributors. The main question is What would cleaning up look like because around 99% of data contains only the first 2 fields so do we want to end each entry with the character ";;" or not?
The result that I have generated is of the first format { default format for a CSV generated from a data frame} and if required I can generate a separate script so that all the entry that has only 2 fields will follow either the first or second format. I feel the first format seems more technical but I cannot deny the fact that journals generated by the scripts { refresh-journal } we have to follow the second format. |
Sorry I had used all the fields A,B,C and D for comparison in the script. So the differences between B1 and B2 should be ignored and only the A1 and A2 comparison should be done right? |
Fixed with #128. |
Issue #130 is kind of follow up of this issue. Maybe, you have time to investigate? |
Sure @koppor I will definitely take a look into the issue. |
@sreenath-tm Another thing: I noticed that there are entries where the abbreviation is equal to the full name. Is that happen more than once? Can it be fixed? Example: |
@koppor This can be present in other CSV files also, do you recommend creating a new separate script for the same so that we can remove all such entries from all the csv files or are we just concerned about the entries from the general.csv file that we have? |
I would like to see a cleanup-script being called after the import happend. It takes as parameter the filename? It can also cleanup all files (should not take a long time) |
Our general list has entries, which appear also in other lists (see https://github.com/JabRef/abbrv.jabref.org/tree/main/journals).
Write a script which removes the duplicates from ...general.csv. Python Pandas could be used. See https://github.com/JabRef/abbrv.jabref.org/blob/main/scripts/update_mathscinet.py for an example usage.
This would fix the last bullet point of JabRef/jabref-koppor#48. ("Check status of the general list [...]")
The text was updated successfully, but these errors were encountered: