Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicates from general list #120

Closed
koppor opened this issue Jan 2, 2023 · 28 comments
Closed

Remove duplicates from general list #120

koppor opened this issue Jan 2, 2023 · 28 comments
Assignees

Comments

@koppor
Copy link
Member

koppor commented Jan 2, 2023

Our general list has entries, which appear also in other lists (see https://github.com/JabRef/abbrv.jabref.org/tree/main/journals).

Write a script which removes the duplicates from ...general.csv. Python Pandas could be used. See https://github.com/JabRef/abbrv.jabref.org/blob/main/scripts/update_mathscinet.py for an example usage.

This would fix the last bullet point of JabRef/jabref-koppor#48. ("Check status of the general list [...]")

@sreenath-tm
Copy link
Contributor

Hi I would like to contribute to this issue. As the issue suggest I would have to take all other csv file as input in the script and validate with the general csv and check for duplicates right?

@Siedlerchr
Copy link
Member

Siedlerchr commented Jan 28, 2023

@sreenath-tm Yes! If a journal is in one of the csv files e.g. acs and also in General, then They should then be removed from the general.csv

@sreenath-tm
Copy link
Contributor

Hi @Siedlerchr if possible can you assign this issue to me.

@sreenath-tm
Copy link
Contributor

What is the approach that we should follow to handle non-ASCII characters like "Académie Royale de Belgique"?

@Siedlerchr
Copy link
Member

Siedlerchr commented Feb 25, 2023

It should be fine to use Unicode/UTF-8 here. Biblatex supports utf8/unicode. Bibtex supports only ASCII, but that is not a problem as we have a converter for unicode <-> latex encoding (latex2unicode) in JabRef for fields

@sreenath-tm
Copy link
Contributor

sreenath-tm commented Feb 25, 2023

I was reading the csv files one by one and while handling them in the python script I am getting an error "UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position ...". I was able to overcome the issue by specifying the "encoding='cp437' while writing and reading. Is that an approach that can be followed or is there any alternate approach?

@Siedlerchr
Copy link
Member

Siedlerchr commented Feb 25, 2023

Hm this looks odd, I just coned the repo and VSCode tells me that all files should be in utf8. Do you have a concrete file where the issue occurs?
Are you on Windows? Then you might need to specify UTF8 mode https://docs.python.org/3/library/os.html#utf8-mode
e.g. [-X utf8](https://docs.python.org/3/using/cmdline.html#cmdoption-X)

@sreenath-tm
Copy link
Contributor

The file "journal_abbreviations_annee-philologique" on line 31 has somewhat of a special character that matches the criteria. I am using windows and will definitely try the command out.

@Siedlerchr
Copy link
Member

Okay I will take a look here as well

@Siedlerchr
Copy link
Member

Yep, saw that in hex editor as well. On Windows Notepad++ is a great tool for checking this
Try replacing the line with:

Analecta Malacitana: Revista de la Sección de Filología de la Facultad de Filosofía y Letras de la Universidad de Málaga;AMal

git gui shows the diff:
grafik

@sreenath-tm
Copy link
Contributor

Yes notepad++ works well in this case. While handling the special characters in code can i use the "cp437" encoding so that all kinds of special characters can be properly parsed?

@koppor
Copy link
Member Author

koppor commented Feb 27, 2023

Yes notepad++ works well in this case. While handling the special characters in code can i use the "cp437" encoding so that all kinds of special characters can be properly parsed?

It is the year 2023 now and UTF-8 is standard. The journal lists have to be UTF-8.

Maybe, you could start to work on a GitHub action checking for UTF-8. See #125.

@nikolaynikolaevn
Copy link

Hello, is there an update on this issue? I am asking because me and my groupmates would like to work consider working on it as a part of a university assignment.

@sreenath-tm
Copy link
Contributor

sreenath-tm commented Mar 9, 2023

Yes, I am working on this issue and I am almost ready with the script. I will raise the PR as soon as I am done testing the functionalities.

@ThiloteE ThiloteE moved this from Free to take to Reserved in Candidates for University Projects Mar 16, 2023
@Siedlerchr
Copy link
Member

@sreenath-tm What is the status here? If you already have something, it would be nice if you could create a PR

@sreenath-tm
Copy link
Contributor

sreenath-tm commented Mar 27, 2023

Yes, @Siedlerchr I have the code ready for finding the duplicates and removing them from the general CSV file. I have been testing the code with the CSV files that are present in the repository and I had a few confusions.

  • The problem that I am facing is the fact that the data in these CSV files are very irregular in the sense that Jabref allows the abbreviations to actually have 4 parts ;[;[;]]. During testing I found that from around 150k only 25 lines actually have this format, so in such a case do I need to handle them separately or is it fine if I just check the name and abbreviation of them?

  • I have created a pandas dataframe and read the data from the CSV file by specifying the separator as ";". An irregularity that I found is that some of the lines in the CSV files where they have only the full name and abbreviation but they end with the character ";;" which is basically specifies the other two fields are not there. But the issue is that this format is not uniform. I have found such anomalies only in the following CSV files from the 20 files that are present

general-899 lines
medicus-3169 lines
dotless-87221
dots-87224

This character is causing a problem when I am running the script because in cases where the line is not ending with ";;" the dataframe gets populated with empty values in the first columns. I have validated the script by removing the trailing ";;" from all the lines that have this issue. I am currently working on handling this scenario so that the script runs without any csv modification.

I can raise a draft PR so that we can discuss this further. What do you suggest?

@koppor
Copy link
Member Author

koppor commented Mar 28, 2023

  • The problem that I am facing is the fact that the data in these CSV files are very irregular in the sense that Jabref allows the abbreviations to actually have 4 parts ;[;[;]]. During testing I found that from around 150k only 25 lines actually have this format, so in such a case do I need to handle them separately or is it fine if I just check the name and abbreviation of them?

Therefore, no full automation can take place.

More data is better quality!

So, if one entry has A1, B1, C1, D1 and the other A2, B2, C2 and no D2, and A1 is A2, B1 is B2, C1 is C2, they can be merged into A2, B2, C2, D2.

If C2 differs, manual investigation is necessary!

If B2 differs, manual investigation is necessary! Maybe, these are different journals...

It's on a case by case basis. Maybe you even need to Google the journals!

Note that some lists are generated using scripts based on web data and cannot be altered. In that case a post processing step needs to be added to the respective scripts.

This character is causing a problem when I am running the script because in cases where the line is not ending with ";;" the dataframe gets populated with empty values in the first columns.

I don't understand. Pandas should be able to treat empty values.

.

I can raise a draft PR so that we can discuss this further. What do you suggest?

Yes, go ahead. As said, maybe other scripts need to be adated, too.

Learning: data cleansing is hard ...

@sreenath-tm
Copy link
Contributor

sreenath-tm commented Apr 1, 2023

Sure @koppor will do a deep dive into the cleansing part which I did not give that much importance. With the existing script that I had These problems are coming in the files mentioned above. That is the reason why I feel other scripts might not need to be edited because it deals with other CSV files but I will definitely take a look into those scripts also and get back to you within less than a week. I will raise the draft pull request in a few days with the logic that you have suggested.

@koppor
Copy link
Member Author

koppor commented Apr 2, 2023

@sreenath-tm Thank you! It would be very great if there was logic on all scripts enabling automatic data cleanup of the lists. I did not check the source lists itself. A good start is the index of all lists at https://github.com/JabRef/abbrv.jabref.org/blob/main/journals/README.md. For the three lists where sources are provided, there is an automatic update in place: https://github.com/JabRef/abbrv.jabref.org/blob/main/.github/workflows/refresh-journal-lists.yml. - Maybe you find sources for the other lists? I doubt that any of these lists is hand-crafted.

@sreenath-tm
Copy link
Contributor

sreenath-tm commented Apr 7, 2023

Most of them have been added as a file itself around 2016 by the contributors. The main question is What would cleaning up look like because around 99% of data contains only the first 2 fields so do we want to end each entry with the character ";;" or not?

  • Advances in Chemistry Series;Adv. Chem. Ser.;; [Last 2 fields are not there still they have the symbol ";;" to signify those fields are empty]
  • ACS Applied Nano Materials;ACS Appl. Nano Mater. [ Last 2 fields are not there and they do not have the symbol ";;" to signify those fields are empty]

The result that I have generated is of the first format { default format for a CSV generated from a data frame} and if required I can generate a separate script so that all the entry that has only 2 fields will follow either the first or second format. I feel the first format seems more technical but I cannot deny the fact that journals generated by the scripts { refresh-journal } we have to follow the second format.

@koppor
Copy link
Member Author

koppor commented Apr 7, 2023

Update: In the review of the PR, I said that we should ignore differences in B, we should take the other lists. The aim is to reduce the size of the general list.

And most probably, the other abbreviations are shorter. See following random example:

image

@sreenath-tm
Copy link
Contributor

sreenath-tm commented Apr 7, 2023

Sorry I had used all the fields A,B,C and D for comparison in the script. So the differences between B1 and B2 should be ignored and only the A1 and A2 comparison should be done right?

@koppor
Copy link
Member Author

koppor commented Apr 12, 2023

Fixed with #128.

@koppor koppor closed this as completed Apr 12, 2023
@koppor
Copy link
Member Author

koppor commented Apr 12, 2023

Issue #130 is kind of follow up of this issue. Maybe, you have time to investigate?

@sreenath-tm
Copy link
Contributor

Sure @koppor I will definitely take a look into the issue.

@koppor
Copy link
Member Author

koppor commented Apr 26, 2023

@sreenath-tm Another thing: I noticed that there are entries where the abbreviation is equal to the full name. Is that happen more than once? Can it be fixed? Example: Agrokhimiya;Agrokhimiya;.

@sreenath-tm
Copy link
Contributor

@koppor This can be present in other CSV files also, do you recommend creating a new separate script for the same so that we can remove all such entries from all the csv files or are we just concerned about the entries from the general.csv file that we have?

@koppor
Copy link
Member Author

koppor commented Apr 28, 2023

I would like to see a cleanup-script being called after the import happend. It takes as parameter the filename? It can also cleanup all files (should not take a long time)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants