Remove duplicates from general list #120

koppor · 2023-01-02T19:34:13Z

Our general list has entries, which appear also in other lists (see https://github.com/JabRef/abbrv.jabref.org/tree/main/journals).

Write a script which removes the duplicates from ...general.csv. Python Pandas could be used. See https://github.com/JabRef/abbrv.jabref.org/blob/main/scripts/update_mathscinet.py for an example usage.

This would fix the last bullet point of JabRef/jabref-koppor#48. ("Check status of the general list [...]")

sreenath-tm · 2023-01-28T09:36:18Z

Hi I would like to contribute to this issue. As the issue suggest I would have to take all other csv file as input in the script and validate with the general csv and check for duplicates right?

Siedlerchr · 2023-01-28T18:21:02Z

@sreenath-tm Yes! If a journal is in one of the csv files e.g. acs and also in General, then They should then be removed from the general.csv

sreenath-tm · 2023-01-30T17:51:27Z

Hi @Siedlerchr if possible can you assign this issue to me.

sreenath-tm · 2023-02-25T17:35:59Z

What is the approach that we should follow to handle non-ASCII characters like "Académie Royale de Belgique"?

Siedlerchr · 2023-02-25T17:42:24Z

It should be fine to use Unicode/UTF-8 here. Biblatex supports utf8/unicode. Bibtex supports only ASCII, but that is not a problem as we have a converter for unicode <-> latex encoding (latex2unicode) in JabRef for fields

sreenath-tm · 2023-02-25T18:10:32Z

I was reading the csv files one by one and while handling them in the python script I am getting an error "UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position ...". I was able to overcome the issue by specifying the "encoding='cp437' while writing and reading. Is that an approach that can be followed or is there any alternate approach?

Siedlerchr · 2023-02-25T18:22:34Z

Hm this looks odd, I just coned the repo and VSCode tells me that all files should be in utf8. Do you have a concrete file where the issue occurs?
Are you on Windows? Then you might need to specify UTF8 mode https://docs.python.org/3/library/os.html#utf8-mode
e.g. [-X utf8](https://docs.python.org/3/using/cmdline.html#cmdoption-X)

sreenath-tm · 2023-02-25T20:19:06Z

The file "journal_abbreviations_annee-philologique" on line 31 has somewhat of a special character that matches the criteria. I am using windows and will definitely try the command out.

Siedlerchr · 2023-02-25T21:09:02Z

Okay I will take a look here as well

Siedlerchr · 2023-02-25T21:50:57Z

Yep, saw that in hex editor as well. On Windows Notepad++ is a great tool for checking this
Try replacing the line with:

Analecta Malacitana: Revista de la Sección de Filología de la Facultad de Filosofía y Letras de la Universidad de Málaga;AMal

git gui shows the diff:

sreenath-tm · 2023-02-26T10:20:06Z

Yes notepad++ works well in this case. While handling the special characters in code can i use the "cp437" encoding so that all kinds of special characters can be properly parsed?

koppor · 2023-02-27T08:20:03Z

Yes notepad++ works well in this case. While handling the special characters in code can i use the "cp437" encoding so that all kinds of special characters can be properly parsed?

It is the year 2023 now and UTF-8 is standard. The journal lists have to be UTF-8.

Maybe, you could start to work on a GitHub action checking for UTF-8. See #125.

nikolaynikolaevn · 2023-03-09T17:06:09Z

Hello, is there an update on this issue? I am asking because me and my groupmates would like to work consider working on it as a part of a university assignment.

sreenath-tm · 2023-03-09T18:37:57Z

Yes, I am working on this issue and I am almost ready with the script. I will raise the PR as soon as I am done testing the functionalities.

Siedlerchr · 2023-03-27T19:00:07Z

@sreenath-tm What is the status here? If you already have something, it would be nice if you could create a PR

sreenath-tm · 2023-03-27T21:53:31Z

Yes, @Siedlerchr I have the code ready for finding the duplicates and removing them from the general CSV file. I have been testing the code with the CSV files that are present in the repository and I had a few confusions.

The problem that I am facing is the fact that the data in these CSV files are very irregular in the sense that Jabref allows the abbreviations to actually have 4 parts ;[;[;]]. During testing I found that from around 150k only 25 lines actually have this format, so in such a case do I need to handle them separately or is it fine if I just check the name and abbreviation of them?
I have created a pandas dataframe and read the data from the CSV file by specifying the separator as ";". An irregularity that I found is that some of the lines in the CSV files where they have only the full name and abbreviation but they end with the character ";;" which is basically specifies the other two fields are not there. But the issue is that this format is not uniform. I have found such anomalies only in the following CSV files from the 20 files that are present

general-899 lines
medicus-3169 lines
dotless-87221
dots-87224

This character is causing a problem when I am running the script because in cases where the line is not ending with ";;" the dataframe gets populated with empty values in the first columns. I have validated the script by removing the trailing ";;" from all the lines that have this issue. I am currently working on handling this scenario so that the script runs without any csv modification.

I can raise a draft PR so that we can discuss this further. What do you suggest?

koppor · 2023-03-28T10:14:12Z

The problem that I am facing is the fact that the data in these CSV files are very irregular in the sense that Jabref allows the abbreviations to actually have 4 parts ;[;[;]]. During testing I found that from around 150k only 25 lines actually have this format, so in such a case do I need to handle them separately or is it fine if I just check the name and abbreviation of them?

Therefore, no full automation can take place.

More data is better quality!

So, if one entry has A1, B1, C1, D1 and the other A2, B2, C2 and no D2, and A1 is A2, B1 is B2, C1 is C2, they can be merged into A2, B2, C2, D2.

If C2 differs, manual investigation is necessary!

If B2 differs, manual investigation is necessary! Maybe, these are different journals...

It's on a case by case basis. Maybe you even need to Google the journals!

Note that some lists are generated using scripts based on web data and cannot be altered. In that case a post processing step needs to be added to the respective scripts.

This character is causing a problem when I am running the script because in cases where the line is not ending with ";;" the dataframe gets populated with empty values in the first columns.

I don't understand. Pandas should be able to treat empty values.

.

I can raise a draft PR so that we can discuss this further. What do you suggest?

Yes, go ahead. As said, maybe other scripts need to be adated, too.

Learning: data cleansing is hard ...

sreenath-tm · 2023-04-01T08:43:01Z

Sure @koppor will do a deep dive into the cleansing part which I did not give that much importance. With the existing script that I had These problems are coming in the files mentioned above. That is the reason why I feel other scripts might not need to be edited because it deals with other CSV files but I will definitely take a look into those scripts also and get back to you within less than a week. I will raise the draft pull request in a few days with the logic that you have suggested.

koppor · 2023-04-02T23:09:14Z

@sreenath-tm Thank you! It would be very great if there was logic on all scripts enabling automatic data cleanup of the lists. I did not check the source lists itself. A good start is the index of all lists at https://github.com/JabRef/abbrv.jabref.org/blob/main/journals/README.md. For the three lists where sources are provided, there is an automatic update in place: https://github.com/JabRef/abbrv.jabref.org/blob/main/.github/workflows/refresh-journal-lists.yml. - Maybe you find sources for the other lists? I doubt that any of these lists is hand-crafted.

sreenath-tm · 2023-04-07T08:39:18Z

Most of them have been added as a file itself around 2016 by the contributors. The main question is What would cleaning up look like because around 99% of data contains only the first 2 fields so do we want to end each entry with the character ";;" or not?

Advances in Chemistry Series;Adv. Chem. Ser.;; [Last 2 fields are not there still they have the symbol ";;" to signify those fields are empty]
ACS Applied Nano Materials;ACS Appl. Nano Mater. [ Last 2 fields are not there and they do not have the symbol ";;" to signify those fields are empty]

The result that I have generated is of the first format { default format for a CSV generated from a data frame} and if required I can generate a separate script so that all the entry that has only 2 fields will follow either the first or second format. I feel the first format seems more technical but I cannot deny the fact that journals generated by the scripts { refresh-journal } we have to follow the second format.

koppor · 2023-04-07T13:44:10Z

Update: In the review of the PR, I said that we should ignore differences in B, we should take the other lists. The aim is to reduce the size of the general list.

And most probably, the other abbreviations are shorter. See following random example:

sreenath-tm · 2023-04-07T14:51:38Z

Sorry I had used all the fields A,B,C and D for comparison in the script. So the differences between B1 and B2 should be ignored and only the A1 and A2 comparison should be done right?

koppor · 2023-04-12T11:16:23Z

Fixed with #128.

koppor · 2023-04-12T11:22:30Z

Issue #130 is kind of follow up of this issue. Maybe, you have time to investigate?

sreenath-tm · 2023-04-12T11:30:08Z

Sure @koppor I will definitely take a look into the issue.

koppor · 2023-04-26T23:16:38Z

@sreenath-tm Another thing: I noticed that there are entries where the abbreviation is equal to the full name. Is that happen more than once? Can it be fixed? Example: Agrokhimiya;Agrokhimiya;.

sreenath-tm · 2023-04-27T11:13:27Z

@koppor This can be present in other CSV files also, do you recommend creating a new separate script for the same so that we can remove all such entries from all the csv files or are we just concerned about the entries from the general.csv file that we have?

koppor · 2023-04-28T23:51:42Z

I would like to see a cleanup-script being called after the import happend. It takes as parameter the filename? It can also cleanup all files (should not take a long time)

koppor added this to Candidates for University Projects Jan 2, 2023

github-project-automation bot moved this to Free to take in Candidates for University Projects Jan 2, 2023

Siedlerchr added the enhancement label Jan 2, 2023

Siedlerchr assigned sreenath-tm Jan 30, 2023

ThiloteE moved this from Free to take to Reserved in Candidates for University Projects Mar 16, 2023

sreenath-tm mentioned this issue Apr 7, 2023

Issue 120 fix #128

Merged

koppor closed this as completed Apr 12, 2023

github-project-automation bot moved this from Reserved to Done in Candidates for University Projects Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove duplicates from general list #120

Remove duplicates from general list #120

koppor commented Jan 2, 2023 •

edited

Loading

sreenath-tm commented Jan 28, 2023

Siedlerchr commented Jan 28, 2023 •

edited

Loading

sreenath-tm commented Jan 30, 2023

sreenath-tm commented Feb 25, 2023

Siedlerchr commented Feb 25, 2023 •

edited

Loading

sreenath-tm commented Feb 25, 2023 •

edited

Loading

Siedlerchr commented Feb 25, 2023 •

edited

Loading

sreenath-tm commented Feb 25, 2023

Siedlerchr commented Feb 25, 2023

Siedlerchr commented Feb 25, 2023

sreenath-tm commented Feb 26, 2023

koppor commented Feb 27, 2023

nikolaynikolaevn commented Mar 9, 2023

sreenath-tm commented Mar 9, 2023 •

edited

Loading

Siedlerchr commented Mar 27, 2023

sreenath-tm commented Mar 27, 2023 •

edited

Loading

koppor commented Mar 28, 2023

sreenath-tm commented Apr 1, 2023 •

edited

Loading

koppor commented Apr 2, 2023

sreenath-tm commented Apr 7, 2023 •

edited

Loading

koppor commented Apr 7, 2023

sreenath-tm commented Apr 7, 2023 •

edited

Loading

koppor commented Apr 12, 2023

koppor commented Apr 12, 2023

sreenath-tm commented Apr 12, 2023

koppor commented Apr 26, 2023

sreenath-tm commented Apr 27, 2023

koppor commented Apr 28, 2023

Remove duplicates from general list #120

Remove duplicates from general list #120

Comments

koppor commented Jan 2, 2023 • edited Loading

sreenath-tm commented Jan 28, 2023

Siedlerchr commented Jan 28, 2023 • edited Loading

sreenath-tm commented Jan 30, 2023

sreenath-tm commented Feb 25, 2023

Siedlerchr commented Feb 25, 2023 • edited Loading

sreenath-tm commented Feb 25, 2023 • edited Loading

Siedlerchr commented Feb 25, 2023 • edited Loading

sreenath-tm commented Feb 25, 2023

Siedlerchr commented Feb 25, 2023

Siedlerchr commented Feb 25, 2023

sreenath-tm commented Feb 26, 2023

koppor commented Feb 27, 2023

nikolaynikolaevn commented Mar 9, 2023

sreenath-tm commented Mar 9, 2023 • edited Loading

Siedlerchr commented Mar 27, 2023

sreenath-tm commented Mar 27, 2023 • edited Loading

koppor commented Mar 28, 2023

sreenath-tm commented Apr 1, 2023 • edited Loading

koppor commented Apr 2, 2023

sreenath-tm commented Apr 7, 2023 • edited Loading

koppor commented Apr 7, 2023

sreenath-tm commented Apr 7, 2023 • edited Loading

koppor commented Apr 12, 2023

koppor commented Apr 12, 2023

sreenath-tm commented Apr 12, 2023

koppor commented Apr 26, 2023

sreenath-tm commented Apr 27, 2023

koppor commented Apr 28, 2023

koppor commented Jan 2, 2023 •

edited

Loading

Siedlerchr commented Jan 28, 2023 •

edited

Loading

Siedlerchr commented Feb 25, 2023 •

edited

Loading

sreenath-tm commented Feb 25, 2023 •

edited

Loading

Siedlerchr commented Feb 25, 2023 •

edited

Loading

sreenath-tm commented Mar 9, 2023 •

edited

Loading

sreenath-tm commented Mar 27, 2023 •

edited

Loading

sreenath-tm commented Apr 1, 2023 •

edited

Loading

sreenath-tm commented Apr 7, 2023 •

edited

Loading

sreenath-tm commented Apr 7, 2023 •

edited

Loading