Repository containing code and results for our " " paper.
abstract
Due to Github's limit on files' size, we are unable to upload the complete results of our study. They are available on Zenodo Zenodo in the compressed folder "execution.tgz".
-
cov_asn_xcand.csv contains the coverage information calculated for each candidate. Each row corresponds to one candidate. The columns are the following:
- "term" : term the candidate applied in.
- "role" : role the candidate applied for.
- "field" : field the candidate applied for.
- "id" : unique ID of the application.
- "total_CV" : total number of unique publications from the candidate's CV.
- "MAG" : raw number of publications from the candidate's CV found in Microsoft Academic Graph.
- "MAG% : percentage of publications from the candidate's CV found in Microsoft Academic Graph.
- "OA" : raw number of publications from the candidate's CV found in OpenAIRE.
- "OA%" : percentage of publications from the candidate's CV found in OpenAIRE.
- "CR" : raw number of publications from the candidate's CV found in CrossRef.
- "CR%" : percentage of publications from the candidate's CV found in CrossRef.
- "MAG+OA" : raw number of publications from the candidate's CV found when Microsoft Academic Graph and OpenAIRE are combined.
- "MAG+OA%" : percentage of publications from the candidate's CV found when Microsoft Academic Graph and OpenAIRE are combined.
- "MAG+CR" : raw number of publications from the candidate's CV found when Microsoft Academic Graph and CrossRef are combined.
- "MAG+CR%" : percentage of the number from the candidate's CV in that year found when Microsoft Academic Graph and CrossRef are combined.
- "OA+CR" : raw number of publications from the candidate's CV found when OpenAIRE and CrossRef are combined.
- "OA+CR%" : percentage of publications from the candidate's CV found when OpenAIRE and CrossRef are combined.
- "Comb" : raw number of publications from the candidate's CV found when MAG, OpenAIRE and CrossRef are combined.
- "Comb%" : percentage of publications from the candidate's CV found when MAG, OpenAIRE and CrossRef are combined.
-
cov_asn_xdataset.csv contains the coverage information broken down for each dataset. Each row corresponds to percentage of publications found in that dataset for a single candidate. The columns are the following:
- "dataset" : dataset of interest.
- "term" : term the candidate applied in.
- "role" : role the candidate applied for.
- "SA" : scientific area the candidate applied for
- "field" : field the candidate applied for.
- "role&field" : combination of role and field the candidate applied for.
- "coverage%" : percentage of publications from the candidate's CV found in MAG, OpenAIRE, CrossRef, or through the multiple combinations of these sources.
-
cov_asn_xyear.csv contains the coverage information calculated for each year. Each row corresponds to one year. The columns are the following:
- "year" : year of publication.
- "total" : total number of unique publications published in that year from all the candidates' CVs.
- "MAG" : raw number of publications published in that year found in Microsoft Academic Graph.
- "MAG%" : percentage of the number of publications published in that year found in Microsoft Academic Graph.
- "OA" : raw number of publications published in that year found in OpenAIRE.
- "OA%" : percentage of publications published in that year found in OpenAIRE.
- "CR" : raw number of publications published in that year found in CrossRef.
- "CR%" : percentage of publications published in that year found in CrossRef.
- "MAG+OA" : raw number of publications published in that year found when Microsoft Academic Graph and OpenAIRE are combined.
- "MAG+OA%" : percentage of publications published in that year found when Microsoft Academic Graph and OpenAIRE are combined.
- "MAG+CR" : raw number of publications published in that year found when Microsoft Academic Graph and CrossRef are combined.
- "MAG+CR%" : percentage of the number of publications published in that year found when Microsoft Academic Graph and CrossRef are combined.
- "OA+CR" : raw number of publications published in that year found when OpenAIRE and CrossRef are combined.
- "OA+CR%" : percentage of publications published in that year found when OpenAIRE and CrossRef are combined.
- "Comb" : raw number of publications published in that year found when MAG, OpenAIRE and CrossRef are combined.
- "Comb%" : percentage of publications published in that year found when MAG, OpenAIRE and CrossRef are combined.
To investigate coverage in our open access datasets of interest, we create a MongoDB database using data from the Microsoft Academic Graph Dump, OpenAIRE Research Graph Dump and Crossref Public Data File. We then query this database with publications' metadata extracted from the candidates' CVs so as to assess whether said publications are present in the datasets of interest.
Due to Github's limit on files' size, we are unable to upload the processed versions dumps here. However, we provide download links to the original dumps and instructions to replicate our processing. The processed and ready-to-use dumps is available on Zenodo (link) in the folder "final".
- Download the folder "preparation" from this repo
- Download all the dumps in a folder named "originals" inside "preparation" from the following links:
- Decompress Microsoft Academic Graph's dump in a folder named "mag" inside "preparation"
- Decompress OpenAIRE's dump in a folder named "openaire" inside "preparation"
- Decompress Crossref's dump in a folder named "crossref" inside "preparation"
- Execute coverage_asn_preparation.py: this script cleans and processes all the data in the dumps and stores them as separate jsons in a folder named "final" inside "preparation". It also imports the dumps as single collections into a MongoDB database and creates the necessary indexes in the collections. If these two other steps are not of interest to you, comment out the importing_dumps_to_db(output_dir) and create_indexes_in_db() functions at the end of the python file.
Estimated time: processing one dump takes approximately 4h on our machine
- Follow the steps in section 3.1 above OR Download the folder named "preparation" from Zenodo (link)
- Execute coverage_asn_preparation.py: if there is a folder named "final" in "preparation" the script doesn't processes the dumps (it doesn't execute the function processing). It imports the dumps as single collections into a MongoDB database and creates the necessary indexes in the collections. If either of these two steps are not of interest to you, comment out either the importing_dumps_to_db or the create_indexes_in_db function at the end of the python file.
Estimated times:
- importing one dump to the database took approximately 1.5h on our machine
- creating a textual index in one collection took from 3 to 4 hours on our machine, ascending/descending indexes take far less
Once our MongoDB database is set, we can query it to search for the candidates' publications and assess whether they are present in the different datasets. To set your database follow the steps in section 3 above. Due to Github's limit on files' size, we are unable to upload the folder with all the candidates' CVs from which we estract the publications' metadata to query the database with. However, they are available on Zenodo in the folder "execution/cand_cvs".
- Download the folder "execution" from this repo
- Execute coverage_asn_execution.py: this script extracts publications' metadata from each candidate's CV, searches for the publication in the Microsoft Academic Graph, OpenAIRE and Crossref collections of the database and calculates coverage of the candidate's publications by these datasets. Specifically it stores results in several files in the folder named "results" in the folder "execution":
- In meta_dict.json, it stores the publications' metadata extracted from the CVs.
- In wo_ info.json, it stores information about missing CVs, CVs with no publications, publications incorrectly parsed from PDF, empty publications, and publications missing both title and doi.
- For each candidate, it stores the publications' data found in the database and the coverage data in a new separate json file in "results". There, resulting candidates' json files are organized by the term, role, and field they applied for in the 2016-18 NSQ session.
- Finally, it stores the essential results into the three csv files in the folder "execution", cov_asn_xcand.csv, cov_asn_xdataset.csv and cov_asn_xyear.csv.