arxiv-track-author-preferences

Further analysis and explanations can be found in this summary post.

Summary

Motivation

What do scientific authorship trends tells us about the importance of different scientific fields? Can this information help a young scientist choose their field? Measuring authors who switch between fields tells us how scientists rank the relative importance of fields.

Methods

Here we take the dataset of all ArXiv papers ever published, and extract the way that authors transition between scientific fields over time.

Metrics

The author count within each field in each year
The author transition counts between fields during each year
The author transition rate between fields for a given year.

Example data for the major fields on ArXiv.

Files

arxiv-metadata-extraction.ipynb
- This file takes the raw arxiv data (downloaded from https://www.kaggle.com/datasets/Cornell-University/arxiv) and extracts how authors publish (for details, see file). Data are saved as .json files (ie. author_counts.json) for quick future analysis.
arxiv-metadata-plotting.ipynb
- This file reads in the processed .json files to display information.
basic_utils.py
- This file contains a few helper functions, and a list of all ArXiv categories.

How to use

Explore how authors transition between your own ArXiv fields

If you want to play with the data (already extracted author counts/transitions over time), for example to see how your own field changes over time, just use arxiv-metadata-plotting.ipynb. This notebook should just run if you download the repository, then you can change what types of fields you look at, etc.

Modify the base data extraction

If you want to double check the extraction of data, and exactly what quantities are being pulled to the .json files, use arxiv-metadata-extraction.ipynb. Doing this requires you to have a few other things downloaded and set up in various folders.

More explanation of metrics

The main metrics are:

The author count within each field in each year
- A single author is assumed to correspond to each unique name (not perfect)
- An author who publishes in multiple categories is counted as being fractionally located in each field, given by 1/(the number of fields in the union of all fields they published in during that year).
The author transition counts between fields during each year
- For each unique author name, I take the difference between the current year author categories, and the previous year author categories.
- The losses in some categories and gains in others for a specific author are counted as transitions from the fields with losses to the fields with gains (and these gains and losses are attributed uniformly across all the gains and losses, proportionally to how much each field is lost out of the total loss, for example).
The author transition rate between fields for a given year.
- This is just the total transition counts, divided by the field size of the source field.
- In other words, it is something like the probability, conditioned on being an author in a specific field, that you transition from that field to another field during that year.

Caveats of the analysis

This isn't perfect. Here's a few potential problems that could be solved.

ArXiv isn't complete: This is only ArXiv data. Ideally I would want to include more databases, and characterize the paper categories in a more general way.
Non-unique author names: I use each author name as a unique string to identify an author, but this isn't correct. Some people have the same name, and I will therefore count them as the same person. To avoid this to some degree, I don't count any authors with more than 100 publications. This should help, but will then mess up the total author number. In the end, it's a tradeoff, and would be better to use something like Orchid identifiers.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
plots		plots
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
arxiv-metadata-extraction.ipynb		arxiv-metadata-extraction.ipynb
arxiv-metadata-plotting.ipynb		arxiv-metadata-plotting.ipynb
author_counts.json		author_counts.json
author_transitions.json		author_transitions.json
basic_utils.py		basic_utils.py
sorted_dates.json		sorted_dates.json
unique_categories.json		unique_categories.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

arxiv-track-author-preferences

Summary

Motivation

Methods

Metrics

Files

How to use

Explore how authors transition between your own ArXiv fields

Modify the base data extraction

More explanation of metrics

Caveats of the analysis

About

Releases

Packages

Contributors 2

Languages

tomhartke/arxiv-track-author-changes

Folders and files

Latest commit

History

Repository files navigation

arxiv-track-author-preferences

Summary

Motivation

Methods

Metrics

Files

How to use

Explore how authors transition between your own ArXiv fields

Modify the base data extraction

More explanation of metrics

Caveats of the analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages