Further analysis and explanations can be found in this summary post.
What do scientific authorship trends tells us about the importance of different scientific fields? Can this information help a young scientist choose their field? Measuring authors who switch between fields tells us how scientists rank the relative importance of fields.
Here we take the dataset of all ArXiv papers ever published, and extract the way that authors transition between scientific fields over time.
- The author count within each field in each year
- The author transition counts between fields during each year
- The author transition rate between fields for a given year.
- arxiv-metadata-extraction.ipynb
- This file takes the raw arxiv data (downloaded from https://www.kaggle.com/datasets/Cornell-University/arxiv) and extracts how authors publish (for details, see file). Data are saved as .json files (ie. author_counts.json) for quick future analysis.
- arxiv-metadata-plotting.ipynb
- This file reads in the processed .json files to display information.
- basic_utils.py
- This file contains a few helper functions, and a list of all ArXiv categories.
If you want to play with the data (already extracted author counts/transitions over time), for example to see how your own field changes over time, just use arxiv-metadata-plotting.ipynb. This notebook should just run if you download the repository, then you can change what types of fields you look at, etc.
If you want to double check the extraction of data, and exactly what quantities are being pulled to the .json files, use arxiv-metadata-extraction.ipynb. Doing this requires you to have a few other things downloaded and set up in various folders.
The main metrics are:
- The author count within each field in each year
- A single author is assumed to correspond to each unique name (not perfect)
- An author who publishes in multiple categories is counted as being fractionally located in each field, given by 1/(the number of fields in the union of all fields they published in during that year).
- The author transition counts between fields during each year
- For each unique author name, I take the difference between the current year author categories, and the previous year author categories.
- The losses in some categories and gains in others for a specific author are counted as transitions from the fields with losses to the fields with gains (and these gains and losses are attributed uniformly across all the gains and losses, proportionally to how much each field is lost out of the total loss, for example).
- The author transition rate between fields for a given year.
- This is just the total transition counts, divided by the field size of the source field.
- In other words, it is something like the probability, conditioned on being an author in a specific field, that you transition from that field to another field during that year.
This isn't perfect. Here's a few potential problems that could be solved.
- ArXiv isn't complete: This is only ArXiv data. Ideally I would want to include more databases, and characterize the paper categories in a more general way.
- Non-unique author names: I use each author name as a unique string to identify an author, but this isn't correct. Some people have the same name, and I will therefore count them as the same person. To avoid this to some degree, I don't count any authors with more than 100 publications. This should help, but will then mess up the total author number. In the end, it's a tradeoff, and would be better to use something like Orchid identifiers.