Skip to content

Measure how authors change fields on the preprint repository ArXiv to understand how different fields are valued.

Notifications You must be signed in to change notification settings

tomhartke/arxiv-track-author-changes

Repository files navigation

arxiv-track-author-preferences

Further analysis and explanations can be found in this summary post.

Summary

Motivation

What do scientific authorship trends tells us about the importance of different scientific fields? Can this information help a young scientist choose their field? Measuring authors who switch between fields tells us how scientists rank the relative importance of fields.

Methods

Here we take the dataset of all ArXiv papers ever published, and extract the way that authors transition between scientific fields over time.

Metrics

  • The author count within each field in each year
  • The author transition counts between fields during each year
  • The author transition rate between fields for a given year.

Example data for the major fields on ArXiv. Alt text

Files

  1. arxiv-metadata-extraction.ipynb
  2. arxiv-metadata-plotting.ipynb
    • This file reads in the processed .json files to display information.
  3. basic_utils.py
    • This file contains a few helper functions, and a list of all ArXiv categories.

How to use

Explore how authors transition between your own ArXiv fields

If you want to play with the data (already extracted author counts/transitions over time), for example to see how your own field changes over time, just use arxiv-metadata-plotting.ipynb. This notebook should just run if you download the repository, then you can change what types of fields you look at, etc.

Modify the base data extraction

If you want to double check the extraction of data, and exactly what quantities are being pulled to the .json files, use arxiv-metadata-extraction.ipynb. Doing this requires you to have a few other things downloaded and set up in various folders.

More explanation of metrics

The main metrics are:

  • The author count within each field in each year
    • A single author is assumed to correspond to each unique name (not perfect)
    • An author who publishes in multiple categories is counted as being fractionally located in each field, given by 1/(the number of fields in the union of all fields they published in during that year).
  • The author transition counts between fields during each year
    • For each unique author name, I take the difference between the current year author categories, and the previous year author categories.
    • The losses in some categories and gains in others for a specific author are counted as transitions from the fields with losses to the fields with gains (and these gains and losses are attributed uniformly across all the gains and losses, proportionally to how much each field is lost out of the total loss, for example).
  • The author transition rate between fields for a given year.
    • This is just the total transition counts, divided by the field size of the source field.
    • In other words, it is something like the probability, conditioned on being an author in a specific field, that you transition from that field to another field during that year.

Caveats of the analysis

This isn't perfect. Here's a few potential problems that could be solved.

  • ArXiv isn't complete: This is only ArXiv data. Ideally I would want to include more databases, and characterize the paper categories in a more general way.
  • Non-unique author names: I use each author name as a unique string to identify an author, but this isn't correct. Some people have the same name, and I will therefore count them as the same person. To avoid this to some degree, I don't count any authors with more than 100 publications. This should help, but will then mess up the total author number. In the end, it's a tradeoff, and would be better to use something like Orchid identifiers.

About

Measure how authors change fields on the preprint repository ArXiv to understand how different fields are valued.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published