- Tagline: (i.e: Analyze political botnet activity on Twitter and develop effective counter-measures)
- Date: October 2016
- Category: Applied Research
- Author(s): Graham Mueller
- Brainstorming Phase: Currently collecting data for baseline models.
Develop a model capable of reliably flagging biomedical articles (appearing on bioRxiv or in biomedical scientific publications) that may be at risk of retraction. Such articles would then be carefully reviewed by peers in the community.
Although the retraction of a scientific article in the biomedical literature is still a rare event, it is getting increasingly frequent here and here.
- Retractions reflect error, misconduct, and fraud, which can significantly affect the scientific community and undermine the trust that the public puts in science.
- Detecting articles at risk of retraction could help focus the attention of efforts like Retraction Watch and other post-publication peer review groups.
- In turn, if the detection of problematic articles becomes more effective, the incentive for fraud is greatly diminished and the penalty for errors is increased, which should improve the overall quality and reliability of the biomedical literature.
PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM).
The Open Access Subset (OA Subset) is the largest collection of articles available for text mining via PMC. Articles in the OA Subset are still protected by copyright in most cases, but are made available for download under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work.
To download a collection in PMC for text mining, you must use the designated services (usually the PMC FTP service).
https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/
Once you have downloaded the data, extract the directories and place them in the data folder. Please do not commit the ~45GB worth of xml files to this repo!
From the root directory, you can create the text file which lists the path of each journal article using the following:
find . -iname '*.nxml' > files-all.txt
The spark script uses the path of each article as input, loads the xml and parse out the article-type, which identifies whether an article was retracted or not. To run this script use:
spark-submit scripts/spark_retractions.py
Python 2.7+ lxml
Apache Spark is used for parsing the xml documents. http://spark.apache.org
- Provide a starting point readme file and status of the current project for new researchers. These projects can take months if not longer sometimes to complete, such information will help onboarding faster.
- Guideline on how to edit-add new resources to this project, if there is a specific requirement, mention them. i.e:
- Please create a branch and do a pull-request when adding to this example project.
- Open Issues if something is not clear in the readme, or found linguistic/ grammar mistakes.
- A Comprehensive Analysis of Articles Retracted Between 2004 and 2013 from Biomedical Literature – A Call for Reforms
- Why Has the Number of Scientific Retractions Increased?
- Why and how do journals retract articles? An analysis of Medline retractions 1988-2008
PS: Last few notes:
- Be Nice & Be Respectful.
- Value other people's work, please reference them. Don't just copy & paste what you find elsewhere when it comes to sharing information.
- Give constructive criticism, as in if you see something not working, or wrong, suggest an attempt to tackle resolving the issue.
- Please Ask Questions: This is one big attempt to open up opportunity for everyone being able to contribute, if they can add value towards these research topics.
- Also, keep in mind that most of the researchers that are opening these projects might have full-time work/research. If there is a specific question, try opening an issue, use the given open communication channels rather than direct contact.