Getting overrun by the amount to do but optimistic
- Three high-quality volunteers have joined the project. This is very exciting
- My Shuttleworth colleagues are doing amazing things.
Thank you so much! I have copied in Clyde (@deadlyvices) who is also helping. (Clyde , perhaps you can extract what I write onto the repo)
The goal is to make all readable scientific papers available to everyone in the expectation that solutions to COVID-19 epidemic already exist in the literature. This was true for Ebola - a paper in 1983 found antibodies in Liberia.If we had indexed all the literature on simple terms "Ebola", "Marburg" and collated that with a list of countries "Sierra Leone", "Liberia" then we can inform the world of risks in those countries. We intend to start with the Royal Society - perhaps 20,000 papers and extract all relevant annotations. We work very closely with Wikidata so objects are precisely defined. In the first instance we work with preprint servers (the PDF is MUCH better than what publishers put out. It's single column Unicode normally.
Unfortunately most of the papers are in PDF , so having a good reader is valuable. Again, unfortunately, there is no standard format and some papers are horrible . These problems are less than they used to be but this is just to warn you it's not trivial. Translating OCR into individual PNGs for each character. non-Unicode; PDFs from LaTeX or publisher tools often use the arbitrary codepoints (and I have compiled dictionaries of these). Styles are important for maths and bio and for sectioning, but "bold" is often represented by printing the same character twice or an unusual colourMap. Italics might be created by affine transformations.
I also use PDFBox to extract SVG. Many diagrams contain the most valuable information and so I have tools to extract vectors and turn them into graphics objects (axes, scales, etc) and in the best cases we can turn the PDF directly into a CSV.
I have a toolkit pdf2svg2 (which overrides PDFBox PageDrawer) which is the centre of this. It "works" but it's fragile. Since Publisher-PDF doesn't really have a spec the code reacts to the articles that I process. I think the subclassing is nearly correct but there are bugs.
Probably the most valuable thing to do is to see what AMI (https://github.com/petermr/ami3) does. There are lots of tests but some are broken (it's very hard to build tests for a non-standard). And often upgrades of OS causes tests to fail (e.g. on numeric values, graphics, file order...)
It's mainly Java 7 but not always generic, a little Java 8 streams where I have been learning. The user base is small so I don't mind moving to later Java's . It's had a 25-year history - a small amount of the code was Fortran -> f2c -> Java. Some of the Java libraries were written in 1994 - before Java was released! A lot can be replaced by more modern libraries (e.g. NIO).
The commandline is a nice tool picocli - about 2 years old - and the workflow is modular - with 20+ toplevel commands. So there's an ami-pdf command which uses PDFBox to turn PDFs into a mixture of *.SVG and *.PNG. There are many debugs which could be valuable e.g. timing - if a document has a 100K PNGs or a 100K we need a switch,
The ethos is that we are all in this together. The code is Apache2 . At present I am the only author but we need a contributor agreement so that people are honoured. I ran a non-profit contentmine.org but it doesn't own the code.
There are lots of fun things to do, lots of difficult and tedious ones and we have about 5% of the resource that's actually required.
You have been warned!