bioschemas-scraper

Web scrapper to harverst Data items using bioschemas specifications markup. This project is based on Scrapy, a Python library to crawl web resources.

Dependencies

You will need pip to install the script requirements, over here you will find documentation about installing pip in your OS. The safer way to get your requirements installed without affecting any other Python project you have is using virtualenv. You will also need an Elastic Search instance running so you can save the crawled records.

And you will need to install scrapy, please find the isntallation steps here.

You will also need an Elastic Search instance running in order to save the crawled data.

Installation

git clone https://github.com/BioSchemas/bioschemas-scraper.git
cd bioschemas-scraper
viartualenv .venv
source activate .venv/bin/activate
pip install -r requirements.txt

After you finish the script execution you will need to deactivate your virtual environment:

deactivate

Configuration

In order to configure the Elastic Search instance information in the scraper you need to modify the last lines in the file bioschemas_scraper/settings.py. This scraper is set to crawl Tess Events web site by default. If you want to generate a new spider for a different web site please take a look of bioschemas_scraper/spiders/bioschemas_spider_xml.py. If you want to add aditional processing to the crawled records you will need to check the pipelines defined in bioschemas_scraper/pipelines, for now there is only one pipeline that take every crawled Bioschemas object and the it validate it agains the Bioschemas Event specificication available as a JSON Schema file at bioschemas_scraper/utils/schemas/Event.json the validation logic is available at bioschemas_scraper/utils/validators.py.

Running

In the root of the repo run:

scrapy crawl https://tess.elixir-europe.org/events

Supported formats

Microdata

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
bioschemas_scraper		bioschemas_scraper
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bioschemas-scraper

Dependencies

Installation

Configuration

Running

Supported formats

About

Releases

Packages

Languages

License

BioSchemas/bioschemas-scraper

Folders and files

Latest commit

History

Repository files navigation

bioschemas-scraper

Dependencies

Installation

Configuration

Running

Supported formats

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages