This repository contains scripts for web data scrapers written in Python
and R
programming languages. The language versions are python 3.7
and R 3.6.1
.
Summary
The world wide web is full of data that are of great interest to scientists and businesses alike. Firms, public institutions, and private users provide every imaginable type of information, and new channels of communication generate vast amounts of data on human behavior. But how to efficiently collect data from the Internet; retrieve information from social networks, search engines, and dynamic web pages; tap web services; and, finally, process the collected data with statistical software? I will answer these questions by creating effective solutions in this repository.
Please read the requirements.txt file. This file provides a listing of the necessary packages used in this repository.
Helpful commands
Execute the following commands in command prompt window
- To see the list of installed python packages,
> pip list
- To see the list of outdated python packages:
> pip list --outdated
- To upgrade a particular python package:
> pip install [package] --upgrade
. Substitute the[package]
withpackage name
. - To automatically generate the
requirements.txt
file, open a terminal window in the repository and type the command,pip3 freeze > requirements.txt
. See this helpful SO post on the same. - To generate the repository navigation structure, open a terminal window in the repository and type the command,
tree /f
. See this SO post
├───data
├───figures
├───resources
│ └───XPATH_Tutorials
└───scripts
├───python
│ └───scrapy_based_scrapers
│ │
│ ├───tutorial
│ │
│ └───web_crawl_automation
└───R
If you'd like to contact me regarding bugs, questions, or general consulting, feel free to drop me a line at [email protected]
If this project help you reduce time to develop, you can give me a cup of coffee :)