Web-Scraping-From-New-York-Times

This repository demonstrates two ways of crawling webpages from New York Times archived news websites. Actually New York Times provides more than Times APIs to respond to request calls, but it turned out that it cannot satisfy my specific needs. Thus, instead, I used two methods -- Selenium and Beautiful Soup, Scrapy and JMESPath, to collect and parse the raw webpage.

The picture above is from https://developer.nytimes.com/.

Description of the Scraping Task

The target of the author is to collect financial news(broadly speaking), from New York Times by companies. A sample webpage looks like this:

Date, headline and summary are desired and sample data are given in the two txt file microsoft.txt and general-motors.txt.

After successfully installing the Scrapy package, open a terminal and go to the directory where the created scrapy robot is stored. Run

$ scrapy crawl gmbot

The spider should start working. Check the official reference document here https://docs.scrapy.org/en/latest/topics/commands.html.

Comparison of Selenium and Scrapy

Note that both Selenium and Scrapy are able to handle infinite scrolling pages. An essential difference is that when using Selenium and BeautifulSoup, crawling and parsing are two separate steps which only executes one after another. In other words, it's not until we reach the end of the infinite scrolling webpage do we start parsing the html file and get the desired information. In this way, the risk of failure in the middle of a crawling task is high (due to unstable internet speed, for example) and we have to start from the very beginning. However, Scrapy and JMESpath are capable of crawling and parsing webpage line by line. Moreover, Scrapy automately gives information when it gets stuck. Thus, we can and we know where exactly to restart.

Welcome for any advice for improvement!

I hope my code could at least give a hint to solve some of your problems. I believe that there're many potential improvement in the provided codes to avoid the problems mentioned above. Any advice is more than welcomed. Thank you!

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Scrapy Robots - General Motors		Scrapy Robots - General Motors
New York Times APIs At A Glance.png		New York Times APIs At A Glance.png
README.md		README.md
Sample Webpage.jpg		Sample Webpage.jpg
Selenium and BeautifulSoup.py		Selenium and BeautifulSoup.py
general-motors.txt		general-motors.txt
microsoft.txt		microsoft.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web-Scraping-From-New-York-Times

Description of the Scraping Task

Comparison of Selenium and Scrapy

Welcome for any advice for improvement!

About

Releases

Packages

Languages

Jeremy-Yan-Liu/Web-Scraping

Folders and files

Latest commit

History

Repository files navigation

Web-Scraping-From-New-York-Times

Description of the Scraping Task

Comparison of Selenium and Scrapy

Welcome for any advice for improvement!

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages