This is a simple Python script to scrape book data from Books to Scrape, a website that provides a collection of books. The scraper collects details about the books, including titles, prices, ratings, availability, and cover image URLs. It stores the data into a CSV file.
- Scrapes book, such as titles, prices, ratings, availability, and cover image URLs.
- Supports scraping multiple pages.
- Saves the scraped data to a CSV file (
books_data.csv
).
Follow these steps to get the project up and running:
-
Clone the repository:
git clone https://github.com/RyanGA09/books-to-scrape-scraping-v1.git
-
Navigate to the project directory:
cd books-to-scrape-scraping-v1
-
Create a virtual environment:
python3 -m venv venv
-
Activate Virtual Environment:
-
On Linux/MacOS:
source venv/bin/activate # On Linux
-
On Windows:
venv\Scripts\activate # On Windows
-
-
Install the required dependencies:
pip install -r requirements.txt
You can run the scraper using the Python script or interactively via Jupyter Notebook. The script will scrape book data and save it to a CSV file to the local machine.
Run the script to start scraping:
Run the script to start scraping book details:
python WebScraping.py
You can also run the script interactively via Jupyter Notebook for a more hands-on approach.
-
Start Jupyter Notebook
jupyter notebook
This will:
- Scrape book details (title, price, rating, etc.) from the catalog.
- Save the data to
books_data.csv
in the current directory.
-
Open
WebScrapingExperiment.ipynb
in the Jupyter interface and run the cells sequentially. This notebook allows for interactive scraping.This notebook allows you to run the scraping code in small chunks, which can be helpful for debugging and learning how scraping works step-by-step.
Note:
If you’re using Visual Studio Code (VSCode), PyCharm, or any other external IDE, you can open the Jupyter notebook file (WebScrapingExperiment.ipynb) directly inside your IDE and run the code without opening Jupyter through a browser.
- scrape_books_from_page(url): This function is used to extract book data from a given web page (either the first page or other pages).
- scrape_multiple_pages(base_url, total_pages): This function sets the scraper to retrieve data from multiple pages (with pagination support) and merge all the results.
- save_to_csv(data, filename): This function is used to save the scraped data into a CSV file (books_data.csv), which facilitates further data analysis and processing.
- Supports Multiple Pages: In V1, this script already supports scraping of multiple pages. You only need to specify the number of pages you want to scrape in the
total_pages
variable in the script. - Using Wait Time: There is a
time.sleep(1)
in the script that pauses 1 second between each request to the server, to avoid overload and give the server time to respond. - CSV File: The retrieved data will be saved in a CSV file named
books_data.csv
, which allows you to view and further analyze the data with spreadsheets such as Excel or Google Sheets.
Check out my article on Medium:
©2024 Ryan Gading Abdullah. All rights reserved.