Skip to content

karthik8239/webscrapper

Repository files navigation

Deeds Scraper and Uploader:

This project scrapes deed information from the charlestoncounty deeds url("https://roddaybook.charlestoncounty.org/") and uploads the data to an AWS S3 bucket. The project is containerized using Docker and utilizes AWS ECR for storing Docker images. The application is built using Python with Playwright for web scraping and Boto3 for interacting with AWS services.

image

Prerequisites:

Docker: Ensure Docker is installed and running on your local machine. Install Docker AWS CLI: Ensure AWS CLI is installed and configured with the necessary permissions. Install AWS CLI Python: Ensure Python is installed. Install Python

AWS Configuration

The AWS credentials should be configured using the AWS CLI: aws configure

This command will prompt you to enter your AWS access key ID, secret access key, region, and output format. The credentials will be stored in ~/.aws/credentials and the configuration in ~/.aws/config.

Project Structure

. ├── setup_aws.sh ├── Dockerfile ├── README.md ├── config.json ├── main.py ├── requirements.txt ├── scraper.py └── script.sh

Configuration File

config.json file in the project root directory with the following content:

Customize the content for the required date,url of website you want to scrap the data:

{ "date": "MM/DD/YYYY", "bucket_name": "your-s3-bucket-name", "url":"Your_url_page"

Files and Their Purpose

  • scraper.py: Handles scraping data from a website using Playwright.

  • cloud.py: Manages AWS assets, including creating S3 buckets with Boto3 and uploading .pkl files.

  • main.py: The main entry point that coordinates functions from cloud.py and scraper.py.

  • Dockerfile: Contains instructions for building the Docker image for running the application in a Docker container.

  • script.sh: Script for creating and pushing the Docker image to the AWS ECR container registry.

  • setup_aws.sh: Script for setting up AWS credentials required to run the scrapped data in the Docker container.

Execution in Local steps:

-> pip install -r requirements.txt

-> python main.py

We can see the deed_info.pkl file saving in our repository

Here the fields are as follows

'Record Date' 'Record Time' 'Maker Firm Name' 'Recipient Firm Name' 'Book-Page' 'Orig Book' 'Orig Page'

Scrapping_execution_python_Screenshot

Scrapped_data_pkl_file_screenshot

Docker exectuion steps:

in setup_aws.sh file So modify the credentials in the setup_aws.sh to point your aws_secret_Access_key,secret_key and region

Build docker command (my-deeds-app -> image name):

docker build -t my-deeds-app .

Run the docker image:

docker run -it --rm my-deeds-app

You can see the aws_s3_bucket contains the .pkl file

deed_info_S3_AWS_Screenshot

ECR_DOCKER_image_Steps:

configure your credentials in the script.sh to point your ECR repository name , account id ,image name and region

chmod +x script.sh

./script.sh

It will run and push the docker image to the ECR

you can navigate to the AWS ECR Console and check the image

ECR_Image_Docker_AWS_Screenshot

##Testing the pushed docker image to AWS ECR

Testing_ECR_docker_image_AWS_Screenshot

About

webscrapper for deeds

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published