Deeds Scraper and Uploader:

This project scrapes deed information from the charlestoncounty deeds url("https://roddaybook.charlestoncounty.org/") and uploads the data to an AWS S3 bucket. The project is containerized using Docker and utilizes AWS ECR for storing Docker images. The application is built using Python with Playwright for web scraping and Boto3 for interacting with AWS services.

Prerequisites:

Docker: Ensure Docker is installed and running on your local machine. Install Docker AWS CLI: Ensure AWS CLI is installed and configured with the necessary permissions. Install AWS CLI Python: Ensure Python is installed. Install Python

AWS Configuration

The AWS credentials should be configured using the AWS CLI: aws configure

This command will prompt you to enter your AWS access key ID, secret access key, region, and output format. The credentials will be stored in ~/.aws/credentials and the configuration in ~/.aws/config.

Project Structure

. ├── setup_aws.sh ├── Dockerfile ├── README.md ├── config.json ├── main.py ├── requirements.txt ├── scraper.py └── script.sh

Configuration File

config.json file in the project root directory with the following content:

Customize the content for the required date,url of website you want to scrap the data:

{ "date": "MM/DD/YYYY", "bucket_name": "your-s3-bucket-name", "url":"Your_url_page"

Files and Their Purpose

scraper.py: Handles scraping data from a website using Playwright.
cloud.py: Manages AWS assets, including creating S3 buckets with Boto3 and uploading .pkl files.
main.py: The main entry point that coordinates functions from cloud.py and scraper.py.
Dockerfile: Contains instructions for building the Docker image for running the application in a Docker container.
script.sh: Script for creating and pushing the Docker image to the AWS ECR container registry.
setup_aws.sh: Script for setting up AWS credentials required to run the scrapped data in the Docker container.

Execution in Local steps:

-> pip install -r requirements.txt

-> python main.py

We can see the deed_info.pkl file saving in our repository

Here the fields are as follows

'Record Date' 'Record Time' 'Maker Firm Name' 'Recipient Firm Name' 'Book-Page' 'Orig Book' 'Orig Page'

Docker exectuion steps:

in setup_aws.sh file So modify the credentials in the setup_aws.sh to point your aws_secret_Access_key,secret_key and region

Build docker command (my-deeds-app -> image name):

docker build -t my-deeds-app .

Run the docker image:

docker run -it --rm my-deeds-app

You can see the aws_s3_bucket contains the .pkl file

ECR_DOCKER_image_Steps:

configure your credentials in the script.sh to point your ECR repository name , account id ,image name and region

chmod +x script.sh

./script.sh

It will run and push the docker image to the ECR

you can navigate to the AWS ECR Console and check the image

##Testing the pushed docker image to AWS ECR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deeds Scraper and Uploader:

Prerequisites:

AWS Configuration

Project Structure

Configuration File

Files and Their Purpose

Execution in Local steps:

Docker exectuion steps:

Build docker command (my-deeds-app -> image name):

ECR_DOCKER_image_Steps:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
__pycache__		__pycache__
unittests		unittests
.DS_Store		.DS_Store
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
cloud.py		cloud.py
config.json		config.json
deed_info2024-07-01.pkl		deed_info2024-07-01.pkl
deed_info_2024-07-01.pkl		deed_info_2024-07-01.pkl
main.py		main.py
requirements.txt		requirements.txt
scraper.py		scraper.py
script.sh		script.sh
setup_aws.sh		setup_aws.sh

karthik8239/webscrapper

Folders and files

Latest commit

History

Repository files navigation

Deeds Scraper and Uploader:

Prerequisites:

AWS Configuration

Project Structure

Configuration File

Files and Their Purpose

Execution in Local steps:

Docker exectuion steps:

Build docker command (my-deeds-app -> image name):

ECR_DOCKER_image_Steps:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages