- The software collects scraped data from the internet and processes it to provide a list of internships to the user. The code is closed source.
- The software is divided into two main parts: Web Scraping and Data Processing Pipeline.
- Pipeline:
-
- Schedule the Web Scraping process to run every 24 hours using AWS EventBridge.
-
- All processes are orchestrated using AWS Step Functions to automate workflows by connecting various AWS services and custom application logic.
-
- Create a NAT Gateway using AWS Lambda to allow the VPC to access the internet. Creating and deleting the NAT Gateway as needed can help save 80% of the costs compared to keeping the NAT Gateway running 24/7 when it's not in use.
-
- Preprocess new search terms and group similar search terms for web scraping, instead of using all the search terms, and save 90% of the time and cost.
-
- Web Scraping: The software scrapes the internet for internships using the search terms and saves the data to an S3 bucket.
-
- Data Processing Pipeline: The software processes the scraped data to provide a list of internships to the user.
-
- Sending an email to the user with the list of internships with their desired position only using AWS SES.
-
- Delete the NAT Gateway after all processes are completed to save costs.
- 5.1 - The software uses a list of search terms to scrape the internet for internships from various job sites.
- 5.2 - The container is pulled from the Amazon Elastic Container Registry (ECR) and run on AWS Fargate.
- 5.3 - The software go to the internet through a proxy server.
- 6.1 - Download the raw data from AWS S3 bucket.
- 6.2 - Cleaning: delete duplicate data and clean the data.
- 6.3 - Categorizing and Reviewing: categorize the each internship by the search term with fine-tuned Llama 3 and final review by GPT 4o.
- Python, Scrapy, Selenium: used for web scraping.
- AWS Fargate, AWS Lambda, AWS Elastic Container Registry (ECR): used for containerization of the web scraping process.
- AWS Step Functions, AWS EventBridge, AWS Simple Email Service (SES), AWS S3: used for orchestrating the workflow and sending emails.
- Fine-tuned Llama 3, GPT 4o: used for categorizing and reviewing the internships.
- AWS Virtual Private Cloud (VPC), AWS NAT Gateway, AWS Elastic IP, AWS Internet Gateway: used for connecting the VPC to the internet.
- ISP Proxy Server: used for connecting to the internet and avoiding IP blocking.
- GitHub Actions: used for CI/CD.