Skip to content

hieuminh65/jobsdreamer-report

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JobsDreamer

All internships on the internet in last 24 hours to your email

Technical Report

  • The software collects scraped data from the internet and processes it to provide a list of internships to the user. The code is closed source.

Overview

  • The software is divided into two main parts: Web Scraping and Data Processing Pipeline.
  • Pipeline:
    1. Schedule the Web Scraping process to run every 24 hours using AWS EventBridge.
    1. All processes are orchestrated using AWS Step Functions to automate workflows by connecting various AWS services and custom application logic.
    1. Create a NAT Gateway using AWS Lambda to allow the VPC to access the internet. Creating and deleting the NAT Gateway as needed can help save 80% of the costs compared to keeping the NAT Gateway running 24/7 when it's not in use.
    1. Preprocess new search terms and group similar search terms for web scraping, instead of using all the search terms, and save 90% of the time and cost.
    1. Web Scraping: The software scrapes the internet for internships using the search terms and saves the data to an S3 bucket.
    1. Data Processing Pipeline: The software processes the scraped data to provide a list of internships to the user.
    1. Sending an email to the user with the list of internships with their desired position only using AWS SES.
    1. Delete the NAT Gateway after all processes are completed to save costs.

Web Scraping

  • 5.1 - The software uses a list of search terms to scrape the internet for internships from various job sites.
  • 5.2 - The container is pulled from the Amazon Elastic Container Registry (ECR) and run on AWS Fargate.
  • 5.3 - The software go to the internet through a proxy server.

Data Processing Pipeline

  • 6.1 - Download the raw data from AWS S3 bucket.
  • 6.2 - Cleaning: delete duplicate data and clean the data.
  • 6.3 - Categorizing and Reviewing: categorize the each internship by the search term with fine-tuned Llama 3 and final review by GPT 4o.

Technologies

  • Python, Scrapy, Selenium: used for web scraping.
  • AWS Fargate, AWS Lambda, AWS Elastic Container Registry (ECR): used for containerization of the web scraping process.
  • AWS Step Functions, AWS EventBridge, AWS Simple Email Service (SES), AWS S3: used for orchestrating the workflow and sending emails.
  • Fine-tuned Llama 3, GPT 4o: used for categorizing and reviewing the internships.
  • AWS Virtual Private Cloud (VPC), AWS NAT Gateway, AWS Elastic IP, AWS Internet Gateway: used for connecting the VPC to the internet.
  • ISP Proxy Server: used for connecting to the internet and avoiding IP blocking.
  • GitHub Actions: used for CI/CD.