Skip to content

trinhcaokhoa/AI-LLM-Webscrapper

Repository files navigation

AI Web Scraper

AI Web Scraper is a Streamlit-based application that scrapes websites using Selenium and performs intelligent content extraction. The scraper also handles captcha solving and supports custom parsing of content using AI models.

Features

  • Scrape website content using Selenium WebDriver (headless Chrome).
  • Intelligent content extraction and cleaning using BeautifulSoup.
  • Supports handling captchas.
  • Custom content parsing using AI models (e.g., Ollama).
  • Easy-to-use web interface powered by Streamlit.

Technologies Used

  • Python
  • Selenium - For web scraping and captcha handling.
  • BeautifulSoup - For DOM content extraction and cleaning.
  • Streamlit - For the user interface.
  • Docker - For containerized execution with docker-compose.
  • Ollama - For AI-based content parsing.

Prerequisites

Make sure you have the following installed on your system:

  • Docker
  • Docker Compose
  • Git

Local Setup and Installation

  1. Clone the repository:

    git clone https://github.com/trinhcaokhoa/AI-Web-Scraper.git
    cd AI-Web-Scraper
    
  2. Build and run the project using Docker Compose:

    docker-compose up --build
    

Open your browser and go to http://localhost:8501.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published