Web Data Extractor is a Python tool for analyzing web pages. It extracts domains and resources from CSP headers and JavaScript files, and organizes them into folders named after URLs. Outputs include text, JSON, and CSV files.
Web Data Extractor is an advanced Python tool designed for analyzing web pages. It captures and processes various types of data from the web, including Content Security Policy (CSP) headers, JavaScript files, images, and iframes. The tool organizes the extracted data into folders named after the hostnames of the provided URLs and saves them in multiple formats such as text, JSON, and CSV.
- CSP Header Analysis: Extracts and processes domains and JavaScript URLs from CSP headers.
- JavaScript Parsing: Fetches and parses JavaScript files to detect domains.
- Resource Detection: Captures URLs of images and iframes.
- Data Organization: Creates folders for each URL and saves data in text files, JSON, and CSV formats.
- Concurrent Processing: Supports concurrent fetching and parsing of JavaScript files for efficient data extraction.
Before running the script, ensure you have the following installed:
- Python 3.7 or higher
- Playwright
- Clone the repository:
git clone https://github.com/M-thefl/CSPGuard.git cd CSPGuard pip install -r requirements.txt
- Install Playwright:
pip install playwright playwright install
- Install Required Python Packages:
pip install requests
To use Web Data Extractor, run the script with one or more URLs as arguments. The script will create folders based on the hostnames of the provided URLs and save the extracted data accordingly.
Command Line Usage:
python main.py <URL1> [<URL2> ... <URLN>]
Example:
python main.py https://example.com https://another-example.com
For each URL provided, the script will create a folder with the following structure:
├── hostname
├── detected_domains.txt
├── detected_domains.json
├── detected_images.txt
├── detected_iframes.txt
└── detected_data.csv
- detected_domains.txt: A list of detected domains.
- detected_domains.json: A JSON file containing detected domains.
- detected_images.txt: A list of detected image URLs.
- detected_iframes.txt: A list of detected iframe URLs.
- detected_data.csv: A CSV file with all detected data including domains, images, and iframes.
This project is licensed under the MIT License - see the LICENSE file for details.
If you have any questions or suggestions, feel free to contact me at [email protected]
good luck (; 🌙
for life
fl
🚀