CSPGuard

Web Data Extractor is a Python tool for analyzing web pages. It extracts domains and resources from CSP headers and JavaScript files, and organizes them into folders named after URLs. Outputs include text, JSON, and CSV files.

Web Data Extractor

Overview

Web Data Extractor is an advanced Python tool designed for analyzing web pages. It captures and processes various types of data from the web, including Content Security Policy (CSP) headers, JavaScript files, images, and iframes. The tool organizes the extracted data into folders named after the hostnames of the provided URLs and saves them in multiple formats such as text, JSON, and CSV.

Features

CSP Header Analysis: Extracts and processes domains and JavaScript URLs from CSP headers.
JavaScript Parsing: Fetches and parses JavaScript files to detect domains.
Resource Detection: Captures URLs of images and iframes.
Data Organization: Creates folders for each URL and saves data in text files, JSON, and CSV formats.
Concurrent Processing: Supports concurrent fetching and parsing of JavaScript files for efficient data extraction.

Prerequisites

Before running the script, ensure you have the following installed:

Python 3.7 or higher
Playwright

Installing Dependencies

Clone the repository:

git clone https://github.com/M-thefl/CSPGuard.git
cd CSPGuard
pip install -r requirements.txt

Install Playwright:

pip install playwright
playwright install

Install Required Python Packages:
```
pip install requests
```

Usage

To use Web Data Extractor, run the script with one or more URLs as arguments. The script will create folders based on the hostnames of the provided URLs and save the extracted data accordingly.
Command Line Usage:

python main.py <URL1> [<URL2> ... <URLN>]

Example:

python main.py https://example.com https://another-example.com

Output

For each URL provided, the script will create a folder with the following structure:

├── hostname
├── detected_domains.txt
├── detected_domains.json
├── detected_images.txt
├── detected_iframes.txt
└── detected_data.csv

detected_domains.txt: A list of detected domains.
detected_domains.json: A JSON file containing detected domains.
detected_images.txt: A list of detected image URLs.
detected_iframes.txt: A list of detected iframe URLs.
detected_data.csv: A CSV file with all detected data including domains, images, and iframes.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🖋 Contact

If you have any questions or suggestions, feel free to contact me at [email protected]

good luck (; 🌙
for life
fl 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSPGuard

Web Data Extractor

Overview

Features

Prerequisites

Installing Dependencies

Usage

Output

📄 License

🖋 Contact

About

Releases

Packages

Languages

License

M-thefl/CSPGuard

Folders and files

Latest commit

History

Repository files navigation

CSPGuard

Web Data Extractor

Overview

Features

Prerequisites

Installing Dependencies

Usage

Output

📄 License

🖋 Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages