A simple webcrawler to detect broken links in websites recursively using Python.
You can use and extend the tool to easily crawl a website and check for broken links and other errors.
Download the pageCrawler.py and call it from the terminal using:
python3 pageCrawler.py URL_TO_CRAWL
Check out all the options using
python3 pageCrawler.py --help
Use the one-liner below to immediately use the WebScanner from DockerHub:
docker run --rm emeraldit/webscanner:1.0.1 URL_TO_CRAWL
To run the docker command and send the output to your Slack, execute the image like this:
docker run --rm -e SLACK_BOT_TOKEN='your-slack-token' emeraldit/webscanner:1.0.1 --channel_id SLACK_ID URL_TO_CRAWL
You can also easily run the script using Docker.
Build the image:
docker build --tag emeraldit/webscanner:1.0.1 .
Run the image
docker run --rm \
--name webscanner.container \
emeraldit/webscanner:1.0.1 \
--help
or on powershell:
docker run --rm `
--name webscanner.container `
emeraldit/webscanner:1.0.1 `
--help
You can also use the class directly in your own python code:
from webscanner import WebScanner
crawler = WebScanner(
URL_TO_CRAWL,
prefix=OPTIONAL_PREFIX,
max_depth=2,
test_external_urls=True,
verbose=2,
)
crawler.crawl()
The prefix determines which links are crawled, for example you can limit the crawling to a specific subdirectory of a domain such as https://domain.com/dir/ Everything above the /dir/ will be ignored, when setting the prefix to this url.
The max_depth determines how deep the crawler goes, for example if set to 1, only the links of the initial page are followed and the process stops. If set to 2, all links of the initial page and all following links from the pages that follow the initial page are crawled.
The WebScanner can send the results of the run directly to your Slack channel.
By providing the --channel_id
or -cid
flag, in execution, you can provide the ID of the Slack
channel in which you would like to post the results.
In order to allow the WebScanner to send messages to your Slack, you need to expose, into your environment
variables, the SLACK_BOT_TOKEN
. Curious where to find this token? Read further on
Slack's documentation.
- Clone this repository
[email protected]:F3licity/WebScanner.git
- Start a new virtual environment on the root folder of this project, using Python 3.8, and activate it.
pip3.8 install virtualenv virtualenv venv38 source venv38/bin/activate
- Install
pip
andpip-tools
. - Install the pip libraries required:
pip install -r requirements.txt
Make sure to check out the CONTRIBUTING.md about the house rules. Start developing!
To contribute please also update the documentation.
You can download the required packages from docs-requirements.txt (pip install -r docs-requirements.txt
).
Install the documentation requirements:
pip install -r docs-requirements.txt
Build the documentation:
gendocs --config mkgendocs.yml
You can then do mkdocs serve
and access it on http://127.0.0.1:8000/