Skip to content
This repository has been archived by the owner on Sep 5, 2024. It is now read-only.

Latest commit

 

History

History
368 lines (305 loc) · 9.1 KB

CONFIG.MD

File metadata and controls

368 lines (305 loc) · 9.1 KB
Table of Contents
  1. Crawler Configuration
  2. Exporter
  3. Global flags

Crawler Configuration

The following sections explain each setting in the crawler configuration:

allowed-domains
  • Description: White list allowed domains
  • Default Value: empty list
  • Example: old.reddit.com -> visit only old.reddit.com domains
body-size
  • Description: Maximum size of the HTTP response body in bytes.
  • Default Value: 0 → unlimited
cache-dir
  • Description: Directory path for caching. Leave empty for no caching.
  • Default Value: "" (empty string)
crypto
  • Description: Enable or disable crypto-related features.
  • Default Value: false
debug
  • Description: Enable or disable debugging mode for GoColly.
  • Default Value: false
disallowed-domains
  • Description: Domain black list for the crawler.
  • Default Value: [] (empty list)
  • Example: reddit.com → crawler will not visit any reddit urls
disallowed-url-filters
  • Description: List of regular expressions to filter disallowed URLs.
  • Default Value: [] (empty list)
  • Example: http://httpbin\.org/h.+"
email
  • Description: Enable or disable email-related features.
  • Default Value: false
ignore-robots-txt
  • Description: Enable or disable ignoring the robots.txt file.
  • Default Value: false
limit-delay
  • Description: Delay in seconds between requests.
  • Default Value: 0
limit-random-delay
  • Description: Random delay in seconds added to the fixed delay.
  • Default Value: 0
max-depth
  • Description: Maximum depth for crawling links.
  • Default Value: 0 → unlimited depth
phone
  • Description: List of countries to parse phone numbers from.
  • Default Value: [] (empty list)
  • Example: "RU,NL,DE,US" → You can choose which countries don't have to be every
queue-max-size
  • Description: Maximum size of the crawler's queue.
  • Default Value: 50000
queue-threads
  • Description: Number of threads used for crawling.
  • Default Value: 4
tor
  • Description: Run the crawler through a tor proxy and allow crawling of .onion links
  • Default Value: false
url-filters
  • Description: List of regular expressions to filter URLs.
  • Default Value: [] (empty list)
  • Example: http://httpbin\.org/h.+ (?:https?://)?(?:www)?(\\S*?\\.onion)\\b -> will limit to .onion domains only
url-revisit
  • Description: Enable or disable revisiting URLs.
  • Default Value: false
urls
  • Description: List of starting URLs for the crawler.
  • Default Value: [] (empty list)
  • Example:
urls:
- https://example.com
- https://example2.com
user-agent
  • Description: User agent string for HTTP requests.
  • Example: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0
  • Source: useragents.me
keywords
  • Description: A keyword, sentence, list of keywords
  • Default Value: []
  • Example: search -k owasp -k hacking -k "Please hack the box!"
---

Exporter

associations
  • Description: Specify the different SQL tables you want to export from the database.
  • Default: all
  • Values:
    • "WP" - WordPress
    • "E" - Email
    • "P" - PhoneNumbers
    • "C" - Crypto
criteria
  • Value: {} - (empty json)
  • Description: Criteria for the exporter.
  • Explanation: If you use the LIKE keyword it will automatically perform the SQL LIKE statement. There's no need for adding extra % inside the criteria.
  • Usage:
    pryingdeep -q 'title=test,"url=LIKE example.com"'
filepath
  • Value: data.json
  • Description: Filepath for the exporter output.
  • Default Value: data.json
limit
  • Description: Limit the exporter to a certain number of items. 0 means every row inside the database.
  • Default Value: 0
raw-sql
  • Value: false
  • Description: Enable or disable the use of performing raw SQL queries.
  • Default Value: false
raw-sql-filepath
  • Default: pkg/querybuilder/queries/select.sql
  • Description: Filepath for the raw SQL queries.
sort-by
  • Value: url
  • Description: Field to use for sorting. Just a generic ORDER BY.
  • Default Value: status_code
sort-order
  • Value: asc
  • Description: Sort order for the exporter.
offset
  • Value: 0
  • Description: Number of records to skip during export. Keep in mind if you want to the id to start from 1, set `sort-by` to `id` and `sort-order` to `asc` Otherwise, the filtering might be weird, and you will get records starting from 50 when you asked for offset from 1.

Global Flags

-s, --silent
  • Default: false
  • Description: Use this flag to disable logging and run silently.
-z, --save-config
  • Default: false
  • Description: Use this flag to save chosen options to your .yaml configuration.
-c, --config <path>
  • Value: The path to the .yaml configuration file. Please also keep the filename as pryingdeep, otherwise the program will break.
  • Description: Use this flag to specify the path to the .yaml configuration.