Skip to content

Commit

Permalink
Merge branch 'r/1.4.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
bitdruid committed Jul 25, 2024
1 parent a9a2027 commit b19b22a
Show file tree
Hide file tree
Showing 11 changed files with 644 additions and 266 deletions.
61 changes: 41 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,6 @@ Internet-archive is a nice source for several OSINT-information. This tool is a

This tool allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.

## Info

Linux recommended: On windows machines, the path length is limited. It can only be overcome by editing the registry. Files which exceed the path length will not be downloaded.

## Installation

### Pip
Expand All @@ -32,52 +28,77 @@ Linux recommended: On windows machines, the path length is limited. It can only
```pip install .```
- in a virtual env or use `--break-system-package`

## Usage infos

- Linux recommended: On Windows machines, the path length is limited. This can only be overcome by editing the registry. Files that exceed the path length will not be downloaded.
- If you query an explicit file (e.g. a query-string `?query=this` or `login.html`), the `--explicit`-argument is recommended as a wildcard query may lead to an empty result.

## Arguments

- `-h`, `--help`: Show the help message and exit.
- `-a`, `--about`: Show information about the tool and exit.

### Required

- `-u`, `--url`: The URL of the web page to download. This argument is required.
- **`-u`**, **`--url`**:<br>
The URL of the web page to download. This argument is required.

#### Mode Selection (Choose One)
- `-c`, `--current`: Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
- `-f`, `--full`: Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
- `-s`, `--save`: Save a page to the Wayback Machine. (beta)
- **`-c`**, **`--current`**:<br>
Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
- **`-f`**, **`--full`**:<br>
Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
- **`-s`**, **`--save`**:<br>
Save a page to the Wayback Machine. (beta)

### Optional query parameters

- `-l`, `--list`: Only print the snapshots available within the specified range. Does not download the snapshots.
- `-e`, `--explicit`: Only download the explicit given url. No wildcard subdomains or paths. Use e.g. to get root-only snapshots.
- `-o`, `--output`: Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
- **`-l`**, **`--list`**:<br>
Only print the snapshots available within the specified range. Does not download the snapshots.
- **`-e`**, **`--explicit`**:<br>
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
- **`-o`**, **`--output`**:<br>
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.

- **Range Selection:**<br>
Specify the range in years or a specific timestamp either start, end or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
(year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)
- `-r`, `--range`: Specify the range in years for which to search and download snapshots.
- `--start`: Timestamp to start searching.
- `--end`: Timestamp to end searching.
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
(year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)
- **`-r`**, **`--range`**:<br>
Specify the range in years for which to search and download snapshots.
- **`--start`**:<br>
Timestamp to start searching.
- **`--end`**:<br>
Timestamp to end searching.

### Additional behavior manipulation

- **`--csv`** `<path>`:<br>
Path defaults to output-dir. Saves a CSV file with the json-response for successfull downloads. If `--list` is set, the CSV contains the CDX list of snapshots. If `--current` or `--full` is set, CSV contains downloaded files. Named as `waybackup_<sanitized_url>.csv`.

- **`--skip`** `<path>`:<br>
Path defaults to output-dir. Checks for an existing `waybackup_<domain>.csv` for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download `http://example.com/subdir1/` and later `http://example.com`, the second query will skip the first path.
Path defaults to output-dir. Checks for an existing `waybackup_<sanitized_url>.csv` for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download `http://example.com/subdir1/` and later `http://example.com`, the second query will skip the first path.

- **`--no-redirect`**:<br>
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.

- **`--verbosity`** `<level>`:<br>
Sets verbosity level. Options are `json` (prints JSON response) or `progress` (shows progress bar).
<!-- Alternatively set verbosity level to `trace` for a very detailed output. -->

- **`--log`** `<path>`:<br>
Path defaults to output-dir. Saves a log file with the output of the tool. Named as `waybackup_<sanitized_url>.log`.

- **`--workers`** `<count>`:<br>
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.

- **`--retry`** `<attempts>`:<br>
Specifies number of retry attempts for failed downloads.

- **`--workers`** `<count>`:<br>
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.

- **`--delay`** `<seconds>`:<br>
Specifies delay between download requests in seconds. Default is no delay (0).

<!-- - **`--convert-links`**:<br>
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->

**CDX Query Handling:**
- **`--cdxbackup`** `<path>`:<br>
Expand Down
102 changes: 102 additions & 0 deletions pywaybackup/Arguments.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@

import sys
import os
import argparse

from pywaybackup.helper import url_split, sanitize_filename

from pywaybackup.__version__ import __version__

class Arguments:

def __init__(self):

parser = argparse.ArgumentParser(description='Download from wayback machine (archive.org)')
parser.add_argument('-a', '--about', action='version', version='%(prog)s ' + __version__ + ' by @bitdruid -> https://github.com/bitdruid')
parser.add_argument('-d', '--debug', action='store_true', help='Debug mode (Always full traceback and creates an error.log')

required = parser.add_argument_group('required (one exclusive)')
required.add_argument('-u', '--url', type=str, metavar="", help='url (with subdir/subdomain) to download')
exclusive_required = required.add_mutually_exclusive_group(required=True)
exclusive_required.add_argument('-c', '--current', action='store_true', help='download the latest version of each file snapshot')
exclusive_required.add_argument('-f', '--full', action='store_true', help='download snapshots of all timestamps')
exclusive_required.add_argument('-s', '--save', action='store_true', help='save a page to the wayback machine')

optional = parser.add_argument_group('optional query parameters')
optional.add_argument('-l', '--list', action='store_true', help='only print snapshots (opt range in y)')
optional.add_argument('-e', '--explicit', action='store_true', help='search only for the explicit given url')
optional.add_argument('-o', '--output', type=str, metavar="", help='output folder - defaults to current directory')
optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search')
optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss')
optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss')

special = parser.add_argument_group('manipulate behavior')
special.add_argument('--csv', type=str, nargs='?', const=True, metavar='path', help='save a csv file with the json output - defaults to output folder')
special.add_argument('--skip', type=str, nargs='?', const=True, metavar='path', help='skips existing files in the output folder by checking the .csv file - defaults to output folder')
special.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org')
special.add_argument('--verbosity', type=str, default="info", metavar="", help='["progress", "json"] for different output or ["trace"] for very detailed output')
special.add_argument('--log', type=str, nargs='?', const=True, metavar='path', help='save a log file - defaults to output folder')
special.add_argument('--retry', type=int, default=0, metavar="", help='retry failed downloads (opt tries as int, else infinite)')
special.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)')
# special.add_argument('--convert-links', action='store_true', help='Convert all links in the files to local paths. Requires -c/--current')
special.add_argument('--delay', type=int, default=0, metavar="", help='delay between each download in seconds')

cdx = parser.add_argument_group('cdx (one exclusive)')
exclusive_cdx = cdx.add_mutually_exclusive_group()
exclusive_cdx.add_argument('--cdxbackup', type=str, nargs='?', const=True, metavar='path', help='Save the cdx query-result to a file for recurent use - defaults to output folder')
exclusive_cdx.add_argument('--cdxinject', type=str, nargs='?', const=True, metavar='path', help='Inject a cdx backup-file to download according to the given url')

auto = parser.add_argument_group('auto')
auto.add_argument('--auto', action='store_true', help='includes automatic csv, skip and cdxbackup/cdxinject to resume a stopped download')

args = parser.parse_args(args=None if sys.argv[1:] else ['--help']) # if no arguments are given, print help

# if args.convert_links and not args.current:
# parser.error("--convert-links can only be used with the -c/--current option")

self.args = args

def get_args(self):
return self.args

class Configuration:

@classmethod
def init(cls):

cls.args = Arguments().get_args()
for key, value in vars(cls.args).items():
setattr(Configuration, key, value)

# args now attributes of Configuration // Configuration.output, ...
cls.command = ' '.join(sys.argv[1:])
cls.domain, cls.subdir, cls.filename = url_split(cls.url)

if cls.output is None:
cls.output = os.path.join(os.getcwd(), "waybackup_snapshots")
os.makedirs(cls.output, exist_ok=True)

if cls.log is True:
cls.log = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.log")

if cls.full:
cls.mode = "full"
if cls.current:
cls.mode = "current"

if cls.auto:
cls.skip = cls.output
cls.csv = cls.output
cls.cdxbackup = cls.output
cls.cdxinject = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.cdx")
else:
if cls.skip is True:
cls.skip = cls.output
if cls.csv is True:
cls.csv = cls.output
if cls.cdxbackup is True:
cls.cdxbackup = cls.output
if cls.cdxinject is True:
cls.cdxinject = cls.output


182 changes: 182 additions & 0 deletions pywaybackup/Converter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
import os
import errno
import magic
from pywaybackup.helper import url_split

from pywaybackup.Arguments import Configuration as config
from pywaybackup.Verbosity import Verbosity as vb
import re

class Converter:

@classmethod
def define_root_steps(cls, filepath) -> str:
"""
Define the steps (../) to the root directory.
"""
abs_path = os.path.abspath(filepath)
webroot_path = os.path.abspath(f"{config.output}/{config.domain}/") # webroot is the domain folder in the output
# common path between the two
common_path = os.path.commonpath([abs_path, webroot_path])
# steps up to the common path
rel_path_from_common = os.path.relpath(abs_path, common_path)
steps_up = rel_path_from_common.count(os.path.sep)
if steps_up <= 1: # if the file is in the root of the domain
return "./"
return "../" * steps_up





@classmethod
def links(cls, filepath, status_message=None):
"""
Convert all links in a HTML / CSS / JS file to local paths.
"""


def extract_urls(content) -> list:
"""
Extract all links from a file.
"""

#content = re.sub(r'\s+', '', content)
#content = re.sub(r'\n', '', content)

html_types = ["src", "href", "poster", "data-src"]
css_types = ["url"]
links = []
for html_type in html_types:
# possible formatings of the value: "url", 'url', url
matches = re.findall(f"{html_type}=[\"']?([^\"'>]+)", content)
links += matches
for css_type in css_types:
# possible formatings of the value: url(url) url('url') url("url") // ends with )
matches = re.findall(rf"{css_type}\((['\"]?)([^'\"\)]+)\1\)", content)
links += [match[1] for match in matches]
links = list(set(links))
return links


def local_url(original_url, domain, count) -> str:
"""
Convert a given url to a local path.
"""
original_url_domain = url_split(original_url)[0]

# check if the url is external or internal (external is returned as is because no need to convert)
external = False
if original_url_domain != domain:
if "://" in original_url:
external = True
if original_url.startswith("//"):
external = True
if external:
status_message.trace(status="", type=f"{count}/{len(links)}", message="External url")
return original_url

# convert the url to a relative path to the local root (download dir) if it's a valid path, else return the original url
original_url_file = os.path.join(config.output, config.domain, normalize_url(original_url))
if validate_path(original_url_file):
if original_url.startswith("/"): # if only starts with /
original_url = f"{cls.define_root_steps(filepath)}{original_url.lstrip('/')}"
if original_url.startswith(".//"):
original_url = f"{cls.define_root_steps(filepath)}{original_url.lstrip('./')}"
if original_url_domain == domain: # if url is like https://domain.com/path/to/file
original_url = f"{cls.define_root_steps(filepath)}{original_url.split(domain)[1].lstrip('/')}"
if original_url.startswith("../"): # if file is already ../ check if it's not too many steps up
original_url = f"{cls.define_root_steps(filepath)}{original_url.split('../')[-1].lstrip('/')}"
else:
status_message.trace(status="", type="", message=f"{count}/{len(links)}: URL is not a valid path")

return original_url





def normalize_url(url) -> str:
"""
Normalize a given url by removing it's protocol, domain and parent directorie references.
Example1:
- Example input: https://domain.com/path/to/file
- Example output: /path/to/file
Example2
- input: ../path/to/file
- output: /path/to/file
"""
try:
url = "/" + url.split("../")[-1]
except IndexError:
pass
if url.startswith("//"):
url = "/" + url.split("//")[1]
parsed_url = url_split(url)
return f"{parsed_url[1]}/{parsed_url[2]}"


def is_pathname_valid(pathname: str) -> bool:
"""
Check if a given pathname is valid.
"""
if not isinstance(pathname, str) or not pathname:
return False

try:
os.lstat(pathname)
except OSError as exc:
if exc.errno == errno.ENOENT:
return True
elif exc.errno in {errno.ENAMETOOLONG, errno.ERANGE}:
return False
return True

def is_path_creatable(pathname: str) -> bool:
"""
Check if a given path is creatable.
"""
dirname = os.path.dirname(pathname) or os.getcwd()
return os.access(dirname, os.W_OK)

def is_path_exists_or_creatable(pathname: str) -> bool:
"""
Check if a given path exists or is creatable.
"""
return is_pathname_valid(pathname) or is_path_creatable(pathname)

def validate_path(filepath: str) -> bool:
"""
Validate if a given path can exist.
"""
return is_path_exists_or_creatable(filepath)





if os.path.isfile(filepath):
if magic.from_file(filepath, mime=True).split("/")[1] == "javascript":
status_message.trace(status="Error", type="", message="JS-file is not supported")
return
try:
with open(filepath, "r") as file:
domain = config.domain
content = file.read()
links = extract_urls(content)
status_message.store(message=f"\n-----> Convert: [{len(links)}] links in file")
count = 1
for original_link in links:
status_message.trace(status="ORIG", type=f"{count}/{len(links)}", message=original_link)
new_link = local_url(original_link, domain, count)
if new_link != original_link:
status_message.trace(status="CONV", type=f"{count}/{len(links)}", message=new_link)
content = content.replace(original_link, new_link)
count += 1
file = open(filepath, "w")
file.write(content)
file.close()
except UnicodeDecodeError:
status_message.trace(status="Error", type="", message="Could not decode file to convert links")
Loading

0 comments on commit b19b22a

Please sign in to comment.