diff --git a/README.md b/README.md
index 70a907e..ed7cae1 100644
--- a/README.md
+++ b/README.md
@@ -11,10 +11,6 @@ Internet-archive is a nice source for several OSINT-information. This tool is a
This tool allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.
-## Info
-
-Linux recommended: On windows machines, the path length is limited. It can only be overcome by editing the registry. Files which exceed the path length will not be downloaded.
-
## Installation
### Pip
@@ -32,6 +28,11 @@ Linux recommended: On windows machines, the path length is limited. It can only
```pip install .```
- in a virtual env or use `--break-system-package`
+## Usage infos
+
+- Linux recommended: On Windows machines, the path length is limited. This can only be overcome by editing the registry. Files that exceed the path length will not be downloaded.
+- If you query an explicit file (e.g. a query-string `?query=this` or `login.html`), the `--explicit`-argument is recommended as a wildcard query may lead to an empty result.
+
## Arguments
- `-h`, `--help`: Show the help message and exit.
@@ -39,25 +40,35 @@ Linux recommended: On windows machines, the path length is limited. It can only
### Required
-- `-u`, `--url`: The URL of the web page to download. This argument is required.
+- **`-u`**, **`--url`**:
+ The URL of the web page to download. This argument is required.
#### Mode Selection (Choose One)
-- `-c`, `--current`: Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
-- `-f`, `--full`: Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
-- `-s`, `--save`: Save a page to the Wayback Machine. (beta)
+- **`-c`**, **`--current`**:
+ Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
+- **`-f`**, **`--full`**:
+ Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
+- **`-s`**, **`--save`**:
+ Save a page to the Wayback Machine. (beta)
### Optional query parameters
-- `-l`, `--list`: Only print the snapshots available within the specified range. Does not download the snapshots.
-- `-e`, `--explicit`: Only download the explicit given url. No wildcard subdomains or paths. Use e.g. to get root-only snapshots.
-- `-o`, `--output`: Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
+- **`-l`**, **`--list`**:
+ Only print the snapshots available within the specified range. Does not download the snapshots.
+- **`-e`**, **`--explicit`**:
+ Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
+- **`-o`**, **`--output`**:
+ Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
- **Range Selection:**
-Specify the range in years or a specific timestamp either start, end or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.
-(year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)
- - `-r`, `--range`: Specify the range in years for which to search and download snapshots.
- - `--start`: Timestamp to start searching.
- - `--end`: Timestamp to end searching.
+ Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.
+ (year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)
+ - **`-r`**, **`--range`**:
+ Specify the range in years for which to search and download snapshots.
+ - **`--start`**:
+ Timestamp to start searching.
+ - **`--end`**:
+ Timestamp to end searching.
### Additional behavior manipulation
@@ -65,19 +76,29 @@ Specify the range in years or a specific timestamp either start, end or both. If
Path defaults to output-dir. Saves a CSV file with the json-response for successfull downloads. If `--list` is set, the CSV contains the CDX list of snapshots. If `--current` or `--full` is set, CSV contains downloaded files. Named as `waybackup_.csv`.
- **`--skip`** ``:
-Path defaults to output-dir. Checks for an existing `waybackup_.csv` for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download `http://example.com/subdir1/` and later `http://example.com`, the second query will skip the first path.
+Path defaults to output-dir. Checks for an existing `waybackup_.csv` for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download `http://example.com/subdir1/` and later `http://example.com`, the second query will skip the first path.
- **`--no-redirect`**:
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
- **`--verbosity`** ``:
Sets verbosity level. Options are `json` (prints JSON response) or `progress` (shows progress bar).
+
+
+- **`--log`** ``:
+Path defaults to output-dir. Saves a log file with the output of the tool. Named as `waybackup_.log`.
+
+- **`--workers`** ``:
+Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
- **`--retry`** ``:
Specifies number of retry attempts for failed downloads.
-
-- **`--workers`** ``:
-Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
+
+- **`--delay`** ``:
+Specifies delay between download requests in seconds. Default is no delay (0).
+
+
**CDX Query Handling:**
- **`--cdxbackup`** ``:
diff --git a/pywaybackup/Arguments.py b/pywaybackup/Arguments.py
new file mode 100644
index 0000000..a6a7bab
--- /dev/null
+++ b/pywaybackup/Arguments.py
@@ -0,0 +1,102 @@
+
+import sys
+import os
+import argparse
+
+from pywaybackup.helper import url_split, sanitize_filename
+
+from pywaybackup.__version__ import __version__
+
+class Arguments:
+
+ def __init__(self):
+
+ parser = argparse.ArgumentParser(description='Download from wayback machine (archive.org)')
+ parser.add_argument('-a', '--about', action='version', version='%(prog)s ' + __version__ + ' by @bitdruid -> https://github.com/bitdruid')
+ parser.add_argument('-d', '--debug', action='store_true', help='Debug mode (Always full traceback and creates an error.log')
+
+ required = parser.add_argument_group('required (one exclusive)')
+ required.add_argument('-u', '--url', type=str, metavar="", help='url (with subdir/subdomain) to download')
+ exclusive_required = required.add_mutually_exclusive_group(required=True)
+ exclusive_required.add_argument('-c', '--current', action='store_true', help='download the latest version of each file snapshot')
+ exclusive_required.add_argument('-f', '--full', action='store_true', help='download snapshots of all timestamps')
+ exclusive_required.add_argument('-s', '--save', action='store_true', help='save a page to the wayback machine')
+
+ optional = parser.add_argument_group('optional query parameters')
+ optional.add_argument('-l', '--list', action='store_true', help='only print snapshots (opt range in y)')
+ optional.add_argument('-e', '--explicit', action='store_true', help='search only for the explicit given url')
+ optional.add_argument('-o', '--output', type=str, metavar="", help='output folder - defaults to current directory')
+ optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search')
+ optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss')
+ optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss')
+
+ special = parser.add_argument_group('manipulate behavior')
+ special.add_argument('--csv', type=str, nargs='?', const=True, metavar='path', help='save a csv file with the json output - defaults to output folder')
+ special.add_argument('--skip', type=str, nargs='?', const=True, metavar='path', help='skips existing files in the output folder by checking the .csv file - defaults to output folder')
+ special.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org')
+ special.add_argument('--verbosity', type=str, default="info", metavar="", help='["progress", "json"] for different output or ["trace"] for very detailed output')
+ special.add_argument('--log', type=str, nargs='?', const=True, metavar='path', help='save a log file - defaults to output folder')
+ special.add_argument('--retry', type=int, default=0, metavar="", help='retry failed downloads (opt tries as int, else infinite)')
+ special.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)')
+ # special.add_argument('--convert-links', action='store_true', help='Convert all links in the files to local paths. Requires -c/--current')
+ special.add_argument('--delay', type=int, default=0, metavar="", help='delay between each download in seconds')
+
+ cdx = parser.add_argument_group('cdx (one exclusive)')
+ exclusive_cdx = cdx.add_mutually_exclusive_group()
+ exclusive_cdx.add_argument('--cdxbackup', type=str, nargs='?', const=True, metavar='path', help='Save the cdx query-result to a file for recurent use - defaults to output folder')
+ exclusive_cdx.add_argument('--cdxinject', type=str, nargs='?', const=True, metavar='path', help='Inject a cdx backup-file to download according to the given url')
+
+ auto = parser.add_argument_group('auto')
+ auto.add_argument('--auto', action='store_true', help='includes automatic csv, skip and cdxbackup/cdxinject to resume a stopped download')
+
+ args = parser.parse_args(args=None if sys.argv[1:] else ['--help']) # if no arguments are given, print help
+
+ # if args.convert_links and not args.current:
+ # parser.error("--convert-links can only be used with the -c/--current option")
+
+ self.args = args
+
+ def get_args(self):
+ return self.args
+
+class Configuration:
+
+ @classmethod
+ def init(cls):
+
+ cls.args = Arguments().get_args()
+ for key, value in vars(cls.args).items():
+ setattr(Configuration, key, value)
+
+ # args now attributes of Configuration // Configuration.output, ...
+ cls.command = ' '.join(sys.argv[1:])
+ cls.domain, cls.subdir, cls.filename = url_split(cls.url)
+
+ if cls.output is None:
+ cls.output = os.path.join(os.getcwd(), "waybackup_snapshots")
+ os.makedirs(cls.output, exist_ok=True)
+
+ if cls.log is True:
+ cls.log = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.log")
+
+ if cls.full:
+ cls.mode = "full"
+ if cls.current:
+ cls.mode = "current"
+
+ if cls.auto:
+ cls.skip = cls.output
+ cls.csv = cls.output
+ cls.cdxbackup = cls.output
+ cls.cdxinject = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.cdx")
+ else:
+ if cls.skip is True:
+ cls.skip = cls.output
+ if cls.csv is True:
+ cls.csv = cls.output
+ if cls.cdxbackup is True:
+ cls.cdxbackup = cls.output
+ if cls.cdxinject is True:
+ cls.cdxinject = cls.output
+
+
diff --git a/pywaybackup/Converter.py b/pywaybackup/Converter.py
new file mode 100644
index 0000000..14ace94
--- /dev/null
+++ b/pywaybackup/Converter.py
@@ -0,0 +1,182 @@
+import os
+import errno
+import magic
+from pywaybackup.helper import url_split
+
+from pywaybackup.Arguments import Configuration as config
+from pywaybackup.Verbosity import Verbosity as vb
+import re
+
+class Converter:
+
+ @classmethod
+ def define_root_steps(cls, filepath) -> str:
+ """
+ Define the steps (../) to the root directory.
+ """
+ abs_path = os.path.abspath(filepath)
+ webroot_path = os.path.abspath(f"{config.output}/{config.domain}/") # webroot is the domain folder in the output
+ # common path between the two
+ common_path = os.path.commonpath([abs_path, webroot_path])
+ # steps up to the common path
+ rel_path_from_common = os.path.relpath(abs_path, common_path)
+ steps_up = rel_path_from_common.count(os.path.sep)
+ if steps_up <= 1: # if the file is in the root of the domain
+ return "./"
+ return "../" * steps_up
+
+
+
+
+
+ @classmethod
+ def links(cls, filepath, status_message=None):
+ """
+ Convert all links in a HTML / CSS / JS file to local paths.
+ """
+
+
+ def extract_urls(content) -> list:
+ """
+ Extract all links from a file.
+ """
+
+ #content = re.sub(r'\s+', '', content)
+ #content = re.sub(r'\n', '', content)
+
+ html_types = ["src", "href", "poster", "data-src"]
+ css_types = ["url"]
+ links = []
+ for html_type in html_types:
+ # possible formatings of the value: "url", 'url', url
+ matches = re.findall(f"{html_type}=[\"']?([^\"'>]+)", content)
+ links += matches
+ for css_type in css_types:
+ # possible formatings of the value: url(url) url('url') url("url") // ends with )
+ matches = re.findall(rf"{css_type}\((['\"]?)([^'\"\)]+)\1\)", content)
+ links += [match[1] for match in matches]
+ links = list(set(links))
+ return links
+
+
+ def local_url(original_url, domain, count) -> str:
+ """
+ Convert a given url to a local path.
+ """
+ original_url_domain = url_split(original_url)[0]
+
+ # check if the url is external or internal (external is returned as is because no need to convert)
+ external = False
+ if original_url_domain != domain:
+ if "://" in original_url:
+ external = True
+ if original_url.startswith("//"):
+ external = True
+ if external:
+ status_message.trace(status="", type=f"{count}/{len(links)}", message="External url")
+ return original_url
+
+ # convert the url to a relative path to the local root (download dir) if it's a valid path, else return the original url
+ original_url_file = os.path.join(config.output, config.domain, normalize_url(original_url))
+ if validate_path(original_url_file):
+ if original_url.startswith("/"): # if only starts with /
+ original_url = f"{cls.define_root_steps(filepath)}{original_url.lstrip('/')}"
+ if original_url.startswith(".//"):
+ original_url = f"{cls.define_root_steps(filepath)}{original_url.lstrip('./')}"
+ if original_url_domain == domain: # if url is like https://domain.com/path/to/file
+ original_url = f"{cls.define_root_steps(filepath)}{original_url.split(domain)[1].lstrip('/')}"
+ if original_url.startswith("../"): # if file is already ../ check if it's not too many steps up
+ original_url = f"{cls.define_root_steps(filepath)}{original_url.split('../')[-1].lstrip('/')}"
+ else:
+ status_message.trace(status="", type="", message=f"{count}/{len(links)}: URL is not a valid path")
+
+ return original_url
+
+
+
+
+
+ def normalize_url(url) -> str:
+ """
+ Normalize a given url by removing it's protocol, domain and parent directorie references.
+
+ Example1:
+ - Example input: https://domain.com/path/to/file
+ - Example output: /path/to/file
+
+ Example2
+ - input: ../path/to/file
+ - output: /path/to/file
+ """
+ try:
+ url = "/" + url.split("../")[-1]
+ except IndexError:
+ pass
+ if url.startswith("//"):
+ url = "/" + url.split("//")[1]
+ parsed_url = url_split(url)
+ return f"{parsed_url[1]}/{parsed_url[2]}"
+
+
+ def is_pathname_valid(pathname: str) -> bool:
+ """
+ Check if a given pathname is valid.
+ """
+ if not isinstance(pathname, str) or not pathname:
+ return False
+
+ try:
+ os.lstat(pathname)
+ except OSError as exc:
+ if exc.errno == errno.ENOENT:
+ return True
+ elif exc.errno in {errno.ENAMETOOLONG, errno.ERANGE}:
+ return False
+ return True
+
+ def is_path_creatable(pathname: str) -> bool:
+ """
+ Check if a given path is creatable.
+ """
+ dirname = os.path.dirname(pathname) or os.getcwd()
+ return os.access(dirname, os.W_OK)
+
+ def is_path_exists_or_creatable(pathname: str) -> bool:
+ """
+ Check if a given path exists or is creatable.
+ """
+ return is_pathname_valid(pathname) or is_path_creatable(pathname)
+
+ def validate_path(filepath: str) -> bool:
+ """
+ Validate if a given path can exist.
+ """
+ return is_path_exists_or_creatable(filepath)
+
+
+
+
+
+ if os.path.isfile(filepath):
+ if magic.from_file(filepath, mime=True).split("/")[1] == "javascript":
+ status_message.trace(status="Error", type="", message="JS-file is not supported")
+ return
+ try:
+ with open(filepath, "r") as file:
+ domain = config.domain
+ content = file.read()
+ links = extract_urls(content)
+ status_message.store(message=f"\n-----> Convert: [{len(links)}] links in file")
+ count = 1
+ for original_link in links:
+ status_message.trace(status="ORIG", type=f"{count}/{len(links)}", message=original_link)
+ new_link = local_url(original_link, domain, count)
+ if new_link != original_link:
+ status_message.trace(status="CONV", type=f"{count}/{len(links)}", message=new_link)
+ content = content.replace(original_link, new_link)
+ count += 1
+ file = open(filepath, "w")
+ file.write(content)
+ file.close()
+ except UnicodeDecodeError:
+ status_message.trace(status="Error", type="", message="Could not decode file to convert links")
diff --git a/pywaybackup/SnapshotCollection.py b/pywaybackup/SnapshotCollection.py
index 3053571..7131626 100644
--- a/pywaybackup/SnapshotCollection.py
+++ b/pywaybackup/SnapshotCollection.py
@@ -70,10 +70,10 @@ def create_output(cls, url: str, timestamp: str, output: str):
download_dir = os.path.join(output, domain, timestamp, subdir)
download_file = os.path.abspath(os.path.join(download_dir, filename))
return download_file
-
+
@classmethod
- def snapshot_entry_modify(cls, collection_entry: dict, key: str, value: str):
+ def entry_modify(cls, collection_entry: dict, key: str, value: str):
"""
Modify a key-value pair in a snapshot entry of the collection (dict).
diff --git a/pywaybackup/Verbosity.py b/pywaybackup/Verbosity.py
index e7a08f0..4f4fc1d 100644
--- a/pywaybackup/Verbosity.py
+++ b/pywaybackup/Verbosity.py
@@ -2,49 +2,113 @@
import json
from pywaybackup.SnapshotCollection import SnapshotCollection as sc
-
class Verbosity:
+ LEVELS = ["trace", "info"]
+ level = None
+
mode = None
args = None
pbar = None
- new_debug = True
- debug = False
- output = None
- command = None
+ log = None
@classmethod
- def init(cls, v_args: list, debug=False, output=None, command=None):
+ def init(cls, v_args: list, log=None):
cls.args = v_args
- cls.output = output
- cls.command = command
+ cls.log = open(log, "w") if log else None
if cls.args == "progress":
cls.mode = "progress"
elif cls.args == "json":
cls.mode = "json"
- else:
- cls.mode = "standard"
- cls.debug = True if debug else False
+ cls.level = cls.args if cls.args in cls.LEVELS else "info"
@classmethod
def fini(cls):
if cls.mode == "progress":
- if cls.pbar is not None: cls.pbar.close()
+ if cls.pbar is not None:
+ cls.pbar.close()
if cls.mode == "json":
print(json.dumps(sc.SNAPSHOT_COLLECTION, indent=4, sort_keys=True))
+ if cls.log:
+ cls.log.close()
@classmethod
- def write(cls, message: str = None, progress: int = None):
+ def write(cls, status="", type="", message=""):
+ """
+ Write a log line based on the provided status, type, and message.
+
+ Args:
+ status (str): The status of the log line. (e.g. "SUCCESS", "REDIRECT")
+ type (str): The type of the log line. (e.g. "URL", "FILE")
+ message (str): The message to be logged. (e.g. actual url, file path)
+ """
+ logline = cls.generate_logline(status=status, type=type, message=message)
+ if cls.mode != "progress" and cls.mode != "json":
+ if logline:
+ print(logline)
+ if cls.log:
+ cls.log.write(logline + "\n")
+ cls.log.flush()
+
+ @classmethod
+ def progress(cls, progress: int):
if cls.mode == "progress":
if cls.pbar is None and progress == 0:
maxval = sc.count(collection=True)
cls.pbar = tqdm.tqdm(total=maxval, desc="Downloading", unit=" snapshot", ascii="░▒█")
- if cls.pbar is not None and progress is not None and progress > 0 :
+ if cls.pbar is not None and progress is not None and progress > 0:
cls.pbar.update(progress)
cls.pbar.refresh()
- elif cls.mode == "json":
- pass
- else:
- if message:
- print(message)
\ No newline at end of file
+
+ @classmethod
+ def generate_logline(cls, status: str = "", type: str = "", message: str = ""):
+
+ if not status and not type:
+ return message
+
+ status_length = 11
+ type_length = 5
+
+ status = status.ljust(status_length)
+ type = type.ljust(type_length)
+
+ log_entry = f"{status} -> {type}: {message}"
+
+ return log_entry
+
+class Message(Verbosity):
+ """
+ Message class representing a message-buffer for the Verbosity class.
+
+ If a message should be stored and stacked for later output.
+ """
+
+ def __init__(self):
+ self.message = {}
+
+ def __str__(self):
+ return self.message
+
+ def store(self, status: str = "", type: str = "", message: str = "", level: str = "info"):
+ if level not in self.message:
+ self.message[level] = []
+ self.message[level].append(super().generate_logline(status, type, message))
+
+ def clear(self):
+ self.message = {}
+
+ def write(self):
+ for level in self.message:
+ if self.check_level(level):
+ for message in self.message[level]:
+ super().write(message=message)
+ self.clear()
+
+ def check_level(self, level: str):
+ return super().LEVELS.index(level) >= super().LEVELS.index(self.level)
+
+ def trace(self, status: str = "", type: str = "", message: str = ""):
+ self.store(status, type, message, "trace")
+
+
\ No newline at end of file
diff --git a/pywaybackup/__version__.py b/pywaybackup/__version__.py
index 9e2406e..d60e0c1 100644
--- a/pywaybackup/__version__.py
+++ b/pywaybackup/__version__.py
@@ -1 +1 @@
-__version__ = "1.3.0"
\ No newline at end of file
+__version__ = "1.4.0"
\ No newline at end of file
diff --git a/pywaybackup/archive.py b/pywaybackup/archive.py
index d908558..e2df8fd 100644
--- a/pywaybackup/archive.py
+++ b/pywaybackup/archive.py
@@ -13,12 +13,15 @@
from socket import timeout
-from pywaybackup.helper import url_get_timestamp, url_split, move_index, sanitize_filename, check_nt
+from pywaybackup.helper import url_get_timestamp, move_index, sanitize_filename, check_nt
from pywaybackup.SnapshotCollection import SnapshotCollection as sc
+from pywaybackup.Arguments import Configuration as config
from pywaybackup.__version__ import __version__
+from pywaybackup.Converter import Converter as convert
+from pywaybackup.Verbosity import Message
from pywaybackup.Verbosity import Verbosity as vb
from pywaybackup.Exception import Exception as ex
@@ -41,40 +44,40 @@ def save_page(url: str):
Returns:
None: The function does not return any value. It only prints messages to the console.
"""
- vb.write("\nSaving page to the Wayback Machine...")
+ vb.write(message="\nSaving page to the Wayback Machine...")
connection = http.client.HTTPSConnection("web.archive.org")
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
}
connection.request("GET", f"https://web.archive.org/save/{url}", headers=headers)
- vb.write("\n-----> Request sent")
+ vb.write(message="\n-----> Request sent")
response = connection.getresponse()
response_status = response.status
if response_status == 302:
location = response.getheader("Location")
- vb.write("\n-----> Response: 302 (redirect to snapshot)")
+ vb.write(message="\n-----> Response: 302 (redirect to snapshot)")
snapshot_timestamp = datetime.strptime(url_get_timestamp(location), '%Y%m%d%H%M%S').strftime('%Y-%m-%d %H:%M:%S')
current_timestamp = datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M:%S')
timestamp_difference = (datetime.strptime(current_timestamp, '%Y-%m-%d %H:%M:%S') - datetime.strptime(snapshot_timestamp, '%Y-%m-%d %H:%M:%S')).seconds / 60
timestamp_difference = int(round(timestamp_difference, 0))
if timestamp_difference < 1:
- vb.write("\n-----> New snapshot created")
+ vb.write(message="\n-----> New snapshot created")
elif timestamp_difference > 1:
- vb.write(f"\n-----> Snapshot already exists. (1 hour limit) - wait for {60 - timestamp_difference} minutes")
- vb.write(f"TIMESTAMP SNAPSHOT: {snapshot_timestamp}")
- vb.write(f"TIMESTAMP REQUEST : {current_timestamp}")
- vb.write(f"\nLAST SNAPSHOT BACK: {timestamp_difference} minutes")
+ vb.write(message=f"\n-----> Snapshot already exists. (1 hour limit) - wait for {60 - timestamp_difference} minutes")
+ vb.write(message=f"TIMESTAMP SNAPSHOT: {snapshot_timestamp}")
+ vb.write(message=f"TIMESTAMP REQUEST : {current_timestamp}")
+ vb.write(message=f"\nLAST SNAPSHOT BACK: {timestamp_difference} minutes")
- vb.write(f"\nURL: {location}")
+ vb.write(message=f"\nURL: {location}")
elif response_status == 404:
- vb.write("\n-----> Response: 404 (not found)")
- vb.write(f"\nFAILED -> URL: {url}")
+ vb.write(message="\n-----> Response: 404 (not found)")
+ vb.write(message=f"\nFAILED -> URL: {url}")
else:
- vb.write("\n-----> Response: unexpected")
- vb.write(f"\nFAILED -> URL: {url}")
+ vb.write(message="\n-----> Response: unexpected")
+ vb.write(message=f"\nFAILED -> URL: {url}")
connection.close()
@@ -82,13 +85,13 @@ def save_page(url: str):
def print_list():
- vb.write("")
+ vb.write(message="")
count = sc.count(collection=True)
if count == 0:
- vb.write("\nNo snapshots found")
+ vb.write(message="\nNo snapshots found")
else:
__import__('pprint').pprint(sc.SNAPSHOT_COLLECTION)
- vb.write(f"\n-----> {count} snapshots listed")
+ vb.write(message=f"\n-----> {count} snapshots listed")
@@ -96,22 +99,22 @@ def print_list():
# create filelist
# timestamp format yyyyMMddhhmmss
-def query_list(url: str, range: int, start: int, end: int, explicit: bool, mode: str, cdxbackup: str, cdxinject: str):
-
+def query_list(range: int, start: int, end: int, explicit: bool, mode: str, cdxbackup: str, cdxinject: str):
+
def inject(cdxinject):
if os.path.isfile(cdxinject):
- vb.write("\nInjecting CDX data...")
+ vb.write(message="\nInjecting CDX data...")
cdxResult = open(cdxinject, "r")
cdxResult = cdxResult.read()
linecount = cdxResult.count("\n") - 1
- vb.write(f"\n-----> {linecount} snapshots injected")
+ vb.write(message=f"\n-----> {linecount} snapshots injected")
return cdxResult
else:
- vb.write("\nNo CDX file found to inject - querying snapshots...")
+ vb.write(message="\nNo CDX file found to inject - querying snapshots...")
return False
- def query(url, range, start, end, explicit):
- vb.write("\nQuerying snapshots...")
+ def query(range, start, end, explicit):
+ vb.write(message="\nQuerying snapshots...")
query_range = ""
if not range:
if start: query_range = query_range + f"&from={start}"
@@ -119,40 +122,42 @@ def query(url, range, start, end, explicit):
else:
query_range = "&from=" + str(datetime.now().year - range)
- domain, subdir, filename = url_split(url)
- if domain and not subdir and not filename:
- cdx_url = f"*.{domain}/*" if not explicit else f"{domain}"
- if domain and subdir and not filename:
- cdx_url = f"{domain}/{subdir}/*"
- if domain and subdir and filename:
- cdx_url = f"{domain}/{subdir}/{filename}/*"
- if domain and not subdir and filename:
- cdx_url = f"{domain}/{filename}/*"
-
- vb.write(f"---> {cdx_url}")
+ if config.domain and not config.subdir and not config.filename:
+ cdx_url = f"{config.domain}"
+ if config.domain and config.subdir and not config.filename:
+ cdx_url = f"{config.domain}/{config.subdir}"
+ if config.domain and config.subdir and config.filename:
+ cdx_url = f"{config.domain}/{config.subdir}/{config.filename}"
+ if config.domain and not config.subdir and config.filename:
+ cdx_url = f"{config.domain}/{config.filename}"
+ if not explicit:
+ cdx_url = f"{cdx_url}/*"
+
+ vb.write(message=f"---> {cdx_url}")
cdxQuery = f"https://web.archive.org/cdx/search/cdx?output=json&url={cdx_url}{query_range}&fl=timestamp,digest,mimetype,statuscode,original&filter!=statuscode:200"
try:
cdxResult = requests.get(cdxQuery).text
except requests.exceptions.ConnectionError as e:
- vb.write("\nCONNECTION REFUSED -> could not query cdx server (max retries exceeded)\n")
+ vb.write(message="\nCONNECTION REFUSED -> could not query cdx server (max retries exceeded)\n")
os._exit(1)
if cdxbackup:
os.makedirs(cdxbackup, exist_ok=True)
- with open(os.path.join(cdxbackup, f"waybackup_{sanitize_filename(url)}.cdx"), "w") as file:
+ with open(os.path.join(cdxbackup, f"waybackup_{sanitize_filename(config.url)}.cdx"), "w") as file:
file.write(cdxResult)
- vb.write("\n-----> CDX backup generated")
+ vb.write(message="\n-----> CDX backup generated")
return cdxResult
+ cdxResult = None
if cdxinject:
cdxResult = inject(cdxinject)
if not cdxResult:
- cdxResult = query(url, range, start, end, explicit)
+ cdxResult = query(range, start, end, explicit)
cdxResult = json.loads(cdxResult)
sc.create_list(cdxResult, mode)
- vb.write(f"\n-----> {sc.count(collection=True)} snapshots to utilize")
+ vb.write(message=f"\n-----> {sc.count(collection=True)} snapshots to utilize")
@@ -160,19 +165,20 @@ def query(url, range, start, end, explicit):
# example download: http://web.archive.org/web/20190815104545id_/https://www.google.com/
-def download_list(output, retry, no_redirect, workers, skipset: set = None):
+def download_list(output, retry, no_redirect, delay, workers, skipset: set = None):
"""
Download a list of urls in format: [{"timestamp": "20190815104545", "url": "https://www.google.com/"}]
"""
if sc.count(collection=True) == 0:
- vb.write("\nNothing to download");
+ vb.write(message="\nNothing to download");
return
- vb.write("\nDownloading snapshots...", progress=0)
+ vb.write(message="\nDownloading snapshots...",)
+ vb.progress(0)
if workers > 1:
- vb.write(f"\n-----> Simultaneous downloads: {workers}")
+ vb.write(message=f"\n-----> Simultaneous downloads: {workers}")
sc.create_collection()
- vb.write("\n-----> Snapshots prepared")
+ vb.write(message="\n-----> Snapshots prepared")
# create queue with snapshots and skip already downloaded urls
snapshot_queue = queue.Queue()
@@ -182,30 +188,30 @@ def download_list(output, retry, no_redirect, workers, skipset: set = None):
skip_count += 1
continue
snapshot_queue.put(snapshot)
- vb.write(progress=skip_count)
+ vb.progress(skip_count)
if skip_count > 0:
- vb.write(f"\n-----> Skipped snapshots: {skip_count}")
+ vb.write(message=f"\n-----> Skipped snapshots: {skip_count}")
threads = []
worker = 0
for worker in range(workers):
worker += 1
- vb.write(f"\n-----> Starting worker: {worker}")
- thread = threading.Thread(target=download_loop, args=(snapshot_queue, output, worker, retry, no_redirect, skipset))
+ vb.write(message=f"\n-----> Starting worker: {worker}")
+ thread = threading.Thread(target=download_loop, args=(snapshot_queue, output, worker, retry, no_redirect, delay, skipset))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
successed = sc.count(success=True)
failed = sc.count(fail=True)
- vb.write(f"\nFiles downloaded: {successed}")
- vb.write(f"Not downloaded: {failed}\n")
+ vb.write(message=f"\nFiles downloaded: {successed}")
+ vb.write(message=f"Not downloaded: {failed}\n")
-def download_loop(snapshot_queue, output, worker, retry, no_redirect, skipset=None, attempt=1, connection=None, failed_urls=[]):
+def download_loop(snapshot_queue, output, worker, retry, no_redirect, delay, skipset=None, attempt=1, connection=None, failed_urls=[]):
"""
Download a snapshot of the queue. If a download fails, the function will retry the download.
The "snapshot_collection" dictionary will be updated with the download status and file information.
@@ -213,32 +219,35 @@ def download_loop(snapshot_queue, output, worker, retry, no_redirect, skipset=No
"""
try:
max_attempt = retry if retry > 0 else retry + 1
- if not connection:
- connection = http.client.HTTPSConnection("web.archive.org")
+ connection = connection or http.client.HTTPSConnection("web.archive.org")
if attempt > max_attempt:
connection.close()
- vb.write(f"\n-----> Worker: {worker} - Failed downloads: {len(failed_urls)}")
return
+
while not snapshot_queue.empty():
snapshot = snapshot_queue.get()
- status = f"\n-----> Attempt: [{attempt}/{max_attempt}] Snapshot [{sc.SNAPSHOT_COLLECTION.index(snapshot)+1}/{len(sc.SNAPSHOT_COLLECTION)}] - Worker: {worker}"
- download_status = download(output, snapshot, connection, status, no_redirect)
- if not download_status:
- if snapshot not in failed_urls:
- failed_urls.append(snapshot)
+ status_message = Message()
+ status_message.store(message=f"\n-----> Attempt: [{attempt}/{max_attempt}] Snapshot [{sc.SNAPSHOT_COLLECTION.index(snapshot)+1}/{len(sc.SNAPSHOT_COLLECTION)}] - Worker: {worker}")
+ download_status = download(output, snapshot, connection, status_message, no_redirect)
+ if not download_status and snapshot not in failed_urls:
+ failed_urls.append(snapshot)
if download_status:
if snapshot in failed_urls:
failed_urls.remove(snapshot)
- vb.write(progress=1)
+ vb.progress(1)
+ if delay > 0:
+ vb.write(message=f"\n-----> Worker: {worker} - Delay: {delay} seconds")
+ time.sleep(delay)
+
if failed_urls:
if not attempt > max_attempt:
attempt += 1
- vb.write(f"\n-----> Worker: {worker} - Retry Timeout: 15 seconds")
+ vb.write(message=f"\n-----> Worker: {worker} - Retry Timeout: 15 seconds")
time.sleep(15)
- download_loop(snapshot_queue, output, worker, retry, no_redirect, skipset, attempt, connection, failed_urls)
+ download_loop(snapshot_queue, output, worker, retry, no_redirect, delay, skipset, attempt, connection, failed_urls)
except Exception as e:
ex.exception(f"Worker: {worker} - Exception", e)
- snapshot_queue.put(snapshot) # requeue snapshot if worker crashes
+ snapshot_queue.put(snapshot) # requeue snapshot if worker crashes
@@ -256,51 +265,35 @@ def download(output, snapshot_entry, connection, status_message, no_redirect=Fal
max_retries = 2
sleep_time = 45
headers = {'User-Agent': f'bitdruid-python-wayback-downloader/{__version__}'}
+ success = False
for i in range(max_retries):
try:
- connection.request("GET", encoded_download_url, headers=headers)
- response = connection.getresponse()
- response_data = response.read()
- response_status = response.status
- response_status_message = parse_response_code(response_status)
- sc.snapshot_entry_modify(snapshot_entry, "response", response_status)
- if not no_redirect:
- if response_status == 302:
- status_message = f"{status_message}\n" + \
- f"REDIRECT -> HTTP: {response.status} - {response_status_message}\n" + \
- f" -> FROM: {download_url}"
- redirect_count = 0
- while response_status == 302:
- redirect_count += 1
- if redirect_count > 5:
- break
- connection.request("GET", encoded_download_url, headers=headers)
- response = connection.getresponse()
- response_data = response.read()
- response_status = response.status
- response_status_message = parse_response_code(response_status)
- location = response.getheader("Location")
- if location:
- encoded_download_url = urllib.parse.quote(urljoin(download_url, location), safe=':/')
- status_message = f"{status_message}\n" + \
- f" -> TO: {download_url}"
- sc.snapshot_entry_modify(snapshot_entry, "redirect_timestamp", url_get_timestamp(location))
- sc.snapshot_entry_modify(snapshot_entry, "redirect_url", download_url)
- else:
- break
+ response, response_data, response_status, response_status_message = download_response(connection, encoded_download_url, headers)
+ sc.entry_modify(snapshot_entry, "response", response_status)
+ if not no_redirect and response_status == 302:
+ status_message.store(status="REDIRECT", type="HTTP", message=f"{response.status} - {response_status_message}")
+ status_message.store(status="", type="FROM", message=download_url)
+ for _ in range(5):
+ response, response_data, response_status, response_status_message = download_response(connection, encoded_download_url, headers)
+ location = response.getheader("Location")
+ if location:
+ encoded_download_url = urllib.parse.quote(urljoin(download_url, location), safe=':/')
+ status_message.store(status="", type="TO", message=location)
+ sc.entry_modify(snapshot_entry, "redirect_timestamp", url_get_timestamp(location))
+ sc.entry_modify(snapshot_entry, "redirect_url", download_url)
+ else:
+ break
if response_status == 200:
output_file = sc.create_output(download_url, snapshot_entry["timestamp"], output)
output_path = os.path.dirname(output_file)
# if output_file is too long for windows, skip download
if check_nt() and len(output_file) > 255:
- status_message = f"{status_message}\n" + \
- f"PATH TOO LONG TO SAVE FILE -> HTTP: {response_status} - {response_status_message}\n" + \
- f" -> URL: {download_url}"
- sc.snapshot_entry_modify(snapshot_entry, "file", "PATH TOO LONG TO SAVE FILE")
- vb.write(status_message)
- return True
-
+ status_message.store(status="PATH > 255", type="HTTP", message=f"{response.status} - {response_status_message}")
+ status_message.store(status="", type="URL", message=download_url)
+ sc.entry_modify(snapshot_entry, "file", "PATH TOO LONG TO SAVE FILE")
+ status_message.write()
+ continue
# case if output_path is a file, move file to temporary name, create output_path and move file into output_path
if os.path.isfile(output_path):
move_index(existpath=output_path)
@@ -309,44 +302,54 @@ def download(output, snapshot_entry, connection, status_message, no_redirect=Fal
# case if output_file is a directory, create file as index.html in this directory
if os.path.isdir(output_file):
output_file = move_index(existfile=output_file, filebuffer=response_data)
-
+ # download file if not existing
if not os.path.isfile(output_file):
with open(output_file, 'wb') as file:
if response.getheader('Content-Encoding') == 'gzip':
response_data = gzip.decompress(response_data)
- file.write(response_data)
- else:
- file.write(response_data)
+ file.write(response_data)
+ # check if file is downloaded
if os.path.isfile(output_file):
- status_message = f"{status_message}\n" + \
- f"SUCCESS -> HTTP: {response_status} - {response_status_message}"
+ status_message.store(status="SUCCESS", type="HTTP", message=f"{response.status} - {response_status_message}")
else:
- status_message = f"{status_message}\n" + \
- f"EXISTING -> HTTP: {response_status} - {response_status_message}"
- status_message = f"{status_message}\n" + \
- f" -> URL: {download_url}\n" + \
- f" -> FILE: {output_file}"
- vb.write(status_message)
- sc.snapshot_entry_modify(snapshot_entry, "file", output_file)
- return True
-
+ status_message.store(status="EXISTING", type="HTTP", message=f"{response.status} - {response_status_message}")
+ status_message.store(status="", type="URL", message=download_url)
+ status_message.store(status="", type="FILE", message=output_file)
+ sc.entry_modify(snapshot_entry, "file", output_file)
+ # if convert_links:
+ # convert.links(output_file, status_message)
+ status_message.write()
+ success = True
+ break
else:
- status_message = f"{status_message}\n" + \
- f"UNEXPECTED -> HTTP: {response_status} - {response_status_message}\n" + \
- f" -> URL: {download_url}"
- vb.write(status_message)
- return True
- # exception returns false and appends the url to the failed list
- except http.client.HTTPException as e:
- status_message = f"{status_message}\n" + \
- f"EXCEPTION -> ({i+1}/{max_retries}), append to failed_urls: {download_url}\n" + \
- f" -> {e}"
- vb.write(status_message)
- return False
- except (timeout, ConnectionRefusedError, ConnectionResetError) as e:
+ status_message.store(status="UNEXPECTED", type="HTTP", message=f"{response.status} - {response_status_message}")
+ status_message.store(status="", type="URL", message=download_url)
+ status_message.write()
+ continue
+ # exception handling
+ except (http.client.HTTPException, timeout, ConnectionRefusedError, ConnectionResetError) as e:
download_exception(type, e, i, max_retries, sleep_time, status_message)
- vb.write(f"FAILED -> download, append to failed_urls: {download_url}")
- return False
+ continue
+ if not success:
+ status_message.store(status="FAILED", type="", message=f"append to failed_urls: {download_url}")
+ status_message.write()
+ return success
+
+
+
+
+
+def download_response(connection, encoded_download_url, headers):
+ connection.request("GET", encoded_download_url, headers=headers)
+ response = connection.getresponse()
+ response_data = response.read()
+ response_status = response.status
+ response_status_message = parse_response_code(response_status)
+ return response, response_data, response_status, response_status_message
+
+
+
+
RESPONSE_CODE_DICT = {
200: "OK",
@@ -364,9 +367,8 @@ def download_exception(type, e, i, max_retries, sleep_time, status_message):
Handle exceptions during the download process.
"""
type = e.__class__.__name__.upper()
- status_message = f"{status_message}\n" + \
- f"{type} -> ({i+1}/{max_retries}), reconnect in {sleep_time} seconds...\n"
- vb.write(status_message)
+ status_message.store(status=f"{type}", type=f"({i+1}/{max_retries})", message=f"reconnect in {sleep_time} seconds...")
+ status_message.write()
time.sleep(sleep_time)
def parse_response_code(response_code: int):
@@ -440,7 +442,7 @@ def skip_open(csv_path: str, url: str) -> tuple:
csv_file.close()
return skipset
else:
- vb.write("\nNo CSV-file or content found to load skipable URLs")
+ vb.write(message="\nNo CSV-file or content found to load skipable URLs")
return None
except Exception as e:
ex.exception("Could not open CSV-file", e)
diff --git a/pywaybackup/arguments.py b/pywaybackup/arguments.py
deleted file mode 100644
index 540ab9e..0000000
--- a/pywaybackup/arguments.py
+++ /dev/null
@@ -1,45 +0,0 @@
-import sys
-import argparse
-from pywaybackup.__version__ import __version__
-
-def parse():
-
- parser = argparse.ArgumentParser(description='Download from wayback machine (archive.org)')
- parser.add_argument('-a', '--about', action='version', version='%(prog)s ' + __version__ + ' by @bitdruid -> https://github.com/bitdruid')
- parser.add_argument('-d', '--debug', action='store_true', help='Debug mode (Always full traceback and creates an error.log')
-
- required = parser.add_argument_group('required (one exclusive)')
- required.add_argument('-u', '--url', type=str, metavar="", help='url (with subdir/subdomain) to download')
- exclusive_required = required.add_mutually_exclusive_group(required=True)
- exclusive_required.add_argument('-c', '--current', action='store_true', help='download the latest version of each file snapshot')
- exclusive_required.add_argument('-f', '--full', action='store_true', help='download snapshots of all timestamps')
- exclusive_required.add_argument('-s', '--save', action='store_true', help='save a page to the wayback machine')
-
- optional = parser.add_argument_group('optional query parameters')
- optional.add_argument('-l', '--list', action='store_true', help='only print snapshots (opt range in y)')
- optional.add_argument('-e', '--explicit', action='store_true', help='search only for the explicit given url')
- optional.add_argument('-o', '--output', type=str, metavar="", help='output folder - defaults to current directory')
- optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search')
- optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss')
- optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss')
-
- special = parser.add_argument_group('manipulate behavior')
- special.add_argument('--csv', type=str, nargs='?', const=True, metavar='path', help='save a csv file with the json output - defaults to output folder')
- special.add_argument('--skip', type=str, nargs='?', const=True, metavar='path', help='skips existing files in the output folder by checking the .csv file - defaults to output folder')
- special.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org')
- special.add_argument('--verbosity', type=str, default="standard", metavar="", help='["progress", "json"] Verbosity level')
- special.add_argument('--retry', type=int, default=0, metavar="", help='retry failed downloads (opt tries as int, else infinite)')
- special.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)')
-
- cdx = parser.add_argument_group('cdx (one exclusive)')
- exclusive_cdx = cdx.add_mutually_exclusive_group()
- exclusive_cdx.add_argument('--cdxbackup', type=str, nargs='?', const=True, metavar='path', help='Save the cdx query-result to a file for recurent use - defaults to output folder')
- exclusive_cdx.add_argument('--cdxinject', type=str, nargs='?', const=True, metavar='path', help='Inject a cdx backup-file to download according to the given url')
-
- auto = parser.add_argument_group('auto')
- auto.add_argument('--auto', action='store_true', help='includes automatic csv, skip and cdxbackup/cdxinject to resume a stopped download')
-
- args = parser.parse_args(args=None if sys.argv[1:] else ['--help']) # if no arguments are given, print help
- command = ' '.join(sys.argv[1:])
-
- return args, command
diff --git a/pywaybackup/helper.py b/pywaybackup/helper.py
index 3186f8e..f3cde24 100644
--- a/pywaybackup/helper.py
+++ b/pywaybackup/helper.py
@@ -1,6 +1,8 @@
import os
+import re
import shutil
+from urllib.parse import urlparse, urljoin
import magic
@@ -21,20 +23,35 @@ def sanitize_filename(input: str) -> str:
input = '.'.join(filter(None, input.split('.')))
return input
+def sanitize_url(input: str) -> str:
+ """
+ Sanitize a url by encoding special characters.
+ """
+ special_chars = [":", "*", "?", "&", "=", "<", ">", "\\", "|"]
+ for char in special_chars:
+ input = input.replace(char, f"%{ord(char):02x}")
+ return input
+
def url_get_timestamp(url):
"""
Extract the timestamp from a wayback machine URL.
"""
- timestamp = url.split("id_/")[0].split("/")[-1]
+ timestamp = url.split("web.archive.org/web/")[1].split("/")[0]
+ if "id_" in url: timestamp = timestamp.split("id_")[0]
return timestamp
def url_split(url, index=False):
"""
Split a URL into domain, subdir, and filename.
+
+ Index:
+ - [0] = domain
+ - [1] = subdir
+ - [2] = filename
"""
- if url.startswith("http"):
+ if "://" in url:
url = url.split("://")[1]
domain = url.split("/")[0]
path = url[len(domain):]
@@ -89,6 +106,4 @@ def check_index_mime(filebuffer: bytes) -> bool:
mime_type = mime.from_buffer(filebuffer)
if mime_type != "text/html":
return False
- return True
-
-
+ return True
\ No newline at end of file
diff --git a/pywaybackup/main.py b/pywaybackup/main.py
index b8e1eec..a524e38 100644
--- a/pywaybackup/main.py
+++ b/pywaybackup/main.py
@@ -2,60 +2,34 @@
import signal
-import pywaybackup.helper as helper
import pywaybackup.archive as archive
-from pywaybackup.arguments import parse
+from pywaybackup.Arguments import Configuration as config
from pywaybackup.Verbosity import Verbosity as vb
from pywaybackup.Exception import Exception as ex
+from pywaybackup.Converter import Converter as convert
def main():
- args, command = parse()
- if args.output is None:
- args.output = os.path.join(os.getcwd(), "waybackup_snapshots")
- os.makedirs(args.output, exist_ok=True)
- else:
- os.makedirs(args.output, exist_ok=True)
-
- ex.init(args.debug, args.output, command)
- vb.init(args.verbosity)
-
- if args.full:
- mode = "full"
- if args.current:
- mode = "current"
-
- if args.auto:
- args.skip = args.output
- args.csv = args.output
- args.cdxbackup = args.output
- args.cdxinject = os.path.join(args.output, f"waybackup_{helper.sanitize_filename(args.url)}.cdx")
- else:
- if args.skip is True:
- args.skip = args.output
- if args.csv is True:
- args.csv = args.output
- if args.cdxbackup is True:
- args.cdxbackup = args.output
- if args.cdxinject is True:
- args.cdxinject = args.output
-
- if args.save:
- archive.save_page(args.url)
+ config.init()
+ ex.init(config.debug, config.output, config.command)
+ vb.init(config.verbosity, config.log)
+ if config.save:
+ archive.save_page(config.url)
else:
try:
- skipset = archive.skip_open(args.skip, args.url) if args.skip else None
- archive.query_list(args.url, args.range, args.start, args.end, args.explicit, mode, args.cdxbackup, args.cdxinject)
- if args.list:
+ skipset = archive.skip_open(config.skip, config.url) if config.skip else None
+ archive.query_list(config.range, config.start, config.end, config.explicit, config.mode, config.cdxbackup, config.cdxinject)
+ if config.list:
archive.print_list()
else:
- archive.download_list(args.output, args.retry, args.no_redirect, args.workers, skipset)
+ archive.download_list(config.output, config.retry, config.no_redirect, config.delay, config.workers, skipset)
except KeyboardInterrupt:
print("\nInterrupted by user\n")
finally:
signal.signal(signal.SIGINT, signal.SIG_IGN)
- archive.csv_close(args.csv, args.url) if args.csv else None
+ archive.csv_close(config.csv, config.url) if config.csv else None
+
vb.fini()
os._exit(0) # kill all threads
diff --git a/test/test_links.js b/test/test_links.js
new file mode 100644
index 0000000..871167e
--- /dev/null
+++ b/test/test_links.js
@@ -0,0 +1,63 @@
+// Example JavaScript File: example.js
+
+// External script with absolute URL
+var externalScript = document.createElement('script');
+externalScript.src = 'http://example.com/js/external-script.js';
+document.head.appendChild(externalScript);
+
+// External script with relative URL
+var localScript = document.createElement('script');
+localScript.src = '/js/local-script.js';
+document.head.appendChild(localScript);
+
+// Inline style with absolute URL in background image
+var element = document.createElement('div');
+element.style.backgroundImage = "url('http://example.com/images/bg.png')";
+document.body.appendChild(element);
+
+// Inline style with relative URL in background image
+element.style.backgroundImage = "url('/images/bg.png')";
+
+// CSS in JavaScript with absolute URL
+var css = "body { background-image: url('http://example.com/images/body-bg.png'); }";
+var style = document.createElement('style');
+style.type = 'text/css';
+style.appendChild(document.createTextNode(css));
+document.head.appendChild(style);
+
+// CSS in JavaScript with relative URL
+var cssLocal = "body { background-image: url('/images/body-bg.png'); }";
+var styleLocal = document.createElement('style');
+styleLocal.type = 'text/css';
+styleLocal.appendChild(document.createTextNode(cssLocal));
+document.head.appendChild(styleLocal);
+
+// Image element with absolute URL
+var imgElement = document.createElement('img');
+imgElement.src = 'http://example.com/images/logo.png';
+document.body.appendChild(imgElement);
+
+// Image element with relative URL
+var imgLocalElement = document.createElement('img');
+imgLocalElement.src = '/images/logo.png';
+document.body.appendChild(imgLocalElement);
+
+// Fetch API call with absolute URL
+fetch('http://example.com/api/data')
+ .then(response => response.json())
+ .then(data => console.log(data));
+
+// Fetch API call with relative URL
+fetch('/api/data')
+ .then(response => response.json())
+ .then(data => console.log(data));
+
+// XMLHttpRequest with absolute URL
+var xhr = new XMLHttpRequest();
+xhr.open('GET', 'http://example.com/api/data', true);
+xhr.send();
+
+// XMLHttpRequest with relative URL
+var xhrLocal = new XMLHttpRequest();
+xhrLocal.open('GET', '/api/data', true);
+xhrLocal.send();