diff --git a/README.md b/README.md index 70a907e..ed7cae1 100644 --- a/README.md +++ b/README.md @@ -11,10 +11,6 @@ Internet-archive is a nice source for several OSINT-information. This tool is a This tool allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range. -## Info - -Linux recommended: On windows machines, the path length is limited. It can only be overcome by editing the registry. Files which exceed the path length will not be downloaded. - ## Installation ### Pip @@ -32,6 +28,11 @@ Linux recommended: On windows machines, the path length is limited. It can only ```pip install .``` - in a virtual env or use `--break-system-package` +## Usage infos + +- Linux recommended: On Windows machines, the path length is limited. This can only be overcome by editing the registry. Files that exceed the path length will not be downloaded. +- If you query an explicit file (e.g. a query-string `?query=this` or `login.html`), the `--explicit`-argument is recommended as a wildcard query may lead to an empty result. + ## Arguments - `-h`, `--help`: Show the help message and exit. @@ -39,25 +40,35 @@ Linux recommended: On windows machines, the path length is limited. It can only ### Required -- `-u`, `--url`: The URL of the web page to download. This argument is required. +- **`-u`**, **`--url`**:
+ The URL of the web page to download. This argument is required. #### Mode Selection (Choose One) -- `-c`, `--current`: Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed). -- `-f`, `--full`: Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time. -- `-s`, `--save`: Save a page to the Wayback Machine. (beta) +- **`-c`**, **`--current`**:
+ Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed). +- **`-f`**, **`--full`**:
+ Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time. +- **`-s`**, **`--save`**:
+ Save a page to the Wayback Machine. (beta) ### Optional query parameters -- `-l`, `--list`: Only print the snapshots available within the specified range. Does not download the snapshots. -- `-e`, `--explicit`: Only download the explicit given url. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. -- `-o`, `--output`: Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved. +- **`-l`**, **`--list`**:
+ Only print the snapshots available within the specified range. Does not download the snapshots. +- **`-e`**, **`--explicit`**:
+ Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`. +- **`-o`**, **`--output`**:
+ Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved. - **Range Selection:**
-Specify the range in years or a specific timestamp either start, end or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.
-(year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112) - - `-r`, `--range`: Specify the range in years for which to search and download snapshots. - - `--start`: Timestamp to start searching. - - `--end`: Timestamp to end searching. + Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.
+ (year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112) + - **`-r`**, **`--range`**:
+ Specify the range in years for which to search and download snapshots. + - **`--start`**:
+ Timestamp to start searching. + - **`--end`**:
+ Timestamp to end searching. ### Additional behavior manipulation @@ -65,19 +76,29 @@ Specify the range in years or a specific timestamp either start, end or both. If Path defaults to output-dir. Saves a CSV file with the json-response for successfull downloads. If `--list` is set, the CSV contains the CDX list of snapshots. If `--current` or `--full` is set, CSV contains downloaded files. Named as `waybackup_.csv`. - **`--skip`** ``:
-Path defaults to output-dir. Checks for an existing `waybackup_.csv` for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download `http://example.com/subdir1/` and later `http://example.com`, the second query will skip the first path. +Path defaults to output-dir. Checks for an existing `waybackup_.csv` for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download `http://example.com/subdir1/` and later `http://example.com`, the second query will skip the first path. - **`--no-redirect`**:
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects. - **`--verbosity`** ``:
Sets verbosity level. Options are `json` (prints JSON response) or `progress` (shows progress bar). + + +- **`--log`** ``:
+Path defaults to output-dir. Saves a log file with the output of the tool. Named as `waybackup_.log`. + +- **`--workers`** ``:
+Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine. - **`--retry`** ``:
Specifies number of retry attempts for failed downloads. - -- **`--workers`** ``:
-Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine. + +- **`--delay`** ``:
+Specifies delay between download requests in seconds. Default is no delay (0). + + **CDX Query Handling:** - **`--cdxbackup`** ``:
diff --git a/pywaybackup/Arguments.py b/pywaybackup/Arguments.py new file mode 100644 index 0000000..a6a7bab --- /dev/null +++ b/pywaybackup/Arguments.py @@ -0,0 +1,102 @@ + +import sys +import os +import argparse + +from pywaybackup.helper import url_split, sanitize_filename + +from pywaybackup.__version__ import __version__ + +class Arguments: + + def __init__(self): + + parser = argparse.ArgumentParser(description='Download from wayback machine (archive.org)') + parser.add_argument('-a', '--about', action='version', version='%(prog)s ' + __version__ + ' by @bitdruid -> https://github.com/bitdruid') + parser.add_argument('-d', '--debug', action='store_true', help='Debug mode (Always full traceback and creates an error.log') + + required = parser.add_argument_group('required (one exclusive)') + required.add_argument('-u', '--url', type=str, metavar="", help='url (with subdir/subdomain) to download') + exclusive_required = required.add_mutually_exclusive_group(required=True) + exclusive_required.add_argument('-c', '--current', action='store_true', help='download the latest version of each file snapshot') + exclusive_required.add_argument('-f', '--full', action='store_true', help='download snapshots of all timestamps') + exclusive_required.add_argument('-s', '--save', action='store_true', help='save a page to the wayback machine') + + optional = parser.add_argument_group('optional query parameters') + optional.add_argument('-l', '--list', action='store_true', help='only print snapshots (opt range in y)') + optional.add_argument('-e', '--explicit', action='store_true', help='search only for the explicit given url') + optional.add_argument('-o', '--output', type=str, metavar="", help='output folder - defaults to current directory') + optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search') + optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss') + optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss') + + special = parser.add_argument_group('manipulate behavior') + special.add_argument('--csv', type=str, nargs='?', const=True, metavar='path', help='save a csv file with the json output - defaults to output folder') + special.add_argument('--skip', type=str, nargs='?', const=True, metavar='path', help='skips existing files in the output folder by checking the .csv file - defaults to output folder') + special.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org') + special.add_argument('--verbosity', type=str, default="info", metavar="", help='["progress", "json"] for different output or ["trace"] for very detailed output') + special.add_argument('--log', type=str, nargs='?', const=True, metavar='path', help='save a log file - defaults to output folder') + special.add_argument('--retry', type=int, default=0, metavar="", help='retry failed downloads (opt tries as int, else infinite)') + special.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)') + # special.add_argument('--convert-links', action='store_true', help='Convert all links in the files to local paths. Requires -c/--current') + special.add_argument('--delay', type=int, default=0, metavar="", help='delay between each download in seconds') + + cdx = parser.add_argument_group('cdx (one exclusive)') + exclusive_cdx = cdx.add_mutually_exclusive_group() + exclusive_cdx.add_argument('--cdxbackup', type=str, nargs='?', const=True, metavar='path', help='Save the cdx query-result to a file for recurent use - defaults to output folder') + exclusive_cdx.add_argument('--cdxinject', type=str, nargs='?', const=True, metavar='path', help='Inject a cdx backup-file to download according to the given url') + + auto = parser.add_argument_group('auto') + auto.add_argument('--auto', action='store_true', help='includes automatic csv, skip and cdxbackup/cdxinject to resume a stopped download') + + args = parser.parse_args(args=None if sys.argv[1:] else ['--help']) # if no arguments are given, print help + + # if args.convert_links and not args.current: + # parser.error("--convert-links can only be used with the -c/--current option") + + self.args = args + + def get_args(self): + return self.args + +class Configuration: + + @classmethod + def init(cls): + + cls.args = Arguments().get_args() + for key, value in vars(cls.args).items(): + setattr(Configuration, key, value) + + # args now attributes of Configuration // Configuration.output, ... + cls.command = ' '.join(sys.argv[1:]) + cls.domain, cls.subdir, cls.filename = url_split(cls.url) + + if cls.output is None: + cls.output = os.path.join(os.getcwd(), "waybackup_snapshots") + os.makedirs(cls.output, exist_ok=True) + + if cls.log is True: + cls.log = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.log") + + if cls.full: + cls.mode = "full" + if cls.current: + cls.mode = "current" + + if cls.auto: + cls.skip = cls.output + cls.csv = cls.output + cls.cdxbackup = cls.output + cls.cdxinject = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.cdx") + else: + if cls.skip is True: + cls.skip = cls.output + if cls.csv is True: + cls.csv = cls.output + if cls.cdxbackup is True: + cls.cdxbackup = cls.output + if cls.cdxinject is True: + cls.cdxinject = cls.output + + diff --git a/pywaybackup/Converter.py b/pywaybackup/Converter.py new file mode 100644 index 0000000..14ace94 --- /dev/null +++ b/pywaybackup/Converter.py @@ -0,0 +1,182 @@ +import os +import errno +import magic +from pywaybackup.helper import url_split + +from pywaybackup.Arguments import Configuration as config +from pywaybackup.Verbosity import Verbosity as vb +import re + +class Converter: + + @classmethod + def define_root_steps(cls, filepath) -> str: + """ + Define the steps (../) to the root directory. + """ + abs_path = os.path.abspath(filepath) + webroot_path = os.path.abspath(f"{config.output}/{config.domain}/") # webroot is the domain folder in the output + # common path between the two + common_path = os.path.commonpath([abs_path, webroot_path]) + # steps up to the common path + rel_path_from_common = os.path.relpath(abs_path, common_path) + steps_up = rel_path_from_common.count(os.path.sep) + if steps_up <= 1: # if the file is in the root of the domain + return "./" + return "../" * steps_up + + + + + + @classmethod + def links(cls, filepath, status_message=None): + """ + Convert all links in a HTML / CSS / JS file to local paths. + """ + + + def extract_urls(content) -> list: + """ + Extract all links from a file. + """ + + #content = re.sub(r'\s+', '', content) + #content = re.sub(r'\n', '', content) + + html_types = ["src", "href", "poster", "data-src"] + css_types = ["url"] + links = [] + for html_type in html_types: + # possible formatings of the value: "url", 'url', url + matches = re.findall(f"{html_type}=[\"']?([^\"'>]+)", content) + links += matches + for css_type in css_types: + # possible formatings of the value: url(url) url('url') url("url") // ends with ) + matches = re.findall(rf"{css_type}\((['\"]?)([^'\"\)]+)\1\)", content) + links += [match[1] for match in matches] + links = list(set(links)) + return links + + + def local_url(original_url, domain, count) -> str: + """ + Convert a given url to a local path. + """ + original_url_domain = url_split(original_url)[0] + + # check if the url is external or internal (external is returned as is because no need to convert) + external = False + if original_url_domain != domain: + if "://" in original_url: + external = True + if original_url.startswith("//"): + external = True + if external: + status_message.trace(status="", type=f"{count}/{len(links)}", message="External url") + return original_url + + # convert the url to a relative path to the local root (download dir) if it's a valid path, else return the original url + original_url_file = os.path.join(config.output, config.domain, normalize_url(original_url)) + if validate_path(original_url_file): + if original_url.startswith("/"): # if only starts with / + original_url = f"{cls.define_root_steps(filepath)}{original_url.lstrip('/')}" + if original_url.startswith(".//"): + original_url = f"{cls.define_root_steps(filepath)}{original_url.lstrip('./')}" + if original_url_domain == domain: # if url is like https://domain.com/path/to/file + original_url = f"{cls.define_root_steps(filepath)}{original_url.split(domain)[1].lstrip('/')}" + if original_url.startswith("../"): # if file is already ../ check if it's not too many steps up + original_url = f"{cls.define_root_steps(filepath)}{original_url.split('../')[-1].lstrip('/')}" + else: + status_message.trace(status="", type="", message=f"{count}/{len(links)}: URL is not a valid path") + + return original_url + + + + + + def normalize_url(url) -> str: + """ + Normalize a given url by removing it's protocol, domain and parent directorie references. + + Example1: + - Example input: https://domain.com/path/to/file + - Example output: /path/to/file + + Example2 + - input: ../path/to/file + - output: /path/to/file + """ + try: + url = "/" + url.split("../")[-1] + except IndexError: + pass + if url.startswith("//"): + url = "/" + url.split("//")[1] + parsed_url = url_split(url) + return f"{parsed_url[1]}/{parsed_url[2]}" + + + def is_pathname_valid(pathname: str) -> bool: + """ + Check if a given pathname is valid. + """ + if not isinstance(pathname, str) or not pathname: + return False + + try: + os.lstat(pathname) + except OSError as exc: + if exc.errno == errno.ENOENT: + return True + elif exc.errno in {errno.ENAMETOOLONG, errno.ERANGE}: + return False + return True + + def is_path_creatable(pathname: str) -> bool: + """ + Check if a given path is creatable. + """ + dirname = os.path.dirname(pathname) or os.getcwd() + return os.access(dirname, os.W_OK) + + def is_path_exists_or_creatable(pathname: str) -> bool: + """ + Check if a given path exists or is creatable. + """ + return is_pathname_valid(pathname) or is_path_creatable(pathname) + + def validate_path(filepath: str) -> bool: + """ + Validate if a given path can exist. + """ + return is_path_exists_or_creatable(filepath) + + + + + + if os.path.isfile(filepath): + if magic.from_file(filepath, mime=True).split("/")[1] == "javascript": + status_message.trace(status="Error", type="", message="JS-file is not supported") + return + try: + with open(filepath, "r") as file: + domain = config.domain + content = file.read() + links = extract_urls(content) + status_message.store(message=f"\n-----> Convert: [{len(links)}] links in file") + count = 1 + for original_link in links: + status_message.trace(status="ORIG", type=f"{count}/{len(links)}", message=original_link) + new_link = local_url(original_link, domain, count) + if new_link != original_link: + status_message.trace(status="CONV", type=f"{count}/{len(links)}", message=new_link) + content = content.replace(original_link, new_link) + count += 1 + file = open(filepath, "w") + file.write(content) + file.close() + except UnicodeDecodeError: + status_message.trace(status="Error", type="", message="Could not decode file to convert links") diff --git a/pywaybackup/SnapshotCollection.py b/pywaybackup/SnapshotCollection.py index 3053571..7131626 100644 --- a/pywaybackup/SnapshotCollection.py +++ b/pywaybackup/SnapshotCollection.py @@ -70,10 +70,10 @@ def create_output(cls, url: str, timestamp: str, output: str): download_dir = os.path.join(output, domain, timestamp, subdir) download_file = os.path.abspath(os.path.join(download_dir, filename)) return download_file - + @classmethod - def snapshot_entry_modify(cls, collection_entry: dict, key: str, value: str): + def entry_modify(cls, collection_entry: dict, key: str, value: str): """ Modify a key-value pair in a snapshot entry of the collection (dict). diff --git a/pywaybackup/Verbosity.py b/pywaybackup/Verbosity.py index e7a08f0..4f4fc1d 100644 --- a/pywaybackup/Verbosity.py +++ b/pywaybackup/Verbosity.py @@ -2,49 +2,113 @@ import json from pywaybackup.SnapshotCollection import SnapshotCollection as sc - class Verbosity: + LEVELS = ["trace", "info"] + level = None + mode = None args = None pbar = None - new_debug = True - debug = False - output = None - command = None + log = None @classmethod - def init(cls, v_args: list, debug=False, output=None, command=None): + def init(cls, v_args: list, log=None): cls.args = v_args - cls.output = output - cls.command = command + cls.log = open(log, "w") if log else None if cls.args == "progress": cls.mode = "progress" elif cls.args == "json": cls.mode = "json" - else: - cls.mode = "standard" - cls.debug = True if debug else False + cls.level = cls.args if cls.args in cls.LEVELS else "info" @classmethod def fini(cls): if cls.mode == "progress": - if cls.pbar is not None: cls.pbar.close() + if cls.pbar is not None: + cls.pbar.close() if cls.mode == "json": print(json.dumps(sc.SNAPSHOT_COLLECTION, indent=4, sort_keys=True)) + if cls.log: + cls.log.close() @classmethod - def write(cls, message: str = None, progress: int = None): + def write(cls, status="", type="", message=""): + """ + Write a log line based on the provided status, type, and message. + + Args: + status (str): The status of the log line. (e.g. "SUCCESS", "REDIRECT") + type (str): The type of the log line. (e.g. "URL", "FILE") + message (str): The message to be logged. (e.g. actual url, file path) + """ + logline = cls.generate_logline(status=status, type=type, message=message) + if cls.mode != "progress" and cls.mode != "json": + if logline: + print(logline) + if cls.log: + cls.log.write(logline + "\n") + cls.log.flush() + + @classmethod + def progress(cls, progress: int): if cls.mode == "progress": if cls.pbar is None and progress == 0: maxval = sc.count(collection=True) cls.pbar = tqdm.tqdm(total=maxval, desc="Downloading", unit=" snapshot", ascii="░▒█") - if cls.pbar is not None and progress is not None and progress > 0 : + if cls.pbar is not None and progress is not None and progress > 0: cls.pbar.update(progress) cls.pbar.refresh() - elif cls.mode == "json": - pass - else: - if message: - print(message) \ No newline at end of file + + @classmethod + def generate_logline(cls, status: str = "", type: str = "", message: str = ""): + + if not status and not type: + return message + + status_length = 11 + type_length = 5 + + status = status.ljust(status_length) + type = type.ljust(type_length) + + log_entry = f"{status} -> {type}: {message}" + + return log_entry + +class Message(Verbosity): + """ + Message class representing a message-buffer for the Verbosity class. + + If a message should be stored and stacked for later output. + """ + + def __init__(self): + self.message = {} + + def __str__(self): + return self.message + + def store(self, status: str = "", type: str = "", message: str = "", level: str = "info"): + if level not in self.message: + self.message[level] = [] + self.message[level].append(super().generate_logline(status, type, message)) + + def clear(self): + self.message = {} + + def write(self): + for level in self.message: + if self.check_level(level): + for message in self.message[level]: + super().write(message=message) + self.clear() + + def check_level(self, level: str): + return super().LEVELS.index(level) >= super().LEVELS.index(self.level) + + def trace(self, status: str = "", type: str = "", message: str = ""): + self.store(status, type, message, "trace") + + \ No newline at end of file diff --git a/pywaybackup/__version__.py b/pywaybackup/__version__.py index 9e2406e..d60e0c1 100644 --- a/pywaybackup/__version__.py +++ b/pywaybackup/__version__.py @@ -1 +1 @@ -__version__ = "1.3.0" \ No newline at end of file +__version__ = "1.4.0" \ No newline at end of file diff --git a/pywaybackup/archive.py b/pywaybackup/archive.py index d908558..e2df8fd 100644 --- a/pywaybackup/archive.py +++ b/pywaybackup/archive.py @@ -13,12 +13,15 @@ from socket import timeout -from pywaybackup.helper import url_get_timestamp, url_split, move_index, sanitize_filename, check_nt +from pywaybackup.helper import url_get_timestamp, move_index, sanitize_filename, check_nt from pywaybackup.SnapshotCollection import SnapshotCollection as sc +from pywaybackup.Arguments import Configuration as config from pywaybackup.__version__ import __version__ +from pywaybackup.Converter import Converter as convert +from pywaybackup.Verbosity import Message from pywaybackup.Verbosity import Verbosity as vb from pywaybackup.Exception import Exception as ex @@ -41,40 +44,40 @@ def save_page(url: str): Returns: None: The function does not return any value. It only prints messages to the console. """ - vb.write("\nSaving page to the Wayback Machine...") + vb.write(message="\nSaving page to the Wayback Machine...") connection = http.client.HTTPSConnection("web.archive.org") headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36' } connection.request("GET", f"https://web.archive.org/save/{url}", headers=headers) - vb.write("\n-----> Request sent") + vb.write(message="\n-----> Request sent") response = connection.getresponse() response_status = response.status if response_status == 302: location = response.getheader("Location") - vb.write("\n-----> Response: 302 (redirect to snapshot)") + vb.write(message="\n-----> Response: 302 (redirect to snapshot)") snapshot_timestamp = datetime.strptime(url_get_timestamp(location), '%Y%m%d%H%M%S').strftime('%Y-%m-%d %H:%M:%S') current_timestamp = datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M:%S') timestamp_difference = (datetime.strptime(current_timestamp, '%Y-%m-%d %H:%M:%S') - datetime.strptime(snapshot_timestamp, '%Y-%m-%d %H:%M:%S')).seconds / 60 timestamp_difference = int(round(timestamp_difference, 0)) if timestamp_difference < 1: - vb.write("\n-----> New snapshot created") + vb.write(message="\n-----> New snapshot created") elif timestamp_difference > 1: - vb.write(f"\n-----> Snapshot already exists. (1 hour limit) - wait for {60 - timestamp_difference} minutes") - vb.write(f"TIMESTAMP SNAPSHOT: {snapshot_timestamp}") - vb.write(f"TIMESTAMP REQUEST : {current_timestamp}") - vb.write(f"\nLAST SNAPSHOT BACK: {timestamp_difference} minutes") + vb.write(message=f"\n-----> Snapshot already exists. (1 hour limit) - wait for {60 - timestamp_difference} minutes") + vb.write(message=f"TIMESTAMP SNAPSHOT: {snapshot_timestamp}") + vb.write(message=f"TIMESTAMP REQUEST : {current_timestamp}") + vb.write(message=f"\nLAST SNAPSHOT BACK: {timestamp_difference} minutes") - vb.write(f"\nURL: {location}") + vb.write(message=f"\nURL: {location}") elif response_status == 404: - vb.write("\n-----> Response: 404 (not found)") - vb.write(f"\nFAILED -> URL: {url}") + vb.write(message="\n-----> Response: 404 (not found)") + vb.write(message=f"\nFAILED -> URL: {url}") else: - vb.write("\n-----> Response: unexpected") - vb.write(f"\nFAILED -> URL: {url}") + vb.write(message="\n-----> Response: unexpected") + vb.write(message=f"\nFAILED -> URL: {url}") connection.close() @@ -82,13 +85,13 @@ def save_page(url: str): def print_list(): - vb.write("") + vb.write(message="") count = sc.count(collection=True) if count == 0: - vb.write("\nNo snapshots found") + vb.write(message="\nNo snapshots found") else: __import__('pprint').pprint(sc.SNAPSHOT_COLLECTION) - vb.write(f"\n-----> {count} snapshots listed") + vb.write(message=f"\n-----> {count} snapshots listed") @@ -96,22 +99,22 @@ def print_list(): # create filelist # timestamp format yyyyMMddhhmmss -def query_list(url: str, range: int, start: int, end: int, explicit: bool, mode: str, cdxbackup: str, cdxinject: str): - +def query_list(range: int, start: int, end: int, explicit: bool, mode: str, cdxbackup: str, cdxinject: str): + def inject(cdxinject): if os.path.isfile(cdxinject): - vb.write("\nInjecting CDX data...") + vb.write(message="\nInjecting CDX data...") cdxResult = open(cdxinject, "r") cdxResult = cdxResult.read() linecount = cdxResult.count("\n") - 1 - vb.write(f"\n-----> {linecount} snapshots injected") + vb.write(message=f"\n-----> {linecount} snapshots injected") return cdxResult else: - vb.write("\nNo CDX file found to inject - querying snapshots...") + vb.write(message="\nNo CDX file found to inject - querying snapshots...") return False - def query(url, range, start, end, explicit): - vb.write("\nQuerying snapshots...") + def query(range, start, end, explicit): + vb.write(message="\nQuerying snapshots...") query_range = "" if not range: if start: query_range = query_range + f"&from={start}" @@ -119,40 +122,42 @@ def query(url, range, start, end, explicit): else: query_range = "&from=" + str(datetime.now().year - range) - domain, subdir, filename = url_split(url) - if domain and not subdir and not filename: - cdx_url = f"*.{domain}/*" if not explicit else f"{domain}" - if domain and subdir and not filename: - cdx_url = f"{domain}/{subdir}/*" - if domain and subdir and filename: - cdx_url = f"{domain}/{subdir}/{filename}/*" - if domain and not subdir and filename: - cdx_url = f"{domain}/{filename}/*" - - vb.write(f"---> {cdx_url}") + if config.domain and not config.subdir and not config.filename: + cdx_url = f"{config.domain}" + if config.domain and config.subdir and not config.filename: + cdx_url = f"{config.domain}/{config.subdir}" + if config.domain and config.subdir and config.filename: + cdx_url = f"{config.domain}/{config.subdir}/{config.filename}" + if config.domain and not config.subdir and config.filename: + cdx_url = f"{config.domain}/{config.filename}" + if not explicit: + cdx_url = f"{cdx_url}/*" + + vb.write(message=f"---> {cdx_url}") cdxQuery = f"https://web.archive.org/cdx/search/cdx?output=json&url={cdx_url}{query_range}&fl=timestamp,digest,mimetype,statuscode,original&filter!=statuscode:200" try: cdxResult = requests.get(cdxQuery).text except requests.exceptions.ConnectionError as e: - vb.write("\nCONNECTION REFUSED -> could not query cdx server (max retries exceeded)\n") + vb.write(message="\nCONNECTION REFUSED -> could not query cdx server (max retries exceeded)\n") os._exit(1) if cdxbackup: os.makedirs(cdxbackup, exist_ok=True) - with open(os.path.join(cdxbackup, f"waybackup_{sanitize_filename(url)}.cdx"), "w") as file: + with open(os.path.join(cdxbackup, f"waybackup_{sanitize_filename(config.url)}.cdx"), "w") as file: file.write(cdxResult) - vb.write("\n-----> CDX backup generated") + vb.write(message="\n-----> CDX backup generated") return cdxResult + cdxResult = None if cdxinject: cdxResult = inject(cdxinject) if not cdxResult: - cdxResult = query(url, range, start, end, explicit) + cdxResult = query(range, start, end, explicit) cdxResult = json.loads(cdxResult) sc.create_list(cdxResult, mode) - vb.write(f"\n-----> {sc.count(collection=True)} snapshots to utilize") + vb.write(message=f"\n-----> {sc.count(collection=True)} snapshots to utilize") @@ -160,19 +165,20 @@ def query(url, range, start, end, explicit): # example download: http://web.archive.org/web/20190815104545id_/https://www.google.com/ -def download_list(output, retry, no_redirect, workers, skipset: set = None): +def download_list(output, retry, no_redirect, delay, workers, skipset: set = None): """ Download a list of urls in format: [{"timestamp": "20190815104545", "url": "https://www.google.com/"}] """ if sc.count(collection=True) == 0: - vb.write("\nNothing to download"); + vb.write(message="\nNothing to download"); return - vb.write("\nDownloading snapshots...", progress=0) + vb.write(message="\nDownloading snapshots...",) + vb.progress(0) if workers > 1: - vb.write(f"\n-----> Simultaneous downloads: {workers}") + vb.write(message=f"\n-----> Simultaneous downloads: {workers}") sc.create_collection() - vb.write("\n-----> Snapshots prepared") + vb.write(message="\n-----> Snapshots prepared") # create queue with snapshots and skip already downloaded urls snapshot_queue = queue.Queue() @@ -182,30 +188,30 @@ def download_list(output, retry, no_redirect, workers, skipset: set = None): skip_count += 1 continue snapshot_queue.put(snapshot) - vb.write(progress=skip_count) + vb.progress(skip_count) if skip_count > 0: - vb.write(f"\n-----> Skipped snapshots: {skip_count}") + vb.write(message=f"\n-----> Skipped snapshots: {skip_count}") threads = [] worker = 0 for worker in range(workers): worker += 1 - vb.write(f"\n-----> Starting worker: {worker}") - thread = threading.Thread(target=download_loop, args=(snapshot_queue, output, worker, retry, no_redirect, skipset)) + vb.write(message=f"\n-----> Starting worker: {worker}") + thread = threading.Thread(target=download_loop, args=(snapshot_queue, output, worker, retry, no_redirect, delay, skipset)) threads.append(thread) thread.start() for thread in threads: thread.join() successed = sc.count(success=True) failed = sc.count(fail=True) - vb.write(f"\nFiles downloaded: {successed}") - vb.write(f"Not downloaded: {failed}\n") + vb.write(message=f"\nFiles downloaded: {successed}") + vb.write(message=f"Not downloaded: {failed}\n") -def download_loop(snapshot_queue, output, worker, retry, no_redirect, skipset=None, attempt=1, connection=None, failed_urls=[]): +def download_loop(snapshot_queue, output, worker, retry, no_redirect, delay, skipset=None, attempt=1, connection=None, failed_urls=[]): """ Download a snapshot of the queue. If a download fails, the function will retry the download. The "snapshot_collection" dictionary will be updated with the download status and file information. @@ -213,32 +219,35 @@ def download_loop(snapshot_queue, output, worker, retry, no_redirect, skipset=No """ try: max_attempt = retry if retry > 0 else retry + 1 - if not connection: - connection = http.client.HTTPSConnection("web.archive.org") + connection = connection or http.client.HTTPSConnection("web.archive.org") if attempt > max_attempt: connection.close() - vb.write(f"\n-----> Worker: {worker} - Failed downloads: {len(failed_urls)}") return + while not snapshot_queue.empty(): snapshot = snapshot_queue.get() - status = f"\n-----> Attempt: [{attempt}/{max_attempt}] Snapshot [{sc.SNAPSHOT_COLLECTION.index(snapshot)+1}/{len(sc.SNAPSHOT_COLLECTION)}] - Worker: {worker}" - download_status = download(output, snapshot, connection, status, no_redirect) - if not download_status: - if snapshot not in failed_urls: - failed_urls.append(snapshot) + status_message = Message() + status_message.store(message=f"\n-----> Attempt: [{attempt}/{max_attempt}] Snapshot [{sc.SNAPSHOT_COLLECTION.index(snapshot)+1}/{len(sc.SNAPSHOT_COLLECTION)}] - Worker: {worker}") + download_status = download(output, snapshot, connection, status_message, no_redirect) + if not download_status and snapshot not in failed_urls: + failed_urls.append(snapshot) if download_status: if snapshot in failed_urls: failed_urls.remove(snapshot) - vb.write(progress=1) + vb.progress(1) + if delay > 0: + vb.write(message=f"\n-----> Worker: {worker} - Delay: {delay} seconds") + time.sleep(delay) + if failed_urls: if not attempt > max_attempt: attempt += 1 - vb.write(f"\n-----> Worker: {worker} - Retry Timeout: 15 seconds") + vb.write(message=f"\n-----> Worker: {worker} - Retry Timeout: 15 seconds") time.sleep(15) - download_loop(snapshot_queue, output, worker, retry, no_redirect, skipset, attempt, connection, failed_urls) + download_loop(snapshot_queue, output, worker, retry, no_redirect, delay, skipset, attempt, connection, failed_urls) except Exception as e: ex.exception(f"Worker: {worker} - Exception", e) - snapshot_queue.put(snapshot) # requeue snapshot if worker crashes + snapshot_queue.put(snapshot) # requeue snapshot if worker crashes @@ -256,51 +265,35 @@ def download(output, snapshot_entry, connection, status_message, no_redirect=Fal max_retries = 2 sleep_time = 45 headers = {'User-Agent': f'bitdruid-python-wayback-downloader/{__version__}'} + success = False for i in range(max_retries): try: - connection.request("GET", encoded_download_url, headers=headers) - response = connection.getresponse() - response_data = response.read() - response_status = response.status - response_status_message = parse_response_code(response_status) - sc.snapshot_entry_modify(snapshot_entry, "response", response_status) - if not no_redirect: - if response_status == 302: - status_message = f"{status_message}\n" + \ - f"REDIRECT -> HTTP: {response.status} - {response_status_message}\n" + \ - f" -> FROM: {download_url}" - redirect_count = 0 - while response_status == 302: - redirect_count += 1 - if redirect_count > 5: - break - connection.request("GET", encoded_download_url, headers=headers) - response = connection.getresponse() - response_data = response.read() - response_status = response.status - response_status_message = parse_response_code(response_status) - location = response.getheader("Location") - if location: - encoded_download_url = urllib.parse.quote(urljoin(download_url, location), safe=':/') - status_message = f"{status_message}\n" + \ - f" -> TO: {download_url}" - sc.snapshot_entry_modify(snapshot_entry, "redirect_timestamp", url_get_timestamp(location)) - sc.snapshot_entry_modify(snapshot_entry, "redirect_url", download_url) - else: - break + response, response_data, response_status, response_status_message = download_response(connection, encoded_download_url, headers) + sc.entry_modify(snapshot_entry, "response", response_status) + if not no_redirect and response_status == 302: + status_message.store(status="REDIRECT", type="HTTP", message=f"{response.status} - {response_status_message}") + status_message.store(status="", type="FROM", message=download_url) + for _ in range(5): + response, response_data, response_status, response_status_message = download_response(connection, encoded_download_url, headers) + location = response.getheader("Location") + if location: + encoded_download_url = urllib.parse.quote(urljoin(download_url, location), safe=':/') + status_message.store(status="", type="TO", message=location) + sc.entry_modify(snapshot_entry, "redirect_timestamp", url_get_timestamp(location)) + sc.entry_modify(snapshot_entry, "redirect_url", download_url) + else: + break if response_status == 200: output_file = sc.create_output(download_url, snapshot_entry["timestamp"], output) output_path = os.path.dirname(output_file) # if output_file is too long for windows, skip download if check_nt() and len(output_file) > 255: - status_message = f"{status_message}\n" + \ - f"PATH TOO LONG TO SAVE FILE -> HTTP: {response_status} - {response_status_message}\n" + \ - f" -> URL: {download_url}" - sc.snapshot_entry_modify(snapshot_entry, "file", "PATH TOO LONG TO SAVE FILE") - vb.write(status_message) - return True - + status_message.store(status="PATH > 255", type="HTTP", message=f"{response.status} - {response_status_message}") + status_message.store(status="", type="URL", message=download_url) + sc.entry_modify(snapshot_entry, "file", "PATH TOO LONG TO SAVE FILE") + status_message.write() + continue # case if output_path is a file, move file to temporary name, create output_path and move file into output_path if os.path.isfile(output_path): move_index(existpath=output_path) @@ -309,44 +302,54 @@ def download(output, snapshot_entry, connection, status_message, no_redirect=Fal # case if output_file is a directory, create file as index.html in this directory if os.path.isdir(output_file): output_file = move_index(existfile=output_file, filebuffer=response_data) - + # download file if not existing if not os.path.isfile(output_file): with open(output_file, 'wb') as file: if response.getheader('Content-Encoding') == 'gzip': response_data = gzip.decompress(response_data) - file.write(response_data) - else: - file.write(response_data) + file.write(response_data) + # check if file is downloaded if os.path.isfile(output_file): - status_message = f"{status_message}\n" + \ - f"SUCCESS -> HTTP: {response_status} - {response_status_message}" + status_message.store(status="SUCCESS", type="HTTP", message=f"{response.status} - {response_status_message}") else: - status_message = f"{status_message}\n" + \ - f"EXISTING -> HTTP: {response_status} - {response_status_message}" - status_message = f"{status_message}\n" + \ - f" -> URL: {download_url}\n" + \ - f" -> FILE: {output_file}" - vb.write(status_message) - sc.snapshot_entry_modify(snapshot_entry, "file", output_file) - return True - + status_message.store(status="EXISTING", type="HTTP", message=f"{response.status} - {response_status_message}") + status_message.store(status="", type="URL", message=download_url) + status_message.store(status="", type="FILE", message=output_file) + sc.entry_modify(snapshot_entry, "file", output_file) + # if convert_links: + # convert.links(output_file, status_message) + status_message.write() + success = True + break else: - status_message = f"{status_message}\n" + \ - f"UNEXPECTED -> HTTP: {response_status} - {response_status_message}\n" + \ - f" -> URL: {download_url}" - vb.write(status_message) - return True - # exception returns false and appends the url to the failed list - except http.client.HTTPException as e: - status_message = f"{status_message}\n" + \ - f"EXCEPTION -> ({i+1}/{max_retries}), append to failed_urls: {download_url}\n" + \ - f" -> {e}" - vb.write(status_message) - return False - except (timeout, ConnectionRefusedError, ConnectionResetError) as e: + status_message.store(status="UNEXPECTED", type="HTTP", message=f"{response.status} - {response_status_message}") + status_message.store(status="", type="URL", message=download_url) + status_message.write() + continue + # exception handling + except (http.client.HTTPException, timeout, ConnectionRefusedError, ConnectionResetError) as e: download_exception(type, e, i, max_retries, sleep_time, status_message) - vb.write(f"FAILED -> download, append to failed_urls: {download_url}") - return False + continue + if not success: + status_message.store(status="FAILED", type="", message=f"append to failed_urls: {download_url}") + status_message.write() + return success + + + + + +def download_response(connection, encoded_download_url, headers): + connection.request("GET", encoded_download_url, headers=headers) + response = connection.getresponse() + response_data = response.read() + response_status = response.status + response_status_message = parse_response_code(response_status) + return response, response_data, response_status, response_status_message + + + + RESPONSE_CODE_DICT = { 200: "OK", @@ -364,9 +367,8 @@ def download_exception(type, e, i, max_retries, sleep_time, status_message): Handle exceptions during the download process. """ type = e.__class__.__name__.upper() - status_message = f"{status_message}\n" + \ - f"{type} -> ({i+1}/{max_retries}), reconnect in {sleep_time} seconds...\n" - vb.write(status_message) + status_message.store(status=f"{type}", type=f"({i+1}/{max_retries})", message=f"reconnect in {sleep_time} seconds...") + status_message.write() time.sleep(sleep_time) def parse_response_code(response_code: int): @@ -440,7 +442,7 @@ def skip_open(csv_path: str, url: str) -> tuple: csv_file.close() return skipset else: - vb.write("\nNo CSV-file or content found to load skipable URLs") + vb.write(message="\nNo CSV-file or content found to load skipable URLs") return None except Exception as e: ex.exception("Could not open CSV-file", e) diff --git a/pywaybackup/arguments.py b/pywaybackup/arguments.py deleted file mode 100644 index 540ab9e..0000000 --- a/pywaybackup/arguments.py +++ /dev/null @@ -1,45 +0,0 @@ -import sys -import argparse -from pywaybackup.__version__ import __version__ - -def parse(): - - parser = argparse.ArgumentParser(description='Download from wayback machine (archive.org)') - parser.add_argument('-a', '--about', action='version', version='%(prog)s ' + __version__ + ' by @bitdruid -> https://github.com/bitdruid') - parser.add_argument('-d', '--debug', action='store_true', help='Debug mode (Always full traceback and creates an error.log') - - required = parser.add_argument_group('required (one exclusive)') - required.add_argument('-u', '--url', type=str, metavar="", help='url (with subdir/subdomain) to download') - exclusive_required = required.add_mutually_exclusive_group(required=True) - exclusive_required.add_argument('-c', '--current', action='store_true', help='download the latest version of each file snapshot') - exclusive_required.add_argument('-f', '--full', action='store_true', help='download snapshots of all timestamps') - exclusive_required.add_argument('-s', '--save', action='store_true', help='save a page to the wayback machine') - - optional = parser.add_argument_group('optional query parameters') - optional.add_argument('-l', '--list', action='store_true', help='only print snapshots (opt range in y)') - optional.add_argument('-e', '--explicit', action='store_true', help='search only for the explicit given url') - optional.add_argument('-o', '--output', type=str, metavar="", help='output folder - defaults to current directory') - optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search') - optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss') - optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss') - - special = parser.add_argument_group('manipulate behavior') - special.add_argument('--csv', type=str, nargs='?', const=True, metavar='path', help='save a csv file with the json output - defaults to output folder') - special.add_argument('--skip', type=str, nargs='?', const=True, metavar='path', help='skips existing files in the output folder by checking the .csv file - defaults to output folder') - special.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org') - special.add_argument('--verbosity', type=str, default="standard", metavar="", help='["progress", "json"] Verbosity level') - special.add_argument('--retry', type=int, default=0, metavar="", help='retry failed downloads (opt tries as int, else infinite)') - special.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)') - - cdx = parser.add_argument_group('cdx (one exclusive)') - exclusive_cdx = cdx.add_mutually_exclusive_group() - exclusive_cdx.add_argument('--cdxbackup', type=str, nargs='?', const=True, metavar='path', help='Save the cdx query-result to a file for recurent use - defaults to output folder') - exclusive_cdx.add_argument('--cdxinject', type=str, nargs='?', const=True, metavar='path', help='Inject a cdx backup-file to download according to the given url') - - auto = parser.add_argument_group('auto') - auto.add_argument('--auto', action='store_true', help='includes automatic csv, skip and cdxbackup/cdxinject to resume a stopped download') - - args = parser.parse_args(args=None if sys.argv[1:] else ['--help']) # if no arguments are given, print help - command = ' '.join(sys.argv[1:]) - - return args, command diff --git a/pywaybackup/helper.py b/pywaybackup/helper.py index 3186f8e..f3cde24 100644 --- a/pywaybackup/helper.py +++ b/pywaybackup/helper.py @@ -1,6 +1,8 @@ import os +import re import shutil +from urllib.parse import urlparse, urljoin import magic @@ -21,20 +23,35 @@ def sanitize_filename(input: str) -> str: input = '.'.join(filter(None, input.split('.'))) return input +def sanitize_url(input: str) -> str: + """ + Sanitize a url by encoding special characters. + """ + special_chars = [":", "*", "?", "&", "=", "<", ">", "\\", "|"] + for char in special_chars: + input = input.replace(char, f"%{ord(char):02x}") + return input + def url_get_timestamp(url): """ Extract the timestamp from a wayback machine URL. """ - timestamp = url.split("id_/")[0].split("/")[-1] + timestamp = url.split("web.archive.org/web/")[1].split("/")[0] + if "id_" in url: timestamp = timestamp.split("id_")[0] return timestamp def url_split(url, index=False): """ Split a URL into domain, subdir, and filename. + + Index: + - [0] = domain + - [1] = subdir + - [2] = filename """ - if url.startswith("http"): + if "://" in url: url = url.split("://")[1] domain = url.split("/")[0] path = url[len(domain):] @@ -89,6 +106,4 @@ def check_index_mime(filebuffer: bytes) -> bool: mime_type = mime.from_buffer(filebuffer) if mime_type != "text/html": return False - return True - - + return True \ No newline at end of file diff --git a/pywaybackup/main.py b/pywaybackup/main.py index b8e1eec..a524e38 100644 --- a/pywaybackup/main.py +++ b/pywaybackup/main.py @@ -2,60 +2,34 @@ import signal -import pywaybackup.helper as helper import pywaybackup.archive as archive -from pywaybackup.arguments import parse +from pywaybackup.Arguments import Configuration as config from pywaybackup.Verbosity import Verbosity as vb from pywaybackup.Exception import Exception as ex +from pywaybackup.Converter import Converter as convert def main(): - args, command = parse() - if args.output is None: - args.output = os.path.join(os.getcwd(), "waybackup_snapshots") - os.makedirs(args.output, exist_ok=True) - else: - os.makedirs(args.output, exist_ok=True) - - ex.init(args.debug, args.output, command) - vb.init(args.verbosity) - - if args.full: - mode = "full" - if args.current: - mode = "current" - - if args.auto: - args.skip = args.output - args.csv = args.output - args.cdxbackup = args.output - args.cdxinject = os.path.join(args.output, f"waybackup_{helper.sanitize_filename(args.url)}.cdx") - else: - if args.skip is True: - args.skip = args.output - if args.csv is True: - args.csv = args.output - if args.cdxbackup is True: - args.cdxbackup = args.output - if args.cdxinject is True: - args.cdxinject = args.output - - if args.save: - archive.save_page(args.url) + config.init() + ex.init(config.debug, config.output, config.command) + vb.init(config.verbosity, config.log) + if config.save: + archive.save_page(config.url) else: try: - skipset = archive.skip_open(args.skip, args.url) if args.skip else None - archive.query_list(args.url, args.range, args.start, args.end, args.explicit, mode, args.cdxbackup, args.cdxinject) - if args.list: + skipset = archive.skip_open(config.skip, config.url) if config.skip else None + archive.query_list(config.range, config.start, config.end, config.explicit, config.mode, config.cdxbackup, config.cdxinject) + if config.list: archive.print_list() else: - archive.download_list(args.output, args.retry, args.no_redirect, args.workers, skipset) + archive.download_list(config.output, config.retry, config.no_redirect, config.delay, config.workers, skipset) except KeyboardInterrupt: print("\nInterrupted by user\n") finally: signal.signal(signal.SIGINT, signal.SIG_IGN) - archive.csv_close(args.csv, args.url) if args.csv else None + archive.csv_close(config.csv, config.url) if config.csv else None + vb.fini() os._exit(0) # kill all threads diff --git a/test/test_links.js b/test/test_links.js new file mode 100644 index 0000000..871167e --- /dev/null +++ b/test/test_links.js @@ -0,0 +1,63 @@ +// Example JavaScript File: example.js + +// External script with absolute URL +var externalScript = document.createElement('script'); +externalScript.src = 'http://example.com/js/external-script.js'; +document.head.appendChild(externalScript); + +// External script with relative URL +var localScript = document.createElement('script'); +localScript.src = '/js/local-script.js'; +document.head.appendChild(localScript); + +// Inline style with absolute URL in background image +var element = document.createElement('div'); +element.style.backgroundImage = "url('http://example.com/images/bg.png')"; +document.body.appendChild(element); + +// Inline style with relative URL in background image +element.style.backgroundImage = "url('/images/bg.png')"; + +// CSS in JavaScript with absolute URL +var css = "body { background-image: url('http://example.com/images/body-bg.png'); }"; +var style = document.createElement('style'); +style.type = 'text/css'; +style.appendChild(document.createTextNode(css)); +document.head.appendChild(style); + +// CSS in JavaScript with relative URL +var cssLocal = "body { background-image: url('/images/body-bg.png'); }"; +var styleLocal = document.createElement('style'); +styleLocal.type = 'text/css'; +styleLocal.appendChild(document.createTextNode(cssLocal)); +document.head.appendChild(styleLocal); + +// Image element with absolute URL +var imgElement = document.createElement('img'); +imgElement.src = 'http://example.com/images/logo.png'; +document.body.appendChild(imgElement); + +// Image element with relative URL +var imgLocalElement = document.createElement('img'); +imgLocalElement.src = '/images/logo.png'; +document.body.appendChild(imgLocalElement); + +// Fetch API call with absolute URL +fetch('http://example.com/api/data') + .then(response => response.json()) + .then(data => console.log(data)); + +// Fetch API call with relative URL +fetch('/api/data') + .then(response => response.json()) + .then(data => console.log(data)); + +// XMLHttpRequest with absolute URL +var xhr = new XMLHttpRequest(); +xhr.open('GET', 'http://example.com/api/data', true); +xhr.send(); + +// XMLHttpRequest with relative URL +var xhrLocal = new XMLHttpRequest(); +xhrLocal.open('GET', '/api/data', true); +xhrLocal.send();