Skip to content

Commit

Permalink
Merge pull request #53 from uga-libraries/48-run-on-linux
Browse files Browse the repository at this point in the history
Issue 48 run on linux
  • Loading branch information
amhanson9 authored Nov 28, 2023
2 parents 25c96ac + 833fa27 commit 922803a
Show file tree
Hide file tree
Showing 157 changed files with 1,711 additions and 1,281 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
configuration.py
_pycache_
*.pyc
tests/.pytest_cache
69 changes: 29 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Download Archive-It Web Content for Preservation

# Overview

Downloads WARCs and six metadata reports for crawls saved during a specified time period
from the Archive-It web archiving service to use for creating a preservation copy of web crawls.

Expand All @@ -11,69 +12,57 @@ which prepares them for UGA's digital preservation system (ARCHive).

UGA downloads web content using this script on a quarterly basis.

Additional script: linux_unzip.py
Script usage: `python linux_unzip.py aips_directory`
Used to unzip the downloaded WARCs in a Linux environment when there is an error in Windows.
It is a known bug that Windows zip programs sometimes results in errors for gzip.

# Getting Started

## Dependencies

* md5deep (https://github.com/jessek/hashdeep)
* numpy
* pandas
* md5deep (https://github.com/jessek/hashdeep) - used to calculate fixity of the downloaded WARC
* numpy - used in unit tests to indicate blank cells
* pandas - used to work with API data and CSV (log) data
* requests - used to get data via Archive-It APIs
* 7-Zip (Windows only) (https://www.7-zip.org/download.html)

## Installation

Before running the script, create a configuration.py file modeled after the configuration_template.py file
with your Archive-It credentials.
Prior to running the script, create a file named configuration.py, modeled after the configuration_template.py,
and save it to your local copy of this repository.
This defines a place for script output to be saved and includes your Archive-It login credentials.

This script must be run in Linux, due to Windows commonly having unzip errors with gzip.

## Script Arguments

Run the script in the command line: `python warc_download.py date_start date_end`
Run the script in the command line: `python ait_download.py date_start date_end`

* date_start is inclusive: the download will include WARCs stored on date_start.
* date_end is exclusive: the download will not include WARCs stored on date_end.
* Format both dates YYYY-MM-DD

## Testing

There are unit tests for all the script functions used by warc_download.py except check_seeds(),
which will be changed soon.
There are unit tests for all the script functions used by ait_download.py and for running the entire script.
The tests for check_seeds() could use more detail, which will be done once the function is updated.
The tests in test_script.py will fail if run at the same time as all other tests in the folder,
because one of the previous tests changes the current directory.
Run test_script.py on its own for an accurate result.

There are no tests for linux_unzip.py, which will be integrated into warc_download.py soon.
The unit tests use UGA Archive-It data.
Any other organization will need to update the expected results with their own data.

# Workflow

Because the script can take days to run, due to the time required to download WARCs, it often gets interrupted.
If this happens, running the script again will cause it to restart the seed that was in-progress when the error happened
and download content for all seeds that had not started yet.
Any seed that already completed, even if it had errors, will not be downloaded again.

The script output is saved in the script output folder, defined in the configuration file.

1. Uses Archive-It API's, or the seeds_log.csv from an earlier iteration of the script,
to get data about the seeds for this download and make the metadata.csv used by the general-aip.py script.


2. For each seed in the download:
1. Makes a folder named with the seed id.
2. Downloads the metadata reports.
3. Deletes empty metadata reports (there is no metadata of that type in Archive-It).
4. Redacts login information from the seed report.
5. Downloads each WARC and verifies the fixity against the MD5 in Archive-It.
6. Unzips each WARC.
7. Saves a summary of the errors, if any, to the log (seeds_log.csv).


3. Checks if everything expected was downloaded and makes a log (completeness_check.csv).
1. Verify metadata completeness with the [Archive-It APIs scripts](https://github.com/uga-libraries/web-archive-it-api)
2. Download the WARCs and related metadata with this script: [download workflow documentation](documentation/Workflow_Preservation_Download_Part_2.md)
3. Transform the downloaded content into AIPs with the [General AIP script](https://github.com/uga-libraries/general-aip)
4. Ingest the AIPs into our digital preservation system (ARCHive)

# Author

4. If there were errors during unzipping, run the linux_unzip.py script to unzip them.
Adriane Hanson, Head of Digital Stewardship, University of Georgia

# History

# Author
Adriane Hanson, Head of Digital Stewardship, University of Georgia
UGA Libraries has downloaded all WARCs and the six associated metadata reports for local preservation since 2020.
Originally, the WARCs were stored as zipped (gzip) files, which is how they are downloaded from Archive-It.
The WARCs were unzipped beginning with the August 2022 download, in line with new preservation format procedures.
They often had to be unzipped in Linux due to a Windows bug with gzip.
In November 2023, the entire script was switched to Linux to be more efficient.
52 changes: 29 additions & 23 deletions warc_download.py → ait_download.py
Original file line number Diff line number Diff line change
@@ -1,30 +1,36 @@
"""
Purpose: Downloads archived web content (WARCs) and associated metadata for a group of seeds from Archive-It.org using
their APIs and prepares them to be converted into AIPs with the general-aip.py script for long-term preservation.
At UGA, this script is run every three months to download content for all crawls saved that quarter.
"""Download WARCs and associated metadata from Archive-It for long-term preservation.
At UGA, this script is run every three months to download content for all crawls saved that quarter.
The download combines all WARCs saved within a quarter for a seed, even if that seed was crawled multiple times.
It also includes six of the metadata reports:
* Collection
* Collection Scope (not downloaded if no scope rules for the collection)
* Crawl Definition (may be more than one)
* Crawl Job (may be more than one)
* Seed
* Seed Scope (not downloaded if not scope rules for the seed)
Prior to the preservation download, all seed metadata should be entered into Archive-It.
Use the seed_metadata_report.py script to verify all required fields are present.
* Seed Scope (not downloaded if no scope rules for the seed)
Parameters:
There are two date parameters, formatted YYYY-MM-DD, which define which WARCs to include in the download.
date_start : required. WARCs stored on this day will be included.
date_end : required. WARCs stored on this day will NOT be included.
Returns:
One folder for each seed, with the WARCs and metadata reports.
A metadata.csv file needed for the general-aip script to prepare the folders for preservation.
A seeds_log.csv file with information about each workflow step.
A completeness_log.csv file with information about the download's completeness.
"""

# Usage: python warc_download.py date_start date_end
# Usage: python ait_download.py date_start date_end

import os
import pandas as pd
import re
import sys

# Import functions and constant variables from other UGA scripts.
# Configuration is made by the user and could be forgotten. The others are in the script repo.
# Configuration is made by the user and could be forgotten.
try:
import configuration as c
except ModuleNotFoundError:
Expand All @@ -33,8 +39,8 @@
sys.exit()
import web_functions as fun

# The preservation download is limited to warcs created during a particular time frame.
# UGA downloads every quarter (2/1-4/30, 5/1-7/31, 8/1-10/31, 11/1-1/31)
# Tests to validate the two date arguments, which specify the time frame for WARCs to include in the download.

# Tests that both dates are provided. If not, ends the script.
try:
date_start, date_end = sys.argv[1:]
Expand Down Expand Up @@ -63,7 +69,7 @@
seeds_directory = os.path.join(c.script_output, "preservation_download")

# The script may be run repeatedly if there are interruptions, such as due to API connections.
# If it has run, it will use the existing seeds_log.csv for seed_df and and skip seeds that were already done.
# If it has run, it will use the existing seeds_log.csv for seed_df and skip seeds that were already done.
# Otherwise, it makes seed_df and metadata_csv by getting data from the Archive-It APIs
# and add the AIP_ID from metadata_csv to be the first column of seed_df.
if os.path.exists(seeds_directory):
Expand All @@ -77,36 +83,36 @@
seed_df = pd.merge(seed_df, aip_id_df, how="left")
seed_df.insert(0, "AIP_ID", seed_df.pop('AIP_ID'))

# Starts counter for tracking script progress.
# Some processes are slow, so this shows the script is still working and how much remains.
# Starts a counter for tracking script progress.
# Some processes are slow, so this shows the script is still working and how much work remains.
current_seed = 0
total_seeds = len(seed_df[seed_df["Complete"].isnull()])

# Iterates through information about each seed, downloading metadata and WARC files from Archive-It.
# Filtered for no data in the WARC_Unzip_Errors (last log column) to skip seeds done earlier if this is a restart.
# Filtered for no data in the Complete column to skip seeds done earlier if this is a restart.
for seed in seed_df[seed_df["Complete"].isnull()].itertuples():

# Updates the current seed number and displays the script progress.
current_seed += 1
print(f"\nStarting seed {current_seed} of {total_seeds}.")

# Row index for the seed being processed in the dataframe, to use for adding logging information.
# Calculates the row index for the seed being processed in the dataframe, to use for adding log information.
row_index = seed_df.index[seed_df["Seed_ID"] == seed.Seed_ID].tolist()[0]

# If the seed already has a folder from an error in a previous iteration of the script,
# deletes the contents and anything in the seeds_log.csv from the previous iteration so it can be remade.
# deletes the contents and anything in the seeds_log.csv from the previous iteration, so it can be remade.
if os.path.exists(str(seed.Seed_ID)):
fun.reset_seed(seed.Seed_ID, seed_df)

# Makes a folder for the seed in the AIP directory,
# Makes a folder for the seed in the seeds directory,
# and downloads the metadata and WARC files to that seed folder.
os.mkdir(str(seed.Seed_ID))
fun.download_metadata(seed, row_index, seed_df)
fun.download_warcs(seed, row_index, seed_df)

# Updates the Complete column with a summary of error types or that the seed processed successfully.
# Updates the Complete column with the error type or that the seed processed successfully.
fun.add_completeness(row_index, seed_df)

# Verifies the all expected seed folders are present and have all the expected metadata files and WARCs.
# Saves the result as a csv in the folder with the downloaded AIPs.
# Verifies the all expected seed folders are present and contain all the expected metadata files and WARCs.
# Saves the result as a csv in the folder with the downloaded content.
fun.check_seeds(date_end, date_start, seed_df, seeds_directory)
4 changes: 0 additions & 4 deletions configuration_template.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,3 @@
inst_page = 'https://partner.archive-it.org/INSERT-NUMBER'
username = 'INSERT-USERNAME'
password = 'INSERT-PASSWORD'

# Path to md5deep64.exe, used for fixity calculations in Windows.
# Use \ in the path or else it will not run.
md5deep = r'INSERT-PATH'
7 changes: 7 additions & 0 deletions documentation/Example_Script_Output/completeness_check.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Seed,AIP,Seed Folder Made,coll.csv,collscope.csv,seed.csv,seedscope.csv,crawldef.csv count,crawljob.csv count,WARC Count Correct,All Expected File Types
2027776,rbrl-377-web-201907-0001,TRUE,TRUE,TRUE,TRUE,FALSE,1,1,TRUE,TRUE
2027707,rbrl-498-web-201907-0001,TRUE,TRUE,TRUE,TRUE,TRUE,1,1,TRUE,TRUE
2529683,magil-ggp-2529683-2023-05,TRUE,TRUE,FALSE,TRUE,FALSE,1,1,TRUE,TRUE
2529676,magil-ggp-2529676-2023-05,TRUE,TRUE,FALSE,TRUE,FALSE,1,1,TRUE,TRUE
2529671,magil-ggp-2529671-2023-05,TRUE,TRUE,FALSE,TRUE,FALSE,1,1,FALSE,TRUE
2520379,magil-ggp-2520379-2023-05,TRUE,TRUE,FALSE,TRUE,FALSE,1,1,TRUE,TRUE
7 changes: 7 additions & 0 deletions documentation/Example_Script_Output/metadata.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Department,Collection,Folder,AIP_ID,Title,Version
russell,rbrl-498,2027707,rbrl-498-web-201907-0001,Open Records with Deborah Gonzalez,1
russell,rbrl-377,2027776,rbrl-377-web-201907-0001,Southeast ADA Center: Your Regional Resource for the Americans with Disabilities Act (ADA),1
magil,magil-0000,2520379,magil-ggp-2520379-2023-05,Georgia Department of Natural Resources Wildlife Resources Division,1
magil,magil-0000,2529671,magil-ggp-2529671-2023-05,Georgia Real Estate Commission & Appraisers Board,1
magil,magil-0000,2529676,magil-ggp-2529676-2023-05,Georgia State Board of Accountancy,1
magil,magil-0000,2529683,magil-ggp-2529683-2023-05,Georgia State Finance Commission,1
7 changes: 7 additions & 0 deletions documentation/Example_Script_Output/seeds_log.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
AIP_ID,Seed_ID,AIT_Collection,Job_ID,Size_GB,WARCs,WARC_Filenames,Metadata_Report_Errors,Metadata_Report_Empty,Seed_Report_Redaction,WARC_Download_Errors,WARC_Fixity_Errors,WARC_Unzip_Errors,Complete,,,,,,,,,,,,,,,,,,,
rbrl-498-web-201907-0001,2027707,12265,943048,0.007,1,ARCHIVEIT-12265-TEST-JOB943048-SEED2027707-20190709144234143-00000-h3.warc.gz,Successfully downloaded all metadata reports,No empty reports,Successfully redacted,Successfully downloaded ARCHIVEIT-12265-TEST-JOB943048-SEED2027707-20190709144234143-00000-h3.warc.gz,Successfully verified ARCHIVEIT-12265-TEST-JOB943048-SEED2027707-20190709144234143-00000-h3.warc.gz fixity on 2023-11-28 20:33:38.106304,Successfully unzipped ARCHIVEIT-12265-TEST-JOB943048-SEED2027707-20190709144234143-00000-h3.warc.gz,Successfully completed,,,,,,,,,,,,,,,,,,,
rbrl-377-web-201907-0001,2027776,12264,943446,0.096,1,ARCHIVEIT-12264-TEST-JOB943446-SEED2027776-20190710131748634-00000-h3.warc.gz,Successfully downloaded all metadata reports,rbrl-377-web-201907-0001_seedscope.csv,Successfully redacted,Successfully downloaded ARCHIVEIT-12264-TEST-JOB943446-SEED2027776-20190710131748634-00000-h3.warc.gz,Successfully verified ARCHIVEIT-12264-TEST-JOB943446-SEED2027776-20190710131748634-00000-h3.warc.gz fixity on 2023-11-28 20:34:09.872859,Successfully unzipped ARCHIVEIT-12264-TEST-JOB943446-SEED2027776-20190710131748634-00000-h3.warc.gz,Successfully completed,,,,,,,,,,,,,,,,,,,
magil-ggp-2520379-2023-05,2520379,15678,1789230,7.434,7,ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230416150811551-00000-sl63gmud.warc.gz|ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417184230631-00001-sl63gmud.warc.gz|ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417185737629-00002-sl63gmud.warc.gz|ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417193251948-00003-sl63gmud.warc.gz|ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417201414622-00004-sl63gmud.warc.gz|ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417223344837-00005-sl63gmud.warc.gz|ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230419134031254-00000-dwz98uv7.warc.gz,Successfully downloaded all metadata reports,magil-ggp-2520379-2023-05_seedscope.csv, magil-ggp-2520379-2023-05_collscope.csv,Successfully redacted,Successfully downloaded ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230416150811551-00000-sl63gmud.warc.gz, Successfully downloaded ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417184230631-00001-sl63gmud.warc.gz, Successfully downloaded ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417185737629-00002-sl63gmud.warc.gz, Successfully downloaded ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417193251948-00003-sl63gmud.warc.gz, Successfully downloaded ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417201414622-00004-sl63gmud.warc.gz, Successfully downloaded ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417223344837-00005-sl63gmud.warc.gz, Successfully downloaded ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230419134031254-00000-dwz98uv7.warc.gz,Successfully verified ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230416150811551-00000-sl63gmud.warc.gz fixity on TIMESTAMP, Successfully verified ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417184230631-00001-sl63gmud.warc.gz fixity on TIMESTAMP, Successfully verified ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417185737629-00002-sl63gmud.warc.gz fixity on TIMESTAMP, Successfully verified ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417193251948-00003-sl63gmud.warc.gz fixity on TIMESTAMP, Successfully verified ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417201414622-00004-sl63gmud.warc.gz fixity on TIMESTAMP, Successfully verified ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417223344837-00005-sl63gmud.warc.gz fixity on TIMESTAMP, Successfully verified ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230419134031254-00000-dwz98uv7.warc.gz fixity on TIMESTAMP,Successfully unzipped ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230416150811551-00000-sl63gmud.warc.gz, Successfully unzipped ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417184230631-00001-sl63gmud.warc.gz, Successfully unzipped ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417185737629-00002-sl63gmud.warc.gz, Successfully unzipped ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417193251948-00003-sl63gmud.warc.gz, Successfully unzipped ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417201414622-00004-sl63gmud.warc.gz, Successfully unzipped ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417223344837-00005-sl63gmud.warc.gz, Successfully unzipped ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230419134031254-00000-dwz98uv7.warc.gz,Successfully completed
magil-ggp-2529671-2023-05,2529671,15678,1791478,0.028,1,ARCHIVEIT-15678-TEST-JOB1791478-0-SEED2529671-20230420155417222-00000-mntg8u5v.warc.gz,Successfully downloaded all metadata reports,magil-ggp-2529671-2023-05_seedscope.csv, magil-ggp-2529671-2023-05_collscope.csv,Successfully redacted,API error 404: can't downloaded ARCHIVEIT-15678-TEST-JOB1791478-0-SEED2529671-20230420155417222-00000-mntg8u5v.warc.gz,,,WARC_Downloaded_Errors,,,,,,,,,,,,,,,,,,
magil-ggp-2529683-2023-05,2529683,15678,1791489,0.05,2,ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420161205384-00000-qix5zv0f.warc.gz|ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420230248436-00000-8bk2lsxt.warc.gz,Successfully downloaded all metadata reports,magil-ggp-2529683-2023-05_seedscope.csv, magil-ggp-2529683-2023-05_collscope.csv,Successfully redacted,Successfully downloaded ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420161205384-00000-qix5zv0f.warc.gz, Successfully downloaded ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420230248436-00000-8bk2lsxt.warc.gz,Successfully verified ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420161205384-00000-qix5zv0f.warc.gz fixity on 2023-11-28 20:36:06.713040, Successfully verified ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420230248436-00000-8bk2lsxt.warc.gz fixity on 2023-11-28 20:36:23.739950,Successfully unzipped ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420161205384-00000-qix5zv0f.warc.gz, Successfully unzipped ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420230248436-00000-8bk2lsxt.warc.gz,Successfully completed,,,,,,,,,,,,,,,
magil-ggp-2529676-2023-05,2529676,15678,1791480,0.014,1,ARCHIVEIT-15678-TEST-JOB1791480-0-SEED2529676-20230420155757131-00000-zrl3k481.warc.gz,Successfully downloaded all metadata reports,magil-ggp-2529676-2023-05_seedscope.csv, magil-ggp-2529676-2023-05_collscope.csv,Successfully redacted,Successfully downloaded ARCHIVEIT-15678-TEST-JOB1791480-0-SEED2529676-20230420155757131-00000-zrl3k481.warc.gz,Successfully verified ARCHIVEIT-15678-TEST-JOB1791480-0-SEED2529676-20230420155757131-00000-zrl3k481.warc.gz fixity on 2023-11-28 20:36:44.890513,Successfully unzipped ARCHIVEIT-15678-TEST-JOB1791480-0-SEED2529676-20230420155757131-00000-zrl3k481.warc.gz,Successfully completed,,,,,,,,,,,,,,,,,,
Loading

0 comments on commit 922803a

Please sign in to comment.