Merge pull request #53 from uga-libraries/48-run-on-linux

Issue 48 run on linux
uga-libraries · Nov 28, 2023 · 922803a · 922803a
2 parents 25c96ac + 833fa27
commit 922803a
Show file tree

Hide file tree

Showing 157 changed files with 1,711 additions and 1,281 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,3 +2,4 @@
 configuration.py
 _pycache_
 *.pyc
+tests/.pytest_cache
diff --git a/README.md b/README.md
@@ -1,6 +1,7 @@
 # Download Archive-It Web Content for Preservation
 
 # Overview
+
 Downloads WARCs and six metadata reports for crawls saved during a specified time period 
 from the Archive-It web archiving service to use for creating a preservation copy of web crawls.
 
@@ -11,69 +12,57 @@ which prepares them for UGA's digital preservation system (ARCHive).
 
 UGA downloads web content using this script on a quarterly basis.
 
-Additional script: linux_unzip.py
-Script usage: `python linux_unzip.py aips_directory`
-Used to unzip the downloaded WARCs in a Linux environment when there is an error in Windows.
-It is a known bug that Windows zip programs sometimes results in errors for gzip.
-
 # Getting Started
 
 ## Dependencies
 
-* md5deep (https://github.com/jessek/hashdeep)
-* numpy
-* pandas
+* md5deep (https://github.com/jessek/hashdeep) - used to calculate fixity of the downloaded WARC
+* numpy - used in unit tests to indicate blank cells
+* pandas - used to work with API data and CSV (log) data
 * requests - used to get data via Archive-It APIs
-* 7-Zip (Windows only) (https://www.7-zip.org/download.html)
 
 ## Installation
 
-Before running the script, create a configuration.py file modeled after the configuration_template.py file
-with your Archive-It credentials.
+Prior to running the script, create a file named configuration.py, modeled after the configuration_template.py,
+and save it to your local copy of this repository.
+This defines a place for script output to be saved and includes your Archive-It login credentials.
+
+This script must be run in Linux, due to Windows commonly having unzip errors with gzip.
 
 ## Script Arguments
 
-Run the script in the command line: `python warc_download.py date_start date_end`
+Run the script in the command line: `python ait_download.py date_start date_end`
 
    * date_start is inclusive: the download will include WARCs stored on date_start.
    * date_end is exclusive: the download will not include WARCs stored on date_end.
    * Format both dates YYYY-MM-DD
 
 ## Testing
 
-There are unit tests for all the script functions used by warc_download.py except check_seeds(),
-which will be changed soon.
+There are unit tests for all the script functions used by ait_download.py and for running the entire script.
+The tests for check_seeds() could use more detail, which will be done once the function is updated.
+The tests in test_script.py will fail if run at the same time as all other tests in the folder,
+because one of the previous tests changes the current directory. 
+Run test_script.py on its own for an accurate result.
 
-There are no tests for linux_unzip.py, which will be integrated into warc_download.py soon.
+The unit tests use UGA Archive-It data.
+Any other organization will need to update the expected results with their own data.
 
 # Workflow
 
-Because the script can take days to run, due to the time required to download WARCs, it often gets interrupted. 
-If this happens, running the script again will cause it to restart the seed that was in-progress when the error happened 
-and download content for all seeds that had not started yet.
-Any seed that already completed, even if it had errors, will not be downloaded again.
-
-The script output is saved in the script output folder, defined in the configuration file.
-
-1. Uses Archive-It API's, or the seeds_log.csv from an earlier iteration of the script, 
-   to get data about the seeds for this download and make the metadata.csv used by the general-aip.py script. 
-
-
-2. For each seed in the download:
-   1. Makes a folder named with the seed id.
-   2. Downloads the metadata reports.
-   3. Deletes empty metadata reports (there is no metadata of that type in Archive-It).
-   4. Redacts login information from the seed report.
-   5. Downloads each WARC and verifies the fixity against the MD5 in Archive-It.
-   6. Unzips each WARC.      
-   7. Saves a summary of the errors, if any, to the log (seeds_log.csv).
-
-
-3. Checks if everything expected was downloaded and makes a log (completeness_check.csv).
+1. Verify metadata completeness with the [Archive-It APIs scripts](https://github.com/uga-libraries/web-archive-it-api)
+2. Download the WARCs and related metadata with this script: [download workflow documentation](documentation/Workflow_Preservation_Download_Part_2.md)
+3. Transform the downloaded content into AIPs with the [General AIP script](https://github.com/uga-libraries/general-aip)
+4. Ingest the AIPs into our digital preservation system (ARCHive)
 
+# Author
 
-4. If there were errors during unzipping, run the linux_unzip.py script to unzip them.
+Adriane Hanson, Head of Digital Stewardship, University of Georgia
 
+# History
 
-# Author
-Adriane Hanson, Head of Digital Stewardship, University of Georgia
+UGA Libraries has downloaded all WARCs and the six associated metadata reports for local preservation since 2020.
+Originally, the WARCs were stored as zipped (gzip) files, which is how they are downloaded from Archive-It.
+The WARCs were unzipped beginning with the August 2022 download, in line with new preservation format procedures.
+They often had to be unzipped in Linux due to a Windows bug with gzip.
+In November 2023, the entire script was switched to Linux to be more efficient.
diff --git a/warc_download.py → ait_download.py b/warc_download.py → ait_download.py
@@ -1,30 +1,36 @@
-"""
-Purpose: Downloads archived web content (WARCs) and associated metadata for a group of seeds from Archive-It.org using
-their APIs and prepares them to be converted into AIPs with the general-aip.py script for long-term preservation.
-At UGA, this script is run every three months to download content for all crawls saved that quarter.
+"""Download WARCs and associated metadata from Archive-It for long-term preservation.
 
+At UGA, this script is run every three months to download content for all crawls saved that quarter.
 The download combines all WARCs saved within a quarter for a seed, even if that seed was crawled multiple times.
+
 It also includes six of the metadata reports:
     * Collection
     * Collection Scope (not downloaded if no scope rules for the collection)
     * Crawl Definition (may be more than one)
     * Crawl Job (may be more than one)
     * Seed
-    * Seed Scope (not downloaded if not scope rules for the seed)
-
-Prior to the preservation download, all seed metadata should be entered into Archive-It.
-Use the seed_metadata_report.py script to verify all required fields are present.
+    * Seed Scope (not downloaded if no scope rules for the seed)
+
+Parameters:
+    There are two date parameters, formatted YYYY-MM-DD, which define which WARCs to include in the download.
+    date_start : required. WARCs stored on this day will be included.
+    date_end : required. WARCs stored on this day will NOT be included.
+
+Returns:
+    One folder for each seed, with the WARCs and metadata reports.
+    A metadata.csv file needed for the general-aip script to prepare the folders for preservation.
+    A seeds_log.csv file with information about each workflow step.
+    A completeness_log.csv file with information about the download's completeness.
 """
 
-# Usage: python warc_download.py date_start date_end
+# Usage: python ait_download.py date_start date_end
 
 import os
 import pandas as pd
 import re
 import sys
 
-# Import functions and constant variables from other UGA scripts.
-# Configuration is made by the user and could be forgotten. The others are in the script repo.
+# Configuration is made by the user and could be forgotten.
 try:
     import configuration as c
 except ModuleNotFoundError:
@@ -33,8 +39,8 @@
     sys.exit()
 import web_functions as fun
 
-# The preservation download is limited to warcs created during a particular time frame.
-# UGA downloads every quarter (2/1-4/30, 5/1-7/31, 8/1-10/31, 11/1-1/31)
+# Tests to validate the two date arguments, which specify the time frame for WARCs to include in the download.
+
 # Tests that both dates are provided. If not, ends the script.
 try:
     date_start, date_end = sys.argv[1:]
@@ -63,7 +69,7 @@
 seeds_directory = os.path.join(c.script_output, "preservation_download")
 
 # The script may be run repeatedly if there are interruptions, such as due to API connections.
-# If it has run, it will use the existing seeds_log.csv for seed_df and and skip seeds that were already done.
+# If it has run, it will use the existing seeds_log.csv for seed_df and skip seeds that were already done.
 # Otherwise, it makes seed_df and metadata_csv by getting data from the Archive-It APIs
 # and add the AIP_ID from metadata_csv to be the first column of seed_df.
 if os.path.exists(seeds_directory):
@@ -77,36 +83,36 @@
     seed_df = pd.merge(seed_df, aip_id_df, how="left")
     seed_df.insert(0, "AIP_ID", seed_df.pop('AIP_ID'))
 
-# Starts counter for tracking script progress.
-# Some processes are slow, so this shows the script is still working and how much remains.
+# Starts a counter for tracking script progress.
+# Some processes are slow, so this shows the script is still working and how much work remains.
 current_seed = 0
 total_seeds = len(seed_df[seed_df["Complete"].isnull()])
 
 # Iterates through information about each seed, downloading metadata and WARC files from Archive-It.
-# Filtered for no data in the WARC_Unzip_Errors (last log column) to skip seeds done earlier if this is a restart.
+# Filtered for no data in the Complete column to skip seeds done earlier if this is a restart.
 for seed in seed_df[seed_df["Complete"].isnull()].itertuples():
 
     # Updates the current seed number and displays the script progress.
     current_seed += 1
     print(f"\nStarting seed {current_seed} of {total_seeds}.")
 
-    # Row index for the seed being processed in the dataframe, to use for adding logging information.
+    # Calculates the row index for the seed being processed in the dataframe, to use for adding log information.
     row_index = seed_df.index[seed_df["Seed_ID"] == seed.Seed_ID].tolist()[0]
 
     # If the seed already has a folder from an error in a previous iteration of the script,
-    # deletes the contents and anything in the seeds_log.csv from the previous iteration so it can be remade.
+    # deletes the contents and anything in the seeds_log.csv from the previous iteration, so it can be remade.
     if os.path.exists(str(seed.Seed_ID)):
         fun.reset_seed(seed.Seed_ID, seed_df)
 
-    # Makes a folder for the seed in the AIP directory,
+    # Makes a folder for the seed in the seeds directory,
     # and downloads the metadata and WARC files to that seed folder.
     os.mkdir(str(seed.Seed_ID))
     fun.download_metadata(seed, row_index, seed_df)
     fun.download_warcs(seed, row_index, seed_df)
 
-    # Updates the Complete column with a summary of error types or that the seed processed successfully.
+    # Updates the Complete column with the error type or that the seed processed successfully.
     fun.add_completeness(row_index, seed_df)
 
-# Verifies the all expected seed folders are present and have all the expected metadata files and WARCs.
-# Saves the result as a csv in the folder with the downloaded AIPs.
+# Verifies the all expected seed folders are present and contain all the expected metadata files and WARCs.
+# Saves the result as a csv in the folder with the downloaded content.
 fun.check_seeds(date_end, date_start, seed_df, seeds_directory)
diff --git a/configuration_template.py b/configuration_template.py
@@ -12,7 +12,3 @@
 inst_page = 'https://partner.archive-it.org/INSERT-NUMBER'
 username = 'INSERT-USERNAME'
 password = 'INSERT-PASSWORD'
-
-# Path to md5deep64.exe, used for fixity calculations in Windows.
-# Use \ in the path or else it will not run.
-md5deep = r'INSERT-PATH'
diff --git a/documentation/Example_Script_Output/completeness_check.csv b/documentation/Example_Script_Output/completeness_check.csv
@@ -0,0 +1,7 @@
+Seed,AIP,Seed Folder Made,coll.csv,collscope.csv,seed.csv,seedscope.csv,crawldef.csv count,crawljob.csv count,WARC Count Correct,All Expected File Types
+2027776,rbrl-377-web-201907-0001,TRUE,TRUE,TRUE,TRUE,FALSE,1,1,TRUE,TRUE
+2027707,rbrl-498-web-201907-0001,TRUE,TRUE,TRUE,TRUE,TRUE,1,1,TRUE,TRUE
+2529683,magil-ggp-2529683-2023-05,TRUE,TRUE,FALSE,TRUE,FALSE,1,1,TRUE,TRUE
+2529676,magil-ggp-2529676-2023-05,TRUE,TRUE,FALSE,TRUE,FALSE,1,1,TRUE,TRUE
+2529671,magil-ggp-2529671-2023-05,TRUE,TRUE,FALSE,TRUE,FALSE,1,1,FALSE,TRUE
+2520379,magil-ggp-2520379-2023-05,TRUE,TRUE,FALSE,TRUE,FALSE,1,1,TRUE,TRUE
diff --git a/documentation/Example_Script_Output/metadata.csv b/documentation/Example_Script_Output/metadata.csv
@@ -0,0 +1,7 @@
+Department,Collection,Folder,AIP_ID,Title,Version
+russell,rbrl-498,2027707,rbrl-498-web-201907-0001,Open Records with Deborah Gonzalez,1
+russell,rbrl-377,2027776,rbrl-377-web-201907-0001,Southeast ADA Center: Your Regional Resource for the Americans with Disabilities Act (ADA),1
+magil,magil-0000,2520379,magil-ggp-2520379-2023-05,Georgia Department of Natural Resources Wildlife Resources Division,1
+magil,magil-0000,2529671,magil-ggp-2529671-2023-05,Georgia Real Estate Commission & Appraisers Board,1
+magil,magil-0000,2529676,magil-ggp-2529676-2023-05,Georgia State Board of Accountancy,1
+magil,magil-0000,2529683,magil-ggp-2529683-2023-05,Georgia State Finance Commission,1
diff --git a/documentation/Example_Script_Output/seeds_log.csv b/documentation/Example_Script_Output/seeds_log.csv
@@ -0,0 +1,7 @@
+AIP_ID,Seed_ID,AIT_Collection,Job_ID,Size_GB,WARCs,WARC_Filenames,Metadata_Report_Errors,Metadata_Report_Empty,Seed_Report_Redaction,WARC_Download_Errors,WARC_Fixity_Errors,WARC_Unzip_Errors,Complete,,,,,,,,,,,,,,,,,,,
+rbrl-498-web-201907-0001,2027707,12265,943048,0.007,1,ARCHIVEIT-12265-TEST-JOB943048-SEED2027707-20190709144234143-00000-h3.warc.gz,Successfully downloaded all metadata reports,No empty reports,Successfully redacted,Successfully downloaded ARCHIVEIT-12265-TEST-JOB943048-SEED2027707-20190709144234143-00000-h3.warc.gz,Successfully verified ARCHIVEIT-12265-TEST-JOB943048-SEED2027707-20190709144234143-00000-h3.warc.gz fixity on 2023-11-28 20:33:38.106304,Successfully unzipped ARCHIVEIT-12265-TEST-JOB943048-SEED2027707-20190709144234143-00000-h3.warc.gz,Successfully completed,,,,,,,,,,,,,,,,,,,
+rbrl-377-web-201907-0001,2027776,12264,943446,0.096,1,ARCHIVEIT-12264-TEST-JOB943446-SEED2027776-20190710131748634-00000-h3.warc.gz,Successfully downloaded all metadata reports,rbrl-377-web-201907-0001_seedscope.csv,Successfully redacted,Successfully downloaded ARCHIVEIT-12264-TEST-JOB943446-SEED2027776-20190710131748634-00000-h3.warc.gz,Successfully verified ARCHIVEIT-12264-TEST-JOB943446-SEED2027776-20190710131748634-00000-h3.warc.gz fixity on 2023-11-28 20:34:09.872859,Successfully unzipped ARCHIVEIT-12264-TEST-JOB943446-SEED2027776-20190710131748634-00000-h3.warc.gz,Successfully completed,,,,,,,,,,,,,,,,,,,
+magil-ggp-2520379-2023-05,2520379,15678,1789230,7.434,7,ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230416150811551-00000-sl63gmud.warc.gz|ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417184230631-00001-sl63gmud.warc.gz|ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417185737629-00002-sl63gmud.warc.gz|ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417193251948-00003-sl63gmud.warc.gz|ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417201414622-00004-sl63gmud.warc.gz|ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417223344837-00005-sl63gmud.warc.gz|ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230419134031254-00000-dwz98uv7.warc.gz,Successfully downloaded all metadata reports,magil-ggp-2520379-2023-05_seedscope.csv, magil-ggp-2520379-2023-05_collscope.csv,Successfully redacted,Successfully downloaded ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230416150811551-00000-sl63gmud.warc.gz, Successfully downloaded ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417184230631-00001-sl63gmud.warc.gz, Successfully downloaded ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417185737629-00002-sl63gmud.warc.gz, Successfully downloaded ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417193251948-00003-sl63gmud.warc.gz, Successfully downloaded ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417201414622-00004-sl63gmud.warc.gz, Successfully downloaded ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417223344837-00005-sl63gmud.warc.gz, Successfully downloaded ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230419134031254-00000-dwz98uv7.warc.gz,Successfully verified ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230416150811551-00000-sl63gmud.warc.gz fixity on TIMESTAMP, Successfully verified ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417184230631-00001-sl63gmud.warc.gz fixity on TIMESTAMP, Successfully verified ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417185737629-00002-sl63gmud.warc.gz fixity on TIMESTAMP, Successfully verified ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417193251948-00003-sl63gmud.warc.gz fixity on TIMESTAMP, Successfully verified ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417201414622-00004-sl63gmud.warc.gz fixity on TIMESTAMP, Successfully verified  ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417223344837-00005-sl63gmud.warc.gz fixity on TIMESTAMP, Successfully verified ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230419134031254-00000-dwz98uv7.warc.gz fixity on TIMESTAMP,Successfully unzipped ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230416150811551-00000-sl63gmud.warc.gz, Successfully unzipped ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417184230631-00001-sl63gmud.warc.gz, Successfully unzipped ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417185737629-00002-sl63gmud.warc.gz, Successfully unzipped ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417193251948-00003-sl63gmud.warc.gz, Successfully unzipped ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417201414622-00004-sl63gmud.warc.gz, Successfully unzipped ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230417223344837-00005-sl63gmud.warc.gz, Successfully unzipped ARCHIVEIT-15678-TEST-JOB1789230-0-SEED2520379-20230419134031254-00000-dwz98uv7.warc.gz,Successfully completed
+magil-ggp-2529671-2023-05,2529671,15678,1791478,0.028,1,ARCHIVEIT-15678-TEST-JOB1791478-0-SEED2529671-20230420155417222-00000-mntg8u5v.warc.gz,Successfully downloaded all metadata reports,magil-ggp-2529671-2023-05_seedscope.csv, magil-ggp-2529671-2023-05_collscope.csv,Successfully redacted,API error 404: can't downloaded ARCHIVEIT-15678-TEST-JOB1791478-0-SEED2529671-20230420155417222-00000-mntg8u5v.warc.gz,,,WARC_Downloaded_Errors,,,,,,,,,,,,,,,,,,
+magil-ggp-2529683-2023-05,2529683,15678,1791489,0.05,2,ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420161205384-00000-qix5zv0f.warc.gz|ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420230248436-00000-8bk2lsxt.warc.gz,Successfully downloaded all metadata reports,magil-ggp-2529683-2023-05_seedscope.csv, magil-ggp-2529683-2023-05_collscope.csv,Successfully redacted,Successfully downloaded ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420161205384-00000-qix5zv0f.warc.gz, Successfully downloaded ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420230248436-00000-8bk2lsxt.warc.gz,Successfully verified ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420161205384-00000-qix5zv0f.warc.gz fixity on 2023-11-28 20:36:06.713040, Successfully verified ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420230248436-00000-8bk2lsxt.warc.gz fixity on 2023-11-28 20:36:23.739950,Successfully unzipped ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420161205384-00000-qix5zv0f.warc.gz, Successfully unzipped ARCHIVEIT-15678-TEST-JOB1791489-0-SEED2529683-20230420230248436-00000-8bk2lsxt.warc.gz,Successfully completed,,,,,,,,,,,,,,,
+magil-ggp-2529676-2023-05,2529676,15678,1791480,0.014,1,ARCHIVEIT-15678-TEST-JOB1791480-0-SEED2529676-20230420155757131-00000-zrl3k481.warc.gz,Successfully downloaded all metadata reports,magil-ggp-2529676-2023-05_seedscope.csv, magil-ggp-2529676-2023-05_collscope.csv,Successfully redacted,Successfully downloaded ARCHIVEIT-15678-TEST-JOB1791480-0-SEED2529676-20230420155757131-00000-zrl3k481.warc.gz,Successfully verified ARCHIVEIT-15678-TEST-JOB1791480-0-SEED2529676-20230420155757131-00000-zrl3k481.warc.gz fixity on 2023-11-28 20:36:44.890513,Successfully unzipped ARCHIVEIT-15678-TEST-JOB1791480-0-SEED2529676-20230420155757131-00000-zrl3k481.warc.gz,Successfully completed,,,,,,,,,,,,,,,,,,