Turf (TIND.io URL Fixer) is a program to download records from the Caltech TIND.io database and check the URLs that may be embedded within the records.
Authors: Michael Hucka
Repository: https://github.com/caltechlibrary/urlup
License: BSD 3-clause license – see the LICENSE file for more information
There are several hundred thousand records in https://caltech.tind.io. Some of the records contain links to other web resources. As a matter of regular maintenance, the links need to be checked periodically for validity, and preferably also updated to point to new destinations if the referenced resources have been relocated.
Turf is a small program that downloads records from https://caltech.tind.io, examines each one looking for URLs, deferences any found, and then finally prints a list of records together with old and new URLs. By default, if not given an explicit search string, Turf will do a search for all records that have one or more URLs in MARC field 856. Alternatively, it can be given a search query on the command line; in that case, the string should be a complete search URL as would be typed into a web browser address bar (or more practically, copied from the browser address bar after performing some exploratory searches in https://caltech.tind.io. Finally, as another alternative, it can read MARC XML input from a file when given the -f
option (/f
on Windows).
The following is probably the simplest and most direct way to install this software on your computer:
sudo pip3 install git+https://github.com/caltechlibrary/turf.git
Alternatively, you can clone this GitHub repository and then run setup.py
:
git clone https://github.com/caltechlibrary/turf.git
cd turf
sudo python3 -m pip install .
Both of these installation approaches should automatically install some Python dependencies that Turf relies upon, namely openpyxl, plac, termcolor and uritools.
Turf is a command-line application. On all systems, the installation should place a new program on your shell's search path called turf
(or turf.exe
on Windows), so that you can start Turf with a simple terminal command:
turf
If that fails because the shell cannot find the command, you should be able to run it using the alternative approach:
python3 -m turf
Turf accepts various command-line arguments. To get information about the available options, use the -h
argument (or /h
on Windows):
turf -h
When run without any arguments, Turf will execute a search in https://caltech.tind.io that looks for records containing URLs in MARC field 856. It will dereference each URL it finds and print to the terminal each record's identifier, the original URL(s), the final URL(s), and any errors encountered. Turf can also accept an explicit search query in the form of a complete search URL as would be typed into a web browser address bar (or more practically, copied from the browser address bar after performing some exploratory searches in caltech.tind.io). The search string should be quoted to prevent the terminal shell from interpreting the punctuation characters in the search string. Here is an example:
turf 'https://caltech.tind.io/search?ln=en&p=856%3A%25&f=&sf=&so=d'
Turf won't write the results to a file unless told to do so using the -o
option (/o
on Windows). It can write the results either in .csv
or .xlsx
format, and it inspects the file name to figure out which format to write. For example, the following will make it produce an Excel file as output:
turf -o results.xlsx
By default, Turf prints a message for every record it processes, so that the user can get a sense of what is happening. When told to save results to a file, however, it does not write every record by default. Instead, by default, it saves only the records that contain URLs and for which the URLs are found to dereference to a different final destination. This behavior can be controlled via two flags, -n
and -a
. If given -n
(/n
on Windows), Turf will write out records with URLs even if the URLs dereference to the same location. If given -a
(/a
on Windows), Turf will write all records even if they don't have any URLs.
The difference between -a
and -n
(/a
and /n
on Windows) is not evident from the default search performed by Turf because it only searches for records with URLs; however, the difference is easier to see when Turf is given a more general search such query such as the following
https://caltech.tind.io/search?action_search=Search&rm=wrd&so=d
which will retrieve all records. The following screencast tries to demonstrate this.
If the URLs to be dereference involve a proxy server (such as EZproxy, a common type of proxy used by academic institutions), it will be necessary to supply login credentials for the proxy component. By default, Turf uses the operating system's keyring/keychain functionality to get a user name and password. If the information does not exist from a previous run of Turf, it will query the user interactively for the user name and password, and (unless the -X
or /X
argument is given) store them in the user's keyring/keychain so that it does not have to ask again in the future. It is also possible to supply the information directly on the command line using the -u
and -p
options (or /u
and /p
on Windows), but this is discouraged because it is insecure on multiuser computer systems.
Finally, the following table summarizes all the command line options available. (Note: on Windows computers, /
must be used instead of -
):
Short | Long form option | Meaning | Default |
---|---|---|---|
-a |
--all |
Save all records, not only those with URLs in MARC field 856 (implies -n ) |
Only write records containing URLs |
-f F |
--file F |
Read MARC XML content from file F | Search caltech.tind.io |
-o R |
--output R |
Save results to file R | Only print results to the terminal |
-s N |
--start-at N |
Start with the Nth record | Start at the first record |
-t M |
--total M |
Stop after processing M records | Process all results found |
-n |
--unchanged |
Include records whose URLs don't change after dereferencing them | Only save records whose URLs change |
-u U |
--user U |
User name for proxy login | Prompt for name |
-p P |
--pswd U |
Password for proxy login | Prompt for password |
-R |
--reset |
Reset proxy name & password | Reuse stored credentials |
-X |
--no-keyring |
Do not read/write the system keyring/keychain | Store proxy credentials |
-q |
--quiet |
Don't print messages while working | Be chatty while working |
-C |
--no-color |
Don't color-code the terminal output | Use colors in the output |
-V |
--version |
Only print program version info and exit | Do other actions instead |
If you find an issue, please submit it in the GitHub issue tracker for this repository. Alternatively, you can send email to the Digital Library Development team at Caltech.
If you like this software, don't forget to give this repo a star on GitHub to show your support!
The vector artwork used as a logo for Turf was created by Milinda Courey and obtained from The Noun Project. It is licensed under the Creative Commons CC-BY 3.0 license.
Turf makes use of numerous open-source packages, without which it would have been effectively impossible to develop Turf with the resources we had. We want to acknowledge this debt. In alphabetical order, the packages are:
- colorama – makes ANSI escape character sequences work under MS Windows terminals
- ipdb – the IPython debugger
- openpyxl – a library to
read/write Excel
.xlsx
and.xlsm
files - plac – a command line argument parser
- requests – an HTTP library for Python
- setuptools – library for
setup.py
- termcolor – ANSI color formatting for output in terminal
- uritools – RFC 3986 compliant, Unicode-aware, scheme-agnostic replacement for urlparse
- urlup – finds the ultimate destination for URLs after following redirections
Copyright (C) 2018, Caltech. This software is freely distributed under a BSD 3-clause license. Please see the LICENSE file for more information.