Building on experiments with the Internet Archive platform to trial computer-assisted archiving of publishers' full back catalogues
In a previous post, I summarised initial investigations by COPIM’s archiving and preservation team into the possibilities for automated archiving. This followed on from earlier experiments with manual workflows, which highlighted the prohibitive time investment that would be necessary for small and scholar-led publishers to manage archiving in this way. Due to the rich and well-structured nature of metadata within Thoth, and the options available for integrating the Thoth software with archiving platforms, we concluded that a basic level of automated ingest would be both worthwhile and eminently achievable. Three months later, we obtained proof of concept with a bulk upload of over 600 Thoth works to the Internet Archive.
This blog post will explore the steps taken to accomplish this, providing pointers for anyone looking into implementing a similar system themselves, as well as giving some background for publishers interested in joining the Thoth programme to take advantage of this feature. All code used in the process is available on GitHub under an open-source licence, as is standard for the COPIM project. The post will also outline our plans for building on this initial work as we start to develop the Thoth Archiving Network.
During initial investigations, we had successfully uploaded temporary test files to the Internet Archive (IA) using the same method which would form the basis of our proof-of-concept workflow. As briefly discussed in my earlier post, both Thoth and the Internet Archive offer APIs (Application Programming Interfaces) as a simple, standardised way for software programs to interact with their databases. They also both offer open-source software libraries in the Python programming language, packages of “canned code” for performing common tasks which can be utilised when developing new programs instead of writing everything from scratch. This meant we could quickly write a piece of Python software which would do the following:
Given the Thoth ID of a work, obtain its full metadata in an easily-digestible format (using the Thoth Python library)
From the metadata, extract the URL where the PDF of the work’s content can be publicly accessed online
Use this URL to download a copy of the PDF content file
Rearrange the work metadata into the format used by the Internet Archive
Log in to an appropriate IA user account, and send the PDF and the formatted metadata to the Archive to create a new archive copy of the work there (using the Internet Archive Python library).
For the proof-of-concept workflow, we decided to perform a one-time upload of real-world files and metadata from publishers’ full back catalogues1. Open Book Publishers (OBP) and punctum, as key COPIM partners, elected to participate in the upload, and the development team consulted with them throughout in determining their approach. The first decision was the choice of platform. As mentioned in the earlier post, the original investigations focused mainly on institutional repositories as archiving platforms; similarly, the Thoth Archiving Network aims to bring together a group of institutions who are willing to host open-access works from smaller publishers in their repositories. The Internet Archive (IA) is not an institutional repository, but the workflows are similar, and the publishers involved agreed that it would be a good place to have accessible, discoverable archive copies of their works hosted as a first step while waiting for agreements with individual institutions to be finalised.
Next, we discussed exactly which files should be included in the upload, and which IA metadata fields should be filled out with which Thoth metadata elements. We decided to prioritise simplicity in our approach, only uploading the PDF version of a work (even though OBP standardly produces editions in additional digital formats, some of which are closed-access and therefore require a degree of consideration when archiving), and bypassing concerns about metadata loss by simply uploading a full Thoth metadata file alongside the content file. While we also did our best to ensure that major IA metadata fields were appropriately filled out to improve discoverability, we were mindful that third-party interfaces such as IA’s could well change over time, and we should therefore not put disproportionate effort into conforming to them at the expense of making progress elsewhere. This resulted in the creation of a new Thoth Export API output option, using JSON format to be easily readable by both humans and computers, and added another step to the workflow described above:
Using the work’s Thoth ID, download its JSON metadata file from the Thoth Export API, to be included in the eventual upload to IA (this is easy to achieve using basic Python, as the Thoth Export API is uncomplicated).
One notable feature of the Internet Archive is that, as a very large platform geared towards ease of access to content, it has many automatic processes in place for enhancing uploaded files. When a simple PDF is submitted, by default the Archive derives multiple additional files from it, such as a thumbnail image to represent it across the site, a version which can be read in the web browser using the Archive’s own BookReader, and basic text versions enabling screen-reading for visually impaired readers as well as full-text searching (created using OCR). The publishers agreed that these derived formats were beneficial for making works more discoverable and accessible to users, although there were some unexpected effects.
Firstly, one of the derived formats is an EPUB version, a potential concern for publishers such as OBP who produce and sell their own EPUB versions of published works as alternatives to the free PDF versions. However, on inspection, the IA-created EPUB is a very utilitarian document based on the OCR text (with all its inevitable mis-scannings), and acknowledges throughout that it has been automatically generated and may contain errors. It is clearly aimed mainly at users who prefer EPUB readers over PDF viewers for reasons that outweigh the reduction in quality (such as smaller file size), and those who want a well-formatted publication thoughtfully tailored to the EPUB standard will still opt for the official publisher’s version.
A more intriguing issue was that when the Archive recognised an uploaded PDF, with its publisher-provided metadata, as representing a published book with an entry in a catalogue such as WorldCat, it would attempt to enhance it by pulling in metadata from said catalogue. While this could sometimes be useful, correctly identifying and adding details such as OCLC numbers which had been omitted from the Thoth record, it also sometimes overwrote accurate, detailed metadata with poorer-quality information – replacing a full publication date with just a year, appending an “[Author]” tag to an author’s name, or dropping some keywords. This could be avoided by turning off the “derive” option altogether when submitting the work, but this meant we lost the other benefits of derived files as discussed above.
When we contacted the Archive to ask if there was a way to continue creating derived files but prevent source metadata being overwritten, they were responsive and helpful. They acknowledged that this was an issue, explained that it could be resolved by enabling an advanced feature, and suggested that we set up a collection where this feature could be enabled by default for all submissions. We agreed and they created the Thoth Archiving Network collection for us, which worked exactly as planned, while also providing a convenient presence for Thoth on the Archive.
Once we had finalised the appropriate process for submitting a single work to IA, including the fine details of source files, metadata mapping and post-upload processing, it was time to extend this process to handle large numbers of works. At a basic level, this would simply require taking the original Python program and running it multiple times, each with a different Thoth work ID; the logic would be identical on each run, so the submissions would be uniform. We just needed to obtain the appropriate set of work IDs to input to the program. Fortunately, the Thoth API is very flexible, so it was easy to write a supplementary program which would ask it for a list of all work IDs:
by the opted-in publishers
marked as Active (i.e. complete and published)
excluding book chapters (i.e. only standalone parent works)
sorted from least to most recently published (for convenience, to give the collection some coherence and help us to track the progress of the bulk upload)
separated into two sections, one for each publisher (as above).
However, another consideration was that as a task gets bigger, it becomes increasingly important to make the program robust. We would be gathering and submitting a large amount of data for a large set of works; on each submission, there were many points at which the attempt might fail. For example, we might have incorrectly set the login credentials for the IA user account; we might start trying to upload a work then discover that necessary information (such as the URL of its PDF) was missing from the Thoth record; we might simply have bad luck and attempt to submit a work at a time when the Archive was already trying to process vast numbers of other submissions, leading it to ask us to try again later. It was therefore important to identify all of these possible points of failure and tell the program how to deal with them (e.g. if asked to try again later, it would understand the request and do just that, rather than giving up and producing an error message).
The final step was to ensure that the program would clearly communicate the results of every upload attempt in a way that could be easily read and referenced by the human running it – because every “automated” process requires at least a small amount of manual handling. In our case, if any upload failed, we wanted to know that this had happened and what had caused it, so that we could investigate the problem and potentially try again. This required writing clear and detailed error messages for the program to write out to a log file at each point where a failure might occur.
When we actually ran the finished program and attempted the bulk upload, only seven works encountered failures, out of a total of 640 works identified as eligible for archiving from the two publishers’ back catalogues, dating from as far back as 2008. Of these, one was a temporary error due to the Archive being overloaded, which was automatically retried and then succeeded; the rest failed due to lacking PDF URLs, or having PDF URLs listed which did not actually link to a PDF. On discussion with the publishers, two of these were found to be legacy print-only publications, therefore exempt from our digital archiving attempt, and the rest just needed quick corrections to their Thoth records before they could be resubmitted, this time successfully. The full upload process took less than eight hours.
At a glance, and based on some spot checks, the bulk of the works’ files and metadata appear to have been uploaded correctly, and they are well-presented in terms of derived images and searchable/filterable details. No checks were performed on the completeness or accuracy of the metadata prior to upload, as this is a workflow in which the publisher assumes full responsibility for the correctness of the Thoth record. As discussed in the previous post, any “additional resources” which are considered part of the work as a whole but not included in the PDF (such as accompanying videos hosted on YouTube) will not have been archived by the process, as it is “one size fits all” rather than considering the curation needs of individual works.
While these are limitations from an archiving perspective, they help to make the workflow almost entirely automatable, a boon for the resource-strapped smaller publisher, providing an acceptable “first line of defence” for those with few or no other archiving or preservation solutions in place. There is also the option for publishers to examine specific works in the Archive and make manual enhancements to them at any point following upload, so even for a work which is known to be particularly complex, the basic automated upload is a helpful first step.
At the time of writing, two months after the upload, the 638 works in the collection had also amassed over 12,000 “views” between them (half by automated web crawlers, and half by real people), proving that the Internet Archive is a valuable platform not only for archiving, but also for dissemination.
Following this successful manually-triggered one-time upload, work is now underway to set up recurring automated uploads of newly-published Thoth works. The ideal would be to automatically submit a work to IA as soon as the publisher marks it as Active within the Thoth system; a similar process could be particularly useful for submissions to distribution platforms, which are more time-sensitive. As an interim step, we are currently targeting recurring periodic “catch-up” uploads, where a modified version of the one-time process would find all works published since the last upload (e.g. in the past week or month), and submit them in a much smaller “bulk” upload.
The main additional component in this workflow would be a GitHub Action. All of the code used in the one-time upload is already publicly hosted on the GitHub platform, and GitHub Actions provide a way to run programs via the GitHub system (rather than on a personal computer), either manually or on a set schedule. The results of Actions are also clearly displayed on the GitHub dashboard and given as email notifications. This would allow the Thoth project manager to quickly identify any failures and take appropriate action, just as was done during the bulk upload.
Most importantly, now that the infrastructure is in place for automated upload to a single platform, it will be very easy to extend the code to upload to additional platforms which use similar processes. Our next venture will be fleshing out the test code previously used to connect to the DSpace-based Cambridge University Library “Apollo” repository via the SWORD protocol, and we hope to be in a position to add them to the nascent Thoth Archiving Network soon. It will be even easier to perform similar bulk IA uploads for other Thoth publishers, and to add them to the recurring upload process; all we need is their Thoth publisher ID, and the code can do the rest!
Header image by Chris Brignola on Unsplash