Options for computer-assisted repository archiving for small and scholar-led presses publishing open access monographs
In a recent post, my colleague Miranda Barnes outlined the challenges of archiving and preservation for small and scholar-led open access publishers, and described the process of manually uploading books to the Loughborough University institutional repository. The conclusion of this manual ingest experiment was that while university repositories offer a potential route for open access archiving of publisher output, the manual workflow is prohibitively time- and resource-intensive, particularly for small and scholar-led presses who are often stretched in these respects.
Fortunately, many institutional repositories provide routes for uploading files and metadata which allow for the process to be automated, as an alternative to the standard web browser user interface. Different repositories offer different routes, but a large proportion of them are based on the same technologies. By experimenting with a handful of repositories, we were therefore able to investigate workflows which should also be applicable to a much broader spread of institutions.
Many websites which allow users to store and view data, such as repositories, offer access to that data via an API (Application Programming Interface). (COPIM’s own Thoth system has two: a GraphQL API, for accessing raw metadata in its database format, and an Export API, for downloading that metadata in the specific formats required by various platforms.) An API defines a standard set of instructions which another computer program can send to it in order to perform certain actions, such as reading existing data from the database, or adding new data to it. Once we know what instructions an API expects, we can develop code to generate and send them, and run that same code quickly and easily to trigger multiple similar actions.
Our desired workflow would therefore be to obtain a book’s metadata from the Thoth API, retrieve all required content files from the URLs specified within it, convert the metadata into the structure expected by the repository API, and then send the files and metadata to the API with an instruction to upload them into the database. The work should then be viewable within the repository in exactly the same way as if it had been manually ingested.
The manual ingest experiment deliberately selected books which posed particular challenges for archiving, such as Denis Diderot 'Rameau's Nephew' – 'Le Neveu de Rameau': A Multi-Media Bilingual Edition (https://doi.org/10.11647/OBP.0098), a complex work with several “additional resources” in both text and audio formats. These resources, external to the main text, are not currently represented within Thoth’s metadata framework, so details about them (such as links to the content) cannot be stored in the Thoth database. Any automated upload at this stage would therefore miss out these files. However, as we are continuously developing Thoth to better fit users’ needs, this is something which can be improved – and this experiment allowed us to identify this desired new feature and raise an issue to track its development.
Another limitation is that while URLs linking to all published digital versions of a book can be stored in Thoth, some of these content files may not be freely retrievable. As discussed in the manual ingest report, both of the books selected are available in PDF, XML, EPUB and MOBI formats, as is standard for Open Book Publishers. However, only the PDF and XML are free to access; the EPUB and MOBI must be purchased, and therefore their URLs lead to a paywall. These formats would therefore need to be manually uploaded by a representative of the publisher who had access to the original files (or, alternatively, the publisher would have to reconfigure their paywall to allow access by the automated ingest program).
As the manual ingest experiment was conducted via the Loughborough University Figshare repository, this was also where we began when investigating automated ingest. Although Figshare is proprietary software and its code is unfortunately not open-source, it has a well-documented open API, which can be fully accessed by submitting the credentials of an existing Figshare user account. Instructions can be sent to the API over the internet via HTTP requests, with additional data included in JSON format. The API will return information (such as data from the database, or confirmation of successful uploads) in a similar format. We found that most of the actions required for manual ingest could be performed using the API. These included creating “items”, creating “projects” and “collections”, adding items to projects/collections, setting the appropriate metadata on an item, adding files to an item, and making draft items public. Corrections can also be made to records which have already been created, as the API allows updates to metadata, deletion of files, changing of links between items and collections, and so on.
The next task was therefore to write some code which would send the necessary instructions to the API. Figshare provide some example code for interacting with the API in various different programming languages, and this can be found at the top of each subsection in the documentation (e.g. the “Curl”/“Java”/“C#”/etc tabs in the Public articles subsection; the full set of example code can be downloaded as a ZIP file from the “Other” tab). However, this code is automatically generated from the API specification, and needs to be augmented by the programmer in order to be usable. There is also no example code for the programming language Rust, which is what is used by Thoth. The experimentation therefore involved writing some Rust code from scratch which would check whether a Thoth book record had an equivalent record on Figshare, and then either create a new Figshare record for it or update the existing one, submitting up-to-date metadata from the Thoth database and uploading a sample data file.
Using this code, we were able to successfully interact with the Figshare API, creating and updating basic records which could be viewed in the Figshare user interface. The two main challenges during development were the complexity of the API, and the specifics of the Figshare metadata format. The API does not make use of any standard frameworks, and instructions which are similar in nature (e.g. adding new metadata vs updating existing metadata) often have slight differences or return inconsistent responses, all of which have to be individually taken into account when writing the code. Many actions also require a multi-step process rather than a single instruction; for example, large files have to be uploaded by sending multiple file parts separately, and checking the API’s response each time before continuing.
Meanwhile, there was the question of how best to represent the broad set of metadata available from Thoth within the Figshare format. This is also relevant to manual ingest, but becomes more apparent when working directly with the two databases. Some correlations are straightforward, e.g. the “title” of a book within Thoth equates to the “title” of a book within Figshare. However, whereas Thoth can store many different types of contributor to a work, such as “Author”, “Editor”, “Translator”, and “Music Editor” (all of which apply to Rameau’s Nephew), Figshare only accepts “Authors” – and cannot store many of the details available for them in Thoth, such as biographies, or institutional affiliations. Figshare does allow users to create “custom fields” for storing additional metadata, but it is not clear whether these are easily searchable, so while they would add to the completeness of an archive record, they might not enhance its discoverability.
The next repository system we investigated was EPrints, having been given access to a test account by the Library at Bath Spa University. EPrints is another popular option for institutional repositories, but unlike Figshare, it is free and open-source. While this is a closer fit with the ethos of the COPIM project, it also increases the likelihood that institutions will be slow to upgrade the software when new versions come out. The EPrints software is also highly customisable, so one institution’s API might be very different from another’s. Nevertheless, if we focus on developing a workflow for the standard API, and bear in mind that we may need to support legacy versions, we should be able to cover the majority of repositories that use EPrints.
One major advantage of EPrints is that it uses the SWORD protocol (Simple Web-service Offering Repository Deposit). As the name suggests, this is a standardised format for interacting with APIs which was designed with institutional repositories in mind. Since it is a technical standard, there is a lot of support for institutions and developers who want to use it. In particular, the SWORD team offer a wide range of open-source software libraries: bundles of pre-written code which can be integrated into new programs, making development much simpler. The set of programming languages covered by these libraries does not include Rust, but does include Python. Python libraries are relatively easy to connect to Rust code; Thoth itself also has a Python library for accessing its API, so it would not take much work to write a small Python program to transfer data from Thoth to EPrints.
The current version of EPrints, version 3.4, uses SWORD version 2.0 by default. In version 2.0, instructions are sent via HTTP requests as for Figshare, but additional details are formatted as XML (in a variation on the standard Atom Publishing Protocol format). The latest version of SWORD, version 3.0, replaces the XML formatting with a JSON-based format. Future versions of EPrints, as well as other repository systems, might move from using SWORD version 2.0 to version 3.0, so we ideally need to support them both. However, using the SWORD libraries would make this very simple: they do the work of creating instructions for the API in the correct format, and dealing with the responses it sends back, so we would not need to worry about the switch from XML to JSON.
We carried out some basic experiments with the SWORD version 2.0 Python library, and successfully used it to connect to the Bath Spa EPrints API and upload some metadata and a small sample file. These became visible within the EPrints user interface in the same way as records which had been created manually. Attempts to upload the full PDF file of Rameau’s Nephew unfortunately failed, seemingly due to its size (around 50MB, whereas the sample file was smaller than 1MB); this would need to be resolved before we could finalise a workflow for automated deposit of standard-sized book files. Although we did not have access to any repository using SWORD version 3.0, we briefly tested its Python library and were easily able to create (but not send) API instructions containing appropriately-formatted metadata.
DSpace is another commonly-used repository system, and, again, it is free and open-source. We were able to carry out some tests on the Cambridge University Library “Apollo” repository, which currently uses DSpace version 6, although they are working towards moving to version 7. Both of these DSpace versions offer API access via a number of different methods, one of which is SWORD version 2.0. Having already successfully tested the SWORD version 2.0 Python library on the Bath Spa EPrints repository, we quickly confirmed that the library also worked for connecting to Apollo and uploading basic metadata. Cambridge University Library have agreed to contact us when they have a test site set up for DSpace version 7, so that we can check if this will require any changes.
Finally, we investigated the possibility of using the Internet Archive as an automated archiving solution. The Internet Archive is not an institutional repository system, but a publicly accessible online archive of all kinds of digital and digitised material, registered as a non-profit in the USA. It is well-known, well-funded and widely used, and allows anyone to create an account and add materials. This could make it a useful alternative for small and scholar-led presses who do not have close institutional connections.
Similar to Figshare, the Internet Archive has developed its own bespoke API, using HTTP requests and JSON content in a custom format rather than adopting standards such as SWORD. However, they also provide their own Python library allowing users to easily interact with it. After signing up for a test account, it was simple to use this library to upload sample files and metadata, all of which can be freely accessed and redownloaded by anyone on the open internet (and even some people not on it).
Thanks to the provision of API access and software libraries, it is very easy to set up automated deposit to a number of institutional repositories. While more complex works may need some level of manual intervention to ensure they are correctly represented within an archive, the bulk of small and scholar-led presses’ output is low-hanging fruit in this respect. Thoth data is well-structured and simple to systematically convert into the formats required by different APIs, so once set up, an automated workflow can be reliably repeated for large numbers of works. The main question, as for manual ingest, is how best to enter the rich metadata available into a system which may not be well-designed for storing it. Nevertheless, when many small publishers have few resources to devote to archiving and preservation, an imperfect but frictionless workflow is better than no workflow at all.
Furthermore, when viewed through the lens of automated deposit, archiving is not actually so different from dissemination of a work on publication. Both processes require, via some means, the submission of metadata and/or content files to an online platform; COPIM’s dissemination and distribution team is already investigating ways to ease the burden on small and scholar-led presses by automating aspects of the dissemination process. Where distribution platforms offer API access, the workflow might be almost identical to the archiving workflow. These experimentations will therefore form the basis of further work where archiving is centred, alongside distribution, within the publishing process – not left as an afterthought.