Exploring the capabilities of DSpace repository software
Last year, Cambridge University Library (CUL) officially became part of the Thoth Open Archiving Network (TOAN), going live with a new repository instance tailored for hosting works from Thoth publishers. The repository, like the main University of Cambridge institutional repository Apollo, is based on the widely-used open-source software DSpace. TOAN was fortunate to have extensive support from seasoned DSpace users at CUL throughout the integration process. This blog post goes behind the scenes to explore how this node of the network got up and running.
The integration with CUL, our third proof of concept following connections with Figshare and Internet Archive, was an opportunity to consolidate and fine-tune the overall TOAN workflow. Our blog post on linking to Internet Archive discussed many of the exploratory details which have now become standard across all of our network nodes. A standardised workflow is easy to conceptualise, manage, and extend to additional platforms.
The findings of all of our integrations so far have fed into the development of the Thoth Dissemination Service. This open-source software system is now fully-featured, handling not only archiving but also distribution. A generic process based on GitHub Actions and the Thoth API’s Python client automatically runs on a specified schedule, searching Thoth works by their publisher and publication date to find those which are newly ready to disseminate. The platform-specific details, e.g. whether the connection is via an API or some other method, what metadata is required, and in what format, are encapsulated within separate modules at the core of the program.
For all of our archiving platforms, the Service disseminates, at minimum, a PDF of the work itself, a set of metadata organised for platform-specific display and discovery, and a JSON file containing the full work metadata as recorded in Thoth. It sends a notification to the team if any errors were encountered during the submission process, and it updates the existing Thoth metadata with the URLs of the newly-created archive versions. This automatic scheduled dissemination has been running smoothly for many months now, requiring only minor manual intervention when an upload fails or a new publisher signs up. By taking the time up front to establish a robust workflow, many hours of repetitive manual tasks can be saved.
As with our other TOAN integrations, the details of the partnership with Cambridge University Library were determined after careful discussion between stakeholders. We initially expected that dissemination would be direct to the main Apollo repository. However, the alternative suggestion of CUL setting up a separate repository instance for the project turned out to have many advantages. It would provide a test case for CUL to explore newer features and more advanced catalogue options without risking the stability of the central repository, as well as giving the potential for further experiments with other datasets.
The new pilot repository would be based on DSpace, allowing the team to utilise the expertise developed in their work on Apollo, but starting from scratch. This provided insights into the ease of setting up a new instance, as well as bringing into focus the amount of customisation work which had gone into Apollo over the years. As previously discussed, open-source repository software often allows many optional and tweakable features, making it suitable for a diverse set of end users, but with the drawback that work done to integrate with one implementation of the software might not translate to another. We had the privilege of being able to configure the pilot repository to showcase the Thoth works to best effect, while being mindful that more compromises might be required when connecting with legacy repositories.
However, we were pleased to discover that the technical core of the integration, from the TOAN side, was reasonably seamless. The SWORD v2 Python client, with which we had previously experimented to connect to EPrints and DSpace repositories, made it simple to write a module for uploading to the DSpace API. No significant changes were needed when switching from testing against Apollo to using the pilot repository, nor when the repository underwent an upgrade from DSpace v5 to v7. Only some minor bugs in the client itself were encountered and needed to be worked around, largely arising because of the age of both the client and the SWORD v2 protocol. While we will want to maintain our SWORD module for other use cases, we are aiming to move to using the newer DSpace REST API in due course.
Connecting to an API means being able to transfer data into your target platform’s database. But this is only useful if the data can then be extracted again in a meaningful way. Many of our discussions in the experimental phase revolved around how best to tag the metadata provided from Thoth so that it could be easily retrieved and displayed in the CUL repository web interface. Again, DSpace provides support for data to be stored and tagged in many different ways, giving it flexibility and broad appeal, but resulting in a lack of common standards.
There is a set of defaults available in the DSpace configuration, which we used as our base since we were developing a brand new instance. However, these themselves use the Dublin Core metadata terms, which also prioritise flexibility over prescriptivism. For example, the term dc.contributor is defined as representing “An entity responsible for making contributions to the resource”, with the comment “Typically, the name of a Contributor should be used to indicate the entity”. Individual implementers are given the freedom to decide whether, e.g., this name should be in Given-Name Family-Name format, or in Family-Name, Given-Name format, or whether the name can be supplemented with (or even omitted in favour of) a unique identifier such as an ORCID. Any discrepancy between what format an individual DSpace server is configured to expect and what a client such as TOAN provides can hinder the representation and therefore discoverability of the supplied information.
This led us to the idea of developing separate metadata ‘profiles’ for different platforms using Dublin Core. TOAN could choose which of these profiles to use when submitting metadata based on how each platform was configured. For example, many DSpace repositories might already be set up to ingest metadata from Jisc Publications Router, which has a publicly available schema, so we developed one profile based on this to ease integration with those platforms. Meanwhile, to enable a user-friendly display including external links to resources such as DOIs, we developed another, more customised profile for this specific project.
At the CUL pilot repository end, the implementation was relatively straightforward, as it is based on a standard DSpace v7.x repository, with plans underway to upgrade to DSpace v8.x. To take advantage of the different metadata profiles provided by TOAN, the pilot repository incorporates several customisations to allow for the storage and enhanced display of richer metadata. In short, only minor modifications to DSpace's standard UI theme were required to enhance publication display pages, as well as developing a metadata parser to map the fields provided by TOAN into the different metadata schemas that DSpace supports.
Although DSpace is widely used and has a broad range of features, as we had previously discovered, not many institutional repositories are necessarily used to accepting programmatic connections via their APIs. Frequently, the majority of their content is ingested via manual upload to their web interfaces. This means that even for popular repository software, there may be only a small pool of users interacting with certain parts of the code in practice. It became evident that we were in a niche group when we encountered bugs not just within our implementation of DSpace/SWORD, but within the DSpace code itself.
In some cases, the CUL team were the ones to identify the root of the bug and supply a patch to fix it, benefiting the wider DSpace user community. It’s great that our project has been able to contribute back to the open-source ecosystem in this way, but it’s also telling that bugs have existed and gone unnoticed within the DSpace SWORD v2 code paths for so long. It will be interesting to see whether we encounter similar issues when moving to the REST API!
Once all the bugs had been ironed out and we had agreed on the best format for submitting and presenting metadata, the last step, as for Internet Archive, was to run a bulk back catalogue upload. Because we could largely re-use the script already developed for this previously, and because the earlier upload had already highlighted works whose metadata needed correcting, this process ran quite smoothly. We then enabled the automatic dissemination workflow for upcoming new publications; at the time of writing, the CUL pilot repository contains over 800 open access monographs.
The workflow which is now in place for archiving will also form a basis for investigations into preservation. As part of the pilot, CUL is exploring preserving the content in-house as part of the Libraries' wider Digital Preservation Programme. The types of material hosted in this platform provide an exemplary use case of scholarly content that is 'preservation ready', uses open and standard file formats (e.g. PDF and epub) and is accompanied by rich, high quality descriptive metadata.
Now that it has been established, the link between TOAN and the pilot repository has required relatively little effort to maintain. The pilot is scheduled to initially run for 3 years, alongside the OBF project, until Spring 2026. This will provide valuable data about the resources needed to provide an archiving repository of this kind.
Header photo by Rafael Garcin on Unsplash