Skip to main content
SearchLoginLogin or Signup

OBF Archiving and Preservation: End of Year 1 Reflections

Looking back at Year 1 of the Open Book Futures Project and the activities of the Archiving & Preservation team in Work Package 7

Published onMay 31, 2024
OBF Archiving and Preservation: End of Year 1 Reflections
·

At the close of Year 1 of the Open Book Futures Project, we reflect on the progress made and looking to the future for Year 2. Detailed below is the progress made in further developing the Thoth Archiving Network, a slight rebrand, institutional and generalist repositories on board so far, and the challenges, lessons learned, and future direction of the Network. We also talk about the first year of our National Libraries Network, with a brief summary of what we’ve learned so far.

The Thoth Archiving Network

Beginning with the COPIM project (2019 – 2023) and progressing into the Open Book Futures project (2023 – 2026), the Thoth Archiving Network has been created to support the archiving of open access monographs in order to ensure future access to important, but at-risk, scholarly material. Our initial vision was to create a push-button deposit tool for the small and scholar-led presses who might otherwise be excluded from existing archiving solutions due to various resource deficits. Participating institutional repositories would provide archiving locations for these publishers, who, as we found, often did not have any archiving or preservation solution in place at all.

The fledgling network is up and running with some new approaches, e.g. the use of generalist repositories, but with the same ambition to help small and scholar-led OA monograph presses secure their work for the future. As we will detail, our approaches have necessarily had to change somewhat due to the realities these challenges have presented. However, as Thoth and the archiving network progress and mature, we will keep the needs and requirements of the small and scholar-led press at the heart of what we do.

Now open: the Thoth Open Archiving Network

Previously TAN, or the Thoth Archiving Network, we have recently reappraised our name and aims for the network and have decided to do a slight rebrand! While the archiving network always focused on open access monographs, so in that respect was always ‘open’, one realization we came to at the end of Year 1 of Open Book Futures was that we wanted openness to not only apply to the books we were archiving. While the concept is currently in its nascency, our ambition is to more fully develop our archiving network as completely open in its practices, one that could even be audited from the outside to assure the works we deposit in our participating repositories are safe, uncorrupted, and in the locations we promised. We are working with the Digital Preservation Coalition (DPC) around how we might implement creation and management of checksums, store these within Thoth, and thus allow for our entire process to become auditable in a transparent and dependable way. There will still be some testing to do in the future once checksums are implemented on the various repository platforms we’ll be using, but more than just in service of open access monographs, we are hoping to also perform as fully open infrastructure, too. This new branding has yet to be implemented across all participating platforms but will be rolled out in due course over the coming months.

Our institutional repository partners so far

On board from the beginning as part of the COPIM project and Open Book Futures, the Loughborough University repository is one of our main partners. At Loughborough, the repository is an instance of Figshare. You can read about our initial workflow experimentations in Figshare during the COPIM project here:

Our other current UK institutional repository is at The University of Cambridge. Cambridge University Libraries have agreed to pilot their participation in the Thoth Open Archiving Network for the initial duration of the Open Book Futures project (2023 – 2026) with expectations that it will hopefully continue once further participation details are ironed out over the course of the project in years two and three. We presently have the full catalogues of two separate open access monograph publishers on Cambridge University Library’s dedicated DSpace instance.

While nothing has yet been officially agreed, we are also in conversation with the library system at the University of California, as well as the Lancaster University library, who is the leading institution on the Open Book Futures project. We hope to finalise these discussions over the course of the coming months.

A new approach: generalist repositories

Following discussion during several early strategy meetings within Open Book Futures, we considered what might be broadly necessary for an archiving network such as TOAN. How many institutional repositories? And what about including readily available generalist repositories? While we haven’t answered the first question yet, as we are still assessing the size of demand within the communities of publishers who would use the Thoth Open Archiving Network, the second question’s answer was a clear ‘yes’.

The Internet Archive (IA) allows anyone to upload to collections within theIA’s Wayback Machine at archive.org with a free account, which can be easily registered within a few minutes’ time. As mentioned above, our initial automated archiving workflow testing within the Internet Archive was reported on within the final months of the COPIM project. This automated deposit testing was made possible with open APIs (Application Programming Interfaces) in Thoth and the IA and the equal availability of open-source software libraries, allowing for interoperability and quick development of the necessary code. After successful initial deposits, Thoth has now been running automated monthly uploads of any new books published by the presses being archived by the network: https://archive.org/details/thoth-archiving-network.

So far there are over 750 open access monographs archived within the Thoth Open Archiving Network with more archived books forthcoming from additional presses within Year 2 of the Open Book Futures Project.

Zenodo

Zenodo is another generalist repository that allows free deposit with an account, either via manual upload through the web interface, or by using Zenodo’s open API to accept files for deposit. Our Thoth developer, Ross Higman, investigated the API and metadate requirements and performed sandbox testing to determine how the OA monographs would display in the user environment once published. The results were successful, and we are preparing to implement the automated upload process shortly, beginning with the back catalogues of two OA monograph publishers previously tested on other platforms. Once this has been performed, deposit will open up to other Thoth subscribers.

Development work

While our aim is ease and automation for publishers, the reality behind creating these simplified processes is a fair amount of development work that must first be done: initially scoping the requirements, exploratory development, and then building the tools and workflows necessary to facilitate streamlined depositing. Our Thoth developer has been busy working out suitable approaches for testing and deploying on the above platforms, and there will be future work necessary after some initial testing with other archiving and repository platforms, such as EPrints and Samvera.

Thoth Open Metadata is a non-profit, open metadata management and dissemination platform using open-source software to implement open workflows that rely on fully open data. Thoth as an infrastructure, also created with the COPIM project, was the natural home for the archiving network we were working to create, as there was already the knowledge and skill within the team to advance the development of the network and the dissemination of content and metadata.

Sustainability

As the Thoth Open Archiving Network has evolved, we have grown from the initial push-button deposit tool that was our initial ambition. The realities of creating a complex service over a single tool, and the integrations necessary to engage with various platforms, have meant that our increasing knowledge and understanding comes at the cost of human effort and service sustainability.

The initial infusion of funding from the COPIM project meant that the development of Thoth and other deliverables was made possible. The additional funding from Open Book Futures now allows us to expand and accelerate the uptake of those proof-of-concept services and the work we can do. That said, the longevity of Thoth and the archiving network will also depend on financial stability and a self-sustaining business model. This is being made possible by Thoth’s subscription models for publishers, ranging from Thoth Free, which provides free facilities to create, manage, and export metadata in a variety of industry-standard formats, to two additional levels of set tier subscription services labeled Thoth Plus. Further options are available for bespoke services. The subscription tiers are set per book and deliberately kept to be as affordable as possible. Thoth Plus A, the basic paid tier for Thoth services, includes streamlined DOI registration, distribution to key content and metadata aggregator platforms, and dissemination to two generalist archiving locations. The extended tier, Thoth Plus B, includes more advanced distribution and robust archiving solutions. While our archiving model is in its early stages, the exact locations of the archived work are not presently fixed, as we are utilising the partner repositories presently part of TOAN. In the future, as our archiving model matures, we will strive to implement a strategic allocation of archived OA monographs in specific locations based on the conditions set by the repository locations, the needs of the publication/publisher content, and the tier level of publisher subscription. In addition to the dissemination of OA monographs to a number of archiving locations, we hope to include additional services such as the parsing and subequent archiving of citation / referenceURLs, the depositing of additional materials and the creation of checksums offered to subscribed publishers — all of which we believe will further strengthen the robustness of the archived content and thus make TOAN a valuable service built on open data and open practices.

We also want to highlight that we have not forgotten about the small or scholar-led press with no financial income who may not be able to subscribe to these services (also see Ramalho et. al, 2024). Our ambition and overall goal aim is still to support and enable small presses to distribute and archive their work with services that are on par with, if not exceeding the services that larger, more well-supported publishers are able to afford. That said, at this point in Thoth and TOAN’s development, we must make sure our offering is sufficiently robust and sustainable. Therefore, we are presently unable to offer all services for free to everyone right from the start, but publishers may contact Thoth directly to request assistance based on financial need. In the future, Thoth hopes to operate a “moving wall” of service offerings, where certain offerings may become discounted or free as the model further matures towards becoming fully sustainable.

Content policies and the role of the institutional repository

We were sometimes met with surprise when suggesting this role for institutional repositories, as the traditional role is solely to serve the authors employed at any given institution. The content policies, which are often set by the University Library or the Research Office (or both), in most cases define the IT as an online venue to showcase the work of their own academics. This makes sense from the HEI’s perspective, as the repository is a service created within and funded by the university where it is located. While the traditional institutional repository content policy is a notable obstacle, it is not insurmountable. In fact, the traditional role of the institutional repository has been changing and beginning to evolve for some time. Along with new additions of institutional publishing and the open curation of conference contributions that have emerged in recent years, university libraries are beginning to see the importance of supporting open access scholarship more widely. Universities in the UK such as the University of York, the University of Sheffield, and others are seeking new ways to support the open access ecosystem more widely, in order to help facilitate the expansion of open access research culture.

The Thoth Open Archiving Network is an open infrastructure that allows participating institutions and institutional repositories to contribute to ensuring the long-term access of essential but at-risk contributions to scholarly knowledge. While large publishers do not experience barriers to existing archiving and preservation pathways, as we already know, small, scholar-led, and even some medium-sized open access monograph presses experience barriers in available resources, but these presses also can experience barriers due to criteria. There are not a high number of OA monographs published by this subset of publishers every year. And participating institutional repositories can work in collaboration with TOAN to establish their own criteria in terms of how much they would accept each year, research and subject areas, and other conditions.

Other challenges we are aware of are related to university IT systems and the necessary layers of security that exist to protect university systems, data, and operations. Some institutions will only allow user accounts for active members of students or staff, which may prevent accounts being created for external parties to deposit within IR workflows. This was a point raised in our COPIM workshop with repository managers from UKCoRR that took place in November 2022. We are aware that some repository systems will require more cooperation and consideration than others to create effective workflows, but this barrier may not be as impenetrable as it first seems. Should an institution with security measure such as this in place wish to join TOAN, we are happy to work with the necessary teams to evolve a practicable solution.

Despite some of these challenges and potential barriers, it is worth noting that quite a few of the attending repository managers at this initial workshop were potentially interested in contributing to the network in theory and were interested in keeping tabs on TOAN’s development. We are hopeful that as our model matures, more of these repositories might be coming on board in the future.

Metadata

Thoth’s capabilities mean that multiple output formats and specifications for more than a dozen platforms, all CC0-licensed, can be generated from any book record in Thoth, and all the metadata is tailor-made for Open Access books. This functionality means that for our testing  and uploads to repositories, we had this tool at our disposal. Publishers who are user-subscribers to Thoth will have their metadata (and eventually files) held within the metadata and dissemination system, so they or anyone else can generate the metadata in any of the available output formats for free.

Even with this functionality available, this did not solve all our challenges with the metadata needs of repository platforms. There are different default metadata requirements for each repository platform software. The metadata can be automatically sent to a repository platform along with the content file, but the metadata schema must match with the default requirements for the book information to display properly. A detailed discussion of earlier work in testing automated ingest within Figshare, DSpace, and Eprints, is detailed in this earlier PubPub post.

This metadata development work is important not just for an individual repository that we work with at this stage, but also for future repositories that use the same software. For example, the new DSpace instance at Cambridge for the Thoth Open Archiving Network is DSpace 7. DSpace uses as its default a Qualified Dublin Core (QDC) based metadata schema by default. However, institutions can add to or extend that schema. Because our aim is to develop a metadata schema that would work interoperably with all DSpace instances, as seamlessly as possible, we had discussions about creating two sets of metadata schema: one that is as rich as possible and engages with the desires of Cambridge to incorporate RIOXX-based fields, and a second that is minimal and matches to the default metadata schema set by DSpace. This way, in terms of this second, more basic default option, even if modifications are made to extend or add to this basic schema, there shouldn’t be errors occurring during automated deposit that disrupt or cause errors in the process.

To counter any potential future issues with metadata requirements changing after content has been deposited and simply to back up a record of the most complete metadata available for a monograph held within Thoth, we also send a JSON file along with the PDF file for the book. This is held as a file alongside the monograph and may be downloaded, as well.

The OBF National Libraries Network

Beginning at the start of Open Book Futures in May 2023, the National Libraries Network was created to bring together National Libraries from over the world to convene around the archiving and preservation concerns related to the open access monograph. The Archiving and Preservation team worked together with British Library colleagues and the Digital Preservation Coalition (DPC) to identify and engage National Libraries who might be willing and able to join the Network.

Initially, the network was composed of the British Library, who is helping to guide the Network’s direction, the National Library of Scotland, and the German National Library, all attending the Network’s first meeting in September 2023. Since then, the Network has grown to 7 members, including the Library of Congress and Qatar National Library, joining for the second meeting in January 2024, the National Library of the Netherlands (KB) joining for Meeting 3 in April 2024, with the Library and Archives Canada on board for Meeting 4, to take place in July/August 2024. We are grateful to the teams of all the National Libraries on board for their time and contributions so far and are looking forward to the future conversations and endeavours that will be made possible by their involvement.

In our first meeting, which took place in September 2023, we introduced the aims of the OBF National Libraries Network to the attending libraries and in turn learned more about the handling of open access monographs by the libraries, including their methods of content ingest, pipelines to preservation and archiving, an introduction to their collection policies, and approaches to handling metadata. These introductory conversations were deeply useful in understanding initial potentials and limitations in the handling of OA monographs, as well as in identifying future themes for subsequent meetings.

Our second meeting in January 2024 considered what the Network aims to deliver for the National Libraries themselves, not just the OBF project, in creating a global forum for National Libraries to discuss and potentially coordinate approaches. A discussion on Legal Deposit took place, considering the impacts of the needs and requirements of Legal Deposit on open access and access in general. Also considered were the differences between content hosting and connection, as well as the benefits and limitations to either. There needs to be a risk analysis with regards the persistence of the work and what confidence the National Library might have in content remaining preserved at another National Library location (or not). The challenges around open access status and metadata were raised, particularly when the metadata does not include an OA licence, and this is a persistent concern that will be followed up in future meetings and work with the National Libraries over the course of the project. Clear and agreed-upon definitions of what ‘open access’ means are also needed, as one National Library raised the issue that this isn’t always clear due to publishers inaccurately labelling something OA, or a book or chapter only be allowed ‘open access’ for a temporary period. The challenges between approaches to programmatic solutions for some of these issues and the persistent need for manual review were also discussed.

Our third meeting, near the end of April 2024, we asked the National Libraries teams consider what technical requirements and workflows would be theoretically required to create an archiving/preservation network of National Libraries for the archiving/preservation of open access monographs globally. One of the deliverables of the archiving and preservation work package within Open Book Futures is to scope the technical and administrative needs that would be required, theoretically, to create such a network, and what that would look like. So while we are working with the National Libraries to determine how this might operate, there is not an immediate expectation that a formal network serving this purpose will be created within the scope of the project. Attending National Libraries brought members of their extended team, including specialists in metadata and legal deposit workflows for eBooks, to begin to discuss the current workflows within the libraries. Challenges and opportunities around metadata, differing cataloguing and deposit workflows for different eBook deposit purposes, and the requirements of Legal Deposit in some countries (such as the UK) to ‘lock down’ even open access books due to the current law were all primary points in the discussion. Another point was the challenges of persistent identifiers and ISBN matching, which links to the additional issue of a reliable metadata matching process. Decisions applied to what and how certain content is preserved within the National Libraries, and the policies determining these decisions, were also discussed. A primary point of action from the meeting will be to determine the best course of pursuing the technical workflows questionnaire shared with the National Libraries in this meeting. This may be taken forward as interviews within each institution or focus groups with relevant staff. Our work package colleagues will also be meeting with the PALOMERA Project to coordinate with their parallel efforts and National Libraries survey work in the intervening period between this meeting and our fourth meeting with the Network. More as we progress into Year 2 of Open Book Futures!

Conclusion

While lots of work has been accomplished in this first year, we are really looking forward to Year 2, where much of the growth we have seen so far will inevitably accelerate and expand even further. Thoth is finalising subscriber contracts with its first influx of participating publishers, and we are working on new developments around checksums, file hosting for distribution, collaborative link rot solutions, and new research around PhD thesis and experimental monograph archiving. We can’t wait to share more with you as these strands develop, so watch this space!


Header image by Julian Hochgesang on Unsplash.

Comments
0
comment
No comments here
Why not start the discussion?