Experimenting with repository workflows for archiving: Manual ingest
Over the course of the last year (2021-2022), colleagues in COPIM’s archiving and preservation team have been considering ways to solve the issues surrounding the archiving and preservation of open access scholarly monographs. Most large publishers and many University presses have existing digital preservation relationships with digital preservation archives, but small and scholar-led publishers lag behind due to lack of resource.
One of the potential solutions we have been considering is the university repository as open access archive for some of these presses. COPIM includes a number of scholar-led presses, such as Mattering Press, meson press, Open Humanities Press, Open Book Publishers and punctum books. Partners on the project also include UCSB Library and Loughborough University Library. In cooperation with Loughborough University Library, we began to run some preliminary repository workflow experimentations to see what might be possible, using books from one of the partner publishers.
Loughborough University employs Figshare as their primary institutional repository, so we began with this as a test bed for our experimentations.
A tale of two monographs: Open Book Publishers
The first volume from Open Book Publishers (OBP) that was employed in these workflow experiments was Denis Diderot 'Rameau's Nephew' – 'Le Neveu de Rameau': A Multi-Media Bilingual Edition (https://doi.org/10.11647/OBP.0098). The reason for the selection of this book is the relative complexity of the content. There are images, audio files, and additional texts, as well as several different file formats of the main text. The 13 audio files, offered in both .wav and .mp3, are essential components to the text, important to the understanding of the work as a whole. Because part of our work investigating the archiving and preservation of digital monographs is the varying level of complexity they possess, we felt this would be a good selection for our exercises.
The first manual workflow experiment approached the deposit of Rameau’s Nephew and all corresponding materials as if the book had been published by a theoretical press, which we called “Loughborough University Press (LUP).” The theoretical author in this instance is an academic depositing internally to Loughborough University, where the press would be positioned. An internal publishing author will already have login access to the university’s Figshare repository, so the element of access is simplified.
As there are multiple files of varying types that make up this digital book, the initial premise was to choose between the Figshare repository functions of “project” or “collection”, both of which are groups of “items”. An item is the deposit of one or more files, possessing a single set of associated metadata, onto a single record. Items can be gathered into either a project or collection, which also have their own metadata for the grouped items. There are benefits and drawbacks for our purposes when it comes to either project or collection, but in order to determine what these were, we needed to create one of each for both selected monographs.
For Rameau’s Nephew, the following items were created:
An item for each of the file formats for the central text files: PDF, XML, EPUB, and MOBI (4 individual items total)
Items for each of the 13 musical compositions (13 items total, each including a WAV and an MP3 file)
Items for each of the 5 supplementary texts (PDFs, 5 items total)
The items were created individually, and then grouped together into a Collection, or created within a Project container as a separate workflow.
This was done in order to represent full metadata for the component parts of the book and its essential and supplementary material. One of the findings we have determined in our research so far is that frequently when a digital monograph is preserved, it is often only the main text file, usually a PDF or XML, and not any of the supplementary material, regardless of how “essential” this material may be to understanding the work. This process also allowed us to clearly indicate the connections between all of the materials deposited.
Here some screenshots of the deposited material:
Overall, for Rameau’s Nephew, there were 22 items in each project/collection, with each item having its own unique set of metadata, as well as a set of metadata accompanying the full project or collection.
Image, Knife and Gluepot
The second volume used in the workflows was Kathryn M. Rudy’s Image, Knife, and Gluepot: Early Assemblage in Manuscript and Print (https://doi.org/10.11647/OBP.0145 ). This monograph has fewer supplementary materials, but contains a high number of images. Like Rameau’s Nephew, there are four different file formats available for the book: PDF, XML, EPUB, and MOBI.
OBP also makes HTML versions available on their website for these books, which allows visitors to read the books on the web without downloading. However, the HTML versions were not included in our archiving workflow experimentations, as HTML is not a downloadable “file” in the same sense as a PDF or EPUB1, for instance, and hence not “archivable” in the same way either of these would be within an institutional repository. HTML and webpages are often better handled by webcrawling services such as the Internet Archive’s Wayback Machine, or similar services, and so they are outside the remit of our current work.
For Image, Knife and Gluepot, we made items for each of the four file formats of the main monograph text (PDF, XML, EPUB, and MOBI). Each item was published individually with its relevant metadata, and the four items were gathered into the “project” and “collection” containers, respectively. The below images are from the “project” created for Image, Knife, and Gluepot.
The XML Item: Preview issues
Because Figshare has the function allowing files to preview in-browser on any live record, the previews of each image file were available to view in the XML item. However, so were icons or preview images all of the other file types within the XML item, which meant some did not preview well due to their size or type, and this led to a slightly confusing presentation of the content. See screenshot below.
As one can see, some of the images, due to their small size, are blurry, and these are also mixed in with other text files, which generally means the contents of the item are difficult to parse. While previewing in-browser as a function is excellent for more straightforward items (such as a single PDF or EPUB, or a small set of high-quality images), for XML the preview function is less a help than a hindrance.
When reviewing the manual workflow experimentations, we considered depositing the Zip file that contained the XML. This option would render as a “file-folder view” of the content, which could be easier to follow: instead of the content being “previewed”, there would just be a list of the contents and how they are organised within the folder. The files wouldn’t be viewable unless downloaded, but they would be present and decipherable, as the following image shows:
The Zip file option was not revisited fully, but could be a possible option for use in automated workflows in the future.
The second manual workflow experiment was approached from the angle of an academic author publishing with “Loughborough University Press” from an external position to the university. Theoretically, external users can be invited to a project on Figshare by an internal member of staff, and that external user can deposit material into that Project, once they accept the invitation and create a Figshare account (if they do not already have one). This function works well in the actual Figshare instance. However, as we have been using the “sandbox”/test area to complete these workflow experimentations, the extra layer of security meant that an actual “external” email address couldn’t be invited. (We attempted this, and the expected invitation email was never received.)
We worked around this for the sake of completing the manual workflow experimentation by inviting another member of Loughborough University staff into the project. The COPIM colleague was then given administrative access to the Loughborough staff member’s account, allowing them to to complete the manual deposit on the Loughborough staff member’s behalf. While this wasn’t exactly “external”, the process was useful, because we came to realise that while items created individually and put into a “collection” could not be added to a “project”, the reverse is possible: if items are created within a project, they can then be added to a collection.
Pros and cons
When weighing up the “project” and “collection” functions for the sole purpose of archiving a monograph, the project function won out, because when a collection is created in Figshare and subsequently published, a DOI is automatically created. While for other purposes, such as creating an online collection of authored/created content, this is ideal, for an already-published monograph it is not. This is because in most cases the original DOI minted by the publisher should be the only DOI for a monograph.
Multiple DOIs will lead to confusion and multiple citations, as well as usage data being obscured. The project function allows for gathering and connecting of monograph materials, making the archived content available in an open access fashion, while not creating an extraneous and unnecessary DOI. The project function also allows for potential collaboration with external members of small and scholar-led presses, or external authors to a university press.
The reality is, however, that Figshare is only one of several main players in terms of repository software used by universities and libraries. The manual deposit workflow option has not been applied to DSpace or Eprints as of yet, due to access issues. But also, the other reality is that manual input itself has some very recognisable pros and cons.
The benefit of manual deposit and manual metadata input for repository-archived monographs and their supplementary components is the ability to create very specific and thorough metadata for the files, as well as to assure clearly articulated connections between the files, both monograph text and supplementary content. However, there are glaring cons to this pathway, as well.
One primary issue with the entire process is this: manual deposit takes a lot more time (individual/staff resource) as well as requiring technological resource, or expertise. Another major finding from earlier research which contributed to our first Scoping Report is that small and scholar-led presses have major deficits of resource: financial, staffing, and at times technological expertise. While the COPIM staff member completing the manual deposit workflows was expertly familiar with how to use Figshare repository software, this wouldn’t necessarily be the case for every press staff member.
Despite this expert familiarity, the process of depositing both volumes from OBP took approximately three days (though this was creating both the project and collection versions). However, the reality is clear: the process of depositing digital monographs into willing repositories for archiving needs to be automated. Small and scholar-led presses won’t be able to spare the staff time to complete a manual deposit process for every monograph, particularly if training is needed beforehand.
This would mean less nuance in some ways. Functions like collections or projects, and elaborate individual metadata for all component parts, wouldn’t be possible in the same way via an automated process. But in reality, the time expense of manual deposit would simply be prohibitive to most small presses, meaning archiving in this way simply wouldn’t happen. Because many small and scholar-led presses do not yet have any active preservation policy in place, at least one of the options we present must be as simple, straightforward, and quick as possible.
Summary & concluding thoughts
While ultimately our findings from these manual workflow experiments led us to appreciate the need for automation, there were still some important insights that arose from the work. Grasping the differences in metadata fields (default/existing and custom) within repository systems was one of several. However, many of these understandings have led to more questions, particularly whether the access vs. the archive copy of a complex/enhanced digital monograph should be the same, how connections between monographs and their associated content might be indicated if they are in different online locations, and how to approach the archiving and preservation of linked content. Many of these are still being examined, and will be discussed more in future archiving and preservation work package outcomes.
The key finding is truly the need for some sort of automated option for basic archiving of outputs from small presses that publish open access monographs. In order to figure out what might be possible, the next step was to bring in the help of one of our developer team to perform workflow experiments with automated API deposit. We started with Figshare, as we had access already to the sandbox side of Loughborough University’s repository. Our next blog will address some of the findings from this test workflow experiment.
Interesting post. These sorts of issues are increasingly coming up for repository staff around creative work theses. At the moment we are using a single record approach so it’s interesting to think about what we could do in the future to better improve archiving and preservation through a project/collection approach. Another issue that comes up for us is representation of digital outputs - e.g interactive websites. What is the most accurate way to represent and preserve such outputs - standard web archiving, video recording of the website, screenshots. Lots of challenges to contend with as we move away from text based PDFs. Dimity Flanagan, Manager Scholarly Communications, University of Melbourne.