As part of the documentation for the first book coming out of the Combinatorial Books Pilot Project, we are sharing some considerations around the archiving and preservation of experimental monographs
This is the eight blogpost in a series documenting the COPIM/OHP Pilot Project Combinatorial Books: Gathering Flowers. You can find the previous blogposts here, here, here, here, here, here, and here.
In what follows, we aim to share some insights into the yet ongoing considerations made around the archiving and preservation of Ecological Rewriting: Situated Engagements with The Chernobyl Herbarium, we also have asked our colleague Miranda Barnes from WP7 to share – via two audio contributions embedded in this blogpost – some of her expertise around the archiving and preservation of experimental monographs.
With the first outcome of the Combinatorial Books Pilot Projext, the experimental monograph Ecological Rewriting, we have created what – in preservation terms – can be considered a complex monograph. Ecological Rewriting, as a whole, consists of a book ingested, peer-reviewed, and published on the PubPub platform that includes text, hyperlinks between these texts and external websites, images, as well as annotations done by the co-editors of the book in the PubPub annotation function; two HedgeDoc pads that are maintained by the co-authors of Ecological Rewriting (here and here); annotations made by the co-editors and authors with hypothes.is on the PDF version of the source text, The Chernobyl Herbarium; annotations made by the peer-reviewers and co-authors on a draft version of the PubPub book; as well as bi-directional hyperlinks between these different stages and layers the book consists of that are included in any “final version” of the publication (Adema & Kiesewetter, 2022).
“Each of these innovations presents preservation challenges; their combination creates an even greater challenge: the need to maintain multiple formats and the connections among them” (Verhoff, Hanson, Greenberg, 2022, 81). These challenges are, as our colleagues from COPIM’s archiving and preservation work package, WP 7, point out, different to the ones emerging in the preservation of traditional OA monographs where a “typical file package … would include the book (in either one file format, such as PDF, or several, if these are produced) along with a metadata file” (Barnes et. al., 2023, 95).
One challenge regarding the preservation of Ecological Rewriting is the volume and variety of formats, file types, and data included. Here, some of the main questions that we have to address are: how to export and archive the different components Ecological Rewriting consists of in a way that is appropriate for each format used; how to preserve the different components of the book as well as essential supplementary material in a way that can track the relationships between these parts (for example, the PubPub book, the HedgeDoc pads, and the annotated PDF version of the source text); and how to make them recognisable as part of the same work (Greenberg, Hanson, Verhoff, 2022)? Questions related to the different versions Ecololgical Rewriting consists of – for example, the texts before and after editing – need also to be addressed: For example, what parts of the publication will change and which ones of the different versions will be preserved as a new copy?1 An additional challenge related to the preservation of Ecological Rewriting are the backlinks we inserted between the book and the externally linked content such as the source text PFD or the HedgeDoc pad. As the authors of WP7’s Scoping Report on Archiving and Preserving OA Monographs write: “A ‘link rot’ is a key factor in the disappearance of linked content from digitally preserved monographs …. Digital monographs continue to evolve and become more complex, and in many cases depend on externally linked content particularly for hosting supplementary materials such as video and audio files, therefore broken or outdated links will likely cause a growing problem” (Barnes et al., 2022, 9).
Digital preservation archives, such as CLOCKSS and Portico that especially larger but also smaller publishers such as Open Humanities Press (OHP) subscribe to “rely on economies of scale with replicable processes” (Verhoff, Hanson, Greenberg, 2022, 81). Additionally, complex and dynamic experimental monographs like Ecological Rewriting tendentially eschew standardisation and replicability. As our colleagues from WP7 write “technical methods for effectively archiving complex digital research publications and for creating an integrated collections of content in different formats have not yet been developed” (Barnes et al., 2022, 1). For these reasons, the archiving and preservation of experimental monographs often demands a tailored approach, including the ”manual deposit and the manual metadata input of repository-archived monographs and the supplementary components”. Such an approach posits issues of cost, time, and responsibility especially for smaller, community-presses with limited time, staffing, and financial resources.
Miranda Barnes, in her audio contributions to this blogpost, stresses that many potential pitfalls in archiving complex publications can be avoided if the consideration of and the planning of archiving and preservation already begins early in the publication process. Barnes points out that these considerations may include thoughts about the possible output file formats that can be generated from the platforms used, and about which of these outputs to keep and preserve during the publishing process and beyond (for example, as complementary materials forming part of a publication).
In consequence to the recommendation to start considering archiving and preservation early in the process, choosing the open source platforms to work with in the context of Ecological Rewriting, as co-editors of the book we were attentive to settle for the ones from which contents can be exported into various formats. By doing so, we wanted to make sure that diverse versions of the book can be preserved and to allows for different options for the book’s preservation in a repository or with another archiving organisation.
PubPub, the platform on which Ecological Rewriting is published, allows various outputs: PDF, Markdown, Word, EPUB, HTML, OpenDocument, Plain Text, JATS XML, LaTeX. Hypothes.is, the annotation software used for annotating the source text PDF, The Chernobyl Herbarium, as well as for peer-review can output annotations as HTML, CSV, or Json. HedgeDoc the open source markdown editor used by the authors to discuss the nature of their rewriting can output as Markdown, HTML, or raw HTML. Additionally, all the platforms also offer an export option for descriptive and packaging metadata that describe how the different parts of the publication relate together which is important to ensure that Ecological Rewriting can be exported into different preservation environments rather than being inherently tied to PubPub. As Greenberg, Hanson, and Verhoff (2022) point out, metadata should “be expressed as a separate file stored adjacent to or within the publication package. When possible, this should be expressed in a standard format such as e.g. ONIX, JATS, or Dublin Core” (4).
Throughout the process, Simon Bowie, the developer at COPIM and one of the co-editors of the Combinatorial Book series of which Ecological Rewriting is the first publication, periodically exported the different versions of the different “pubs” the PubPub book consists of – each including a text, images, as well as a selection of metadata – as Markdown and JATS XML files saving them on a hard drive and a private Git repository on GitHub for backup. Once all the pubs are published, they will be exported again and will be deposited – as a “frozen” version of the book – in an institutional repository. The documentation of the Combinatorial Books Pilot Case (of which this blogpost is a part) can similarly be exported as separate pubs and added to the “preservation package” if wished so. Additionally, Simon exported the HedgeDoc pads as Markdown files (to be deposited on Humanities Commons) as well as the hypothes.is annotations on the source text PDF in HTML, CSV, and Json using Jon Udell's Hypothesis export tool.
Further considerations, as Miranda Barnes points out in her audio contribution, are also needed in the preservation of Ecological Rewriting, regarding tensions between books existing in different versions and the traditional practice of assigning identifiers such as DOIs and ISBNs to a particular version of record; the managing of the different versions of the book, the maintenance of the relationship between them; as well as the decision what to keep as part of any final publication: Here, the involved parties – in the case of Ecological Rewriting especially the editors of the publication and the publisher – have to agree upon questions such as: which parts of the publication still can be changed, over what period, and which changes should be preserved? As Verhoff, Hanson, and Greenberg (2022) add: “Efficient versioning criteria combined with workflows that only update the files that have changed can avoid unnecessary redundancy and overuse of storage. If versioned have content is eventually triggered by the preservation service, there will need to be a mutual understanding about which version(s) should be made available for access” (85).
Potential preservation challenges emerge regarding the links that are embedded in the publication: These challenges are technological but also concern legal questions regarding the integration of external contents into a publication. Public annotations on hypothes.is are published under a Creative Commons CC0 Public Domain Dedication that allows free copying, modification, distribution and performance of your contributions, even for commercial purposes, all without asking permission. However, annotations made in a private group such as the one set up for the peer-review of the publication are the property of the individual user. To link specific passages of the texts on PubPub to discussions of the peer-reviewer and author of a text in order to make this process available for readers, we asked the peer-reviewers of Ecological Rewriting for their permission.
Miranda Barnes, in a blogpost about the manual preservation of complex monographs, suggests a potential further step in the preservation of Ecological Rewriting. As she points out, it is important to “consider how to arrange these complex composite works in the archive to ensure they are manageable, discoverable, and eventually accessible. In some cases, it may be practical to keep the entire work in a single Archival Information Package (AIP) and focus on extracting and indexing metadata to reveal the component resources.” In her blogpost, Barnes discusses the preservation of complex monographs in reference to case studies. Including the first volume from Open Book Publishers (OBP) that was employed in these workflow experiments, Denis Diderot 'Rameau's Nephew' – 'Le Neveu de Rameau': A Multi-Media Bilingual Edition. The book has been deposited in an university Figshare repository. The specificities of this process might be relevant for the preservation of Ecological Rewriting too, for example, for the way in which it can be preserved on an institutional server. For Denis Diderot 'Rameau's Nephew, a Figshare “item” (the deposit of one or more files, possessing a single set of associated metadata, onto a single record) for each of the file formats for the central text files: PDF, XML, EPUB, and MOBI has been created; furthermore items for each of the 13 musical compositions forming part of the publication (each including a WAV and a MP3 file); as well as items for 5 supporting PDF texts. These items were then grouped together in a “collection” or “project” (which also have their own metadata for the grouped items) on Figshare “in order to represent full metadata for the component parts of the book and its essential and supplementary material”. Barnes highlights that the benefit of this manual process “is the ability to create very specific and thorough metadata for the files, as well as to assure clearly articulated connections between the files, both monograph text and supplementary content”. This is, as she observes, different to current practices where often only the main text file, usually as PDF or XML is preserved.
Another way of preservation able to capture the website content for web archiving is, as Barnes underlines in one of her audio contributions, emulation. As our colleagues from WP7 explain, emulation can be used for the preservation of certain types of softwares and platforms “that are used in experimental works [and] may be new or current when the work is created … [but are] likely [to] at some point … become obsolete or unsupported. Emulation allows to retain the functionality, look, and feel of a digital work to be preserved for future use and access. It does so by “way of recreating the file within the environment of its original software. This is done by emulating applications, operating systems, or hardware platforms in order to prevent the loss of original functionality by delivering the same user experience as the original platform” (Barnes et al., 2022, 8-9). If emulation is the right way to preserve Ecological Rewriting is yet to be discussed, as PubPub ultimately is, first and foremost, a display platform that does not execute or run anything.
To preserve links to external URLs and to prevent link-rot many publishers archive websites referenced or linked to within the monographs at the Internet Archive and via its Wayback Machine at point of publication (Barnes et al., 2022), which could be a way to go also for Ecological Rewriting for example regarding the hyperlinks between texts published on PubPub and external websites, or the links between the PubPub texts with the source text PDF and the HedgeDoc pads.
As our colleagues from WP7 point out, manual deposit as we have been engaging with for the preservation of Ecological Rewriting takes time, it requires technological resource and expertise, while “small and scholar-led presses have major deficits of resource: financial, staffing, and at times technological expertise”. However, as Janneke Adema writes regarding the preservation of Computational Books in the context of another WP6 Pilot Case:
From the viewpoint of a digital preservation archive, their ultimate responsibility when it comes to archiving scholarly publications is with the publisher, so this highlights dependencies between archives and publishers on what should be archived (so archives would start with the baseline and would potentially be doing more). Archiving in this context is what the publisher decides is archivable, which in most cases would just be the published outputs, but sometimes does include supplementary material. While archives do receive supplementary files for some publications, what is archived tends to be solely the files for the publication and the publication's metadata.
Yet authors often have much more data and resources connected to their research that are not part of the published output. In this context it is normally the responsibility of the author to archive these kinds of data sets, to take care of the ‘input files’. Research data management is already becoming an increasingly important role for scholars to take on as part of doing academic research, taking into consideration how to archive the resources their research draws on as well as the various stages the research goes through. But this allocation of responsibility becomes perhaps more complicated in the context of computational publishing. (Adema, 2023).
The publisher of Ecological Rewriting, OHP, traditionally have preserved the outputs of their publications by means of permanent archiving with Stanford University Libraries’ LOCKSS Program, the export of PDFs to the publisher’s server, as well as through hard copies available made available in the British Library and other libraries through the UK's legal deposit system. While a printed edition of Ecological Rewriting is planned, for the preservation of the online publication in its current state both its preservation as a PDF or its “translation” into a book are not adequate to capture the book’s versioned, layered, and interlinked nature.
In the case of Ecological Rewriting, the labour connected to the archiving and preservation of Ecological Rewriting as well as the additional data created around this book (the hypothes.is annotations or the HedgeDoc pads, for example) was primarily taken over by the co-editors of the book, specifically the developer Simon Bowie. This partially is a reaction to a shift in responsibilities connected to our wish to, as part of the book, capture the research process, much of which took place on various collaborative open writing and publication platforms (such as the HedgeDoc pads). These were already public (or “published”) and continually updated during the research process. By making various connections back to the previous versions of the book in development, we, so to speak, incorporated the research and writing process into the publication and made it a part of it. Doing so, induced a shift of responsibilities as – arguably – the responsibility of the publisher is with the publication, in which we embedded preliminary research data (which usually are not part of a final publication and remain the responsibility of the author(s) to maintain).
Resources such as the Scoping Report of WP7, the various blogpost done by WP7, the DigitalPreservationCoalition’s Digital Preservation Handbook, as well as the Guidelines for Preserving New Forms of Scholarship can support this shift of traditional roles. While being far from a more elaborated “system of sustainable workflows” (Barnes et al., 2022, 1), they still can offer possibilities for incorporating some considerations on preservation into existing workflows or adapting existing workflows to – early on in the publishing process – include time for discussing archiving and preservation questions and connected responsibilities, while being adaptive and flexible enough to accommodate possible pathways. Resources such as the Scoping Report of WP 7can offer insights into best practices and suggest possible workflows for archiving and preservation that can at least lessen some of the strains related to the manual preservation of complex monographs for those small presses, editors, and authors who aim to engage in experimental monograph publishing.
However, problems of cost, time, responsibility, and thus also labor, related to the publication of complex monographs persist despite the efforts to build a collective system of sustainable workflows for experimental monograph publishing. As Barnes writes, consequently, “the reality is clear: the process of depositing digital monographs into willing repositories for archiving needs to be automated. This would mean less nuance in some ways. Functions like collections or projects, and elaborate individual metadata for all component parts, wouldn’t be possible in the same way via an automated process. But in reality, the time expense of manual deposit would simply be prohibitive to most small presses, meaning archiving in this way simply wouldn’t happen. Because many small and scholar-led presses do not yet have any active preservation policy in place, at least one of the options we present must be as simple, straightforward, and quick as possible”.
While such an approach implies a possible loss of the complexity of experimental monographs (for example, by, in the sake of standardisation and scalability, focusing on static outputs rather than on dynamic content), it also points towards a set of questions sitting at the centre of the archiving and preservation of scholarship: What is the purpose of archiving, what is it that we really want to archive, and for whom? As Miranda Barnes mentions in her audio contribution that forms blogpost and as also Adema, 2023 writes: “In this context it is important to [already early on in the publishing process] consider what the baseline is, what the minimum is that publishers and authors can do and what is needed if we want to do more.”
Considerations like these are especially relevant in the context of experimental publishing, and the ephemeral, open-ended, relational, and situated nature of collaborative experimentation (and scholarship as experimentation) itself. Do we have to preserve “everything” and what would more or less deliberate omission entail and effect? What is it that we can preserve from the rewriting of The Chernobyl Herbarium at all? For example, can the “collective effort to develop a more creative register in our own writing and thinking habits as early career researchers” that Gabriela Méndez Cota describes in her introduction to Ecological Rewriting be preserved by storing the book and all its components or does it have to be reperformed, “lived,” in diverse contexts, by different communities, according to their own politics and possibilities – every time anew, beyond our own imagination of what rewriting could be and entail?