Skip to main content
SearchLoginLogin or Signup

Conversations on Archiving and Preserving Computational Books

Published onApr 28, 2023
Conversations on Archiving and Preserving Computational Books

Together with our colleagues from COPIM WP7 we have sat down to discuss issues of archiving and preservation where it concerns computational books, and more in specific in relation to the computational prototypes we have created as part of our Computational Books Pilot Project, and the tentative publishing workflow we have created with the help of Open Book Publishers. Underneath you will find a write up of some of the things we discussed together in addition to some further reflections.

Archiving Jupyter Notebooks

For the prototypes we have been working with at COPIM, we have used Jupyter Notebook files to retrieve content from linked open data catalogues. We have looked at Wikidata sources, including Wikidata itself, as well as at standalone Wikibase instances, such as the one at TIB, and the notebooks have used Python code to pull in the open data from these sources (e.g. paintings from the Baroque period) and format it to the required output formats. We have then in our workflow used Quarto to render the Jupyter Notebooks as a static publication, outputting them in ePub, HTML, PDF and a whole range of different formats. That Quarto publication can then be pushed to GitHub Pages where it can be displayed as a static website with options to download the PDF, ePub version etc. All of this is then saved in a Git repository on GitHub so all the files are open for anyone to see, download, fork, etc.

In relation to the conventional publishing workflow, we originally started off with lots of question marks around the stability and the ephemeral nature of computational publications, but as we progressed with our prototyping work, we found that a lot is actually captured, in terms of data and versions of content, and maybe captured even better than a normal print or digital workflow might do. In the technical computational workflow we have been working with, we have a defined source (to retrieve input from) as well as clear outputs and both the source and outputs are captured. When Jupyter Notebook retrieves these remote sources, we know the address it comes from—it copies content from the remote source to a local folder—together with the publication, so in this sense a lot is being captured.

What is helpful is that outputs generated via Quarto are all saved in one folder or a single directory, all outputs are thus neatly packaged up. When you add to it that this folder is stored using the Git version control system, using releases, commits, each of these recorded with editor’s comments, time stamps and cryptographic IDs such as SHA-1, it means you can step back and down through major and minor revisions with extreme accuracy. Preservation mechanisms can then be pointed to that one folder to preserve whatever is there, or we could send that folder to a platform as an archiving folder (e.g. to Portico, or to repositories). However, this of course only preserves static versions, executables are a more difficult question.

Storage, Permissions, Responsibilities

What isn’t necessarily downloaded into that folder are larger files. For example, Quarto downloads copies of images, and hosts its own version, so there is no need for the link to this image to be perpetually necessary in the creation of this, as the digital object, the image, is being hosted on the server together with the publication. Videos however—although in principle you can also store these in the directory, this is not the default—are added as embedding links, which brings up the issue of who is responsible for preserving this kind of external or remote content (in the case of link rot for example). At Open Book Publishers they have partly resolved this issue by reorganising the links in their publication as Handles, this means that they can redirect the link where needed as they have control via the handle instead of the URL directly. But this still doesn’t answer the question of how to make sure that the third-party hosted data objects that are retrieved are themselves archived. And if it is not hosted by a third party but locally then there are issues of licensing to consider and whether one has the corresponding permission to publish (in different versions), host, and preserve the data objects locally (Open Book Publishers for example mostly arranges copyright clearance to be able to host videos themselves).

As we have mostly been working with content from Wikidata for our prototypes, we can look up the copyright status of content to check for permissive content only, which is something that can be automated or scripted. In this sense, Wikidata is a very good source to work with as their licenses are open, but other GLAM providers might be less forthcoming with respect to permissions and rights. Open Book Publishers often uses Fair Use exemptions to work around copyright limitations (arguing that, for example, how something is a necessary image to have as part of the documentation). With computational books we might be able to do something similar to say this is a work in its entirety rather than its individual segments and the copyright statement concerns the entirety of the work or what is in the repository.

This of course comes back then to the question of whose task is it to take control of preservation and archiving of the digital object (especially in computational environments)? At Open Book Publishers the publisher controls a lot of this, but not all publishers will (be able to) do the same. What is the responsibility of authors and GLAM organisations in these kinds of contexts? And are there any minimal standards we can work towards, such as ensuring that at least the outputs, or the ‘frozen versions’ of computational publications are preserved?

Dynamic Content, Versioning, and Release Hierarchies

The question of preservation and content retrieval becomes even more complicated in the context of having to deal with retrieving data that is regularly updated or includes ‘live data’ or dynamic website content, and in cases in which different versions of a publication are regularly released, which is often the case in computational book publishing.1 With respect to versioning and version control, the printed book workflow has been very clunkily set up to accommodate this, built around ISBNs and each version needing a different ISBN.

In a computational publishing workflow the process of sending out a zip folder containing all the bundled content that makes up the publication, including some way to navigate that content, would be easy to do. Developers can specify in Git repositories how to run them (or, in the case of our computational publishing workflow, include Docker files which can reproduce the software environment). What wouldn’t be as easy is to archive any links that are not contained or content from a site not contained in that folder (as discussed previously), or when content that is being pulled is not static but is live and constantly adapted (e.g. see economic data and astrophysics for fast changing data sets). Here there is a necessity to take data snapshots at a certain point in time, so as to have the version of record that a given data point has been referenced, which can become complicated as you might also need to preserve the database that you are retrieving your data from at certain snapshots as a whole. When you run a Jupyter Notebook you get a snapshot from your data source, so you are pulling the data and preserving it at that static point. We do have the ability to automate some of this too (to script this) as you can see from the Thoth/ScholarLed catalogue prototype and the script we wrote for that. In this script we wanted to make it really clear that we are pulling data fixed at a certain point of time, so the script also produces a small Markdown output that lists ‘this is a snapshot of data as of this exact date and time’.

Preserving publications that rely on dynamic data or that are regularly updated or versioned also creates issues for storage for repositories as according to best practice each version recorded should include its previous versions. In real life, repositories don't tend to put that as rigorously into practice because of the implications this has on storage usage. Archiving platforms and publishers could however make an agreement on what major versions would be preserved. Archiving platforms only want the major point releases and as few as practicable, as the old versions are archived too. This will be an issue for institutional repositories which have little storage capacity, so for smaller repositories there needs to be some kind of arrangement that a large computational publication is not being updated even every month: yearly releases is probably what repositories are more comfortable with. This will also be important beyond repository storage but in the context of reader uptake (e.g. mobile use of publications) and the environmental impact all this storage has.

Method Preservation, Emulation, and Minimum Approaches

This goes back to the question of when is a given version a major release (which will also be deemed worthy of being preserved?). We need fixed versions for different reasons, but amongst others to have something stable to be referenced.

Often with very large outputs one other method of archiving could be to focus more not on archiving the output files of a publication (e.g. PDF, ePub, HTML) but on applying the "input+method" approach, archiving both the input files and the method to execute and render them and then anyone can be able to repull the output files.2 Hence the focus of the archiving here is not on the output but on the input.

Emulation is also a possible method to capture more dynamic website content for web archiving, where it lowers the risk of the content becoming obsolete to a current platform or environment.3 Emulation, “as a process, aims to reproduce or ‘emulate’ the original look and feel of the content as it originally appeared, by way of recreating the file within the environment of its original software. This is done by emulating applications, operating systems, or hardware platforms in order to prevent the loss of original functionality by delivering the same user experience as the original platform” (Barnes et al., 2022, 8-9).

Archiving the method might also be a way to start thinking about how to archive some of the interactivity that Computational Books can offer. It would be possible to archive some of this with small datasets for example, datasets of numbers or numbers that can be graphed for example, as here you can keep the whole datasets in the Jupyter Notebook. However, for larger datasets, e.g. when running queries against linked open data sources, you are essentially getting a snapshot of the interaction at that time you are running the notebook.

Yet this comes back again to the question of whether the interactivity that Jupyter Notebooks offer necessarily needs to be archived. One more constrained approach then would be to focus more on the static outputs, without getting too hung up on dynamic adaption and dynamic content, as one argument would be that what we want to archive in the end is the scholarly argument and the data to support this. Yet the scholarly argument itself is arguably one that is becoming more dynamic too and moving away from the single authorial voice towards more collaborative and processual forms of research and publishing, which is something that computational publishing platforms for example support. Seeing the scholarly argument as that which needs to be stabilised, when it becomes harder to fix the book as object itself, is nevertheless and interesting approach, and it exemplifies the need within scholarly research and publishing for forms of stabilisation, where what we often see in experimental publishing is that when we start to break down one form of stabilisation, we often have to fall back to another form of fixation (Adema, 2021).

This brings us back again to what the purpose of archiving is, and what it actually is that the author and the publisher would like to archive, which might differ per publication, and per stakeholder. What parts of the research and publishing process and what changes and edits in this are important to the author and what to the publisher (also look at the context of literary authoring, for example)? In this context it is important to consider what the baseline is, what the minimum is that publishers and authors can do and what is needed if we want to do more. For a publisher such as Open Book Publishers for example, the minimum that they are trying to archive are the output files. For them a publication is defined by a set of output files and they then attach an ISBN to this particular set of output files. Open Book Publishers is already a bit more structured compared to some publishers in doing this, which means that they try to archive as many of the output formats as they can. That would be a minimum approach for them, anything additional would for example consider all the links within the work that might be important to preserve, so additionally therefore one would want be able to archive the links, to create an archive collection of all the links within the publication—which is not something that Open Book Publishers are doing at this stage. Any additional archiving might include this or archiving any other embedded content and resources associated with the work. What also wouldn’t necessarily be part of any minimum approach to archiving is to archive the previously discussed method together with the input files, but if authors, for example, would like to archive this, publishers could in theory provide authors with guidelines of what they would need to deliver to the publisher to archive these protocols and methods.

From the viewpoint of a digital preservation archive, their ultimate responsibility when it comes to archiving scholarly publications is with the publisher, so this highlights dependencies between archives and publishers on what should be archived (so archives would start with the baseline and would potentially be doing more). Archiving in this context is what the publisher decides is archivable, which in most cases would just be the published outputs, but sometimes does include supplementary material. While archives do receive supplementary files for some publications, what is archived tends to be solely the files for the publication and the publication's metadata.

Yet authors often have much more data and resources connected to their research that are not part of the published output. In this context it is normally the responsibility of the author to archive these kinds of data sets, to take care of the ‘input files’. Research data management is already becoming an increasingly important role for scholars to take on as part of doing academic research, taking into consideration how to archive the resources their research draws on as well as the various stages the research goes through. But this allocation of responsibility becomes perhaps more complicated in the context of computational publishing. For example the input files, the Jupyter Notebooks could be 100% changed, where the output file (e.g. ePub) remains the same. In a computational book the versioning is the result of interacting with it instead of creating a final piece of writing.

Long Term Preservation

We also need to think about this in the context of long term archiving: what aspects of a computational publication do we need to archive long term and what is less important? In the context of computational publishing, archiving certain fixed versions and a description of how to recreate these versions from e.g.  Jupyter Notebooks, might be more relevant. Repositories tend to keep data for 7-10 years since last access, yet very few repositories have deletion policies, so effectively once data is in a repository, it will be looked after. But discussions about long term preservation are not only about the length of preservation but also about the type of preservation. There needs to be a discussion about whether what is needed is for example having a usable output in the future or simply a non-corrupted output. Package management becomes especially important in this context. For example a Jupyter Notebook, which basically is a JSON file that contains instructions how it can be run in a particular environment, by a particular kernel (in our case the Python3 kernel), when it is run creates a version of itself with the output saved in that Json file. In 20 years time, as long as you can read a JSON file you would still have the output but you might not be able to run the code that was used to create it. The Jupyter Notebook creates a static version of itself but the ability to run it and interact with it and execute it is what will die. 

All in all, what becomes clear is that especially for computational books, conversations need to be had between publishers, authors, and repositories—often already when starting to design a computational publishing project—about what is preserved (what data, interactions, versions), in what way and how often, and who takes responsibility for what is archived and preserved.

Header image: Photo by Pawel Czerwinski on Unsplash.

No comments here
Why not start the discussion?