Details of a technical model for computational book publishing using Jupyter Notebook and Quarto
As part of COPIM’s pilot project looking at computational book publishing, we’ve worked on a new technical model and workflow for publishing computational books using a combination of Jupyter Notebook, Quarto, and GitHub Pages. This blogpost outlines the details of this model and further blogposts in this series will examine how we have applied it practically to produce some computational publications.
Our pilot project exploring computational publishing is part of COPIM’s Work Package 6 which researches experimental book publishing. In previous work, we’ve defined ‘computational publishing’ as the publishing of a book which combines human-readable text with computational functionality. Computational books contain advanced computational elements including but not limited to: dynamic media like audio or video; interactive data visualisations; executable code blocks; data repositories.
For this pilot project we have been working with Simon Worthington, a researcher in future publishing based at the Open Science Lab of the German National Library of Science and Technology (Technische Informationsbibliothek). Simon’s research alongside COPIM has considered how computational publishing software traditionally used by STEM-focused researchers could be adapted for use by arts and humanities researchers and artists.
As part of this research, Simon and COPIM have experimented with a model for publishing computational books quickly and easily using a combination of several software tools. These books can contain audio and video objects, 3D models (tested using .obj files and via embedded .stl viewers), datasets from linked open data repositories (tested using SPARQL queries against Wikibase instances), and media like images pulled in via linked open data queries. This allows for publications that link directly with linked open data repositories such as Wikidata and pull data directly from those sources.
Our model for building these computational publications involves linking together several pieces of open source software. This makes the model fairly modular so that one piece of software could be replaced with another relatively easily allowing for a potentially wide range of customisations in the technical publishing workflow.
Jupyter Notebook is a file format for creating notebook interface documents which combine human-readable Markdown text with computational code. A Jupyter Notebook is fundamentally a structured data file in JSON format which contains the various cell elements of the notebook’s content and any outputs that have been produced by running that notebook in an appropriate environment. Jupyter Notebook files can be edited and run in a range of computational environments including JupyterLab, Binder, and Visual Studio Code (or the truly open source version, VSCodium). Jupyter Notebook files are able to connect to a range of software language kernels for running code in various languages but most frequently are used with Python.
Quarto is an open source publishing system based on the Pandoc document conversion software. Where Pandoc is able to convert files from one format into another, Quarto uses this to focus on converting input files into output formats suitable for book publication such as HTML, PDF, OpenOffice, ePub, TEI, XML, and more. Given a repository of Markdown and/or Jupyter Notebook files, Quarto’s rendering process will output that content structured into the form of a book complete with chapters, index, references, and bibliographic metadata.
Git and GitHub are used for version control of files and for hosting in a GitHub repository. GitHub Pages also allows for display of the HTML files produced as a Quarto output.
These pieces of software can be installed on the host machine by installing Python, Quarto, and an environment for editing Jupyter Notebook files. Alternatively they can be deployed using Docker to set up containerised environments for Quarto and JupyterLab and there’s an example of this kind of Docker Compose configuration here: https://github.com/NFDI4Culture/cp4c/blob/main/docker-compose.yml
The basic technical workflow involves collecting content in Markdown files and/or running code in Jupyter Notebook files, rendering those files into a book publication using Quarto, and then pushing the files to GitHub which automatically renders the files as a website using GitHub Pages. Every step of this process can be automated on the Unix command line so that the process can run automatically on a macOS PC or Linux server in order to refresh data without manual intervention. Below is an example of a Bash script for running the entire process.
The most minimal publication requires only a .qmd Markdown file containing some text content and a _quarto.yml file which outlines the structure for the publication and formatting options. You can then add Jupyter Notebook files as separate chapters. In the example below, _quarto.yml specifies a book structure with a chapter for the homepage (index.qmd)and two chapters of computational content from Jupyter Notebook files (paintings.ipynb and video.ipynb) and specifies output in PDF and ePub formats as well as HTML.
The Jupyter Notebook files can then run custom Python code to perform whatever computational functions needed. We’ve tried a range of Python code to perform the following non-exhaustive list of functions:
displaying a dataset from a linked open data query using a SPARQL query on Wikidata
displaying results from API calls to Wikidata’s API, to Thoth’s GraphQL API, and to ORCID’s Public API
displaying images from a linked open query using a SPARQL query on a Wikibase instance
displaying a video in HTML through iframe embedding
displaying a .stl 3D model file in HTML through iframe embedding
displaying a .obj 3D model file in HTML using obj2html to convert it
The example below shows a Notebook file which displays some Markdown and then runs code to execute a SPARQL query against Wikidata and formats the results into a list with images.
We run this Notebook in an appropriate environment to ensure that the output cells are added to the file.
Once all Notebook files have been run and added to the _quarto.yml document structure, we run Quarto’s rendering process (quarto render
on the command line) to allow Quarto to render and output HTML and whatever other file formats have been specified.
Finally, we push the resulting files to GitHub. By specifying in the _quarto.yml file that all outputs are to be stored in a ./docs directory, we can point GitHub Pages to that directory to have it display the HTML as a website with links to download any other PDF or ePub version. Here is an example of a Quarto publication rendered in GitHub Pages that we put together for COPIM’s Experimental Books: Re-imagining Scholarly Publishing conference: https://simonxix.github.io/Experimental_Books_workshop/
By ensuring that this model is modular, we are able to substitute different pieces of software for different results. This is important for interoperability, for ensuring that the model can be extended with additional features, and to ensure longevity of the model should software cease to function in the future. In our context, we may want to make some software changes for ethical reasons. For example, we are concerned that Posit (formerly RStudio), the company that develops Quarto, announced this year that they are partnering with Palantir Technologies Inc. Though Posit have clarified that this ‘partnership’ is a narrow technical integration, we are nonetheless concerned that they would choose to work with Peter Thiel’s big data company known for its work with state surveillance agencies, their contract with the USA’s institutionally racist ICE agency, and their role in the Cambridge Analytica data scandal in the UK. As we develop this model, we’ll be looking to swap out Quarto for an equivalent piece of publication rendering software such as Jupyter Book.
Similarly there are good ethical reasons to reconsider our use of GitHub but fortunately we can host the HTML output of this workflow on any hosting platform including potentially our own dedicated server.
Our next major step is producing a prototype publication using this model. We have already designed a prototype book catalogue which will be discussed in the next blogpost and are currently working on an art catalogue using this model which will draw from linked open data sources. A future blogpost will also explore how this technical model can be adapted to a publisher’s workflow.