Skip to main content
SearchLoginLogin or Signup

Implementing international metadata standards and requirements in Thoth: an update

Published onMay 21, 2024
Implementing international metadata standards and requirements in Thoth: an update
·
1 of 5
key-enterThis Pub is a Comment on
Open Metadata and Libraries

From its inception, Thoth has been conceived as an open dissemination system, with the dedicated aim to enable transparent access to information on metadata formats and the larger context of the scholarly book production lifecycle. Alongside building the actual Thoth platform, our team has focused on making the underlying research into open standards and infrastructures accessible to a wider range of users, including small, scholar-led, and university publishers that would otherwise have difficulty finding this information.

Knowledge-sharing: the Thoth Wiki

Thoth’s work on implementing metadata standards and platform-specific flavours thereof is informed by seminal research into the dissemination landscape and book supply chain (Clarke & Ricci, 2021; Stone et al., 2021) and the wideranging experience in open infrastructure building within our team.

The collected body of knowledge has since been distilled into an open Wiki, which is used by the team to document and continuously log updates on progress made.

Screenshot of the Thoth Wiki landing page. Source: https://github.com/thoth-pub/thoth/wiki

Fig. 1: Screenshot of the Thoth Wiki landing page. Source: https://github.com/thoth-pub/thoth/wiki

With the Wiki, Thoth provides up-to-date information on the various stakeholders active in the Open Access Book Supply Chain, including:

  • Content Funders (e.g. funding agencies and libraries);

  • Content Creators (e.g. publishers and authors);

  • Content Platforms, including a wide variety of Ebook Aggregators, OA Platforms and Repositories, Consumer Ebook Platforms, Shadow Libraries, and Ebook Distributors;

  • Catalogues, Indices and Knowledge Bases;

  • Topic-specific bibliographies, Citation Indices; and

  • Stakeholders pertaining to the archiving and preservation of OA books, subsumed under the Thoth Archiving Network section.

Metadata standards & requirements

As initially discussed in an earlier blog post (van Gerven Oei, 2021), and based on the recommendations distilled from Thoth’s initial research, the team has focused on prioritising implementing three commonly used metadata formats — MARC, ONIX, and KBART — while also implementing a host of additional useful export options. Below, we provide an update on progress made so far.

Progress on MARC record creation

Over the past 24 months, Thoth has put extensive work into developing automated workflows that follow established practice and documentation (e.g. Library of Congress guidance) around the MARC standard.

Once the initial structure had been implemented, the team engaged with librarian specialists from the UK and the US to further improve the data model. We are indebted to, and grateful for, the extensive input provided by metadata specialists Emma Booth (University of Manchester, UK), colleagues at Jisc (UK), and Jeffrey Edmunds (Pennsylvania State University, USA). As has become clear in those conversations, and due to MARC being a rather idiosyncratic standard designed with and for specialist cataloguers that remains open to context-specific interpretation in some fields, while remaining quite rigid in other implementation aspects, it remains virtually impossible to fully automate this process, as outputted data quality will in the end always rely on the input provided by publishers.1 That said, the Thoth workflow is capable of providing above-average quality of records created, with the caveat that the last 5-10% to achieve a perfect result will always depend on an individual publisher’s input.

Conversations with specialists yielded a lot for the team to think about, particularly on the subject of what often tends to be called ‘authoritative’ datasets, i.e. controlled vocabularies. As Jeff Edmunds has noted, Thoth’s MARC records

could benefit from certain enhancements, such as the addition of subject terms from authoritative lists (the Library of Congress Subject Headings, the Getty Art and Architecture Thesaurus, Homosaurus, etc.) and Library of Congress or Dewey classification numbers. (Edmunds, 2023)

What became clear in later conversations with Jeff Edmunds and with Emma Booth is that librarians seem to prefer such ‘authoritative’ datasets as they afford them the benefit of easy classification. However, as has been raised by the Thoth team, this also opens other questions. This pertains e.g. to the conformity of the legacy subject lists to community-managed, open standards, or the alleged source of authority that these lists are implicitly imbued with — and the (unintended) consequences of such authority with regards to the exclusion of groups that are not conforming to the defined standard. From an international and comparative perspective, there remain specific issues e.g. with regards to the universal applicability of regionally-relevant vocabularies, such as the US-centric Library of Congress Subject Headings, to publishers catering to other regional contexts such as Latin America or Europe.

With Thoth, we have so far decided to support LCC Subject Headings as an optional data field, while prioritising controlled subject vocabularies that are released under fully open licenses. Hence, Thoth has now fully implemented the Thema subject category classification as a future-proof alternative to BISAC and BIC (the latter of which has recently ceased operations in early 2024).

The team are keen to implement the Metadata Best Practices for Trans and Gender Diverse Resources (cf. The Trans Metadata Collective et al., 2023) incl. Homosaurus vocabulary, and, based on recommendations from library experts, are also considering an implementation of the Virtual International Authority File (VIAF) standard in one of our future releases.2

The Thoth team is proud to confirm that the system can now export above-average quality metadata records in MARC21 and MARCXML, and we will continue our collaborations with libraries to further improve ways to facilitate direct data provision to library systems, via these exports and through Thoth’s open API.

ONIX outputs

As described in an earlier blog post,

ONIX is one of the most widely adopted metadata formats currently in use, managed by non-profit EDItEUR. Rather than a specific format, it is an open XML schema, which has led different platforms to implement different flavors of ONIX, with distinct ideas about what a minimally viable record is. (van Gerven Oei, 2021)

Over the past two years, Thoth has further improved the variety of ONIX outputs available, implementing ONIX 3.0 specifications from Project MUSE, OAPEN, JSTOR, Google Books, and OverDrive.

For some of the platform-specific implementations, the process of seeking a given platform’s confirmation of which data would be required in which specific format has itself taken quite some time. Further to that, and as an exercise in showcasing the ONIX standard’s full range of data fields, the team implemented an ONIX 3.0 Thoth export variant that includes the maximum amount of publisher-provided data. Compared to the other, platform-specific versions that tend to limit the amount of data, the ONIX Thoth export flavour, together with Thoth’s JSON export, can be understood as the most complete of all Thoth exports available.

With ONIX 2.1, Thoth has implemented specifications from EBSCO Host and ProQuest Ebrary.

KBART output

Continuing its conversations with the NISO KBART standards committee, Thoth is steadily improving its KBART output and now features a KBART export that implements OCLC’s specifications. As well as being compliant with OCLC, this output can also be used to submit data to ExLibris, ProQuest, and EBSCO’s Knowledge Bases, among many others. Publishers seeking KBART endorsement for their metadata records will also find that the Thoth format now makes this easy to obtain.

As of April 2024, Thoth has also applied to be listed in the KBART registry.

Further outputs: Crossref XML, BibTeX, CSV, json

Over the last 18 months, the team has also implemented a Crossref-compliant XML export that publishers can use to register their DOIs with Crossref individually. Alternatively, publishers can also let the automated workflows available via Thoth Plus, which also includes Crossref Sponsorship through Thoth, take care of that DOI registration for them. Quite recently, the team has also implemented the additional metadata required to participate in Crossref’s Crossmark service, which provides information on the “current status of an item of content, including any corrections, retractions, or updates to that record.” (Crossref, 2020) The Crossmark service is also available to publishers participating in Thoth’s Crossref sponsorship. For more details on the added benefits of Thoth Plus and the Thoth Crossref Sponsorship, please refer to our accompanying blog post.

The team felt it was important to also include a BibTeX export option, so as to enable researchers and cataloguers to directly reuse the bibliographic data in research management systems such as Zotero.

Further exports in JSON and basic CSV are also available.

Expansion of Thoth’s data model towards multilingualism, non-OA fields, PRISM

A number of further developments have emerged out of Thoth’s engagement with international publishers3. The team is grateful for the input provided by publishers such as L’Harmattan Open Access Hungary, and from the exchanges with Latin American publishers that Thoth has begun to engage with (cf. Ramalho et al., 2024).

The team’s conversations with those publishers uncovered a need to expand Thoth’s data model to enable multilingualism through the provision of data fields for title and abstracts in multiple languages, as well as a multilingual provision of the descriptive data on contributors’ institutional affiliations. An implementation of these data model extensions will be forthcoming in a future release of Thoth.

Further feedback received included needs specific to hybrid publishers that have closed-access books alongside OA titles in their catalogue. Currently, Thoth’s data model has been tailored around OA books. For those hybrid publishers to be able to create and manage metadata for both types of books, Thoth would need to extend its data model to include certain fields relevant to closed-source books, such as the ability to record territorial restrictions and publication release dates (embargo periods).

We understand that the provision of such fields might potentially be perceived as problematic by some. Following internal discussions, the team agreed to prioritise the opening of those books’ underlying metadata (as the metadata records themselves can still always be made available via CC0) over an enforced categorical separation between open and closed books, as this would have meant additional barriers for hybrid publishers that are seeking to transition their existing closed model to one providing more open access titles over time.

The Thoth team thus decided to permit metadata management for closed-source books within Thoth, and to work to include relevant metadata fields in the data model in a forthcoming release. Thoth sees the overarching benefit in having the resulting metadata outputs openly available under a CC0 dedication. That said, Thoth will ensure that its added-value dissemination services (referred to as Thoth Plus) are not offered for closed, non-OA books.

Peer review metadata: PRISM integration

As part of Thoth’s close collaboration with OAPEN and the Directory of Open Access Books, the team is now also seeking to include dedicated metadata fields to directly feed into DOAB’s Peer Review Information Service for Monographs (PRISM). This is part of a larger development package to facilitate direct metadata exchange between OAPEN, the Directory of Open Access Books (DOAB), and Thoth.

Locations

Thoth also has the capability to record multiple locations, so as to enable publishers to keep track of the manifold outlets they are submitting their OA book data to.

With the increasing range of platforms that Thoth is able to automatically submit data and content to via the Thoth Plus workflows, more and more platforms are also being added to Thoth metadata. To document the location of published books’ PDF, epub, and html versions, the system currently differentiates between the Publisher’s own website and files and data about the book being hosted on Project MUSE, OAPEN, DOAB, JSTOR, EBSCO Host, EBSCO KB, OCLC KB, ProQuest KB, ProQuest ExLibris, Jisc KB, Google Books, Internet Archive, ScienceOpen, SciELO Books, and Zenodo. An additional free-text field can also hold information on other platforms where a version of a given book can be found.

This is deemed particularly relevant in the context of the further development of OAPEN’s Book Analytics service, as Thoth will be able to pinpoint to a large variety of locations, for each of which the analytics service developed by COKI can then collect usage data from. More on Thoth’s involvement in the COKI/OAPEN Book Analytics service can be found in our accompanying overview of federated services and platforms that Thoth has been working with.

Linked datasets, persistent identifiers, and references

Right from the beginning, Thoth has sought to to enable flexible many-to-many relationships between multiple data points.

This includes contributor information that is shared across publishers, as it is understood that authors and editors might, over time, be publishing books with a number of presses represented on Thoth.

For the purpose of adding contributor-level persistent identifiers, Thoth has implemented ORCiD as its PID of choice. To record contributors’ details on a given book record, this data point can be linked to an individual book record, and descriptive contributor information regarding the contributor’s background, affiliation, and position held at the time of publication can then be stored within the book-level record.

Another component in the Thoth data model is institutional data, which is also stored independently from individual book records, and similarly shared across publishers. Each book record can then link to such institutional data points independently from each other, to e.g. link the institution an author or editor is affiliated with. The choice of PID for this institutional data is the Research Organisation Registry (ROR) dataset, a joint initiative by California Digital Library, Crossref, and DataCite.

Thoth also has a data field dedicated to recording funder-level data. While this previously relied on Crossref’s Open Funder Registry set of records, the announced integration of Funder Registry data with ROR has enabled Thoth to also use ROR for identifying funding institutions (while continuing to support the recording of Crossref Funder IDs in DOI format).

As mentioned above, Thoth is in the process of adding metadata fields to allow publishers to participate in Crossmark, including publication statuses for withdrawn works and new editions of previously published works. Participating publishers will be able to add a Crossmark button to digital editions of books, allowing the reader to find out if a PDF they are reading has been superseded by an updated edition, for example.

And last, but by no means least, Thoth also provides capabilities to store bibliographic reference / citation data for books and individual chapter. The current implementation provides a free-text field that takes a book’s or chapter’s bibliographic records of literature referenced in the work at hand. Publishers can then register these with Crossref either by utilising Thoth’s Crossref-specific XML export, or via the automated data submission workflows provided through Thoth Plus.

Doing so, Thoth is proud to say that our implemented metadata model and corresponding Thoth services facilitate good metadata practice in the context of open access books by providing publishers with a means to implement persistent identifiers and controlled vocabularies relevant in the context of Open Science / Open Scholarship. This is reflected in Thoth meeting, or even exceeding, all metadata requirements & recommendations provided by e.g. the German working group of university publishers’s Quality Standards for OA Books (Arbeitsgemeinschaft der Universitätsverlage, 2023).

White-label catalogue and website

Over the past 18 months, Thoth’s functionalities have seen substantial improvement in a variety of areas. As has been outlined in the preceding paragraphs, much work has gone into implementing a variety of metadata export formats.

Encouraged by feedback received from a number of international publishers and consortia, an additional body of work has been focusing on developing a white-label catalogue that extends the idea of a white-label website that Open Book Publishers showcased in 2022 (cf. Arias 2022). In this context, “white-label” means that the website and the underlying software serve as a fully-functional template that can freely be re-used by anyone interested in establishing their own publisher website.

In a similar vein, Thoth’s own web presence now also serves as an updated white-label website/catalogue template that can be re-used by any publisher or consortium. This new Website was launched in March 2024, and features a full book catalogue that directly implements a live Thoth metadata feed. An overview of all features available on the website are available in this dedicated blog post.

Going forward, Thoth is planning to offer this white-label template as a website/catalogue hosting solution to publishers and consortia who might be interested in running their own website and shared catalogue leveraging the full potential of Thoth’s open metadata.

Outlook: steps towards implementing the Thoth Traffic Light System

As part of our longer-term development trajectory, Thoth will be seeking to implement an automated guidance mechanism to support publishers with their metadata work. This mechanism, which is tentatively labelled the ‘Traffic Light System’, will seek to integrate the variety of platform-specific requirements with regards to which kinds of data are mandated by certain platforms.

To give an example, for a publisher to be able to register books with Google Books, their corresponding ONIX 3.0 metadata record needs to have at least one BIC, BISAC, or LCC subject classification code listed in the metadata. Similarly, JSTOR is mandating the provision of at least one BISAC subject code in its ONIX 3.0 variant to allow ingest of corresponding book and metadata files. And in a similar fashion, and following OAPEN’s move to Thema classification codes (cf. Snijder, 2024) because of BIC being deprecated since February 2024, OAPEN now mandates the provision of at least one Thema subject code for uploads to OAPEN and DOAB. Other platforms have different specifications, and in Thoth, we are seeking to integrate most of them in this guidance mechanism of the Traffic Light System.

And this is where Thoth’s Traffic Light System (TLS) comes into play: implemented within the Thoth metadata management backend, the TLS will provide a live colour-coded indication of a given publisher’s metadata quality, to highlight which platform’s specifications are already being met, and to indicate which metadata fields might still benefit from further input to meet each checked platform’s minimum requirements.

An early version of this has already been implemented: in Thoth’s newly-established advanced book catalogue, which draws on live Thoth metadata, the “Export Metadata” area displays all available export formats for a given book record. Any specifications for which the export mechanism will fail to create an export file due to missing metadata fields are being marked with an exclamation mark, and a mouse-over will reveal the reason for such a failing export. Publishers can thus quickly amend their existing metadata entries in Thoth via the web interface, and then check back to see if the issue has been resolved.

Screenshot of a book entry in the Thoth metadata catalogue. This displays a list of metadata including title, subtitle, contributor, doi, landing page URL, license, copyright, publisher, publication date, place of publication, ISBNs, and short and long abstracts. (Note that this is just a subset of all metadata fields available in Thoth). 
On the left of the screen, a dedicated "Export Metadata" area lists the variety of export formats available for each book in Thoth, these include the following specifications: ONIX 3.0 Thoth, ONIX 3.0 Project MUSE, ONIX 3.0 OAPEN, ONIX 3.0 JSTOR, ONIX 3.0 Google Books, ONIX 3.0 Overdrive; ONIX 2.1 EBSCO Host, ONIX 2.1 ProQuest Ebrary; CSV, JSON, OCLC KBART, BibTeX, Crossref DOI deposit (XML), MARC21 record, MARC21 Markup, and MARC21 XML.

Fig. 2: Thoth’s advanced book catalogue, drawing on live Thoth metadata.

An early implementation of the future Thoth Traffic Light System can be already seen in action: on the “Export Metadata” widget (left) of a given book record, exports that are still unavailable due to missing metadata are being marked with an exclamation mark sign, and a mouse-over will reveal the reason for this.

Next to this, Thoth will also work to enhance the overall UX and implement multilingual options for the interface and key documentation to enable the steadily-growing pool of international publishers to flexibly use Thoth according to their local needs.

So, still lots to do.

If you are a publisher who would like to learn more about the various solutions that Thoth Open Metadata has on offer, or a library or open access platform / provider that would like to work with us, do get in touch via [email protected] or visit https://thoth.pub.


Header image by Ricardo Gomez Angel on Unsplash.

Comments
0
comment
No comments here
Why not start the discussion?