RDF Linked data cataloguing at Oslo Public Library

This post was originally published as an article in SCATnews no. 41, the Newsletter of the Standing Committee of the IFLA Cataloguing Section.

The new public library in Oslo will be a digital discovery center where the presentation of the physical collection will merge with digital content and user generated content. This calls for new ways of describing both physical and digital content, and for new ways of working with cataloguing in the library. To get there, the library has decided to drop their integrated library system and to drop MARC as cataloguing format. Instead we will use RDF linked data as the primary cataloguing format, starting already in 2015.

For more than 25 years the present proprietary integrated library system has been the main tool for library staff as well as the core for most end-user services at Oslo Public Library. All software development is in the hands of system vendors, and the vendors also control the data within the system. 2015 marks the end of this era, as the library will switch to the open source library system Koha. But instead of just switching from one integrated library system to another, the idea is to use Koha as one of several modules in an extended system. Koha will handle core functionality like circulation and patron database, but search and browsing for end-users as well as the cataloguing module for library staff will be FRBR oriented and based on RDF linked data.

The staff at Oslo Public Library has worked with catalogue data as linked data since 2010. The first experiments were about identifying works and expressions represented in the library catalogue and using the FRBR model to link the manifestations in the library collection to those works and expressions. Later the library developed the tool MARC2RDF for harvesting catalogue records from the library catalogue and converting them to RDF linked data, as well as running scripts for adding a FRBR-like structure to the catalogue and enriching bibliographic catalogue data with information harvested from external online sources. Since 2011 the library has maintained a full linked data version of the catalogue, data.deichman.no, where RDF data have been exposed and made available for querying. This linked data «shadow catalogue» has been vital for the creation of two new digital end-user services, developed from 2012 to 2013: A service that collects book reviews from Norwegian libraries in one database, describe them with metadata and link them to works and manifestations in the linked data catalogue, so that users can look up reviews by using metadata that describe books (e.g. «10 latest reviews of fantasy books for kids» or «10 latest reviews of books about sports»). The second service, the «Active shelf» is a physical device that patrons can use to look up information about books collected from multiple online sources. The service also includes features like «Similar books» and «More books by the same author» the users can browse in a touchscreen interface to discover new books they might be interested in. Both these services are fueled by library metadata in the linked data format, and none of them would have been possible to make by the use of ordinary MARC records. The opportunity to add FRBR functionality has been one of the most immediate advantages of using linked data as metadata format. This has made it possible to connect a book review to all the different editions of a book, and this is what makes it possible to show «More books by the same author» as a lucid list of unique works on the Active shelf, rather than the kind of messy lists of multiple editions of the same book we are used to from the online catalogue. This is implemented by using scripts that construct identifiers for works and link them to manifestations based on certain logical rules applied to catalogue metadata. But as long as the linked data are produced by converting MARC records, there will always be a limitation to how expressive the data can be. Any information that is not contained in or can be derived from MARC records we are going to miss out on. As we want our future online catalogue search and other end-user services to appear smart and usable, we have decided to move away from MARC cataloguing altogether, and rather use RDF linked data as core metadata format. Only in this way can we get the kind of expressive and uniform metadata that we want.

A public library is a library for the public. As catalogue data is an important resource for the library’s end-user services, the public library cataloguer should keep in mind that it is the public, the patrons, the data should be made to facilitate. This was of course always the case, but traditionally in a more indirect manner. Cataloguers have produced catalogue data as a tool for librarians, so that they in their turn could use them to assist the patrons. This is usually not the case anymore. Patrons search the online catalogue themselves, or they use software applications that use catalogue data to help them find what they want or to discover new things. Nevertheless, cataloguing rules, classification schemes and metadata formats still tend to be «librarian readable» and a mystery to most patrons. Also, cataloguing standards and practice, as part of their function is to organize collections so that each object has its one right place according to one particular set of rules, tend to accommodate one «preferred» type of need for information. Cataloguing for the public should avoid making assumptions about what patrons will be interested in or in what their motivation for being interested in it would be. It should simply focus on making the data as rich and expressive as possible, so that it can be applied and combined in as many ways as possible.

It is a bit depressing that library cataloguing in 2014 still is pretty much about typing, and that much of the focus in cataloguing is on strings, names and words. Cataloguing should be about linking resources, not typing; and the identity of the things we describe should be determined by unique identifiers, rather than by the strings that label them. Two decades after the arrival of the World Wide Web and four decades after the relational database, it is rather remarkable that library cataloguing still focus so much on words and text. Instead of registering a person as the author of a book, we make the record contain an entry for the author’s name. Instead of registering the topic of a document, we add an entry for the topic term. This is not a good way to describe information resources, and conceptually it just seems plain wrong. A person’s name is a property of that person, not of the books he or she has written. The name shouldn’t be part of the description of the book at all, and it definitely shouldn’t be part of the description twice (as is the case now, since we register names both as main or added entries and in statements of responsibility). Instead the resource that is the book should be linked to other resources that represent the person that wrote the book and the concept that is a topic of the book. Then the person’s name and the labels that describe the concept should be part of the descriptions of those resources. The search index for the book should of course contain indirectly connected text strings such as names for persons and labels for topics, so that people can find the book by searching for those strings. But a search index should be something different from a resource description, and the building of search indexes should be a job for machines, not humans.

We are of course fully aware that we are not the only ones in the library world who see the need for new ways of thinking about cataloguing. So why don’t we wait for the library standards for linked data cataloguing that are bound to come sooner or later? Well, first of all there is the suspicion that «sooner» might be slightly less likely than «later». The new Oslo Public Library is opening in only four years, and we simply don’t have the time to wait and see what happens in the meantime. We also fear that a standard constructed to facilitate all kinds of libraries, with their different types of collections and their diversity in character and quality of legacy data, will have a lot more complexity to it than we really need, possibly on expense of value gained from a simpler model. The real gain here lies in linking to and making use of resources outside of the library and outside of the library world. We are also in doubt whether new library specific standards is the right way to go at all; after all there already exist specialized ontologies and vocabularies to describe almost everything we need to say something about. Why would we need one unified model to describe everything? And why should we assume that the opinions or choices made by specialists within the different fields would be inferior to those made by library generalists? Finally, if the emerging standards for library linked data cataloguing do prove to be useful and valuable in the future, we will have a better starting point for implementing them than most other libraries. RDF linked data are flexible and easy to adjust, and the more expressive they are, the better the starting point.

When the library sector wants to introduce a new way of doing things, it is usually a very thorough and rather slow process. We have to consider implications for all the different types of libraries with their different types of material. And then we usually feel that we have to be absolutely sure we have thought of and planned for every kind of eventuality and exception that could occur, and that we have got everything just right, before we can start using it. Consequences of a method like this can be that the final result is overcomplicated and hard to understand and use, that we sacrifice valuable functionality as a precaution against threats that later turn out not to be all that pressing, or that the development takes so long that the product is already outdated when it is finally ready for use. We might have something to learn from modern software development methods, where it is a goal to implement and start using a system as soon as possible. Then further development, adjustments and the adding of new features is based on the needs that arise during the actual use of the system, rather than on the needs we think we are going to have when we think about what the system should be like in advance. This speeds up the process and reduces the risk of choosing solutions that are unnecessary, impractical or too complicated.

One of our strongest incentives for switching formats and changing cataloguing practice is that we are able to. We realize that we are in a very fortunate position at Oslo Public Library. As a public library, our main task is to offer content that is of current interest, rather than conserving an existing historical collection. The library’s content policy states that the library’s focus on content should be based on a «just in time» principle, rather than on a «just in case» principle: Instead of letting the collection we have decide what kind of library we get, we should let the kind of library we want decide what our collection should be like. We can apply a similar way of thinking to our catalogue data: Instead of letting the nature of our old data dictate our choices of formats and cataloguing practice, we can focus on how we ideally would like to register metadata, and then let that dictate our choices of formats and data models. The old data can be converted and adjusted to fit our models as well as possible, and will gradually make up a smaller and less significant part of the catalogue.

Om Asgeir Rekkavik

Bibliotekar ved Deichmanske bibliotek. Avdeling for kunnskapsorganisering