Tutorial: Introduction to Annif and automated subject indexing

Date:2020-09-21 07:00

Osma Suominen
National Library of Finland
Osma Suominen is working as an information systems specialist at the National Library of Finland. He is currently working on automated subject indexing, in particular the Annif tool and the Finto AI service, as well as the publishing of bibliographic data as Linked Data. He is also one of the creators of the thesaurus and ontology service and is leading development of the Skosmos vocabulary browser used in Finto. Osma Suominen earned his doctoral degree at Aalto University while doing research on semantic portals and quality of controlled vocabularies within the FinnONTO series of projects.

Koraljka Golub
Linnaeus University
Koraljka Golub is a Professor in Library and Information Science at Linnaeus University. She is the head of the institute, Linnaeus University’s iSchool and program coordinator for the Master’s in Digital Humanities. Her research focuses on manual, automatic and collaborative approaches to knowledge organization for the purposes of information retrieval. She has worked on research projects related to automatic subject indexing using thesauri and classification schemes, both subject specific (Engineering Index) and general (Dewey Decimal Classification, Library of Congress Subject Headings). Established evaluation models for automated subject indexing as well as their more complex alternatives have also been a focus of her research.

Annemieke Romein
KNAW Huygens Institute for Dutch History
Annemieke Romein is a post-doctoral researcher at the KNAW Huygens Institute for Dutch History. She is an early modern historian, who works on the intersection of political and legal history as well as digital humanities. Her research focuses on early modern legislation from a comparative perspective. Her current project - ‘A Game of Thrones?’ - deals with how governance in three early modern republics (Berne, Holland and Gelderland) dealt with issues of order. In 2019 she was a Researcher-in-Residence at the KB National Library of the Netherlands where she worked with Sara Veldhoen and Michel de Gruijter on automatic metadating of individual laws in early modern volumes of ordinances.

Sara Veldhoen
National Library of the Netherlands
Sara Veldhoen works as a research software engineer at the research department of the KB, national library of the Netherlands. She is an active member of a research group that explores possibilities around automated metadata generation to assist the people who catalogue publications, with a focus on subject indexing, using Annif, and author indexing. She is also involved in the KB's Researcher-in-Residence programme, where she works together with external researchers on projects they propose, like that of Annemieke Romein. Sara Veldhoen holds a master's degree in Artificial Intelligence from the University of Amsterdam, where she studied compositionality of language in neural networks.

Manually indexing documents for subject-based access is a very labour-intensive intellectual process. A machine could perform similar subject indexing much faster. In this series of presentations and demonstrations, we will show practical examples of automated subject indexing and discuss how such systems can be evaluated.

In the first part of this presentation, Osma Suominen will introduce the general idea of automated subject indexing using a controlled vocabulary such as a thesaurus or a classification system; and the open source automated subject indexing tool Annif, which integrates several different machine learning algorithms for text classification. By combining multiple approaches, Annif can be adapted to different settings. The tool can be used with any vocabulary; and, with suitable training data, documents in many different languages may be analysed. Annif is both a command line tool and a microservice-style API service which can be integrated with other systems. We will demonstrate how to use Annif to train a model using metadata from an existing bibliographic database and how it can then provide subject suggestions for new, unseen documents.

In the second part of the presentation, Koraljka Golub will discuss the topic of evaluating automated subject indexing systems. There are many challenges in evaluation, for example the lack of gold standards to compare against, the inherently subjective nature of subject indexing, relatively low inter-indexer consistency in typical settings, and dominating out-of-context, laboratory-like evaluation approaches.

In the third part of the presentation, Annemieke Romein and Sara Veldhoen will present a case study of how they have applied Annif in a Digital Humanities research project to categorize early modern legislative texts using a hierarchical subject vocabulary and a pre-trained set.

For practitioners that would like to learn how to use the Annif tool on their own, there is also a follow-up hands-on tutorial. The hands-on tutorial consists of short prerecorded video presentations, written instructions and practical exercises that explain and introduce various aspects of Annif and its use.