Metadata is an abstraction, a language with a grammar and vocabularies that necessarily emerges in many varied forms across natural languages, cultures, and intellectual domains. This address will recapitulate some of the metaphors that emerged in the community to bridge these abstractions to the problems of information management in the digital world.
Weibel will also explore some of the social engineering challenges of how a growing global community self-organized and, in the current vernacular, “crowd sourced” what grew into a global standards activity, a research community, and many spin-off activities that underlie much of the organization of digital information on the Internet.
There’s a pretty good chance he’ll tell some stories along the way.
In a massively distributed environment like Internet, service providers play a critical role to make information findable. While data providers make available excellent information, hubs collect their metadata and give visibility worldwide. However, the metadata that is being produced and exposed is not always uniformly rich, and needs to be optimized. In the case of scientific literature in food and agriculture, there are certain singularities that make it even more complex. From one side, grey literature is critical, while journal articles are not necessarily the only scholarly communication channel that counts. Secondly, while in other sciences English is the pivotal language, in the case of the food and agriculture and due to the diversity of languages being used, it is necessary to consider multilingualism and semantic strategies as a way to increase accessibility to scientific literature. Service providers have taken different approaches to resolve all this, through expanding the coverage of types of documents and considering semantic technologies as a key instrument to enrich metadata. This panel session aims to discuss the challenges that service providers are facing to aggregate content from data providers in food and agricultural sciences. The five panelists will share their experiences from different perspectives: 1. Ag Data Commons which is the public, government, scientific research data catalog and repository available to help the agricultural research community share and discover research data funded by the United States Department of Agriculture and meet Federal open access requirements; 2. AGRIS at Food and Agriculture Organization of the United Nations with more than 12 Million records about publications in up to 90 different languages from 500 data providers; 3. Beijing agricultural think tank platform which brings together agricultural policy, the development plan, agricultural related reports, agricultural experts data, and statistical data around the world; and 4. GARDIAN, the Global Agricultural Research Data Innovation & Acceleration Network, which is the CGIAR flagship data harvester across all CGIAR Centers and beyond; and 5. The Land Portal Library with 60,000 highly enriched resources related to land governance aggregated from a highly specialized sector.
In the first part of this presentation, Osma Suominen will introduce the general idea of automated subject indexing using a controlled vocabulary such as a thesaurus or a classification system; and the open source automated subject indexing tool Annif, which integrates several different machine learning algorithms for text classification. By combining multiple approaches, Annif can be adapted to different settings. The tool can be used with any vocabulary; and, with suitable training data, documents in many different languages may be analysed. Annif is both a command line tool and a microservice-style API service which can be integrated with other systems. We will demonstrate how to use Annif to train a model using metadata from an existing bibliographic database and how it can then provide subject suggestions for new, unseen documents.
In the second part of the presentation, Koraljka Golub will discuss the topic of evaluating automated subject indexing systems. There are many challenges in evaluation, for example the lack of gold standards to compare against, the inherently subjective nature of subject indexing, relatively low inter-indexer consistency in typical settings, and dominating out-of-context, laboratory-like evaluation approaches.
In the third part of the presentation, Annemieke Romein and Sara Veldhoen will present a case study of how they have applied Annif in a Digital Humanities research project to categorize early modern legislative texts using a hierarchical subject vocabulary and a pre-trained set.
For practitioners that would like to learn how to use the Annif tool on their own, there is also a follow-up hands-on tutorial. The hands-on tutorial consists of short prerecorded video presentations, written instructions and practical exercises that explain and introduce various aspects of Annif and its use.
The study is based on the results of the “Wooden Slips Character Dictionary” (簡牘字典系統/ WCD), launched by the Academia Sinica Center for Digital Cultures (ASCDC) as an online system to demonstrate the possibility of integrative application of different ontologies and vocabularies to deal with linked data for DH research. To achieve the aforesaid purpose, the study has developed an “integrative Chinese Wooden Slips Ontology.” The main purpose of the ontological design is to support DH scholarship in the research field of ancient Chinese characters and their interpretation, and also serve as a basic data model for structuring an online retrieval system of Chinese characters across different institutes. The integrative Chinese Wooden Slips Ontology is designed based on the CIDOC-CRM model, which contains four different data models of specific fields to enhance the detailed and accurate description of single wooden slips and the information about each written character. The CRM-based data model is extended to enrich the detailed data on each written Chinese character, including temporal information of work production and annotation for the whole wooden slip or a single character. As a result, the CRM classes are extended as nodes to link with the different types of this integrative Chinese Wooden Slips Ontology. Since the ancient Chinese characters are written on fragile materials and easily become damaged or unrecognizable over time, the interpretation process of these characters has to rely on the support both of images and their metadata retrieval through sematic methods, such as IIIF and Linked Data. To read, recognize and compare writing manners between the same or similar written characters is one of the important methods used to interpret characters accurately. IIIF-based retrieval systems can help scholars to conduct research in a visually comfortable way. While interpreting the precise meaning of a written character within the whole text, obtaining information about the composition or annotation of a Chinese ancient glyph must depend on the LOD-based retrieval approach. ASCDC’s “Chinese characters and character realization ontology” and the “Web Annotation on Cultural heritage ontology” might offer a new approach to analyze this Chinese ancient cultural heritage via semantic methods. To extend and enhance the preliminary research results, images of single characters in the WCD system are further interoperated and retrievable in the union catalog of the “Multi-database Search System for Historical Chinese Characters” based on the IIIF-based API, which is established in cooperation with other international research communities, including the Nara National Research Institute for Cultural Properties, Historiographical Institute of the University of Tokyo, National Institute of Japanese Language, National Institute for Japanese Language and Linguistics, and Institute for Research in Humanities at Kyoto University in Japan. The same Chinese characters from datasets of different institutes can be displayed in this collective interface, which supports the study of ancient Chinese characters. Links: 1. Wooden Slips Character Dictionary: https://wcd-ihp.ascdc.sinica.edu.tw/woodslip/ 2. Multi-database Search System for Historical Chinese Characters: https://wcd-ihp.ascdc.sinica.edu.tw/union/
railML functions on the principle of referencing existing usable standards instead of developing all aspects from scratch and it can therefore be seen as an application of Dublin Core. Since Dublin Core has its background in library science, it makes this a great example for collaborative open source work and cooperation spanning across ectors, which might otherwise not have much in common.