Invited Talk 6: LLM to Annotate Subject Metadata

Starts at: Wed, Nov 8, 2023, 12:00 South Korea Time; ( 08 Nov 23 03:00 UTC )
Finishes at: Wed, Nov 8, 2023, 13:00 South Korea Time; ( 08 Nov 23 04:00 UTC )
Venue: Gyeongha Hall 1
Moderator: Sam Oh

Moderator

Sam Oh

Sunkyunkywan University and DCMI
ORCID
Sam Oh is a Distinguished Professor for Global Affairs at Sungkyunkwan University in Seoul Korea, the current executive director for the DCMI, and chairs the ISO/IEC JTC1/SC34 (Document Description & Processing Languages) and ISO TC46/SC9 (Identification & Description) committees. He represents the National Library of Korea on the DCMI Governing Board.

His main research interest is in the area of metadata and ontology modeling. He has extensive experience in consulting companies and government sectors regarding design of metadata and ontologies. He taught courses such as database design, Web database design, designing XML and metadata schemas, ontology modeling, information architecture, and designing knowledge management systems.

He received his Ph.D. in Information Science and Technology from Syracuse University, NY, USA in 1995 and worked for the Information School at the University of Washington for 4 years (1994-1998) prior to taking his current post.

Presentations

Utilising a Large Language Model to Annotate Subject Metadata: A Case Study with an Australia National Research Data Catalogue

Recent studies have shown that in-context learning with Large language models (LLMs) has matched or even exceeded human performance for various automated annotation tasks. In this research, we explored a large language model - OpenAI's GPT-3.5 to enrich metadata from an Australian national dataset catalogue, specifically to annotate subject metadata with the Australian and New Zealand Standard Research Classification (ANZSRC) taxonomy. In line with recent related studies using the in-context learning framework, we crafted prompts incorporating task instructions and example metadata records with subject metadata annotated by humans and then engaged GPT-3.5 to produce predictions. Our models displayed outstanding results, achieving an accuracy of over 90% in multiple research divisions. There are a few research divisions where the models didn't perform as well compared to human annotation, however, a further analysis shows the machined annotated subject headings can complement human annotation. Overall, our research demonstrates the potential of GPT-3.5 as a powerful tool for automating the annotation of subject metadata with limited training data.

Shiwei Zhang

RMIT University
ORCID
Shiwei Zhang, a research associate at RMIT University in Australia, specializes in the research and development of natural language processing models. His work encompasses a wide range of applications, such as sentiment analysis, irony detection, product-related question answering, and addressing medical NLP challenges. Recently, his research focus has been on instruction tuning for large language models, steerable text generation, and in-context learning. He earned his Ph.D. from RMIT University in Australia.