Metadata and Search

Global Corporate Circle DCMI 2003 Workshop

Sunday, September 28, 2003
Seattle, Washington, USA

This report was prepared by Abe Crystal (University of North Carolina) and Paula Land (University of Washington).

A note on the linking conventions in this report—The names of people link to their PowerPoint presentation the first time they are mentioned in the report. The names of companies link to the company's web site.

Introduction

The 2003 Dublin Core™ Conference ( http://dc2003.ischool.washington.edu/) took as its basic premise that "Metadata is fundamental to persons, organizations, machines, and an array of enterprises that are increasingly turning to the Web and electronic communication for disseminating and accessing information." One of the reasons metadata is receiving such attention is its role in facilitating information seeking. The pre-conference workshop "Metadata and Search" addressed the challenges of using metadata to help users find information, particularly when using site- or domain-specific search engines. Discussions of this specific problem rapidly grew to encompass numerous related areas, including the costs and benefits of creating metadata, integration and interoperability, methods of metadata creation, quality issues, and information architecture. Here we summarize the workshop, identify key themes and insights, and suggest directions for future developments in practice and research.

Thematic summary

Cost and usefulness of metadata

Participants were skeptical about whether creating large amounts of metadata (particularly for intranets with millions of documents) is a wise investment. Brian DiSilvestro ( Verity) made the point that poor quality metadata can make search worse, and that metadata needs to be kept up to date using dynamic classification tools. Noting that metadata development and application is one of the most expensive ways to get users to content, Lou Rosenfeld ( LouisRosenfeld.com) argued that documents merit differing levels of metadata, depending on a variety of criteria, such as authority, strategic value, currency, popularity, and usability. He dubbed this approach "content value tiers," since it calls for placing content into tiers of assessed value. This approach follows Rosenfeld's emphasis on the "Pareto Principle" or "80/20 rule," according to which a key role of the information architect is to identify the relatively small percentage of content that is of the most value to users.

Estimates of the cost of large-scale metadata creation made it evident how important such a winnowing process is. Rosenfeld calculated that creating metadata for a large organization with one million documents would require roughly 60 employee-years. Mike Doane ( SBI and Company) observed that his company typically charges from $195,000 to $275,000 to initially set up a metadata solution for a corporation (which will then face additional ongoing costs).

These enormous costs have tempted many organizations to consider bypassing a central cataloging operation in favor of resource authors creating metadata directly. But several participants pointed out that this approach is no panacea. Resource authors often create metadata records of poor quality, including incomplete or inaccurate information. It then falls on information specialists to identify and repair these "broken" records. Sandy Hostetetter ( Rohm and Haas Company) noted that fixing poor records can be more costly than simply creating records from scratch. In addition to metadata quality, motivation is a fundamental issue. One audience member noted that a multi-year trial of decentralized metadata creation in his organization had yielded virtually no usable records, because authors were simply not interested in metadata. Julie Martin ( Boeing Technical Libraries) argued that unmotivated content creators must "feel the pain" of poor IR before they can be convinced to invest in metadata. The lack of tools for managing and applying metadata adds an additional burden to organizations wishing to centralize and standardize metadata use.

In short, challenging business and cultural issues face information architects (IA's) seeking to make widespread use of metadata in their organizations. IA's must be prepared to make a clear business case for their metadata initiatives and promote the visibility of their efforts and impact. As Sean Squires ( Washington Mutual, Inc.) put it, they must continually explain "this is what we're doing and why."

Return on investment for metadata

Discussion of the cost and usefulness of metadata led naturally to consideration of the return on investment (ROI) that can be achieved. Mike Doane pointed out that using metadata in an intranet environment to reduce employee time spent finding and verifying files can save, at a conservative estimate, $8,200 per employee. In addition, representatives from financial and governmental agencies raised the point that there are often regulatory and legal reasons for developing and applying metadata to certain sets of content, whether or not they meet criteria such as those in Lou Rosenfeld's value tiers. Helen Josephine ( Intel Corporation), presenting a case study of Intel, described proving ROI as one of her greatest challenges, and asked, "What is the measurable business value of knowledge management?" She suggested that, within her organization, ROI will be measured by reduced time spent by employees searching for documents. The driving factor for the IT department is the expense of individual document storage, which could be lessened by better knowledge management. For Sean Squires and Casey Krug from Washington Mutual Bank, the ROI is intangible, but they have been able to demonstrate meeting business needs, requirements, and improved efficiencies.

Metadata creation

Many participants identified metadata creation as a fundamental challenge. Lou Rosenfeld noted the longstanding problem of inter-indexer consistency, and cited evidence that only 10% of human-assigned index terms do not occur in full text. Considering these facts, he was skeptical about whether human-created subject metadata merits investment in many cases. Sandy Hostetetter reported that many end users in her organization aren't interested in creating metadata, leading to poor results when they are asked to do so. Julie Martin from Boeing expressed optimism about the ability of better metadata creation tools, such as a plugin for Microsoft Word her company has been developing internally. But she noted that even excellent tools are of little use until a "content culture" has developed, in which understanding of the value of metadata and its contribution to information retrieval are pervasive.

Information architecture

Metadata is of little use if it is not integrated into an organization's information architecture. (And as Pete Bell of Endeca Technologies, Inc. noted, the architecture must in turn be integrated with user interfaces and backend technologies). Several participants, for example, noted the difficulty of identifying labels for content. Sean Squires and Casey Krub observed that label selection can be a highly politicized process.

One insight that emerged is the value of focusing on specific collections or domains in order to enable more "elegant" architectures or more powerful interface techniques (such as faceted access, discussed below). In many cases, there seems to be a tradeoff between implementing metadata (or other aspects of information architecture) for large, general, heterogeneous collections vs. smaller, specific, homogeneous collections. More general implementations permit economies of scale in the use of technology (and perhaps, people). But more specific implementations can offer much better support to users by tying IA and interface design more closely to particular user and task characteristics.

User interfaces

Marti Hearst, as well as Pete Bell from Endeca and Brad Allen from Siderean, made a convincing case for the value of exposing faceted metadata in user interfaces. Having access to an array of facets can allow users to smoothly transition from collection-wide searches that return thousands of records to tightly scoped searches with a reasonable number of hits. "Proximate content" is one way to describe this type of user interface—get someone to the general area, and then let them browse to their specific need.

The tradeoff is the high complexity of these interfaces, but Marti Hearst's empirical studies demonstrate that users found the complexity manageable and the interfaces usable. [See: Faceted Metadata for Image Search and Browsing in Proceedings of ACM CHI 2003 and Finding the Flow in Web Site Search in Communications of the ACM, vol. 45, September 2002.] Endeca's implementation of this technique has been successful in e-commerce applications, while Siderean's software shows promise for Semantic Web applications such as aggregation and syndication.

Alex Wade ( Microsoft SharePoint Portal Server) described some of the SharePoint tools for providing content interfaces such as list management, query-defined views, results sorting, and property mapping. He also described some advanced search features such as keyword "best bets," query term replacement, and query term expansion that are making it to corporate desktops via the Microsoft product.

Looking forward, Marti Hearst argued for more sophisticated applications of faceted views. These views could be based on the user's context, for example their position in the organization or their task. More generally, while it is difficult to improve "web search" in general, there are many possibilities for improving specific sub-problems of search, and metadata will likely play in a key role for these domain- and task-specific improvements.

Conclusion

As the summary above illustrates, even a topic as ostensibly narrow as "metadata and search" sparked discussion on a wide range of topics. Participants uncovered shortcomings in current practices, challenged conventional wisdom, and proposed new possibilities for the effective use of metadata. Drawing on their contributions, here are some suggestions for future directions in metadata practice and research:

Return on investment (ROI). Establishing rigorous ROI for metadata use in organizations can help information architects make more widespread use of metadata. But measuring the benefits of better resource discovery, for example, is difficult. Subjective impressions of retrieval systems that make good use of metadata are often highly positive, but these diffuse and intangible benefits are hard to translate into quantitative results. Innovative approaches here could benefit both scientific research and practical implementations. More broadly, finding metrics for productivity improvements by "knowledge workers" (or "information workers") is an important challenge.
"Soft" issues. While technical problems abound, in many organizations the most challenging issues are cultural and political. Getting people or business units to take ownership of problem areas, securing funding, and earning the trust of groups within the organization are critical pieces of many metadata initiatives. Going forward, it appears that practitioners need "best practices" for explaining and promoting the value of their work just as they need guidelines for implementing metadata and search systems.
Pragmatic improvements. You can't have it all—there will never be complete consensus or a perfect system. Most organizations need to build metadata coverage and usage organically, not try to do everything at once. These more pragmatic approaches can overcome the sense of intractability that forestalls overly ambitious initiatives. At the same time, they provide accelerated access to subsidiary (and sometimes unexpected) benefits. For example, even small-scale experimentation with author-generated metadata can provide a better foundation for labeling by giving direct access to users' vocabulary.
New conceptions of "value." A consensus emerged during the workshop that it is important to invest in metadata creation economically, for example, describing little-valued documents sparsely but highly-valued ones more verbosely. Thinking through the implications of this approach, it becomes apparent that more nuanced conceptions of "value" will be necessary in many situations. For example, organizations may establish different value for different types of metadata such as administrative or descriptive. In this case, some objects may have high archival values and need extensive administrative metadata for legal or institutional reasons, but they will rarely be accessed and so don't require descriptive investment.
User-centered metadata. User-centered approaches have revolutionized information retrieval, and continue to spur innovation (as in facet-based search/browse interfaces, for example). But the user's perspective is often lost in metadata debates. How are people actually using metadata? What capabilities does having extensive metadata afford? Familiar examples include "best bets" or filtering to gain better control of search results, but an overall understanding of why users need metadata seems to be lacking. Better knowledge here could enormously clarify the scope and importance of metadata initiatives for both practitioners and researchers.

Considering the breadth and depth of these challenges, we anticipate numerous opportunities for improvements in metadata use in the years to come.

Name:	Metadata and Search: Global Corporate Circle DCMI 2003 Workshop
Type:
Status:
Charter:	Report of the pre-conference workshop on Metadata and Search at the 2003 Dublin Core Conference in Seattle, 28 September 2003.
Moderator/Chair:
Established:

Metadata and Search: Global Corporate Circle DCMI 2003 Workshop