Sunday, September 28, 2003
Seattle, Washington, USA
This report was prepared by Abe Crystal (University of North Carolina) and Paula Land (University of Washington).
A note on the linking conventions in this report—The names of people link to their PowerPoint presentation the first time they are mentioned in the report. The names of companies link to the company's web site.
The 2003 Dublin Core Conference ( http://dc2003.ischool.washington.edu/) took as its basic premise that "Metadata is fundamental to persons, organizations, machines, and an array of enterprises that are increasingly turning to the Web and electronic communication for disseminating and accessing information." One of the reasons metadata is receiving such attention is its role in facilitating information seeking. The pre-conference workshop "Metadata and Search" addressed the challenges of using metadata to help users find information, particularly when using site- or domain-specific search engines. Discussions of this specific problem rapidly grew to encompass numerous related areas, including the costs and benefits of creating metadata, integration and interoperability, methods of metadata creation, quality issues, and information architecture. Here we summarize the workshop, identify key themes and insights, and suggest directions for future developments in practice and research.
Participants were skeptical about whether creating large amounts of metadata (particularly for intranets with millions of documents) is a wise investment. Brian DiSilvestro ( Verity) made the point that poor quality metadata can make search worse, and that metadata needs to be kept up to date using dynamic classification tools. Noting that metadata development and application is one of the most expensive ways to get users to content, Lou Rosenfeld ( LouisRosenfeld.com) argued that documents merit differing levels of metadata, depending on a variety of criteria, such as authority, strategic value, currency, popularity, and usability. He dubbed this approach "content value tiers," since it calls for placing content into tiers of assessed value. This approach follows Rosenfeld's emphasis on the "Pareto Principle" or "80/20 rule," according to which a key role of the information architect is to identify the relatively small percentage of content that is of the most value to users.
Estimates of the cost of large-scale metadata creation made it evident how important such a winnowing process is. Rosenfeld calculated that creating metadata for a large organization with one million documents would require roughly 60 employee-years. Mike Doane ( SBI and Company) observed that his company typically charges from $195,000 to $275,000 to initially set up a metadata solution for a corporation (which will then face additional ongoing costs).
These enormous costs have tempted many organizations to consider bypassing a central cataloging operation in favor of resource authors creating metadata directly. But several participants pointed out that this approach is no panacea. Resource authors often create metadata records of poor quality, including incomplete or inaccurate information. It then falls on information specialists to identify and repair these "broken" records. Sandy Hostetetter ( Rohm and Haas Company) noted that fixing poor records can be more costly than simply creating records from scratch. In addition to metadata quality, motivation is a fundamental issue. One audience member noted that a multi-year trial of decentralized metadata creation in his organization had yielded virtually no usable records, because authors were simply not interested in metadata. Julie Martin ( Boeing Technical Libraries) argued that unmotivated content creators must "feel the pain" of poor IR before they can be convinced to invest in metadata. The lack of tools for managing and applying metadata adds an additional burden to organizations wishing to centralize and standardize metadata use.
In short, challenging business and cultural issues face information architects (IA's) seeking to make widespread use of metadata in their organizations. IA's must be prepared to make a clear business case for their metadata initiatives and promote the visibility of their efforts and impact. As Sean Squires ( Washington Mutual, Inc.) put it, they must continually explain "this is what we're doing and why."
Discussion of the cost and usefulness of metadata led naturally to consideration of the return on investment (ROI) that can be achieved. Mike Doane pointed out that using metadata in an intranet environment to reduce employee time spent finding and verifying files can save, at a conservative estimate, $8,200 per employee. In addition, representatives from financial and governmental agencies raised the point that there are often regulatory and legal reasons for developing and applying metadata to certain sets of content, whether or not they meet criteria such as those in Lou Rosenfeld's value tiers. Helen Josephine ( Intel Corporation), presenting a case study of Intel, described proving ROI as one of her greatest challenges, and asked, "What is the measurable business value of knowledge management?" She suggested that, within her organization, ROI will be measured by reduced time spent by employees searching for documents. The driving factor for the IT department is the expense of individual document storage, which could be lessened by better knowledge management. For Sean Squires and Casey Krug from Washington Mutual Bank, the ROI is intangible, but they have been able to demonstrate meeting business needs, requirements, and improved efficiencies.
Many participants identified metadata creation as a fundamental challenge. Lou Rosenfeld noted the longstanding problem of inter-indexer consistency, and cited evidence that only 10% of human-assigned index terms do not occur in full text. Considering these facts, he was skeptical about whether human-created subject metadata merits investment in many cases. Sandy Hostetetter reported that many end users in her organization aren't interested in creating metadata, leading to poor results when they are asked to do so. Julie Martin from Boeing expressed optimism about the ability of better metadata creation tools, such as a plugin for Microsoft Word her company has been developing internally. But she noted that even excellent tools are of little use until a "content culture" has developed, in which understanding of the value of metadata and its contribution to information retrieval are pervasive.
Metadata is of little use if it is not integrated into an organization's information architecture. (And as Pete Bell of Endeca Technologies, Inc. noted, the architecture must in turn be integrated with user interfaces and backend technologies). Several participants, for example, noted the difficulty of identifying labels for content. Sean Squires and Casey Krub observed that label selection can be a highly politicized process.
One insight that emerged is the value of focusing on specific collections or domains in order to enable more "elegant" architectures or more powerful interface techniques (such as faceted access, discussed below). In many cases, there seems to be a tradeoff between implementing metadata (or other aspects of information architecture) for large, general, heterogeneous collections vs. smaller, specific, homogeneous collections. More general implementations permit economies of scale in the use of technology (and perhaps, people). But more specific implementations can offer much better support to users by tying IA and interface design more closely to particular user and task characteristics.
Marti Hearst, as well as Pete Bell from Endeca and Brad Allen from Siderean, made a convincing case for the value of exposing faceted metadata in user interfaces. Having access to an array of facets can allow users to smoothly transition from collection-wide searches that return thousands of records to tightly scoped searches with a reasonable number of hits. "Proximate content" is one way to describe this type of user interface—get someone to the general area, and then let them browse to their specific need.
The tradeoff is the high complexity of these interfaces, but Marti Hearst's empirical studies demonstrate that users found the complexity manageable and the interfaces usable. [See: Faceted Metadata for Image Search and Browsing in Proceedings of ACM CHI 2003 and Finding the Flow in Web Site Search in Communications of the ACM, vol. 45, September 2002.] Endeca's implementation of this technique has been successful in e-commerce applications, while Siderean's software shows promise for Semantic Web applications such as aggregation and syndication.
Alex Wade ( Microsoft SharePoint Portal Server) described some of the SharePoint tools for providing content interfaces such as list management, query-defined views, results sorting, and property mapping. He also described some advanced search features such as keyword "best bets," query term replacement, and query term expansion that are making it to corporate desktops via the Microsoft product.
Looking forward, Marti Hearst argued for more sophisticated applications of faceted views. These views could be based on the user's context, for example their position in the organization or their task. More generally, while it is difficult to improve "web search" in general, there are many possibilities for improving specific sub-problems of search, and metadata will likely play in a key role for these domain- and task-specific improvements.
As the summary above illustrates, even a topic as ostensibly narrow as "metadata and search" sparked discussion on a wide range of topics. Participants uncovered shortcomings in current practices, challenged conventional wisdom, and proposed new possibilities for the effective use of metadata. Drawing on their contributions, here are some suggestions for future directions in metadata practice and research:
Considering the breadth and depth of these challenges, we anticipate numerous opportunities for improvements in metadata use in the years to come.