innovation in metadata design, implementation & best practice

DCMI Note - On Information Factoring in Dublin Metadata Records

Title:

On Information Factoring in Dublin Metadata Records

Creator:
Sperberg-McQueen, C. M.
Date Issued:
1996-04-17
Identifier:
Replaces:
NA
Is Replaced By:
NA
Latest version:
http://dublincore.org/specifications/dublin-core/info-factoring/
Status of document:
This is a DCMI Note.
Description of document:

This document describes some problems in the interpretation of metadata records which contain repeated fields, repeated field groups, or references to other metadata records. The semantics of repeated fields and groups (and, equivalently, of references to other metadata records) are described using sentential logic, and a proposal is made to specify the interpretation of repeating groups using the disjunctive normal form of corresponding logical expressions. From this proposal, requirements for grouping elements and inheritance are derived. The semantic principles involved may be of wider applicability, but all examples are from the socalled `Dublin Core' of metadata elements, described in the paper OCLC/NCSA Metadata Workshop Report, by Stuart Weibel, Jean Godby, Eric Miller, and Ron Daniel, available on the World Wide Web at http://www.oclc.org:5046/conferences/metadata/dublin_core_report.html.


Dublin Core Architecture Working Group: Discussion Note

Status of this Document

This document, first published in 1996, is being made available as a Dublin Core discussion note as part of the DC-Architecture Working Group's effort to formalise an XML/RDF representation of the Dublin Core. While the document is 5 years old, many of the issues and observations made are worth reconsidering in the light of subsequent work on RDF and XML. The DC Note in its current form is UNPUBLISHED and undergoing minor edits for publication on the dublincore.org site. Contact Dan Brickley (dc-architecture co-chair) if you have any queries regarding this process.

The remainder of this document is unaltered apart from minor edits for XHTML validation.


On Information Factoring in Dublin Metadata Records

C. M. Sperberg-McQueen

17 April 1996


Table of Contents

  • 1 The Problem
  • 2 Semantic Models
    • 2.1 Sentential Logic
    • 2.2 Existential Quantifiers
  • 3 Markup Solutions
    • 3.1 Groups
    • 3.2 Implicit Anding and Oring
    • 3.3 Inheritance
  • 4 Conclusion

This document describes some problems in the interpretation of metadata records which contain repeated fields, repeated field groups, or references to other metadata records. The semantics of repeated fields and groups (and, equivalently, of references to other metadata records) are described using sentential logic, and a proposal is made to specify the interpretation of repeating groups using the disjunctive normal form of corresponding logical expressions. From this proposal, requirements for grouping elements and inheritance are derived. The semantic principles involved may be of wider applicability, but all examples are from the socalled `Dublin Core' of metadata elements, described in the paper OCLC/NCSA Metadata Workshop Report, by Stuart Weibel, Jean Godby, Eric Miller, and Ron Daniel, available on the World Wide Web at http://www.oclc.org:5046/conferences/metadata/dublin_core_report.html.

1 The Problem

The data elements defined for a metadata record by the `Dublin Core' are all optional and all repeatable, and have no prescribed order. Some (e.g. author, title) relate to the intellectual content of an object (the work), while others (e.g.form) relate to particular realizations or instantiations of that intellectual content. Some (e.g. identifier, terms and conditions) may apply to all forms taken by a given item, or only to some forms and not others.

For example, consider the documentation for the TEI Lite SGML tag set. As a work, it may be described by the following metadata:

Title
TEI Lite: An Introduction to Text Encoding for Interchange
Author
Lou Burnard
Author
C. M. Sperberg-McQueen
It has, however, three realizations with distinct URLs, one for the TEI version:
Form
TEI Lite
Identifier (URL)
http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei
and two for two different HTML versions:
Form
HTML
Identifier (URL)
http://www-tei.uic.edu/orgs/tei/intros/teiu5.html
Identifier (URL)
http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html

How should, could, or must metadata for such items be represented?

At the Warwick meeting, Dan LaLiberte argued that in Dublin it was agreed that a given metadata record should describe only a single realization of an intellectual object; this would help ensure that metadata records are unambiguous. I don't find this explicit in the Dublin conference report, but that report does say explicitly that multiple versions may require multiple records. Redundancy may be controlled by factoring common information (e.g. work-related information) into separate records and `inheriting' it in the records for specific realizations. On this view, the three instantiations of the TEI Lite documentation will each require a separate metadata record.

Reports at the Warwick meeting (April, 1996) from users of the Dublin core, however, make clear that in practice, there is a strong desire to put metadata for a given work in a single record, using some mechanism such as repeating groups to describe multiple realizations. This paper, for example, might be represented thus with repeating groups (I use the DTD described by Eric Miller's paper Issues of Document Description in HTML, available at http://www.oclc.org:5046/~emiller/tmp/paper.html. ):

<citation>
<title>TEI Lite: An Introduction to
Text Encoding for Interchange</title>
<author>Lou Burnard</author>
<author>C. M. Sperberg-McQueen</author>
<form>TEI Lite</form>
<identifier scheme='URL'>
http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei
</identifier>
<form>HTML</form>
<identifier scheme='URL'>
http://www-tei.uic.edu/orgs/tei/intros/teiu5.html
</identifier>
<identifier scheme='URL'>
http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html
</identifier>
</citation>

The only problem with this method is that it requires a lot of intelligence in the reader or user of the metadata to interpret the meaning of fields which occur more than once. A human may easily realize that the first form (TEI Lite) applies only to the first identifier, and that the second and third identifiers are for objects in the second form (HTML); software will realize it only if suitably instructed. A human will realize, perhaps even without conscious thought, that the two elements both apply, at the same time, to all instantiations of the paper, because there are two authors for the paper, while the two

elements each relate to separate and distinct instantiations of the paper. Software is unlikely to realize this critical difference without help.

The association of form and identifier information can be made explicit, using the element of Eric Miller's DTD:

<citation>
<title>TEI Lite: An Introduction to
Text Encoding for Interchange</title>
<author>Lou Burnard</author>
<author>C. M. Sperberg-McQueen</author>
<instance>
  <form>TEI Lite</form>
  <identifier scheme='URL'>
  http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei
  </identifier>
</instance>
<instance>
  <form>HTML</form>
  <identifier scheme='URL'>
  http://www-tei.uic.edu/orgs/tei/intros/teiu5.html
  </identifier>
  <identifier scheme='URL'>
  http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html
  </identifier>
</instance>
</citation>
This is an improvement, but not a full solution (note, for now, that two HTML identifiers still require different interpretation from the two __ elements). If common information is factored out into other records, we may be able to escape some of these logical difficulties, but we need a clear explanation of how information in the local record and the inherited information imported from an external record are to be related: are they always additive, even if the same field appears in both records? Or does a local field `override' the inherited value for that field? # 2 Semantic Models We can get a better grip on the problem if we apply some principles of formal logic. The simplest way to formalize the semantics of a Dublin metadata record, it seems to me, is using sentential logic. Existential quantifiers may also be used, and I describe that possibility briefly, enough to persuade myself that the more complex formalism does not require a more complex syntax for metadata records. Either approach allows us first to express more clearly the types of ambiguity arising from repeated fields or groups, and second to see what sorts of mechanisms might suffice to disambiguate them. ## 2.1 Sentential Logic Let us consider first why a simple record like the following seems less problematic than the sample given above:
<citation>
  <title>On the Pulse of Morning
  <author>Maya Angelou
  <publisher>University of Virgina Library Electronic Text Center
  <otherAgent>Transcribed by the University of Virginia Electronic Text Center
  <date>1993
  <object>Poem
  <form>1 ASCII file
  <source>Newspaper stories and oral performance of text at the presidential inauguration of Bill Clinton
  <language>English
</citation>

The key difference, I believe, is that all of the metadata in this record unambiguously applies all the time, while some elements of the previous record apply only in conjunction with certain other elements.

If we express each element as a logical proposition, the simple record has a correspondingly simple logical form. For convenience, let us give each proposition a short name:

  • T = "The item has the title On the Pulse of Morning."
  • A = "The item was written by Maya Angelou."
  • P = "The item was published by the University of Virginia Library Electronic Text Center."
  • D = "The item was published in 1993."
  • etc. Then the metadata record as a whole can be expressed formulaically: (T & A & P & D & OA & Ob & F & S & L), or "The item has the title On the Pulse of Morning and the item was written by Maya Angelou and ...".

The more complex record has a more complex logical structure. If we name the propositions thus:

  • T = "The item has the title TEI Lite: An Introduction to Text Encoding for Interchange."
  • A1 = "The item was written by Lou Burnard."
  • A2 = "The item was written by C. M. Sperberg-McQueen."
  • F1 = "The item is in TEI Lite form."
  • I1 = "The item has the URL http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei."
  • F2 = "The item is in HTML form."
  • I2 = "The item has the URL http://www-tei.uic.edu/orgs/tei/intros/teiu5.html."
  • I3 = "The item has the URL http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html." then the record as a whole has the following formulaic interpretation: (T & A1 & A2 & ((F1 & I1) | (F2 & (I2 | I3)))), which can be paraphrased in English roughly thus: > - The item is called "TEI Lite: ..." > - and it was written by Lou Burnard > - and it was (also) written by C. M. Sperberg-McQueen > - and it is > - either in TEI Lite as .../teiu5.tei > - or in HTML > - either as .../teiu5.html > - or as .../teiu5.split.html.

Each individual instance can be described (as Dan LaLiberte points out) with a simple metadata record, which translates into a simple formula:

  • T & A1 & A2 & F1 & I1
  • T & A1 & A2 & F2 & I2
  • T & A1 & A2 & F2 & I3 i.e.
  • TEI Lite, by LB and CMSMcQ, in TEI Lite format, at .../teiu5.tei
  • TEI Lite, by LB and CMSMcQ, in HTML format, at .../teiu5.html
  • TEI Lite, by LB and CMSMcQ, in HTML format, at .../teiu5.split.html

I believe that this simple form of expression, in which the only connector is and, corresponds to the class of metadata records which are unambiguous and easy to interpret. The problem of interpreting complex metadata records (ones with repeating fields or groups) can thus be paraphrased: how do we derive a set of simple and-expressions from the logical expression representing a complex metadata record?

Fortunately, the answer is simple.

If we combine the three simple expressions into a single formula, we get a paraphrase of the metadata record as a whole:

( (T & A1 & A2 & F1 & I1)
| (T & A1 & A2 & F2 & I2)
| (T & A1 & A2 & F2 & I3)
)
which can be paraphrased roughly thus: > (if you have an item in hand described by this metadata record, then one of these three things is true:) > - _either_ the title is TEI Lite ..._and_ the author(s) are LB _and_ CMSMcQ _and_ the form is TEI Lite the URL is .../teiu5.tei > - _or_ the title is TEI Lite ..._and_ the author(s) are LB _and_ CMSMcQ _and_ the form is HTML the URL is .../teiu5.html > - _or_ the title is TEI Lite ..._and_ the author(s) are LB _and_ CMSMcQ _and_ the form is HTML the URL is .../teiu5.split.html The salient point (and the only interesting or new claim in this entire paper) is that this expression is logically equivalent to the original formula for the example, but unlike the original this one is in _disjunctive normal form_.[1] It is fortunately not hard to generate the disjunctive normal form of arbitrary logical expressions, particularly when (as here) the only operators allowed are `and` and `or`. We can then describe the semantics of metadata records thus: - Each element in a metadata record represents a single logical predicate. - A _simple_ record is interpreted as the `and`-ing together (conjunction) of its sub-elements. - A _complex_ record is interpreted as a shorthand for the `or`-ing together (disjunction) of several simple records, each represented by one term in the complex record's disjunctive normal form. We do need, however, a way to make explicit not only the parenthetical groupings in the formula (__ does this) but also which propositions in the formula are joined by `and` (&) and which by `or` (|). We can see therefore that proposals calling for a single grouping element (such as that made by Eric Miller in the paper already mentioned, or by myself in informal DTD sketches) will not suffice to solve the problem. We need not one but two distinct types of group. Miller's __ element already serves as an `and`-group, since simple citations are interpreted as the `and`-ing together (formally, the conjunction) of their elements. It will have to be able to nest recursively, however, if we want to handle all cases of shared metadata. And we will need a second grouping element, to serve as an `or`-group. For examples, see the section Groups, below. ## 2.2 Existential Quantifiers Some readers may resist the use of sentential logic as a formalism for representing the meanings of metadata records in general, since the meaning of
<citation>
<title>On the pulse of morning</title>
</citation>
is not, in general, merely "The title is On the pulse of morning" but something more like "(There is an object, described by this record, and) the title (_of the object described by this record_) is On the pulse of morning." That is, there is an implied existential quantifier inherent in the existence of a metadata record, and there is an implied argument for each metadata element, viewed as a logical function. Paraphrasing records at this level of detail would make it easier to capture the semantics of work and realization more clearly. Represented in first-order predicate calculus, our example might look like this:
(E w)(E lb)(E cmsmcq)(E i1)(E i2)(E i3)
     ( work(w)
     & title(w,"TEI Lite ...")
     & name(lb,"Lou Burnard")
     & name(cmsmcq,"C. M. Sperberg-McQueen")
     & author(w,lb) & author(w,cmsmcq)
     & instance(w,i1)
     & form(i1,teilite)
     & url(i1,".../teiu5.tei")
     & instance(w,i2)
     & form(i2,html)
     & url(i2,".../teiu5.html")
     & instance(w,i3)
     & form(i3,html)
     & url(i3,".../teiu5.html")
     & (i1 != i2) & (i1 != i3)
     )
which we might paraphrase as: - there are objects _w_, _lb_, _cmsmcq_, _i1_, _i2_, and _i3_, such that - _w_ is a work - the title of _w_ is TEI Lite ... - the full name of _lb_ is _Lou Burnard_ - the full name of _cmsmcq_ is _C. M. Sperberg-McQueen_ - (an) author of _w_ is _lb_ - (an) author of _w_ is _cmsmcq_ - _i1_ is an instance of _w_ - the format of _i1_ is _teilite_ - the URL of _i1_ is _.../teiu5.tei_ - _i2_ is an instance of _w_ - the format of _i2_ is _teilite_ - the URL of _i2_ is _.../teiu5.tei_ - _i3_ is an instance of _w_ - the format of _i3_ is _teilite_ - the URL of _i3_ is _.../teiu5.tei_ - _i1_ and _i2_ are not the same object - _i1_ and _i3_ are not the same object Note, in passing, that from the __ elements we can infer that the first instantiation is not identical to the second or third, but the second and third instantiations, both being in HTML, could conceivably be identical. Hence there is no claim that `(i2 != i3)`. If we are willing to assume that different instantiations are the only possible causes of `or`-groups in metadata records, then we may plausibly believe (a) that complex metadata records can all be described with a single `and`-group, if instantiations are given identifiers (such as the _i1_, _i2_, _i3_ of the example) and the identifiers are used to associate the metadata elements applying to each instantiation, and (b) that Eric Miller's __ element suffices, after all, since all instances are implicitly `or`-ed with each other, and nothing else will cause an `or` group. I'm reluctant to accept this logic, first because while many (all?) examples of logical complication in metadata records do involve multiple instantiations, I certainly haven't seen any argument that proves this is a logical necessity. Second, tempting though this argument is, I still don't know how to derive the formula just given systematically from the metadata record itself. The formula has three instantiations, and three _form()_ predicates, while the metadata record itself has three __ elements, but only two __ elements, and only two __ elements. # 3 Markup Solutions ## 3.1 Groups We saw earlier, when we used sentential logic to say what metadata records mean, that we need both a grouping element meaning `and` and one meaning `or`. The `or`-group we must invent. For now, let's call it __. The `and`-group we already have, in the `citation` element. The only drawback is that the term _citation_ seems to imply that its contents constitute a complete citation, which will not always be the case. For purposes of illustration, therefore, let's invent a second new grouping element called __. If we augment Eric Miller's DTD with __ and __, our example record will look like this (I augment the __ and __ elements with identifiers, so I can refer to them in later discussion):
<citation>
<title>TEI Lite: An Introduction to
Text Encoding for Interchange</title>
<author>Lou Burnard</author>
<author>C. M. Sperberg-McQueen</author>
<or id=O1>
  <and id=A1>
    <form>TEI Lite</form>
    <identifier scheme='URL'>
    http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei
    </identifier>
  </and>
  <and id=A2>
    <form>HTML</form>
    <or id=O2>
      <identifier scheme='URL'>
      http://www-tei.uic.edu/orgs/tei/intros/teiu5.html
      </identifier>
      <identifier scheme='URL'>
      http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html
      </identifier>
    </or>
  </and>
</or>
</citation>

3.2 Implicit Anding and

Oring</a>

It might be suggested (I suggested it myself, in the first draft of these notes) that we don't really need the element everywhere it occurs in the example just given. It would be clear enough simply to write

  <and id=A2>
    <form>HTML</form>
    <identifier scheme='URL'>
      http://www-tei.uic.edu/orgs/tei/intros/teiu5.html
    </identifier>
    <identifier scheme='URL'>
      http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html
    </identifier>
  </and>
since the URL is a characteristic of the realization, not the work, and in general two identifiers in the same scheme will _always_ refer to distinct realizations. They thus might as well be regarded as forming a sort of automatic, implicit `or`-group. On this view, some elements are implicitly `and`-ed together when they repeat: author, for example. Some elements (e.g. identifier, form) are intrinsically incapable of being `and`-ed together and thus form implicit `or`-groups. Some elements can go either way: multiple titles may all apply, or they might apply each to one particular instantiation of the work (the French title applies to the French version, the English title to the English version of a European regulation, which might need juridically to be treated as a single work, since all national-language versions have equal authority). On the whole, it seems better _not_ to make too much of this generalization -- though it might be a useful heuristic for plausibility checking. Since some elements can go either way, we will need __ and __ (or rather, their logical equivalents: I am not proposing actual elements here, just pointing out the need for elements with conjunctive and disjunctive meaning) regardless, and using such elements explicitly seems simpler and less confusing than hard-wiring so much intelligence into software. It also is worth pointing out that the initial premise of this section is false: two URLs may very easily point to the same object, and it is easy to imagine methods of describing formats which would allow multiple names to be applied to the same format (just as in some programming languages the same data type can be referred to by multiple names). ##
3.3 Inheritance There are three ways to treat inheritance of metadata from other records. We can insist that the inherited metadata never include the same elements as are present locally, or we can specify that locally specified elements override inherited elements of the same name, or we can attempt to specify some method of merging the two records so as to keep all the information from both records, by `and`-ing or `or`-ing corresponding elements together. In the first event, it may be overkill to speak of `inheritance'; in the latter, we may be reintroducing all the problems of repeating groups. If we take the first or the second approach (or even the third approach, as long as we provide a simple rule, such as "All inherited data is `and`-ed together with local data"), we will be able to interpret a reference to external metadata fairly rigorously: - Replace the reference (virtually) with the contents of the referenced record (or, with those parts of the referenced record not overridden by local specifications). - If the local record and the referenced record are both _simple_ (in the sense given above), then the result is also guaranteed simple. - If the local or the referenced record are _complex_, then the same rules of interpretation must be applied as for other complex records. By making liberal use of references to other records, we can do without __ and __ elements. We can demonstrate this by giving a method of transforming records with __ and __ into sets of records linked by reference. For each SGML element in the source record, we do the following: - if the SGML element is of type __, then create a new output record (which becomes the `current output record'): first copy the element itself, and then copy each of its children, following these same rules - if the element is of type __, then - copy nothing to the current output record - for each subelement of the __, create a new output record, copying the subelement to the new output record following these same rules - at the end of each subelement's output record, add a reference to the current output record, with relationship-type of `inheritance` - if the element is of type __, copy the element itself to the current output record, renaming it as a __ element; then copy all its children, using these same rules - otherwise, copy the element using its current type, then copy all its children using these same rules Or perhaps it would be clearer to put it this way: We begin by copying the entire __ element and all its children into a new record, which we then process as follows: - Give the new record a name or URL; remember this name in the variable _N_. - If any __ elements occur as children of the root __ element, then delete the entire __ element and copy it into a new record. Add, as the first element in the new record, a reference to record _N_. Then process the new record as described in this procedure. (It gets its own name, etc.) - If any __ elements occur as children of the root __ element, then remove the start- and end-tags from the __ element, thus promoting its children one level in the document tree. The sample record from the previous section would turn into the following set of records: - record C (at `http://www.meta.org/catalog/c`):
<citation>
  <title>TEI Lite: An Introduction to
    Text Encoding for Interchange</title>
  <author>Lou Burnard</author>
  <author>C. M. Sperberg-McQueen</author>
</citation>
- record A1 (at `http://www.meta.org/catalog/a1`):
<citation id=A1>
  <relation scheme='URL' type='OtherType' othertype='inherits'>
    http://www.meta.org/catalog/c
  </relation>
  <form>TEI Lite</form>
  <identifier scheme='URL'>
    http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei
  </identifier>
</citation>
- record A2 (at `http://www.meta.org/catalog/a2`):
<citation id=A2>
  <relation scheme='URL' type='OtherType' othertype='inherits'>
    http://www.meta.org/catalog/c
  </relation>
  <form>HTML</form>
</citation>
- record H1 (at `http://www.meta.org/catalog/h1`):
<citation>
  <relation scheme='URL' type='OtherType' othertype='inherits'>
    http://www.meta.org/catalog/a2
  </relation>
  <identifier scheme='URL'>
    http://www-tei.uic.edu/orgs/tei/intros/teiu5.html
  </identifier>
</citation>
- record H2 (at `http://www.meta.org/catalog/h2`):
<citation>
  <relation scheme='URL' type='OtherType' othertype='inherits'>
    http://www.meta.org/catalog/a2
  </relation>
  <identifier scheme='URL'>
    http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html
  </identifier>
</citation>

More work is needed here, I think, both to specify how to interpret the record when the same element occurs both locally and in the referenced object, and to specify what constitutes the same element.

4 Conclusion

An adequate syntax for multiple versions (realizations) of the same work (intellectual content) requires an explicit semantic interpretation, to avoid hopeless ambiguity. If we provide our syntax with mechanisms for both disjunctive and conjunctive groupings (and-groups and or-groups), we can provide simple rules for interpreting complex records in terms of their disjunctive normal form.

More complex semantic formalisms, using existential quantifiers, may also be defined, but do not require any syntax more elaborate than the simpler semantics.

Notes

[1] A formula in sentential logic is in disjunctive normal form if it is a disjunction (or alternation, or or-group) of one or more terms, and if each term is a conjunction of one or more primitive sentences or their negations. No nested expressions are allowed. For a fuller discussion, any book on formal logic may be consulted, but perhaps the best discussion of disjunctive normal form and the algebraic manipulations used to achieve it may be found in W.V. Quine, Methods of logic 4th ed. (Cambridge, Mass. : Harvard University Press, 1982). [return to text]