Creators: |
Michael Sperberg-McQueen |
Date Issued: | 1996-04-17 |
This Version: | http://dublincore.org/documents/2001/03/19/info-factoring/ |
Latest Version: | https://www.dublincore.org/specifications/dublin-core/info-factoring/ |
Replaces: |
|
Status: | note |
Description: | This document describes some problems in the interpretation of metadata records which contain repeated fields, repeated field groups, or references to other metadata records. The semantics of repeated fields and groups (and, equivalently, of references to other metadata records) are described using sentential logic, and a proposal is made to specify the interpretation of repeating groups using the disjunctive normal form of corresponding logical expressions. From this proposal, requirements for grouping elements and inheritance are derived. The semantic principles involved may be of wider applicability, but all examples are from the socalled 'Dublin Core' of metadata elements, described in the paper OCLC/NCSA Metadata Workshop Report, by Stuart Weibel, Jean Godby, Eric Miller, and Ron Daniel, available on the World Wide Web at http://www.oclc.org:5046/conferences/metadata/dublin_core_report.html. |
This document, first published in 1996, is being made available as a Dublin Core discussion note as part of the DC-Architecture Working Group's effort to formalise an XML/RDF representation of the Dublin Core. While the document is 5 years old, many of the issues and observations made are worth reconsidering in the light of subsequent work on RDF and XML. The DC Note in its current form is UNPUBLISHED and undergoing minor edits for publication on the dublincore.org site. Contact Dan Brickley (dc-architecture co-chair) if you have any queries regarding this process.
The remainder of this document is unaltered apart from minor edits for XHTML validation.
C. M. Sperberg-McQueen
17 April 1996
This document describes some problems in the interpretation of metadata records which contain repeated fields, repeated field groups, or references to other metadata records. The semantics of repeated fields and groups (and, equivalently, of references to other metadata records) are described using sentential logic, and a proposal is made to specify the interpretation of repeating groups using the disjunctive normal form of corresponding logical expressions. From this proposal, requirements for grouping elements and inheritance are derived. The semantic principles involved may be of wider applicability, but all examples are from the socalled `Dublin Core' of metadata elements, described in the paper OCLC/NCSA Metadata Workshop Report, by Stuart Weibel, Jean Godby, Eric Miller, and Ron Daniel, available on the World Wide Web at http://www.oclc.org:5046/conferences/metadata/dublin_core_report.html.
The data elements defined for a metadata record by the `Dublin Core' are all optional and all repeatable, and have no prescribed order. Some (e.g. author, title) relate to the intellectual content of an object (the work), while others (e.g.form) relate to particular realizations or instantiations of that intellectual content. Some (e.g. identifier, terms and conditions) may apply to all forms taken by a given item, or only to some forms and not others.
For example, consider the documentation for the TEI Lite SGML tag set. As a work, it may be described by the following metadata:
How should, could, or must metadata for such items be represented?
At the Warwick meeting, Dan LaLiberte argued that in Dublin it was agreed that a given metadata record should describe only a single realization of an intellectual object; this would help ensure that metadata records are unambiguous. I don't find this explicit in the Dublin conference report, but that report does say explicitly that multiple versions may require multiple records. Redundancy may be controlled by factoring common information (e.g. work-related information) into separate records and `inheriting' it in the records for specific realizations. On this view, the three instantiations of the TEI Lite documentation will each require a separate metadata record.
Reports at the Warwick meeting (April, 1996) from users of the Dublin core, however, make clear that in practice, there is a strong desire to put metadata for a given work in a single record, using some mechanism such as repeating groups to describe multiple realizations. This paper, for example, might be represented thus with repeating groups (I use the DTD described by Eric Miller's paper Issues of Document Description in HTML, available at http://www.oclc.org:5046/~emiller/tmp/paper.html. ):
<citation> <title>TEI Lite: An Introduction to Text Encoding for Interchange</title> <author>Lou Burnard</author> <author>C. M. Sperberg-McQueen</author> <form>TEI Lite</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei </identifier> <form>HTML</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.html </identifier> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html </identifier> </citation>
The only problem with this method is that it requires a lot of intelligence in the reader or user of the metadata to interpret the meaning of fields which occur more than once. A human may easily realize that the first form (TEI Lite) applies only to the first identifier, and that the second and third identifiers are for objects in the second form (HTML); software will realize it only if suitably instructed. A human will realize, perhaps even without conscious thought, that the two
The association of form and identifier information can be made explicit, using the
<citation> <title>TEI Lite: An Introduction to Text Encoding for Interchange</title> <author>Lou Burnard</author> <author>C. M. Sperberg-McQueen</author> <instance> <form>TEI Lite</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei </identifier> </instance> <instance> <form>HTML</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.html </identifier> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html </identifier> </instance> </citation>This is an improvement, but not a full solution (note, for now, that two HTML identifiers still require different interpretation from the two _
<citation> <title>On the Pulse of Morning <author>Maya Angelou <publisher>University of Virgina Library Electronic Text Center <otherAgent>Transcribed by the University of Virginia Electronic Text Center <date>1993 <object>Poem <form>1 ASCII file <source>Newspaper stories and oral performance of text at the presidential inauguration of Bill Clinton <language>English </citation>
The key difference, I believe, is that all of the metadata in this record unambiguously applies all the time, while some elements of the previous record apply only in conjunction with certain other elements.
If we express each element as a logical proposition, the simple record has a correspondingly simple logical form. For convenience, let us give each proposition a short name:
(T & A & P & D & OA &
Ob & F & S & L)
, or "The item has the title On the Pulse of Morning and the item was written by Maya Angelou and ...".The more complex record has a more complex logical structure. If we name the propositions thus:
(T & A1 & A2 & ((F1 & I1)
| (F2 & (I2 | I3))))
, which can be paraphrased in English roughly thus: > - The item is called "TEI Lite: ..."
> - and it was written by Lou Burnard
> - and it was (also) written by C. M. Sperberg-McQueen
> - and it is
> - either in TEI Lite as .../teiu5.tei
> - or in HTML
> - either as .../teiu5.html
> - or as .../teiu5.split.html.Each individual instance can be described (as Dan LaLiberte points out) with a simple metadata record, which translates into a simple formula:
T & A1 & A2 & F1 & I1
T & A1 & A2 & F2 & I2
T & A1 & A2 & F2 & I3
i.e.I believe that this simple form of expression, in which the only connector is and
, corresponds to the class of metadata records which are unambiguous and easy to interpret. The problem of interpreting complex metadata records (ones with repeating fields or groups) can thus be paraphrased: how do we derive a set of simple and
-expressions from the logical expression representing a complex metadata record?
Fortunately, the answer is simple.
If we combine the three simple expressions into a single formula, we get a paraphrase of the metadata record as a whole:
( (T & A1 & A2 & F1 & I1) | (T & A1 & A2 & F2 & I2) | (T & A1 & A2 & F2 & I3) )which can be paraphrased roughly thus: > (if you have an item in hand described by this metadata record, then one of these three things is true:) > - _either_ the title is TEI Lite ..._and_ the author(s) are LB _and_ CMSMcQ _and_ the form is TEI Lite the URL is .../teiu5.tei > - _or_ the title is TEI Lite ..._and_ the author(s) are LB _and_ CMSMcQ _and_ the form is HTML the URL is .../teiu5.html > - _or_ the title is TEI Lite ..._and_ the author(s) are LB _and_ CMSMcQ _and_ the form is HTML the URL is .../teiu5.split.html The salient point (and the only interesting or new claim in this entire paper) is that this expression is logically equivalent to the original formula for the example, but unlike the original this one is in _disjunctive normal form_.[1] It is fortunately not hard to generate the disjunctive normal form of arbitrary logical expressions, particularly when (as here) the only operators allowed are `and` and `or`. We can then describe the semantics of metadata records thus: - Each element in a metadata record represents a single logical predicate. - A _simple_ record is interpreted as the `and`-ing together (conjunction) of its sub-elements. - A _complex_ record is interpreted as a shorthand for the `or`-ing together (disjunction) of several simple records, each represented by one term in the complex record's disjunctive normal form. We do need, however, a way to make explicit not only the parenthetical groupings in the formula (_
<citation> <title>On the pulse of morning</title> </citation>is not, in general, merely "The title is On the pulse of morning" but something more like "(There is an object, described by this record, and) the title (_of the object described by this record_) is On the pulse of morning." That is, there is an implied existential quantifier inherent in the existence of a metadata record, and there is an implied argument for each metadata element, viewed as a logical function. Paraphrasing records at this level of detail would make it easier to capture the semantics of work and realization more clearly. Represented in first-order predicate calculus, our example might look like this:
(E w)(E lb)(E cmsmcq)(E i1)(E i2)(E i3) ( work(w) & title(w,"TEI Lite ...") & name(lb,"Lou Burnard") & name(cmsmcq,"C. M. Sperberg-McQueen") & author(w,lb) & author(w,cmsmcq) & instance(w,i1) & form(i1,teilite) & url(i1,".../teiu5.tei") & instance(w,i2) & form(i2,html) & url(i2,".../teiu5.html") & instance(w,i3) & form(i3,html) & url(i3,".../teiu5.html") & (i1 != i2) & (i1 != i3) )which we might paraphrase as: - there are objects _w_, _lb_, _cmsmcq_, _i1_, _i2_, and _i3_, such that - _w_ is a work - the title of _w_ is TEI Lite ... - the full name of _lb_ is _Lou Burnard_ - the full name of _cmsmcq_ is _C. M. Sperberg-McQueen_ - (an) author of _w_ is _lb_ - (an) author of _w_ is _cmsmcq_ - _i1_ is an instance of _w_ - the format of _i1_ is _teilite_ - the URL of _i1_ is _.../teiu5.tei_ - _i2_ is an instance of _w_ - the format of _i2_ is _teilite_ - the URL of _i2_ is _.../teiu5.tei_ - _i3_ is an instance of _w_ - the format of _i3_ is _teilite_ - the URL of _i3_ is _.../teiu5.tei_ - _i1_ and _i2_ are not the same object - _i1_ and _i3_ are not the same object Note, in passing, that from the _