innovation in metadata design, implementation & best practice

A Syntax for writing a list of labelled values in a text string

Title:

A Syntax for Writing a List of Labelled Values in a Text String

Creator:
Creator:
Date Issued:
1999-04-30
Identifier:
Replaces:
Not Applicable
Is Replaced By:
Not Applicable
Latest version:
Status of document:
This is a DCMI Note.
Description of document: A method for recording lists of labelled values in a text string, called Dublin Core Structured Values, with the label DCSV, is described. The notation is intended for structured information within attribute value strings in markup-languages such as HTML and XML. This is likely to be useful in recording complex element values in metadata systems based on the qualified Dublin Core model.

  1. Introduction
  2. Structured Values - the DCSV scheme
  3. Parsing DCSV
  4. Examples
  5. Sample Code for parsing DCSV coded values
  6. Acknowledgments
  7. References

1. Introduction

Element attributes in markup languages, such as HTML [HTML4] and XML [XML], provide an alternative position to the element content for recording data. For some "empty" elements - such as the <IMG >and <META >elements in HTML - attributes are the only place to hold data. In other cases there may be good reasons to prefer element attributes to element content for data. For example, fragments of XML can be included in the <HEAD>of a HTML document, and will be safely ignored by most client software (eg browsers) provided the elements have no content. This syntax trick can be used to embed XML-RDF encoded data safely in current versions of HTML [RDF-in-HTML].

Future versions of HTML are expected to overcome these limitations by allowing general XML documents to be included [XHTML]. Nevertheless, there is strong interest in using HTML <META >elements to record data with more structure than normally implied by a plain-text string, in particular to record metadata according to the qualified Dublin Core model [Q-DC-HTML].

However, the use of element attributes for storing data has a number of technical limitations:

  1. attributes may occur no more than once
  2. the data must consist of a text-string including no double-quotes (")

These features mean that there is no built-in way to indicate the structure of the data.

Nevertheless, in certain applications, it is highly desirable to be able to encode structured values within a plain-text string. Some generic methods are in common use. Inheriting conventions from natural languages, commas (,) and semi-colons (;) are frequently used as list separators. Similarly, comma-separated-values (CSV) and tab-separated-values (TSV) are common export formats from spreadsheet and database software. Dots (.) and dashes (-) are sometimes used to imply hierarchies, particularly in thesaurus applications.

A number of specific encoding schemes use punctuation characters within the text string to indicate structure. For example, colons (:) terminate protocol labels, and double slashes (//) act as separators for identifiers coded as URIs [URI]. Colons (:) separate specified labels from values within a field, and semi-colons (;) separate fields within a personal description according to vCard [vCard]. Hyphens are one of the many characters used to separate fields in a date according with ISO8601 [ISO8601]. For some schemes - vCard and ISO8601, for example - the punctuation indicates a very formal structure to the value, and is expected to be parsed automatically.

Our intention in this note is to define a generic, self-describing data-structuring method for text-strings, to be used when no other suitable scheme is available. This is based on methods used and found successful elsewhere, vCard in particular, but is more generalised than the preceding standards.

2. Structured Values - the DCSV scheme

To allow the recording of generic Structured Values , we introduce the Dublin Core Structured Values ( DCSV ) scheme.

We distinguish between two types of substring - labels and values, where a label is the name of the type of a value, and a value is the data itself. Furthermore, we allow a complete value to be disaggregated into set of components, each of which has its own label and value. A value that is comprised of components in this way is called a structured value.

Punctuation characters are used in recording a structured value as follows:

  • colons (:) separate plain-text labels of structured value-components from the values themselves
  • semi-colons (;) separate (optionally labelled) value-components within a list
  • dots (.) indicate hierachical structure in labels, if required.

The labels and the component values themselves each consist of a text-string. The intention is that the label will be a word or code corresponding to the name of the value-component. Labels may be absent, in which case the entire sub-string delimited by semi-colons (;) or the end of the string comprise the component value.

The following patterns show how structured values may be recorded in strings using DCSV:

"u1; u2; u3"
"cA:v1"
"cA:v1; cB.part1:v2; cB.part2:v3"
"cA:v1; u2; u3"

where u1, u2and u3are unlabelled components, cAand cBare the labels of Structured Value components, part1and part2are sub-components of cB, and v1, v2and v3are values of the components.

The use of the specific punctuation characters in DCSV coded values means that these characters cannot be used directly within strings which comprise the content (either labels or values) of the components. For DCSV, therefore, when a period, full-stop or dot (.) a colon (:), or a semi-colon (;) is required within the value, the characters are escaped using a backslash, appearing as . :** \; , and the backslash itself is escaped similarly \**. This method of escaping special characters largely preserves readability and the ability to enter DCSV coded metadata values easily using a text-editor if required. Software written to process DCSV coded values must make the necessary substitutions.

Note that, the double-quote (") character is a generic case that cannot be used directly within HTML or XML element attributes.

3. Parsing DCSV

A simple method can be used to parse metadata values recorded according to the DCSV scheme. For a single value recorded using the DCSV scheme:

  1. split the text-string into a list of substrings on any unescaped semi-colons (;);
    if no semi-colon is present, there is a single substring
  2. split each substring into its (label,value) on any unescaped colons (:);
    if no colon is present, the label is empty
  3. within each value replace the escaped characters with the actual character required.

A short Perl program which performs this parsing operation is included at the end of this note.

4. Examples

"name.given:Renato; name.family:Iannella; employer:DSTC; 
Contact:Level 7, Gehrmann Labs, The University of Queensland, 
Qld\. 4072, Australia"
"rows:200; cols:450"

The DCSV scheme adds most of the components required for the representation of the qualified DC model in HTML [Q-DC-HTML][Q-DC-RDF], while remaining fully compatible with the HTML-4 [HTML] standard. It thus supports a recording method for qualified Dublin Core, compatible with tools which rely on HTML (browsers, metadata harvesters), but with a clear route for migrating relatively rich information into fully structured notations when appropriate. In this context, DCSV is noted as the value of the

SCHEME
attribute of the HTML
<META >
element as shown in the foloowing examples:
<META NAME="DC.Creator" SCHEME="DCSV"
         CONTENT="name.given:Simon; name.family:Cox;
employer:CSIRO; height:177 cm">
<META NAME="DC.Language" SCHEME="RFC1766" CONTENT="en-AU">
<META NAME="DC.Contributor" SCHEME="vCard" CONTENT="fn:Simon
Cox; org:CSIRO">
<META NAME="DC.Date" SCHEME="ISO8601" CONTENT="1999-04-30">
<META NAME="DC.Relation" SCHEME="URL"
CONTENT="http://www.foo.bar/explication.html">
<META NAME="DC.Format.media" SCHEME="IMT" CONTENT="image/gif">
<META NAME="DC.Format.size" CONTENT="14 kB">
<META NAME="DC.Format.size" SCHEME="DCSV" CONTENT="rows:200; cols:450">

5. Sample Code for parsing DCSV coded values

The following Perl program reads a DCSV coded string entered on stdin, and prints a formatted version of the structured result. This code is provided for demonstration purposes only and contains no error-checking.

#!/usr/local/bin/perl

print "Enter string to be parsed:\n";

my $string = join('',<STDIN>);

print "\nString to be parsed is [$string]\n";

First escape % characters

$string =~ s/%/"%".unpack('C',"%")."%"/eg;

Next change \ escaped characters to %d% where d is the

character's ascii code $string =~ s/\(.)/"%".unpack('C',$1)."%"/eg;

print "\nEscaped string is [$string]\n";

Now split the string into components

my @components = split(/;/, $string);

print "\nComponents:\n"; foreach $component (@components) { my ($name, $value) = split(/:/, $component, 2);

# if there is no : $value is empty so copy $name 

into $value and empty $name if (!$value) { $value = $name; $name = ''; }

# strip whitespace from name string
$name =~ s/^\s*(\S+)\s*$/$1/;

# convert % escaped characters back in value string
$value =~ s/%(\d+)%/pack('C',$1)/eg;

print "Name [$name] has value [$value]\n";

}


6. Acknowledgments

John Kunze encouraged us to write up this proposal formally. Kim Covil wrote the perl code.


7. References

[HTML4]
Dave Raggett, Arnaud Le Hors, Ian Jacobs, 1998, HTML 4.0 Specification http://www.w3.org/TR/REC-html40/
[ISO8601]
M. Wolf and C. Wicksteed, 1997, Date and Time Formats, http://www.w3.org/TR/NOTE-datetime
[Q-DC-HTML]
S. Cox 1999 Recording qualified Dublin Core metadata in HTML http://www.agcrc.csiro.au/projects/3018CO/metadata/qdchtml/NOTE-QDCHTML-19991103.html
[Q-DC-RDF]
E. Miller, P. Miller, D. Brickley, 1999. Guidance on expressing the Dublin Core within the Resource Description Framework (RDF) http://www.ukoln.ac.uk/interop-focus/activities/dc/datamodel/
[RDF-in-HTML]
This uses the most compact form of XML-RDF [RDF-syntax], in which all the data occurs as attribute values. In this form several important capabilities are not available, such as multiple (repeated) values. For an example, see Figure 5 in S.J.D. Cox and K.D. Covil, "A web-based geological information system using metadata", Proc. 3rd IEEE META-DATA Conference, http://computer.org/conferen/proceed/meta/1999/papers/7/cox_covil.html
[URI]
T. Berners-Lee, R. Fielding, L Masinter, 1998 Uniform Resource Identifiers (URI): Generic Syntax RFC2396 http://info.internet.isi.edu/in-notes/rfc/files/rfc2396.txt
T. Berners-Lee, L. Masinter, and M. McCahill, 1994 Uniform Resource Locators, RFC1738 http://info.internet.isi.edu/in-notes/rfc/files/rfc1738.txt.
T. Berners-Lee, 1994 Universal Resource Identifiers in WWW: A Unifying Syntax for the Expression of Names and Addresses of Objects on the Network as used in the World-Wide Web, RFC1630 http://info.internet.isi.edu/in-notes/rfc/files/rfc1630.txt.
[vCard]
F. Dawson, T. Howes, 1998 vCard MIME Directory Profile RFC2426 http://info.internet.isi.edu/in-notes/rfc/files/rfc2426.txt
[XHTML]
Steven Pemberton and many others, 1999 XHTML 1.0: The Extensible HyperText Markup Language http://www.w3.org/TR/WD-html-in-xml/
See also Dave Raggett, HyperText Markup Language Activity Statement http://www.w3.org/MarkUp/Activity.html
[XML]
Extensible Markup Language http://www.w3.org/XML/