|
Title:
|
DCMI DCSV: A syntax for writing
a list of labelled values in a text string
|
|
Creator:
|
|
|
Creator:
|
|
|
Date Issued:
|
2000-07-28
|
|
Identifier:
|
|
|
Replaces:
|
|
|
Is Replaced By:
|
Not Applicable
|
|
Latest version:
|
|
|
|
|
|
Status of document:
|
|
| Description of document: |
We describe a method for recording
lists of labelled values in a text string, called Dublin Core
Structured Values, with the label DCSV. The notation is intended
for structured information within attribute values in markup-languages
such as HTML and XML. This is likely to be useful in recording
complex element values in metadata systems based on the qualified
Dublin Core model. |
| View metadata for this page: |
http://purl.org/dc/documents/rec/dcmi-point-20000728.htm.rdf |
|
NOTICE TO IMPLEMENTORS:
|
The syntax examples included in this document
are provisional, and are currently under review as part
of the DCMI work on recommending coordinated syntax recommendations
for HTML, XML, and RDF. These recommendations and minor
editorial changes in this document can be expected to take
place in the near future. Note that the use of "="
as a separator is a change from earlier versions of this
specification which used ":" in the same position.
This change was considered desirable because the ":"
character occurs frequently within strings which are likely
to be used as names and values. Using "=" as a
separator reduces the need to escape characters in the data.
|
|
Table of Contents
Introduction
It is highly desirable to be able to encode or serialise
structured values within a plain-text string. Some generic methods
are in common use. Inheriting conventions from natural languages,
commas (,) and semi-colons (;) are frequently used as list separators.
Similarly, comma-separated-values (CSV) and tab-separated-values
(TSV) are common export formats from spreadsheet and database
software, with line-feeds separating rows or tuples. Dots
(.) and dashes (-) are sometimes used to imply hierarchies, particularly
in thesaurus applications. The eXtensible Markup Language [XML]
provides a general solution, using tags contained within angle
brackets (<, >) to indicate the structure.
A number of named encoding schemes use punctuation characters
within a text string to indicate specific components. For example,
a colon (:) terminates the protocol label, and slashes (/), question-marks
(?), ampersands (&) and hashes (#) are used to separate other
fields in identifiers coded as URI's [URI].
Colons (:) separate specified labels from values within a field,
and semi-colons (;) separate fields within a personal description
according to a common implementation of vCard [vCard].
Hyphens are used to separate fields in a date according with the
W3C profile of ISO8601 [W3C-DTF]. For some
schemes - vCard and W3C-DTF, for example - the punctuation indicates
a very formal structure to the value, and is expected to be parsed
automatically.
Element attributes in markup languages, such as HTML [HTML4]
and XML [XML], provide a position for recording
data. For some "empty" elements - such as the <IMG
> and <META
> elements in HTML - attributes are the only place to hold
data. In other cases there may be good reasons to store data in
element attributes rather than element content. For example, fragments
of XML can be included in the <HEAD>
of a HTML document, and will be safely ignored by most client
software (eg browsers) provided the elements have no content.
This syntax trick can be used to embed XML-RDF encoded data safely
in current versions of HTML [RDF-in-HTML].
Future versions of HTML are expected to overcome these limitations
by allowing general XML documents to be included [XHTML].
Nevertheless, there is strong interest in using HTML <META
> elements to record data with more structure than normally
implied by a plain-text string, in particular to record metadata
according to the qualified Dublin Core model [Q-DC-HTML].
However, the use of element attributes for storing data has some
technical limitations:
- attributes may occur no more than once
- values are constrained to a set of types which restricts the
permissible character-strings [HTML4] in
some contexts. Use of XML's angle-bracket delimiters (<,
>) and various other punctuation characters is only valid
in certain cases (i.e. when the content type is CDATA), and
is only generally reliable using escape-mechanisms (i.e. as
character entities). In general, strings containing these
characters are prone to misinterpretation by some user-agents
(e.g. browsers).
Note that there is no intrinsic way to indicate structure within
the values of attributes of HTML elements.
Our intention in this note is to define a compact human-readable
data-structuring method for HTML attribute values of content type
CDATA, avoiding certain punctuation characters which are prone
to cause difficulties in some encoding environments. The notation
should normally be used only when no other suitable scheme is
available. It is based on methods used and found successful elsewhere,
but is more generalised than the preceding standards. It may be
used as the basis of profiles designed to encode particular data
types [Profiles].
Structured Values - the
DCSV scheme
To allow the recording of generic Structured Values, we
introduce the Dublin Core Structured Values (DCSV) scheme.
We distinguish between two types of substring - labels
and values, where a label is the name of the type of a
value, and a value is the data itself. Furthermore, we allow a
complete value to be disaggregated into set of components,
each of which has its own label and value. A value that is comprised
of components in this way is called a structured value.
Punctuation characters are used in recording a structured value
as follows:
- equals-signs (=) separate plain-text labels of structured
value-components from the values themselves
- semi-colons (;) separate (optionally labelled) value-components
within a list
- dots (.) indicate hierachical structure in labels, if required.
The labels and the component values themselves each consist of
a text-string. The intention is that the label will be a word
or code corresponding to the name of the value-component. Labels
may be absent, in which case the entire sub-string delimited by
semi-colons (;) or the end of the string comprise a component
value.
The following patterns show how structured values may be recorded
in strings using DCSV:
"u1; u2; u3"
"cA=v1"
"cA=v1; cB.part1=v2; cB.part2=v3"
"cA=v1; u2; u3"
where u1, u2 and
u3 are unlabelled
components, cA
and cB are the
labels of Structured Value components, part1
and part2 are
sub-components of cB,
and v1, v2
and v3 are values
of the components.
The use of specific punctuation characters in DCSV coded values
means that care must be exercised if these characters are to be
used directly within strings which comprise the content (either
labels or values) of the components. For DCSV, therefore, when
an equals-sign (=), or a semi-colon (;) is required within the
value, the characters are escaped using a backslash, appearing
as \= \;. There should be no ambiguity regarding the dot, full-stop
or period (.) within strings: when it is part of a label, a dot
indicates some hierachy; when part of a value, it has the conventional
meaning for the context. This method of escaping special characters
largely preserves readability and the ability to enter DCSV coded
metadata values easily using a text-editor if required. Software
written to process DCSV coded values must make the necessary substitutions.
Note that in HTML the double-quote (") character can be
used directly within a CDATA attribute value if the full string
is delimited by single-quotes ('), but in XML the double-quote
must be encoded as a character entity in element attributes.
As there is no explicit grouping mechanism, DCSV can only be
used to record a list. DCSV is only intended to be used for relatively
simple structured values, probably as an interim approach, pending
more general support for syntaxes such as XML which allow recording
of more complex hierarchical structures. However, it is more compact
than the XML equivalent, and is more easily read and constructed
in some common contexts, such as within HTML <meta > elements.
Parsing DCSV
A simple method can be used to parse metadata values recorded
according to the DCSV scheme. For a single value recorded using
the DCSV scheme:
- split the text-string into a list of substrings on any unescaped
semi-colons (;);
if no semi-colon is present, there is a single substring
- split each substring into its (label,value) on any unescaped
equals-signs (=);
if no = is present, the label is empty
- within each value replace the escaped characters with the
actual character required.
A short Perl program which performs this parsing operation is
included at the end of this note.
Examples
"name.given=Renato;
name.family=Iannella; employer=DSTC; Contact=Level 7, Gehrmann
Labs, The University of Queensland, Qld. 4072, Australia"
"rows=200; cols=450"
The DCSV scheme provides useful support for the representation
of complex values for metadata elements in HTML, while remaining
fully compatible with all commonally used tools (browsers, editors,
metadata harvesters). When used in this way "DCMIDCSV"
or the name of one of its derivatives can be noted as the value
of the SCHEME attribute
of the HTML <META>
element as shown in the following examples of qualified
Dublin Core metadata:
<META NAME="DC.Contributor"
SCHEME="DCMIDCSV"
CONTENT="name.given=Eric; name.family=Miller; employer=OCLC;
height=170 cm">
<META NAME="DC.Format" SCHEME="DCMIDCSV"
CONTENT="rows=200; cols=450">
<META NAME="DC.Coverage.spatial" SCHEME="DCMIBOX"
CONTENT="name=Western Australia; northlimit=-13.5; southlimit=-35.5;
westlimit=112.5; eastlimit=129">
<META NAME="DC.Coverage.spatial" SCHEME="DCMIPOINT"
CONTENT="name=Bridgnorth, Shropshire, U.K.; east=372000;
north=293000; units=m; projection=U.K. National Grid">
<META NAME="DC.Date" SCHEME="DCMIPERIOD"
CONTENT="name=Perth International Arts Festival, 2000;
start=2000-01-26; end=2000-02-20;">
Sample Code for parsing
DCSV coded values
The following Perl program reads a DCSV coded string entered on
stdin, and prints a formatted version of the structured result.
This code is provided for demonstration purposes only and contains
no error-checking.
#!/usr/local/bin/perl
use strict
print "Enter string to be parsed:\n";
my $string = join('',<STDIN>);
print "\nString to be parsed is [$string]\n";
# First escape % characters
$string =~ s/%/"%".unpack('C',"%")."%"/eg;
# Next change \ escaped characters to %d% where d is the character's
ascii code
$string =~ s/\\(.)/"%".unpack('C',$1)."%"/eg;
print "\nEscaped string is [$string]\n";
# Now split the string into components
my @components = split(/;/, $string);
print "\nComponents:\n";
foreach $component (@components) {
my ($label, $value) = split(/=/, $component, 2);
# if there is no = copy contents of $label into $value and
empty $label
if (!$value) {
$value = $label;
$name = '';
}
# strip whitespace from name string
$label =~ s/^\s*(\S+)\s*$/$1/;
# convert % escaped characters back in label string
$label =~ s/%(\d+)%/pack('C',$1)/eg;
#convert % escaped characters back in value string
$value =~s/%(\d+)%/pack('C',$1)/eg;
print "Label [$label] has value [$value]\n";
}
Acknowledgments
John Kunze encouraged us to write up this proposal formally.
Kim Covil wrote the perl code. Eric Miller nagged regarding the
overlap with XML. Steve Tolkin convinced us to switch to =.
References
[DCMI]
Dublin Core Metadata Initiative, OCLC, Dublin Ohio.
http://purl.org/dc/
[HTML4]
Dave Raggett, Arnaud Le Hors, Ian Jacobs, 1999, HTML 4.01 Specification
http://www.w3.org/TR/html40/
[Profiles]
DCMI Box - specification of the spatial limits of a place,
and methods for encoding this in a text string http://purl.org/dc/documents/dcmi-box
DCMI Point - a point location in space, and methods for encoding
this in a text string http://purl.org/dc/documents/dcmi-point
DCMI Period - specification of the limits of a time interval,
and methods for encoding this in a text string http://purl.org/dc/documents/dcmi-period
[Q-DC-HTML]
S. Cox, 1999, Recording qualified Dublin Core metadata in HTML
http://www.ned.dem.csiro.au/research/visualisation/metadata/qdchtml/NOTE-QDCHTML-19991103.html
[Q-DC-RDF]
E. Miller, P. Miller, D. Brickley, 1999, Guidance on expressing
the Dublin Core within the Resource Description Framework (RDF)
http://www.ukoln.ac.uk/interop-focus/activities/dc/datamodel/
[RDF-in-HTML]
This uses the most compact form of XML-RDF [RDF-syntax], in which
all the data occurs as attribute values. In this form several
important capabilities are not available, such as multiple (repeated)
values. For an example, see Figure 5 in S.J.D. Cox and K.D. Covil,
"A web-based geological information system using metadata",
Proc. 3rd IEEE META-DATA Conference, http://computer.org/conferen/proceed/meta/1999/papers/7/cox_covil.html
[URI]
T. Berners-Lee, R. Fielding, L Masinter, 1998 Uniform Resource
Identifiers (URI): Generic Syntax RFC2396 http://info.internet.isi.edu/in-notes/rfc/files/rfc2396.txt
T. Berners-Lee, L. Masinter, and M. McCahill, 1994 Uniform Resource
Locators, RFC1738 http://info.internet.isi.edu/in-notes/rfc/files/rfc1738.txt.
T. Berners-Lee, 1994 Universal Resource Identifiers in WWW: A
Unifying Syntax for the Expression of Names and Addresses of Objects
on the Network as used in the World-Wide Web, RFC1630 http://info.internet.isi.edu/in-notes/rfc/files/rfc1630.txt.
[vCard]
F. Dawson, T. Howes, vCard MIME Directory Profile RFC2426 http://www.imc.org/rfc2426
[W3C-DTF]
M. Wolf, C. Wicksteed, 1997, Date and Time Formats, http://www.w3.org/TR/NOTE-datetime
[XHTML]
Steven Pemberton and many others, 1999 XHTML 1.0: The Extensible
HyperText Markup Language http://www.w3.org/TR/WD-html-in-xml/
See also Dave Raggett, HyperText Markup Language Activity Statement
http://www.w3.org/MarkUp/Activity.html
[XML]
Extensible Markup Language http://www.w3.org/XML/
|