| |
|
Title:
|
A syntax for writing a list of
labelled values in a text string
|
|
Creator:
|
|
|
Creator:
|
|
|
Date Issued:
|
1999-04-30
|
|
Identifier:
|
|
|
Replaces:
|
Not Applicable
|
|
Is Replaced By:
|
Not Applicable
|
|
Latest version:
|
Not Applicable
|
|
|
|
|
Status of document:
|
This document is a NOTE made
available by the Dublin Core Metadata Inititive Directorate
for discussion only. The publication of a NOTE by the
Dublin Core implies no endorsement of any kind.
|
| Description of document: |
We describe a method for recording lists of
labelled values in a text string, called Dublin Core Structured
Values, with the label DCSV. The notation is intended for
structured information within attribute value strings in markup-languages
such as HTML and XML. This is likely to be useful in recording
complex element values in metadata systems based on the qualified
Dublin Core model. |
|
Document
metadata:
|
|
|
- Introduction
- Structured Values - the DCSV scheme
- Parsing DCSV
- Examples
- Sample Code for parsing DCSV coded values
- Acknowledgments
- References
1. Introduction
Element attributes in markup languages, such as HTML [HTML4]
and XML [XML], provide an alternative position
to the element content for recording data. For some "empty"
elements - such as the <IMG >and <META
>elements in HTML - attributes are the only place to
hold data. In other cases there may be good reasons to prefer
element attributes to element content for data. For example, fragments
of XML can be included in the <HEAD>of a HTML
document, and will be safely ignored by most client software (eg
browsers) provided the elements have no content. This syntax trick
can be used to embed XML-RDF encoded data safely in current versions
of HTML [RDF-in-HTML].
Future versions of HTML are expected to overcome these limitations
by allowing general XML documents to be included [XHTML].
Nevertheless, there is strong interest in using HTML <META
>elements to record data with more structure than normally
implied by a plain-text string, in particular to record metadata
according to the qualified Dublin Core model [Q-DC-HTML].
However, the use of element attributes for storing data has a
number of technical limitations:
- attributes may occur no more than once
- the data must consist of a text-string including no double-quotes
(")
These features mean that there is no built-in way to indicate
the structure of the data.
Nevertheless, in certain applications, it is highly desirable
to be able to encode structured values within a plain-text string.
Some generic methods are in common use. Inheriting conventions
from natural languages, commas (,) and semi-colons (;) are frequently
used as list separators. Similarly, comma-separated-values (CSV)
and tab-separated-values (TSV) are common export formats from
spreadsheet and database software. Dots (.) and dashes (-) are
sometimes used to imply hierarchies, particularly in thesaurus
applications.
A number of specific encoding schemes use punctuation characters
within the text string to indicate structure. For example, colons
(:) terminate protocol labels, and double slashes (//) act as
separators for identifiers coded as URIs [URI].
Colons (:) separate specified labels from values within a field,
and semi-colons (;) separate fields within a personal description
according to vCard [vCard]. Hyphens are one
of the many characters used to separate fields in a date according
with ISO8601 [ISO8601]. For some schemes
- vCard and ISO8601, for example - the punctuation indicates a
very formal structure to the value, and is expected to be parsed
automatically.
Our intention in this note is to define a generic, self-describing
data-structuring method for text-strings, to be used when no other
suitable scheme is available. This is based on methods used and
found successful elsewhere, vCard in particular, but is more generalised
than the preceding standards.
2. Structured Values - the DCSV scheme
To allow the recording of generic Structured Values, we
introduce the Dublin Core Structured Values (DCSV) scheme.
We distinguish between two types of substring - labels
and values, where a label is the name of the type of a
value, and a value is the data itself. Furthermore, we allow a
complete value to be disaggregated into set of components,
each of which has its own label and value. A value that is comprised
of components in this way is called a structured value.
Punctuation characters are used in recording a structured value
as follows:
- colons (:) separate plain-text labels of structured value-components
from the values themselves
- semi-colons (;) separate (optionally labelled) value-components
within a list
- dots (.) indicate hierachical structure in labels, if required.
The labels and the component values themselves each consist of
a text-string. The intention is that the label will be a word
or code corresponding to the name of the value-component. Labels
may be absent, in which case the entire sub-string delimited by
semi-colons (;) or the end of the string comprise the component
value.
The following patterns show how structured values may be recorded
in strings using DCSV:
"u1; u2; u3"
"cA:v1"
"cA:v1; cB.part1:v2; cB.part2:v3"
"cA:v1; u2; u3"
where u1, u2and u3are
unlabelled components, cAand cBare the
labels of Structured Value components, part1and part2are
sub-components of cB, and v1, v2and
v3are values of the components.
The use of the specific punctuation characters in DCSV coded
values means that these characters cannot be used directly within
strings which comprise the content (either labels or values) of
the components. For DCSV, therefore, when a period, full-stop
or dot (.) a colon (:), or a semi-colon (;) is required within
the value, the characters are escaped using a backslash, appearing
as \. \: \;, and the backslash itself is
escaped similarly \\ . This method of escaping special
characters largely preserves readability and the ability to enter
DCSV coded metadata values easily using a text-editor if required.
Software written to process DCSV coded values must make the necessary
substitutions.
Note that, the double-quote (") character is a generic case that
cannot be used directly within HTML or XML element attributes.
3. Parsing DCSV
A simple method can be used to parse metadata values recorded
according to the DCSV scheme. For a single value recorded using
the DCSV scheme:
- split the text-string into a list of substrings on any unescaped
semi-colons (;);
if no semi-colon is present, there is a single substring
- split each substring into its (label,value) on any unescaped
colons (:);
if no colon is present, the label is empty
- within each value replace the escaped characters with the
actual character required.
A short Perl program which performs this parsing operation is
included at the end of this note.
4. Examples
"name.given:Renato; name.family:Iannella; employer:DSTC;
Contact:Level 7, Gehrmann Labs, The University of Queensland,
Qld\. 4072, Australia"
"rows:200; cols:450"
The DCSV scheme adds most of the components required for the
representation of the qualified DC model in HTML [Q-DC-HTML][Q-DC-RDF],
while remaining fully compatible with the HTML-4 [HTML]
standard. It thus supports a recording method for qualified Dublin
Core, compatible with tools which rely on HTML (browsers, metadata
harvesters), but with a clear route for migrating relatively rich
information into fully structured notations when appropriate.
In this context, DCSV is noted as the value of the
SCHEME
attribute of the HTML
<META >
element as shown in the foloowing examples:
<META NAME="DC.Creator" SCHEME="DCSV"
CONTENT="name.given:Simon; name.family:Cox;
employer:CSIRO; height:177 cm">
<META NAME="DC.Language" SCHEME="RFC1766" CONTENT="en-AU">
<META NAME="DC.Contributor" SCHEME="vCard" CONTENT="fn:Simon
Cox; org:CSIRO">
<META NAME="DC.Date" SCHEME="ISO8601" CONTENT="1999-04-30">
<META NAME="DC.Relation" SCHEME="URL"
CONTENT="http://www.foo.bar/explication.html">
<META NAME="DC.Format.media" SCHEME="IMT" CONTENT="image/gif">
<META NAME="DC.Format.size" CONTENT="14 kB">
<META NAME="DC.Format.size" SCHEME="DCSV" CONTENT="rows:200; cols:450">
5. Sample Code for parsing DCSV coded values
The following Perl program reads a DCSV coded string entered
on stdin, and prints a formatted version of the structured result.
This code is provided for demonstration purposes only and contains
no error-checking.
#!/usr/local/bin/perl
print "Enter string to be parsed:\n";
my $string = join('',<STDIN>);
print "\nString to be parsed is [$string]\n";
# First escape % characters
$string =~ s/%/"%".unpack('C',"%")."%"/eg;
# Next change \ escaped characters to %d% where d is the
character's ascii code
$string =~ s/\\(.)/"%".unpack('C',$1)."%"/eg;
print "\nEscaped string is [$string]\n";
# Now split the string into components
my @components = split(/;/, $string);
print "\nComponents:\n";
foreach $component (@components) {
my ($name, $value) = split(/:/, $component, 2);
# if there is no : $value is empty so copy $name
into $value and empty $name
if (!$value) {
$value = $name;
$name = '';
}
# strip whitespace from name string
$name =~ s/^\s*(\S+)\s*$/$1/;
# convert % escaped characters back in value string
$value =~ s/%(\d+)%/pack('C',$1)/eg;
print "Name [$name] has value [$value]\n";
}
6. Acknowledgments
John Kunze encouraged us to write up this proposal formally.
Kim Covil wrote the perl code.
7. References
- [HTML4]
- Dave Raggett, Arnaud Le Hors, Ian Jacobs, 1998, HTML 4.0
Specification http://www.w3.org/TR/REC-html40/
- [ISO8601]
- M. Wolf and C. Wicksteed, 1997, Date and Time Formats,
http://www.w3.org/TR/NOTE-datetime
- [Q-DC-HTML]
- S. Cox 1999 Recording qualified Dublin Core metadata in
HTML http://www.agcrc.csiro.au/projects/3018CO/metadata/qdchtml/NOTE-QDCHTML-19991103.html
- [Q-DC-RDF]
- E. Miller, P. Miller, D. Brickley, 1999. Guidance on expressing
the Dublin Core within the Resource Description Framework (RDF)
http://www.ukoln.ac.uk/interop-focus/activities/dc/datamodel/
- [RDF-in-HTML]
- This uses the most compact form of XML-RDF [RDF-syntax], in
which all the data occurs as attribute values. In this form
several important capabilities are not available, such as multiple
(repeated) values. For an example, see Figure 5 in S.J.D. Cox
and K.D. Covil, "A web-based geological information system using
metadata", Proc. 3rd IEEE META-DATA Conference, http://computer.org/conferen/proceed/meta/1999/papers/7/cox_covil.html
- [URI]
- T. Berners-Lee, R. Fielding, L Masinter, 1998 Uniform Resource
Identifiers (URI): Generic Syntax RFC2396 http://info.internet.isi.edu/in-notes/rfc/files/rfc2396.txt
T. Berners-Lee, L. Masinter, and M. McCahill, 1994 Uniform
Resource Locators, RFC1738 http://info.internet.isi.edu/in-notes/rfc/files/rfc1738.txt.
T. Berners-Lee, 1994 Universal Resource Identifiers in WWW:
A Unifying Syntax for the Expression of Names and Addresses
of Objects on the Network as used in the World-Wide Web,
RFC1630 http://info.internet.isi.edu/in-notes/rfc/files/rfc1630.txt.
- [vCard]
- F. Dawson, T. Howes, 1998 vCard MIME Directory Profile
RFC2426 http://info.internet.isi.edu/in-notes/rfc/files/rfc2426.txt
- [XHTML]
- Steven Pemberton and many others, 1999 XHTML 1.0: The Extensible
HyperText Markup Language http://www.w3.org/TR/WD-html-in-xml/
See also Dave Raggett, HyperText Markup Language Activity
Statement http://www.w3.org/MarkUp/Activity.html
- [XML]
- Extensible Markup Language http://www.w3.org/XML/
Last modified 1999-04-30
|
|