innovation in metadata design, implementation & best practice


DCMI DCSV: A syntax for writing a list of labelled values in a text string

Creator: Simon Cox
Creator: Renato Iannella
Contributor: Andy Powell
Contributor: Andrew Wilson
Date Issued: 2005-07-25
Is Replaced By:
Latest version:
Status of document: This is a DCMI Proposed Recommendation. From 2005-07-25 to 2005-10-10, the status of this revision was incorrectly shown as "DCMI Recommendation"..
Description of document: This document describes a method for recording lists of labelled values in a text string, called Dublin Core™ Structured Values, with the label DCSV. The notation is intended for structured information within attribute values in Dublin Core™ metadata descriptions.

Table of Contents

  1. Introduction
  2. Structured Values - the DCSV encoding scheme
  3. Parsing DCSV
  4. Sample Code for parsing DCSV coded values
  5. Glossary
  6. Acknowledgments
  7. References

1. Introduction

It is often highly desirable to be able to encode or serialise values within a plain-text string. Some generic methods are in common use. Inheriting conventions from natural languages, commas (,) and semi-colons (;) are frequently used as list separators. Similarly, comma-separated-values (CSV) and tab-separated-values (TSV) are common export formats from spreadsheet and database software, with line-feeds separating rows or tuples. Dots (.) and dashes (-) are sometimes used to imply hierarchies, particularly in thesaurus applications. The eXtensible Markup Language [XML] provides one general solution, using tags contained within angle brackets (<, >) to indicate the structure.

2. Structured Values - the DCSV encoding scheme

To allow the recording of generic Structured Values , we introduce the Dublin Core™ Structured Values ( DCSV ) encoding scheme.

This document describes a particular method for structuring simple string values within a DCMI description. Here, we distinguish between two types of substring within a value string - componentLabels and componentValues, where a componentLabel is the name of the type of a componentValue, and a componentValue is the data itself. Furthermore, we allow a complete value string to be disaggregated into set of components, each of which has its own componentLabel and componentValue. A value that is comprised of components in this way is called a structured value.

Punctuation characters are used in recording a structured value as follows:

  • equals-signs (=) separate plain-text comppnentLabels of structured value-components from the componentValues themselves;
  • semi-colons (;) separate (optionally labelled) value-components within a list;
  • dots (.) indicate hierachical structure in componentLabels, if required.

The componentLabels and the componentValues themselves each consist of a text-string. The intention is that the componentLabel will be a word or code corresponding to the name of the value-component. componentLabels may be absent, in which case the entire sub-string delimited by semi-colons (;) or the end of the string comprise a componentValue.

The following patterns show how structured values may be recorded in strings using DCSV:

"u1; u2; u3"
"cA=v1; cB.part1=v2; cB.part2=v3"
"cA=v1; u2; u3"

where u1, u2 and u3 are unlabelled components, cA and cB are the componentLabels of Structured Value components, part1 and part2 are sub-components of cB, and v1, v2 and v3 are componentValues of specific components.

The use of specific punctuation characters in DCSV coded values means that care must be exercised if these characters are to be used directly within strings which comprise the content (either componentLabels or componentValues) of the components. For DCSV, therefore, when an equals-sign (=), or a semi-colon (;) is required within the componentValue, the characters are escaped using a backslash, appearing as \= \;. There should be no ambiguity regarding the dot, full-stop or period (.) within strings: when it is part of a componentLabel, a dot indicates some hierachy; when part of a componentValue, it has the conventional meaning for the context. This method of escaping special characters largely preserves readability and the ability to enter DCSV coded metadata value strings easily using a text-editor if required. Software written to process DCSV coded values must make the necessary substitutions.

As there is no explicit grouping mechanism, DCSV can only be used to record a list. DCSV is only intended to be used for relatively simple structured values.

3. Parsing DCSV

A simple method can be used to parse metadata values recorded according to the DCSV scheme. For a single value recorded using the DCSV scheme:

  1. split the value string into a list of
    if no semi-colon is present, there is a single substring;
  2. split each substring into its (componentLabel,componentValue) on any unescaped equals-signs (=);
    if no equals (=) is present, the componentLabel is empty;
  3. within each componentValue replace the escaped characters with the actual character required.

4. Sample Code for parsing DCSV coded values

The following Perl program reads a DCSV coded string entered on stdin, and prints a formatted version of the structured result. This code is provided for demonstration purposes only and contains no error-checking.

use strict
print "Enter string to be parsed:\n";
my $string = join('',<STDIN>);
print "\nString to be parsed is [$string]\n";

First escape % characters

$string =~ s/%/"%".unpack('C',"%")."%"/eg;

Next change \ escaped characters to %d% where d is the character's ascii code

$string =~ s/\(.)/"%".unpack('C',$1)."%"/eg; print "\nEscaped string is [$string]\n";

Now split the string into components

my @components = split(/;/, $string); print "\nComponents:\n"; foreach $component (@components) {     my ($label, $value) = split(/=/, $component, 2);     # if there is no = copy contents of $label into $value and empty $label     if (!$value) {     $value = $label;     $name = '';     }     # strip whitespace from name string     $label =~ s/^\s(\S+)\s$/$1/;     # convert % escaped characters back in label string     $label =~ s/%(\d+)%/pack('C',$1)/eg;     # convert % escaped characters back in value string     $value =~s/%(\d+)%/pack('C',$1)/eg;     print "Component Label [$label] has Component Value [$value]\n"; }

5. Glossary

This document uses the following terms:

A component is one of a set of one or more text strings structured according to the DCSV scheme that together make up a statement.
The name of the type of a componentValue
The value string representing the value specified by the label.
A description is made up of one or more statements about one, and only one, resource.
Within DCMI, element is typically used as a synonym for property. However, it should be noted that the word element is also commonly used to refer to a structural markup component within an XML document.
element refinement
An element refinement is a property of a resource that shares the meaning of a particular DCMI property but with narrower semantics. Since element refinements are properties, they can be used in metadata descriptions independently of the properties they refine. In DCMI practice, an element refinement refines just one parent DCMI property.
encoding scheme
Encoding scheme is the generic name for vocabulary encoding scheme and syntax encoding scheme.
encoding scheme URI
The generic name for a vocabulary encoding scheme URI or a syntax encoding scheme URI.
marked-up text
A string that contains HTML, XML or other markup (for example TeX) and that is associated with the value of a property.
A property is a specific aspect, characteristic, attribute, or relation used to describe resources.
property URI
A property URI is a URI reference that identifies a single property.
Qualifierwas the generic name used for the terms that are now usually referred to specifically as element refinements or encoding schemes.
related description
A related description is a description of a resource that is related to the resource being described.
A resource is anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), and a collection of other resources. Not all resources are network "retrievable"; e.g., human beings, corporations, concepts and bound books in a library can also be considered resources.
resource URI
A resource URI is a URI reference that identifies a single resource.
A statement is made up of a property URI (a URI reference that identifies a property), zero or one value URI (a URI reference that identifies a value of the property), zero or one vocabulary encoding scheme URI (a URI reference that identifies the class of the value) and zero or more value representations of the value.
structured value
Structured value is the generic name for the following:
  • A value string that contains machine-parsable component parts (and which has an associated syntax encoding scheme that indicates how the component parts are encoded within the string).
  • Some marked-up text.
  • A related description
syntax encoding scheme
A syntax encoding scheme indicates that the value string is formatted in accordance with a formal notation, such as "2000-01-01" as the standard expression of a date.
syntax encoding scheme URI
A syntax encoding scheme URI is a URI reference that identifies a syntax encoding scheme. For all DCMI recommended encoding schemes, the URI reference is constructed by concatenating the name of the encoding scheme with the URI.
The generic name for a property (i.e. element or element refinement), vocabulary encoding scheme, syntax encoding scheme or concept taken from a controlled vocabulary (concept space).
term URI
The generic name for a URI reference that identifies a term.
A value is the physical or conceptual entity that is associated with a property when it is used to describe a resource.
value URI
A value URI is a URI reference that identifies the value of a property.
value representation
A value representation is a surrogate for (i.e. a representation of) the value.
value string
A value string is a simple string that represents the value of a property. In general, a value string should not contain any marked-up text.
vocabulary encoding scheme
A vocabulary encoding scheme is a class that indicates that the value of a property is taken from a controlled vocabulary (or concept-space), such as the Library of Congress Subject Headings.
vocabulary encoding scheme URI
A vocabulary encoding scheme URI is a URI reference that identifies a vocabulary encoding scheme. For all DCMI recommended encoding schemes, the URI reference is constructed by concatenating the name of the encoding scheme with the URI.

6. Acknowledgments

John Kunze encouraged the original authors to write up this proposal formally. Kim Covil wrote the perl code. Eric Miller nagged regarding the overlap with XML. Steve Tolkin convinced the original authors to switch to =.

7. References

Dublin Core™ Metadata Initiative, OCLC, Dublin Ohio.

A. Powell, M. Nilsson, A. Naeve, Pete Johnson, 2004** ,***DCMI Abstract Model_

Dave Raggett, Arnaud Le Hors, Ian Jacobs, 1999, HTML 4.01 Specification

DCMI Box - specification of the spatial limits of a place, and methods for encoding this in a text string
DCMI Point - a point location in space, and methods for encoding this in a text string
DCMI Period - specification of the limits of a time interval, and methods for encoding this in a text string

S. Cox, 2000, Recording qualified Dublin Core™ metadata in HTML

This uses the most compact form of XML-RDF [RDF-syntax], in which all the data occurs as attribute values. In this form several important capabilities are not available, such as multiple (repeated) values. For an example, see Figure 5 in S.J.D. Cox and K.D. Covil, "A web-based geological information system using metadata", Proc. 3rd IEEE META-DATA Conference

D. Beckett, 2004,*RDF/XML Syntax Specification (Revised)_

T. Berners-Lee, R. Fielding, L Masinter, 1998 Uniform Resource Identifiers (URI): Generic Syntax RFC2396
T. Berners-Lee, L. Masinter, and M. McCahill, 1994 Uniform Resource Locators, RFC1738
T. Berners-Lee, 1994 Universal Resource Identifiers in WWW: A Unifying Syntax for the Expression of Names and Addresses of Objects on the Network as used in the World-Wide Web, RFC1630

F. Dawson, T. Howes, vCard MIME Directory Profile RFC2426

M. Wolf, C. Wicksteed, 1997, Date and Time Formats

Steven Pemberton and many others, 1999 XHTML 1.0: The Extensible HyperText Markup Language
See also Dave Raggett, HyperText Markup Language Activity Statement

Extensible Markup Language