innovation in metadata design, implementation & best practice

Description and Cataloging of XML-Data Schemas

Creators: Andrew Layman
Date Issued: 1999-05-25
This Version: http://dublincore.org/specifications/dublin-core/1999/05/25/dc-xml-data-description/
Latest Version: https://www.dublincore.org/specifications/dublin-core/dc-xml-data-description/
Replaces:
  • https://www.dublincore.org/specifications/dublin-core/dc-xml-data-description/1999-05-25/
  • Status: note
    Description: The Dublin Core Metatadata Element Set is a collection of fifteen elements designed to categorize and catalog electronic resources. The elements are sufficiently general that they are suitable for categorizing and describing XML-Data schemas. This paper proposes a schema, based on Dublin Core elements, and then gives guidelines for its application in XML-Data schemas.

    The Dublin Core Metatadata Element Set is a collection of fifteen elements designed to categorize and catalog electronic resources. The elements are sufficiently general that they are suitable for categorizing and describing XML-Data schemas. This paper proposes a schema, based on Dublin Core elements, and then gives guidelines for its application in XML-Data schemas.

    But first, a sample: A trivial schema categorized according to the elements described here might look like:

    <Schema xmlns='urn:schemas-microsoft-com:xml-data'
              xmlns:dt = 'urn:schemas-microsoft-com:datatypes'
              >
     <description>
      <catalogInformation xmlns='urn:schemas-biztalk-org/biztalk/catalog' >
          <title>Schema for messages about a tortilla factory.</title>
          <creator>
              <FreeText>Andrew Layman</FreeText>
              <PersonReference>mailto:andrewl@microsoft.com</PersonReference>
          </creator>
          <subject>
              <SubjectReference>
                  urn:taxonomy-biztalk-org:www.census.gov/epcd/naics/1997#3118
              </SubjectReference>
              <Keyword>Tortilla Manufacturing</Keyword>
          </subject>
          <type>
              <ResourceReference>urn:schemas-microsoft-com:xml-data
              </ResourceReference>
          </type>
          <type>
              <ResourceReference>urn:schemas-biztalk-org/biztalk-0.8.xml
              </ResourceReference>
          </type>
      </catalogInformation>
     </description>
    </Schema>
    

    The Schema

    This defines a small set of tags, each based on the corresponding generic Dublin Core element, specialized for the purpose of cataloging schemas.

    <!-- Schema for Schema Catalog, version 1, based on Dublin Core,
    designed 5/13/99 by Andrew Layman. -->
     
    <Schema xmlns='urn:schemas-microsoft-com:xml-data'
              xmlns:dt = 'urn:schemas-microsoft-com:datatypes'
          >
    
      <description>This defines a small set of tags, each based on the corresponding
    generic Dublin Core, but here specialized for the purpose of cataloging schemas.
    See http://purl.org/dc for more information on Dublin Core.
      </description>
    
      <ElementType name="PersonReference" model="closed" content="textOnly" >
        <description>A URI reference to a person, which may be a natural person,
    a corporation, or any other legal person.</description>
        <datatype dt:type="URI" />
      </ElementType>
    
      <ElementType name="SubjectReference" model="closed" content="textOnly" >
        <description>A URI reference to an identifier from a controlled
    taxonomy.</description>
        <datatype dt:type="URI" />
      </ElementType>
    
      <ElementType name="ResourceReference" model="closed" content="textOnly" >
        <datatype dt:type="URI" />
      </ElementType>
    
      <ElementType name="FreeText" model="open" content="mixed" >
        <description>Mixed text and markup. Must be well-formed if
    marked-up.</description>
        <attribute type="xml:lang" />
      </ElementType>
    
      <ElementType name="keyword" model="closed" content="textOnly" >
        <description>A keyword used for categorization, with a human-language
    meaning but not drawn from a controlled vocabulary identified by a URI. We
    recommend using only lower-case text.</description>
        <attribute type="xml:lang" />
      </ElementType>
    
      <ElementType name="identifier" model="open" content="eltOnly" >
        <description>The formal identifier(s) of the schema. If the schema
    described by this catalog information is not the enclosing document, place the
    URI of the described document here.</description>
        <group order="one" minOccurs="1" maxOccurs="*">
            <element type="ResourceReference" />
        </group>
      </ElementType>
    
      <ElementType name="title" model="closed" content="textOnly" >
        <description>The descriptive title of this schema.</description>
        <attribute type="xml:lang" />
      </ElementType>
    
      <ElementType name="creator" model="open" content="eltOnly" >
        <description>The person or organization primarily responsible for
        creating the intellectual content of this schema. </description>
        <group order="one" minOccurs="1" maxOccurs="*">
            <element type="PersonReference" />
            <element type="FreeText" />
        </group>
      </ElementType>
    
      <ElementType name="subject" model="open" content="eltOnly" >
        <description>The topic of the schema. Typically, subject will be
        expressed as keywords or phrases that describe the subject or
        content of the schema. The use of controlled vocabularies and
        formal classification schemes is encouraged.</description>
        <group order="one" minOccurs="0" maxOccurs="*">
            <element type="SubjectReference" />
            <element type="keyword" />
        </group>
      </ElementType>
    
      <ElementType name="description" model="open" content="mixed" >
        
        <description> A textual description of the content of the resource,
        including abstracts in the case of document-like objects or content
        descriptions in the case of visual resources.</description>
        <attribute type="xml:lang" />
        
      </ElementType>
    
      <ElementType name="publisher" model="open" content="eltOnly" >
        
        <description>The entity responsible for making the resource
        available in its present form, such as a publishing house, a
        university department, or a corporate entity.</description>
    
        <group order="one" minOccurs="1" maxOccurs="*">
            <element type="PersonReference" />
            <element type="FreeText" />
        </group>
        
      </ElementType>
    
      <ElementType name="contributor" model="open" content="eltOnly" >
        
        <description>A person or organization not specified in a Creator
        element who has made significant intellectual contributions to the
        resource but whose contribution is secondary to any person or
        organization specified in a Creator element (for example, editor,
        transcriber, and illustrator).</description>
    
        <group order="one" minOccurs="1" maxOccurs="*">
            <element type="PersonReference" />
            <element type="FreeText" />
        </group>
        
      </ElementType>
    
      <ElementType name="type" model="open" content="eltOnly" >
        
        <description>classification of this schema, not from the
    standpoint of its subject matter, but rather its characteristics.
    Specifically, if the described schema conforms to certain specifications,
    the URI of those specifications should appear here.</description>
    
        <group order="one" minOccurs="1" maxOccurs="*">
            <element type="ResourceReference" />
            <element type="FreeText" />
        </group>
    
      </ElementType>
    
      <ElementType name="format" model="open" content="eltOnly" >
        
        <description>Media or format (e.g. MIME type) of the
    resource.</description>
    
        <group order="one" minOccurs="1" maxOccurs="*">
            <element type="ResourceReference" />
            <element type="FreeText" />
        </group>
    
      </ElementType>
    
      <ElementType name="source" model="open" content="eltOnly" >
        
        <description>Information about a second resource from which the
        present resource is derived. While it is generally recommended that
        elements contain information about the present resource only, this
        element may contain a date, creator, format, identifier, or other
        metadata for the second resource when it is considered important for
        discovery of the present resource; recommended best practice is to
        use the Relation element instead. For example, it is possible to
        use a Source date of 1603 in a description of a 1996 film adaptation
        of a Shakespearean play, but it is preferred instead to use Relation
        "IsBasedOn" with a reference to a separate resource whose
        description contains a Date of 1603. Source is not applicable if the
        present resource is in its original form.</description>
    
        <group order="one" minOccurs="1" maxOccurs="*">
            <element type="ResourceReference" />
            <element type="FreeText" />
        </group>
    
      </ElementType>
    
      <ElementType name="language" model="closed" content="textOnly" >
        
        <description>The language of the intellectual content of the
        resource. When used, he content of this field must coincide
        with RFC 1766 [Tags for the Identification of Languages,
        http://ds.internic.net/rfc/rfc1766.txt ]; examples include en, de,
        es, fi, fr, ja, th, and zh.</description>
        
      </ElementType>
    
      <ElementType name="rights" model="open" content="eltOnly" >
        
        <description>A rights management statement, an identifier that
        links to a rights management statement, or an identifier that links
        to a service providing information about rights management for the
        resource.</description>
    
        <group order="one" minOccurs="1" maxOccurs="*">
            <element type="ResourceReference" />
            <element type="FreeText" />
        </group>
        
      </ElementType>
    
      <ElementType name="catalogInformation" model="open" content="eltOnly" >
        
        <description>
    
        A small set of tags, each based on the corresponding generic Dublin
    Core element, but here specialized for the purpose of cataloging schemas.
    See http://purl.org/dc for more information on Dublin Core.
        Many tags may be repeated at this level, and also allow multiple
    occurences of their subelments. The intended usage is that distinct items
    (for example distinct creators) should be expressed with separate elements,
    while alternative forms of reference to the same item (for example,
    several ways of referring to the same creator) should be expressed
    as alternate subelements.
    
        </description>
    
        <group order="seq">
            <element type="identifier" minOccurs="0" maxOccurs="1" />
            <element type="title" minOccurs="0" maxOccurs="*" />
            <element type="creator" minOccurs="1" maxOccurs="*" />
            <element type="subject" minOccurs="1" maxOccurs="*" />
            <element type="description" minOccurs="0" maxOccurs="*" />
            <element type="publisher" minOccurs="0" maxOccurs="*" />
            <element type="contributor" minOccurs="0" maxOccurs="*" />
            <element type="type" minOccurs="1" maxOccurs="*" />
            <element type="format" minOccurs="0" maxOccurs="*" />
            <element type="source" minOccurs="0" maxOccurs="*" />
            <element type="language" minOccurs="0" maxOccurs="*" />
            <element type="rights" minOccurs="0" maxOccurs="*" />
        </group>
        
      </ElementType>
    
    </Schema>
    

    How to Catalog a Schema

    Crucial to understanding how this is used is to first understand the role of the several URI-based references, such as PersonReference, SubjectReference and ResourceReference. These occur within elements whose content model is very flexible in Dublin Core. For example, the creator element may have free text or it may have a reference to a specific company or individual via some well-known identification system. If the name of the person or company is free-text, meaning that it does not come from a controlled identifier system, it goes within a FreeText element. Controlled identifiers go with elements such as PersonReference.

    An example of a controlled identifier is a D-U-N-S number, defined by the Dun and Bradstreet corporation. In BizTalk catalogue information, all controlled identifiers use the Universal Resource Identifier system. For example, supposing that Dun and Bradstreet gave a URI beginning with 'urn:www-dnb-com:dunsno' to every number they issue. A creator element might look like

    <creator>
        <PersonReference>urn:www-dnb-com:dunsno#123456789012345</PersonReference>
    </creator>
    

    Similarly, subject taxonomies are reasonably going to be defined by many authorities. Each of these should have a corresponding URI namespace, used similarly to

    <subject>
        <subjectRef>
            urn:taxonomy-biztalk-org:www.census.gov/epcd/naics/1997#3118
        </subjectRef>
    </subject>
    

    Subject categorizations also allow keywords from uncontrolled vocabularies, so the following might be seen:

    <subject>
        <subjectRef>
            urn:taxonomy-biztalk-org:taxonomy.census.gov/epcd/naics/1997#3118
        </subjectRef>
        <keyword>Tortilla Manufacturing</keyword>
    </subject>
    

    Type categorization identifies a schema as conforming to the XML-Data and BizTalk specifications:

    <type>
         <ResourceReference>urn:schemas-microsoft-com:xml-data
          </ResourceReference>
    </type>
    <type>
          <ResourceReference>urn:schemas-biztalk-org/biztalk-0.8.xml
          </ResourceReference>
    </type>
    

    A simple schema for a tortilla factory, catalogued according to the elements described here, might look like:

    <Schema xmlns='urn:schemas-microsoft-com:xml-data'
              xmlns:dt = 'urn:schemas-microsoft-com:datatypes'
              >
     <description>
      <catalogInformation xmlns='urn:schemas-biztalk-org/biztalk/catalog' >
          <title>Schema for messages about a tortilla factory.</title>
          <creator>
              <FreeText>Andrew Layman</FreeText>
              <PersonReference>mailto:andrewl@microsoft.com</PersonReference>
          </creator>
          <subject>
              <SubjectReference>
                  urn:taxonomy-biztalk-org:taxonomy.census.gov/epcd/naics/1997#3118
              </SubjectReference>
              <Keyword>Tortilla Manufacturing</Keyword>
          </subject>
          <type>
              <ResourceReference>urn:schemas-microsoft-com:xml-data
              </ResourceReference>
          </type>
          <type>
              <ResourceReference>urn:schemas-biztalk-org/biztalk-0.8.xml
              </ResourceReference>
          </type>
      </catalogInformation>
     </description>
    </Schema>
    

    We do not expect that the majority of such catalog entries will be created by hand. More likely - and much more reliably - they will be created by people using tools. For example, we picture a web page that allows one to categorize a schema by filling in fields for creator, subject, etc. For each, the user can enter free-text keywords, but can also pick URIs from lists (e.g selecting "NAICS" as a SubjectReference taxonomy and then picking a specific classification from the supplied list.) Similarly, we expect that searching will be mediated by tools that are designed for this task.

    Taxonomies and other Identifier Systems

    BizTalk encourages classifying schemas according to taxonomies created by standards bodies and private companies. Where these taxonomies have been incorporated into URI schemes, meaning that the taxonomy has an established URI, BizTalk will use these. Where taxonomies exist but are not yet incorporated into the URI scheme, BizTalk will supply URI prefixes of the general form "urn:taxonomy-biztalk-org:". The remainer of the URI will identify the specific taxonomy and particular identifier.

    For example, the US Census Bureau provides the North American Industry Classification System (NAICS), a series of identifying numbers for business categories, but the census bureau does not currently define a URI scheme for these numbers. Within BizTalk, these can be referenced in a form like "urn:taxonomy-biztalk-org:www.census.gov/epcd/naics/1997#3118".

    Similarly, there are several systems for identifying persons and companies. BizTalk encourages use of these systems, either via URIs supplied by the system authority or, when no such URIs yet exist, by means of URIs provided by BizTalk.

    BizTalk will publish an initial list of approved taxonomies and identifier systems, and will work to expand the list over time.


    http://purl.org/dc/elements/1.1/. See also the proposed 1.1 definition at http://www.dstc.edu.au/RDU/DCAC/version11.html.

    RFC2396
    IETF (Internet Engineering Task Force) RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax, eds. T. Berners-Lee, R. Fielding, L. Masinter. August 1998

    See http://www.census.gov/epcd/www/naics.html.