innovation in metadata design, implementation & best practices

----------------------------------------------------------------------
Date: Thu, 23 Mar 2006 17:21:12 -0000
Reply-To: DCMI Architecture Group <DC-ARCHITECTURE@JISCMAIL.AC.UK>
Sender: DCMI Architecture Group <DC-ARCHITECTURE@JISCMAIL.AC.UK>
From: Andy Powell <andy.powell@EDUSERV.ORG.UK>
Subject: FW: Domains and ranges of DC properties
To: DC-ARCHITECTURE@JISCMAIL.AC.UK

This topic was discussed briefly at the f2f meeting in Madrid - and has
trundled on slowly behind the scenes since then, as part of the DC-RDF
Taskforce activity.

The proposed list of domain and range classes are in the DC-Architecture
wiki at

http://dublincore.org/architecturewiki/DCPropertyDomainsRanges

Given that this discussion is largely about semantics, we agreed at
todays Usage Board teleconf, that discussion about this list would move
into the remit of the UB. However, that doesn't mean that we aren't
interested in people's views. So, if you have comments on the above
document, please share them here.

(As an aside, we're going to attempt to bring back DC-RDF Taskforce
discussion onto the main list - since the current use of a sub-list
seems to have fragmented some our conversations!).

----------------------------------------------------------------------
Date: Thu, 23 Mar 2006 23:44:55 +0000
From: Rachel Heery <r.heery@UKOLN.AC.UK>
Subject: Re: FW: Domains and ranges of DC properties
To: DC-ARCHITECTURE@JISCMAIL.AC.UK

On Thu, 23 Mar 2006, Andy Powell wrote:
> http://dublincore.org/architecturewiki/DCPropertyDomainsRanges
>
> Given that this discussion is largely about semantics, we agreed at
> todays Usage Board teleconf, that discussion about this list would move
> into the remit of the UB. However, that doesn't mean that we aren't
> interested in people's views. So, if you have comments on the above
> document, please share them here.

I think it would be useful to clarify to this list the underlying purpose
of this exercise and the benefits. Or is this being done in a spirit of
'enabling' unknown future benefits??

As I understand it the aim of the exercise is to make precise distinctions
about the "value space" of a property in machine-processable definitions.
And that this is being done to indicate what can be inferred when a
particular property is used in a triple.

Can someone perhaps give some use cases of how this would be beneficial??

My concern would be that the accuracy of the values in manually crafted DC
metadata would not support inference. DC metadata, as I see it, is
intended to provide 'cheap and cheerful' minimal level descriptive
metadata. Would any application want to start making inferences over such
metadata?

On the other hand if the intention is to be able to merge data from
different sources on a 'partial understanding' basis I would be more
convinced of the benefit...

----------------------------------------------------------------------
Date: Fri, 24 Mar 2006 13:24:36 +0100
From: Mikael Nilsson <mini@NADA.KTH.SE>
Subject: Re: FW: Domains and ranges of DC properties
To: DC-ARCHITECTURE@JISCMAIL.AC.UK

tor 2006-03-23 klockan 23:44 +0000 skrev Rachel Heery:
> I think it would be useful to clarify to this list the underlying purpose
> of this exercise and the benefits.

That's very true, we do need to make clear to ourselves and the world
why this is important.

> Or is this being done in a spirit of
> 'enabling' unknown future benefits??

No, certainly not. The benefits are tangible and somewhat understood by
at least a few...

> As I understand it the aim of the exercise is to make precise distinctions
> about the "value space" of a property in machine-processable definitions.
> And that this is being done to indicate what can be inferred when a
> particular property is used in a triple.

It is not a case of trying to restrict the definitions, but rather to
make clear what the definitions *already* imply. Some of the definitions
are less than clearly formulated, for the benefit of no one... I'm
pretty sure the changes introduced are not meant to be semantic changes
at all.

The origin of these considerations is the DCMI Abstract Model, which
finally took a position on the question of "What is a 'value' anyway?".
We now know the exact distinction DCMI makes between a value and a value
string. Again, I don't believe that the DCAM really introduces anything
new, but the DCAM made it impossible to escape the issue of "values" by
blurring distinctions. Some of that blurring is present in the
definitions of the DCMI terms themselves, and we're trying to fix that.

> Can someone perhaps give some use cases of how this would be beneficial??

A reasonable request, again, even though it should be obvious that being
clear and consistent are virtues on their own.

It seems clear at this point that the main beneficiary at this point in
time is the Semantic Web community. The reason is simply that this
community do the most advanced machine-processing of metadata.

If metadata is crafted and consumed purely by humans, precise ranges and
domains are generally unnecessary, as the meaning is (often) clear given
context, knowledge of the domain, and a good deal of good-will.

This is not always true, mind you. A non-native english speaker not
knowing the context might have *huge* problems, for example. And there's
no way a machine can help in this case.

However, when machines generate and consume metadata we cannot assume
them to be that smart. You already know this, of course. But look at the
following:

http://swoogle.umbc.edu/index.php?option=com_frontpage&service=relation&queryType=rel_swd_instance_range_p2c&searchString=http%3A%2F%2Fpurl.org%2Fdc%2Felements%2F1.1%2Fcreator 

which is a list of the Classes used for values of dc:creator (not a list
of values).

Where's the commonality? To a human, it's clear that most values (except
for some anomalies like foaf:maker) are actually entities like persons
or organizations. It's a fair guess that even the literal values denote
such entities. But how does a piece of software know that these
disparate classes denote the same *kind* of thing? The only way is
through a common definition - namely the definition of dc:creator. And
the way to tell a machine that "values of dc:creator are always Agents"
is to give it a range. That's the whole thing, really.

This information would clearly be useful for Swoogle, as it would be
able to improve its search engine based in this information.

> My concern would be that the accuracy of the values in manually crafted DC
> metadata would not support inference. DC metadata, as I see it, is
> intended to provide 'cheap and cheerful' minimal level descriptive
> metadata. Would any application want to start making inferences over such
> metadata?

All RDF data is potentially subject to inference. You won't know what
people will do with your data, and in the case of RDF, the
machine-semantics (the basis for inference) is specified in the RDF
specs themselves.

In other words: expressing metadata in RDF makes it subject to the
interpretations of *others* according to the rules of RDF. Thus, if
you're producing invalid RDF manually, you're not violating the rules of
DC but the rules of RDF. Maybe RDF shouldn't be used in those cases? Or
software should be used to make the data abide by the RDF rules?

> On the other hand if the intention is to be able to merge data from
> different sources on a 'partial understanding' basis I would be more
> convinced of the benefit...

Well, this is the Swoogle argument. Swoogle is just a single example of
a SW-based service that relies on the semantics of RDF, including
domains and ranges. *Any* generic RDF service would need to rely on that
information, and withholding them that information for DC properties,
even though it's there in the definitions for human consumption,
wouldn't be very meaningful.

I'm not sure the above helped at all, but that's my (incomplete) view of
the situation. Feel free to correct me! :-)

----------------------------------------------------------------------
Date: Fri, 24 Mar 2006 13:05:52 +0000
From: Pete Johnston <p.johnston@UKOLN.AC.UK>
Subject: Re: FW: Domains and ranges of DC properties
To: DC-ARCHITECTURE@JISCMAIL.AC.UK

Rachel Heery wrote:
> I think it would be useful to clarify to this list the underlying purpose
> of this exercise and the benefits. Or is this being done in a spirit of
> 'enabling' unknown future benefits??
> 
> As I understand it the aim of the exercise is to make precise distinctions
> about the "value space" of a property in machine-processable definitions.
> And that this is being done to indicate what can be inferred when a
> particular property is used in a triple.
> 
> Can someone perhaps give some use cases of how this would be beneficial??

Mikael beat me to a reply, but as I've been writing this for the last 
hour or so, I'll send it anyway. I think I'm repeating some of what 
Mikael said, but I've had a stab at a more concrete example of how the 
rdfs:range data might be used.

It seems to me the bottom line is, as Mikael said, that this information 
is _already_ part of the definition of the DCMI-owned properties, but it 
is currently accessible only to a human reader of the term 
description/definition (and even in the human-readable definitions 
sometimes it isn't as clear/unambiguous as it should be - though the 
DCMI Usage Board has gone to some pains to try to address that, I think).

Including the RDFS assertions makes the information which is already 
there _explicit_ and accessible to an _application_.

Why is that useful?

One attempt at a "Use Case" :

Problem: to resolve clearly the ambiguity over whether the value of 
dc:creator (etc) is the agent or the name of the agent.

The DCMI definition says the value is the agent. Some metadata providers 
follow this definition; some metadata providers interpret the definition 
as allowing them to use the name of the agent, a literal, as the value. 
(And to make matters worse, DCMI has published specifications which 
license both approaches.)

To an RDF application, those two data providers are "saying" two 
different things. The data providers think they are "saying" the same 
thing in two different ways.

Consider a service provider consuming metadata and working across both 
those sets of data that they have harvested.

Case 1: The service provider may not even be aware that these two 
patterns are in use. They set up a query to answer the question "Which 
resources were created by the agent called 'John Smith'?" and that query 
only returns half the results because they are catering only for one 
pattern. Not very satisfactory for the user of the service. Cheap, 
certainly, and maybe cheerful for the service provider in the short 
term, but probably not for the user who only sees half the results.

Case 2: The service provider may be aware from analysing their harvested 
data that both these patterns are in use, but not that DCMI and the data 
providers expects them to be treated as equivalent. The service provider 
might choose to reject/ignore one of the two cases. How do they decide? 
Different service providers might make different choices. Again, not 
very satisfactory for the user of those services, who gets contradictory 
results. Still cheap-ish and moderately cheerful for the service 
provider, but the user gets less cheerful as they get different results 
from different services (and eventually vents their spleen on the 
service providers, who then aren't cheerful either).

Case 3: The service provider may be aware that both these patterns are 
in use, and also that DCMI and the data providers expect them to be 
treated as equivalent. So the provider introduces some processing - 
processing specific to this group of properties - which treats the two 
cases as the same or maps one to the other. This introduces a cost to 
the service provider. They can't apply the generic rule, the rule they 
use for all the other RDF data they have harvested: they have to 
introduce "special case" rule for this specific property or set of 
properties. And every new service provider that comes along and wants to 
process the data has to (a) find out that these idiosyncrasies specific 
to this group of properties exist (how do they do that?) and (b) 
implement this special-case rule in their application. And the more 
options left open to the data providers - not just two patterns, but 
three, four, five etc - the more rules the service providers have to 
find out about and handle in their application, and the more complex and 
costly those applications become. OK, the user may now be fairly 
cheerful, but it's neither cheap nor cheerful for our service providers, 
who probably decide that working with DC metadata is just more hassle 
than it's worth!

By introducing, say, an rdfs:range constraint for the dc:creator that 
says the range is a class of some:Agent, which is disjoint - 
"non-overlapping" - with the class rdfs:Literal, DCMI says unambiguously 
to my two metadata providers and - perhaps more importantly - to their 
two metadata creation applications that the value is the agent, not the 
name of the agent. It allows those two metadata creation applications to 
"say" the same thing when they create the metadata record.

And it says unambiguously to the service provider and to their metadata 
processing application that the value is the agent not the literal name 
of the agent. So when that application processes some harvested data, it 
can detect contradictions between the data and DCMI's description of the 
property.

In my case 2 above the best the application could do was say to the 
human administrator of the service "Look, I've got two (three, four 
five) sorts of thing as values for dc:creator here" - the Swoogle case, 
as Mikael pointed out - leaving the human administrator to go to read 
about dc:creator and try to work out what to do.

With the rdfs:range information, the application can recognise "DCMI 
says the range of dc:creator is some:Agent which is disjoint with 
rdfs:Literal, so all these sorts of thing are also instances of 
some:Agent, that's fine - but ooh, look, Admin-Person, this set of data 
has literal values and contradicts that assertion". Our providing the 
rdfs:range data provides the basis for more clarity and consistency. It 
makes the work of service providers easier, and cheaper, and hopefully 
the end result is more cheerfulness all round! ;-)