blog.cbeer.info

Nov 4, 2009

Teaching PBCore, Questions and Notes

The questions below are loosly based on those raised by particpants in the introduction to XML workshop presented at the Association of Moving Image Archivists 2009 conference in St. Louis, MO on 3 November. In general, tangible examples are crucial to the teaching and understanding of PBCore. At present, the PBCore examples are hap-hazard and follow little logical progression. An improvement in this area would be beneficial to the adoption of PBCore. In addition, tools should be created to support new PBCore-based applications which would make distiguishing between well-formed XML, valid PBCore, and PBCore that conforms to a community of practice easier. - Where are the XML attributes? After an introduction to XML, which taught the partipants about the basic building blocks of XML (elements, entities, and attributes), the lack of attributes in PBCore was confusing. Rather than:

<title type="Program">Jimmy Carter</title>

PBCore requires:

<pbcoreTitle>
<title>Jimmy Carter</title>
<titleType>Program</titleType>
</pbcoreTitle>

As a developer, the additional mechanics to parse each type, each authority, or each role are annoying copy+paste jobs, but it is clear that even those new to XML develop the same expectations. With some of the recent developments from DCMI to make Dublin Core more relevant to the changing metadata landscape, it seems like PBCore has failed to evolve. The reason, as best I can determine, is the PBCore 1.x schema was developed based on existing XML exports from a relational database where that convention is born out of the need for a semantically agnostic schema rather than proper schema creation. - What is PBCore's relation to Dublin Core? PBCore is introduced as being a derivative or extension of Dublin Core, but for some shared element names, there is no obvious relationship. This should either be clarified in future development or dropped. - What is the difference between the formatPhysical, formatMediaType, and formatGenerations? These three instantiation-level metadata elements all describe similar problems slightly differently * formatPhysical (or formatDigital, perhaps) describes the carrier format, which may be independent of the content on the carrier * formatMediaType describes the content present on the carrier * formatGenerations describes the type of content on the carrier The PBCore value lists could be clarified to remove some of the current (seemingly) redundant information - Why are formatPhysical and formatDigital formatted different? Or, why wouldn't one use multiple instantiations to express the different formats for which an item is available? The value list for formatDigital is based on the IANA MIME type registry, while the formatPhysical list is the aggregate of the source elements, which is reflected in the inconsistency of formatting. Could the formatPhysical list become more cohesive and resemble MIME types? The relation between current instantiations is, at best, unclear and not systematic. The biggest flaw in the current approach is that it is difficult to express the provenance of an instantiation and it's relation to the intellectual work. The current situation also breaks the 1:1 correspondance between an instantiation and a carrier/file/etc. Some major restructuring, possibly breaking backwards-compatibility is necessary to correct these issues. In the meantime, I would recommend creating a new instantiation for each instance and using the pbcoreAnnotation field to supply basic provenance information. - The PBCore outline graph is confusing. As is, the outline graph mixes XML elements with conceptual groupings which makes it confusing to someone new to XML or to PBCore. The graphic could be easily revised to use shaded groups to communicate the content classes, rather than tree nodes. - The PBCore metadata dictionary picklists provide no definitions or best practices The metadata dictionary, which may be the most important part of PBCore 1.x, is marginalized on the website. The picklists are offered only as lists and fail to provide appropriate definitions for titleType, descriptionType, etc. Without this guidance, each implementor is forced to make determinations without respect to a community of practice. Taking descriptionType as an example, guidance is needed to describe when to use the format-specific types (program, series, etc) vs the generic type labels (abstract, summary). - The PBCore website conflates schema rules with best practices The PBCore website recommends best practices and guidelines for usage closely integrated with the schema requirements. This placement is confusing; while the best practices are very important and are essential resources for understanding, it adds difficulty to the understanding of PBCore. - A schema-validating XML editor complains when the XML document lacks recommended or optional fields In particular, oXygen indicates to the user that fields like pbcoreGenre are REQUIRED for conformance to PBCore, while the website leads one to believe this is not the case. In fact, this should not be the case because genre is very specific to broadcasting/traffic needs and will likely be missing in general usage. This leads me to believe that PBCore should examine the approach the TEI community took with regard to modulization. Proper modulization would provide implementors with a relevant set of metadata elements necessary for use, and perhaps make it easier to integrate PBCore with other metadata schemas (for example, a rights schema or technical metadata standard), leaving PBCore responsible for description and rules for aggregation. - How do you exchange records? Or, how can I put multiple description documents in the same file? A PBCoreDescriptionDocument, according to the PBCore schema, should have only one document per file, which is common XML practice, but unknown to those new to XML. Participants were attracted to aggregations as a way to deliver contextually complete documents containing metadata records for relations, etc. Other standards have explored aggregations independent of standards (say, Atom or OAI-PMH), which is probably a more-sound approach. - Extensions are hard, confusing. Yep.

Oct 27, 2009

15 ways to improve PBCore

This is a post describing shortcomings and potential improvements for PBCore, an XML markup for media material interchange. These suggestions try to work within the current confines of PBCore, rather than introducing radical changes (which could bring PBCore more in line with the rest of the XML and linked data worlds). Further, we recognize the strength of PBCore is in descriptive metadata, and these suggestions are primarily to strengthen those components, rather than trying to compete on technical metadata.

Define what all the data dictionary elements mean — “clip”, “element”, “actuality”, “version of”, etc. These need to be defined in order for the community to better apply consistently. Other communities have come up with these already – we just need to determine which ones apply to which elements. See for example, the European Broadcasting Union does a nice job of distributing machine-readable XML definitions for their data dictionary.
Enhance semantics of relation types by creating an ontology (using rdfs or similar, like the Fedora RELS-EXT ontology) – eg. instead of simply “version of” allow “derivation of”, “copy of” “identical to” etc.
PBCore only has contextual date on individual instantiations, but we want an overall date with types for created/issued/etc (e.g. the date an interview was conducted). A similar issue exists for locations. Both of these are different from pbcoreCoverage — coverage is about the content, rather than the context.
Format of the content — whether it is an interview, a panel discussion, a live event, b-roll, beauty shots, etc. formatGenerations provides a piece of this puzzle, but this is ultimately descriptive metadata, which probably don't belong in an instantiation. EBUCore provides for part of this with a controlled vocabulary for editorial formats, but it’s not granular enough (e.g. Discussion/Interview/Debate/Talkshow). Our suggestion is to explore enhancing the genre data dictionary to include archival descriptors like “interview” “b-roll”, which would solve this in a backwards-compatible way.
Machine parseable rights language; we're embedding the Open Digital Rights Language (ODRL) as a member of pbcoreRightsSummary, but it would be nice to have a common way to express rights (both rights the publisher has, and rights granted by the publisher to the user). An alternate (and perhaps desirable and necessary) solution would be to at least investigate better ways to combine PBCore with established schemas like ODRL, MODS, etc.
A way to identify the primary title and description of an asset, for use in a discovery interface. Existing solutions, like picking titles based on hierarchy, or using a separate metadata document, are flawed.
A formal way to order, prioritize, and relate instantiations within a record (e.g. programs within a series, provenance/hierarchy of digital instances).
A way to label the type for a pbcoreSubject is (e.g. person, organization, place, date, etc), in addition to the existing authority reference.
Authority references should be available in most (if not all) PBCore containers, which could help enable linked data applications. This could be accomplished through new xml attributes, which would be ignored by legacy applications, and perhaps better in line with other standards.
Better handling of "element" level materials, for archival raw footage and similar. Finished programs are handled decently in the existing PBCore, but the data dictionaries aren't prepared for this level.
Adopt proper RDF relationships for PBCore relations.
Consider adding educational levels and standards. PBCore currently addresses this tangentially with audienceLevel and audienceRating.
Better way to handle metadata about people, whether by enhancing the existing structure, supporting an hCard microformat, or otherwise.
Semantics to deal with thumbnails for discovery interfaces, or how to attach visual representations/facsimiles of a PBCore media instantiation. This is probably a low priority, nice to have change.
Content flags, which include advisory messages about sensitive content, are regularly created for broadcast programs, but PBCore doesn't provide a way to capture these. Perhaps the best way here is to add time-based metadata to the descriptive material (but, then, what do you base the timecode against? See next.)
BONUS: Add timecode information to instantiations and relationships to identify sections of content, in order to support time-based metadata, content flags, etc.

Oct 10, 2009

Fedora + Ruote workflow system

[caption id="attachment_222" align="alignnone" width="300" caption="Here is a high-level diagram of a workflow system that combines Ruote, ActiveMQ, Fedora to create a flexible and extendable lightweight workflow system:"]

[/caption]

Oct 4, 2009

Fedora, Blacklight, and Ruby on Rails

I've been playing with Blacklight, a catalog interface built on solr, this weekend with fairly positive results. After some initial frustration trying to figure out the demo data, I switched gears and connected Blacklight to my own solr data source, populated by a Fedora repository. Two initial kinks here were:

The unique identifier field `id` is hard-coded into Blacklight, while my existing data used the field name `PID`; see CODEBASE-171
The unique identifiers in my repository began with a qualified namespace in the form "org.example.repository", which broke the Ruby on Rails default routing system

My quick fix for the routing issue was to change the formatting requirements for the id field in the router, so my resource map now looks like:


  map.resources(:catalog,
    :only => [:index, :show, :update],
  […]
    :requirements => { :id => /([A-Za-z0-9]|-|\.)+:(([A-Za-z0-9])|-|~|_|(%[0-9A-F]{2}))+/ }
  )

The regular expression is a copy of the Fedora PID regular expression, but I've disallowed periods in the identifier name (but they are still legal in the namespace, which I imagine is common practice). There is still a fair bit of work hooking in object views, but the catalog + discovery portions were quickly and easily done.

Sep 28, 2009

BagIt workflows

Adapted from an email I just wrote, but I think there is some good resources here, so I thought I'd share more widely. I’ve toyed around with the BagIt standard, and have a demonstrator for a very homogenous use-case (using Ruby, Ruote, and ruby-bagit) but it doesn’t factor into our DAM -> Fedora workflow yet. From my limited implementation, it would certainly be nice to see DAMS beginning to adopt it , if a few issues can be addressed, either by the standard or by convention. The biggest issue with the BagIt standard at this point is that it is exclusively a framework for transferring a collection of files, but doesn’t yet provide a way to create complex/compound objects out of the contents. The Library of Congress has been using BagIt for their Chronicaling America newspaper project ( tech notes) , but the reconstruction of objects and relationships has been implicit (based on a file naming convention) or manually done. This probably works in the simplest cases, where each BagIt item can be mapped into a compound object with either limited or embedded metadata, but I’m not sure if this could be easily applied to the problem of creating and relating multiple (heterogeneous) complex objects. Ben O’Steen at Oxford has proposed an extension to add an RDF manifest to the BagIt package to provide this sort of relationships , but I haven’t pursued that further. There has also been some recent development around combining BagIt and OAI-ORE, which might be a better way of approaching the problem using existing standards. A further wrinkle, at our end, is that our Fedora repository is holding compressed access copies of the content, which cannot be stored in the DAM (because the DAM content model fails to account for proxy objects or similar). I imagine this is going to be a problem with almost all large datastreams, and something infrastructure will have to adapt to.