Nov 4, 2009
The questions below are loosly based on those raised by particpants in the introduction to XML workshop presented at the Association of Moving Image Archivists 2009 conference in St. Louis, MO on 3 November.
In general, tangible examples are crucial to the teaching and understanding of PBCore. At present, the PBCore examples are hap-hazard and follow little logical progression. An improvement in this area would be beneficial to the adoption of PBCore. In addition, tools should be created to support new PBCore-based applications which would make distiguishing between well-formed XML, valid PBCore, and PBCore that conforms to a community of practice easier.
- Where are the XML attributes?
After an introduction to XML, which taught the partipants about the basic building blocks of XML (elements, entities, and attributes), the lack of attributes in PBCore was confusing. Rather than:
<title type="Program">Jimmy Carter</title>
PBCore requires:
<pbcoreTitle>
<title>Jimmy Carter</title>
<titleType>Program</titleType>
</pbcoreTitle>
As a developer, the additional mechanics to parse each type, each authority, or each role are annoying copy+paste jobs, but it is clear that even those new to XML develop the same expectations. With some of the recent developments from DCMI to make Dublin Core more relevant to the changing metadata landscape, it seems like PBCore has failed to evolve.
The reason, as best I can determine, is the PBCore 1.x schema was developed based on existing XML exports from a relational database where that convention is born out of the need for a semantically agnostic schema rather than proper schema creation.
- What is PBCore's relation to Dublin Core?
PBCore is introduced as being a derivative or extension of Dublin Core, but for some shared element names, there is no obvious relationship. This should either be clarified in future development or dropped.
- What is the difference between the formatPhysical, formatMediaType, and formatGenerations?
These three instantiation-level metadata elements all describe similar problems slightly differently
* formatPhysical (or formatDigital, perhaps) describes the carrier format, which may be independent of the content on the carrier
* formatMediaType describes the content present on the carrier
* formatGenerations describes the type of content on the carrier
The PBCore value lists could be clarified to remove some of the current (seemingly) redundant information
- Why are formatPhysical and formatDigital formatted different? Or, why wouldn't one use multiple instantiations to express the different formats for which an item is available?
The value list for formatDigital is based on the IANA MIME type registry, while the formatPhysical list is the aggregate of the source elements, which is reflected in the inconsistency of formatting. Could the formatPhysical list become more cohesive and resemble MIME types?
The relation between current instantiations is, at best, unclear and not systematic. The biggest flaw in the current approach is that it is difficult to express the provenance of an instantiation and it's relation to the intellectual work. The current situation also breaks the 1:1 correspondance between an instantiation and a carrier/file/etc. Some major restructuring, possibly breaking backwards-compatibility is necessary to correct these issues. In the meantime, I would recommend creating a new instantiation for each instance and using the pbcoreAnnotation field to supply basic provenance information.
- The PBCore outline graph is confusing.
As is, the outline graph mixes XML elements with conceptual groupings which makes it confusing to someone new to XML or to PBCore. The graphic could be easily revised to use shaded groups to communicate the content classes, rather than tree nodes.
- The PBCore metadata dictionary picklists provide no definitions or best practices
The metadata dictionary, which may be the most important part of PBCore 1.x, is marginalized on the website. The picklists are offered only as lists and fail to provide appropriate definitions for titleType, descriptionType, etc. Without this guidance, each implementor is forced to make determinations without respect to a community of practice. Taking descriptionType as an example, guidance is needed to describe when to use the format-specific types (program, series, etc) vs the generic type labels (abstract, summary).
- The PBCore website conflates schema rules with best practices
The PBCore website recommends best practices and guidelines for usage closely integrated with the schema requirements. This placement is confusing; while the best practices are very important and are essential resources for understanding, it adds difficulty to the understanding of PBCore.
- A schema-validating XML editor complains when the XML document lacks recommended or optional fields
In particular, oXygen indicates to the user that fields like pbcoreGenre are REQUIRED for conformance to PBCore, while the website leads one to believe this is not the case. In fact, this should not be the case because genre is very specific to broadcasting/traffic needs and will likely be missing in general usage.
This leads me to believe that PBCore should examine the approach the TEI community took with regard to modulization. Proper modulization would provide implementors with a relevant set of metadata elements necessary for use, and perhaps make it easier to integrate PBCore with other metadata schemas (for example, a rights schema or technical metadata standard), leaving PBCore responsible for description and rules for aggregation.
- How do you exchange records? Or, how can I put multiple description documents in the same file?
A PBCoreDescriptionDocument, according to the PBCore schema, should have only one document per file, which is common XML practice, but unknown to those new to XML. Participants were attracted to aggregations as a way to deliver contextually complete documents containing metadata records for relations, etc. Other standards have explored aggregations independent of standards (say, Atom or OAI-PMH), which is probably a more-sound approach.
- Extensions are hard, confusing.
Yep.
Oct 4, 2009
I've been playing with Blacklight, a catalog interface built on solr, this weekend with fairly positive results. After some initial frustration trying to figure out the demo data, I switched gears and connected Blacklight to my own solr data source, populated by a Fedora repository.
Two initial kinks here were:
- The unique identifier field `id` is hard-coded into Blacklight, while my existing data used the field name `PID`; see CODEBASE-171
- The unique identifiers in my repository began with a qualified namespace in the form "org.example.repository", which broke the Ruby on Rails default routing system
My quick fix for the routing issue was to change the formatting requirements for the id field in the router, so my resource map now looks like:
map.resources(:catalog,
:only => [:index, :show, :update],
[…]
:requirements => { :id => /([A-Za-z0-9]|-|\.)+:(([A-Za-z0-9])|-|~|_|(%[0-9A-F]{2}))+/ }
)
The regular expression is a copy of the Fedora PID regular expression, but I've disallowed periods in the identifier name (but they are still legal in the namespace, which I imagine is common practice).
There is still a fair bit of work hooking in object views, but the catalog + discovery portions were quickly and easily done.