TNS Internal:CollectionAPI/AtomPub

From NSDLWiki

Jump to: navigation, search

This is a proposal for a collection-oriented interface to based on the Atom and AtomPub standards. The goal is to make the collections and items of the repository discoverable in a simple manner while using established standards and conventions.

Contents

[hide]

Overview

This Atom/AtomPub based interface is composed of three types of web resources:

  1. AtomPub Service documents
    • Lists all collections housed in the repository, with links to their feeds
    • Groups collections into workspaces
    • Lists the capabilities of the collection, including acceptable member categories and mime types. A vocabulary of categories is proposed to represent type, source, and format of entries.
  2. Atom feed documents
    • Each feed document is a web resource with a resolvable URI.
    • clients may add or remove entries from feeds using techniques defined in the AtomPub protocol
  3. Atom entries
    • Each entry contains a record or media file as its content.
    • Entries contain a number of links to related resources, such as the atom feed containing member items (in the case of an entry that represents a collection record), out-of-band data an application needs to attach to the record, or the resource being described in record metadata.
    • Each entry has a resolvable URI, and may be edited per the AtomPub protocol

Basic model

The entire repository record space is represented as a series of atom feeds and entries. An entry serves as a container for its content, which may be either a metadata record or a (possibly binary) media file. Besides the content, the collection API makes extensive use of atom links to relay information about the content. The basic elements of an entry are as follows:

id
Globally unique identifier required by atom spec. The server may choose generate its own id for an entry (if none is given), or use an acceptable value provided by the client. The convention used in the collection api is that it is a uri with the prefix http://ndr.nsdl.org/collections/.
title
Required by the atom spec. For metadata-based entries, this should be the catalogued title. For media files, this should be a a file name.
updated
The server will maintain the last updated date for each record
content
The content of the entry. Contains a href link to a web resource containing the content (metadata record or media file)
summary
required by the atom spec if content is not human readable text or html. This can be anything, but typically should be a description from metadata, or a short description of a media file. This field is solely for the benefit of humans scanning the entries.
category
Optional under the atom spec, but required by the collections API. A record may have multiple categories. The collections API defines categories to reflect the entry type, source, or metadata format. See vocabulary of categories
link
The collections api makes extensive use of atom links. Each link has a relattribute. The collections api uses standard link relations as follows:
  • rel=via link from an metadata item entry to the collection record entry for the collection(s) of which it is a member
  • rel=related link to other non-record content. Applications can attach their own 'private' data to any entry using this link. There may be any such links, and they are differentiated by their title
  • rel=alternate For entries containg metadata records, this contains the URI of the resource being described by the record (typically, this is the URL of some site on the web). There may be only one such link.
  • rel=enclosure link to the media file, or all forms/formats of metadata associated with a record. There must be at least one enclosure link, pointing to the URI of the content of the entry. There may be any number of enclosure links.
  • rel=edit,rel=edit-media defined by the AtomPub spec, used for modifying the content of the atom entry, or the metadata record or media file itself.

Entries must be in one of three type categories: record, collection, or media.

record entry
A record entry contains a metadata record as its content. This record is considered its "primary" record. It may, optionally, contain any number of enclosure links that contain alternate formats of the record. Clients may or may not understand each format. Since every record is a member of a collection, all record entries must contain a rel=via link to the entry describing its collection metadata.
collection entry
A collection atom entry is a record entry that contains, as its content, a collection metadata record. Every collection has an associated atom feed containing the members of a collection (which is created by the system). A collection atom entry, therefore, contains a rel=related title="feed" link to its item feed. [Josh: Wouldn't it be better to use title="item_feed" or title="items" given that "feed" could mean an RSS URL and apply to multiple links?]
media entry
A media entry contains a file as its content. Like record, it too must have a rel=via" link to the collection it belongs to.

Examples

Sometimes, these concepts are best understood through example. Here are some examples of feeds that comprise this collection API. Look at the feeds in various browsers and feed readers. (note: service documents are not feeds, and are likely to be unrecognized by most readers)

Reading from the NDR

Feeds

Feeds are the primary means for reading data out of the ndr using the collection API. All feeds have the following properties:

Stable URIs
A feed URI never changes, unless it is deleted/removed. If an app wanted to monitor the items in a single collection, it would be appropriate to configure the app with the collection's URI.
Entries ordered by last update date
As per the atom spec, feeds ordered such that the most recently updated/added entry appears first, and the oldest entry is last
Pagination for large feeds
Large feeds are broken up into a series of feed documents. If the feed has been broken up as such, the atom pagination convention is used such that each feed document contains rel=next, rel=prev, rel=first or rel=last links in the feed preamble. To facilitate use in browser (live bookmarks), the first feed document may be much shorter than subsequent entries (say, 25 entries)
Access restrictions
Feeds MAY require authentication or authorization in order to read its content.

There is no real difference between collection feeds and collection member feeds except that a collection feed contains a link to a service document in which it appears.

Entries

As mentioned earlier, all entries have their own URI, and exist as self-contained web resources outside the context of their feed. This is useful for JMS messages. If an object is updated in the NDR, it shall be possible retrieve the corresponding entry resource.

Writing to the NDR

Writing to the NDR via the collections API uses techniques specified in the AtomPub protocol, plus a few extensions for some of the more "advanced" features (such as creating "out of band" content).

Adding new items/collections

There are two possible ways to add new items, using AtomPub principles

POST an atom entry to a feed
This atom entry is essentially the same entry that will be added to the feed. The server is allowed to reject or make changes to the posted entry before storing and adding to the feed as follows:
  • id If an id is provided, the server will use it, or return an error. If an id is not provided, the server will creat one
  • updated The server will ignore any value provided by the user, and will create its own
  • content Any xml content between the content tags will be used to create a new web resource containing that content. The resulting contentelement will contain a link to the URL of this content.
  • link As for the content element, any atom links that don't specify a href, but instead contain xml content, will result in a resource being created and linked to. The server may add new links as well. In the case of an item entry, the server will create a link to its collection (and overwrite any link provided by the user to this effect).
  • Other required elements (title, summary, etc) - these should really be defined by the app, but the server could potentially fill them with some sort of reasonable content.

As per AtomPub, the response headers contain the location of the newly created entry.

POST just the content to the feed
AtomPub allows this as well. In this situation, since ONLY the content (file) is POSTed to the feed, the server creates. We would need to decide if we want to allow it. This may be useful for media collections (images, etc), where the title, summary, etc are less relevant than a collection or item. The server would need to generate an entire atom entry itself, and link to the provided content.

Editing items or collections

There are two legitimate operations that fall in this scope: updating the content of an entry, or updating the atom entry itself (such as attaching new content)

Updating the content of an entry
The AtomPub protocol defines a link rel="edit-media" atom link, which points to a a web resource that can be updated. As per AtomPub, this resource can be updated with PUT containing the new content. This affects the content element. In the collections API, we also declare that any other atom link may be updated in a similar way. The linked resources may be updated by a PUT. Any such resource may require authentication, and may return any http error code.
Updating the entry itself
The AtomPub protocol defines a link rel="edit" link to a resource representing the individual atom entry. This can be updated with PUT. As this replaces the existing entry with the supplied entry (plus any changes created by the server), applications should pass along all elements in the original entry unless it is explicitly removing an element. It is necessary to use this method if the application wishes to add a NEW rel=related link to an entry.
This method may also be used (as an extension of the AtomPub protocol) to update the content of the entry, or any link. Using the same method described earlier, if xml is found between content or link elements, the server will create or update the linked resource with the given content.

Usage in terms of use cases

Web Feed Ingest

List collections with titles
Given the URL of the feed of NSDL collections, iterate through atom entries. Their title is in the <title> element, and their feed url is in a <link rel="related" title="feed"> element.
Attach source of records (WFI) to a collection
Explicitly enabling/defining/attaching a source of records is perhaps a need dictated by policy and security. Compare the following procedures:
  1. For any items added to a collection via WFI, simply declare <category scheme="http://ns.nsdl.org/collections/source" term="WebFeedIngest" /> on an ad-hoc basis.
  2. Use the same <category> tag, but before a collection will accept such an entry, the WFI source will need to be listed in the collection's ncs_collect record. Thus, edit the collection record to declare the source, then start adding records at will. Applications that have permission to edit a particular collection would be able to edit *all* records in a collection regardless of source. The source category would merely be a tool in allowing an app to select only its own records.
  3. Use a MetadataProvider handle rather than a human readable string as the source. NCS (or another admin app) would create a new source MetadataProvider for the collection, and assign permissions to the source app. This would assure that one source app would not be able to edit the records of another source app within a collection.
Store app-specific config data with collection or records
App-specific data associated with a collection would be located in that collection's collection record atom entry. The app would add a <link rel="related" title="X"> link, where the title X is chosen by the app. (e.g. see the title=dcs_data link in this entry). For data associated with an item record, the same technique would apply, though the link would appear in the item record's atom entry.
There are two ways to introduce a new <link rel="related" title="X"> link to the atom record. As per the AtomPub protocol, both involve an http PUT to the resource in the entry's <link rel="edit"> link. The payload of this PUT request is the full content of the atom entry plus the new link. The server shall accept two different representations of the new link
  1. A standard atom link, complete with a href url. In this case, the app would have put its content at the given url, and merely references it in the atom link.
  2. The app creates the atom link, but does NOT include a href attribute. Instead, the atom link element contains nested elements containing xml content. The server will create a new web resource, give it the xml content found between the link tags, and save the atom entry with the new atom link pointing to the new url containing the requested content. This behaviour is NOT part of the atomPub spec.
Determine all records in a collection that were supplied by WFI
Iterate through all records in a collection feed, and discard all those that do not have a <category scheme="http://ns.nsdl.org/collections/source" category for the specified app.
  • We may or may not want to expose a way to filter the results, so the feed contains only the desired entries. This may be easy or difficult depending on how the feeds are implemented.

NCS

De-accession collections
To completely purge a collection and its members from the repository
http DELETE to the collection atom entry uri (found in the rel="edit" link). The system will purge all records in the collection feed, then purge the collection atom entry (though not necessarily in that order)
To de-activate a collection but NOT purge its content from the repository
Edit the atom entry for the collection, add the element <app:draft>yes</app:draft>. At that point, we have two possible choices for system response:
  1. The system modifies every member record to add <app:draft>yes</app:draft>. Applications and indexes will then respect this marker, and disregard the entries
  2. The system will make the membership feed inaccessible (perhaps a 403 response) to all clients except those who have permission to see the feed
Store collections in NDR for preservation that are not yet visible
Same use of <app:draft>yes</app:draft> as above. There would be no distinction as to WHY the collection is not visible (e.g. not ready vs de-accessioned but in repository), except perhaps in the collection metadata record.
Add new collections
Create an entry in a collection of collections feed (e.g. NSDL collections) as per AtomPub. This could be done in two ways:
  1. POST the full collection atom entry to the collection feed uri. The primary collection metadata would be xml nested in the content element. The collection members feed will be added by the server and will appear in the stored atom entry, and the uri of the new entry will be returned to the client as per AtomPub.
  2. POST just the collection metadata to the feed uri. An accept element in the service document for the collection of collections will let the client know that is possible (e.g. the mimetype of the collection metadata is acceptable). The system will take care of creating the atom entry, and he client would be returned the uri of the new atom entry, as per AtomPub. If the app needs to add additional info, it would need to subsequently edit the atom entry.
Add new records
Same as adding new collections, except POSTing to the individual collection's feed.
Load collections/items from repository
Given a collection-of-collections feed uri (or discovering it through the service document), or an individual collection feed (if items are desired rather than collection records), iterate through the atom feed. NCS may have to (or want to) discard records that do not have NCS as a source, and perhaps pick a single metadata format to present to the user for editing.
Store app-specific out-of-band data associated with collections or items
see explanation in Web feed ingest example.
Modify collection metadata or app-specific data
As per the AtomPub spec, editing collection metadata is achieved by a PUT of the new metadata document to the edit-media link in the collection's atom entry. If the NCS needs to update non-primary metadata (say, nsdl_dc), it can do a PUT to the uri in the <link rel="enclosure" title="format_nsdl_dc"> link. Any other link (including out of band rel="related" links can be edited in the same fashion. For creating new non-primary metadata enclosures or out of band related links, see the Web feed ingest example above.
Manage non-NSDL collections in repository
Nothing special is needed. Non-nsdl collections would simply exist in some collection-of-collections feed that is NOT the "NSDL collection of collections" feed.

DDS

Discover collections
Within the scope of a collection-of-collections (e.g. discovering new NSDL collections), new collections are discovered merely by detecting new entries in the collection-of-collections feed.
Load records in a give collection
The collection record entry will have an atom link to the feed that contains all records in a collection: <link rel="related" title="feed" type="application/atom+xml;type=feed">
Determine the collections for a given record
All item records have a link back to the parent collection atom entry, using the atom rel="via" link.
Determine the native metadata format for a given record
The native metadata record is in the content element of the atom entry.
Read specific metadata formats from a given record
All resources considered to be metadata records are present in atom rel="enclosure" links, even the primary metadata record. There is a naming convention to use the titleparameter to convey the metadata format's name.

Collections reporting

How many collections use NCS? OAI harvest? WFI?
If the source is recorded in collection metadata, this is job for a search index. If not, this can be achieved by iterating through the desired set of records and counting the source categories,
What format(s) are used in a given collection?
If the format is recorded in collection metadata, this is job for a search index. If not, this can be achieved by iterating through the desired set of records, and counting the format categories. If this is to include all translations and alternate formats, then all the enclosure link titles will need to be counted as well.
How many items are in a collection? Historical data?
Feeds could list the number of items in them, but this proposal does not include that at this time. There is no explicit means of presenting historic data - this proposal focuses entirely on present repository state.

Categories

Scheme - http://ns.nsdl.org/collections/type
Terms - {collection, record, media}
  • record - used when the entry contains a metadata item. Entries categorized using the 'record' term must contain a "rel=via" link to the entry describing the collection. They must have a 'content' element pointing to the primary metadata record, and must also contain an "rel=enclosure" link to this record. Record entries may contain additional metadata formats represented by additional enclosure links. Lastly, record entries must include an "rel=alternate" link containing the URL of the resource being cataloged. Example: feed containing 'record' type.
  • collection - used when the entry contains a collection record. Entries categorized using the 'collection' term must contain a link to the feed containing the collection members, and must contain the elements expected of a 'record' entry. Example: feed containing 'collection' type.
  • media - used when the entry contains some form of primary content or media. Entries categorized using the 'media' term must contain a link to the 'collection' entry describing the collection, and must contain a 'content' element and a 'rel=enclosure' link to the URL of the primary content. Example feed containing 'media' type.
Scheme - http://ns.nsdl.org/collections/source
Terms - string representing the application that created the record
Each app may represent itself as a string (e.g. "NCS", "WebFeedIngest", "OAI") if it needs to know that it was the creator of a given record. Example feed containing multiple sources
Scheme - http://ns.nsdl.org/collections/format
Terms - metadata format string
A string that represents the primary metadata format for a record or collection entry. Not sure where the master list is, but examples include the familiar nsdl_dc, ncs_item, ncs_collect, etc. Example feed containing multiple formats
Personal tools