TNS Internal:CollectionAPI/AtomPub
From NSDLWiki
This is a proposal for a collection-oriented interface to based on the Atom and AtomPub standards. The goal is to make the collections and items of the repository discoverable in a simple manner while using established standards and conventions.
Contents[hide] |
Overview
This Atom/AtomPub based interface is composed of three types of web resources:
- AtomPub Service documents
- Lists all collections housed in the repository, with links to their feeds
- Groups collections into workspaces
- Lists the capabilities of the collection, including acceptable member categories and mime types. A vocabulary of categories is proposed to represent type, source, and format of entries.
- Atom feed documents
- Each feed document is a web resource with a resolvable URI.
- clients may add or remove entries from feeds using techniques defined in the AtomPub protocol
- Atom entries
- Each entry contains a record or media file as its content.
- Entries contain a number of links to related resources, such as the atom feed containing member items (in the case of an entry that represents a collection record), out-of-band data an application needs to attach to the record, or the resource being described in record metadata.
- Each entry has a resolvable URI, and may be edited per the AtomPub protocol
Basic model
The entire repository record space is represented as a series of atom feeds and entries. An entry serves as a container for its content, which may be either a metadata record or a (possibly binary) media file. Besides the content, the collection API makes extensive use of atom links to relay information about the content. The basic elements of an entry are as follows:
- id
- Globally unique identifier required by atom spec. The server may choose generate its own id for an entry (if none is given), or use an acceptable value provided by the client. The convention used in the collection api is that it is a uri with the prefix
http://ndr.nsdl.org/collections/
. - title
- Required by the atom spec. For metadata-based entries, this should be the catalogued title. For media files, this should be a a file name.
- updated
- The server will maintain the last updated date for each record
- content
- The content of the entry. Contains a
href
link to a web resource containing the content (metadata record or media file) - summary
- required by the atom spec if
content
is not human readable text or html. This can be anything, but typically should be a description from metadata, or a short description of a media file. This field is solely for the benefit of humans scanning the entries. - category
- Optional under the atom spec, but required by the collections API. A record may have multiple categories. The collections API defines categories to reflect the entry type, source, or metadata format. See vocabulary of categories
- link
- The collections api makes extensive use of atom links. Each link has a
rel
attribute. The collections api uses standard link relations as follows:- rel=via link from an metadata item entry to the collection record entry for the collection(s) of which it is a member
- rel=related link to other non-record content. Applications can attach their own 'private' data to any entry using this link. There may be any such links, and they are differentiated by their
title
- rel=alternate For entries containg metadata records, this contains the URI of the resource being described by the record (typically, this is the URL of some site on the web). There may be only one such link.
- rel=enclosure link to the media file, or all forms/formats of metadata associated with a record. There must be at least one enclosure link, pointing to the URI of the
content
of the entry. There may be any number of enclosure links. - rel=edit,rel=edit-media defined by the AtomPub spec, used for modifying the content of the atom entry, or the metadata record or media file itself.
Entries must be in one of three type categories: record, collection, or media.
- record entry
- A record entry contains a metadata record as its
content
. This record is considered its "primary" record. It may, optionally, contain any number of enclosure links that contain alternate formats of the record. Clients may or may not understand each format. Since every record is a member of a collection, all record entries must contain arel=via
link to the entry describing its collection metadata. - collection entry
- A collection atom entry is a record entry that contains, as its content, a collection metadata record. Every collection has an associated atom feed containing the members of a collection (which is created by the system). A collection atom entry, therefore, contains a
rel=related title="feed"
link to its item feed. [Josh: Wouldn't it be better to use title="item_feed" or title="items" given that "feed" could mean an RSS URL and apply to multiple links?] - media entry
- A media entry contains a file as its
content
. Like record, it too must have arel=via"
link to the collection it belongs to.
Examples
Sometimes, these concepts are best understood through example. Here are some examples of feeds that comprise this collection API. Look at the feeds in various browsers and feed readers. (note: service documents are not feeds, and are likely to be unrecognized by most readers)
- Service Document (see detailed walkthrough)
- Contains ROOT workspace listing the "collections of collections" hosted in this repository
- "NSDL collections" workspace represents the collections comprising nsdl.org. Note the usage of categories
- "NSDL media" workspace represents a collection of collections containing nsdl.org media. Its only member is the brand images collection, which contains media entries.
- Feed containing collection records (see detailed walkthrough)
- Represents NSDL collection of collections
- Note how each entry contains a link to the collection feed
- Feed containing nsdl_dc metadata records (see detailed walkthrough)
- Note that this feed actually has SIX members, but is split into two feeds. Atom pagination is achieved via navigation links ("rel=next"). Most readers only read the first feed page.
- Just contains nsdl_dc, except one record that contains an additional enclosure containing 'native' nsdl_dc (i.e. the raw, un-transformed nsdl_dc harvested from the source)
- Feed containing metadata records from multiple sources, in different formats (see detailed walkthrough)
- Contains records from NCS in msp2 format, and from Web feed ingest in nsdl_dc format
- Contains "out of band' dcs_data link for app-specific data associated with record
- Contains raw feed data from WFI as a non-primary enclosure
- Feed containing media: brand images! (see detailed walkthrough)
- Pure image media, No real cataloging associated with them.
Reading from the NDR
Feeds
Feeds are the primary means for reading data out of the ndr using the collection API. All feeds have the following properties:
- Stable URIs
- A feed URI never changes, unless it is deleted/removed. If an app wanted to monitor the items in a single collection, it would be appropriate to configure the app with the collection's URI.
- Entries ordered by last update date
- As per the atom spec, feeds ordered such that the most recently updated/added entry appears first, and the oldest entry is last
- Pagination for large feeds
- Large feeds are broken up into a series of feed documents. If the feed has been broken up as such, the atom pagination convention is used such that each feed document contains
rel=next, rel=prev, rel=first
orrel=last
links in the feed preamble. To facilitate use in browser (live bookmarks), the first feed document may be much shorter than subsequent entries (say, 25 entries) - Access restrictions
- Feeds MAY require authentication or authorization in order to read its content.
There is no real difference between collection feeds and collection member feeds except that a collection feed contains a link to a service document in which it appears.
Entries
As mentioned earlier, all entries have their own URI, and exist as self-contained web resources outside the context of their feed. This is useful for JMS messages. If an object is updated in the NDR, it shall be possible retrieve the corresponding entry resource.
Writing to the NDR
Writing to the NDR via the collections API uses techniques specified in the AtomPub protocol, plus a few extensions for some of the more "advanced" features (such as creating "out of band" content).
Adding new items/collections
There are two possible ways to add new items, using AtomPub principles
- POST an atom entry to a feed
- This atom entry is essentially the same entry that will be added to the feed. The server is allowed to reject or make changes to the posted entry before storing and adding to the feed as follows:
- id If an
id
is provided, the server will use it, or return an error. If anid
is not provided, the server will creat one - updated The server will ignore any value provided by the user, and will create its own
- content Any xml content between the content tags will be used to create a new web resource containing that content. The resulting
content
element will contain a link to the URL of this content. - link As for the content element, any atom links that don't specify a
href
, but instead contain xml content, will result in a resource being created and linked to. The server may add new links as well. In the case of an item entry, the server will create a link to its collection (and overwrite any link provided by the user to this effect). - Other required elements (title, summary, etc) - these should really be defined by the app, but the server could potentially fill them with some sort of reasonable content.
- id If an
As per AtomPub, the response headers contain the location of the newly created entry.
- POST just the content to the feed
- AtomPub allows this as well. In this situation, since ONLY the content (file) is POSTed to the feed, the server creates. We would need to decide if we want to allow it. This may be useful for media collections (images, etc), where the title, summary, etc are less relevant than a collection or item. The server would need to generate an entire atom entry itself, and link to the provided content.
Editing items or collections
There are two legitimate operations that fall in this scope: updating the content of an entry, or updating the atom entry itself (such as attaching new content)
- Updating the content of an entry
- The AtomPub protocol defines a
link rel="edit-media"
atom link, which points to a a web resource that can be updated. As per AtomPub, this resource can be updated with PUT containing the new content. This affects thecontent
element. In the collections API, we also declare that any other atom link may be updated in a similar way. The linked resources may be updated by a PUT. Any such resource may require authentication, and may return any http error code.
- Updating the entry itself
- The AtomPub protocol defines a
link rel="edit"
link to a resource representing the individual atom entry. This can be updated with PUT. As this replaces the existing entry with the supplied entry (plus any changes created by the server), applications should pass along all elements in the original entry unless it is explicitly removing an element. It is necessary to use this method if the application wishes to add a NEWrel=related
link to an entry. - This method may also be used (as an extension of the AtomPub protocol) to update the content of the entry, or any link. Using the same method described earlier, if xml is found between
content
orlink
elements, the server will create or update the linked resource with the given content.
Usage in terms of use cases
Web Feed Ingest
- List collections with titles
- Given the URL of the feed of NSDL collections, iterate through atom entries. Their title is in the
<title>
element, and their feed url is in a<link rel="related" title="feed">
element. - Attach source of records (WFI) to a collection
- Explicitly enabling/defining/attaching a source of records is perhaps a need dictated by policy and security. Compare the following procedures:
- For any items added to a collection via WFI, simply declare
<category scheme="http://ns.nsdl.org/collections/source" term="WebFeedIngest" />
on an ad-hoc basis. - Use the same
<category>
tag, but before a collection will accept such an entry, the WFI source will need to be listed in the collection'sncs_collect
record. Thus, edit the collection record to declare the source, then start adding records at will. Applications that have permission to edit a particular collection would be able to edit *all* records in a collection regardless of source. The source category would merely be a tool in allowing an app to select only its own records. - Use a MetadataProvider handle rather than a human readable string as the source. NCS (or another admin app) would create a new source MetadataProvider for the collection, and assign permissions to the source app. This would assure that one source app would not be able to edit the records of another source app within a collection.
- For any items added to a collection via WFI, simply declare
- Store app-specific config data with collection or records
- App-specific data associated with a collection would be located in that collection's collection record atom entry. The app would add a
<link rel="related" title="X">
link, where the titleX
is chosen by the app. (e.g. see thetitle=dcs_data
link in this entry). For data associated with an item record, the same technique would apply, though the link would appear in the item record's atom entry.
- There are two ways to introduce a new
<link rel="related" title="X">
link to the atom record. As per the AtomPub protocol, both involve an httpPUT
to the resource in the entry's<link rel="edit">
link. The payload of this PUT request is the full content of the atom entry plus the new link. The server shall accept two different representations of the new link- A standard atom link, complete with a
href
url. In this case, the app would have put its content at the given url, and merely references it in the atom link. - The app creates the atom link, but does NOT include a
href
attribute. Instead, the atom link element contains nested elements containing xml content. The server will create a new web resource, give it the xml content found between the link tags, and save the atom entry with the new atom link pointing to the new url containing the requested content. This behaviour is NOT part of the atomPub spec.
- A standard atom link, complete with a
- Determine all records in a collection that were supplied by WFI
- Iterate through all records in a collection feed, and discard all those that do not have a
<category scheme="http://ns.nsdl.org/collections/source"
category for the specified app.- We may or may not want to expose a way to filter the results, so the feed contains only the desired entries. This may be easy or difficult depending on how the feeds are implemented.
NCS
- De-accession collections
- To completely purge a collection and its members from the repository
- http DELETE to the collection atom entry uri (found in the
rel="edit"
link). The system will purge all records in the collection feed, then purge the collection atom entry (though not necessarily in that order) - To de-activate a collection but NOT purge its content from the repository
- Edit the atom entry for the collection, add the element
<app:draft>yes</app:draft>
. At that point, we have two possible choices for system response:- The system modifies every member record to add
<app:draft>yes</app:draft>
. Applications and indexes will then respect this marker, and disregard the entries - The system will make the membership feed inaccessible (perhaps a 403 response) to all clients except those who have permission to see the feed
- The system modifies every member record to add
- Store collections in NDR for preservation that are not yet visible
- Same use of
<app:draft>yes</app:draft>
as above. There would be no distinction as to WHY the collection is not visible (e.g. not ready vs de-accessioned but in repository), except perhaps in the collection metadata record. - Add new collections
- Create an entry in a collection of collections feed (e.g. NSDL collections) as per AtomPub. This could be done in two ways:
- POST the full collection atom entry to the collection feed uri. The primary collection metadata would be xml nested in the
content
element. The collection members feed will be added by the server and will appear in the stored atom entry, and the uri of the new entry will be returned to the client as per AtomPub. - POST just the collection metadata to the feed uri. An
accept
element in the service document for the collection of collections will let the client know that is possible (e.g. the mimetype of the collection metadata is acceptable). The system will take care of creating the atom entry, and he client would be returned the uri of the new atom entry, as per AtomPub. If the app needs to add additional info, it would need to subsequently edit the atom entry.
- POST the full collection atom entry to the collection feed uri. The primary collection metadata would be xml nested in the
- Add new records
- Same as adding new collections, except POSTing to the individual collection's feed.
- Load collections/items from repository
- Given a collection-of-collections feed uri (or discovering it through the service document), or an individual collection feed (if items are desired rather than collection records), iterate through the atom feed. NCS may have to (or want to) discard records that do not have NCS as a source, and perhaps pick a single metadata format to present to the user for editing.
- Store app-specific out-of-band data associated with collections or items
- see explanation in Web feed ingest example.
- Modify collection metadata or app-specific data
- As per the AtomPub spec, editing collection metadata is achieved by a PUT of the new metadata document to the
edit-media
link in the collection's atom entry. If the NCS needs to update non-primary metadata (say, nsdl_dc), it can do a PUT to the uri in the<link rel="enclosure" title="format_nsdl_dc">
link. Any other link (including out of bandrel="related"
links can be edited in the same fashion. For creating new non-primary metadata enclosures or out of band related links, see the Web feed ingest example above. - Manage non-NSDL collections in repository
- Nothing special is needed. Non-nsdl collections would simply exist in some collection-of-collections feed that is NOT the "NSDL collection of collections" feed.
DDS
- Discover collections
- Within the scope of a collection-of-collections (e.g. discovering new NSDL collections), new collections are discovered merely by detecting new entries in the collection-of-collections feed.
- Load records in a give collection
- The collection record entry will have an atom link to the feed that contains all records in a collection:
<link rel="related" title="feed" type="application/atom+xml;type=feed">
- Determine the collections for a given record
- All item records have a link back to the parent collection atom entry, using the atom
rel="via"
link. - Determine the native metadata format for a given record
- The native metadata record is in the
content
element of the atom entry. - Read specific metadata formats from a given record
- All resources considered to be metadata records are present in atom
rel="enclosure"
links, even the primary metadata record. There is a naming convention to use thetitle
parameter to convey the metadata format's name.
Collections reporting
- How many collections use NCS? OAI harvest? WFI?
- If the source is recorded in collection metadata, this is job for a search index. If not, this can be achieved by iterating through the desired set of records and counting the source categories,
- What format(s) are used in a given collection?
- If the format is recorded in collection metadata, this is job for a search index. If not, this can be achieved by iterating through the desired set of records, and counting the format categories. If this is to include all translations and alternate formats, then all the enclosure link titles will need to be counted as well.
- How many items are in a collection? Historical data?
- Feeds could list the number of items in them, but this proposal does not include that at this time. There is no explicit means of presenting historic data - this proposal focuses entirely on present repository state.
Categories
- Scheme - http://ns.nsdl.org/collections/type
- Terms - {collection, record, media}
- record - used when the entry contains a metadata item. Entries categorized using the 'record' term must contain a "rel=via" link to the entry describing the collection. They must have a 'content' element pointing to the primary metadata record, and must also contain an "rel=enclosure" link to this record. Record entries may contain additional metadata formats represented by additional enclosure links. Lastly, record entries must include an "rel=alternate" link containing the URL of the resource being cataloged. Example: feed containing 'record' type.
- collection - used when the entry contains a collection record. Entries categorized using the 'collection' term must contain a link to the feed containing the collection members, and must contain the elements expected of a 'record' entry. Example: feed containing 'collection' type.
- media - used when the entry contains some form of primary content or media. Entries categorized using the 'media' term must contain a link to the 'collection' entry describing the collection, and must contain a 'content' element and a 'rel=enclosure' link to the URL of the primary content. Example feed containing 'media' type.
- Scheme - http://ns.nsdl.org/collections/source
- Terms - string representing the application that created the record
- Each app may represent itself as a string (e.g. "NCS", "WebFeedIngest", "OAI") if it needs to know that it was the creator of a given record. Example feed containing multiple sources
- Scheme - http://ns.nsdl.org/collections/format
- Terms - metadata format string
- A string that represents the primary metadata format for a record or collection entry. Not sure where the master list is, but examples include the familiar nsdl_dc, ncs_item, ncs_collect, etc. Example feed containing multiple formats