TNS Internal:NDR/API/2.0/UnderstandingDDS

From NSDLWiki

Jump to: navigation, search

Contents

[hide]

Understanding DDS

Terminology

By convention, the following terms are used in this documentation in the ways described.

Library Collection
Used to refer to the broad concept of a Collection in a Digital Library. It is not a data structure of any kind.
collection
Used to refer to the data structure in DDS used to store information (ex. title, description, etc.) about a Library Collection.
Metadata Record
Used to describe the XML representation of an item that conceptually resides within a Library Collection
Annotation Record
Used to describe the XML representation of an item that extends another item and conceptually resides within a Library Collection
record
Refers to the data structure in DDS used to store information (ex. XML metadata, format of XML metadata, etc.) in DDS. This data structure is used as part of the process for managing Library Collections, Metadata Records, and Annotation Records.


Overview

DDS was designed to index information about three types of objects.

  1. Library Collection
  2. Metadata Record
  3. Annotation Record


DDS has two APIs.

  1. Repository Update Service API - adding, updating, and deleting of collections and records.
  2. Search Service API - query Lucene index and construct a return response


DDS System Setup

Configuration of Metadata Formats

  • Configure metadata format used for collections. For dlese, this is "dlese_collect".
  • Pre-configure known metadata formats to be used by records. Example, nsdl_dc, msp2, etc. NOTE: Can add more later. Note that metadata formats do not need to be previously known. Any XML format can be specified in PutCollection. All formats are indexed automatically, creating individual search fields for each XPath in the XML. Known metadata formats have additional functionality, such as indexing/search over standard fields (title, description, url, etc.) and custom fields. See sections on Standard, XPath and Custom Search Fields
  • Pre-configure known annotation format(s) to be used by records. Example, nsdl_anno


Questions:

  • Is the configuration process through a UI or through a configuration file? Configuration file...
  • What is the process and syntax of the configuration? See the document Configuring Search Fields for XML Frameworks
  • How is xpath to putRecord's id configured? XPaths to the IDs may be configured as described above. Others are hard coded internally for small number of XML frameworks. All others use the ID supplied in the PutRecord request.
  • How does DDS know that an XML Format is an annotation instead of a metadata record? The annotates/isAnnotatedBy relationship is currently hard coded and dlese_anno is the only one right now. To do: Make an external configuration for this.
  • Can there be more than one annotation XML Format defined? Yes.
  • How is xpath to id identifying the record being annotated get configured for annotation format(s)? Currently hard coded. To do: Make this part of the external config for this
  • What else can be configured? See the DDS Installation and Configuration documentation
  • How are transforms registered and associated with configured xmlFormats? Transforms can be implemented as XSLT or a Java class. These are configured in the webapp deployment descriptor for DDS (web.xml). For details, scroll down to the section XML format converters in DDS web.xml.

File System Directories

The file system is used as a redundant store of the XML metadata represented in DDS. The metadata passed in to putRecord in the recordXml parameter is stored without modification in the appropriate directory. Note: By default DDS stores the metadata in the Lucene index and saves a copy as files on the file system. The files can be used to backup, mirror or restore the repository if needed. A config flag can be set to disable the file copy and store only in Lucene if desired.

The following is an example of the structure of the file system that is setup for DDS.


  • Setup the root directory that holds all DDS files.
/DDS
  • Setup the following file system structure for collections.
/DDS/dlese_collect             <!-- use the same name as the metadata format for collections -->
/DDS/dlese_collect/collect     <!-- 'collect' is the reserved name for collection of collections -->
  • Setup the following file system structure for records. One directory is setup for each metadata format that is supported.
/DDS/nsdl_dc
/DDS/msp2
/DDS/nsdl_anno
etc.

When putCollection creates a collection, it adds a subdirectory under the appropriate metadata format that is specified on the putCollection call.

  • Example: Beyond Penguins with collectionKey=2010030001T (aggregator handle) is indexed in DDS with metadata format=nsdl_dc would create the following when putCollection is processed (NOTE: This is not part of the DDS System Setup. This processing happens when putCollection is called.)
/DDS/nsdl_dc/2010030001T

For details see the section titled DDS file system structure in the DDS Data Source Configuration documentation

Questions:

  • Does the configuration process automatically setup the directory structure, or is this a manual process?

A config parameter is used to define the base directory. From there the directory structure is automatically set up.

Setting up the Collection of Collections

A fake collection is setup in advance of running DDS with the following metadata:

  • collectionKey="collect"
  • xmlFormat="dlese_collect"
  • name=something like Collection of Collections
  • description=something like Collection of Collections for holding all collections that will be added through the putCollection call


NOTE: The /xmlFormat/colletionKey combo is represented in the file system structure above as /DDS/dlese_collect/collect.


NOTE: This setup process has implications for NDR. The current DDS restricts all collections to have one and only one metadata XML Format. I don't believe at this point that this will be a limitation if we surface the NDR concept of a collection, which also holds minimal metadata. But if higher level concepts such as membership in NSDL are to be surfaced seamlessly through DDS, it is not clear how that would happen. This also has implications for other implementations in the NDR, such as Learning Applications.

A collection in DDS can be set up to store meta-metadata, such as data that describes Membership in NSDL. The search application could use this meta-metadata to retrieve the NSDL membership information. It may be easiest to just get the info directly from the NDR and configure appropriately, for example we may want to set up a dedicated DDS that contains only the public collections from NSDL the NDR. Likewise, learning application can use a DDS meta-metadata collection to store application data, or get it directly from the NDR if appropriate.

Configuring Search

  • configure what should be searched when find is called without specifying search terms (defaults to search all element values in the XML)


Library Collection

Collection Configuration Process

An XML Format must be configured prior to being specified as an xmlFormat in the putCollection call. The xmlFormat on the putCollection call identifies the format that all Metadata Records which are to be associated with this collection will have when they are added to DDS using the putRecord call. See Record Configuration Process for more information.


Collection as a Special Case:

A "dlese_collect" XML Format is configured (hardcoded?) to hold metadata about a Library Collection. The schema for "dlese_collect" within the context of its use by DDS has only a few elements: collectionKey, title, description, and (record) xmlFormat. Is it true that dlese_collect only has collectionKey, title, desc, and xmlFormat? Or is this a full collection description with rich metadata about the Library Collection? If it is, how does a dlese_collect record get added since it isn't part of the putCollection call? I'm guessing it might also have information like create and modify date. Is there a link I can include to the dlese_collect schema?

The dlese_collect format can contains rich metadata. See the Collection matadata schema. The Rest API PutCollection method currently only accepts collectionKey, title, description, and (record) xmlFormat, which it stuffs into a skeleton dlese_collect record under the hood. We may way to extend PutCollection to accept rich metadata too. Currently the full dlese_collect record can be placed in a DDS by copying it directly to the file system (e.g. not via the PutCollection method).

Creating a Collection

putCollection

Parameters:

  • collectionKey - <collection aggregator handle> - must be unique
  • xmlFormat - all Metadata Records for this Library Collection must be put into DDS in this XML Format
  • name - User friendly name of the Library Collection
  • description - Description of the Library Collection


DDS processing of putCollection

  • create a record to represent the collection NOTE: There really isn't a separate data structure collection. It is a record data structure that stores the record in a reserved /DDS/xmlFormat/collectionKey location. In the dlese case, this is /DDS/dlese_collect/collect.
    • assemble XML for dlese_collect_metadata from the passed in title, description, xmlFormat, and collectionKey.
    • call putRecord(collectionKey,"collect","dlese_collect",dlese_collect_metadata)
  • Lucene
    • putRecord's processing handles interactions with Lucene, but here is what happens when putRecord runs...
      • What does happen? PutCollection sets up the collection and it also indexes the dlese_collect record like what occurs in PutRecord.
  • File storage
    • putCollection creates a directory under the XML Formats directory structure. For example, if xmlFormat="nsdl_dc" for putCollection, the following directory is created...
      • adds /DDS/nsdl_dc/collectionKey directory (ex. /DDS/nsdl_dc/2200%2F2010039485)
        NOTE: When Metadata Records are added to this Library Collection by calling putRecord with collectionKey=collectionKey, the recordXml of the new record will be stored in this location.
    • putRecord is called to create the collection within the DDS system. putRecord handles the processing for creating an entry in the collection of collections directory (pre-configured during DDS System Setup), but here is what happens when putRecord runs...
      • adds file /DDS/dlese_collect/collect/collectionKey with the constructed dlese_collect_metadata as the content of the new file
        (ex. /DDS/dlese_collect/collect/2200%2F2010039485)


NOTE: dlese_collect and collect directories are hardcoded during DDS System Setup.


NOTE: The limitation of one metadata XML Format for a collection has implications for the NDR with respect to annotations. With the current processing in DDS, annotations can not live in the same collection as the items they annotate, since the annotation will have an XML Format like nsdl_anno and the item will have an XML Format like nsdl_dc. Other potential conflicts also exist that require further exploration.

The responses from the proposed NDR v2 getResourceMetadata and listResourceMetadata methods contain heterogeneous XML for all things known about the resource including one or more metadata XML and zero or more annotation XML elements. One approach would be to place this directly into DDS as a single record using DDS PutRecord. By doing so, DDS would produce a resource centric search automatically. Collection attribution could be handled using the collection metadata in the NDR response. This would give us resource-centric search 'for free' and split the work nicely between NDR and DDS.

Search for a Collection

Questions:

  • Can you search for a collection? Yes
  • Can you limit results to "dlese_collect" XML format? Yes
  • What information can you get back for a collection? Example: DDS Search for 'ocean' in collections. Returns a set of matching collection records.

Metadata Record

Record Configuration Process

The XML Format of the Metadata Record must be configured prior to being specified as an xmlFormat in the putRecord call. See DDS System Setup for pre-configuration process.

Configurable items for a Metadata Record's XML Format:

  • xpath to id in recordXml to use in place of the id parameter
  • xpath to standard fields that allow search on those fields (ex. title, description, bounding box, etc.)
  • xpath to URLs that allow special processing of URLs
  • xpath to other special processing elements (ex. location, etc.)
  • What else?

Questions:

  • Are these all configurable, or are some hardcoded?

In general, all search fields are configurable. See the document Configuring Search Fields for XML Frameworks

Creating a Metadata Record

putRecord

Parameters:

  • id - <metadata handle> An XML Format can be configured with an xpath to an id in the recordXML. I don't foresee NDR using this.
  • collectionKey - matches the <collection aggregator handle> passed in with putCollection
  • xmlFormat - must match the xmlFormat specified with putCollection for the collection where this record is being added
  • recordXml - the Metadata Record XML in the specified format


DDS processing of putRecord

  • Special Processing by DDS Code
    • standard fields
    • URLs
    • other special processing (eg. bounding-box location, What else?)
    • processing of existing annotation records in DDS that annotate this record. NOTE: See Annotation Record for more information on this processing.


Questions:

  • What kind of special processing does DDS do?
    • Is it pre-processing of what value gets stored in Lucene? Some pre-processing is done for certain fields. For example, average star rating is calculated and encoded to allow for range quiries; Bounding-box lat-lon coordinates are encoded to support geospatial searching; XPaths are extracted from the XML documents to form the XPath Search Fields. See the section titled Search fields for details: DDS Search Fields
    • Is it that additional items are indexed in Lucene?
    • Is it post-processing when records are retrieved? Some post-process occurs, for example to bring together the separately-stored records, annotations and collection information into the single response for a resource.
  • With regards to annotation processing...
    • How does DDS know that a record is an Annotation Record, as opposed to a Metadata Record? This is configured in the system as described above, currently hard coded for dlese_anno only
    • What process is used for locating existing annotation records? DDS searches it's own index to find existing annotations


  • Lucene
    • What happens? I'm guessing, all element-value pairs are indexed and all attribute-value pairs are indexed. Does Lucene make a distinction between element-value pairs and attribute-value pairs? Is what happens to construct the Lucene index configured as part of the configuration of the XML Format? We index each element/attribute-value pair found in the XML document as described in the DDS documentation linked here.


  • File storage
    • adds file /DDS/xmlFormat/collectionKey/id with the contents of recordXml as the content of the new file
      (ex. /DDS/nsdl_dc/2200%2F2010039485/2200%2F2010039521)


NOTE: xmlFormat directory was created as part of the configuration process for the XML Format. collectionKey directory was created when the collection was created with putCollection.

Search for a Metadata Record

  • How do you get it back out...
    • What does Lucene do and what does DDS do?
      • Here comes my best guess...
        • Lucene gives back IDs for matching search results Lucene gives back Documents, which contain all stored fields for the given item. The Document stores the record XML and data from each of the fields for the record, including fields from the associated annotations
        • DDS uses these to locate the XML for each matching record and related annotation records
        • DDS uses the XML from the file system to construct the results which are the record with annotatedBy elements added for the annotation records? XML is returned directly from the Lucene Document. I believe isAnnotatedBy annotations are fetched separately from Lucene and inserted into the response.

Annotation Record

Annotation Configuration Process

The XML Format of the Annotation Record must be configured prior to being specified as an xmlFormat in the putRecord call. See DDS System Setup for pre-configuration process.


Configurable items for a Annotation Record's XML Format:

  • everything that can be configured for a Metadata Record
  • xpath to id in recordXml that identifies the id of the record that is being annotated (hardcoded?)
  • What else?


Creating a Annotation Record

putRecord

Parameters:

  • id - <annotation handle> An XML Format can be configured with an xpath to an id in the recordXML. I don't foresee NDR using this.
  • collectionKey - matches the <collection aggregator handle> passed in with putCollection
  • xmlFormat - must match the xmlFormat specified with putCollection for the collection where this annotation is being added
  • recordXml - the Annotation Record XML in the specified format

NOTE: The restriction that all records within a collection must have the same xmlFormat, means that annotations, which have a metadata format like nsdl_anno, must reside in a collection separate from what they annotate. This creates limitations on the proposed NDR model which currently does not impose a requirement that all metadata within a Library Collections use only one metadata format. The current NDR model allows multiple formats of the same metadata to be stored within a Metadata object. From an implementation perspective, Metadata can be transformed into a common XML Format prior to being put into the DDS index. The problem occurs for Annotation Records in the same Library Collection with Metadata Records. It is unlikely that XML Format of the Annotation Record could be transformed into an XML Format that is common with the Metadata Record's XML Format. This should not be a problem. The records need to be split up and Put into different DDS collections - separate ones for metadata and separate ones for annotations and DDS will do the work to combine them. Alternatively, the entire resource-centric response from NDR could be inserted in to a single DDS collection, as described above in DDS processing of putCollection

NOTE: Another implementation issue may arise from the fact that the proposed NDR model allows annotations of Library Collections and Resources. It is unclear whether DDS could surface these types of annotations. DDS allows annotations on collections (again, needs configuration to be exposed). DDS could also be set up to allow annotations to be associated via URL (Resource) as well as ID. But I don't think these would be necessary.

DDS processing of putRecord

  • Special Processing by DDS Code
    • star rating annotation special processing (allows searches like: search for 4 or better)
      • can be stored individually in annotations
      • average star rating is calculated and attached to metadata record
    • What else?


  • Lucene
    • processing of the Annotation Record is the same as that for a Metadata Record
    • additional processing adds Lucene search terms created from the annotation record and adds them to the search terms for the record it annotates


  • File storage
    • adds new file /DDS/xmlFormat/collectionKey/id with the contents of recordXml as the content of the file
      (ex. /DDS/nsdl_anno/2200%2F2010039485/2200%2F2010039521)


NOTE: xmlFormat directory was created as part of the configuration process for the XML Format. collectionKey directory was created when the collection was created with putCollection.


Search for a Annotation Record

  • An individual annotation record can be found with the same process used for metadata record
  • Special processing for an annotation?
Personal tools