From NSDLWiki

< TNS Internal:NDR | 2.0/implementationDetails/search

Proposed Architecture to Support Search with NDR-API v2.0

Overview

The following architecture separates the generation of information to be indexed from the process that performs the indexing. This provides a flexible implementation process allowing for choices in indexing schemes. The architecture employs use of Fedora messaging service queue to identify modified objects combined with a search indexing service listener.

Basic Process

There are 3 new components that would work together with an existing indexing service to maintain a searchable index for NDR.

NDR Fedora Message Queue

Fedora updates message queue when object is created/modified

JW - Another approach to consider is to update the search index directly when the add/modify/delete metadata calls are made. This has the advantage of being able to send an error message back to the client if an error occurs (is this possible when using the message Queue?), and rolling back to ensure that the index and NDR are always in synch.

Search Index Service Listener

listens to NDR Fedora Message Queue
calls getResourceView Disseminator to get XML blob to be indexed
makes appropriate calls to generate the index

NDR getResourceView Disseminator

The getResourceView disseminator on metadata objects would be simple code to determine the metadata object's resource object. Then it would call getResourceView disseminator on that resource object. The getResourceView disseminator on the resource would return a resource centric view including...

resource information
all metadata
all annotations
collection level information
- hierarchy of collections

NOTE: Branding info is not included in the example since handling of branding information is TBA. JW - Branding info is probably not needed at the item level, but should be reflected at the get/listCollection level

Example output of getResourceView()

The example includes:

1 resource defined in the /record/header
0 annotations of the resource
2 metadata records about the resource
- 1 metadata record from NCS
  - 1 annotation of the NCS metadata record
- 1 metadata record from OnRamp via WFI

JW - This response output would make it very easy to construct a resource-centric search with DDS. I assume this would be roughly the same response from list/getResoruceMetadata calls?

<record>
  <header>
    <resourceURL>http://spookystuff.com/</resourceURL>
    <handle>2200/20061643458</handle>
    <handleURL>http://ndr.nsdl.org/api/getResourceMetadata/2200/20061643458</handleURL>
  </header>
  <annotatedBy />
  <cataloguedBy>
    <record>
      <header>
        <handle>2200/20061212656</handle>
        <handleURL>http://ndr.nsdl.org/api/getMetadataRecord/2200/20061212656</handleURL>
        <externalIdentifier source="NCS">CLC-000-000-000-061</externalIdentifier>
        <XMLFormat>nsdl_dc</XMLFormat>
        <collectionHierarchy>
          <parentCollection>
            <collectionName>Concepts Library Collaborative from NCS</collectionName>
            <collectionHandle>2200/20061258472</collectionHandle>
            <agentName>NCS</agentName>
            <agentHandle>2200/20061857483</agentHandle>
            <parentCollection>
              <collectionName>Concepts Library Collaborative</collectionName>
              <collectionHandle>2200/20061258334</collectionHandle>
              <agentName>NCS</agentName>
              <agentHandle>2200/20061857483</agentHandle>
              <parentCollection>
                <collectionName>National Science Digital Library</collectionName>
                <collectionHandle>2200/20061251230</collectionHandle>
                <agentName>NSDL</agentName>
                <agentHandle>NSDL_AGENT</agentHandle>
              </parentCollection>
            </parentCollection>
          </parentCollection>
        </collectionHierarchy>
      </header>
      <metadataXML>
        <nsdl_dc:nsdl_dc 
            xmlns:nsdl_dc="http://ns.nsdl.org/nsdl_dc_v1.02/" 
            schemaVersion="1.02.000" 
            xmlns:dc="http://purl.org/dc/elements/1.1/" 
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
            xsi:schemaLocation="http://ns.nsdl.org/nsdl_dc_v1.02/ http://ns.nsdl.org/schemas/nsdl_dc/nsdl_dc_v1.02.xsd"> 
          <dc:title>My life: a Monster's Story.</nowiki></dc:title> 
          <dc:description>The story of Frankenstein, blah blah blah.</dc:description> 
          <dc:identifier>http://www.frankenstein.org/autobiography</dc:identifier> 
          <dc:creator>Baron Von Frankenstein</dc:creator> 
          <dc:author>Mary Shelly</dc:author> 
        </nsdl_dc:nsdl_dc>   
      </metadataXML>
      <annotatedBy>
        <record>
          <header>
            <handle>2200/20061212656</handle>
            <handleURL>http://ndr.nsdl.org/api/getAnnotation/2200/20061212656</handleURL>
            <externalIdentifier source="NCS">CLC-000-000-000-054</externalIdentifier>
            <XMLFormat>nsdl_anno</XMLFormat>
            <annotatesHandle>2200/20061212656</annotatesHandle>
            <annotatesHandleURL>http://ndr.nsdl.org/api/getMetadataRecord/2200/20061212656</annotatesHandleURL>
            <collectionHierarchy>
              <parentCollection>
                <collectionName>Concepts Library Collaborative from NCS</collectionName>
                <collectionHandle>2200/20061258472</collectionHandle>
                <agentName>NCS</agentName>
                <agentHandle>2200/20061857483</agentHandle>
                <parentCollection>
                  <collectionName>Concepts Library Collaborative</collectionName>
                  <collectionHandle>2200/20061258334</collectionHandle>
                  <agentName>NCS</agentName>
                  <agentHandle>2200/20061857483</agentHandle>
                  <parentCollection>
                    <collectionName>National Science Digital Library</collectionName>
                    <collectionHandle>2200/20061251230</collectionHandle>
                    <agentName>NSDL</agentName>
                    <agentHandle>NSDL_AGENT</agentHandle>
                  </parentCollection>
                </parentCollection>
              </parentCollection>
            </collectionHierarchy>
          </header>
          <annotationXML>
            <nsdl_anno:nsdl_anno 
                xmlns:nsdl_anno="http://ns.nsdl.org/nsdl_anno_v1.02/" 
                schemaVersion="1.02.000" xmlns:dc="http://purl.org/dc/elements/1.1/" 
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
                xsi:schemaLocation="http://ns.nsdl.org/nsdl_anno_v1.02/ http://ns.nsdl.org/schemas/nsdl_anno/nsdl_anno_v1.02.xsd"> 
              <title>Annotations for Global Sun Temperature Project</title> 
              <rating>5</rating> 
            </nsdl_anno:nsdl_anno>   
          </annotationXML>
        </record>
      </annotatedBy>
    </record>
    <record>
      <header>
        <handle>2200/20061293824</handle>
        <handleURL>http://ndr.nsdl.org/api/getMetadataRecord/2200/20061293824</handleURL>
        <externalIdentifier source="NCS">CLC-000-000-000-062</externalIdentifier>
        <XMLFormat>nsdl_dc</XMLFormat>
        <collectionHierarchy>
          <parentCollection>
            <collectionName>Concepts Library Collaborative from OnRamp RSS</collectionName>
            <collectionHandle>2200/20061258471</collectionHandle>
            <agentName>WFI</agentName>
            <agentHandle>2200/200618664235</agentHandle>
            <parentCollection>
              <collectionName>Concepts Library Collaborative</collectionName>
              <collectionHandle>2200/20061258334</collectionHandle>
              <agentName>NCS</agentName>
              <agentHandle>2200/20061857483</agentHandle>
              <parentCollection>
                <collectionName>National Science Digital Library</collectionName>
                <collectionHandle>2200/20061251230</collectionHandle>
                <agentName>NSDL</agentName>
                <agentHandle>NSDL_AGENT</agentHandle>
              </parentCollection>
            </parentCollection>
          </parentCollection>
        </collectionHierarchy>
      </header>
      <metadataXML>
        <nsdl_dc:nsdl_dc 
            xmlns:nsdl_dc="http://ns.nsdl.org/nsdl_dc_v1.02/" schemaVersion="1.02.000" 
            xmlns:dc="http://purl.org/dc/elements/1.1/" 
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
            xsi:schemaLocation="http://ns.nsdl.org/nsdl_dc_v1.02/ http://ns.nsdl.org/schemas/nsdl_dc/nsdl_dc_v1.02.xsd"> 
          <dc:title>My life: a Ghost's Story.</nowiki></dc:title> 
          <dc:description>The story of Casper, blah blah blah.</dc:description> 
          <dc:identifier>http://www.caspertheghost.org/autobiography</dc:identifier> 
          <dc:creator>Baron Von Casper</dc:creator> 
          <dc:author>Casper McFadden</dc:author> 
        </nsdl_dc:nsdl_dc>   
      </metadataXML>
      <annotatedBy />
    </record>
  <cataloguedBy>
</record>

Implementation of a Search Index Service Listener

With the proposed architecture, there are any number of indexing services that could be used. If a particular indexing mechanism is selected now, it can easily be replaced at a later date.

Options Being Explored

DDS as Search Index Service Listener

Overview of DDS

DDS is the indexing service used by DLESE.

Advantages:

It is well known to many of the development staff.
Existing tools are built around DDS

Limitations:

There is one and only one collection of collections.
All DDS records in a DDS collection must be of the same XMLFormat.
- Metadata records of differing XML Formats cannot live in the same collection
- Metadata records and Annotation records cannot live in the same collection

NOTE: These limitations may or may not apply when indexing the results of getResourceView.

Potential Implementation Using DDS

If my understanding of DDS is correct, it is capable of indexing the results of the getResourceView disseminator. The DDS Listener would need to be written to handle the following functionality...

listen to the NDR Fedora Message Queue
when an object is updated, call getResourceView disseminator for the object
call putRecord with parameters...
- id - from xpath: /record/header/handle
- collectionKey - handle for NSDL generic collection (see question below)
- xmlFormat - "resourceCentricXMLFormat"
- recordXml - <record> XML returned from getResourceView disseminator

Questions:

collectionKey value?
- Since the XML blob contains all metadata from all collections, would there be an all encompassing key instead of a collection key? Perhaps using whatever the handle is for the non-parented generic collection (ex. the handle for generic collection "National Science Digital Library")? JW - Since this is the resource centric view, it might make sense to put these all in a single DDS collection (think collection as a container). To represent Library Collections, we could set up a collection of collections. These would hold the collection-level metadata for each of the collections that are referenced in the resourceCentroXML response, and could be used to generate browse, search and other UI structures related to collections. To set these up there would need to be listeners established for collection-level operations.
xmlFormat value?
- How would resourceCentricXMLFormat be defined? JW - Using a DDS Search Field Configuration. Standard and custom fields could be defined, and data could be pulled from multiple locations in the response.
  - Ultimately resourceCentricXMLFormat may have any number of metadata formats in the <metadataXML> and in the <annotationXML> elements. JW - We could define a DDS XML format, say 'ndr_resource_metadata', for the NDR resourceCentricXMLFormat response package.
  - How detailed is the definition of the xmlFormat?
  - Is there a way to identify standard fields that are desirable for more efficient search? For example, title... JW - Yes.
    - The resource itself does not have a title.
    - The titles in the various <metadataXML> elements may not match. JW - The DDS indexing config for ndr_resource_metadata would be set up to pull the standard fields (URL, title, description) from each of the expected XML blobs in the response package. It's OK to have multiple titles and descriptions since what we're interested in is terms use for search. The DDS search response would include the entire ndr_resource_metadata package, so a client could choose a particular XML record to form it's display to the user.
other issues?

gsearch as Search Index Service Listener

Overview of gsearch

gsearch is the generic search module that is part of fedora's service architecture. It is specifically designed to accept fedora object's internal storage format, and provide a mechanism to index their content - including that of datastreams.

Advantages:
- gsearch already exists as part of the fedora architecture, and can take advantage of the message queuing service available in the recent release of fedora.
- gsearch can be configured to recognize datastream/dissemination types and perform user transformations on their output for indexing.
- gsearch uses Lucene as it's storage base.

Limitations:
- All objects are, by default, indexed. This may be an issue for those objects we do not wish to have a representation in the index, or if we wish to exclude certain objects - like unfinished, or works-in progress. This may be controlled if object disseminations are used to feed the index.
- it's not the DDS - which is known and fits into some downstream applications already.

Potential Implementation Using gsearch

Based on my limited exposure to gsearch:

a Lucene index could be generated and maintained in a near-synchronous way through use of the message queuing features of fedora.
This index could reflect the output of fedora object disseminations (as above) or could be the result of direct transformations of fedora object xml.
This implementation would not be restricted to the DDS input structures and mechanisms and could produce similar results.

solr as Search Index Service Listener

Overview of solr

solr (http://lucene.apache.org/solr/) is billed as "... the popular, blazing fast open source enterprise search platform from the Apache Lucene project..."

Advantages:
- popular service with strong community support.
- comes with a user interface for search and browse.

Limitations:
- ?

Potential Implementation Using solr

solr can be fed directly using gsearch and fedora's message queue mechanism. This would allow a solr index to be either the de-facto nsdl index or a supplemental index - if desired. The same amount of effort would be required to add a gsearch or dds-listener implementation, but with the extra effort of installing & maintaining the solr instance as well.

I am not familiar with the features of solr, and cannot make a judgement on it's utility or applicability to the nsdl.

TNS Internal:NDR/2.0/implementationDetails/search/proposedArchitecture