Community:Search/TheLuceneindex

From NSDLWiki

Jump to: navigation, search

Contents

[hide]

The Search Index

The NSDL search index is created and accessed using Lucene 2.0. Each record represents a single resource in the NDR (the NSDL repository).

Each record in the index may contain data from three different sources:

  • the metadata for the resource in the NDR,
  • selected metadata from the collection(s) that contributed the resource to the NSDL,
  • the text of the resource web page, as obtained from the ContentCache crawler.

Most of this data is stored at two or three levels of granularity, allowing very specific searches, very general searches, or searches that are in between. Data is stored in individual search fields, and may also be merged into compound fields with other data that share a common category of meaning. All text fields are merged into a special "allFields" field, so an untargeted search will match any of the text data. Finally, any text data is stored in both regular and stemmed search fields for more flexible searching.

As might be expected, all of this flexibility comes at the cost of considerable complexity. An example might be in order:

An example

A resource may include metadata items for <dc:title> and for <dct:alternative>. These will be stored in the index fields for "dc:title" and "dct:alternative", respectively. Moreover, since dct:alternative is a refinement of dc:title, both of these metadata items will also be stored in the index field for "compound_title". Since these are text fields, they will be merged into the "allFields" field on the record as well.

This same resource may inherit a <dc:title> from the metadata of the collection that contributed it to the library. This title will be stored in the index field "coll.dc:title", and merged into the "compound_title" and "allFields".

If the ContentCache crawler is able to retrieve the web page for the resource, it may find an HTML <Title> tag in the page. This title also will be merged into the "compound_title" field.

Each of these index fields has a stemmed counterpart, so the index contains fields named "dc:title.Stems", "dct:alternative.Stems", "coll.dc:title.Stems", "compound_title.Stems", and "allFields.Stems".

In summary, then, a search against the "dc:title" field, will yield only those records that have matching data in the <dc:title> tags of their metadata. The same search against the "compound_title" field will yield records that match in the <dc:title> or <dct:alternative> metadata tags, the <dc:title> tag of their collection records, or the Title of their web pages.

Searching

So, a wide variety of searches are available by specifying the correct field name along with the search term(s).

dc\:title:frogs Search for "frogs" id "dc:title", matching records that have a <dc:title> metadata tag containing the word "frogs".
compound_title:frogs Search for "frogs" id "compound_title", matching records that have the word "frogs" in any title-related tag.
allFields:frogs Search for "frogs" in "allFields", matching records that have "frogs" in any text field.
frogs Search for "frogs" in "allFields.Stems" (the default search field), matching records that have "frogs", "frog", "froggy", "frogging" etc. in any text field.


List of Index Fields

A list of all fields in the index is available at http://ndrsearch.nsdl.org/indexFieldInfo.

Personal tools