Community:CITI/OAIHarvestIngest

From NSDLWiki

Jump to: navigation, search

Contents

[hide]

Overview of the NSDL OAI production Harvest / Ingest Process

Collection Administration

  • NCS functions
    • manages collection records
    • If OAI harvests from sources with formats other than nsdl_dc or oai_dc can be treated as separate collections, the NCS should require no modification. If integratin of multiple fromats from a single provider is a requirement, changes would have to be made.
  • Harvest Manager (Harvest Manager technical features)
    • manages harvest process
    • submits scheduled harvest requests
    • If the NCS treats other formats as separate collections, no change is required from the Harvest Manager.

Harvesting

  • Harvest2crsd - Evaluates trigger files (sample here) & runs harvester (perl)
  • This currently runs on server8 under user ingestd. Cron script is:
# Crontab for Ingest processing
#
## process harvest xml files
*/10 * * * * umask 002; export LANG=en_US.UTF-8;PATH=/usr/local/bin:$PATH; export PATH;/usr/local/ingestCode/scripts/harvest2dbi.sh >/dev/null 2>&1
###
###
#35 23 * * * * /var/local/ingestd/bin/compress_ingest.pl > /dev/null 2>&1
#0-59/2 * * * * /var/local/ingestd/bin/chk_ingest_and_log.pl > /dev/null 2>&1
    • This process is metadata format agnostic. This will harvest only a single metadata format at a time, but will harvest any format requested. Assigns results to whatever collection id is passed in by the harvest trigger file.

Transformation

  • crs2dbi performs transformation - run as part of Harvest/Transform after harvesting is done
  • This is also run on server8, and is automatically run as part of the harvest process (above).
    • Currently recognizes ONLY oai_dc and nsdl_dc as input formats. Will require at least a custom transform to convert input OAI records into the interim format - dbIngest. This transform should be straightforward. The output can be nearly identical to the current interim format. Estimated time to complete new transform, integrate into application, and test: 20 hours

Ingest to NDR

  • Check4IngestFiles - run by cron on ndr.nsdl.org under user fedorad
    • This process is metadata format agnostic.
  • Dbi2NDR is run by a cron task on ndr.nsdl.org under user fedorad
    • This appears to be metadata format agnostic. Requires testing.
  • Ingest2Repository processes individual item records.
    • This process currently requires an nsdl_dc element in each item. This part of the process enforces the "all metadata must have nsdl_dc" rule, and also generates a resource for the dc:identifiers within the metadata. If changes to accomodate other formats could identify them on the modified input format from crs2dbi (above) code changes to process other formats should be simple. Estimated time to modify & test: 8-15 hours.

OAI re-served via Fedora proai services

  • proai service
    • scans NDR for metadata objects as they are updated
    • builds OAI records from 2 Metadata object disseminations:
      • getMetadata - retrieves the metadata record formatted according to a format=... parameter
      • getMetadataAbout - retrieves provenance information for re-serving
    • caches OAI records in a mySQL database.


Notes on harvesting other metadata formats

Notes on impact of proposed Abstract API

Harvest / Ingest process detail

  • The harvest manager generates a Harvest trigger file conforming to http://ns.nsdl.org/schemas/MRingest/harvest_v1.00.xsd. File name is of the form [collection-id]_[datestamp]_harvest.xml for example 3694837_2008-05-11T23-00-02Z_harvest.xml
  • The Harvest Manager writes the harvest trigger file directly to the server8.nsdl.org:/usr/local/ingest/watched_folders/ToBeProcessed folder. The Harvest manager fetches the relevant harvesting information from the NCS/DDS. The harvest trigger file contains the collection's target metadataProvider handle as the <collectionNA> element for the harvest within the trigger file.
  • A cron task running on server8.nsdl.org (as user ingestd) fires Harvest2crsd.sh, which runs Harvest2crsd.pl, which runs Harvest2crsd.class (java), which checks the directory in question for the existence of a harvest trigger file.
  • The harvest trigger file is parsed, and launches a real harvest, or test harvest - depending on the parameters in the trigger file.
  • Most requests (test, validation, or real harvest request) will launch the perl harvester (server8.nsdl.org:usr/local/OAIharvester)
  • crsd2dbi - will wait for the perl harvester to complete, then check the resulting ListRecords response files for schema validity and proceed to process them into dbInsert files. This step involves:
    • parsing the ListRecords response
    • determining the format of the harvested records. We currently accept only oai_dc and nsdl_dc as input formats.
    • transforms the OAI input record into a dbInsert-formatted (schema at: http://ns.nsdl.org/schemas/db_insert/itemRecs_v1.07.xsd, sample doc here: http://ns.nsdl.org/schemas/db_insert/itemRecs.xml) record that contains the original metadata record (labeled as native...) and the nsdl_dc version that is the result of the transform.
    • placing all the generated ...dbinsert files in the ToBeProcessed directory of the NFS mounted ingest/ files. There may be many of these, depending on the size of the harvest. I believe (?) the set record count for a single dbinsert file is 200 metadata records.
    • Note: as a part of generating dbinsert files, crsd2dbi calculates the number of dbinsert files to be generated and writes this number into the header of each of the files as the value of the pieces (<pieces></pieces>) element.
  • Check4IngestFiles - run by cron on ndr.nsdl.org under user fedorad
    • copies/moves to location for ingest into Oracle MR, and NDR
  • Dbi2NDR is run by a cron task on ndr.nsdl.org under user fedorad
    • Determines how many (if any) dbinsert files there are to process
    • Launches multiple instances of Ingest2Repository - once per file.
    • Throttles at a rate set in the HarnessConfig class.
  • Ingest2Repository processes dbinsert files.
    • Looks up metadata provider Handle and Aggregation Handle based on incoming Collection identifier - CollectionNA.
    • queries the NDR for metadata provider and associated aggregator ids
    • verifies type of harvest - full reharvest or otherwise.
    • for a full reharvest:
      • the NDR is queried for the handles of all metadata records from the associated provider. This is used later to identify which records are either updated or deleted. Saved in a file with the unique id of the harvest + .hdls
      • as records are parsed, a list of the handles of the metadata items added, changed, or deleted is kept. Metadata items are discovered by the NDR api Find method - using the incoming OAI identifer and the metadataProvider handle as keys into the NDR metadata items.
      • once processing is completed for a set of files (a complete harvest as determined by the <pieces> value) a harvest whose type is "full..." will trigger a scan of the .hdls file, comparing against the handle lists that were generated from each of the dbinsert files as processed, and remove those that were not updated. This removes old records in the event of a "full-reharvest"
      • one HTTP message is sent to the Harvest Manager at http://harvest.nsdl.org indicating success or failure at the 'end' of the process. Parameters are: status=[status code]&uuid=[uuid]&ts=[timestamp]
        • [status code] values are: 2 = completion 3 = completion with errors or warnings 4 = failure
        • [uuid] = the unique id of the harvest attempt - generated by the Harvest Manager when the harvest request is generated
        • [ts] = current time stamp (ISO8601 plus UTC Zed)


(*) The OAI harvester is a version of Simeon Warner's perl OAI harvester - formerly available at the Open Archives Initiative Tools page (http://www.openarchives.org/pmh/tools/tools.php). This harvester is somewhat configurable, fairly robust, and generates .gz files from each harvest result returned from a request. This harvester also does set checking, and UTF8 validation, and is generally a great tool, but has a minor weak point - it dies when invalid xml is received. Mal-formed xml responses cause the harvester to die with an xml parsing error. This is due to the fact that the harvester parses the xml to retrieve resumption tokens.

Personal tools