Community:CITI/OAIHarvestIngest/HarvestManager

From NSDLWiki

Jump to: navigation, search

Contents

[hide]

Harvest Manager Features

See Harvest Manager UI, User Guide

  • Manages harvests
    • Reads collection records (from the NCS) to obtain harvest scheduling information
    • Initiates harvests at required intervals (daily, weekly, monthly) by generating a harvest trigger file for the ingest process
    • Has UI for administrators to initiate harvests manually
      • Authorized for harvest admins only, uses basic auth in Tomcat over SSL. Collection builders not allowed to initiate - request to validate or harvest would need to be sent to harvest admin.
    • E-mails final harvest status to NSDL collection administrators and collection builders
      • Harvest manager is notified by harvest processes when they are done (either success or fail)
    • Provides public OAI provider explorer and XML validator
      • Allows administrators and collection builders to explore and validate OAI data providers: Validate records, view sets and available XML formats
  • Displays Harvest Information and Reports
    • Displays harvest details and history to NSDL collection administrators and collection builders in a Web interface
      • Displays harvest logs for detailed harvest reporting for administrators (reads harvest log DB written by ingest process)
    • Displays harvest schedule information and details to administrators and collection builders. An internal database keeps track of past harvest histories.

Software Architecture

  • Harvest Manager is implemented as a Java Web application that runs in Tomcat
    • Uses the Struts application framework
    • Code resides NSDL SVN repository
    • Application is built and deployed using ant
  • Architecture notes:
    • Internal cron threads fire off harvests at scheduled intervals (to initiate harvest processes)
    • Network access: Uses Web services, JDBC, SMTP
    • Log files read ingest DB
    • Maintains an internal DB (implemented as XML files) that stores information about past harvests

Features

  • Ability to view and sort harvest collections by title (without stop words like 'the')
  • Ability to discover harvest collections by simple search, search by collection name
  • Ability to validate records returned by data providers for QA and evaluation by administrators and collection builders

Data Used From NCS

  • Data from collection record:
    • NCS Collection ID
    • OAI BaseURL
    • XML format
    • OAI sets (if applicable)
    • Harvest frequency (in months)
  • Data outside of collection record, provided by NCS service response header:
    • NDR Metadata Provider Handle

Daily harvest pseudo-code

  • Harvest Java thread kicks off once a day (currently 6:30 PM ET)
    • Fetch all NCS_Collecton records (uses DDS search API from NCS)
    • For each NCS_Collection record that has OAI ingest metadata
      • Check if the collection is due to be harvested (based on last harvest date and requested harvest interval). If due for harvest:
        • Create an ingest trigger file (see example below)
        • Update last harvest date in internal database

Example Trigger File Generated

<?xml version="1.0" encoding="UTF-8"?>
<harvestRequest xmlns="http://ns.nsdl.org/MRingest/harvest_v1.00/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ns.nsdl.org/MRingest/harvest_v1.00/  http://ns.nsdl.org/schemas/MRingest/harvest_v1.00.xsd" schemaVersion="1.00.000">
 <baseURL>http://mathforum.org/oai/provider</baseURL>
 <collectionNA>2200-20061002124657491T</collectionNA><!-- The MDP handle -->
 <runType>full_reharvest</runType>
 <providerEmail>mr-ingest@nsdl.org</providerEmail>
 <sets><!-- Sets are optional -->
   <set>set123</set>
 </sets>
 <formats>
   <format>nsdl_dc</format>
 </formats>
 <firstHarvest>false</firstHarvest>
 <uuid>NSDL-COLLECTION-4794-1260577739519</uuid>
</harvestRequest>

Trigger file schema: http://ns.nsdl.org/schemas/MRingest/harvest_v1.00.xsd

Harvest Manager installation and configuration

  • Host: server8
  • Tomcat: /usr/local/tomcat-harvest
  • Context: hm
  • Configuration is defined in:
    • /usr/local/tomcat-harvest/webapps/hm/WEB-INF/web.xml
    • /usr/local/tomcat-harvest/conf/server.xml

Project code is managed in the NSDL SVN repository: server9.nsdl.org/repos/harvestManager/trunk/harvest-manager-project

The software is built and deployed using ANT

Personal tools