Community:CITI/OAIHarvestIngest/HarvestManager
From NSDLWiki
Contents[hide] |
Harvest Manager Features
See Harvest Manager UI, User Guide
- Manages harvests
- Reads collection records (from the NCS) to obtain harvest scheduling information
- All collection and harvest schedule information is managed and edited in the NCS. See CRS replacement requirements
- Initiates harvests at required intervals (daily, weekly, monthly) by generating a harvest trigger file for the ingest process
- Has UI for administrators to initiate harvests manually
- Authorized for harvest admins only, uses basic auth in Tomcat over SSL. Collection builders not allowed to initiate - request to validate or harvest would need to be sent to harvest admin.
- E-mails final harvest status to NSDL collection administrators and collection builders
- Harvest manager is notified by harvest processes when they are done (either success or fail)
- Provides public OAI provider explorer and XML validator
- Allows administrators and collection builders to explore and validate OAI data providers: Validate records, view sets and available XML formats
- Reads collection records (from the NCS) to obtain harvest scheduling information
- Displays Harvest Information and Reports
- Displays harvest details and history to NSDL collection administrators and collection builders in a Web interface
- Displays harvest logs for detailed harvest reporting for administrators (reads harvest log DB written by ingest process)
- Displays harvest schedule information and details to administrators and collection builders. An internal database keeps track of past harvest histories.
- Displays harvest details and history to NSDL collection administrators and collection builders in a Web interface
Software Architecture
- Harvest Manager is implemented as a Java Web application that runs in Tomcat
- Uses the Struts application framework
- Code resides NSDL SVN repository
- Application is built and deployed using ant
- Architecture notes:
- Internal cron threads fire off harvests at scheduled intervals (to initiate harvest processes)
- Network access: Uses Web services, JDBC, SMTP
- Log files read ingest DB
- Maintains an internal DB (implemented as XML files) that stores information about past harvests
Features
- Ability to view and sort harvest collections by title (without stop words like 'the')
- Ability to discover harvest collections by simple search, search by collection name
- Ability to validate records returned by data providers for QA and evaluation by administrators and collection builders
Data Used From NCS
- Data from collection record:
- NCS Collection ID
- OAI BaseURL
- XML format
- OAI sets (if applicable)
- Harvest frequency (in months)
- Data outside of collection record, provided by NCS service response header:
- NDR Metadata Provider Handle
Daily harvest pseudo-code
- Harvest Java thread kicks off once a day (currently 6:30 PM ET)
- Fetch all NCS_Collecton records (uses DDS search API from NCS)
- For each NCS_Collection record that has OAI ingest metadata
- Check if the collection is due to be harvested (based on last harvest date and requested harvest interval). If due for harvest:
- Create an ingest trigger file (see example below)
- Update last harvest date in internal database
- Check if the collection is due to be harvested (based on last harvest date and requested harvest interval). If due for harvest:
Example Trigger File Generated
<?xml version="1.0" encoding="UTF-8"?> <harvestRequest xmlns="http://ns.nsdl.org/MRingest/harvest_v1.00/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ns.nsdl.org/MRingest/harvest_v1.00/ http://ns.nsdl.org/schemas/MRingest/harvest_v1.00.xsd" schemaVersion="1.00.000"> <baseURL>http://mathforum.org/oai/provider</baseURL> <collectionNA>2200-20061002124657491T</collectionNA><!-- The MDP handle --> <runType>full_reharvest</runType> <providerEmail>mr-ingest@nsdl.org</providerEmail> <sets><!-- Sets are optional --> <set>set123</set> </sets> <formats> <format>nsdl_dc</format> </formats> <firstHarvest>false</firstHarvest> <uuid>NSDL-COLLECTION-4794-1260577739519</uuid> </harvestRequest>
Trigger file schema: http://ns.nsdl.org/schemas/MRingest/harvest_v1.00.xsd
Harvest Manager installation and configuration
- Host: server8
- Tomcat: /usr/local/tomcat-harvest
- Context: hm
- Configuration is defined in:
- /usr/local/tomcat-harvest/webapps/hm/WEB-INF/web.xml
- /usr/local/tomcat-harvest/conf/server.xml
Project code is managed in the NSDL SVN repository: server9.nsdl.org/repos/harvestManager/trunk/harvest-manager-project
The software is built and deployed using ANT