Community:RSSIngest

From NSDLWiki

Jump to: navigation, search

Notes from the kickoff design meeting, July 28, 2008. Reformatted and adulterated: please correct, amplify as necessary.

Contents

[hide]

Related Pages

Representing ingested data in the NDR

The Big Questions

What are we doing? – Use cases

SGER grant:

  • any PI with an account on a public bookmarking system can register that feed with NSDL, so URLs from that feed, together with the feed metadata, will become resources in NSDL (or if already in, metadata will be added)
  • Starting with connotea tag sets, available from each user by tag, if memory serves
  • del.icio.us also.

OnRamp generates RSS

  • e.g. Destination to register records as resources in NDR - generate RSS with link='record Fez VIEW url' which will be the resources
  • e.g. Destination to register parts of records as resouces in NDR - generates RSS with link='datastream Fez ESERV url' which will be the resources
  • e.g. Registering results from any destination as a resource in the NDR - link='OnFire cache url' which will be resources

ExpertVoices generates RSS

  • e.g. IPY stuff

Accept feeds in RSS1, RSS2, or ATOM

  • Use an open source library: Apache Feedparser, Project ROME, other?

One user may have multiple feeds

When?

  • By 09/30/2008 "begin to develop"
  • Must be in production by 03/31/2009, but 01/01/2009 would be much nicer

Dean's 5 steps

(1) User Interface

  • register RSS feed and provide simple metadata
    • add to existing collection or create new collection?
    • collection image?
      • use existing,
      • create a "default" based on username or source,
      • user supplies one
    • incremental or full harvest?
      • see {The "incremental" problem, below}
  • view existing RSS feeds
    • any restrictions? view only the user's own feeds?
  • where is this information stored?
    • in the harvest info datastream in MetadataProvider (described below)?
    • in some other datastream within NDR
    • in MySql or some other external storage?
  • can other blessed applications create this information?
    • For example, it might be convenient for a destination admin to be able to setup all information within OnRamp and have OnRamp write out the same information that the RSS Ingest Admin would create. Since this is a convenience, it would be a lower priority. But with this in mind, certain design decisions can be made to make this easier in the future.

(2) NDR encoding of harvest metadata

  • All info that is collected by the User Interface to be stored in the NDR
    • exact format to be determined.
  • Add info that is not available to the user
    • scheduling info

(3) Automated process to harvest RSS

  • guided by harvest metadata in NDR

(4) Tool to parse RSS and add contents to NDR

(5) Store feedback on harvest into NDR

  • date of last harvest, number of records, etc.

Interpretation issues

The "incremental" problem

Do we perform a "full" harvest each time?

  • That means deleting all previous records from this source and re-harvesting from scratch
  • What about RSS streams that are truncated? Are we OK with losing the older records from the library?
  • case in point: Lynette showed an RSS feed from Fez that was up to item #1900, but had only 700 items in the feed.
  • When a harvest does not contain a record that it did contain previously, do we…
    • assume that the "full" RSS stream has been truncated, and keep the previous records?
    • assume that the older record has been deleted by the user?

Do we perform an "incremental" harvest?

  • Only accept items that have been added since our last harvest?
    • What if the items do not have pubdates? How do we define "incremental" then?
  • How is the user to delete or modify metadata?

"Duplicate" records

  • Questions:
    • What if an incremental harvest contains a record that has already been harvested?
      • Without a pubdate, how do you tell?
    • What if a harvest (full or incremental) contains two items that both point to the same link?
  • Simple answer:
    • If we have multiple metadata records for the same resource from the same RSS feed, keep the most recent and discard any others.
  • Other answer:
    • keep all records.

Implementation questions

Tie it to the NCS? No!

  • We don't want all of these PIs using NCS.
    • NCS is administrator-based. It takes training.
  • NCS is not NDR-native.
    • NCS has been tied to the NDR.
    • We want something that is more closely tied to the NDR.
  • We are anticipating a greater volume of feeds.
    • We don't want to have to approve each of them.

Multiple sources per collection

  • Examples:
    • A PI has tags on both connotea and del.icio.us
    • Multiple PIs on a project want to have their feeds in the same collection
    • We want a collection to take inputs from both OnRamp and ExpertVoices
  • More interesting example:
    • We want a collection to represent both an OAI source and an RSS source.
  • The problem?
    • Currently, when we do a full re-harvest on a collection, we blow it all away first.
  • Tim suggests an approach
    • Each RSS feed can (should?) be associated with a different metadata provider.
    • To do a full harvest, we first delete only those records in the collection that area associated with that metadata provider.
    • Our OAI harvest process could be modified to do the same thing, permitting both OAI and RSS in the same collection (or multiple OAI sources?)

Where to store the harvest info?

  • (post-meeting discussion with Jim and Tim)
  • Each RSS feed (or OAI feed) is represented by a MetadataProvider.
  • Add a datastream to the MetadataProvider with XML that holds the harvest info
  • How do we secure this info?
    • Do we allow other applications/agents to create these datastreams?
      • --Elrayle 19:42, 30 July 2008 (UTC) This depends on what is in the datastream. Does it include 1) the setup information collected through RSS Ingest Admin, AND 2) information about the latest harvest and harvest scheduling? Or are these maintained separately? Seems like we may want to allow editing by outside applications of 1) setup information (see comments above under User Interface), but not allow outside editing of 2) harvest info.
Personal tools