Community:NCore/NCore in Fedora 3.0

From NSDLWiki

Current 2.2-era NCore overvivew

Fig. 1 Fedora's role in 2.2-era NCore

The NCore repository architecture in the era of Fedora 2.2 consistes of a backend Fedora 2.2 instance and a public-facing middleware webapp. Users of NCore would interact with the middleware exclusively (as opposed to interacting with fedora objects or a running fedora server through API-A/M) for repository operations. This middleware provides an NCore-model-centric API and model of interaction. This middleware fulfills the following roles:

Provides a security architecture and the associated authorization and authentication mechanisms
Enforces the NCore model constraints in the repository
Provides a simplified NCore-centric object syntax for adding and modifying objects
Provides search and retrieval capabilities based on NCore object characteristics (as opposed to metadata search)
Performs high-level operations exposed as API methods.

In addition to this middleware, 2.2-era NCore employs two instances of Fedora's oaiprovider service to provide OAI-PMH access to the contributed Dublin Core metadata within the repository. Thus, ultimately all access to the repository content had to be mediated through one of these services (See Fig. 1): the middleware API for reading and writing NCore domain objects and relationships, and OAI-PMH for reading user-contributed metadata payload (which does not contain an NCore model or domain knowledge). Other NCore services such as Search utilize data that origniate from one of these two services.

What has worked well

Journalling has allowed a rather robust and flexible triad of leader/follower servers with (manual) switchover of a leader to a follower.
MPTStore triple store has practically eliminated stability problems and most performance issues.
General scalability has been great. Has no problem managing 6 million objects in large batch-oriented loads.
Connection to third party apps via NDR API has been successful.
Maintaining modification histories has been useful for detective work.
Provenance of metadata is very clear.

What has been a problem

Polling for changed metadata by proai is prone to error and resource intensive. The oaiprovider places strict validation and state transition conditions on data changes that is able to interpret. Purging a metadata record from the repository, for example, would not make it disappear from OAI because that is not a state transition the oaiprovider recognizes. In addition, the high level of I/O required resulted in periods wrought with slow responses and timeouts. Oaiprovider updates on a cron job to run at night as a temporary fix.
Search service has infrequent updates. Many services rely on the search service as a source of data in addition to its search capabilities. Infrequent updates can make it seem like data is not in the repository. Currently, a fix would require capabilities not present in our repository infrastructure.
Discrepancies between NDR and Search. The Search service generates its own separate view of the data based exclusively through DC metadata exposed in OAI-PMH.
- The set of resources in Search is exclusively determined by dc:identifier metadata fields, which is not necessarily the case in the NCore repository.
- Metadata may not appear in OAI due oai-specific privacy settings. This has evolved into an ad-hoc control mechanism for keeping metadata records "out of the library" presuming that "search = the library", but applications using the NDR API to directly view the content would have no way of knowing that.
- Collection membership of a resource is only visible to the Search service by the existence of a metadata object, through a somewhat complex (and perhaps broken) process. To nsdl.org, entire collection exposure is through search service.

Possible directions

Fig. 2 The OAI/Search and the NDR API services build conflicting views of the data, as the OAI service essentially bypasses business logic in the NDR API/Middleware.

With some experience behind us, we may begin to address some of the shortcomings of the current 2,2-era Fedora/NCore architecture. To begin, let us examine problem #3 with the 2.2 architecture: The NDR API and OAI produce conflicting views of the data in the repository.

Fig. 2 provides a simple schematic representation of the tension between those services whose data ultimately derive from OAI metadata (from the oaiprovider service), and those whose data derive from the NDR API. Both services have orthogonal rules and business logic. The NDR API presents an atomistic, resource-centric view of the data in the repository, where resource objects are non-redundant, and related to information-carrying objects such as Metadata and groupings (Aggregators) via relationships in the NCore model. Concepts such as collections are represented as structures within the graph formed by these interrelated objects.

In contrast to the repository API, the oaiprovider service exposes only a subset of the data in the repository: the Dublin Core metadata payload of Metadata objects, and a few specific attributes related to the record's provenance. In particular, an NCore OAI record contains

Dublin Core metadata
OAI Set specification
The OAI identifier of a metadata record describing the record's collection in the NSDL
Branding (brand image) of the record's collection
Miscellaneous information specifying details on how the record was harvested

Individual Metadata and MetadataProvider objects may contain properties which indicate the visibility of the OAI records to the oaiprovider service, e.g. public (record may be present in all oai services), protected (record may not be present in public oai services), and private (the record may not be served out via oai at all).

Fig.3 All access, external and internal, is through the NDR API/Middleware.

The NCore search service harvests the Dublin Core metadata from the OAI service, and from this list, creates its own non-redundant set of resource-oriented records from this metadata, possibly combining several metadata records into one. This view of the data can differ starkly from the view obtained from the NDR API. Given an object viewed via the NDR API, it is not possible to determine if the record is visible in the search service, or even what it may look like, without knowing the particular rules and algorithms used by both OAI and Search.

One obvious approach to this dilemma is to reduce the number of ways the data in the repository may be interpreted. Fig. 3 represents a change in configuration where all access to NCore content is through the NDR API. There are a few problems with this layout, however:

Notification/polling of changed records. The Search service polls daily for new or updated records to index. The OAI-PMH protocol is a mature technology that allows this easily. The current API has no method for polling/querying for recent objects, nor does it have a messaging service.
Object traversal needed for higher-level constructs such as collection membership. Collection membership is represented in an out-of-band field in the OAI-PMH representation of metadata. The middleware generates this value by searching for the presence of a particular path in the repository between a metadata object and a collection aggregator. This is a moderately complex operation that would require numerous API calls and deduction to reach a conclusion.
- The collection, as currently represented to the OAI service is provided in a dissemination Disseminators are not exposed in the NDR API currently

Fig.4 Business logic is separated from the API/middleware and exposed it (along with others) as service endpoints on fedora objects.

Another approach is to expose Fedora to backend services, but represent the functionality and business logic of the API through service disseminations of on Fedora objects, as shown in Fig.4. Compared to the "All access through API" approach, this has a few benefits:

Fedora has out-of-box messaging capability which backend services may subscribe to.
Backend services with access to Fedora also have access to the ResourceIndex
Fedora dissemination architecture is already in place

.. as well as a few potential drawbacks

Capabilities between API and Direct-fedora access are not exactly the same. Direct-to-fedora is more rich, since disseminators are not exposed in the API.
Backend apps would have to address objects by PID, which is forbidden/hidden in the API.

NCore into Fedora 3.0

Goals

Represent NCore model in CMA
Unhide interactions with Fedora
Services on JMS notification bus
NCore in Fedora as "drop-in" core objects + "host environment"
Make NCore play better with the web architecture through Fedora

Base CMA objects defining NCore model and behaviour

Community:NCore/NCore in Fedora 3.0

From NSDLWiki

Contents

Current 2.2-era NCore overvivew

What has worked well

What has been a problem

Possible directions

NCore into Fedora 3.0

Goals

Views

Personal tools

Navigation

Wiki Search

Toolbox