TNS Internal:EduPakArchRoadmap

From NSDLWiki

Introduction

EduPak 1.0, the software distribution, currently consists of two applications: the NCS, and the NDR, as well a separately deployable DDS service. These applications were originally conceived independently by under differing sets of assumptions, and serve different purposes.

In EduPak 1.0, the NCS and the NDR are brought together in a simple way. The NCS can write some (or all) of its content to the NDR, and can bootstrap itself from "NCS formatted" data stored in the NDR. The NCS translates its data onto the NDR model (such as representing resources and metadata in an atomistic way, and translating its records into nsdl_dc), and overlays this representation with raw NCS record data. The NDR happily stores this data which lies outside the basic NDR model, but NDR or other applications that use the NDR might not necessarily expect or understand its content. With configuration, the DDS is able index this "outside model" data introduced by the NCS.

Looking at these applications abstractly, they represent different philosophies towards questions such as extensibility, interoperability, patterns of communication with applications, centrality of content, etc. In this abstract sense, the NCS is a peer to applications such as the CCS and perhaps even the SMS in that they share similar architectural patterns. Likewise, applications and infrastructure originally written to contribute content to the NSDL by way of the NDR, such as Web Feed Ingest, EV blog plugins, etc share a different set of patterns.

In the near future, we face tasks including developing new learning applications and expanding EduPak. We need to look carefully at the architecture as a whole to develop a general road map for assuring that our infrastructure best meets these goals. The decisions made therein will define an engineering philosophy for new development, and help us understand how our current applications and infrastructure will fit into the big picture.

Framing the discussion

Here are some steps that seem logical arrive at a road map for EduPak and the TNS architecture in general. This will be revised if any of these steps prove to be unnecessary, confusing, or illogical - this is just a first attempt.

Perhaps it makes sense to proceed as follows (The later steps especially may change to reflect reality - again, this is just a first stab in the dark)

A) (Sep. 23, 24 or 25? All interested parties, focus on developers)

Assemble all the developers who have an interest in EduPak development so that everybody is aware of what is going on, and can contribute to the process.

Review the roadmap, and solicit opinions of improvements or major omissions. Tweak as necessary
Discuss step 1 - Get an idea of how we view the nature of EduPak right now as developers.
Discuss steps 3, 4 - characterizing and reflecting on current architectures.
Gather ideas and questions for the PIs or other stakeholders for step 5 - future goals and use cases
Gather any other questions and concerns

B) (Start this in Oct. Boulder meeting.)

Speak with PIs/mgmt, stakeholders or any other relevant party

Review step 1 - How will EduPak be defined going forward?
- Is this different from the impression of developers? if so, highlight the differences as something to communicate with the group as a whole.
Gather material for step 5

C): Disseminate results of (B) to developers. Assemble options for step 6 - proposed approaches

D): Assemble developers, mgmt for step 7 - discuss approaches (there may be only one at by this point). In the end, everybody should be aware of the general direction that is decided upon.

E): Fill in the blanks

Define EduPak

Before proceeding too far, it's important to make sure we're clear about what defines "EduPak" for the sake of this discussion, and how it relates to (or differs from) the infrastructure deployed by TNS. For starters, here are some public descriptions of EduPak:

From the NSDL wiki page for EduPak:

EduPak version 1.0 is a lightweight version of the NCore open-source digital library platform specifically designed to meet the needs of national educational organizations and institutions focused on establishing specialized digital collections, conducting educational research, or providing students, teachers and instructors with discipline-oriented pedagogical products and tools that require basic technology for educational digital repositories. Built with NCore components, EduPak is an all-in-one, educational digital repository solution that provides a general platform for building digital libraries united by a common data model and interoperable applications.

From the overview document included in the software release:

NSDL EduPak 1.0 consists of several applications and web services based on the NCore platform packaged together in a convenient bundle. Taken as a whole, these applications work together to provide a solution for cataloging, sharing, and re-using educational resources in various contexts.

From a Fedora Commons wiki page:

NSDL EduPak 1.0 is a publicly available lightweight version of NCore, established in 2008 as an open-source digital library platform of technology and standards that create a dynamic information layer on top of library resources. Based on Fedora open source repository software, NCore provides users, developers, information managers and decision-makers with systems for description, organization, interrelation and annotation of resources. NSDL EduPak is an all-in-one education digital repository solution that provides a platform for building digital libraries united by a common data model and interoperable applications.

All these seem to imply that that EduPak is a software distribution - a lightweight and abridged version of the infrastructure used by the NSDL/TNS. However, recent events spell out some changes:

TNS has decided to move away from the digital library paradigm
New focus is on on providing learning applications around richly curated data
Management is apparently thinking about moving away from the NCore brand

So - will the EduPak name be taken to mean the learning-oriented core infrastructure used by TNS? Will it be a downloadable suite of interconnected applications (with the NDR being the glue by which data is shared between them)? Will new learning apps be considered part of EduPak? Or is EduPak a software distribution of a more limited set of components?

Whether or not the term 'EduPak' is taken to encompass the architecture and techniques TNS will adopt going forward, questions of architecture and development philosophy are relevant to consider as outlined here. It is especially important to make sure that the developers are aware of the various requirements, tradeoffs, and ideals when discussing technical road maps. However, we do need to be clear on our terms and the scope of the conversation.

Input from Tammy: 9/18/2009

I think EduPak is the set of core components useful for creating learning apps (or educational digital libraries), however the apps themselves are not distributed as part of the package. It may be that when we distribute an individual app, EduPak happens to be bundled with them under the hood

On Characterizing

When we discuss our applications or infrastructure from a technical standpoint, it might make sense consider the following categories:

Basic model and I/O: How applications create, read, update, and delete data, as well as the data format/nature at a most basic level.

Search and query

Query languages, indexes, and scope/completeness of searchable data.

What sort of indexing is used? By whom?
What sort of queries can applications perform? Queries to whom?

Interoperability

Data discovery, sharing, aggregation, modes of access. Basically, this aspect focuses on how other applications interact with one another through the data and APIs present in the system in question.:

What sorts of data sharing occur between applications?
Which systems do applications need to be aware of and communicate with?

Scalability: How the application performs with respect to large sets of data, or large number of access requests.

Availability: Failures or maintenance procedures and how they affect system functionality

Security: Accounts/login, data access controls for reads or writes.

Extensibility: Procedures and recommended methods for adding new types of data to the system, or expanding functionality

Characterizing current EduPak systems

NDR 1.x

Basic I/O: NDR API provides basic CRUD functionality. The API is basically a proprietary XML-based protocol that has REST-like and RPC-like characteristics. The basic unit of data is the 'object', which itself is composed of 'properties' (key/value), 'datastreams' (XML or binary blob), and 'relationships' (named relationships to other NDR objects), and there is a pre-defined ontology of object types, and property/relationship/datastream names. The API allows wholesale creation and deletion of single objects, or updates of individual components in an object. A Java-based toolkit is provided that allows manipulation of NDR objects as java objects, thus hiding the details of the XML API, which is used in an underlying driver layer.

In defining the various object types (Agent, Aggregator, Metadata, etc) and required relationships, the NDR model is specifically oriented towards representing provenance, and controlling access to data that is contributed by multiple applications.

The XML-based API itself has proven somewhat cumbersome and/or difficult to use in its raw form. The NDR toolkit helps immensely here.
Even with the NDR toolkit, manipulating higher-level concepts (e.g. 'collections', which are represented my multiple interrelated objects) has proven to be difficult and/or error prone. Creating a collection, for example, requires multiple steps that must be performed serially and correctly (e.g. creating the correct object types, adding the proper relationships to connect them in the correct way).
Graph traversal can be slow and cumbersome
A toolkit 2.0 prototype has been created to present a "high level" interface to constructs such as collections in the NDR.

Search and Query: The NDR API provides for matching objects based on their properties and relationships, as well as convenience methods for finding resources by their URL or finding all objects that have a relationship that points toward a given object.

NDR Search API is limited to "structural" components such as relationships or properties. Datastream content, such as full text or metadata, is NOT searchable via the NDR API.
In developing the Web Feed Ingest, Jim ran across many situations where tasks such as "List all titles of collections" could not be addressed by NDR search API (or NSDL metadata-oriented search), and could ONLY be addressed by brute force crawling of individual objects, which is very expensive
Outside the NDR API, some of our infrastructural applications (such as OAI provider or sitemap generators) can and do use Fedora's Resource Index directly to perform more complex queries. Still, these queries are limited to NDR "structural" components such as properties and relationships, rather than "content' present in datastreams.

Interoperability: The NDR 1.x strives to achieve interoperability through its shared data model. Resource objects are non-redundant, and are shared in the sense that all aggregations and metadata involving a given resource are co-located around the single physical object. All data, regardless of source, contribute to a single "big graph" of related objects, which itself is visible to every application. nsdl_dc is used as a least-common-denominator metadata format.

Scalability: The NDR was designed for managing millions of objects from many sources, and to accommodate both large highly parallel bulk operations and small ad-hoc updates.

The NDR has been able to successfully manage over 7 million objects, thanks in part to the adoption of PostgreSQL-backed MPTStore triple store.
NDR API searches are very quick (typically < 10ms), and consume few system resources.
Data access involving a fetch from Fedora typically take ~100ms, and heavy read/access load has not caused a problem for the NSDL in production.
RI queries outside the NDR API, as used by PROAI, can be quite expensive and I/O intensive with large data sets, and became a problem in NSDL production running during the day - so we ran the job at night.
Ingesting into the repository is slower than ideal. On the NSDL's production hardware, concurrent data ingests never exceeded 10 records/second.
Fedora rebuilds are relatively slow - on a 7-million object repository, a rebuild of the RI and Fedora's internal SQL registry took several days in NSDL hardware.
As the repository gets large, if Fedora's journalling is enabled, the opening of a new journal file can take a noticeably large time. At its peak size, it would take up to two minutes to open a journal in NSDL production

Availability: The NDR has several potential points of failure: the NDR software itself, Fedora, and databases used by Fedora. Luckily, Fedora journalling allows for maintaining mirror servers that can step into production in the case of a failure.

Re-configuring the leader/follower cluster necessarily requires the re-start of the leader Fedora instance, which implies at least a few seconds of downtime when this happens
Upgrading Fedora versions requires a major migration procedure, which requires downtime or read-only operation for the duration of the process

Security: The NDR provides a proprietary public-key based authentication system inspired by the system used by Amazon S3. Write access to NDR objects is controlled by a permissions system, which uses the auth:authorizedToChange relationship to enumerate the specific agents (or aggregators of agents) allowed to modify given objects. Read access to objects is not restricted. Some operations (such as attaching Aggregators/MetadataProviders to other Agents) are restricted to so-called "trusted" applications.

In practice, access keys are assigned one per NDR-aware application, not on a user-based level. If the application manages users (such as EV blog), the the application itself is responsible for managing user rights within the application - the NDR doesn't do anything to assist. The application (under its own, single NDR access account) then performs actions on behalf of the user
SSL is not supported in NDR 1.x so far

Extensibility: Although the NDR defines a basic relationship and property ontology, applications may define their own properties, relationships, or datastreams as they wish. These extensions are visible to all applications, and may be used in NDR search requests. In addition, the ordering and membership of objects may be given specific meaning (e.g collections)

Object types are a fixed set
For custom properties, relationships, or datastreams, there is no guarantee that another application will understand anything outside of the basic ontology.

NCS 2.7.x + DDS

The NCS is an application for cataloging 'records' (discussed below), which are essentially XML files containing mostly arbitrary content. The DDS query and indexing service, based on Lucene, which can store the the content of XML elements as searchable fields in its index. Although the NCS and DDS are distinct applications, the NCS is 'built around' a DDS instance such that this local DDS indexes all the records that have been generated by NCS. It cannot function without a DDS index of its data, hence NCS is distributed with its own internal DDS instance in EduPak. At various points in this section, NCS and "NCS + DDS" are taken to mean the same thing: The NCS application with its included DDS instance.

Basic model and I/O: At the most basic level, the data model is any XML document with a schema. In addition, XML document models that extend a 'record' or an 'annotation' pattern acquire special features that enhance their searching and discoverablilty. Several schemas are configured by default (ncs_item, ncs_data, etc). Access to NCS CRUD functionality is through the NCS GUI and its editing framework. NCS stores records on the filesystem, and updates DDS index which then indexes these records. The DDS CRUD API is not exposed to applications outside the NCS. Records can be exported to the NDR from the NCS.

The 'record' pattern specifies some expected elements. For example, these elements, expressed in XPath here, are expected to be present in a 'record' (they can reside at any XPath - these are examples). An XML schema that is configured to specify these elements acquire uniform search-ability by title, url, description, ID, geospatial bounding box and so forth:
- /record/general/recordID
- /record/general/url
- /record/general/title
- /record/general/description

The 'annotation' pattern is used to define a special relationship 'annotates'. An XML schema that is configured to specify this relationship acquires special behavior: It injects the content from the annotation record into the index document corresponding to the record being annotated. The record being annotated can then be searched using the data found in the annotation (relation 'isAnnotatedBy'), and search results in the REST service return the annotation XML along with the record XML.

Search and query: DDS maintains a Lucene index of all records, and exposes the DDS search API for queries over indexed fields.

By default, all elements and attributes present in the XML documents are indexed for searching. Thse search fields are generated automatically using field names that correspond to the XPath locations.
There is a 'standard' set of fields to index that are expected to be present for the 'record' and 'annotation' data models (enumerated in Basic model and I/O section)
Additional custom fields may be configured for indexing by specifying XPath locations.
Modifying the index field definitions (e.g. adding a new field to be indexed) would imply the need to re-build the index, unless the change is irrelevant to the current set of data.

Interoperability

The search REST API allows other applications read-only access to almost all data (i.e. data configured to be indexed into the DDS). There is no fully-developed external write API to NCS + DDS; all data in NCS + DDS is either created by hand via the cataloging interface, or imported from another source by hand (with some exceptions)

The NSDL.org ui takes advantage of data in the NCS through its search API for associating brand images with collections or records.
Besides the standard fields, the fields accessible via DDS API are entirely dependent on the configuration of the index, and the record formats present in NCS.

DDS REST search API can return JSON as well as XML, making it very JavaScript friendly

Availability

Events that require downtime for NCS + DDS are rare. NCS stores records on file system, and DDS can be re-built from these records. NCS can also save and import records from an NDR.

Scalability

Once the number of records exceeds several thousand or so, update performance becomes a problem. Editing a record in the NCS results in a real-time update to the index.
On the read/access side, Lucene and DDS perform well.

Security: Accounts/login, data access controls for reads or writes.

NCS maintains its own local user, password, and role list which are used for physical human login access. The DDS update API is disabled by default for external applications, but when active it permits access to all index records for clients at specified IP addresses

Access to records in NCS is based on role, and is not enforced on an individual record or collection basis.

Extensibility

NCS can accommodate basically any XML-based record format.

DDS 3.4.x Stand-Alone

The DDS Stand-Alone application provides DDS search services for records that are managed by external applications. Records are placed into DDS Stand-Alone via the DDS CRUD Rest API, via regularly-occurring NDR imports or from files on disc managed by another application. A DDS repository, like the NCS, consists of collections of XML records. External applications can define collections of arbitrary XML formats. If desired, schema validation must be performed by the external application prior to placing them into DDS. The DDS Stand-Alone provides the same data model, configuration and search capabilities as NCS+DDS (see above).

Defining collections of records

The DDS CRUD API has methods to put and delete collections, defined by XML format (nsdl_dc, etc.) and a unique collection identifier (key). Once a collection is defined, API methods may then be used to put and delete records in those collections.
NDR collections are configured using NDR handle. An NDR handle identifier becomes a collection in the DDS. DDS updates its index at regular intervals to reflect any changes made in the NDR.
If DDS is configured off of files and disc, the external application writes collections records to disc, which are used to define the collections in DDS. DDS then updates its indexes at regular intervals to reflect changes made to the XML files on disc.

Java Bean support

DDS provides native support for modeling Java Beans that are serialized to XML using the java.beans.XMLEncoder class. Properties defined in a Bean automatically become searchable fields in the index. This provides a simple way to search over data in a Bean and marshal to and from Java Objects.

Availability

DDS indexes can be re-built or updated simultaneously while supporting searches. The process of updating or re-indexing a record consists of two separate operations: delete and then (re)index. Thus individual records become unavailable in searches momentarily at the time they are re-indexed.
When the DDS CRUD API is enabled, changes are made durable as a store of files on disc. The Lucene index can be restored from the file store or backup copy of files.
DDS CRUD API does not have the concept of a user or login - it's strictly all-or-nothing access based on IP, or restricted to a local tomcat running behind a firewall.

Scalability

The NCS and DDS were designed for relatively small collections of rich content. The DDS Stand-Alone has been used with repositories as large as 300,000 records. The NCS+DDS up to about 20,000 (on a slow machine). Further development can enable scaling to larger sizes.

Reflection

This section should be expanded based on observations and discussion between the developers, but here some things to get conversation started

Right now in EduPak 1.0, the only instance of data sharing via the NDR occurs when NCS publishes to the NDR, and a separately installed DDS service is configured to pull "collections" in from the NDR and index them according to a custom set of rules. In this scenario, the NDR is being used as a way to label and partition the data, where the data of interest are the record blobs produced by the NCS. The true information modeling occurs in the record blobs, with the NDR model used primarily for administrative purposes. This is really no different from the way the NDR has been used by the NSDL.org - practically all business information is derived from metadata blobs, and the NDR maintains an administrative role. that has allowed multiple repository writers (EV, Wiki plugins, NCS, Ingest) to co-exist. In general, the NDR really hasn't been used to enhance the information model itself. Are we happy with that role, or do we see it as an unfortunate consequence of the requirements of the library around it?
- Suppose we face the task of adding a new data type - annotations. If the NDR is seen as purely administrative, then the task of information modeling is applied purely to the blobs. If we see the NDR as a participant in the information model, then the task of information modeling spreads into the territory of adding new object types to the NDR, or adding new datastreams, properties, or relationships to the existing ontology. As developers, consider some of the implications for applications reading from the NDR, writing to the NDR, and interpreting the data. Does one approach immediately stand out as more intuitive or palatable?

Specialization and centralization - The DDS-based apps maintain content and specialized indexes locally. The NDR is very centralized, but provides very limited search capability, primarily over object structure (and not higher-level content). Right now, applications almost have to maintain their own internal indexes if they are to derive and reason over data pulled from the NDR. What would the world look like if we stuck to this model? If an application wanted to query the data, would it need to know which application contains the index with the answer? What would the world look like if we created a more rich centralized index - either a key-value oriented index like lucene, or an RDF-oriented index like the triple store? What sorts of data would we need in this central index in order for it to become useful? What costs and benefits might be associated with this?

The NCS is currently being used to satisfy the need of the NSDL.org UI in displaying the proper collection brand images with search results. That is to say, the brand image is catalogued in collection records in the NCS, and the NSDL.org UI queries the DDS instance in the NCS to determine the proper image for a given collection. The NSDL search index used to provide this data to the UI, and this data used to originate in the NDR. What were some factors that caused this change, and what does this say about agility?

In a sense, until we have completed the task of "Define EduPak", all of the rest of this is premature. Whom do we intend for EduPak to serve, and what services do we hope to provide? Until we have those answers, we can't evaluate the importance of things like security, availability, extensibility, etc. While waiting for the definition of Edupak, it may behoove us to begin work on those subsidiary issues, rather than to be paralyzed. Nonetheless, we must remember that no meaningful decisions can be made until we clearly, unambiguously state our goals: what are we trying to do, and for whom?

Can the NDR be agile? The NDR is an "Enterprise-level" piece of infrastructure, as shown by the emphasis on concurrency, scalability, authentication and authorization. This contrasts sharply with the NCS, which appears to be more limited in all these scopes. But how much agility does the NCS exhibit, in contrast to the NDR? If, for example, we wish to add a new object type to the NDR data model - annotations, perhaps - how long would we expect that to take? Weeks, certainly. Months, very possibly. Can we devise a way for the NDR to rapidly accommodate requests for new functionality?

Synchronicity - certain parts of our architecture are made to be synchronous - e.g. a change made by a user (or application) is instantly queriable in an index. Synchronous behaviour can be great for user interaction, but can put a big damper on performance or scalability. For example, in NCS, saving an edited record synchronously updates the index so that it instantly appears in lists and searches. In the NDR, the internal resource index (triple store) is synchronous (at great effort). The NSDL search service and the NDR OAI service are NOT synchronous, and are run as batch jobs. In our existing architectures, where has our choice of synchronous or asynchronous (or bach behaviour) paid off, and where has it caused pain?

Has the resource/metadata split in the NDR model been useful to us so far, or has it been an irritant, or hard to grok?

Future goals and use cases

Considering the types of learning apps we hope to develop, what sorts of queries can we expect on the data?
- Perhaps ask the PIs for some general example queries, e.g. "I want to know the mostly highly rated resource that conforms to state standard X (or equivalent standard)"
- What are some strategies for answering these queries?
Consider concrete example as a case study.

Proposed directions

Consider concrete example of meeting a new goal as a case study to illustrate how the approach works. For example...

Discussion

Road map and guiding principles

Stuff to put somewhere

In its immediate future (and by future, that means past), the NDR in Edupak is adopting simple Fedora 3.2 support. In essence, this entails adopting minimal Fedora API and format changes required to store the NSDL model in Fedora 3.2 objects. The work for this is finished except for distributing the release formally and converting migrating our current production installation.

The next logical step in the evolution of the NDR is to take advantage of its wonderful Content Model Architecture (CMA). This has been discussed before, and referred to as "phase 2" of the ndr + Fedora 3 roadmap.

However, at this point it makes sense to take a step back and consider the current lessons from NDR developers, and future directions for EduPak as a whole. While evolving the NDR's use of Fedora's CMA makes sense on paper, we need to make sure that any improvements make sense in an evolving architecture that is moving away from the union catalogue paradigm and towards supporting ad-hoc learning applications.