Ed Fox, Virginia Polytechnic Institute and State University

James Allan, University of Massachusetts at Amherst

Paul Berkman, The New Media Studio / EvREsearch LTD

Elizabeth Liddy, Syracuse University

Dean Zollman, Kansas State University

The session will involve a mix of presentations on some current implementations at both the core NSDL level and individual projects. These presentations will provide a basis for questions and discussion on the types of the technical approaches being used for particular needs.

The PowerPoint files for each presentation have been uploaded:

Panel introduction and discussion questions: IR.ppt
Ed Fox, Virginia Tech (CITIDEL project): IRcitidel2.ppt
James Allan, U. Mass. Amherst: 03-10 nsdl pJamesi.ppt
Paul Berkman, New Media Studio / EvREsearch LTD: NSDL search Paul.ppt
Elizabeth Liddy, Syracuse U.: LiddyL.IR.ppt
Dean Zollman, Kansas State U.: NSDL IR panel.ppt

Notes - Information Retrieval: From Databases to Search Engines

40 attendees.

Presentations

Ed Fox IRcitidel2.ppt

CITIDEL project. A collection project. http://www.citidel.org
PPT includes current architecture. Includes filtering tied into the collection. Hope to have a million objects by the end of the project. Filters are restricted to people who log in, but may open up to others. Search engine, ESSEX fast, in memory system. Open for others to use. Followed with snippits of other projects.
Grapezone cluster search results.Using approach over CITIDEL collections such as theses.
PIPE: Personalization. Personalize to people and different access technology modes (e.g. mobile). Subsets collection depending on what youre interested in. Connecting to CITIDEL.
Standardization: Need to standardize logging of use of search engines. Working with collaborators at Villinova on this area. XML log format being developed. Working on analysis tools to go along with this. Tools are available for community to use.

Regarding personalization. Does it include personalization on person as well as UI? Yes.

James Allan 03-10 nsdl pJamesi.ppt

Talked about CI search as it currently is. Core infrastructure includes metadata repository of primarily bibliographic information (not actual items). Many have web accessible content. Search gathers the metadata repository records and the content and provides integrated queries over the metadata and content. Search can be specified to the DC fields and NSDL-specific fields. The query language was written for the NSDL as an extension of Z39.50. Its not tied to a particular system. The language is intended for portal builders, not people directly as its using the SDLIP protocol. Probably switch to SOAP (decision not finalized yet). Types of queries include Free Text, and richer query of rankby and constraining to particular fields (require component). Search engine fits between the MR (harvested via OAI) and the portals. The IR component is now implemented using Lucene. Switched to Lucene due to constraints on the original proprietary system being used. Lucene is open source, and widely used thus there is a plausible lucence community for support. Takes about a week from MR update to search (at the moment).

Can you say something about performance, indexing, and search in response time? Lucene is in Java, and this has overhead. But as indexing can be done offline, its not a problem. As lucene is not a pure in memory model, there may be some scaling issues.
Will synthesizer and search engine be wrapped up as a tool for community members? It could be.

Could be using SQL for the metadata side, but the IR community doesnt work with that model.

Where is documentation on what you do? Just written a small handout that will point of to background information.

Conent based indexing. How deep do you go? Just take the first page. Reagan Moore is doing further depth as part of archiving project. Open question to what the resource is. Also, we cant get past a login page.

Paul Berkman NSDL search Paul.ppt

Introduce a project dealing with a number of agencies. IR is only part of the answer. Lists returned from search engines dont expose relationships between information. Digital information already available, but more information doesnt mean more knowledge. Still have a paper paradigm, tagging, lists, organizing into folders. Challenge is to integrate based on user defined parameters. Example using the antarctic treaty database. Working with any digital data and looking at automated pattern extraction from that data so not limited to just text representations. Look at the patterns, stucture, of a resource, e.g. text sequences such as paragraphs, binary sequences such as a cloud structure in a satellite image. Builds a structure of relationships show as heirachy to user. User defines the granularity of the information they want to deal with based on opening or collapsing the structure.

Can this be used with existing search engines? Yes, as its modular, it could be used with other modular systems. It could be used with metadata, for example, because of the pattern in metadata.

How are individual parts tagged? Break resource in granuals, or in memory by tagging in the memory space (byte offset ontologies).

Have you run this with any image based collections? No, not yet. Text has inherent in it much of the information. Image is harder as you need some metadata for access to the image. But with attributes described, it can give a way in.

Liz Liddy LiddyL.IR.ppt

1st round project to break the metadata generation bottleneck. Did a small evaluation which suggested that automated was similar to manual. Now doing a second project to go further called MetaTest. Set of questions about use and value of metadata. Talk about types of evaluation of metadata including quantitative information retrieval evaluation. Presented the methodology to be used in the project.. Found that some fields had better coverage from automated extraction as manual seemed to be concentrated to a few DC fields. Looking at human eye tracking to help in evaluation of use. This is being used in some of the evaluation being done in this project. Their poster also has more info on this including preliminary findings (some were in the presentation slides). Preparing for IR experiment based on some evaluation of value to users of current metadata in NSDL.

Dean Zollman NSDL IR panel.ppt

Searching Video. Much of the work based of CMU human interaction projects. Project focusing on high school physics teachers. Talk today about two databases; traditional as its somewhat like a tradition search engine, Synthetic Interviews more of a natural language interface. Issues, extracting metadata and how do you present video? Traditional based on informedia which includes a metadata extractor from the audio track. Apply search and present possible results of video to matching audio. Looking at a number of interface possibilities for this (examples in slides). Synthetic method is more of a conversational mode where the conversation is more on the ask an expert paradigm. Interviews with experts are recorded and marked, then teachers can ask a question and the search system takes you to relevant parts of the interviews with expert in a more conversational approach.

Can other NSDL collections use your service? We havent tried that yet. Not as a service, but weld be happy to take other videos and incorporate into the pathway project.

Informedia has had a series of innovative UIs. What have you learnt as youve gone along? In terms of video, weve backed up as its a web interface. In the past weve had a thick client. Web interface has limitations over the thick client, but browsing could be useful access method through the web interface.

User concerns with accessibility from MAC for NSDL audience, current access limited to explorer 5.5 on PC.

Panel Discussion

General questions to address:
What are special IR needs in NSDL?
How can advances in IR better serve NSDL users?
What should projects do regarding IR
How can CI support NSDL IR needs?
What can the Tech committee do to help regarding IR?

Comment. Video is exciting, but I think Liddy should be congratulated on her research as the result is interesting in terms of automated metadata. This was thought not to be able to be done, but Liz has tested this. Liddy started out unconvinced needing metadata, but now thinks there are specific things it is useful for. Not convinced controlled vocabularies map into what people think.

NSF data management program has funded a GWU and NIST project to look at search engine requirements for use in the math area, especially with respect to math equations.

Allan: special search engines for particular types of search e.g. GIS. Talked about how to possible integrate such search engine output (merge), but not at the indexing side.

What about Ontologies? Is a controlled vocabulary an ontology? Ontologies are useful and important for some areas. Liddy: two approaches, one using ontologies, or using NLP. Ontologies can be brittle to adaptation later on.

Georeferenced information. Work with gazetteers to link between the words that work search engines like lucene to special extent. This is a way to link the words of the science to the binary data in the databases.

Freestone: database and IR worlds diverged a number of years ago and the two sides need to come back together to learn from each other.

Issue of standards, and standards in education. How will these technologies work with standards? Liddy is also looking at this area too, but its a hard problem. Just the number of standards is a big problem.

Many groups in NSDL have their own sub-community vocabularies and ontologies. The IR needs of those not specific to the area are significant problems.

Seems theres a large emphasis of metadata. Is it really useful. One example of content that was on the site, but didnt get used until it was moved to a new area of the site titled teaching tips, along with some basic metadata and then use went up.

Comments

Please enter any comments in the following format.

(commenters' initials) - month/day [comment date]
comment

NSDL thanks DLESE for hosting the swikis for the NSDL Annual Meeting 2003.

Swiki Features