Wikicompany:Metasearcher

From Wikicompany

Web OPAC menu:

Contents

[edit]

Introduction

This is a project for a Ruby / Ruby on Rails based metasearch web application for SRU accessible library information stores.

At a later stage more search interfaces/protocols (such as RSS, OAI and OpenSearch) may be supported in the services backend.

user interface design

user interface design

[edit]

Ruby

  • Projects:
    • http://ruby-lang.org/en/
    • http://rubyforge.org
  • Documentation:
    • http://www.ruby-doc.org
    • http://www.ruby-doc.org/stdlib/
    • http://www.rubycentral.com/book/ (the updated version for Ruby 1.8 is not free)
  • Talk:
    • http://www.ruby-forum.com
    • http://debianlinux.net/ruby/ (blog aggregation)
    • http://del.icio.us/tag/ruby+blog
[edit]

Rails

  • Projects:
    • http://www.rubyonrails.org
    • rubyforge rails projects
  • Documentation:
    • http://api.rubyonrails.org
  • Talk:
    • http://groups.google.com/group/rubyonrails-core
    • http://groups.google.com/group/rubyonrails-talk
[edit]

Design principles

metasearcher implementatie concept

[edit]

Client

  • Make use of the browser's potential for a more responsive web experience (such as AJAX and JS).
  • The application must work correctly in most major browsers (MSIE, Mozilla Firefox, Safari, Opera).
  • Easy application installation and setup process.
  • Good workflow
  • Good web design styling using CSS.
  • Acceptable user-interaction performance
  • Clean and simple URL API design.
[edit]

Server

  • Clean and simple code design.
    • Integrate existing code components as much as possible, instead of writing our own code.
  • Use open standards if possible, especially for public interfaces: SRU, CQL, AJAX, JavaScript, DOM, CSS, HTML
[edit]

Threaded SRU workers

Note: for now we focus on handling SRU services, at a later stage other popular search interfaces may be supported.

[edit]

Thread system

  • backgroundrb
    • backgroundrb is a Rails plugin for divorcing long running tasks from the Rails HTTP request/response cycle. Periodic data view updates (and status reports) can be initiated from a client-side AJAX call.
    • svn source
    • mailinglist
[edit]

Worker status interface

  • status polling interface (in Model?)
  • Query action
[edit]

XML result parsing

  • Live SRU servers:
    • http://www.loc.gov/standards/sru/servers.html
  • SRU service info:
    • fetch "explainResponse" document info:
      • serverInfo
      • indexInfo
      • schemaInfo
      • configInfo (?)
  • Results processing:
    • SRU XML response data (partial / complete XML file)
    • Hit data cleaning (the XML data must be normalized for further use)
    • Hit data insertion
  • Issues:
    • fetch speed, parse speed, parse complexity, data structure accessibility
  • Tools:
    • hpricot - html / xml parser (DOM mode)
    • REXML - xml parser (DOM and SAX mode)
    • clxmlserial - REXML based classes serializer (status ?)
[edit]

Model

[edit]

queries table

  • id
  • qid
  • qstring (a normalized query string)
  • timestamp
  • hostip
  • ...
[edit]

hits table

[edit]

Considerations

  • Use good standards if possible: Dublin Core, MARC21, MODS, MediaRSS, etc.
  • element-name synonym mapping (handled by a service backend)
  • easy multi-value field extraction / storage / indexing / search
    • split/unsplit each column fields like: 'food||water||fire'
  • multi-lingual field support (?) (beside the user-interface language options)
  • UTF8 support
  • HTML support
  • performance considerations (use a stream/SAX based XML parsing approach if possible)
  • security considerations (system robustnes, html/js, sql injection, DoS attack, data pollution)
  • authority file usage for topic extraction/discovery
  • External content relations:
    • file-url for an object (eg. a PDF or text file)
    • concept-url for an object (eg. a Wikipedia URL)
    • Book web services (covers, ToC)
[edit]

Fields

  • id
  • qid (links to a corresponding query table entry)
  • (todo: add foreignkey fields)
  • dc:identifier (examples: isbn:9025126068, http://mydbsite.org/item/23223, ...)
  • dc:title
  • author -> dc:creator
  • dc:description
  • year -> dc:date (or _also_ have a separate year field?)

New Fields:

  • dc:publisher
  • dc:type
    • DCMI fields (an ! indicates probable usage):
    •  ! collection
    •  ! dataset (a subset of collection)
    •  ! event (eg. an auction, exhibition)
    •  ! image
    • interactiveresource
    • movingimage
    • physicalobject
    •  ! service (name of the web service used)
    • software
    •  ! sound
    • stillimage
    •  ! text
  • dc:format
  • dc:language
  • dc:relation
  • dc:subject
  • dc:coverage (spatial / temporal)
  • dc:source
  • dc:rights
  • dc:contributor (also covers: illustrators?)

Considered:

  • subtitle
  • insertdate
  • lastupdatedate
  • version (or: edition, number)
  • editor
  • translator
  • organisation
  • location
  • price
  • material (eg, bindingtype, ...)
  • annotation (or: remark)
  • attachment (eg. CDROM)
  • status
  • itemcode (any form of internal object coding)
  • classification
  • size
  • scale
  • pages

Considered for serials:

  • serial (name of the whole series)
  • startdate
  • enddate (if the serial stopped)
  • frequency
  • pagestart
  • pageend

MRSS

  • media:group
  • media:content

optional elements:

  • media:adult
  • media:rating
  • media:description
  • media:keywords
  • media:thumbnail
  • media:category
  • media:hash
  • media:player
  • media:credit
  • media:copyright
  • media:text
  • media:restriction
  • media:title

Other Field options:

  • contact (email, address, ...)
[edit]

services table

  • (to be designed)
[edit]

facets table

  • (to be designed)
[edit]

Controller

[edit]

URL API

[edit]

URL design

  • ms - controller which accepts other methods and parameters (it defaults to using the "list" method)
  • list - to show a list of result items
    • qid - the query id (intersting idea: combining multiple qid's?)
    • page - the results page number
    • sort - the sort-column
    • sortd - the requested sort direction of the column, which can be: asc or desc
  • show - to show a specific result item
    • qid - the query id
    • oid - the object id
  • search - to search for items
    • q - a Common Query Language query string
    • format - The output format, which can be: html, rss or sru. By default the output will be html.
    • services - The ";" delimited list of the web services to be queried. By default all services will be queried.
  • new / create (for testing only)
  • edit / update (for testing only)
[edit]

URL examples

  • Todo: identify search queries using qid's, and being able to use these qid's for other action methods, such as: list, show, bookmark.
  • list:
    • http://www.rijksmuseum.nl/ms (the ms controller's "index" method defaults to the using the "list" method)
    • http://www.rijksmuseum.nl/ms/list?qid=423423532&page=4
  • show:
    • http://www.rijksmuseum.nl/ms/show/829132
  • search (a search first retrieves the results and returns a qid, then presents the results using the "list" method):
    • http://www.rijksmuseum.nl/ms/search?q=Rembrandt (searches for all records with "Rembrandt" in the default service list)
    • This search query should lead to:
      • http://www.rijksmuseum.nl/ms/list?qid=423423532
      • http://www.rijksmuseum.nl/ms/list?qid=423423532&page=1&format=html (equivalent to previous URL)

More search examples:

  • http://www.rijksmuseum.nl/ms/search?q=Rembrandt&services=rml,kb
[edit]

Other controller issues

  • Query mapping: URL query -> SRU query
    • CQL string (can be used verbatim)
    • FromResult, NumberofResults / ToResult
  • Register query semantics: Link QID (query ID) to the hits rows (so we can determine what query belongs to which hits)
  • Hit data removal (periodic sql delete statement)
  • User session management via cookies (can be used for user-specific data which needs to be made persistent)
[edit]

SRU API

  • automated mapping of the standard URL API to a subset of the SRU API standard
  • output SRU XML response (instead of html)
  • explainResponse document
[edit]

View

[edit]

List view

  • show a list of query hits, and update dynamically as more data is fetched (after an AJAX poll).
  • create JS-based sortable-column interface
  • Pagination (beware of the bugs and slowness in the current Rails pagination code!)
    • pagination helper
    • HowtoPagination
  • Mark search query patterns in bold (in the same way Google does) for easily discovering where in the record the query matches.
    • See: ActionView::Helpers::TextHelper

To check:

  • SRU options for pagination
  • Option for third-party webservice usage (such as: "show a book thumbnail image from service X")
[edit]

Detail view

  • Can shows all the row info of a hit
  • Can show external service call data inline (availability, ToC, physical location, ...).
[edit]

Other ideas

  • Relevance ranking:
    • Subjective: Learn from user queries -> make suggestions and recommendations
    • Objective
      • Relevance ranking alghorithm? Some possible aspects to such a feature:
        • service
        • place of query matching in the record
        • number of text matches in the record
        • type of media
        • publishing year
        • ...
  • Usability improvements:
    • search word autocompletion in the search textfield per word
      • register each separate word for each user search in a special "querywords" table.
    • Query bookmarking: for easy/quick search-history access.
    • Object bookmarking: allow user to collect interesting objects for later review (cart system).
  • Automated topic maps
    • http://rubysvm.cilibrar.com - Framework for automated text categorization
    • Carrot2 demo - Java-based general metasearch engine
  • Context view
    • Allow users to combine multiple queries into a "context view" (better name needed? "query pool"?)
    • Store such views by a name
    • Get an updated RSS feed for such a view
    • Allow for default context views for the most common queries
    • ...
[edit]

Open issues

  • Permanent linking to: single objects, qeuries and contextual views.
  • Non disturbing result polling and view updates (with sorting).
  • SRU response handling strategy (DOM / streaming, SRU XML namespaces, automated index-field usage, ...)
  • Robust paging with multiple SRU servers (?)
[edit]

Estimated Timeline

  • Aug: basic idea presentation (together with the other alternatives)
  • Sep-Oct: working demo
    • working data engine: URL -> query -> SRU XML response -> XML handling -> hits table and and query table insertions
    • Primitive query form
    • qid functionality in data model and application logic
    • code design analysis
    • SRU, RSS, MediaRSS performance tests
    • dicsuss outstanding UI options
      • paging/ajax-polling system
        • SRU resultset retrieval
        • Result ordering
      • permanent links
      • bookmarking data system (full content / ID only)
    • Local Adlib setup and SRU testing (hopefully!)
    • Local Ockham OAI/SRU setup
  • Nov:
    • query form: more expressive query form (CQL support?)
    • relevance ranking (objective)
    • UI for selecting resources
    • Setup polling system for new results
    • Result content analysis: facet filtering system
  • Dec:
    • completion of any outstanding issues
[edit]

Resources

  • ruby image-handling modules and applications
Personal tools