Wikicompany:Metasearcher
From Wikicompany
Web OPAC menu:
- Web OPAC
- Metasearcher
Contents |
[edit]
Introduction
This is a project for a Ruby / Ruby on Rails based metasearch web application for SRU accessible library information stores.
At a later stage more search interfaces/protocols (such as RSS, OAI and OpenSearch) may be supported in the services backend.
[edit]
Ruby
- Projects:
- http://ruby-lang.org/en/
- http://rubyforge.org
- Documentation:
- http://www.ruby-doc.org
- http://www.ruby-doc.org/stdlib/
- http://www.rubycentral.com/book/ (the updated version for Ruby 1.8 is not free)
- Talk:
- http://www.ruby-forum.com
- http://debianlinux.net/ruby/ (blog aggregation)
- http://del.icio.us/tag/ruby+blog
[edit]
Rails
- Projects:
- http://www.rubyonrails.org
- rubyforge rails projects
- Documentation:
- http://api.rubyonrails.org
- Talk:
- http://groups.google.com/group/rubyonrails-core
- http://groups.google.com/group/rubyonrails-talk
[edit]
Design principles
[edit]
Client
- Make use of the browser's potential for a more responsive web experience (such as AJAX and JS).
- The application must work correctly in most major browsers (MSIE, Mozilla Firefox, Safari, Opera).
- Easy application installation and setup process.
- Good workflow
- Good web design styling using CSS.
- Acceptable user-interaction performance
- Clean and simple URL API design.
[edit]
Server
- Clean and simple code design.
- Integrate existing code components as much as possible, instead of writing our own code.
- Use open standards if possible, especially for public interfaces: SRU, CQL, AJAX, JavaScript, DOM, CSS, HTML
[edit]
Threaded SRU workers
Note: for now we focus on handling SRU services, at a later stage other popular search interfaces may be supported.
[edit]
Thread system
- backgroundrb
- backgroundrb is a Rails plugin for divorcing long running tasks from the Rails HTTP request/response cycle. Periodic data view updates (and status reports) can be initiated from a client-side AJAX call.
- svn source
- mailinglist
[edit]
Worker status interface
- status polling interface (in Model?)
- Query action
[edit]
XML result parsing
- Live SRU servers:
- http://www.loc.gov/standards/sru/servers.html
- SRU service info:
- fetch "explainResponse" document info:
- serverInfo
- indexInfo
- schemaInfo
- configInfo (?)
- fetch "explainResponse" document info:
- Results processing:
- SRU XML response data (partial / complete XML file)
- Hit data cleaning (the XML data must be normalized for further use)
- Hit data insertion
- Issues:
- fetch speed, parse speed, parse complexity, data structure accessibility
- Tools:
- hpricot - html / xml parser (DOM mode)
- REXML - xml parser (DOM and SAX mode)
- clxmlserial - REXML based classes serializer (status ?)
[edit]
Model
[edit]
queries table
- id
- qid
- qstring (a normalized query string)
- timestamp
- hostip
- ...
[edit]
hits table
[edit]
Considerations
- Use good standards if possible: Dublin Core, MARC21, MODS, MediaRSS, etc.
- element-name synonym mapping (handled by a service backend)
- easy multi-value field extraction / storage / indexing / search
- split/unsplit each column fields like: 'food||water||fire'
- multi-lingual field support (?) (beside the user-interface language options)
- UTF8 support
- HTML support
- performance considerations (use a stream/SAX based XML parsing approach if possible)
- security considerations (system robustnes, html/js, sql injection, DoS attack, data pollution)
- authority file usage for topic extraction/discovery
- External content relations:
- file-url for an object (eg. a PDF or text file)
- concept-url for an object (eg. a Wikipedia URL)
- Book web services (covers, ToC)
[edit]
Fields
- id
- qid (links to a corresponding query table entry)
- (todo: add foreignkey fields)
- dc:identifier (examples: isbn:9025126068, http://mydbsite.org/item/23223, ...)
- dc:title
- author -> dc:creator
- dc:description
- year -> dc:date (or _also_ have a separate year field?)
New Fields:
- dc:publisher
- dc:type
- DCMI fields (an ! indicates probable usage):
- ! collection
- ! dataset (a subset of collection)
- ! event (eg. an auction, exhibition)
- ! image
- interactiveresource
- movingimage
- physicalobject
- ! service (name of the web service used)
- software
- ! sound
- stillimage
- ! text
- dc:format
- dc:language
- dc:relation
- dc:subject
- dc:coverage (spatial / temporal)
- dc:source
- dc:rights
- dc:contributor (also covers: illustrators?)
Considered:
- subtitle
- insertdate
- lastupdatedate
- version (or: edition, number)
- editor
- translator
- organisation
- location
- price
- material (eg, bindingtype, ...)
- annotation (or: remark)
- attachment (eg. CDROM)
- status
- itemcode (any form of internal object coding)
- classification
- size
- scale
- pages
Considered for serials:
- serial (name of the whole series)
- startdate
- enddate (if the serial stopped)
- frequency
- pagestart
- pageend
MRSS
- media:group
- media:content
optional elements:
- media:adult
- media:rating
- media:description
- media:keywords
- media:thumbnail
- media:category
- media:hash
- media:player
- media:credit
- media:copyright
- media:text
- media:restriction
- media:title
Other Field options:
- contact (email, address, ...)
[edit]
services table
- (to be designed)
[edit]
facets table
- (to be designed)
[edit]
Controller
[edit]
URL API
[edit]
URL design
- ms - controller which accepts other methods and parameters (it defaults to using the "list" method)
-
list - to show a list of result items
- qid - the query id (intersting idea: combining multiple qid's?)
- page - the results page number
- sort - the sort-column
- sortd - the requested sort direction of the column, which can be: asc or desc
-
show - to show a specific result item
- qid - the query id
- oid - the object id
-
search - to search for items
- q - a Common Query Language query string
- format - The output format, which can be: html, rss or sru. By default the output will be html.
- services - The ";" delimited list of the web services to be queried. By default all services will be queried.
- new / create (for testing only)
- edit / update (for testing only)
[edit]
URL examples
- Todo: identify search queries using qid's, and being able to use these qid's for other action methods, such as: list, show, bookmark.
-
list:
- http://www.rijksmuseum.nl/ms (the ms controller's "index" method defaults to the using the "list" method)
- http://www.rijksmuseum.nl/ms/list?qid=423423532&page=4
-
show:
- http://www.rijksmuseum.nl/ms/show/829132
-
search (a search first retrieves the results and returns a qid, then presents the results using the "list" method):
- http://www.rijksmuseum.nl/ms/search?q=Rembrandt (searches for all records with "Rembrandt" in the default service list)
- This search query should lead to:
- http://www.rijksmuseum.nl/ms/list?qid=423423532
- http://www.rijksmuseum.nl/ms/list?qid=423423532&page=1&format=html (equivalent to previous URL)
More search examples:
- http://www.rijksmuseum.nl/ms/search?q=Rembrandt&services=rml,kb
[edit]
Other controller issues
- Query mapping: URL query -> SRU query
- CQL string (can be used verbatim)
- FromResult, NumberofResults / ToResult
- Register query semantics: Link QID (query ID) to the hits rows (so we can determine what query belongs to which hits)
- Hit data removal (periodic sql delete statement)
- User session management via cookies (can be used for user-specific data which needs to be made persistent)
[edit]
SRU API
- automated mapping of the standard URL API to a subset of the SRU API standard
- output SRU XML response (instead of html)
- explainResponse document
[edit]
View
[edit]
List view
- show a list of query hits, and update dynamically as more data is fetched (after an AJAX poll).
- create JS-based sortable-column interface
- Pagination (beware of the bugs and slowness in the current Rails pagination code!)
- pagination helper
- HowtoPagination
- Mark search query patterns in bold (in the same way Google does) for easily discovering where in the record the query matches.
- See: ActionView::Helpers::TextHelper
To check:
- SRU options for pagination
- Option for third-party webservice usage (such as: "show a book thumbnail image from service X")
[edit]
Detail view
- Can shows all the row info of a hit
- Can show external service call data inline (availability, ToC, physical location, ...).
[edit]
Other ideas
-
Relevance ranking:
- Subjective: Learn from user queries -> make suggestions and recommendations
- Objective
- Relevance ranking alghorithm? Some possible aspects to such a feature:
- service
- place of query matching in the record
- number of text matches in the record
- type of media
- publishing year
- ...
- Relevance ranking alghorithm? Some possible aspects to such a feature:
-
Usability improvements:
-
search word autocompletion in the search textfield per word
- register each separate word for each user search in a special "querywords" table.
- Query bookmarking: for easy/quick search-history access.
- Object bookmarking: allow user to collect interesting objects for later review (cart system).
-
search word autocompletion in the search textfield per word
-
Automated topic maps
- http://rubysvm.cilibrar.com - Framework for automated text categorization
- Carrot2 demo - Java-based general metasearch engine
-
Context view
- Allow users to combine multiple queries into a "context view" (better name needed? "query pool"?)
- Store such views by a name
- Get an updated RSS feed for such a view
- Allow for default context views for the most common queries
- ...
[edit]
Open issues
- Permanent linking to: single objects, qeuries and contextual views.
- Non disturbing result polling and view updates (with sorting).
- SRU response handling strategy (DOM / streaming, SRU XML namespaces, automated index-field usage, ...)
- Robust paging with multiple SRU servers (?)
[edit]
Estimated Timeline
- Aug: basic idea presentation (together with the other alternatives)
-
Sep-Oct: working demo
- working data engine: URL -> query -> SRU XML response -> XML handling -> hits table and and query table insertions
- Primitive query form
- qid functionality in data model and application logic
- code design analysis
- SRU, RSS, MediaRSS performance tests
- dicsuss outstanding UI options
- paging/ajax-polling system
- SRU resultset retrieval
- Result ordering
- permanent links
- bookmarking data system (full content / ID only)
- paging/ajax-polling system
- Local Adlib setup and SRU testing (hopefully!)
- Local Ockham OAI/SRU setup
-
Nov:
- query form: more expressive query form (CQL support?)
- relevance ranking (objective)
- UI for selecting resources
- Setup polling system for new results
- Result content analysis: facet filtering system
-
Dec:
- completion of any outstanding issues
[edit]
Resources
- ruby image-handling modules and applications