Free Software :: Free Culture & Archiving Planet

Free Culture projects:

Research links:

ToC

  1. OpenGeoData : Screencast on how to remove duplicate node in OpenStreetMap
  2. EFF : Google Superbowl Ad Explains The Need for Search Privacy
  3. OpenGeoData : USGS VGI meeting results
  4. NLP News : Google Working On Speech-To-Speech Translation Phone
  5. NLP News : Google (Voice) solves universal translation soonish
  6. EFF : EFF Fights for Cell Phone Users' Privacy in Thursday Hearing
  7. NLP News : Google translation phone "two years away"
  8. Linux Foundation : The Alexandria Project, Chap. 4: Beware of Greeks bearing Trapdoors
  9. NLP News : Google’s Real-Time Voice Translator Could Make Any Language Lingua ...
  10. NLP News : Coming Soon from Google: Hilariously Inaccurate Real-Time Voice ...
  11. NLP News : Google develops phone-call language translation
  12. NLP News : Coming Soon from Google: Hilariously Inaccurate Real-Time Translations
  13. NLP News : Google Developing Language Translator for Mobile Phones
  14. NLP News : Google developing mobile language translation software
  15. NLP News : Google's Translation Phone Not Imminent
  16. NLP News : Google working on speech translation for phones
  17. OpenGeoData : OSM in the news roundup
  18. NLP News : Subscribe To Gizmodo
  19. NLP News : Google Working On Live Translation Service
  20. NLP News : Google Working on Translator Phone
  21. NLP News : Google's Software Will Enable Translation of Foreign Languages in Smartphones
  22. NLP News : Google Taking Translate To The Next Level
  23. NLP News : Google is Working on Speech-to-Speech Translation for Android
  24. NLP News : Google is developing an instant speech-to-speech translator
  25. NLP News : Google developing real-time translator phone
  26. NLP News : Google working on instant speech translation for cell phones
  27. NLP News : AppTek's Hybrid Machine Translation Scores Highest During Independent ...
  28. NLP News : AppTek's Hybrid Machine Translation Scores Highest During Independent Comparative Evaluation
  29. NLP News : Google considering speech-to-speech translation
  30. NLP News : Google planning mobile speech-to-speech translator
  31. NLP News : Google Phone's Killer App: Real-Time Translation
  32. NLP News : Google Working on Speech-to-Speech Phone Translation
  33. NLP News : Vietnamese professor wins IBM award
  34. NLP News : Google “translator phone” project promises real-time translation
  35. NLP News : Don't bother to learn foreign languages: Google to launch smartphones ...
  36. NLP News : Hanoi academic receives international award
  37. NLP News : Google to offer "Babel Fish" abilities with future Android phones
  38. NLP News : Google Developing Universal Language Translator for Smartphones
  39. NLP News : Google to launch instant speech translator for smartphones
  40. Tesseract : Re: Chinese Tessdata Pack
  41. NLP News : iSOCO electronic invoicing exchange could save 30 percent of processing costs
  42. Tesseract : Tesseract 3.0 for iPhone Compiling
  43. NLP News : Google translation dream is pie in the sky
  44. NLP News : Google working on smartphone software to automatically translate foreign ...
  45. Ubiquity : The Home of Drunk Celebs
  46. NLP News : Blog heads east
  47. Linux Foundation : CodePlex Foundation Picks Paula Hunter as Executive Director
  48. NLP News : Google leaps barrier with translator phone
  49. Ubiquity : Ubiquity Commands Should Be Prefixed
  50. NLP News : Soon, software for mobiles that translates foreign languages instantly
  51. NLP : Blog heads east
  52. NLP News : Google leaps language barrier with translator phone
  53. Open Knowledge Foundation : Book Search, Museum View, and Exploitation
  54. Linux Foundation : Happy camper
  55. FRBR : Last Week in FRBR #14
  56. NLP News : Yinlips E-Book, an iPad Clone with E-Ink
  57. W3C Semantic Web : German Translation of the RDFa Primer
  58. Tesseract : Help a newbie!!
  59. Wikimedia : Tech folk will again meet in Berlin
  60. EFF : Patent Office Grants EFF Request for Reexamination of Dangerous VOIP Patent
  61. EFF : Know Before You Go: Tickets May Come at a Higher Price Than You Realize
  62. Science Commons : Data Sharing on the Web
  63. Public Library of Science : New spring range now available in the PLoS Store
  64. NLP News : A Wikipedia Matching Approach to Contextual Advertising
  65. AKSW Semantic Web : 2nd Leipzig Semantic Web Day on 6th May
  66. BioMed OA : GOseq – a new method for Gene Ontology analysis of RNA-seq data
  67. NLP News : Microsoft and NSF Enable Research in the Cloud
  68. NLP News : Thomson Reuters Unveils WestlawNext, the Next Generation in Legal Research
  69. Open Access : True or false? Defend your answer.
  70. Open Book Alliance : Responding to U.S. Department of Justice Statement of Interest Regarding Google Books Settlement
  71. Information Aesthetics : Data Fiction: Storytelling with Information Graphics
  72. Linux Foundation : Free Loans at 0% Interest
  73. Journal of Machine Learning : Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting; Philippos Mordohai, Gérard Medioni; 11(Jan):411--450, 2010.
  74. Journal of Machine Learning : A Convergent Online Single Time Scale Actor Critic Algorithm; Dotan Di Castro, Ron Meir; 11(Jan):367--410, 2010.
  75. Journal of Machine Learning : Bundle Methods for Regularized Risk Minimization; Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le; 11(Jan):311--365, 2010.
  76. Journal of Machine Learning : Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure; Kaname Kojima, Eric Perrier, Seiya Imoto, Satoru Miyano; 11(Jan):285--310, 2010.
  77. Journal of Machine Learning : Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions; Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos; 11(Jan):235--284, 2010.
  78. Journal of Machine Learning : Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation; Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos; 11(Jan):171--234, 2010.
  79. Journal of Machine Learning : An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data; Yufeng Ding, Jeffrey S. Simonoff; 11(Jan):131--170, 2010.
  80. Journal of Machine Learning : Classification Methods with Reject Option Based on Convex Risk Minimization; Ming Yuan, Marten Wegkamp; 11(Jan):111--130, 2010.
  81. Journal of Machine Learning : On-Line Sequential Bin Packing; András György, Gábor Lugosi, György Ottucsàk; 11(Jan):89--109, 2010.
  82. Journal of Machine Learning : Model Selection: Beyond the Bayesian/Frequentist Divide; Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley; 11(Jan):61--87, 2010.
  83. Journal of Machine Learning : Online Learning for Matrix Factorization and Sparse Coding; Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro; 11(Jan):19--60, 2010.
  84. Journal of Machine Learning : An Efficient Explanation of Individual Classifications using Game Theory; Erik Štrumbelj, Igor Kononenko; 11(Jan):1--18, 2010.
  85. NLP News : News
  86. NLP News : Logic-Based Question Answering
  87. NLP News : Current Trends in Automated Deduction
  88. NLP News : Special Issue on Automated Deduction
  89. Open Video Conference : By: Entrevista: Elizabeth Stark « Vídeo Online
  90. OpenSocial API : Using IBM Mashup Center to build interoperable Web 2.0 applications with OpenSocial gadgets
  91. Open Knowledge Foundation : Rethinking Open Data: Lessons learned from the Open Data front lines
  92. NLP News : Former U.S. Air Force Chief Information Officer Joins Digital Reasoning as Special Advisor
  93. FreeSound : Zorg and Andy
  94. FreeSound : Umbrella Adventure
  95. FreeSound : Digitópia Competitions 2010
  96. Information Aesthetics : Huge Interactive Signpost Shows the Direction to Favorite Locations
  97. Google Research : Google Cluster Data
  98. NLP News : A wizard of oz component-based approach for rapidly prototyping and testing input multimodal interfaces
  99. Information Aesthetics : Ward Shelley: Infovis Oil Painting Artist
  100. OCRopus : status and progress
  101. FLOSS Manuals : Collaborative Futures Launch(es)
  102. Open Book Alliance : GBS 2.0 Objection Roundup
  103. NLP News : Former U.S. Air Force Chief Information Officer Joins Digital ...
  104. Linux Foundation : What to Expect at LinuxCon 2010 this August in Boston!
  105. Open Book Alliance : Reps. Gonzalez & Green Ask DOJ to Scrutinize Google Books Settlement
  106. AKSW Semantic Web : Doctoral and PostDoc positions available at AKSW
  107. Open Video Conference : By: Open Video « Vídeo Online
  108. Wikimedia : Wikimedia donates servers to deserving non-profits.
  109. Science Commons : T-shirt Contest Goes Global
  110. Ubiquity : Funny sms
  111. Ubiquity : Too much spammy messages?
  112. NLP News : Activistas: En Ciudad Juárez hay “escuadrones de la muerte”
  113. UMBEL : Draft paper submission deadline is extended: EISWT-10
  114. NLP News : A dictionary to identify small molecules and drugs in free text.
  115. Open Knowledge Foundation : 7th Communia Workshop, Luxembourg
  116. NLP News : Carnegie Mellon helps relief workers in Haiti bridge the language barrier
  117. EFF : EFF's 20th Birthday Commemorative Poster
  118. Open Medicine : This week at Open Medicine - new research
  119. Tesseract : Different results on subimages
  120. NLP News : Term distribution visualizations with Focus+Context
  121. Omeka : Omeka Outreach Month
  122. W3C Semantic Web : RDFa Working Group launched
  123. Open Access : February SOAN
  124. BioMed OA : UK government adopts Creative Commons licenses for open data: good news for public-sector researchers publishing in open access journals
  125. Planet Linked Data : Collaborating on Images
  126. Google Research : Announcing Google's Focused Research Awards
  127. NLP News : Graph Classification
  128. NLP News : The algorithms for preliminary text processing: Decomposition, annotation, morphological analysis
  129. Information Aesthetics : Information Landscapes in 1994 (MIT Prof Muriel Cooper)
  130. OCRopus : Chinese
  131. OCRopus : Bugfix for xml-entities.cc
  132. Open Archaeology : ArcheoFOSS 2010: posticipata al 19 febbraio la scadenza per presentare abstract
  133. FreeCulture.org : Lawrence Lessig talk on Fair Use and Online Video
  134. Ubiquity : Funny video clips
  135. EFF : Seven "Corporations of Interest" in Selling Surveillance Tools to China
  136. NLP News : Theoretical Foundations for Enabling a Web of Knowledge
  137. NLP News : Tools and Techniques in Qualitative Reasoning about Space
  138. NLP News : Linguistic Systems, Inc. Unveils New Online Translation Service at ...
  139. FreeCulture.org : Gifts for Free Culture X Registration!
  140. Ubiquity : List All Currently Open Tabs
  141. Planet Linked Data : Wash down the Apple tablet with a gulp of Kool Aid
  142. Music Brainz : Sorry for the downtime!
  143. Linux Foundation : The Alexandria Project, Chap. 3: I just HATE it when that Happens
  144. Journal of Machine Learning : A Survey of Accuracy Evaluation Metrics of Recommendation Tasks; Asela Gunawardana, Guy Shani; 10(Dec):2935--2962, 2009.
  145. Journal of Machine Learning : Efficient Online and Batch Learning Using Forward Backward Splitting; John Duchi, Yoram Singer; 10(Dec):2899--2934, 2009.
  146. Journal of Machine Learning : Online Learning with Samples Drawn from Non-identical Distributions; Ting Hu, Ding-Xuan Zhou; 10(Dec):2873--2898, 2009.
  147. Journal of Machine Learning : Adaptive False Discovery Rate Control under Independence and Dependence; Gilles Blanchard, Étienne Roquain; 10(Dec):2837--2871, 2009.
  148. Journal of Machine Learning : Cautious Collective Classification; Luke K. McDowell, Kalyan Moy Gupta, David W. Aha; 10(Dec):2777--2836, 2009.
  149. Journal of Machine Learning : Reproducing Kernel Banach Spaces for Machine Learning; Haizhang Zhang, Yuesheng Xu, Jun Zhang; 10(Dec):2741--2775, 2009.
  150. Journal of Machine Learning : Learning Halfspaces with Malicious Noise; Adam R. Klivans, Philip M. Long, Rocco A. Servedio; 10(Dec):2715--2740, 2009.
  151. Journal of Machine Learning : Structure Spaces; Brijnesh J. Jain, Klaus Obermayer; 10(Nov):2667--2714, 2009.
  152. Journal of Machine Learning : Bounded Kernel-Based Online Learning; Francesco Orabona, Joseph Keshet, Barbara Caputo; 10(Nov):2643--2666, 2009.
  153. Journal of Machine Learning : DL-Learner: Learning Concepts in Description Logics; Jens Lehmann; 10(Nov):2639--2642, 2009.
  154. NLP News : Exploiting geographic references of documents in a geographical information retrieval system using an ontology-based index
  155. Information Aesthetics : Making Digital Content on the Mobile Phone Physically Graspable
  156. Information Aesthetics : World Map of Barcelona Natural Science Museum Biodiversity Data
  157. Ubiquity : Google Site Search?
  158. Planet Linked Data : The Business Of Linked Data (BOLD) Discussion Space
  159. Planet Linked Data : Getting The Linked Data Value Pyramid Layers Right (Update #2)
  160. Planet Linked Data : What is the DBpedia Project? (Updated)
  161. Planet Linked Data : Getting The Linked Data Value Pyramid Layers Right (Update #2)
  162. Planet Linked Data : What is the DBpedia Project? (Updated)
  163. OCRopus : Introduction
  164. NLP : Coordinate ascent and inverted indices...
  165. Ubiquity : Fresh new pics
  166. Tesseract : Handwriting recognition
  167. Communia : 7th Communia Workshop abstracts and initial statements by speakers and chairpersons
  168. OCRopus : Page segmentation
  169. Open Book Alliance : Opposition Pours In
  170. Open Knowledge Foundation : CERN opens up bibliographic metadata!
  171. Science Commons : MichiganView releases remote sensing data under CC0 waiver
  172. Music Brainz : Testing the NGS live data feed
  173. Ubiquity : Fun pics
  174. Planet Linked Data : BBC Semantic Web use-case
  175. OCRopus : Newcomer: Intros
  176. W3C Semantic Web : New SW Use Case by the BBC
  177. EFF : Blogging ACTA Across The Globe: Lessons From Korea
  178. BioMed OA : How to publish raw clinical data: guidelines from Trials and the BMJ
  179. Music Brainz : Database on test server updated
  180. FLOSS Manuals : Collaborative Futures
  181. Linux Foundation : Limits?
  182. EFF : Blogging ACTA Across The Globe: The View from France
  183. Free Our Data : A new No.10 petition: free PostZon
  184. EFF : Obama Reverses Position on Disclosing Lobbyist Contacts
  185. OCRopus : Non-UTF8 comments in source code files
  186. Open Book Alliance : AMENDED GOOGLE BOOKS SETTLEMENT IS A “PALTRY PROPOSAL” THAT DEFIES ANTITRUST LAWS
  187. Linux Foundation : Linux can compete with the iPad on price, but where’s the magic?
  188. if:book : and now we have an ipad
  189. Open Knowledge Foundation : Clear Climate Code, and Data
  190. Open Book Alliance : And The Hits Keep Coming…
  191. Linux Foundation : Tagging the Noosphere
  192. NLP News : Hot Off the Wire
  193. W3C Semantic Web : New SPARQL drafts published
  194. FRBR : Last week in FRBR #13
  195. EFF : FCC's Net Neutrality Plan Would Permit Blocking of BitTorrent
  196. Science Commons : Design a new t-shirt for Science Commons and win a trip to Seattle to attend Science Commons Symposium – Pacific Northwest!
  197. Planet Linked Data : Behind Oz’s Curtain
  198. Ubiquity : Twitter software. Automatic twitter widget
  199. Inside Google Book Search : Updated Books Home Page and My Library

February 08, 2010

OpenGeoData

Screencast on how to remove duplicate node in OpenStreetMap

February 08, 2010 08:29 PM

EFF.org Updates

Google Superbowl Ad Explains The Need for Search Privacy

Google's ad during yesterday's Superbowl explained in less than a minute how the story of someone's life can be pieced together from their search queries. Using only the search terms and user's clicks of the search results, Google told the story of a user who seeks love while studying abroad in Paris, finds it, moves to Paris, marries and has a child.


The poignant story, along with Google's suite of search stories, masterfully illustrate how some of the most intimate information in our lives--from planning a trip to political activism--are routinely and vividly expressed in our interactions with Google, and highlights the need for that information to have strong protections.

The Superbowl ad was Google's first foray into national television advertising, and its great that Google used this opportunity to illustrate the importance of search privacy to one of the world's largest audiences. Now that Google has shown how personal its records of user interaction are, it should follow through and protect that information from involuntary disclosure by anonymizing search queries. Microsoft's Bing is anonymizing this information after six months by deleting the entire Internet Protocol ("IP") address associated with your search queries. Google can and should anonymize search queries in the same way after six months or less.

February 08, 2010 08:08 PM

OpenGeoData

USGS VGI meeting results

You might be interested in the results of a meeting the USGS held on VGI and its implications. Minutes, presentations etc are over here.

February 08, 2010 05:49 PM

NLP News

Google Working On Speech-To-Speech Translation Phone

Google already runs a successful online translator, Google Translate, but they’ve got far-loftier ideas than simply converting the written word. They want to translate languages spoken over the phone, according to their head of translation services. (more…)

February 08, 2010 05:16 PM

Google (Voice) solves universal translation soonish

Google (Voice) solves universal translation soonishRegister... our favourite mad-scientists-for-a-better-tomorrow funding body, has been pouring money into machine translation for years in the hope of enabling US ...

February 08, 2010 04:56 PM

EFF.org Updates

EFF Fights for Cell Phone Users' Privacy in Thursday Hearing

Philadelphia - The Electronic Frontier Foundation (EFF) will be arguing this Thursday before the U.S. Court of Appeals for the 3rd Circuit in Philadelphia, urging the court to block a government attempt to seize telephone company records detailing a cell phone user's past locations without first getting a search warrant.

EFF is serving as a friend of the court or "amicus," joined by co-amici the ACLU, the ACLU of Pennsylvania, and the Center for Democracy & Technology. Professor Susan Freiwald of the University of San Francisco, who submitted a separate amicus brief to the panel, will be joining EFF Senior Staff Attorney Kevin Bankston in arguing on Thursday that federal privacy statutes in combination with the Fourth Amendment to the U.S. Constitution protect the privacy of cell phone users and require the government to show probable cause before obtaining cell phone location information.

WHAT:
Oral argument In the Matter of the Application of the United States of America for an Order Directing a Provider of Electronic Communication Service to Disclose Records to the Government

WHEN:
Thursday, February 11th
9:30am

WHERE:
Albert Branson Maris Courtroom (19th floor)
U.S. Courthouse
601 Market St.
Philadelphia, PA 19106

For more information on attending Thursday's hearing, contact press@eff.org.

For the full EFF amicus brief to the Third Circuit:
http://www.eff.org/files/filenode/celltracking/Filed%20Cell%20Tracking%2...

For more on the issue of cell phone tracking:
http://www.eff.org/issues/cell-tracking

Contacts:

Kevin Bankston
Senior Staff Attorney
Electronic Frontier Foundation
bankston@eff.org

Jennifer Stisa Granick
Civil Liberties Director
Electronic Frontier Foundation
jennifer@eff.org

Rebecca Jeschke
Media Relations Director
Electronic Frontier Foundation
press@eff.org

February 08, 2010 04:43 PM

NLP News

Google translation phone "two years away"

Google translation phone "two years away"Telegraph.co.uk“If you look at the progress in machine translation and corresponding advances in voice recognition, there has been huge progress recently,” he said.

February 08, 2010 04:33 PM

Browse Blogs

The Alexandria Project, Chap. 4: Beware of Greeks bearing Trapdoors

Our story so far:  Security expert Frank Adversego comes under suspicion when the Library of Congress is hacked by a mysterious cracker with motives unknown and a taste for the bizarre; to protect himself, Frank had better get to the bottom of things.

February 08, 2010 04:26 PM

NLP News

Google’s Real-Time Voice Translator Could Make Any Language Lingua ...

The real-time-translating Babel Fish from Douglas Adams' Hitchhikers' Guide to the Galaxy was named for the Tower of Babel, a biblical structure fractured by linguistic confusion. Google engineers are working on a translator for Google Android ...

February 08, 2010 04:23 PM

Coming Soon from Google: Hilariously Inaccurate Real-Time Voice ...

Google’s (GOOG) built up quite a business scanning the written word for contextual advertising oppurtunities. Now it hopes to do the same for the spoken word as well. The company is reportedly developing a real-time translation technology for our ...

February 08, 2010 04:15 PM

Google develops phone-call language translation

First PostGoogle develops phone-call language translationFirst Post"Clearly, for it to work smoothly, you need a combination of high-accuracy machine translation and high-accuracy voice recognition, and that's what we're ...

February 08, 2010 04:09 PM

Coming Soon from Google: Hilariously Inaccurate Real-Time Translations

Googles built up quite a business scanning the written word for contextual advertising oppurtunities. Now it hopes to do the same for the spoken word as well. The company is reportedly developing a real-time translation technology for our phones.

February 08, 2010 04:06 PM

Google Developing Language Translator for Mobile Phones

We think speech-to-speech translation should be possible and work reasonably well in a few years’ time. Clearly, for it to work smoothly, you need a combination of high-accuracy machine translation and high-accuracy voice recognition, and that’s ...

February 08, 2010 03:40 PM

Google developing mobile language translation software

With the Nexus One, Google made voice commands an integral part of the phone’s user interface. In addition to voice searching — a feature on all Android phones — Nexus One users can also dictate e-mails, SMS messages, and use voice commands in ...

February 08, 2010 03:25 PM

Google's Translation Phone Not Imminent

Although it is working hard on the problems of translation and speech recognition, Google says it won't have a translating phone any time soon Contrary to media reports, Google is not about to launch an automatic translation phone, but the company is ...

February 08, 2010 03:18 PM

Google working on speech translation for phones

Head of translation services Franz Och tells The Sunday Times that the search giant is closing in on speech-to-speech translation.

February 08, 2010 03:13 PM

OpenGeoData

OSM in the news roundup

The Washington Post attended a recent OpenStreetMap mapping party. New York Times mentions OSM and Haiti. As does New Scientist. The Guardian brings everyone up to date on OpenStreetMap’s progress. Also, interesting article by Nat on opening up data.

February 08, 2010 03:11 PM

NLP News

Subscribe To Gizmodo

Google already runs a successful online translator, Google Translate, but they've got far-loftier ideas than simply converting the written word. They want to translate languages spoken over the phone, according to their head of translation services ...

February 08, 2010 03:04 PM

Google Working On Live Translation Service

Language lessons may be a thing of the past if Google cracks the live voice translation technology it admits it's been working on. The company would combine its advanced voice recognition know-how with its text translation service to create a mobile ...

February 08, 2010 02:57 PM

Google Working on Translator Phone

Search engine-based company Google is reportedly gearing up for the release of a mobile phone that would be capable to translate foreign languages almost instantly, the latest reports around the web suggest. It seems that the company is focused on ...

February 08, 2010 02:57 PM

Google's Software Will Enable Translation of Foreign Languages in Smartphones

TopNews United StatesGoogle's Software Will Enable Translation of Foreign Languages in SmartphonesTopNews United StatesClearly, for it to work smoothly, you need a combination of high-accuracy machine translation and high-accuracy voice recognition, and that's what we're ...and more »

February 08, 2010 02:52 PM

Google Taking Translate To The Next Level

When I downloaded Google Translate and associated language packs on my Motorola Droid one of the FIRST things I did was attempt to speak into the phone using the speech-to-text function, have it translate in a different language, and then hear it ...

February 08, 2010 02:50 PM

Google is Working on Speech-to-Speech Translation for Android

Google is Working on Speech-to-Speech Translation for AndroidMashable (blog)“Clearly, for it to work smoothly, you need a combination of high-accuracy machine translation and high-accuracy voice recognition, and that's what we're ...and more »

February 08, 2010 02:39 PM

Google is developing an instant speech-to-speech translator

Spencer Dalziel THE INQUIRER Google of Babelfish

February 08, 2010 02:27 PM

Google developing real-time translator phone

Google has confirmed that it is working on a technology to allow real-time language translation on the phone. read more

February 08, 2010 02:26 PM

Google working on instant speech translation for cell phones

Google has already pushed language translation forward online with it’s Google Translate service and its integration into services such as Google Reader. But the search giant has set itself a much greater challenge with the promise of an automatic speech translation service for cell phones. The idea is the equivalent of producing a universal translator that [...]

February 08, 2010 02:18 PM

AppTek's Hybrid Machine Translation Scores Highest During Independent ...

AppTek's Hybrid Machine Translation Scores Highest During Independent ...Business Wire (press release)“AppTek is the only human language technology provider to successfully develop a single system to enable fully-automatic, high quality machine translation,” ...and more »

February 08, 2010 02:18 PM

AppTek's Hybrid Machine Translation Scores Highest During Independent Comparative Evaluation

MCLEAN, Va.----AppTek, a leader in human language technology , today announced its hybrid machine translation system, a full integration of statistical and rule-based methodologies, was ranked first among machine translation systems tested during an independent comparative study.

February 08, 2010 02:16 PM

Google considering speech-to-speech translation

Automated speech-to-speech language translation should be possible in a few years' time, according to Google translation chief Franz Och. In an interview with The Sunday Times, Och said the...

February 08, 2010 01:47 PM

Google planning mobile speech-to-speech translator

Google has been discussing new software under development which it claims could let mobile phones translate speech "almost instantly".

February 08, 2010 01:30 PM

Google Phone's Killer App: Real-Time Translation

Google is developing software for the first phone capable of translating foreign languages almost instantly ? like the Babel Fish in The Hitchhiker's Guide to the Galaxy.

February 08, 2010 01:21 PM

Google Working on Speech-to-Speech Phone Translation

Google Working on Speech-to-Speech Phone TranslationLifehackerGoogle's head of translation services is quoted as saying that speech-to-speech requires "high-accuracy machine translation and high-accuracy voice ...and more »

February 08, 2010 12:55 PM

Vietnamese professor wins IBM award

First in the country to get an award the US computer firm gives to encourage collaboration between researchers at universities

February 08, 2010 11:57 AM

Google “translator phone” project promises real-time translation

SlashGear (blog)Google “translator phone” project promises real-time translationSlashGear (blog)“Clearly, for it to work smoothly, you need a combination of high-accuracy machine translation and high-accuracy voice recognition, and that's what we're ...and more »

February 08, 2010 11:56 AM

Don't bother to learn foreign languages: Google to launch smartphones ...

A mobile phone that can act as an interpreter is being developed by Google. The firm says its device will convert spoken words into another language almost instantly. Franz Och, Google's head of translation services, said: 'We think speech-to-speech ...

February 08, 2010 11:51 AM

Hanoi academic receives international award

Pham Bao Son, a computer science professor in Hanoi, has become the first Vietnamese to win an IBM Faculty Award, an international program to promote innovation and foster collaboration between researchers.

February 08, 2010 10:52 AM

Google to offer "Babel Fish" abilities with future Android phones

Language translation abilities could be a selling point for future versions of the Android mobile phone operating system, as it's reported Google is working on a "Babel Fish" phone. Obviously the development would be a software, rather than hardware ...

February 08, 2010 10:39 AM

Google Developing Universal Language Translator for Smartphones

Phones ReviewGoogle Developing Universal Language Translator for SmartphonesPhones ReviewFor it work smoothly you need a combination of high accuracy machine translation along with high accuracy voice recognition which is what Google is working ...and more »

February 08, 2010 10:24 AM

Google to launch instant speech translator for smartphones

Google to launch instant speech translator for smartphonesNetimperative"Clearly, for it to work smoothly, you need a combination of high-accuracy machine translation and high-accuracy voice recognition, and that's what we're ...and more »

February 08, 2010 09:42 AM

tesseract-ocr Google Group

Re: Chinese Tessdata Pack

I got it. now attached chinese-tessdata.zip file

February 08, 2010 03:05 AM

NLP News

iSOCO electronic invoicing exchange could save 30 percent of processing costs

iSOCO electronic invoicing exchange could save 30 percent of processing costsGizmag... to automatically convert electronic invoices across different formats using a combination of semantic and natural language processing technology. ...

February 08, 2010 01:39 AM

tesseract-ocr Google Group

Tesseract 3.0 for iPhone Compiling

Hello, I'm newbie. I' trying to compile Tesseract Svn for iPhone OS. I
entered these command
svn checkout [link] tesseract-
ocr
cd tesseract-ocr
./runautoconf
./configure
make
sudo make install
Then I copied build_fat.sh file to directory from
[link]

February 08, 2010 12:04 AM

February 07, 2010

NLP News

Google translation dream is pie in the sky

Google translation dream is pie in the skyTHINQ.co.ukAnyone familiar with Google's on-line translation service is aware of the problems machine translation poses. Any voice recognition-based system requires ...and more »

February 07, 2010 10:42 PM

Google working on smartphone software to automatically translate foreign ...

Google working on smartphone software to automatically translate foreign ...MobileCrunch (blog)It's not entirely machine translation, though, which is generally rubbish, since people can help contribute with certain words and phrases that might not ...and more »

February 07, 2010 10:27 PM

ubiquity-firefox Google Group

The Home of Drunk Celebs

A Collection Of Celebrities In Their Finest Hour
Really cool!
[link]

February 07, 2010 09:41 PM

NLP News

Blog heads east

I started this blog ages ago while still in grad school in California at USC/ISI. It came with me three point five years ago when I started as an Assistant Professor at the University of Utah. Starting some time this coming summer, I will take it even further east: to CS at the University of Maryland where I have just accepted a faculty offer.These past (almost) four years at Utah have been fantastic for me, which has made this decision to move very difficult. I feel very lucky to have been able to come here. I've had enormous freedom to work in directions that interest me, great teaching opportunities (which have taught me a lot), and great colleagues. Although I know that moving doesn't mean forgetting one's friends, it does mean that I won't run in to them in hallways, or grab lunch, or afternoon coffee or whatnot anymore. Ellen, Suresh, Erik, John, Tom, Tolga, Ross, and everyone else here have made my time wonderful. I will miss all of them.The University here has been incredibly supportive in every way, and I've thoroughly enjoyed my time here. Plus, having world-class skiing a half hour drive from my house isn't too shabby either. (Though my # of days skiing per year declined geometrically since I started: 30-something the first year, then 18, then 10... so far only a handful this year. Sigh.) Looking back, my time here has been great and I'm glad I had the opportunity to come.That said, I'm of course looking forward to moving to Maryland also, otherwise I would not have done it! There are a number of great people there in natural language processing, machine learning and related fields. I'd like to think that UMD should be and is one of the go-to places for these topics, and am excited to be a part of it. Between Bonnie, Philip, Lise, Mary, Doug, Judith, Jimmy and the other folks in CLIP and related groups, I think it will be a fantastic place for me to be, and a fantastic place for all those PhD-hungry students out there to go! Plus, having all the great folks at JHU's CLSP a 45 minute drive a way will be quite convenient.A part of me is sad to be leaving, but another part of me is excited at new opportunities. The move will take place some time over the summer (carefully avoiding conferences), so if I blog less then, you'll know why. Thanks again to everyone who has made my life here fantastic.

February 07, 2010 06:00 PM

Browse Blogs

CodePlex Foundation Picks Paula Hunter as Executive Director

As you may recall, the CodePlex Foundation indicated in January that it expected to name a permanent Executive Director within a few weeks’ time. That has now happened, and in the “small world” department, the new ED happens to be Paula Hunter - someone I’ve known for years, and worked with several times in the past. The full press release is below. Paula is someone I like and respect a lot, and a great choice for CodePlex.

February 07, 2010 05:35 PM

NLP News

Google leaps barrier with translator phone

GOOGLE'S making the first phone able to translate foreign languages almost instantly.

February 07, 2010 01:36 PM

ubiquity-firefox Google Group

Ubiquity Commands Should Be Prefixed

I sometimes find myself accidentally disabling commands, this is
because I might type something and click enter before checking full,
this often leads to disabled commands. As such, in order to make
things easier and also group things, there shoulld be another layer to
commands, all ubiquity commands should be prefixed with "ubiquity-"

February 07, 2010 12:00 PM

NLP News

Soon, software for mobiles that translates foreign languages instantly

London, Feb 7: It may soon be possible to transform communication among speakers of the world's 6,000-plus languages - all thanks to Google which is developing software for the first phone capable of translating foreign languages instantly.

February 07, 2010 11:05 AM

natural language processing blog

Blog heads east

I started this blog ages ago while still in grad school in California at USC/ISI. It came with me three point five years ago when I started as an Assistant Professor at the University of Utah. Starting some time this coming summer, I will take it even further east: to CS at the University of Maryland where I have just accepted a faculty offer.

These past (almost) four years at Utah have been fantastic for me, which has made this decision to move very difficult. I feel very lucky to have been able to come here. I've had enormous freedom to work in directions that interest me, great teaching opportunities (which have taught me a lot), and great colleagues. Although I know that moving doesn't mean forgetting one's friends, it does mean that I won't run in to them in hallways, or grab lunch, or afternoon coffee or whatnot anymore. Ellen, Suresh, Erik, John, Tom, Tolga, Ross, and everyone else here have made my time wonderful. I will miss all of them.

The University here has been incredibly supportive in every way, and I've thoroughly enjoyed my time here. Plus, having world-class skiing a half hour drive from my house isn't too shabby either. (Though my # of days skiing per year declined geometrically since I started: 30-something the first year, then 18, then 10... so far only a handful this year. Sigh.) Looking back, my time here has been great and I'm glad I had the opportunity to come.

That said, I'm of course looking forward to moving to Maryland also, otherwise I would not have done it! There are a number of great people there in natural language processing, machine learning and related fields. I'd like to think that UMD should be and is one of the go-to places for these topics, and am excited to be a part of it. Between Bonnie, Philip, Lise, Mary, Doug, Judith, Jimmy and the other folks in CLIP and related groups, I think it will be a fantastic place for me to be, and a fantastic place for all those PhD-hungry students out there to go! Plus, having all the great folks at JHU's CLSP a 45 minute drive a way will be quite convenient.

A part of me is sad to be leaving, but another part of me is excited at new opportunities. The move will take place some time over the summer (carefully avoiding conferences), so if I blog less then, you'll know why. Thanks again to everyone who has made my life here fantastic.

February 07, 2010 11:00 AM

NLP News

Google leaps language barrier with translator phone

Times OnlineGoogle leaps language barrier with translator phoneTimes Online“Clearly, for it to work smoothly, you need a combination of high-accuracy machine translation and high-accuracy voice recognition, and that's what we're ...and more »

February 07, 2010 12:10 AM

February 06, 2010

Open Knowledge Foundation Blog

Book Search, Museum View, and Exploitation

Read today a Google Books PR piece on the Guardian website. Of out-of-print or hard-to-get books, it says, “Although copies may be available in libraries, they are effectively dead to the wider world.” Also heard today that Google Street View is proposing inside views, museum interiors. Last week, I and some OKF people heard a Google [...] Related posts:

  1. 7th Communia Workshop, Luxembourg
  2. Photographing public domain works - Wikipedia Loves Art launches on Sunday!
  3. CERN opens up bibliographic metadata!

February 06, 2010 10:36 PM

Browse Blogs

Happy camper

I broke down and bought a Nexus One last week.

I got the original G1 phone from google when it came out, and I hardly ever used it. Why? I generally hate phones - they are irritating and disturb you as you work or read or whatever - and a cellphone to me is just an opportunity to be irritated wherever you are. Which is not a good thing.

February 06, 2010 09:06 PM

The FRBR Blog

Last Week in FRBR #14

Hi. I usually get this out on Fridays, but I hope you don’t miss it because it’s coming out on Saturday this week. Seems like it was a slowish week in FRBRania. The first couple of pieces involve the RDA-L mailing list archives (RDA being, of course, the new cataloguing rules Resource Description and Access) and also Karen Coyle .

Mix and Match: Mashups of Bibliographic Data

Mix and Match: Mashups of Bibliographic Data at the recent American Library Association conference had people from Google talking about Google Books metadata, OCLC talking about ONIX, and the Open Library talking about the Open Library. Eric Hellman was there and wrote it up in Google Exposes Book Metadata Privates at ALA Forum, which a lot of people have been pointing out, including on RDA-L.

Karen Coyle, who was the Open Library person at the session, brought the four FRBR user tasks into talk about alphabetical ordering of titles:

In FRBR we have the four user tasks: find, identify, select, obtain. These are fully imbued with the assumption of user knowledge.

“to find entities that correspond to the user’s stated search criteria (i.e., to locate either a single entity or a set of entities in a file or database as the result of a search using an attribute or relationship of the entity);”

This seems to eliminate the possibility that the user could be successful in the library catalog with a need like: “I just finished Twilight and loved it. What else might I like?” Yet that is a legitimate query to bring to the library, and even to the library catalog. Perhaps we should spend some time re-writing the FRBR user tasks, expanding them to meet a wider variety of user needs. Then we could look at our catalogs and say: “What does this mean in terms of catalog functionality?” I maintain that alphabetical order will not be at the top of our list, but will probably appear along some user tasks.

Peter Murray was also there, and wrote it up in Mashups of Bibliographic Data: A Report of the ALCTS Midwinter Forum:

[From the OCLC section.] If there is an exact match for the incoming ONIX record in WorldCat, the WorldCat record is enhanced with certain fields from the ONIX record (descriptions, author biographies, web links) — being careful not to override authority work being done by libraries, but adding enhancements that libraries may not otherwise input. In turn, enhancements from exact match record and FRBR work set records (hardcover versus softcover versus audiobook, etc.) are added to the ONIX record (non-English subject headings, adding a Dewey Decimal Classification (DDC) field from another similar record if one doesn’t already exist, change the author field to an authority-controlled version). If there is not an exact match for the ONIX record in WorldCat, a new WorldCat record is built from the ONIX record and it is subsequently enhanced by metadata found in the FRBR work set records.

RDA-L thread on RDA and granularity

Coyle began the RDA and Granularity thread prompted by a chat at a libary conference. As you can see from the archives it started a big long discussion that changed Subject. Somewhere in there John Myers posted in the Systems v Cataloging subthread:

[C]onsider the FRBR expression entity. A significant aspect in textual works between expressions is translation. We do have a 240 field to record that, but since the application of the rules for Uniform titles were left to the discretion of the cataloging agency, indication of an expression for a translation can also appear in a translation note recorded in tag 500, sometimes in conjunction with the 240 but oftentimes alone (as several thousand records in my catalog will attest). Now, if this data were consistently recorded in the 240 (both with respect to the format and to the application of use of the 240), then machine FRBR-ization of these records for translations would be relatively simple.

There was more FRBR discussion in the replies.

RDA National Test Update

Jennifer Eustis’s RDA National Test: Update points to Testing Resource Description and Access (RDA) at the Library of Congress, which sketches out how a bunch of libraries are going to test RDA before committing to use it. Because FRBR is fundamental to RDA, this will also be the biggest test so far of how FRBR helps bibliographic organization.

RDA vs. AACR2: Implications for Social Justice

On 11 January the New York City Radical Reference Collective ran RDA vs. AACR2: Implications for Social Justice, with Rick Block from Columbia University.

Jessica Lingel wrote notes on the session, which are worth reading. It looks like there was a good review of FRBR and RDA and where things are at, and then some interesting questions about that and the social justice and progressive side of cataloguing.

Question – what aspects of cataloging relate to issues of social justice?

It’s mostly a matter of subject headings. But even in descriptive cataloging, what gets included, what doesn’t has implications. RDA won’t so much change that, although it raise the question of personal archiving.

I’d never thought about this angle on FRBR and RDA. Very interesting subject. The first thing that strikes me is that in the linked data and Semantic Web approach anyone can say anything about anything. It will be much easier for people to apply their own sets or subsets of terminology to a group of things while still keeping connected with the rest of the universe, and for anyone else who wants to use that vocabulary to mix it in with their own system. This is a big improvement.

February 06, 2010 04:47 PM

NLP News

Yinlips E-Book, an iPad Clone with E-Ink

PMP Today (blog)Yinlips E-Book, an iPad Clone with E-InkPMP Today (blog)The article (in Chinese) is a little long-winded and machine translation doesn't yield much but promises: Yinlips is planning to launch an iPad lookalike ...

February 06, 2010 02:49 PM

W3C Semantic Web Activity News

German Translation of the RDFa Primer

Stefan Schumacher has published a German translation of the RDFa Primer.

February 06, 2010 08:08 AM

tesseract-ocr Google Group

Help a newbie!!

Hey. I'm using Tesseract OCR in a Plate Recognition program. In the
country I live the car plate's format is three alphabetic characters
and three numbers. For example CRX 117.
Therefore I have a question maybe some of u can answer. First of all I
want to know how to remove the some characters in dont use, for

February 06, 2010 02:41 AM

February 05, 2010

Wikimedia Technical Blog

Tech folk will again meet in Berlin

Developer Meet-Up

Developer Meet-Up (by Raymond, CC-BY-SA)

Wikimedia Germany invites all MediaWiki developers, Toolserver users, Gadget hackers, and other people interested in the technical side of Wikimedia projects to come to Berlin for a Developer Meet-Up on April 14.-16. Last year’s meet-up in Berlin was a great success, and we hope to make it even better this time! This year we want to focus on structured (meta) data, search, and community building. The future of the Toolserver will also be a subject.

The dates are set, but it’s not clear yet if we start full throttle on Wednesday the 14th or if we have just an arrival event on that date and a full day on Friday the 16th instead – this depends on venue arrangements that are not sorted out yet. Note that registration in advance will be required – a website will be set up for this soon, we will announce it on blogs and mailing lists.

On that Friday, April 16., the Wikimedia Chapters and Board start their convention in Berlin. This will be a great opportunity to meet, to discuss interesting topics, to network and to exchange ideas and thoughts! Wikimedia Germany will host the event, so we will organize the venue, the hotel(s), some fun things to do in Berlin, food & drinks and lots of other things – and there might even be a party at the c-base again…

See you in Berlin!

February 05, 2010 09:11 PM

EFF.org Updates

Patent Office Grants EFF Request for Reexamination of Dangerous VOIP Patent

San Francisco - The Electronic Frontier Foundation (EFF) has won reexamination of an illegitimate patent on voice-over-Internet protocol (VoIP) that could cripple the adoption of new VoIP technologies.

A company named Acceris Communications Technologies, now C2 Communications Technologies, was awarded the bogus patent for hardware, software, and processes for implementing VoIP using analog telephones as endpoints -- covering many telephone calls made over the Internet. EFF and the law firm Fenwick & West LLP filed a reexamination request showing that both a prior patent and published reference materials described the underlying technology long before Acceris made its claim. Today the United States Patent and Trademark Office (USPTO) granted EFF's reexamination request, ruling that there were substantial new questions of patentability.

"Our American patent system is meant to encourage invention and innovation," said EFF Legal Director Cindy Cohn. "Undeserved patents instead serve to quash competition and hurt business and consumers."

"We are pleased that the USPTO agrees with the substantial new questions of patentability raised in EFF's request, and we look forward to the USPTO's ultimate decision on this patent," said Nikhil Iyengar of Fenwick &West.

The challenge to this patent is part of EFF's Patent Busting Project, which combats the chilling effects of bad patents on the public and consumer interests. So far eight patents targeted by EFF have been busted, invalidated, narrowed, or had a reexamination granted by the Patent Office.

For more on EFF's Patent Busting Project:
http://www.eff.org/patent/

Contacts:

Cindy Cohn
Legal Director
Electronic Frontier Foundation
cindy@eff.org

Nikhil Iyengar
Fenwick & West LLP
niyengar@fenwick.com

February 05, 2010 08:29 PM

Know Before You Go: Tickets May Come at a Higher Price Than You Realize

As part of our Terms of Ab(use) project, we pay close attention to the fine print of online agreements for provisions that are potentially dangerous to consumers. We've noticed a troubling change in the way event planners restrict the rights of individuals who attend their shows. Where once these limitations had to fit on the back of a ticket, increasingly event organizers have moved their fine print online, where they are able to use even more contract law to avoid the limits of trademark and copyright law and actively control what ticket holders can say or do even after the event is over.

These burdensome terms can show up in some pretty unexpected places. Last year we noted how the Burning Man Organization (BMO) used online ticket terms to require participants to assign to BMO—in advance—the copyright to any pictures they took on the playa. Tickets for the 2010 event went on sale in mid-January, and we hoped the new terms would acknowledge the concerns we had expressed. Sadly, the new terms are just as onerous as before.

The "assignment in advance" clause is not the only burdensome provision. The BMO ticket terms limit participants' rights to use their own photos online, obliging them to take down any photos to which BMO objects for any reason and forbidding them from allowing anyone else to download or copy the photos. This means participants cannot donate their works to the public domain or to license their works, even through Creative Commons—no matter what is depicted or whether a use is noncommercial.

Even the notoriously protective Olympics allow spectators to take their own pictures or videos under their Ticket License Agreement, requiring only that the images "not be used for broadcast, publication, or any other commercial purpose." It is disappointing that the BMO cannot be at least as flexible.

Burning Man also continues to strip ticket holders of their right to make perfectly legal uses of its trademarks, forbidding participants from even using the (trademarked) term "Burning Man" on any website. In other words, participants who’d like to blog about their experiences at the event can’t use the words ”Burning Man.” Thus Burning Man uses contract law to do what it cannot under either copyright or trademark law—exert extraordinary control over participants' speech.

Why would BMO—the organizer of an “an annual experiment in temporary community dedicated to radical self-expression and radical self-reliance”—undermine speech and creativity like this? BMO claims that the terms in the Burning Man ticket agreement are necessary to protect Black Rock City’s unique culture and the privacy of its participants. Furthermore, BMO points out that the limitations are rarely enforced and they only claim copyright if the photos are used in a way BMO doesn't authorize. By claiming copyright in all photographs taken at the event, BMO can use the streamlined "notice and takedown" process enshrined in the Digital Millennium Copyright Act (DMCA) to quickly remove unapproved photos from the Internet.

But using online ticket terms for fast and easy takedown and to restrict CC-licensing and dedication to the public domain is a terrible precedent to set. As we pointed out in our post about this issue last year, doctors working with the tort reform group Medical Justice used contract law in the exact same way to censor negative reviews on Yelp and similar sites.

We understand the real challenges BMO faces in trying to preserve its noncommercial, community character. And we are aware that the current BMO enforces these terms only rarely. But a benevolent censor is still a censor—and BMO may not always be so benevolent. The bigger danger, though, is that other event organizers will take a page from the BMO/Medical Justice playbook, and assignment and abrogation of rights will become standard Terms of Ab(use) in all online contracts.

Copyright and trademark law were not intended to be used this way, and the collateral damage to speech and creativity inherent in the restrictions included in the Burning Man ticket agreement is too great. If the good folks at BMO continue to use ticket restrictions like this, Burning Man may very well become a new kind of model—not of temporary community and support for individual creative expression, but of permanent restriction on that very expression.

February 05, 2010 07:48 PM

Science Commons

Data Sharing on the Web

The February issue of Talis’ Nodalities magazine focuses on data sharing and includes an article by Science Commons’ own Kaitlin Thaney. Last October, Kaitlin joined Jordan Hatcher, Leigh Dodds and Tom Heath to give a four hour tutorial titled “Legal and Social Frameworks for Sharing Data on the Web” at the International Semantic Web Conference (ISWC). [...]

February 05, 2010 07:20 PM

Public Library of Science

New spring range now available in the PLoS Store

When we launched the PLoS store back in November 2009, we promised to continually update it with new designs and merchandise. We've been true to our word. For the last couple of months, we've been creating our Spring range, so now you, your home, office or lab and the kids in your life can look cool and support the PLoS cause.

Over the years, folks have continually asked us for a Future PLoS Author kids range - you are never too young (or old) to get behind Open Access. As you can see from the adjacent photo of Pete and his daughter, we road tested these designs and products on our own kids. When we were told by the eloquent three year old of one of our team that she couldn't read the words on the tees because they were too fancy, we changed them so that they were readable.

For those of you who prefer a more classic look, there are new embroidered tees, hats and hoodies. We think these look particularly nice on a classic polo or baseball cap, but you can decide for yourself how and where (they are smart enough for the boardroom of golf course) you want to wear them.

Since everyone loves dinosaurs and PLoS ONE in particular publishes a large amount of Paleontology research, we couldn't resist developing a range around this theme. We have dino tees, mugs and mousepads, just take your pick.

While we were in all out creative overdrive, we also came up with new mug designs and a few posters to brighten your home or lab walls, so check them out.

Visit the PLoS store today to find these featured products and many more. Through your purchase, you can support our non-profit mission. Happy shopping!

Trackback URL for this post:

http://www.plos.org/cms/trackback/512

February 05, 2010 06:29 PM

NLP News

A Wikipedia Matching Approach to Contextual Advertising

Abstract  Contextual advertising is an important part of today’s Web. It provides benefits to all parties: Web site owners and an advertising platform share the revenue, advertisers receive new customers, and Web site visitors get useful reference links. The relevance of selected ads for a Web page is essential for the whole system to work. Problems such as homonymy and polysemy, low intersection of keywords and context mismatch can lead to the selection of irrelevant ads. Therefore, a simple keyword matching technique gives a poor accuracy. In this paper, we propose a method for improving the relevance of contextual ads. We propose a novel “Wikipedia matching” technique that uses Wikipedia articles as “reference points” for ads selection. We show how to combine our new method with existing solutions in order to increase the overall performance. An experimental evaluation based on a set of real ads and a set of pages from news Web sites is conducted. Test results show that our proposed method performs better than existing matching strategies and using the Wikipedia matching in combination with existing approaches provides up to 50% lift in the average precision. TREC standard measure bpref-10 also confirms the positive effect of using Wikipedia matching for the effective ads selection. Content Type Journal ArticleDOI 10.1007/s11280-010-0084-2Authors Alexander N. Pak, Korea Advanced Institute of Science and Technology (KAIST) Division of Computer Science, Department of EECS Daejeon KoreaChin-Wan Chung, Korea Advanced Institute of Science and Technology (KAIST) Division of Computer Science, Department of EECS Daejeon Korea Journal World Wide WebOnline ISSN 1573-1413Print ISSN 1386-145X

February 05, 2010 06:17 PM

blog.aksw.org

2nd Leipzig Semantic Web Day on 6th May

After the great success of the first Leipziger Semantic Web Tag (LSWT) in 2009 AKSW is organizing another installment of this event on 6th May. In addition to demonstrating the benefits of semantic technologies to enterprises, this year’s LSWT focuses on Open Data in science and E-government. A particular highlight will be the festive announcement of the German chapter of the Open Knowledge Foundation. More information on the LSWT can be found (in German) at: http://aksw.org/Events/2010/LeipzigerSemanticWebDay.

February 05, 2010 04:50 PM

BioMed Central

GOseq – a new method for Gene Ontology analysis of RNA-seq data

Until recently, microarrays have been the method of choice for transcriptional profiling.  The advent of next generation sequencing technologies however has seen the rise of direct sequencing of mRNA (RNA-seq) as a new method for such profiling.  In a recent publication in Genome Biology, Alicia Oshlack and colleagues at the Walter and Eliza Hall Institute in Melbourne, Australia have developed a new method for performing Gene Ontology analysis of RNA-seq data, called GOseq.
 
GOseq identifies whether a given transcriptional profile is over-represented with transcripts associated with specific biological processes. Up until now, statistical methods, such as this, used for analysing RNA-seq data have generally been modifications of methods developed for use with microarray data.  Oshlack, however, shows that statistical methods are not interchangeable between the two techniques; in particular, there is a bias inherent in RNA-seq data whereby highly-expressed transcripts are more likely to be called as being differentially expressed compared with short or less highly-expressed genes.  The GOseq algorithm takes this into account, thus correcting the bias and providing a more reliable readout.  As well as providing a useful new tool, this paper highlights the need for new statistical analysis techniques tailored specifically for the new technology of RNA-seq.
 
Given the extent to which RNA-seq is being embraced by the genomics community, for example in defining alternative transcripts, this method is a welcome addition to a growing field.
 

February 05, 2010 04:44 PM

NLP News

Microsoft and NSF Enable Research in the Cloud

Microsoft and NSF Enable Research in the CloudiEncyclopedia... speech recognition, user-interface research, natural language processing, programming tools and methodologies, operating systems and networking, ...and more »

February 05, 2010 04:24 PM

Thomson Reuters Unveils WestlawNext, the Next Generation in Legal Research

Thomson Reuters Unveils WestlawNext, the Next Generation in Legal ResearchPR-USA.net (press release)A team with Ph.Ds in artificial intelligence and computer science - with specialties in machine learning, data mining, and natural language processing ...and more »

February 05, 2010 11:38 AM

Open Access News

True or false? Defend your answer.

Prepping for your Graduate Record Exams? Here's a sample essay topic from a GRE study guide:

All results of publicly funded scientific studies should be made available to the general public free of charge. Scientific journals that charge a subscription or newsstand price are profiting unfairly.

See GRE Exam 2009 Edition Comprehensive Program, Kaplan Publishing, June 2008, p. 231. (Thanks to Amber Smith for the discovery.)

February 05, 2010 10:22 AM

Open Book Alliance

Responding to U.S. Department of Justice Statement of Interest Regarding Google Books Settlement

The Open Book Alliance today issued the following statement in response to the Department of Justice’s filing with the U.S. District Court:
The Open Book Alliance applauds the action taken today by the Department of Justice.  We believe that the DoJ’s Statement of Interest regarding the Google Books Settlement will help to preserve competition, promote innovation and protect the public interest.

The Department of Justice has made it crystal clear that the proposal before the court is overreaching and cannot be approved: “… the United States has reluctantly concluded that use of the class action mechanism in the manner proposed by the ASA is a bridge too far.”

We are particularly heartened that the Government identified the anti-competitive consequences this proposal would have on digital book sales and the search market, concerns that were voiced by the Open Book Alliance and its members.  The brief addressed the “exclusive access” Google would have been awarded by the settlement while also noting that, “This outcome has not been achieved by a technological advance in search or by operation of normal market forces; rather, it is the direct product of scanning millions of books without the copyright holders’ consent.”

February 05, 2010 02:10 AM

information aesthetics

Data Fiction: Storytelling with Information Graphics

data_fiction.jpg
The combination of storytelling and information visualization has been long predicted, although still very few examples do exist. On the other hand, some might claim typical information aesthetic visualization is all about tell a compelling story.

Following project takes infographic storytelling one (literal) step further: Sumedicina [janalange.de] is the title of a fictional thriller story about an international virus scandal, and is mainly told through the medium of infographics.

One might not feel completely sure whether this is just an excuse to create a collection of visually impressive infographic representations, or whether the narrative of the story is somehow hidden within the graphs. Answer: the short notes below the graphs at the Flickr collection reveal it is probably a combination of the two.


February 05, 2010 12:53 AM

February 04, 2010

Browse Blogs

Free Loans at 0% Interest

In a recent Back Duck survey, we found that companies doing software development are using significant amounts of open source software; about 22% of code was identified as originating from an OSS project.
 
The cost savings from strategic use of open source can free up precious software development resources and compress project schedules. To calculate how much, see the Black Duck ROI calculator.
 

February 04, 2010 10:29 PM

JMLR

Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting; Philippos Mordohai, Gérard Medioni; 11(Jan):411--450, 2010.

We address instance-based learning from a perceptual organization standpoint and present methods for dimensionality estimation, manifold learning and function approximation. Under our approach, manifolds in high-dimensional spaces are inferred by estimating geometric relationships among the input instances. Unlike conventional manifold learning, we do not perform dimensionality reduction, but instead perform all operations in the original input space. For this purpose we employ a novel formulation of tensor voting, which allows an N-D implementation. Tensor voting is a perceptual organization framework that has mostly been applied to computer vision problems. Analyzing the estimated local structure

February 04, 2010 10:03 PM

A Convergent Online Single Time Scale Actor Critic Algorithm; Dotan Di Castro, Ron Meir; 11(Jan):367--410, 2010.

Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic

February 04, 2010 10:03 PM

Bundle Methods for Regularized Risk Minimization; Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le; 11(Jan):311--365, 2010.

A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L_1 and L_2 penalties. In addition to the unified framework we present tight convergence bounds, which

February 04, 2010 10:03 PM

Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure; Kaname Kojima, Eric Perrier, Seiya Imoto, Satoru Miyano; 11(Jan):285--310, 2010.

We study the problem of learning an optimal Bayesian network in a constrained search space; skeletons are compelled to be subgraphs of a given undirected graph called the super-structure. The previously derived constrained optimal search (COS) remains limited even for sparse super-structures. To extend its feasibility, we propose to divide the super-structure into several clusters and perform an optimal search on each of them. Further, to ensure acyclicity, we introduce the concept of ancestral constraints (ACs) and derive an optimal algorithm satisfying a given set of ACs. Finally, we theoretically derive the necessary and sufficient sets of ACs to be considered for finding an optimal constrained

February 04, 2010 10:03 PM

Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions; Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos; 11(Jan):235--284, 2010.

In part I of this work we introduced and evaluated the Generalized Local Learning (GLL) framework for producing local causal and Markov blanket induction algorithms. In the present second part we analyze the behavior of GLL algorithms and provide extensions to the core methods. Specifically, we investigate the empirical convergence of GLL to the true local neighborhood as a function of sample size. Moreover, we study how predictivity improves with increasing sample size. Then we investigate how sensitive are the algorithms to multiple statistical testing, especially in the presence of many irrelevant features. Next we discuss the role of the algorithm parameters and also show that Markov blanket

February 04, 2010 10:03 PM

Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation; Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos; 11(Jan):171--234, 2010.

We present an algorithmic framework for learning local causal structure around target variables of interest in the form of direct causes/effects and Markov blankets applicable to very large data sets with relatively small samples. The selected feature sets can be used for causal discovery and classification. The framework (Generalized Local Learning, or GLL) can be instantiated in numerous ways, giving rise to both existing state-of-the-art as well as novel algorithms. The resulting algorithms are sound under well-defined sufficient conditions. In a first set of experiments we evaluate several algorithms derived from this framework in terms of predictivity and feature set parsimony and compare

February 04, 2010 10:03 PM

An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data; Yufeng Ding, Jeffrey S. Simonoff; 11(Jan):131--170, 2010.

There are many different methods used by classification tree algorithms when missing data occur in the predictors, but few studies have been done comparing their appropriateness and performance. This paper provides both analytic and Monte Carlo evidence regarding the effectiveness of six popular missing data methods for classification trees applied to binary response data. We show that in the context of classification trees, the relationship between the missingness and the dependent variable, as well as the existence or non-existence of missing values in the testing data, are the most helpful criteria to distinguish different missing data methods. In particular, separate class is clearly the

February 04, 2010 10:03 PM

Classification Methods with Reject Option Based on Convex Risk Minimization; Ming Yuan, Marten Wegkamp; 11(Jan):111--130, 2010.

In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition.

February 04, 2010 10:03 PM

On-Line Sequential Bin Packing; András György, Gábor Lugosi, György Ottucsàk; 11(Jan):89--109, 2010.

We consider a sequential version of the classical bin packing problem in which items are received one by one. Before the size of the next item is revealed, the decision maker needs to decide whether the next item is packed in the currently open bin or the bin is closed and a new bin is opened. If the new item does not fit, it is lost. If a bin is closed, the remaining free space in the bin accounts for a loss. The goal of the decision maker is to minimize the loss accumulated over n periods. We present an algorithm that has a cumulative loss not much larger than any strategy in a finite class of reference strategies for any sequence of items. Special attention is payed to reference strategies

February 04, 2010 10:03 PM

Model Selection: Beyond the Bayesian/Frequentist Divide; Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley; 11(Jan):61--87, 2010.

The principle of parsimony also known as "Ockham's razor" has inspired many theories of model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms. We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overfitting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches.

February 04, 2010 10:03 PM

Online Learning for Matrix Factorization and Sparse Coding; Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro; 11(Jan):19--60, 2010.

Sparse coding--that is, modelling data vectors as sparse linear combinations of basis elements--is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally

February 04, 2010 10:03 PM

An Efficient Explanation of Individual Classifications using Game Theory; Erik Štrumbelj, Igor Kononenko; 11(Jan):1--18, 2010.

We present a general method for explaining individual predictions of classification models. The method is based on fundamental concepts from coalitional game theory and predictions are explained with contributions of individual feature values. We overcome the method's initial exponential time complexity with a sampling-based approximation. In the experimental part of the paper we use the developed method on models generated by several well-known machine learning algorithms on both synthetic and real-world data sets. The results demonstrate that the method is efficient and that the explanations are intuitive and useful.

February 04, 2010 10:03 PM

NLP News

News

News Content Type Journal ArticleCategory CommunityDOI 10.1007/s13218-010-0012-8 Journal KI - Künstliche IntelligenzOnline ISSN 1610-1987Print ISSN 0933-1875

February 04, 2010 06:15 PM

Logic-Based Question Answering

Abstract  Question answering systems aim to provide concise and correct responses to arbitrary questions, communicating with the user in a natural language. This way they help making the knowledge of large textual sources accessible in an intuitive manner which goes beyond the capabilities of conventional search engines. In the LogAnswer project the universities of Hagen and Koblenz cooperate to build a German language question answering system which combines computational linguistics and automated reasoning to deduce answers from a knowledge base derived from Wikipedia. Content Type Journal ArticleCategory ProjektDOI 10.1007/s13218-010-0010-xAuthors Ulrich Furbach, Universität Koblenz-Landau AI Research Group, Computer Science Faculty Postfach 201602 56016 Koblenz DeutschlandIngo Glöckner, FernUniversität in Hagen Intelligent Information and Communication Systems Group (IICS) 58084 Hagen DeutschlandHermann Helbig, FernUniversität in Hagen Intelligent Information and Communication Systems Group (IICS) 58084 Hagen DeutschlandBjörn Pelzer, Universität Koblenz-Landau AI Research Group, Computer Science Faculty Postfach 201602 56016 Koblenz Deutschland Journal KI - Künstliche IntelligenzOnline ISSN 1610-1987Print ISSN 0933-1875

February 04, 2010 06:15 PM

Current Trends in Automated Deduction

Abstract  Automated deduction is one of the key areas in artificial intelligence. In this short article we give an overview on some of the main current research topics in automated deduction. Content Type Journal ArticleCategory FachbeitragDOI 10.1007/s13218-010-0011-9Authors Jürgen Giesl, RWTH Aachen University Lehr- und Forschungsgebiet Informatik 2 Ahornstraße 55 52074 Aachen Deutschland Journal KI - Künstliche IntelligenzOnline ISSN 1610-1987Print ISSN 0933-1875

February 04, 2010 06:15 PM

Special Issue on Automated Deduction

Special Issue on Automated Deduction Content Type Journal ArticleCategory EditorialDOI 10.1007/s13218-010-0009-3Authors Jürgen Giesl, RWTH Aachen University Lehr- und Forschungsgebiet Informatik 2 Ahornstraße 55 52074 Aachen Germany Journal KI - Künstliche IntelligenzOnline ISSN 1610-1987Print ISSN 0933-1875

February 04, 2010 06:15 PM

Comments on: Conference

By: Entrevista: Elizabeth Stark « Vídeo Online

[...] a expressão livre e a inovação no vídeo online. Foi responsável pela produção da primeira Open Video Conference, que aconteceu em Nova York, no começo de [...]

February 04, 2010 06:06 PM

OpenSocial API Blog

Using IBM Mashup Center to build interoperable Web 2.0 applications with OpenSocial gadgets

Greetings…

As many of you know, IBM is committed to building and advancing open standards and platforms. Openness fosters an ecosystem of innovation, promotes customer choice, and allows enterprises to build solutions that best meet their needs. We’ve been working with the OpenSocial Foundation to help define and implement the requirements of enterprise customers.

Recently, we put together an article that can be found on DeveloperWorks that describes how you can use IBM Mashup Center to build complex Web 2.0 applications that use OpenSocial gadgets by simply dragging them from a pallet and dropping them on a page.

However, OpenSocial gadgets are not the only component model available, for example there are OpenAjax widgets as well as one that IBM uses for many of our products called iWidgets. Because of the importance of interoperability with previously deployed products that leverage different component models, this article demonstrates how you can build a mashup using different types of gadgets/widgets and still have them interoperate and communicate with each other. This interoperability is achieved by leveraging the OpenAjax Hub, a proven technology for inter-gadget communication available from the OpenAjax Alliance.

The article on DeveloperWorks represents an initial step. Members of the OpenSocial Foundation and the OpenAjax Alliance are now working together to support inter-gadget communication into the next version of OpenSocial by integrating the OpenAjax Hub. The designs and work items can be found on the OpenSocial wiki.

Feel free to jump in and help continue to evolve OpenSocial for use in enterprise environments.

IBM SWG Strategy & Technology, Emerging Standards Team

February 04, 2010 05:55 PM

Open Knowledge Foundation Blog

Rethinking Open Data: Lessons learned from the Open Data front lines

Nat Torkington recently wrote the following piece on O’Reilly Radar. He kindly gave us permission to republish it on the Open Knowledge Foundation blog… In the last year I’ve been involved in two open data projects, Open New Zealand and data.govt.nz. I believe in learning from experience and I’ve seen some signs recently that other [...] Related posts:

  1. What features should be included in a catalogue of open government data?
  2. Australian government releases open data for MashupAustralia competition
  3. What do you think about open government data in Australia?

February 04, 2010 04:28 PM

NLP News

Former U.S. Air Force Chief Information Officer Joins Digital Reasoning as Special Advisor

FRANKLIN, Tenn.----Digital Reasoning® Systems Inc., the intelligence-software innovator, today announced that Gen. William "Tom" Hobbins has joined the company as a special advisor.

February 04, 2010 03:20 PM

The Freesound Blog

Zorg and Andy

Guy Davis, creator of the indie movie Zorg and Andy writes to tell us about his usage of sounds from Freesound. Not only that, but he offers us a nice reduction price for people who want to buy the dvd! Read on for details!

“While in post-production on our film, “Zorg and Andy,” I knew I wanted to try not to skimp on audio finishing (we’d skimped on so many other things, it seemed only natural to try to do something right).

Unfortunately, we were completely out of money (tragically, we still are) and so we had to find some creative (and free!) ways to try to do the necessary audio work.  We obviously couldn’t afford any of the commercial sound effects libraries to supplement our Apple Loops collection, but the problem disappeared after I stumbled across Freesound.

I used Freesound samples in a myriad of ways:  everything from ambience and effects, to transitional pieces and Foley (and if I never have to lay in another pair of footsteps by hand, it will be too soon).

I was absolutely amazed at not only the breadth of the samples available on the site, but also by the ease of searching.  For instance, in one particular scene I needed to find the sound of malevolent chanting (our movie is a goofy comedy concerning a number of cults on a college campus who are all fighting for possession of a mysterious, ancient idol)–on Freesound, no problem! One quick search and I found a ready-to-wear evil incantation.

For me, one of the most satisfying things about sound effects work is creating the illusion of reality by combining seemingly disparate elements, from using the sounds of an Uzi and a gas stove being lit to help sell a camera flash, to using the sounds of insects, birds flapping their wings, a dragon’s hiss (!) and a bean belt shaker to create a roomful of ravenous, flesh-eating beetles. I was able to find virtually everything I needed here.

Again, I can’t begin to thank you enough, Bram, for creating Freesound, and I’d also like to thank all of your members for contributing to this wonderful resource.

And if any of your members are fans of cheesy, b-movies, “Zorg” is available on DVD and digital download through Film Baby. And for all of your help, we’d like to offer the cast and crew discount of 25% off DVD’s and 20% off downloads.  At checkout, just use the discount code: freesound”

credits 1credits 2

February 04, 2010 10:38 AM

Umbrella Adventure

David Smithson from Hive writes to tell me about the usage of freesound sounds in Umbrella Adventure

Umbrella Adventure is an entirely hand-drawn exploration adventure game, which takes place in a huge forest full of platforming challenges, enemies and puzzles. The main character, Gopher, wakes up to find his collection of cakes stolen in the night, and it’s up to you to arm yourself with an umbrella and head out into the world to bring them all back, by using the umbrella to overcome the obstacles in your way, meeting new friends, and opening new doorways and paths through the forest.

One of the game’s biggest focal points is immersion - creating the feeling that the world you’re exploring is solid, consistent and real, and this, from our side, involves making sure the entire world feels right, moves right, and sounds right. For a game like this to work, we needed a huge range of different sounds which could be combined to create a rich, realistic soundscape for players to explore through. In the three or four years since the development of the game began, we’ve used a ton of material from Freesound contributors, mostly ambient, nature and weather effects, although there are some bizarre sounds in there too; it’s pretty neat to think that, for example, the sound of someone scratching their beard has contributed to our work like that. A lot of care was taken to make sure everything in the game, from the player character to moving enemies and scenery, sounds the way it should, so players can really get inside the game and believe what they’re hearing, and in this respect Freesound has been absolutely essential to us. Visiting and being part of Freesound is something really integral to the ‘indie spirit’ of game development for us; the community there is really dedicated to promoting the creative output of all its members, and it’s great to see so many people united by their common love of finding and making great sounds. It’s something we’re all really thankful is still there, and very proud to be working with.

February 04, 2010 10:30 AM

Digitópia Competitions 2010

The guys at Casa da Música in Portugal are running a competition which involves Freesound.org and yours truly as a judge. Have a look! Enter! Win a midi controller!!

The first Digitópia Miniatures Competition took place in 2008. The aim of this competition was to create a musical miniature of up to 90 seconds, using the resources available at Digitópia. Entrants had to use at least one sound from those recorded on the Porto Underground by Factor E, Casa da Música’s resident team in the Education Department, on World Music Day in 2008. Aimed at stimulating the continual use of Digitópia’s resources and encouraging greater musical ambition, this competition attracted many and varied entries, and the winner João Gravato, received a Midi controller and 20 tickets for events of his choice held at Casa da Música during 2009.

The next three competitions – all of them international, which is also reflected in the panels comprising major figures from the field – are physically separate from Digitópia, although the facility possesses all the resources need to produce an entry. Nevertheless, they possibly come even closer to the philosophy of the project. The differences can immediately be seen in the prizes on offer: instead of a commercially-available Midi controller, we are developing a special one-off Midi controller, with the unmistakable Digitópia stamp. But more importantly, all the competitions reflect the idea of sharing with the community: from original Max/MSP or Pure Data patches distributed with open source licences, to new contributions of small musical gestures to freesound.org. It all culminates in a special competition which is full of potential: we will reward the most daring dream by making it come true!”

February 04, 2010 10:20 AM

information aesthetics

Huge Interactive Signpost Shows the Direction to Favorite Locations

nokia_arrow.jpg
This gigantic, interactive signpost sponsored by Nokia Ovi Maps in the form of a dynamically rotating electronic LED screen allows passers-by to send in their favorite location and coordinates via text or email. The giant pointer, hung on a 60ton construction on height of 50m, then automatically rotates to the given direction and displays the submitted description to the world.

Watch the documentary video below.

See also the Nokia Blog [1,2] and FarFar. Via Engadget.


February 04, 2010 09:00 AM

Google Research Blog

Google Cluster Data



Google faces a large number of technical challenges in the evolution of its applications and infrastructure. In particular, as we increase the size of our compute clusters and scale the work that they process, many issues arise in how to schedule the diversity of work that runs on Google systems.

We have distilled these challenges into the following research topics that we feel are interesting to the academic community and important to Google:

To aid researchers in addressing these questions in a realistic manner, we will provide data from Google production systems. The initial focus of these data will be workload characterization. Details of the data can be found here. The data are structured as follows:
We solicit your feedback in terms of: (a) the quality and content of the data we are providing; (b) technical approaches and/or results related to the topics above; and (c) other research topics that you feel Google should be addressing in the area of Cloud Computing (along with details of the data required to address these topics).

February 04, 2010 07:31 AM

NLP News

A wizard of oz component-based approach for rapidly prototyping and testing input multimodal interfaces

Abstract  In this paper we present a novel approach for prototyping, testing and evaluating multimodal interfaces, OpenWizard. OpenWizard allows the designer and the developer to rapidly evaluate a non-fully functional multimodal prototype by replacing one modality or a composition of modalities that are not yet available by wizard of oz techniques. OpenWizard is based on a conceptual component-based approach for rapidly developing multimodal interfaces, an approach first implemented in the ICARE software tool and more recently in the OpenInterface tool. We present a set of wizard of oz components that are implemented in OpenInterface. While some wizard of oz (WoZ) components are generic to be reused for different multimodal applications, our approach allows the integration of tailored WoZ components. We illustrate OpenWizard using a multimodal map navigator. Content Type Journal ArticleCategory Original PaperDOI 10.1007/s12193-010-0042-4Authors Marcos Serrano, University of Grenoble Grenoble FranceLaurence Nigay, University of Grenoble Grenoble France Journal Journal on Multimodal User InterfacesOnline ISSN 1783-8738Print ISSN 1783-7677

February 04, 2010 07:01 AM

information aesthetics

Ward Shelley: Infovis Oil Painting Artist

ward_shalley.jpg
Unknown to me before, visual (infovis?) artist Ward Shelley [wardshelley.com] comes as a refreshing surprise.

Shalley's impressive oil paintings and pencil drawings use real information in an attempt to depict the understanding of how things evolve and relate to one another, and how this develops over time. Usual topics range from art or cultural history, such as the arc of an artist's career and its influences, or the effect of particular ideas in an aesthetic or political movement. The paintings are interpreted as being "wide-screen", as all information is available to the interacting eye at every moment.

These works are full of compact information, which takes months to collect and organize. The designs are done with pencil on paper because each piece goes through constant revisions during this time. 3 different versions of the painting are made from same information. Normally the pencil drawing goes through minor changes from version to version, and the painting is entirely different, using different colors and brushwork.

Thnkx Irene.


February 04, 2010 05:35 AM

ocropus Google Group

status and progress

I realize that things look pretty quiescent with OCRopus right now,
but we're busy working on the next release.
The biggest change for the release will be that all top-level
functionality will be exposed and programmable from Python. This
should make customization, documentation, and training significantly

February 04, 2010 05:01 AM

FLOSS Manuals News

Collaborative Futures Launch(es)

Collaborative Futures Launch(es) Collaborative Futures will have two launches, one in Berlin on Saturday Feb 6 (1700) during the transmediale Festival http://www ... (by AdamHyde)

February 04, 2010 03:03 AM

Open Book Alliance

GBS 2.0 Objection Roundup

By Gary Reback

The latest round of briefing produced yet another profound embarassment – bordering on an outright humiliation – for the parties proposing the Google Book Settlement.  Their original proposal met with a torrent of criticism so intense that Google and the publishers withdrew it and replaced it with an “amended” version.

But the amendments turned out to be more illusory than substantive, as an even more diverse array of important interests filed strong objections to the proposed settlement last week.  Individuals, groups and commercial enterprises filed more than fifty new briefs and letters opposing the amended deal.  The opposition came from every quarter:

On the other hand, practically no one filed in support of the new version, which purports to limit the Settlement to books published in the U.S., the U.K., Canada and Australia.[1] The amendments drew dutiful letters of support from publisher associations in those countries, who doubtless cut a secret, sweetheart deal like members of  the Association of American Publishers (AAP) secured in the U.S.  The interested publisher associations, plus a single authors’ group from the UK (which candidly admitted that many of its members object to the deal), represented all the support the parties could muster at this stage of the proceedings.  Also, of the 400-plus parties who objected to the initial version, only one commercial enterprise withdrew its objection.

Many opponents of the deal, including the United States Government in its September 15 filing, challenged the legitimacy of the putative class representatives to serve in that capacity – because of separate side deals for the named publisher plaintiffs, the diversity of publisher and author interests not represented among the named plaintiffs, etc.  Despite what must have been a no-expenses-barred effort by the AAP to drum up publisher support for the amendments, not a single publisher – not one among the thousands and thousands of publishers in the United States – came forward to stand behind the self-appointed class reps in court.  The silence, as the saying goes, was deafening.

The Proposed Settlement is simply incomprehensible, except to the most highly trained lawyers.  Nevertheless, authors, publishers and public interest groups in overwhelming numbers took the time and effort and in many cases bore significant expense (to hire lawyers) in order to make their complaints known.  An untold number of individual authors registered their protests by opting out of the deal.  In the court of public opinion, the Proposed Settlement was soundly rejected.


[1] Google said it was going to keep right on scanning books published in other countries; it just wasn’t going to pay rightsholders any consideration for its appropriation of their works.

February 04, 2010 01:52 AM

NLP News

Former U.S. Air Force Chief Information Officer Joins Digital ...

Digital Reasoning builds data analytic solutions based on a distinctive, patented mathematical approach to natural language processing. The value of Digital Reasoning is not only the ability to leverage the existing knowledge base, but also to reveal ...

February 04, 2010 12:14 AM

Browse Blogs

What to Expect at LinuxCon 2010 this August in Boston!

The call for participation and registration opened for LinuxCon today signaling the beginning of planning for the 2nd Annual LinuxCon.

To recap on some of the highlights of LinuxCon 2009, which took place in Portland last September, we brought you:

February 04, 2010 12:09 AM

February 03, 2010

Open Book Alliance

Reps. Gonzalez & Green Ask DOJ to Scrutinize Google Books Settlement

Congressmen Charlie Gonzalez (TX-20) and Gene Green (TX-29) today announced that they sent a letter to U.S. Attorney General Eric Holder highlighting their questions and concerns about GBS 2.0. The Congressmen were particularly focused on the fact that many authors and publishers who are not part of the class action lawsuit will be affected by the settlement.

The letter said, “Today, there are hundreds of thousands of authors who are not members of the Authors Guild and hundreds of publishers who are not part of the [Association of American Publishers] who would be most acutely affected by the Google Books Settlement.  Yet their voices have been largely excluded from the process.”

Congressmen Gonzalez and Green said that they were spurred to contact Attorney General Holder as a result of several letters they have received from minority publishers.

February 03, 2010 10:14 PM

blog.aksw.org

Doctoral and PostDoc positions available at AKSW

For collaborative research projects in the area of Linked Data technologies and Semantic Web the research group Agile Knowledge Engineering and Semantic Web (AKSW) at Universität Leipzig opens positions for one postdoctoral researcher and 2 doctoral students. The complete job offers are available at: http://aksw.org/Jobs

February 03, 2010 09:03 PM

Comments on: Conference

By: Open Video « Vídeo Online

[...] Movimento Open Video surgiu na “The Open Video Conference (OVC)”, evento realizado na NYU Law School, em Nova Iorque, nos dias 19 e 20 de junho de 2009. A partir [...]

February 03, 2010 08:26 PM

Wikimedia Technical Blog

Wikimedia donates servers to deserving non-profits.

Every year, Wikipedia usage goes upward, and every year the technical folks working and volunteering with Wikimedia have to plan, purchase, and implement new servers to keep up to the growing popularity of Wikipedia and its sister projects.  With the advances in computing, running 9 new application servers this year took the load of 36 application servers from 3 years ago.

So when we upgrade, what happens to the old equipment that is too slow for Wikipedia, but not too slow for MANY other non-profits?  We donate them!  These systems were 1U rackmount servers, dual cpu 2.5-3, single core, 2-4GB of RAM, and 2-4 HDD Bays with 1-2 80-250GB HDDs. This year, we have  three non-profits who received our older systems (in alphabetical order): Drupal.org, OpenStreetMap Foundation, and Sugar Labs.

Drupal.org

Drupal is a free software package that allows an individual or a community of users to easily publish, manage and organize a wide variety of content on a website. Tens of thousands of people and organizations are using Drupal to power scores of different web sites.

OpenStreetMap Foundation

The OpenStreetMap Foundation is an international non-profit organisation supporting but not controlling the project. It is dedicated to encouraging the growth, development and distribution of free geospatial data and to providing geospatial data for anybody to use and share.

OpenStreetMap is an open initiative to create and provide free geographic data such as street maps to anyone who wants them.

Sugar Labs

The mission of Sugar Labs® is to produce, distribute, and support the use of the Sugar learning platform; it is a support base and gathering place for the community of educators and developers to create, extend, teach, and learn with the Sugar learning platform.

We hope the recipients of our servers will be able to put them to good use!

Below are some common questions involving Wikimedia and the server donation process:

Q. How can I get some of the decommissioned donation servers?

A. The best place to follow the goings on of our technical team is here, on the Wikimedia Technical Blog.  When we have a batch of servers up for decommissioning and donation, we will announce it on the tech blog, and instructions on how to apply to receive some servers.

Q. Who is eligible to apply for servers?

A. We try to only donate servers to other non-profits whose core values are similar or in support of our own.  This means we do not donate them for individual use.   Since these servers were purchased with donations to support Wikimedia, we feel we need to further donate them to other like-minded organizations, since that is how the money for the servers was meant to be spent.

Q. How often does this happen?

A. Most servers are kept in use by Wikimedia beyond three years.  Many of our servers that we have turned off in this batch are anywhere from 3 to 5 years old.  We only replace them when it makes sense from the technical standpoint to do so.  This means we cannot just say ‘we will do this every X months.’  We try to get the most use out of every server, as they were donated or purchased with donations.  So there is no set date, just keep checking the Wikimedia Technical Blog, when we have more to donate, we will say so there!

Q. I am a student/person/so and so, and I want to learn to develop and do such and such.  Can you send me a server?

A. Sorry, unfortunately it is just not realistic or fair of us to try to sort out which personal use requests for servers are legitimate and which are folks wanting computers for any other reason.  We choose to limit our donations to other like minded non-profit organizations.

Rob Halsell
Systems Administrator

February 03, 2010 07:33 PM

Science Commons

T-shirt Contest Goes Global

Our Science Commons t-shirt design contest is now open to entrants outside of the US.  We realized that while we might not be able to afford to fly someone to Seattle from Australia or Germany, we would be missing out on too much talent if we limited the contest to US residents. The prize will [...]

February 03, 2010 07:25 PM

ubiquity-firefox Google Group

Funny sms

Funny-Videos contains a large collection of funny videos/funny video
clips and movie downloads updated daily Our editors find the best
Funny video, Funny babies, Funny movies clips and pictures for you to
watch right now. [link]

February 03, 2010 05:42 PM

Too much spammy messages?

Please turn on authorize first message for this group, Google groups
SPAM is "rampant" these days.

February 03, 2010 05:40 PM

NLP News

Activistas: En Ciudad Juárez hay “escuadrones de la muerte”

Activistas: En Ciudad Juárez hay “escuadrones de la muerte”The NarcoSphereAdvanced NLP revit license (Natural Language Processing) solutions save us precious time that we usually spend on proofreading and editing our emails ...

February 03, 2010 03:46 PM

UMBEL Google Group

Draft paper submission deadline is extended: EISWT-10

Draft paper submission deadline is extended: EISWT-10
The 2010 International Conference on Enterprise Information Systems
and Web Technologies (EISWT-10) (website: [link])
will be held during 12-14 of July 2010 in Orlando, FL, USA. EISWT is
an important event in the areas of Enterprise Information Systems,

February 03, 2010 12:25 PM

NLP News

A dictionary to identify small molecules and drugs in free text.

Related Articles A dictionary to identify small molecules and drugs in free text. Bioinformatics. 2009 Nov 15;25(22):2983-91 Authors: Hettne KM, Stierum RH, Schuemie MJ, Hendriksen PJ, Schijvenaars BJ, Mulligen EM, Kleinjans J, Kors JA MOTIVATION: From the scientific community, a lot of effort has been spent on the correct identification of gene and protein names in text, while less effort has been spent on the correct identification of chemical names. Dictionary-based term identification has the power to recognize the diverse representation of chemical information in the literature and map the chemicals to their database identifiers. RESULTS: We developed a dictionary for the identification of small molecules and drugs in text, combining information from UMLS, MeSH, ChEBI, DrugBank, KEGG, HMDB and ChemIDplus. Rule-based term filtering, manual check of highly frequent terms and disambiguation rules were applied. We tested the combined dictionary and the dictionaries derived from the individual resources on an annotated corpus, and conclude the following: (i) each of the different processing steps increase precision with a minor loss of recall; (ii) the overall performance of the combined dictionary is acceptable (precision 0.67, recall 0.40 (0.80 for trivial names); (iii) the combined dictionary performed better than the dictionary in the chemical recognizer OSCAR3; (iv) the performance of a dictionary based on ChemIDplus alone is comparable to the performance of the combined dictionary. AVAILABILITY: The combined dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web site http://www.biosemantics.org/chemlist. PMID: 19759196 [PubMed - indexed for MEDLINE]

February 03, 2010 11:33 AM

Open Knowledge Foundation Blog

7th Communia Workshop, Luxembourg

We recently attended a workshop in Luxembourg as part of Communia, the EU policy network on the digital public domain. There was a focus on bringing together themes from previous events to make a series of policy recommendations to the European Commission (watch this space!). Below are a few notes highlighting some of the talks and [...] Related posts:

  1. First COMMUNIA Workshop - “Technology and the Public Domain”
  2. 2nd Communia Workshop, Torino
  3. Third COMMUNIA Workshop - Marking the public domain

February 03, 2010 11:27 AM

February 02, 2010

NLP News

Carnegie Mellon helps relief workers in Haiti bridge the language barrier

Pop CityCarnegie Mellon helps relief workers in Haiti bridge the language barrierPop CityLTI. with a focus on machine translation, speech processing, information retrieval, text mining and computer-assisted language learning, is part of Carnegie ...

February 02, 2010 11:24 PM

EFF.org Updates

EFF's 20th Birthday Commemorative Poster

EFF-mecha-POSTER

UPDATE: Downloadable wallpaper now available.

To celebrate 20 years of fighting for your digital rights, EFF staff designer Hugh D'Andrade came up with a commemorative poster! You can download your own hi-res copy from our Flickr page (on a Creative Commons Attribution License), and limited edition prints will be given to VIP Donors at our upcoming 20th Birthday celebration on February 10th at the DNA Lounge in San Francisco. The design will also be available on t-shirts on sale at the event.

Hope to see you there!

EFF's 20th Birthday with Adam Savage and Friends

February 10, 2010
Doors open at 8 pm, VIP event at 7 pm
$30 donation (no one turned away)
DNA Lounge
375 Eleventh Street
San Francisco, CA

February 02, 2010 08:37 PM

Open Medicine Blog blogs

This week at Open Medicine - new research

According to new research by Patel et al., published today in Open Medicine, approximately 64 percent of Canadian adults (age 40 and older) have access to primary percutaneous coronary intervention (PCI), a treatment for severe heart attacks, within 60 minutes. (In the U.S., 80 percent of adults have access to PCI within a similar time frame.) PCI, in which a balloon catheter is used to restore blood flow to the heart, is currently the recommended treatment for ST-segment elevated myocardial infarction (STEMI). The authors used geographic information systems and census data to estimate transit time to PCI facilities in Canada using ground transportation. They found that transit times vary greatly from province to province. In New Brunswick, for example, only 15 percent of adults age 40 and older have access to PCI in that time frame, while in Ontario that number is 73 percent. The study also evaluated how the addition of four hypothetical facilities would affect access. This research on geographic access to PCI has important mplications for policy makers creating regionalized care models for cardiac care.

This commentary discusses the policy implications of new research by Patel et al., using geographic information systems to estimate access to primary percutaneous coronary intervention (PCI), a technique used to manage severe heart attacks.

Late last year, the appointment of Bernard Prigent, vice-president of medical affairs at Pfizer Canada, to the governing council of the Canadian Institutes for Health Research (CIHR), generated much controversy and two hearings by a Standing Committee on Health of the House of Commons. In this editorial, the editors of Open Medicine present an official response from CIHR to a commentary and analysis piece by Steven Lewis, also published today in the journal.

Steven Lewis, a health policy consultant based in Saskatoon and a member of Open Medicine’s editorial board, criticizes the appoint of Bernard Prigent, vice-president of medical affairs at Pfizer Canada, to the governing council of the Canadian Institutes of Health Research. “More discouraging than the Prigent appointment in itself is the refusal of either government or the CIHR to acknowledge that the appointment raises even the possibility of conflict of interest,” he writes in a new commentary published today in Open Medicine. “It is one thing to make and defend a decision; it is quite another to refuse to recognize and substantively engage with the very real ethical issues at its core.”

Follow “Open Medicine” on Twitter and Identi.ca. Join us on Facebook: http://www.facebook.com/group.php?gid=6117690964

February 02, 2010 07:53 PM

tesseract-ocr Google Group

Different results on subimages

Hi everybody,
I'm writing an application to automatically scan tons of postal
orders, using TessNet2 library from C#. Tesseract is great and
recognizes about everything on the postal order. But, because some
fields contain only numbers and some others only letters, I want to
process single subimages from the whole picture, adjusting

February 02, 2010 06:40 PM

NLP News

Term distribution visualizations with Focus+Context

Abstract  Many text searches are meant to identify one particular fact or one particular section of a document. Unfortunately, predominant search paradigms focus mostly on identifying relevant documents and leave the burden of within-document searching on the user. This research explores term distribution visualizations as a means to more clearly identify both the relevance of documents and the location of specific information within them. We present a set of term distribution visualizations, introduce a Focus+Context model for within-document search and navigation, and describe the design and results of a 34-subject user study. This user study shows that these visualizations—with the exception of the grey scale histogram variant—are comparable in usability to our Grep interface. This is impressive given the substantial experience of our users with Grep functionality. Overall, we conclude that user do not find this visualization model difficult to use and understand. Content Type Journal ArticleDOI 10.1007/s11042-010-0479-1Authors Moses Schwartz, New Mexico Institute of Mining and Technology Department of Computer Science and Engineering Socorro NM USACurtis Hash, New Mexico Institute of Mining and Technology Department of Computer Science and Engineering Socorro NM USALorie M. Liebrock, New Mexico Institute of Mining and Technology Department of Computer Science and Engineering Socorro NM USA Journal Multimedia Tools and ApplicationsOnline ISSN 1573-7721Print ISSN 1380-7501

February 02, 2010 05:50 PM

Omeka

Omeka Outreach Month

Six more weeks of winter lie ahead, making the forthcoming frigid days perfect for starting or tweaking an Omeka project. The Omeka team will be working hard as well in February during the newly-designated “Omeka Outreach Month.” We plan to write a few blog posts, seek assistance with plugin and theme development, and drop a new [...]

February 02, 2010 05:45 PM

W3C Semantic Web Activity News

RDFa Working Group launched

W3C launched today the RDFa Working Group, whose mission is to support the use of RDFa, a format for embedding structured data in Web documents. The Working Group's goals include making it easier to author RDFa, promoting continued adoption of the technology in HTML, XHTML, and XML, and helping developers create RDFa applications. The group is chartered to extend and enhance RDFa 1.0, including the specification of an API. The Working Group will also support the HTML Working Group in its work on incorporating RDFa in HTML5 and XHTML5 (as a followup on the the currently published Working Draft for RDFa 1.0 in HTML5).

February 02, 2010 05:27 PM

Open Access News

February SOAN

I just mailed the February issue of the SPARC Open Access Newsletter.  This issue takes a close look at four analogies between the political fortunes of open access and the political fortunes of clean energy. 

The roundup section briefly notes 116 OA developments from January.

Here's a quick overview of the four analogies:

  1. The gap between breakthrough and uptake
  2. Putting obstacles in our way
  3. Slowing down to protect the incumbents
  4. Some pay for all

Update (3 hours later).  A list problem has snagged delivery of the email edition.  Apologies for the delay.  Meantime, the online edition (link above) is the same as the email edition and already available.

February 02, 2010 05:03 PM

BioMed Central

UK government adopts Creative Commons licenses for open data: good news for public-sector researchers publishing in open access journals

The UK government has in recent years made significant amounts of government data openly available for reuse.  They Work for You is an example of a website which creatively reuses data on UK parliamentary activity,  and its parent organization, MySociety, has played an important role in encouraging the UK government towards opening up more data.

The latest development in UK government open data sharing is the launch of data.gov.uk, launched in beta test form last month , which “provides a single access point to over 2,500 central government datasets that have been made available for free re-use”.

Buried in the small print of this announcement is an important change, with significant implications for open access publishing in the UK. This change is the adoption of Creative Commons-compatible licensing for UK government open content.

Up until now, open data  from the UK government was licensed via the Office of Public Sector Information’s ‘Click Use’ license scheme. The Click Use model required any potential users or distrubutors of the data to first request their own ‘Click Use’ license from the UK government website, in order to gain permission to reuse the data.

In contrast musicians, artists and other creators around the world who wish to share content openly while reserving some rights have increasingly standardized on the use of  Creative Commons licenses, which do not require any such license request to be made.

BioMed Central, like many other open access publishers, uses the Creative Commons Attribution License, which requires only that the original version of the work should be correctly attributed when the work or any part of it is reused.

Until now, because work carried out by researchers at UK government agencies is often covered by ‘Crown Copyright’, and because Crown Copyright is legally distinct from the normal Copyright law, the applicability of Creative Commons licenses to such work has been in question. As a result, special license wording has in some cases been necessary for such articles published in BioMed Central journals, in order to indicate that they can be reused only under the ‘Click use’ scheme. This had the potential to cause delays for authors and confusion for readers.

The good news is that the announced intention of OPSI to move away from ‘Click Use’ licensing towards Creative Commons-compatible licensing over the coming months should entirely solve this problem, making life easier for all concerned.

It also provides an important precedent for dealing with similar challenges in other (rather arcane) areas of copyright law. For example, the World Health Organization and other supra-national bodies do not recognize national jurisdictions, which causes similar challenges for Creative Commons licensing to those caused by Crown Copyright,  and requires similar workarounds via special-case license wording. BioMed Central is hopeful that a Creative Commons-compatible licensing scheme specifically designed for such supra-national bodies will soon resolve this and we are working with WHO and Creative Commons towards such a solution.

February 02, 2010 04:26 PM

Linked Data Blog Aggregator

Collaborating on Images

Inkscape Logo

The Inkscape Process Can Also Aid Image Interchanges with Powerpoint

As we see more collaboration forums emerge, one question that naturally arises is the joint authoring or editing of images. This is particularly important as “official” slide decks or presentations come to the fore.

There are perhaps many different ways to skin this cat. In this article, I describe how to do so using the free, open source SVG editing program, Inkscape.

Why Inkscape?

Like many of you, I have been creating and editing images for years. I am by no means a graphics artist, but images and diagrams have been essential for communicating my work.

Until a few years back, I was totally a bitmap man. I used Paint Shop Pro (bought by Corel in 2004 and getting long in the tooth) and did a lot of copying and pasting.

I switched to Inkscape about two years ago for the following reasons:

How to Collaborate with Inkscape

Once you have a working image in Inkscape, make sure all collaborators have a copy of the software. Then:

  1. Isolate the picture (sometimes there are multiple images in a single file) by deleting all extraneous image stuff in the file
  2. From the toolbar, click on the Zoom to fit drawing in window icon [Zoom to fit drawing in window]; this will resize and put your target image in the full display window
  3. Under File -> Document Properties … check Show page border and Show border shadow, then Fit page to selection. This helps size the image properly in the exported file for sharing or collaboration
  4. Save the file as an *.svg option, and name the file with a date/time stamp and author extension (useful for tracking multiple author edits over time)
  5. If in multiple author mode, make sure who has current “ownership” of the image is clear.

How to Share with Powerpoint

Of course, it is more often the case that not all collaborators may have a copy of Inkscape or that the image began in the SVG format.

The image below began as a Windows Powerpoint clip art file, which has then gone through some modifications. Note the bearded guy’s hand holding the paper is out of registry (because I screwed up in earlier editing, but I also can easily fix because it is a vector image!  ;)   ). Also note we have the border from Inkscape as suggested above.  This file, BTW, is people.png, and was created as a PNG after a screen capture from Inkscape:

PNG representation of an SVG

When beginning in Powerpoint or as clip art, files in the format of Windows metafile (*.wmf) or extended WMF (*.emf) work well. (For example, you can download and play with the native Inkscape format of people.svg, or the people.wmf or people.emf versions of the image above.) If you already have images in a Powerpoint presentation, save in one of these two formats, with (*.emf) preferred. (EMF is generally better for text.)

You can open or load these files directly into Inkscape. Generally, they will come in as a group of vectors; to edit the pieces, you should “ungroup.”

After editing per the instructions in the previous section, if you need to re-insert back into Powerpoint, please use the *.emf format (and make sure you do not save text as paths).

For example, see the following PNG graphic taken from a Inkscape file (figure_text.svg):

PNG representation of an SVG

We can save it as an EMF (figure_textpath.emf) to a Powerpoint, with the option of converting text to paths:

Text-to-path EMF

Or, we can save it as an EMF (figure_text.emf) to a Powerpoint, only this time not converting text to paths and then “ungrouping” once in Powerpoint:

EMF with no text to path

Note the latter option, text not as path, is the far superior one. However, also note that borders are added to the figures and vertical text is rotated 90o back to horizontal. Nonetheless, the figure is fully editable, including text. Also, if the original Inkscape figures are constructed with lines of the same color as fills, the border conversion also works well.

Frankly, especially with text, because there can be orientation and other changes going from Inkscape to Powerpoint, I recommend using Inkscape and its native SVG for all early modifications and to keep a canonical copy of your images. Then, prior to completion of the deck, save as EMF for import into Powerpoint and then clean up. If changes later need to be made to the graphic, I recommend doing so in Inkscape and then re-importing.

Other Alternatives

I should note there is an option, as well, in Inkscape to convert raster images to vector ones (use Path -> Trace bitmap … and invoke the multiple scans with colors). This is doable, but involves quite a bit of image copying, manipulation and color separation to achieve workable results. You may want to see further Inkscape’s documentation on tracing, or more fully this reference dealing with color.

Of course, there are likely many other ways to approach these issues of collaboration and sharing. I will leave it to others to suggest and explain those options.

February 02, 2010 04:26 PM

Google Research Blog

Announcing Google's Focused Research Awards



[cross-posted with the Official Google Blog]

It is said that Google is like a university — and not just because everyone eats their lunch off trays in the cafeteria. Like a university, we devote significant energy to research across a wide array of subjects — from semantics to help improve search, to ways we can improve the efficiency of our data centers. Along with our internal efforts, we've long invested in building a strong, mutually beneficial relationship with universities and the research community. We give approximately 150 research grants a year to fund projects across a variety of subjects, we host visiting faculty members here at Google on sabbatical, and last year we started the Google Fellowship Program to fund graduate students doing innovative research in several fields.

Today, we're announcing the first-ever round of Google Focused Research Awards — funding research in areas of study that are of key interest to Google as well as the research community. These awards, totaling $5.7 million, cover four areas: machine learning, the use of mobile phones as data collection devices for public health and environment monitoring, energy efficiency in computing, and privacy. These are all areas in which Google is already deeply invested, and yet there is a long way to go. We're excited to see what these projects contribute to the body of research in these important areas.

These unrestricted grants are for two to three years, and the recipients will have the advantage of access to Google tools, technologies and expertise. We've given awards to 12 projects led by 31 professors at 10 universities:

Machine Learning: William Cohen, Christos Faloutsos, Garth Gibson and Tom Mitchell, Carnegie Mellon University

Use of mobile phones as data collection devices for public health and environment monitoring: Gaetano Borriello, University of Washington and Deborah Estrin, UCLA

Energy efficiency in computing:

Privacy:
We look forward to working with these researchers over the coming years. And, as we continue to identify key areas of research that are of mutual interest to both university researchers and Google, we will provide awards to support these collaborations. For more information about all of our research programs, check out our University Relations site.
Update at 1:14 PM: Added Allesandro Acquisti and Norman Sadeh to the list of PIs on the CMU privacy project.

February 02, 2010 01:14 PM

NLP News

Graph Classification

Supervised learning on graphs is a central subject in graph data processing. In graph classification and regression, we assume that the target values of a certain number of graphs or a certain part of a graph are available as a training dataset, and our goal is to derive the target values of other graphs or the remaining part of the graph. In drug discovery applications, for example, a graph and its target value correspond to a chemical compound and its chemical activity. In this chapter, we review state-of-the-art methods of graph classification. In particular, we focus on two representative methods, graph kernels and graph boosting, and we present other methods in relation to the two methods. We describe the strengths and weaknesses of different graph classification methods and recent efforts to overcome the challenges. Content Type Book ChapterDOI 10.1007/978-1-4419-6045-0_11Authors Koji Tsuda, Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology (AIST) Tokyo JapanHiroto Saigo, Max Planck Institute for Informatics Saarbru̶cken Germany Book Series Advances in Database SystemsPrint ISSN 1386-2944 Book Series Volume Volume 40 Book Managing and Mining Graph DataDOI 10.1007/978-1-4419-6045-0Online ISBN 978-1-4419-6045-0Print ISBN 978-1-4419-6044-3

February 02, 2010 06:48 AM

The algorithms for preliminary text processing: Decomposition, annotation, morphological analysis

Abstract  This paper considers the existing algorithms and suggests new algorithms for preliminary text processing that permit its quality to be increased, including: the deduction-inversion architecture of decomposition, modified algorithm of bidirectional interference, and morphological analysis based on preliminary annotation with tags of parts of speech. Content Type Journal ArticleDOI 10.3103/S0005105509060041Authors V. A. YatskoM. S. StarikovE. V. LarchenkoT. N. Vishnyakov Journal Automatic Documentation and Mathematical LinguisticsOnline ISSN 1934-8371Print ISSN 0005-1055 Journal Volume Volume 43 Journal Issue Volume 43, Number 6 / December, 2009

February 02, 2010 06:46 AM

information aesthetics

Information Landscapes in 1994 (MIT Prof Muriel Cooper)

muriel_cooper.jpg
Back in 1994, Muriel Cooper, one of the co-founders of the MIT Media Lab where she taught interactive media design as the head of the Visible Language Workshop, presented her work at the TED5 conference in Monterey, CA.

Her presentation would initiate a new era of data visualization, and it changed the way designers thought of the possibilities of electronic media. (Maybe quite similar to how David Small's dynamic renditions of text changed my way of thinking about 3D visualization). Her work was revolutionary as it pushed typography into the 3 spatial dimensions, and augmented it with dynamics, animation and interactivity. Tragically, it was just after this event that she passed away.

Since many years, David Young has carried around an old VHS tape that demonstrated this work, to show it to students as an example of Muriel's vision and as an inspiration to push creative boundaries (or, as told in the film: "We must reexamine the current stultifying interface standards and metaphors. We must define a rich vocabulary, tools and design strategies that are applicable to any information domain and to this multidimensional world"). He has finally digitized the tape and has posted it online for all to see (see the movie below).

"The work is a beautiful demonstration of the ideas that Muriel had been pursuing for much of her career. Dynamics, interactivity, typography, and live data. For this video she used the titles "Designing and information landscape in time and space" and "The dynamic visualization of information in two and three dimensions."

For more information about Muriel Cooper, you should read Muriel Cooper's Visible Wisdom" by Janet Abrams.

Note: Image above taken from the Talmud Project by David Small, and Financial Viewpoints, by Lisa Strausfeld.


February 02, 2010 06:32 AM

ocropus Google Group

Chinese

Hello. I'm trying to train ocropus on Chinese using the code in the
repository as of last week. I'm using the training code in extras/
train-unicode (very cool, btw). After producing the training files, I
ran:
ocropus trainseg my.model out
which took all day but finally produced a model. From the log:

February 02, 2010 01:26 AM

Bugfix for xml-entities.cc

xml-entities.cc was refusing to compile on g++ on my machine, with the
following error:
ocr-utils/xml-entities.cc:119: error: invalid conversion from 'const
char*' to 'char*'
The following change fixes the error:
119c119
< while((p = strchr(begin, '&'))) {
---

February 02, 2010 01:05 AM

February 01, 2010

IOSA.it

ArcheoFOSS 2010: posticipata al 19 febbraio la scadenza per presentare abstract

Forward from the Scientific Committee:

[V]i comunico che la scadenza per la presentazione degli abstract è stata
posticipata dal 30 gennaio al 19 febbraio.

AGGIORNAMENTO PRINCIPALI SCADENZE:

E' possibile proporre contributi su ogni aspetto inerente il progetto, lo sviluppo e l'uso di formati liberi e/o di software libero o open source in archeologia.

Sono previste sessioni per la presentazione di papers soggetti a recensione.

E' prevista anche una sessione poster per la presentazione di progetti in corso o recentemente conclusi.

Anche in questa edizione sarà organizzata una specifica sessione di laboratorio (OpenLAB). Quest'anno i laboratori dedicati all'incontro diretto con gli strumenti Open Source saranno dedicati a modellazione 3D
applicata all'archeologia, gestione di mesh 3D con particolare riferimento ai Beni Culturali e creazione e gestione di Database e applicazione in campo archeologico.

Per tutte le informazioni sull'evento vi invito a visitare il sito www.archeologiadigitale.it/archeofoss/2010.html

read more

February 01, 2010 09:26 PM

FreeCulture.org - Students for Free Culture

Lawrence Lessig talk on Fair Use and Online Video

On February 25th, 2010, Lawrence Lessig will deliver a talk on fair use and politics in online video from Harvard Law School in Cambridge, MA. Open Video Alliance and the Harvard Berkman Center are teaming up to provide a live webcast—you can tune in at http://openvideoalliance.org/lessig, or attend in person at one of many screening [...]

February 01, 2010 09:20 PM

ubiquity-firefox Google Group

Funny video clips

Dudes, it is awesome!!!check this out! [link]

February 01, 2010 09:16 PM

EFF.org Updates

Seven "Corporations of Interest" in Selling Surveillance Tools to China

Secretary of State Hillary Clinton's announcement of a new U.S. policy on global Internet Freedom included a bold new statement about the responsibilities of American technology companies:

...We are urging U.S. media companies to take a proactive role in challenging foreign governments' demands for censorship and surveillance. The private sector has a shared responsibility to help safeguard free expression. And when their business dealings threaten to undermine this freedom, they need to consider what’s right, not simply what’s a quick profit.

We couldn't agree more.

While Clinton focuses on media companies — meaning Internet media companies like Google, Yahoo! and Microsoft — there are plenty of other companies deserving scrutiny. Specfically, many U.S. (and multinational) technology companies may be knowingly selling Chinese authorities the surveillance equipment used to commit or facilitate human rights abuses. We think it's high time to pay attention to them as well.


The "Corporations of Interest"

Drawing from published news articles, EFF has compiled a list of seven corporations that are reportedly selling surveillance technology to the Chinese government and related entities. We're designating them "corporations of interest".

Of course, news articles alone are not absolute evidence that these companies are indeed fostering repression in China. But it's clear that China uses technology to employ rampant censorship, invasive data collection and intimidation. Learning exactly what is going on, especially in the Chinese environment of state secrecy and propaganda, is difficult. But news reports, especially those that include admissions of some level of involvement from company officials, are a sufficient basis to begin asking further questions.

  1. Cisco: Cisco's deep involvement in the building of China's Golden Shield Project has been admitted by the company. Cisco's involvement has even already been raised before Congress, including the fact that Cisco engineers gave a presentation acknowledging the repressive uses for their technology that quoted their Chinese government buyers as saying that Cisco's products could be used to "combat 'Falun Gong' evil religion and other hostiles." The UK's Guardian reports that Cisco provides over 60% of all routers, switches, and network gear to China and estimates that Cisco makes $500 million annually from China.

  2. Nortel: Rolling Stone and The Guardian report that Nortel has sold hardware to aid the Golden Shield Project for surveillance and censorship purposes, including working with Tsinghua University to develop speech recognition software to monitor telephone conversations.

  3. Oracle: Business Week reports that Oracle has sold software to the Chinese Ministry of Public Security for criminal and ideological investigations. Oracle admits that one-third of its business in China is with the government.

  4. Motorola: Business Week also reports that Motorola sold the Chinese authorities handheld devices for street cops to tap into "sophisticated data repositories" on Chinese citizens.

  5. EMC: Business Week also reports that EMC sold "sophisticated data repositories" to the Chinese public security authorities. The top EMC executive in Beijing is quoted as saying, "We can expect big revenue from public security agencies" in China.

  6. Sybase: Business Week also reports that Sybase sells database programs to the Shanghai police.

  7. L-1 Identity Solutions: Rolling Stone reports that this Connecticut-based biometrics company sold software to Chinese companies that aids government officials in identifying individuals for purposes of criminal investigations.

The question of which companies have assisted in Chinese surveillance is just a small piece of a very large puzzle and we're quite confident that there are more than just these seven. And obviously many countries other than China are engaged in Internet surveillance — from Iran's infamous repression of political dissent, to censorship efforts across the globe, to the USA's own domestic surveillance architecture. Corporate complicity in these efforts is equally deserving of scrutiny.

It's also worth keeping in mind that surveillance is only part of the equation. Other technologies created or sold by companies may also be misused by the Chinese authorities. For instance, Internet censorship systems curtail civil liberties almost as severely as Internet surveillance systems. Research by the OpenNet Initiative has shown that censorship systems in many repressive countries have been outsourced to U.S. corporations.


The Solution

What comes next? Again, there's simply not enough publicly available information to be absolutely certain about the extent of any one company's active involvement or complicity.

So, a good first step would be for the companies in question to clear the air and come clean with the public about their behavior. There are six steps we'd like to see them take:

  1. Clarify their actual relationships with the Chinese authorities engaged in surveillance and censorship of the Chinese people.

  2. Publicly disclose what sorts of products and services they are selling to the Chinese government.

  3. Publicly disclose whether they have been doing "customization" or otherwise facilitating targeting of human rights activists or other vulnerable groups in China.

  4. Publicly disclose whether they have learned that their products and services are being used for repression.

  5. Publicly disclose how much money they make selling products and services to the Chinese government.

  6. Publicly disclose the steps they can take to prevent their products and services being used to violate human rights.

EFF (and presumably the State Department) will be watching closely to see whether these and other corporations selling surveillance technologies to the Chinese authorities take these steps.

And if they don't? Then it may be time for Secretary Clinton, or her allies in Congress and the Administration, to pressure them to do so.

February 01, 2010 06:07 PM

NLP News

Theoretical Foundations for Enabling a Web of Knowledge

The current web is a web of linked pages. Frustrated users search for facts by guessing which keywords or keyword phrases might lead them to pages where they can find facts. Can we make it possible for users to search directly for facts embedded in web pages? Instead of a web of human-readable pages containing machine-inaccessible facts, can the web be a web of machine-accessible facts superimposed over a web of human-readable pages? Ultimately, can the web be a web of knowledge that can provide direct answers to factual questions and support these answers by referencing and highlighting relevant base facts embedded in source pages? Answers to these questions call for distilling knowledge from the web’s wealth of heterogeneous digital data into a web of knowledge. But how? Or, even more fundamentally, what, precisely, is this web of knowledge, and what is required to enable it? To answer these questions, we proffer a theoretical foundation for a web of knowledge: We formally define a computational view of knowledge in a way that enables practical construction and use of a web of knowledge. Content Type Book ChapterDOI 10.1007/978-3-642-11829-6_15Authors David W. Embley, Brigham Young University, Provo Utah 84602 U.S.A.Andrew Zitzelberger, Brigham Young University, Provo Utah 84602 U.S.A. Book Series Lecture Notes in Computer ScienceOnline ISSN 1611-3349Print ISSN 0302-9743 Book Series Volume Volume 5956/2010 Book Foundations of Information and Knowledge SystemsDOI 10.1007/978-3-642-11829-6Print ISBN 978-3-642-11828-9

February 01, 2010 05:41 PM

Tools and Techniques in Qualitative Reasoning about Space

As a subfield of artificial intelligence, qualitative reasoning is about that kind of knowledge representation languages and automated deduction methods that is used by scientists and engineers when a precise quantitative description of the physical bodies is not available or when a complete quantitative calculation of their relationships is not feasible. A special area of qualitative reasoning is concerned with the qualitative aspects of representing and reasoning about spatial entities. Applications of qualitative spatial reasoning (QSR) can be found in natural language processing [1], spatial information systems [8], etc. They have given rise to numerous knowledge representation languages and automated deduction methods for space. Content Type Book ChapterDOI 10.1007/978-3-642-11829-6_1Authors Philippe Balbiani, CNRS — Université de Toulouse Institut de recherche en informatique de Toulouse 118 ROUTE DE NARBONNE 31062 TOULOUSE CEDEX 9 France Book Series Lecture Notes in Computer ScienceOnline ISSN 1611-3349Print ISSN 0302-9743 Book Series Volume Volume 5956/2010 Book Foundations of Information and Knowledge SystemsDOI 10.1007/978-3-642-11829-6Print ISBN 978-3-642-11828-9

February 01, 2010 05:41 PM

Linguistic Systems, Inc. Unveils New Online Translation Service at ...

Linguistic Systems, Inc. (LSI), a leading provider of language translation services, today unveiled its Select Translation Service version 2.0 at LegalTech 2010 in New York City (Booth # 315). The latest system provides online access to four levels ...

February 01, 2010 04:31 PM

FreeCulture.org - Students for Free Culture

Gifts for Free Culture X Registration!

Free Culture X is only two weeks away. If you have not yet registered, now is the time to do it! Register now! Give a dollar, $25, $100—it’s up to you. 100% of the proceeds will fund future Students for Free Culture projects. To sweeten the deal for you, we’re announcing some cool gifts: •If you register at [...]

February 01, 2010 04:28 PM

ubiquity-firefox Google Group

List All Currently Open Tabs

Is it possible to have a command that lists all open tabs? It would be
twice as good as to display them all in a grid with a close button in
the top right corner and when you click them you'd be taken to said
tab, but I'll happily settle for a list.

February 01, 2010 03:37 PM

Linked Data Blog Aggregator

Wash down the Apple tablet with a gulp of Kool Aid

I’m not in the least bit excited about the iPad, and it seems I’m not alone. The mood seems to have changed since before the launch, with countless tech journalists previously falling over themselves to declare tablets the next big thing. (Thankfully Rory Cellan-Jones from the BBC was more measured, focusing on personal projectors as a more exciting development). The mood since is considerably more downbeat, and I think more realistic.

I may be missing some crucial usage context that reveals the killer characteristics of the iPad, but I’ve tried really hard and still nothing. There are many obvious practical issues with the device:

The only scenarios I can conjure up where I could imagine using the device are:

Neither of these, or even both, are very compelling at all. TVs are getting good for viewing photos, by including e.g. an SD card slot, and rumours of the death of paper are greatly exagerated.

Perhaps the most annoying thing about the scenarios used to promote the device is the one about the San Francisco to Tokyo flight, watching video all the way without running out of battery. Any airline with planes worth boarding has personal video screens. I don’t want to bring my own. I’d rather use the space to carry a decent pair of noise-canceling headphones, which I’m sure increase my enjoyment of onboard media far more than a little bit of extra screen real estate. The development I want to see is not a new device that I have to prop on the flimsy airline table, hold tight when we hit some turbulence, and stow away when my food arrives, but the capability to connect my own device to the in-built screen via USB or Bluetooth. Even a bare USB port with power but no connectivity would be a start, allowing me to run low-powered devices (that I already own) during long flights.

OK, so the flight reference is just a touchstone for how long the device can run without mains power, but I think it demonstrates a lack of grounding of the device in realistic scenarios.

Any new device has to have two key characteristics these days for me to get excited: interoperability and convergence. The iPad seems to have very little of either. You could argue that it offers some convergence between smartphones and e-readers, but that’s about as exciting as convergence between a smartphone and a wall clock.

I’m left wondering what the iPad is competing against? I’m guessing it’s paper, whether that’s in the form of a book, brochure, newspaper, restaurant menu or whatever. Unfortunately for Apple, paper is pretty well suited to each of these, especially when you introduce bath water, the risk of theft, or just ketchup, into the equation. Perhaps this is the end of electronic picture frames as dedicated device? Probably about time. Maybe the iPad will make an excellent Spotify console for the living room. Who knows? Whatever happens I can’t see this becoming a mass-market product worthy of even a fraction of the hype.

Where I wish that Apple had expended their creative talent was in addressing the power issue. Not in making sure I could watch 10 hours of back to back video, but in enabling me to spend that energy in whatever way I choose, powering whichever device I choose. It drives me crazy that I carry several batteries around, and short of running my phone off my laptop via USB there is no interoperability between these power sources. If Apple could produce a universal power supply that was sleek, sexy, efficient and interoperable, then I would be interested. Sadly this doesn’t seem to be the way.

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

February 01, 2010 03:30 PM

MusicBrainz Blog

Sorry for the downtime!

Hi folks, just a quick announcement – our main web server frontend had some minor problems earlier, but we’ve since sorted it out. It seems to have only effected the web frontend, and not other services such as the web service.

We’re back up now though!

February 01, 2010 01:19 PM

Browse Blogs

The Alexandria Project, Chap. 3: I just HATE it when that Happens

…Sure enough, as Frank strode up the half-lit corridor in Cube City, there was Rick standing next to his cubicle, coffee cup in hand.  His face lit up as soon as he saw Frank.  “Morning, Frank,” he called out.  “Recovered from your big Saturday night yet?”  He raised his coffee cup in a mock toast and leaned casually against his cube so Frank could barely squeeze past. 

February 01, 2010 01:12 PM

JMLR

A Survey of Accuracy Evaluation Metrics of Recommendation Tasks; Asela Gunawardana, Guy Shani; 10(Dec):2935--2962, 2009.

Recommender systems are now popular both commercially and in the research community, where many algorithms have been suggested for providing recommendations. These algorithms typically perform differently in various domains and tasks. Therefore, it is important from the research perspective, as well as from a practical view, to be able to decide on an algorithm that matches the domain and the task of interest. The standard way to make such decisions is by comparing a number of algorithms offline using some evaluation metric. Indeed, many evaluation metrics have been suggested for comparing recommendation algorithms. The decision on the proper evaluation metric is often critical, as each metric

February 01, 2010 08:03 AM

Efficient Online and Batch Learning Using Forward Backward Splitting; John Duchi, Yoram Singer; 10(Dec):2899--2934, 2009.

We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as l_1. We derive

February 01, 2010 08:03 AM

Online Learning with Samples Drawn from Non-identical Distributions; Ting Hu, Ding-Xuan Zhou; 10(Dec):2873--2898, 2009.

Learning algorithms are based on samples which are often drawn independently from an identical distribution (i.i.d.). In this paper we consider a different setting with samples drawn according to a non-identical sequence of probability distributions. Each time a sample is drawn from a different distribution. In this setting we investigate a fully online learning algorithm associated with a general convex loss function and a reproducing kernel Hilbert space (RKHS). Error analysis is conducted under the assumption that the sequence of marginal distributions converges polynomially in the dual of a Hölder space. For regression with least square or insensitive loss, learning rates are given

February 01, 2010 08:03 AM

Adaptive False Discovery Rate Control under Independence and Dependence; Gilles Blanchard, Étienne Roquain; 10(Dec):2837--2871, 2009.

In the context of multiple hypothesis testing, the proportion Π_0 of true null hypotheses in the pool of hypotheses to test often plays a crucial role, although it is generally unknown a priori. A testing procedure using an implicit or explicit estimate of this quantity in order to improve its efficency is called adaptive. In this paper, we focus on the issue of false discovery rate (FDR) control and we present new adaptive multiple testing procedures with control of the FDR. In a first part, assuming independence of the p-values, we present two new procedures and give a unified review of other existing adaptive procedures that have provably controlled FDR. We report extensive simulation

February 01, 2010 08:03 AM

Cautious Collective Classification; Luke K. McDowell, Kalyan Moy Gupta, David W. Aha; 10(Dec):2777--2836, 2009.

Many collective classification (CC) algorithms have been shown to increase accuracy when instances are interrelated. However, CC algorithms must be carefully applied because their use of estimated labels can in some cases decrease accuracy. In this article, we show that managing this label uncertainty through cautious algorithmic behavior is essential to achieving maximal, robust performance. First, we describe cautious inference and explain how four well-known families of CC algorithms can be parameterized to use varying degrees of such caution. Second, we introduce cautious learning and show how it can be used to improve the performance of almost any CC algorithm, with or without cautious

February 01, 2010 08:03 AM

Reproducing Kernel Banach Spaces for Machine Learning; Haizhang Zhang, Yuesheng Xu, Jun Zhang; 10(Dec):2741--2775, 2009.

We introduce the notion of reproducing kernel Banach spaces (RKBS) and study special semi-inner-product RKBS by making use of semi-inner-products and the duality mapping. Properties of an RKBS and its reproducing kernel are investigated. As applications, we develop in the framework of RKBS standard learning schemes including minimal norm interpolation, regularization network, support vector machines, and kernel principal component analysis. In particular, existence, uniqueness and representer theorems are established.

February 01, 2010 08:03 AM

Learning Halfspaces with Malicious Noise; Adam R. Klivans, Philip M. Long, Rocco A. Servedio; 10(Dec):2715--2740, 2009.

We give new algorithms for learning halfspaces in the challenging malicious noise model, where an adversary may corrupt both the labels and the underlying distribution of examples. Our algorithms can tolerate malicious noise rates exponentially larger than previous work in terms of the dependence on the dimension n, and succeed for the fairly broad class of all isotropic log-concave distributions. We give poly(n, 1/ε)-time algorithms for solving the following problems to accuracy ε: Learning origin-centered halfspaces in R^n with respect to the uniform distribution on the unit ball with malicious noise rate η = Ω(ε^2 / log(n/ε)). (The best previous result

February 01, 2010 08:03 AM

Structure Spaces; Brijnesh J. Jain, Klaus Obermayer; 10(Nov):2667--2714, 2009.

Finite structures such as point patterns, strings, trees, and graphs occur as "natural" representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces

February 01, 2010 08:03 AM

Bounded Kernel-Based Online Learning; Francesco Orabona, Joseph Keshet, Barbara Caputo; 10(Nov):2643--2666, 2009.

A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory

February 01, 2010 08:03 AM

DL-Learner: Learning Concepts in Description Logics; Jens Lehmann; 10(Nov):2639--2642, 2009.

In this paper, we introduce DL-Learner, a framework for learning in description logics and OWL. OWL is the official W3C standard ontology language for the Semantic Web. Concepts in this language can be learned for constructing and maintaining OWL ontologies or for solving problems similar to those in Inductive Logic Programming. DL-Learner includes several learning algorithms, support for different OWL formats, reasoner interfaces, and learning problems. It is a cross-platform framework implemented in Java. The framework allows easy programmatic access and provides a command line interface, a graphical interface as well as a WSDL-based web service.

February 01, 2010 08:03 AM

NLP News

Exploiting geographic references of documents in a geographical information retrieval system using an ontology-based index

Abstract  Both Geographic Information Systems and Information Retrieval have been very active research fields in the last decades. Lately, a new research field called Geographic Information Retrieval has appeared from the intersection of these two fields. The main goal of this field is to define index structures and techniques to efficiently store and retrieve documents using both the text and the geographic references contained within the text. We present in this paper two contributions to this research field. First, we propose a new index structure that combines an inverted index and a spatial index based on an ontology of geographic space. This structure improves the query capabilities of other proposals. Then, we describe the architecture of a system for geographic information retrieval that defines a workflow for the extraction of the geographic references in documents. The architecture also uses the index structure that we propose to solve pure spatial and textual queries as well as hybrid queries that combine both a textual and a spatial component. Furthermore, query expansion can be performed on geographic references because the index structure is based in an ontology. Content Type Journal ArticleDOI 10.1007/s10707-010-0106-3Authors Nieves R. Brisaboa, University of A Coruña Database Laboratory Campus de Elviña 15071 A Coruña SpainMiguel R. Luaces, University of A Coruña Database Laboratory Campus de Elviña 15071 A Coruña SpainÁngeles S. Places, University of A Coruña Database Laboratory Campus de Elviña 15071 A Coruña SpainDiego Seco, University of A Coruña Database Laboratory Campus de Elviña 15071 A Coruña Spain Journal GeoInformaticaOnline ISSN 1573-7624Print ISSN 1384-6175

February 01, 2010 06:47 AM

information aesthetics

Making Digital Content on the Mobile Phone Physically Graspable

tangible_data_mobile.jpg
Fabian Hemmert [fabianhemmert.com] tries to solve the question "How to Make Digital Content Graspable?" in a quite original way. In his short TEDx talk, and the according movie which you can watch below, you can check out his innovative inventions to depict and interface with data on a mobile phone through three different ways.

The weight-shifting method allows a phone to communicate to users where to walk by dynamically changing its gravitational center along two axes. The shape-changing method is able to convey where more information is located outside of the screen by changing the thickness of a phone at its corners. And lastly, the 'living' method allows a mobile phone to display emotional states due to a continuous heartbeat and breathing-like motion that can be felt ambiently in your trouser pocket.

See also Physical Weight of Data, DataMorphose and ce.real.


February 01, 2010 01:17 AM

World Map of Barcelona Natural Science Museum Biodiversity Data

bio_barcelona.jpg
The Natural Science Museum of Barcelona Data Base [bioexplora.cat] contains more than 150 years worth of biological records collected around the world by the Natural Science Museum in Barcelona. The database consists of about 50.000 different records of mollusc, vertebrata and artropodes. All the information is structured following the Darwin Core Standard, developed by the Global Biodiversity Information Facility.

The interactive map allows users to drag and zoom, turn on or off geographical layers, click on an active 'square' to download all the registers located in the selected area. Circular frames show the detailed locations of individual finds.


February 01, 2010 12:40 AM

January 31, 2010

ubiquity-firefox Google Group

Google Site Search?

I saw another product that allowed you to type "gs this <term>", I'd
like to have Ubiquity "search this <term>" is that possible? and can
results be displayed in-line?

January 31, 2010 11:59 PM

Linked Data Blog Aggregator

The Business Of Linked Data (BOLD) Discussion Space

I've created a new discussion space that's squarely focused on the business development and marketing aspects of "HTTP based Linked Data" (Linked Data). As its name indicates, It's a BOLD attempt to fill a VoiD. :-)

Background

A few months ago, Aldo Bucchi posted a message to the LOD mailing list seeking a discussion space for more business and marketing oriented topic, in relation to Linked Data. At the time, my assumption was that the existing LOD mailing list served that purpose absolutely fine, but in due course I came to realize that Aldo's request had a much lager foundation than I initially suspected.

Historic Oversight

Linked Data, like its umbrella Semantic Web Project, has suffered from an inadvertent oversight on the parts of many of its enthusiasts (myself included): 100% of the discussion spaces are created by, geared towards, or dominated by researchers (from Academia primarily) and/or developers. Thus, at the very least, we've been operating in an echo chamber that only feed the existing void between the core community and those who are more interested in discussing business and marketing related topics.

The new discussion space seeks to cover the following:

  1. Brainstorming Value Proposition Articulation
  2. War Story Exchanges
  3. Case Studies and Use-cases
  4. Market Research & Positioning (for instance Linked Data is killer technology that redefines Data Integration, but none of the major research firms currently make that connection)
  5. .

How Do I Join The Conversation? Simply sign up on the Google hosted BOLD mailing list, introduce yourself (ideally), and then start conversing! :-)

January 31, 2010 10:48 PM

Getting The Linked Data Value Pyramid Layers Right (Update #2)

One of the real problems that pervades all routes to Linked Data value prop. incomprehension stems from the layering of its value pyramid; especially when communicating with -initially detached- end-users.

Note to Web Programmers: Linked Data is about Data (Wine) and not about Code (Fish). Thus, it isn't a "programmer only zone", far from it. More than anything else, its inherently inclusive and spreads its participation net widely across: Data Architects, Data Integrators, Power Users, Knowledge Workers, Information Workers, Data Analysts, etc.. Basically, everyone that can "click on a link" is invited to this particular party; remember, it is about "Linked Data" not "Linked Code", after all. :-)

Problematic Value Pyramid Layering

Here is an example of a Linked Data value pyramid that I am stumbling across --with some frequency-- these days (note: 1 being the pyramid apex):

  1. SPARQL Queries
  2. RDF Data Stores
  3. RDF Data Sets
  4. HTTP scheme URIs

Basically, Linked Data deployment (assigning de-referencable HTTP URIs to DBMS records, their attributes, and attribute values [optionally] ) is occurring last. Even worse, this happens in the context of Linked Open Data oriented endeavors, resulting in nothing but confusion or inadvertent perpetuation of the overarching pragmatically challenged "Semantic Web" stereotype.

As you can imagine, hitting SPARQL as your introduction to Linked Data is akin to hitting SQL as your introduction to Relational Database Technology, neither is an elevator-style value prop. relay mechanism.

In the relational realm, killer demos always started with desktop productivity tools (spreadsheets, report-writers, SQL QBE tools etc.) accessing, relational data sources en route to unveiling the "Productivity" and "Agility" value prop. that such binding delivered i.e., the desktop application (clients) and the databases (servers) are distinct, but operating in a mutually beneficial manner to all, courtesy of a data access standards such as ODBC (Open Database Connectivity).

In the Linked Data realm, learning to embrace and extend best practices from the relational dbms realm remains a challenge, a lot of this has to do with hangovers from a misguided perception that RDF databases will somehow completely replace RDBMS engines, rather than compliment them. Thus, you have a counter productive variant of NIH (Not Invented Here) in play, taking us to the dreaded realm of: Break the Pot and You Own It (exemplified by the 11+ year Semantic Web Project comprehension and appreciation odyssey).

From my vantage point, here is how I believe the Linked Data value pyramid should be layered, especially when communicating the essential value prop.:

  1. HTTP URLs -- LINKs to documents (Reports) that users already appreciate, across the public Web and/or Intranets
  2. HTTP URIs -- typically not visually distinguishable from the URLs, so use the Data exposed by de-referencing a URL to show how each Data Item (Entity or Object) is uniquely identified by a Generic HTTP URI, and how clicking on the said URIs leads to more structured metadata bearing documents available in a variety of data representation formats, thereby enabling flexible data presentation (e.g., smarter HTML pages)
  3. SPARQL -- when a user appreciates the data representation and presentation dexterity of a Generic HTTP URI, they will be more inclined to drill down an additional layer to unravel how HTTP URIs mechanically deliver such flexibility
  4. RDF Data Stores -- at this stage the user is now interested data sources behind the Generic HTTP URIs, courtesy of natural desire to tweak the data presented in the report; thus, you now have an engaged user ready to absorb the "How Generic HTTP URIs Pull This Off" message
  5. RDF Data Sets -- while attempting to make or tweak HTTP URIs, users become curious about the actual data loaded into the RDF Data Store, which is where data sets used to create powerful Lookup Data Spaces (e.g., DBpedia) come into play such as those from the LOD constellation as exemplified by DBpedia (extractions from Wikipedia).

Related

January 31, 2010 10:46 PM

What is the DBpedia Project? (Updated)

The recent Wikipedia imbroglio centered around DBpedia is the fundamental driver for this particular blog post. At time of writing this blog post, the DBpedia project definition in Wikipedia remains unsatisfactory due to the following shortcomings:

  1. inaccurate and incomplete definition of the Project's What, Why, Who, Where, When, and How
  2. inaccurate reflection of project essence, by skewing focus towards data extraction and data set dump production, which is at best a quarter of the project.

Here are some insights on DBpedia, from the perspective of someone intimately involved with the other three-quarters of the project.

What is DBpedia?

A live Web accessible RDF model database (Quad Store) derived from Wikipedia content snapshots, taken periodically. The RDF database underlies a Linked Data Space comprised of: HTML (and most recently HTML+RDFa) based data browser pages and a SPARQL endpoint.

Note: DBpedia 3.4 now exists in snapshot (warehouse) and Live Editions (currently being hot-staged). This post is about the snapshot (warehouse) edition, I'll drop a different post about the DBpedia Live Edition where a new Delta-Engine covers both extraction and database record replacement, in realtime.

When was it Created?

As an idea under the moniker "DBpedia" it was conceptualized in late 2006 by researchers at University of Leipzig (lead by Soren Auer) and Freie University, Berlin (lead by Chris Bizer). The first public instance of DBpedia (as described above) was released in February 2007. The official DBpedia coming out party occurred at WWW2007, Banff, during the inaugural Linked Data gathering, where it showcased the virtues and immense potential of TimBL's Linked Data meme.

Who's Behind It?

OpenLink Software (developers of OpenLink Virtuoso and providers of Web Hosting infrastructure), University of Leipzig, and Freie Univerity, Berlin. In addition, there is a burgeoning community of collaborators and contributors responsible DBpedia based applications, cross-linked data sets, ontologies (OpenCyc, SUMO, UMBEL, and YAGO) and other utilities. Finally, DBpedia wouldn't be possible without the global content contribution and curation efforts of Wikipedians, a point typically overlooked (albeit inadvertently).

How is it Constructed?

The steps are as follows:

  1. RDF data set dump preparation via Wikipedia content extraction and transformation to RDF model data, using the N3 data representation format - Java and PHP extraction code produced and maintained by the teams at Leipzig and Berlin
  2. Deployment of Linked Data that enables Data browsing and exploration using any HTTP aware user agent (e.g. basic Web Browsers) - handled by OpenLink Virtuoso (handled by Berlin via the Pubby Linked Data Server during the early months of the DBpedia project)
  3. SPARQL compliant Quad Store, enabling direct access to database records via SPARQL (Query language, REST or SOAP Web Service, plus a variety of query results serialization formats) - OpenLink Virtuoso since first public release of DBpedia

In a nutshell, there are four distinct and vital components to DBpedia. Thus, DBpedia doesn't exist if all the project offered was a collection of RDF data dumps. Likewise, it doesn't exist if you have a SPARQL compliant Quad Store without loaded data sets, and of course it doesn't exist if you have a fully loaded SPARQL compliant Quad Store is up to the cocktail of challenges presented by live Web accessibility.

Why is it Important?

It remains a live exemplar for any individual or organization seeking to publishing or exploit HTTP based Linked Data on the World Wide Web. Its existence continues to stimulate growth in both density and quality of the burgeoning Web of Linked Data.

How Do I Use it?

In the most basic sense, simply browse the HTML pages en route to discovery erstwhile relationships that exist across named entities and subject matter concepts / headings. Beyond that, simply look at DBpedia as a master lookup table in a Web hosted distributed database setup; enabling you to mesh your local domain specific details with DBpedia records via structured relations (triples or 3-tuples records) comprised of HTTP URIs from both realms e.g., owl:sameAs relations.

What Can I Use it For?

Expanding on the Master-Details point above, you can use its rich URI corpus to alleviate tedium associated with activities such as:

  1. List maintenance - e.g., Countries, States, Companies, Units of Measurement, Subject Headings etc.
  2. Tagging - as a compliment to existing practices
  3. Analytical Research - you're only a LINK (URI) away from erstwhile difficult to attain research data spread across a broad range of topics
  4. Closed Vocabulary Construction - rather than commence the futile quest of building your own closed vocabulary, simply leverage Wikipedia's human curated vocabulary as our common base.

Related

January 31, 2010 10:45 PM

Getting The Linked Data Value Pyramid Layers Right (Update #2)

One of the real problems that pervades all routes to Linked Data value prop. incomprehension stems from the layering of its value pyramid; especially when communicating with -initially detached- end-users.

Note to Web Programmers: Linked Data is about Data (Wine) and not about Code (Fish). Thus, it isn't a "programmer only zone", far from it. More than anything else, its inherently inclusive and spreads its participation net widely across: Data Architects, Data Integrators, Power Users, Knowledge Workers, Information Workers, Data Analysts, etc.. Basically, everyone that can "click on a link" is invited to this particular party; remember, it is about "Linked Data" not "Linked Code", after all. :-)

Problematic Value Pyramid Layering

Here is an example of a Linked Data value pyramid that I am stumbling across --with some frequency-- these days (note: 1 being the pyramid apex):

  1. SPARQL Queries
  2. RDF Data Stores
  3. RDF Data Sets
  4. HTTP scheme URIs

Basically, Linked Data deployment (assigning de-referencable HTTP URIs to DBMS records, their attributes, and attribute values [optionally] ) is occurring last. Even worse, this happens in the context of Linked Open Data oriented endeavors, resulting in nothing but confusion or inadvertent perpetuation of the overarching pragmatically challenged "Semantic Web" stereotype.

As you can imagine, hitting SPARQL as your introduction to Linked Data is akin to hitting SQL as your introduction to Relational Database Technology, neither is an elevator-style value prop. relay mechanism.

In the relational realm, killer demos always started with desktop productivity tools (spreadsheets, report-writers, SQL QBE tools etc.) accessing, relational data sources en route to unveiling the "Productivity" and "Agility" value prop. that such binding delivered i.e., the desktop application (clients) and the databases (servers) are distinct, but operating in a mutually beneficial manner to all, courtesy of a data access standards such as ODBC (Open Database Connectivity).

In the Linked Data realm, learning to embrace and extend best practices from the relational dbms realm remains a challenge, a lot of this has to do with hangovers from a misguided perception that RDF databases will somehow completely replace RDBMS engines, rather than compliment them. Thus, you have a counter productive variant of NIH (Not Invented Here) in play, taking us to the dreaded realm of: Break the Pot and You Own It (exemplified by the 11+ year Semantic Web Project comprehension and appreciation odyssey).

From my vantage point, here is how I believe the Linked Data value pyramid should be layered, especially when communicating the essential value prop.:

  1. HTTP URLs -- LINKs to documents (Reports) that users already appreciate, across the public Web and/or Intranets
  2. HTTP URIs -- typically not visually distinguishable from the URLs, so use the Data exposed by de-referencing a URL to show how each Data Item (Entity or Object) is uniquely identified by a Generic HTTP URI, and how clicking on the said URIs leads to more structured metadata bearing documents available in a variety of data representation formats, thereby enabling flexible data presentation (e.g., smarter HTML pages)
  3. SPARQL -- when a user appreciates the data representation and presentation dexterity of a Generic HTTP URI, they will be more inclined to drill down an additional layer to unravel how HTTP URIs mechanically deliver such flexibility
  4. RDF Data Stores -- at this stage the user is now interested data sources behind the Generic HTTP URIs, courtesy of natural desire to tweak the data presented in the report; thus, you now have an engaged user ready to absorb the "How Generic HTTP URIs Pull This Off" message
  5. RDF Data Sets -- while attempting to make or tweak HTTP URIs, users become curious about the actual data loaded into the RDF Data Store, which is where data sets used to create powerful Lookup Data Spaces (e.g., DBpedia) come into play such as those from the LOD constellation as exemplified by DBpedia (extractions from Wikipedia).

Related

January 31, 2010 10:44 PM

What is the DBpedia Project? (Updated)

The recent Wikipedia imbroglio centered around DBpedia is the fundamental driver for this particular blog post. At time of writing this blog post, the DBpedia project definition in Wikipedia remains unsatisfactory due to the following shortcomings:

  1. inaccurate and incomplete definition of the Project's What, Why, Who, Where, When, and How
  2. inaccurate reflection of project essence, by skewing focus towards data extraction and data set dump production, which is at best a quarter of the project.

Here are some insights on DBpedia, from the perspective of someone intimately involved with the other three-quarters of the project.

What is DBpedia?

A live Web accessible RDF model database (Quad Store) derived from Wikipedia content snapshots, taken periodically. The RDF database underlies a Linked Data Space comprised of: HTML (and most recently HTML+RDFa) based data browser pages and a SPARQL endpoint.

Note: DBpedia 3.4 now exists in snapshot (warehouse) and Live Editions (currently being hot-staged). This post is about the snapshot (warehouse) edition, I'll drop a different post about the DBpedia Live Edition where a new Delta-Engine covers both extraction and database record replacement, in realtime.

When was it Created?

As an idea under the moniker "DBpedia" it was conceptualized in late 2006 by researchers at University of Leipzig (lead by Soren Auer) and Freie University, Berlin (lead by Chris Bizer). The first public instance of DBpedia (as described above) was released in February 2007. The official DBpedia coming out party occurred at WWW2007, Banff, during the inaugural Linked Data gathering, where it showcased the virtues and immense potential of TimBL's Linked Data meme.

Who's Behind It?

OpenLink Software (developers of OpenLink Virtuoso and providers of Web Hosting infrastructure), University of Leipzig, and Freie Univerity, Berlin. In addition, there is a burgeoning community of collaborators and contributors responsible DBpedia based applications, cross-linked data sets, ontologies (OpenCyc, SUMO, UMBEL, and YAGO) and other utilities. Finally, DBpedia wouldn't be possible without the global content contribution and curation efforts of Wikipedians, a point typically overlooked (albeit inadvertently).

How is it Constructed?

The steps are as follows:

  1. RDF data set dump preparation via Wikipedia content extraction and transformation to RDF model data, using the N3 data representation format - Java and PHP extraction code produced and maintained by the teams at Leipzig and Berlin
  2. Deployment of Linked Data that enables Data browsing and exploration using any HTTP aware user agent (e.g. basic Web Browsers) - handled by OpenLink Virtuoso (handled by Berlin via the Pubby Linked Data Server during the early months of the DBpedia project)
  3. SPARQL compliant Quad Store, enabling direct access to database records via SPARQL (Query language, REST or SOAP Web Service, plus a variety of query results serialization formats) - OpenLink Virtuoso since first public release of DBpedia

In a nutshell, there are four distinct and vital components to DBpedia. Thus, DBpedia doesn't exist if all the project offered was a collection of RDF data dumps. Likewise, it doesn't exist if you have a SPARQL compliant Quad Store without loaded data sets, and of course it doesn't exist if you have a fully loaded SPARQL compliant Quad Store is up to the cocktail of challenges presented by live Web accessibility.

Why is it Important?

It remains a live exemplar for any individual or organization seeking to publishing or exploit HTTP based Linked Data on the World Wide Web. Its existence continues to stimulate growth in both density and quality of the burgeoning Web of Linked Data.

How Do I Use it?

In the most basic sense, simply browse the HTML pages en route to discovery erstwhile relationships that exist across named entities and subject matter concepts / headings. Beyond that, simply look at DBpedia as a master lookup table in a Web hosted distributed database setup; enabling you to mesh your local domain specific details with DBpedia records via structured relations (triples or 3-tuples records) comprised of HTTP URIs from both realms e.g., owl:sameAs relations.

What Can I Use it For?

Expanding on the Master-Details point above, you can use its rich URI corpus to alleviate tedium associated with activities such as:

  1. List maintenance - e.g., Countries, States, Companies, Units of Measurement, Subject Headings etc.
  2. Tagging - as a compliment to existing practices
  3. Analytical Research - you're only a LINK (URI) away from erstwhile difficult to attain research data spread across a broad range of topics
  4. Closed Vocabulary Construction - rather than commence the futile quest of building your own closed vocabulary, simply leverage Wikipedia's human curated vocabulary as our common base.

Related

January 31, 2010 10:43 PM

ocropus Google Group

Introduction

Hi,
A couple of days ago I have sent a letter to the group with a help
request regarding page segmentation. At that time I missed your rules
on posting the first message requiring to provide a short
introduction. I am working at a startup which deals with NLP, we are
exploring the possibility to extend our products with the

January 31, 2010 05:31 PM

natural language processing blog

Coordinate ascent and inverted indices...

Due to a small off-the-radar project I'm working on right now, I've been building my own inverted indices. (Yes, I'm vaguely aware of discussions in DB/Web search land about whether you should store your inverted indices in a database or whether you should handroll your own. This is tangential to the point of this post.)

For those of you who don't remember your IR 101, here's the deal with inverted indices. We're a search engine and want to be able to quickly find pages that contain query terms. One way of storing our set of documents (eg., the web) is to store a list of documents, each of which is a list of words appearing in that document. If there are N documents of length L, then answering a query is O(N*L) since we have to look over each document to see if it contains the word we care about. The alternative is to store an inverted index, where we have a list of words and for each word, we store the list of documents it appears in. Answering a query here is something like O(1) if we hash them, O(log |V|) if we do binary search (V = vocabulary), etc. Why it's called an inverted index is beyond me: it's really just like the index you find at the back of a textbook. And the computation difference is like trying to find mentions of "Germany" in a textbook by reading every page and looking for "Germany" versus going to the index in the back of the book.

Now, let's say we have an inverted index for, say, the web. It's pretty big (and in all honesty, probably distributed across multiple storage devices or multiple databases or whatever). But regardless, a linear scan of the index would give you something like: here's word 1 and here are the documents it appears in; here's word 2 and its doucments; here's word v and its documents.

Suppose that, outside of the index, we have a classification task over the documents on the web. That is, for any document, we can (efficiently -- say O(1) or O(log N)) get the "label" of this document. It's either +1, -1 or ? (? == unknown, or unlabeled).

My argument is that this is a very plausible set up for a very large scale problem.

Now, if we're trying to solve this problem, doing a "common" optimization like stochastic (sub)gradient descent is just not going to work, because it would require us to iterate over documents rather than iterating over words (where I'm assuming words == features, for now...). That would be ridiculously expensive.

The alternative is to do some sort of coordinate ascent algorithm. These actually used to be quite popular in maxent land, and, in particular, Joshua Goodman had a coordinate ascent algorithm for maxent models that apparently worked quite well. (In searching for that paper, I just came across a 2009 paper on roughly the same topic that I hadn't seen before.)

Some other algorithms have a coordinate ascent feel, for instance the LASSO (and relatives, including the Dantzig selector+LASSO = DASSO), but they wouldn't really scale well in this problem because it would require a single pass over the entire index to make one update. Other approaches, such as boosting, etc., would fare very poorly in this setting.

This observation first led me to wonder if we can do something LASSO or boosting like in this setting. But then that made me wonder if this is a special case, or if there are other cases in the "real world" where you data is naturally laid out as features * data points rather than data points * features. Sadly, I cannot think of any. But perhaps that's not because there aren't any.

(Note that I also didn't really talk about how to do semi-supervised learning in this setting... this is also quite unclear to me right now!)

January 31, 2010 12:37 PM

January 30, 2010

ubiquity-firefox Google Group

Fresh new pics

The best funny pictures collection that i've ever seen))
[link]

January 30, 2010 08:31 PM

tesseract-ocr Google Group

Handwriting recognition

Hi -
I'm trying to put together a handwriting recognition system and am
looking for any advice that this group can offer.
I'm not interested in something simplistic like the restricted stroke
alphabet used in some PDAs. I'm aiming for something that can
recognize longhand cursive or shorthand, so this is rather ambitious.

January 30, 2010 08:27 PM

Communia Publications

7th Communia Workshop abstracts and initial statements by speakers and chairpersons

This document collects abstracts and initial position statements submitted by chairpersons and speakers before the workshop.

January 30, 2010 06:49 PM

ocropus Google Group

Page segmentation

Hi,
Having a screenshot of the web page I need to identify regions
containing navigation menus, content areas and advertisement blocks.
Based on the following posts in the group
[link]
[link]

January 30, 2010 08:48 AM

January 29, 2010

Open Book Alliance

Opposition Pours In

This week was the perfect storm for opponents of the revised GBS with condemnation raining down on the flawed book grab from around the globe.

Thursday was the last day for public filings in opposition to the settlement and there was no shortage of opponents or different takes on why the GBS was destructive, unfair and beyond salvation.

Consumer groups were especially active with comments filed from Public Knowledge and Consumer Watchdog.  In its brief, Consumer Watchdog noted that not only is the settlement an unconstitutional attempt to revise copyright law, but it also continues to award Google an unlawful and anti-competitive monopoly.

From the author’s camp, filings were delivered to Judge Chin from the Science Fiction and Fantasy Writers of America and American Society of Journalists and Authors, Inc. Noted U.K. author and anti-GBS petition signer Ursula K. Le Guin also shared her case against the revised deal with the court.

And while the revised settlement cut out many non-english writing authors in an attempt to placate international objections to the deal – while also shattering the illusion of the library as being a representation of humanity’s knowledge – another round of objections sprang up from all corners of the globe.  Groups representing Japanese authors and Indian authors joined advocates from Germany, France, Italy and the U.K. in denouncing the plan.

We can only hope that Judge Chin and the settlement parties are listening to this overwhelming and varied outpouring of opposition.

January 29, 2010 10:33 PM

Open Knowledge Foundation Blog

CERN opens up bibliographic metadata!

As regular readers of the Open Knowledge Foundation blog will know, bibliographic metadata is a subject close to our heart (see e.g., here, here and here). Hence we were delighted to see today’s announcement that CERN Library are releasing their bibliographic metadata under an open license! From the announcement: Librarians are in general very favourable [...] Related posts:

  1. Response to ‘The Future of Bibliographic Control’ draft from the Library of Congress
  2. Biblios - “world’s largest database” of open bibliographic data goes beta!
  3. Open Bibliographic Data: The State of Play

January 29, 2010 08:32 PM

Science Commons

MichiganView releases remote sensing data under CC0 waiver

Puneet Kishor is a Science Commons Fellow, specializing in geospatial issues and open data, and a guest blogger here at Science Commons. Starting Jan 28, 2010, MichiganView is making available all of its more than 93 Gigabytes of Landsat 5 and 7, and NAIP imagery data in the public domain using the new CC0 Waiver [...]

January 29, 2010 08:27 PM

MusicBrainz Blog

Testing the NGS live data feed

If anyone would like to test out the NGS replication, which I’ve setup from the test server, follow these instructions:

Good luck!

January 29, 2010 07:09 PM

ubiquity-firefox Google Group

Fun pics

ahhahaa))check it)) [link]

January 29, 2010 06:05 PM

Linked Data Blog Aggregator

BBC Semantic Web use-case

After a very long time writing it, we finally have a BBC Semantic Web use-case on the W3C website! It describes work we did around BBC Programmes, BBC Music, BBC Wildlife Finder and Search+. I hope it all makes a bit of sense :-) For a more detailed writeup about these issues, Patrick's Linked Data on the BBC are very good.

January 29, 2010 03:37 PM

ocropus Google Group

Newcomer: Intros

Hey Folks,
I am coming from a few different areas
* Geogeek involved with FOSS for geomatics
* Ubuntu user and long time Linux convert
* heavy user of FOSS tools
* python scripting (always learning)
* willing to bash the CLI but love it when a project matures to a
simple GUI
* Willing advocate of mature FOSS projects that can solve real world

January 29, 2010 03:18 PM

W3C Semantic Web Activity News

New SW Use Case by the BBC

The BBC has provided a W3C Semantic Web Use Case on how Semantic Web Technologies are used on some of the BBC’s Web Sites. The main characteristics of the BBC’s approach is to use the Web as a Content Management System. Sites like the BBC Music, BBC Programmes, or the BBC Wildlife Finder rely on external, publicly available datasets like Musicbrainz or Wikipedia; the BBC sites themselves show an aggregated view of this information, put in a BBC context. Furthermore, the BBC also creates Web identifiers for every item it has interest in; RDF representations of these Web identifiers allow developers to use the BBC’s data to build applications.

January 29, 2010 03:13 PM

EFF.org Updates

Blogging ACTA Across The Globe: Lessons From Korea

If there's one country that might have insight into what a post-ACTA future may look like, it's the Republic of Korea. Korea is known as having one of the most advanced networks in the world, but more recently it has also been the recipient of some of the strongest foreign pressure to ramp up its IP laws. Heesob Nam is a member (and former Chair) of IPLeft, a Korean digital rights activist group founded in 1999 to critique the increasingly maximalist IP rights agenda in that country, and research and present alternative policy proposals. He writes of the impact on Korea of ACTA and other international IP agreements.

For Korea, ACTA is the Anti-Commons Trade Agreement

In August 2008, our group, IPLeft, demanded that the Korean government disclose relevant information about its stance on the negotiation of ACTA. The disclosure was denied, as was our appeal. The reason for the denial was unconvincing: the disclosure, we were told, would result in "a harmful effect on a diplomatic relationship with foreign countries and severe damage to considerable national interests".

How does the participation in an international cooperation to combat the trade of "counterfeit and pirated goods" harmfully impact foreign relationships? Which national interests are to be damaged by open and transparent discussion? Unlike its attitude to civic society and the general public, it turned out that the Korean government already provided relevant information to, and sought opinions from, particular business groups from the earliest stages of the negotiation, at least from November 2007.

When it comes to ACTA, transparency and openness became principles that apply only to a small number of business interests. This is why the secrecy of ACTA is so bad: it mirrors a particular perspective that views the system of intellectual property as a means for maximizing commercial profit and which pays little attention to the broader social, cultural and economic implications of the IP system.

This imbalanced and biased approach is infused into the draft texts that we have seen. The draft chapters on civil enforcement, criminal enforcement, and border measures lack procedural justice and fairness. They improperly promote the interests of IP holders to the detriment of the other party in civil, criminal and administrative proceedings.

The provisions contained in the proposed Internet Chapter appear to impose undue obligations on ISPs. The extent to which ISPs are to be liable for copyright infringement by users is a matter of domestic cultural policy, not a trade issue. Careful balancing of interests and fine-tuning are necessary, including factors specific to local culture and environment which cannot be concluded in a closed room occupied by trade negotiators.

More significantly, the liability of ISPs is of great importance not merely for the protection of copyright: it is important for the protection and realization of everyone’s right to take part in cultural life as declared in legally binding international human rights instruments. One of our concerns about ACTA is the risk of undercutting the principle of the rule of law and the possible conflict with human rights, in particular with the right to a fair trial, the right to equality before courts and tribunals, the right to equality of arms, and the right to be presumed innocent. ACTA tries to introduce substantial changes in civil and criminal procedures. But the proposed changes give rise to issues of procedural justice and fairness, jeopardizing Korea’s obligations under the international human rights instruments, e.g., the International Covenant on Civil and Political Rights, and potentially weakening the democratic values recognized in our Constitution.

For instance, pursuant to the US-Japan joint proposal, any provisional measures such as a preliminary injunction may be rendered by judicial authorities without a prior hearing of the alleged infringer. Here, neither "irreparable harm to the right holder" nor "a demonstrable risk of evidence being destroyed" is explicitly required. Even the Customs office may take an ex-officio action to suspend the release of suspected copyright or trademark infringing goods. Moreover, right holders may be awarded a predetermined amount of damages without having a burden to show the amount of damage or even when the amount is greater than actual damage. An even more severe breach of principles of procedural justice is found in a so-called "camcorder provision" under which anyone who attempts to use an audiovisual recording device to make a copy of any part of an audiovisual work in a theater may be criminally punished. This out-of-proportion rule not only produces a direct conflict with the right to be presumed innocent but also undermines the principle of fair use or fair dealing.

National autonomy is vital in order to decide the proper level of local IP protection and enforcement. Korean IP law has undergone substantial revision due to the threat of trade sanctions from both the US and the EU since the early 1980s. This economic coercion has continued for about thirty years, and has led to an emergence of consistent domestic pressure for stronger IP protection.

Interestingly, the strongest advocates for these reforms in Korea are not the IP industries: they are the executive branches in government which claim competance over the administration of patent, trademark, and copyright. To them, stronger IP protection and enforcement is a chance to enhance their position. The unending economic pressure and the heavy reliance of our domestic economy upon exports have produced this environment. The problem is that these state actors are much more influential than other, emerging local businesses, because they possess institutional capacities and resources to promote a maximalist IP regulatory culture.

With this power, these government agencies have introduced new laws in Korea which may well be used to support controversial provisions currently being discussed in Guadalajara, Mexico. Examples include a filtering obligation imposed on certain online service providers, and a "graduated response" rule under which the Minister of Culture can suspend or terminate the Internet account of a repeat infringer or even shut down a website that the repeat infringer is using. Advocates claim that the shutting-down provision is incorporated in the US-Korea Free Trade Agreement (and Side Letters) (currently awaiting ratification), and may possibly be pushed by the US in modified form during the ACTA negotiations. If ACTA is concluded with an inspiration of such provisions and applied to Korea under the name of international harmony, our effort to reform the copyright system would be undermined, and opportunities for democratic policy discussion at local level would be lost.

Contrary to the beliefs of ACTA negotiators, stronger, criminal enforcement rules can create unintended consequences among the general public. In Korea, following the introduction of these new laws, reports of criminal copyright infringement skyrocketed from 14,838 to 90,979 between 2005 and 2008. Among these, juveniles victims occupied 24% in 2008, an increase from 1.9% in 2005.

This reported increase, however, does not represent a rapid rise of the unauthorized use of copyrighted material by juveniles. Rather, it shows how criminal sanctions can be misused. Under the Korean Copyright Act, any unauthorized acts of reproduction or distribution of copyrighted works can invoke a criminal liability. This wide coverage of criminal sanction paves the way to abuse or misuse of criminal enforcement. ACTA is no different in this sense. In order to be "willful copyright piracy" under ACTA, an infringing activity needs to be "on a commercial scale". But commercial scale is defined so broadly that it covers activities with "no direct or indirect motivation of financial gain". With this broad definition, the infringement on a commercial scale may include almost every unauthorized use of copyrighted work. So, for instance, those who download a single piece of music may risk criminal penalties. In other words, ACTA opens the door to the global misuse of criminal enforcement rules, beyond even what we've seen in Korea.

Here, criminal sanctions have become a sort of new business model for lawyers acting for copyright holders (mainly music and film industries). They monitor Internet users and send warning letters to suspected individuals threatening a criminal action. In exchange for not taking the criminal action, they ask for a cash settlement. Criminal enforcement procedures provide copyright holders with leverage; using the threat of criminal action as leverage for the settlement negotiation as the initiation of criminal procedure is triggered by a complaint by the right holder. Among the 90,979 complaints in 2008, 56% were settled out of court.

ACTA risks exporting Korea's criminal enforcement regime, while importing the worst of other countries' IP laws. But that's not the only reason to oppose it. A trade agreement that breaches procedural justice, fairness, transparency, and proportionality is not Anti-Counterfeiting: it's Anti-Commons.

January 29, 2010 02:27 PM

BioMed Central

How to publish raw clinical data: guidelines from Trials and the BMJ

An increasing number of peer-reviewed journals and research funding agencies require authors to make available the raw, unprocessed data supporting the findings reported in their research articles (click here for information on why this is important). Just this month, for example, The American Naturalist announced that its authors must make their data publicly available as a condition of publication and the UK Government has also recently launched an open data website.

But there is little practical guidance available on how data should be shared, particularly in clinical research where sharing information about individuals without their consent presents risks to privacy - both from a legal and ethical perspective.

Recognizing this problem, in March 2009 the editors of the journal Trials made a committment to produce guidance on preparing raw clinical data for publication.

"Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers" - co-published today in Trials and in the BMJ - represents that guidance. It proposes a minimum standard for anonymizing (or "de-identifying") datasets to protect patient privacy whilst allowing clinical data to be shared.

Research article
Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers
Iain Hrynaszkiewicz, Melissa L Norton, Andrew J Vickers, Douglas G Altman
Trials 2010, 11:9 (29 January 2010)
[Abstract] [PDF]

The guidance lists 28 items of personal and clinical information that can make patients identifiable and recommends that any direct identifiers, such as patients' names and addresses, should be removed from datasets before publication. Unless patients have consented to the sharing of their data, datasets containing three or more indirect identifiers, such as age or sex, should be reviewed by an independent researcher or ethics committee to determine any risks to privacy, before data are submitted for publication. If the independent review finds privacy could be at risk, alternatives to fully open access sharing of data must be considered.

Making raw clinical data available will benefit future research - and to that end, human health - and all researchers should obtain consent for sharing of supporting data when recruiting human subjects. Until this becomes a routine practice, however, concerns about patient privacy remain a common barrier to the sharing of information. This practical guidance aims to help remove this obstacle and enable other scientists and patients to benefit from full and transparent reporting of research data.

January 29, 2010 12:14 PM

FLOSS Manuals News

cheap phentermine tqder guest 29 Jan 2010 (by TWikiGuest)

January 29, 2010 10:41 AM

MusicBrainz Blog

Database on test server updated

The database on the test server has been updated with the latest data and search indexes to match the latest work that ijabz’ has done. This update was in preparation for testing the replication system, which I’ll invite the general public to try in a few hours.

IMPORTANT NOTE: The passwords on the test server are now the same as the passwords from the main server as of January 23.

January 29, 2010 10:06 AM

FLOSS Manuals News

Collaborative Futures

Collaborative Futures "Collaboration on a book is the ultimate unnatural act." Tom Clancy There is something exciting about holding a freshly printed book in your ... (by AdamHyde)

January 29, 2010 03:09 AM

January 28, 2010

Browse Blogs

Limits?

Testing The Limits
by Kevin Dankwardt, Ph.D., K Computing

How big? How many? How long? Questions about the limit on things in Linux abound.

January 28, 2010 11:51 PM

EFF.org Updates

Blogging ACTA Across The Globe: The View from France

Today is day three of the seventh round of ACTA negotiations, currently taking place in Guadalajara, Mexico.

La Quadrature Du Net is a French advocacy group formed to promote digital rights and online freedom. Its name comes by analogy between the unsolvable mathematical problem of "squaring the circle", and similarly impossible attempts to "effectively control the flow of information in the digital age by the law and the technology without harming public freedoms, and damaging economic and social development". In our ongoing series of perspectives on ACTA from around the globe, today Jérémie Zimmermann and Félix Tréguer of La Quadrature du Net describe how the trade agreement undermines democratic challenges to IP policies in France and beyond.

ACTA: An agreement between lobbyists who hate democracy

For the past fifteen years, with the advent of the networked society, we have seen a growing confrontation between rightsholders, whose business-models rely on the control of culture, and ordinary citizens. Recently, a new battle has emerged, as a growing number of policy-making arenas consider implementing "three-strikes" schemes and Net filtering practices to deter filesharing.

In these important debates, citizens and public-interest groups have scored a few successes, convincing lawmakers that such proposals are at their core irreconciliable with fundamental rights and freedoms. Our own victories include amendment 138 in the Telecoms Package by the European Parliament — which provided that no restriction on users' access to the Internet could be imposed without a judicial decision —, and the groundbreaking first decision of the French Constitutional Council on HADOPI, stating that the French Declaration of the Rights of Man implied the "freedom to access [public online communication] services". 1.

However, in the face of such opposition in traditional democratic forums, rightstholder lobbyists 2 have pushed for a global agreement that would establish extremist IP enforcement standards to all signatory countries.

As a result of this effort, ACTA is being negotiated outside of the traditional and relatively transparent IPR policy-making arenas, such as the WTO or WIPO. As made clear by a leaked summary of the draft Internet chapter of ACTA written by the Commission, this multilateral agreement could impose three-strikes schemes and Internet filtering practices:

To benefit from safe-harbours, ISPs need to put in place policies to deter unauthorised storage and transmission of IP infringing content (ex: clauses in customers' contracts allowing, inter alia, a graduated response).

This policy laundering endeavor on the part of the EU, the US and their home-grown IPR lobbies needs to be strongly opposed. Since the adoption of the TRIPS agreement in 1994, other countries have been compelled to implement increasingly harsher IP regimes. Bit by bit, the already shaky balance found in TRIPS between rightsholders and users of informational goods is neutralized at the expense of the latter.

The EU and the US have put an increasing pressure on countries such as Canada or Brazil - who understand the value of flexible copyright and patent laws for fostering free speech and democracy, social justice and development3. Through bilateral and multilateral agreements, international trade law is aligning with the most maximalist IP regimes that benefit only a few corporations located in the richest countries4. Even worse is the fact that such trade agreements are debated out of way of public scrutiny (although they significantly impact domestic law and go way beyond tariff regulations). Through undemocratic means, the development of an inclusive knowledge society is undermined.

But with copyright and patent law, more is not always better5 - especially for developing countries. This trend must be stopped. The ongoing negotiations on ACTA must be made transparent. Once these extremist IP enforcement measures are debated democratically, it will become clear that they do not rest on a principled basis and that they do not foster socio-economic progress. It is up to citizens and public-interest groups all around the world to act so that this fundamental debate can occur.

  1. 1. The Constitutional Council stated that: "Article 11 of the Declaration of the Rights of Man and the Citizen of 1789 proclaims : "The free communication of ideas and opinions is one of the most precious rights of man. Every citizen may thus speak, write and publish freely, except when such freedom is misused in cases determined by Law". In the current state of the means of communication and given the generalized development of public online communication services and the importance of the latter for the participation in democracy and the expression of ideas and opinions, this right implies freedom to access such services. The full decision is available in PDF.
  2. 2. Besides entertainment companies, the IP lobbyists fighting for ACTA also include the pharmaceutical and bioengineering industries. The list of the people in the U.S industry who have been granted access to the draft ACTA is available on KEI's website.
  3. 3. See, on this topic, Yochai Benkler, "The Wealth of Networks". Yale Press, 2006. See, especially chapter 9, "Justice and Development"
  4. 4. For recent examples regarding EU's trade relations, see the the EU-South Korea Free Trade Agreement and the EU-Canada Comprehensive Economic and Trade Agreement (CETA).
  5. 5. See, for example, Josh Lerner, "Patent Protection and Innovation Over 150 Years". Working paper no. 8977, National Bureau of Economic Research, Cambridge, USA, 2002. Josh Lerner studied changes in intellectual property law in sixty countries over a period of 150 years. He found that when patent law was strengthened, investment in innovation for local firms slightly decreased.

January 28, 2010 11:07 PM

Free Our Data: the blog

A new No.10 petition: free PostZon

Mark Goodge added this as a comment to the data.gov.uk post, but it seems worth making more visible. So here it is: “While the launch of data.gov.uk is a big step in the right direction, the government’s response to the petition inspired by the forced closure of ernestmarples.com has been pathetic. As a consequence, I’ve created [...]

January 28, 2010 09:51 PM

EFF.org Updates

Obama Reverses Position on Disclosing Lobbyist Contacts

In yesterday's State of the Union address, President Obama made an important commitment to openness and transparency in government:

It's time to require lobbyists to disclose each contact they make on behalf of a client with my Administration or Congress.

This is welcome news. For the past few years, EFF has been litigating a Freedom of Information Act case against the government, seeking the identities of lobbyists who contacted the Department of Justice and the Office of the Director of National Intelligence on behalf of their telecommunications company clients in order to push for telecom immunity. With the help of lobbyists from AT&T, Verizon, and Sprint, the FISA Amendments Act passed with an unconstitutional provision to retroactively grant immunity to the telecoms for collaborating with the warrantless wiretapping program.

So far, the Obama Administration has been fighting hard to stop the release of the names of these representatives, appealing a court order that required disclosure. Just last month, the Obama Administration argued to the appeals court that "there is no public interest in the compelled disclosure of the representatives’ identities." To the contrary, the Administration argued, lobbyists had a "significant privacy interest in being able to communicate confidentially with the government."

While it's great to see Obama reverse his position in the State of the Union and acknowledge the strong public interest in disclosure of lobbying records, the Administration must do more than give speeches in order to fulfill its commitment to transparency. Instead, Obama must apply this policy to pending litigation, and release the identities of telecommunications representatives who lobbied for immunity for the their telecommunications carrier clients.

January 28, 2010 09:45 PM

ocropus Google Group

Non-UTF8 comments in source code files

I noticed that in some source files -- most notably ocr-voronoi/
voronoi-pageseg.cc -- there are a large number of code comments that
are not in UTF8 or any other character set I can identify. As these
comments might prove useful for newbies trying to understand the code
(read: me) I wonder if someone could illuminate me as to how those

January 28, 2010 07:30 PM

Open Book Alliance

AMENDED GOOGLE BOOKS SETTLEMENT IS A “PALTRY PROPOSAL” THAT DEFIES ANTITRUST LAWS

The Open Book Alliance today formally filed Amicus Curiae in opposition to the proposed settlement between the Authors Guild, Inc., Association of American Publishers, Inc. and Google, Inc.

The OBA filing states,  “The torrent of criticism to the settlement may have produced amendments to the class definition, but it has not affected Google’s conduct one iota…..All in all, little has been accomplished, save from Google’s perspective as it continues to build its lead over competitors.  …The Court’s procedures are ill-suited for resolution of what is now at stake in this matter – rewriting the copyright law, restructuring the publishing industry, and maintaining a competitive search market.”

The proposed Google book settlement is not a philanthropic effort to bring literature into the 21st century and bridge a literary divide.  In reality, Google is focused on becoming the sole owner of an immense digital library that will improve the company’s advertising-based search business.  This de facto exclusive license will provide Google with an enormous advantage over its search competitors.

The brief explores the market importance of so-called “tail queries” – rare or obscure search requests that are hard to fulfill accurately.  The brief explains how “[d]igital rights to virtually all out-of-print books provide Google with a decisive advantage in responding to tail queries.”  The brief continues, “[i]f Google can deny its search rivals the ability to integrate the same corpus of books, Google’s lead in search will become insurmountable.”

The brief also uncovers “carefully crafted exceptions” inside the settlement regarding the Google Partner Program.  Google has signed “Partner” agreements with thousands of publishers.  Doubtless, many – if not all – of the named publishers in the settlement have their own agreements with Google that will govern the payments they receive from Google, in lieu of the provisions that were negotiated in the settlement for other class members.  The brief states, “…[this] permits the parties to negotiate secret side deals to govern the economic terms of books licensed to Google under the Settlement at any time, even after a court review of the Amended Agreement, effectively evading judicial and public scrutiny…”.

The OBA believes that the proposed settlement threatens to bottleneck the access to and distribution and pricing of the largest, private digital database of books in the world. It would do so by using the class action mechanism to not only redress past harm, but to prospectively shape the future of digital book distribution, display and search.

In the Amicus, the OBA highlights five key reasons why the amended settlement is a “paltry proposal”:

January 28, 2010 07:13 PM

Browse Blogs

Linux can compete with the iPad on price, but where’s the magic?

Yesterday I watched Apple’s Steve Jobs unveil the iPad. Jobs clearly can create revolutionary products; he can also produce spin like no one else. Yesterday was no exception.

His main message about the iPad was “a magical device at a breakthrough price.” He repeated this many times throughout the pitch and twice at the end. This phrase demands an honest response: how will Linux-based devices compete with the iPad?

January 28, 2010 06:53 PM

if:book

and now we have an ipad

The iPad has arrived, to no one's surprise: as soon as you use an iPhone, you start wondering what a computer-sized version of the same would be like. (Those interested in how past predictions look now might look at this post by Ben from five years ago.) The iPad is an attractive device and at $500, it seems likely to take off. It seems entirely possible that a tablet could replace laptops and desktops for many computers, to say nothing of Kindles and Nooks. My MacBook Pro suddenly feels rickety. Hardware-wise, it feels like the iPad might finally be Alan Kay's Dynabook.

And yet: standing on the verge of a potential transformation in how people use computers, I think it's worth stepping back for a second to think about where we are. I suggest that now might be a useful time to re-read Neal Stephenson's manifesto, "In the Beginning Was the Command Line". This is, it needs to be said, a dated piece of writing, as Stephenson has admitted; this is the perpetual curse of writing about technology. Stephenson was writing in 1999, when Microsoft's monopoly over the computer seemed to be without limit; Apple was then in an interregnum, and Google and Amazon were promising web players in a sea of many other promising web players. "In the Beginning Was the Command Line," however, is still worth reading because of his understanding of how we use computers. The promise of the open source movement that Stephenson described was that it would give users complete control over their computers: using Linux, you weren't tied into how corporations thought your computer could be used but were free to change it as you desired. And there's a deeper question that Stephenson gets at: the problem of how we understand our tools, if at all. We can theoretically understand how an open system works; but a closed system is opaque. Something magic is happening under the hood.

Things have changed since then: the Linux desktop never really took off in the way that seemed possible in 1999. Corporations – Apple and Google – showed that they could use open source in a tremendously profitable way. What made me think about this essay yesterday, however, was the iPad: Apple has created a computer that's entirely locked down. The only applications that will run on the iPad are those that have been approved by Apple. And this is one of the first computers where the user will be entirely unable to access the file system. I understand why this is possible from a design standpoint: file systems are arcane things, and most people don't understand them or want to understand them. But this means that Apple has a complete lock on how media gets into your iPad: you're tied into an Apple-approved mechanism. The user of the iPad, like the user of the iPhone, is directly tied into the Apple economy: your credit card on file with Apple not only lets you buy apps and media, but it will also allow you to buy internet connectivity.

It's simple – it's fantastically simple, and it will probably work. But I can't help but think of how Stephenson metaphorically equates the closed system of Windows 95 to a car with the hood welded shut: you can't get inside it. Apple's managed this on a scale that late 90s Microsoft could only dream of. I wonder as well what this means for our understanding of technology: maybe technology's become something we let others understand for us.

January 28, 2010 05:54 PM

Open Knowledge Foundation Blog

Clear Climate Code, and Data

The following guest post is by David Jones who is, among other things, a curator of the climate data group on CKAN (the OKF’s open source registry of open data) and co-founder of Clear Climate Code (which we blogged about back in 2008). Clear Climate Code have been working on ccc-gistemp, a project to reimplement in [...] Related posts:

  1. Clearer Climate Code
  2. Climate Change, Climate Sceptics and Open Data
  3. Code

January 28, 2010 05:22 PM

Open Book Alliance

And The Hits Keep Coming…

In the lead up to February’s ruling, we’ve seen an avalanche of objections from authors, unions, academics and foreign governments.  This morning we can add a tidal wave of opposition from India to the list.

In a submission to Judge Chin, Indian authors and publishers, along with the Indian Reprographic Rights Organization (IRRO) and Federation of Indian Publishers (FIP) demanded that Google be forced to uphold the principles that govern copyright law and intellectual property.

“The Google Book Settlement is contrary to every international treaty that governs Copyright laws. Google’s unilateral conduct is a brazen attempt to turn Copyright law on its head, by usurping the exclusive rights of the Copyright holder,” says Siddharth Arya, legal counsel for IRRO.

Beyond an objection to the general disregard shown for copyright holders, the opponents outlined how the ruling impacts audiences far beyond U.S. shores.

The current scope of GBS2.0 is books that are either registered with the United States Copyright Office or published in the UK, Canada and Australia. However, it as much impacts the rest of the world as any author published in the aforementioned countries is included in the settlement. In the current global economy, Indian authors like to see themselves published abroad for higher royalties and better professional services.

Bottom line, the GBS is causing the United States’ international partners to question our commitment to protecting intellectual property.  The opposition is yet another reminder of what a truly disruptive and harmful idea the GBS has become.

January 28, 2010 03:13 PM

Browse Blogs

Tagging the Noosphere

The last issue of Standards Today focused on XML - the underpinning of ODF and hundreds of other standards - and one of the most important standards ever developed.  Here is the editorial from that issue.

January 28, 2010 12:14 PM

NLP News

Hot Off the Wire

Using its proprietary and patented Natural Language Processing (NLP) technology, LifeCode, A-Life deciphers electronic transcribed patient encounters via the Internet through its data center, which are then appropriately coded for reimbursement ...

January 28, 2010 09:26 AM

W3C Semantic Web Activity News

New SPARQL drafts published

The W3C SPARQL Working Group has published a First Public Working Draft of SPARQL 1.1 Property Paths, which defines a more succinct way to write parts of basic graph patterns and also extend matching of triple pattern to arbitrary length paths. The group also published six updates, namely: SPARQL 1.1 Query adds support for aggregates, subqueries, projected expressions, and negation to the SPARQL query language. SPARQL 1.1 Update defines an update language for RDF graphs. SPARQL 1.1 Protocol for RDF defines an abstract interface and HTTP bindings for a protocol to issue SPARQL Query and SPARQL Update statements against a SPARQL endpoint. SPARQL 1.1 Service Description defines a vocabulary and discovery mechanism for describing the capabilities of a SPARQL endpoint. SPARQL 1.1 Uniform HTTP Protocol for Managing RDF Graphs describes the use of the HTTP protocol for managing named RDF graphs on an HTTP server. SPARQL 1.1 Entailment Regimes defines conditions under which SPARQL queries can be used with entailment regimes such as RDF, RDF Schema, OWL, or RIF.

January 28, 2010 08:40 AM

The FRBR Blog

Last week in FRBR #13

Assunção, FRBR and Music Uniform Title

Maria Clara Assunção has a paper called “FRBR and Music Uniform Title” in Páginas a & b 2:4 (2009), pp. 143-153.

The concepts of “work” and “expression” introduced by FRBR model, have particular implications for the rationale behind the construction of music uniform titles and can help to significantly improve the identification of musical works through this cataloguing resource. This study results from the practical need to establish a set of effective criteria in the development of uniform titles for musical works of a diverse nature, mostly of doubtful identification, often handwritten and sometimes anonymous. This paper aims to contribute to clarify this vital resource in the cataloguing of music but often avoided or misapplied.

LibraryThing, A FRBR Model of Publishers

I spent some time cleaning out my inbox. At work I’ve been doing Inbox Zero for a long time and it’s an enormous help, but my personal mailbox had a bunch of stuff in it that was dragging me down, so I started deleting. One thing I found was from Tim “Mr. LibraryThing” Spalding, sent in May 2009, pointing out a discussion on the LT site: A FRBR Model of of Publishers.

As many know, LibraryThing has a concept of “works” being composed of editions. And we have author and tag aliases.

Together, these concepts resemble what librarians call the FRBR model, and its siblings FRAR, FRSAR, FRBRoo, and FR-lama-lama-ding-dong.

Now, I want to do publishers. That is, I want to have pages for publishers.

This requires some model of how publishers are. An ideal model would understand that HarperCollins used to be called Harper Collins, that Collins is an imprint of HarperCollins, but was an independent company, etc. Truly publishers and imprints are much worse than authors or works. They’re a river you can’t step in twice and that calls itself a stream the next day. Also the river is only really significant insofar as books float down it. And there are beavers making dams, and fish and… Okay, not the last part.

So, does anyone have any advice on this problem? What does FRBR look like when applied to publishers, imprints and etc.?

I don’t know if this lead anywhere. To my surprise, even for the new Stephen King, Under the Dome, no publisher is listed in the Common Knowledge section. (It’s Scribner.) I had a look at a few books and didn’t see a Publisher field on any of them. I don’t know what’s going on there, or where Tim got with this, but that’s what happens when you let email sit around for eight months and then feel bad about not dealing with it.

January 28, 2010 06:48 AM

January 27, 2010

EFF.org Updates

FCC's Net Neutrality Plan Would Permit Blocking of BitTorrent

Remember what put the debate over net neutrality into high gear? In 2007, EFF and the Associated Press confirmed suspicions that Comcast was clandestinely blocking BitTorrent traffic. It was one of the first clear demonstrations that ISPs are technologically capable of interfering with your Internet connection, and that they may not even tell you about it. After receiving numerous complaints, the FCC in 2008 stepped in and threw the book at Comcast, requiring them to stop blocking BitTorrent. The Comcast-BitTorrent experience put net neutrality at the top of the FCC agenda.

Yet now that the FCC has formally issued draft net neutrality regulations, they have a huge copyright loophole in them — a loophole that would theoretically permit Comcast to block BitTorrent just like it did in 2007 — simply by claiming that it was "reasonable network management" intended to "prevent the unlawful transfer of content."

You heard that right — under these conditions, the new proposed net neutrality regulations would allow the same practices that net neutrality was first invoked to prevent, even if these ISP practices end up inflicting collateral damage on perfectly lawful content and activities.

When we saw the loophole, we had to ask ourselves, "Is this real net neutrality?" And the answer was simply, "No." The entertainment industry is already pressuring ISPs to become copyright cops. Carving a copyright loophole in net neutrality would leave your lawful activities at the mercy of overbroad copyright filtering schemes, and we already have plenty of experience with copyright enforcers targeting legitimate users by mistake, carelessness, or design.

If net neutrality regulations are to be taken seriously at all, then the loophole must be closed. Sign the petition to demand real net neutrality from the FCC.

January 27, 2010 11:43 PM

Science Commons

Design a new t-shirt for Science Commons and win a trip to Seattle to attend Science Commons Symposium – Pacific Northwest!

Science Commons needs a new t-shirt design. Sure, the “e=mc(shared)” shirts are great – mine gets worn more than any other shirt in the dresser – but after years of having that design, we think it would be nice to have a new one. We are turning to the crowd to find a [...]

January 27, 2010 11:23 PM

Linked Data Blog Aggregator

Behind Oz’s Curtain

Benjamin Nowack, creator of ARC and Trice, wrote an interesting blog post about the place of Microformats and RDFa in the HTML 5 specification. I am not deep into the specification itself, and so may lack some history context. However, the most interesting point in this article is not related to Microformats, RDFx or the new HTML 5 specification.

The point is that apparently, some people believe that it is RDF or nothing. This is not new, but is that true?

People (and particularly enterprises) want the benefits of structured data, not necessarily RDF. In fact, many people don’t know about RDF, or don’t understand RDF, or just don’t care about RDF. But, is it because you don’t know, understand or care about RDF that you cannot benefit from it? No, certainly not. And I think that is what Benjamin is talking about when he mentions things such as: “[...] to get RDF to the broader developer community“, “[...] here could have been a solution that would have served everybody sufficiently well, both HTMLers and RDFers“. “[...] they would most probably have been able to define RDFa 1.1 as a proper superset of Microdata”. RDF can be incarnated in multiple bodies, but it is still RDF. I think it is what Benjamin was suggesting, and it the path we took at Structured Dynamics.

We choose to use RDF behind Oz’s curtain. This means that at the core of any of our methodologies, systems and specifications, we use RDF. Why? Because it is the more flexible description framework available that helps us handle any other source of data. However, does that mean that we should push RDF in everybody’s face? Certainly not.

Our work with different enterprises from all kind of domains told us that we have to look beyond RDF while still using it (as paradoxically as that may appear). For example, we developed structWSF and conStruct such that people can upload (and manage) their data in different formats while being able to export it in all other different formats. At the core, these systems use RDF to manipulate all these different kind of formats, but from the outside, users simply use the format they care about, they use, or that they have available in their workflow. These users benefits from RDF without knowing it, understanding it or without caring about it. We don’t think RDF is for everyone, but everyone can benefit from RDF.

Another example of RDF behind Oz’s curtain is the irON description framework and its three serialization profiles: irJSON, irXML and commON that we developed. As stated in the Purpose section of this document, the goal was quite clear:

irON (instance record and Object Notation) is a abstract notation and associated vocabulary for specifying RDF triples and schema in non-RDF forms. Its purpose is to allow users and tools in non-RDF formats to stage interoperable datasets using RDF. The notation supports writing RDF and schema in JSON (irJSON), XML (irXML) and comma-delimited (CSV) formats (commON). The notation specification includes guidance for creating instance records (including in bulk), linkages to existing ontologies and schema, and schema definitions. Profiles and examples are also provided for each of the irXML, irJSON and commON serializations.

irON is premised on these considerations and observations:

The irON notation and vocabulary is designed to allow the conceptual structure (”schema”) of datasets to be described, to facilitate easy description of the instance records that populate those datasets, and to link different structures for different schema to one another. In these manners, more-or-less complete RDF data structures and instances can be described in alternate formats and be made interoperable. irON provides a simple and naive information exchange notation expressive enough to describe most any data entity.

I think this is what Benjamin was talking about in his article, and the kind of mindset he was suggesting the RDF community to adopt. At least this is the minding we adopted at Structured Dynamics, and apparently it is the minding Benjamin adopted for his own business. I am sure there are many other people and organizations out there that are adopting the same point of view according to RDF and its role in the current data ecosystem.

January 27, 2010 08:50 PM

ubiquity-firefox Google Group

Twitter software. Automatic twitter widget

Twitter IM
Twitter IM is an open source desktop Twitter client for Windows.
* Base Twitter functionalities
o Create and view tweets
o View replies and reply to tweets
o Retweet, Direct Messages and User Profile
and much more options. Read here [link]

January 27, 2010 08:07 PM

Google Book Search Blog

Updated Books Home Page and My Library



I'm happy to announce a few fresh features for Google Books. We've updated the home page by adding the ability to scroll through categories of books and magazines.



We also integrated the My Library feature into the home page to enable you to create and then share collections of books by adding them to "bookshelves." This new version of My Library gives you control over your collections by enabling you to keep some bookshelves private--if, say, you want to organize your own personal reading lists--while sharing others.



Previously, all books in your My Library were part of a single collection, and you could tag books with labels to organize. Now, instead of tagging a book with a label, you can add it to one or more bookshelves. As part of this transition to bookshelves, we're migrating all the previously created labels to the new bookshelf system. For example, if you had tagged a book with a label called "favorite travel books," then you'll now see a custom bookshelf called "favorite travel books" that contains the same book.

As always, you have full control over your book collection data. We continue to offer the Book Search API as a way for you to extract and edit your data. Ultimately, we also hope that these open APIs will make it easier to build product integrations that synchronize reading lists across devices and applications.

You can search and discover millions of books on Google Books. Our hope is that these new tools will make it easier for you to find, organize and keep track of the books that you're interested in reading.

January 27, 2010 07:41 PM