Comedy: making reality acceptable

May 11, 2006 on 12:29 am | In wikicompany, media, features, spider | No Comments

There's some great stand-up comedy being streamed onto the web by Cringe Humor NYC. Grab streamtuner and streamripper and have a laugh.

I wish streamtuner could also handle video, podcasts, skypecasts, and other 'live' streams from various sites, really decoupling the content from the web/RSS site interface with a simple and consistent browse/search/bookmark interface.

For Wikicompany I've been hacking at an intelligence-augmented web crawler which will create company profiles from URLs. Its pretty useful already.

The spider collects data from various sources, parses and mangles the data (including geocoding, auto tagging, logo handling) and creates a link for the Wikicompany publish form. From the publish form any manual changes can be made, before including the profile in Wikicompany. Parsing an existing profile on Wikicompany back to the form is on the todo list.

The spider is not perfect yet, but the results are very promising. I'm currently testing the algorithms on about 4000 company domains (mainly biotech companies).

I want to automate as much work as possible, but some things only a human can (currently) do. Although some good NLP software might be able to automate even more things. Some statistical approaches, once more correct context data is known, could also be interesting.

I also wrote a tool which can generate a company URL list from a list of company names, which really helps to collect large lists company URLs from various sources on the web.

Here's a small list of profiles which were gathered completely automatic:

  1. www.wholesoyco.com
  2. www.wholesomesweeteners.com
  3. www.wholefoodsmarket.com
  4. www.wholefoods.com
  5. www.whittakersearch.com
  6. www.whitlockpkg.com
  7. www.whitleyspeanut.com
  8. www.whitfieldfoods.com
  9. www.whiteysicecream.com
  10. www.whitewave.com
  11. www.whiterose.com
  12. www.whiterockdistilleries.net

Powered by WordPress with Pool theme design by Borja Fernandez.
Entries and comments feeds. Valid XHTML and CSS. ^Top^