Tesseract home

Topics in data mining, text and information retrieval, and search engines

I have always been interested in text and information retrieval. Probably this stems from my desire to have intelligent agents in support of research 'ala common depictions in fiction (we're not there yet).

After visiting their headquarters outside of Boston, I collected a bunch of interesting papers from Thinking Machines including several on the use of the Connection Machine as a specialized information retrieval computer. Basically, it would be used as a kind of associative memory instead of a serial 'byte indexing/retrieval engine.' Very cool.

My first taste of this topic in a professional context as working in large (50 disc collections!) CD-ROM text and image retrieval systems for medical applications. Later, I was contracted to develop a front end to a system called CD-RDX, proposed as a text retrieval standard for the CIA.

Later, I was asked by the (late) Orientation.com to develop their next- generation text retrieval engine, destined to work in multiple languages. At the time, Orientation's on-line text engine was based on their earlier work in CD-ROM based retrieval. That engine, BlackMagic, was developed by one of Orientation's founders.

Interesting work on information retrieval has been around for at least as long as computers have been around.

Open Source search engines

Google related

When discussing text and information retrieval, it's useful to consider past successes. The current 800-lb gorilla in this domain is google.com. Here are a few papers published by Google on its architecture.

Useful and assorted links

Back to Tesseract links