Topics in data mining, text and information retrieval, and search
engines
I have always been interested in text and information retrieval. Probably
this stems from my desire to have intelligent agents in support of research
'ala common depictions in
fiction (we're not there yet).
After visiting their headquarters outside of Boston, I collected a bunch
of interesting papers from
Thinking Machines
including several on the use of the
Connection Machine as a
specialized information retrieval computer. Basically, it would be used
as a kind of associative memory instead of a serial 'byte indexing/retrieval
engine.' Very cool.
My first taste of this topic in a professional context as working in
large (50 disc collections!) CD-ROM text and image retrieval systems for
medical applications. Later, I was contracted to develop a front end
to a system called CD-RDX, proposed as a text retrieval standard for the
CIA.
Later, I was asked by the (late) Orientation.com to develop their next-
generation text retrieval engine, destined to work in multiple languages.
At the time, Orientation's on-line text engine was based on their earlier
work in CD-ROM based retrieval. That engine, BlackMagic, was developed
by one of Orientation's founders.
Interesting work on information retrieval has been around for at least as
long as computers have been around.
Open Source search engines
-
Xapian project is an Open Source Search Engine Library,
released under the GPL. It's written in C++, with bindings to
allow use from Perl, Python, PHP, Java, Tcl, C#, and Ruby (so far!)
Xapian is a highly adaptable toolkit which allows developers
to easily add advanced indexing and search facilities to their own
applications. It supports the Probabilistic Information Retrieval model
and also supports a rich set of boolean query operators.
-
Semantic Indexing project
-
S-EM is a text learning or classification system that learns
from a set of positive and unlabeled examples (no negative examples).
It is based on a "spy" technique, naive Bayes and EM algorithm
Google related
When discussing text and information retrieval, it's useful to consider
past successes. The current 800-lb gorilla in this domain is
google.com. Here are a few papers
published by Google on its architecture.
Useful and assorted links
Back to Tesseract links