Search

Enter a search word or two and press return to see the search results.

Who am I?

Hi, I’m Graeme and these are my notes, from my messy desk. I started this blog because Google proved to be more useful at finding content than anything else I’ve used.

So I started adding my own content in the hopes that Google would index it and allow me to find things again in the future.

It works.

You can find out more about me here, and you should follow me on Twitter here.

Keeping up

You can automatically receive new content here by subscribing to the “Blog RSS” (link below). This is the easiest way to keep up with what I write here.  See this BBC article for a good introduction on RSS and keeping up with the goings on of the Internet more easily.

« Subversion and NFS file locking | Main | Alive. Well, mostly »
Thursday
25Aug2005

Indexing and searching

On a completely unrelated web search last night, I came across the Xapian project. Funny that, what with it being an indexing and searching library for C++. So I had a quick play around and it seems pretty neat. And pretty fast. I'm wondering if we could put it to use as a backend-independent search engine for MailManager instead of our current implementation, which makes use of whatever the database backend provides. In the case of PostgreSQL, this is tsearch2, which by all accounts seems to be pretty good. MySQL also has its own full text search support. (And the ZODB we were using in 1.x has zcatalog et al to provide the searching capability.)

One problem with using a database backend's native text searching is that the language to express search terms (and the range of searches possible) varies between backends. They also vary on whether they are case sensitive, whether they support stemming and how non-words are dealt with. In the case of search expression, for example, with tsearch2, we'd do something along the lines of:

[code lang="sql"]SELECT * FROM articles
WHERE idxfti @@ to_tsquery('default', 'this & that');[/code]

to search for records that contained both 'this' and 'that', whereas in MySQL it would be:

[code lang="sql"]SELECT * FROM articles
WHERE MATCH (title,body)
AGAINST ('+this +that' IN BOOLEAN MODE);[/code]

So we have to create our own search language to present to the user, then normalise the searches for each backend. Actually, since neither of them use the same language as we presented to the user in MM 1.x, we've been doing that anyway. :-)

There's also the problem of using other backends that just don't have their own text searching capabilities. SQLite certainly doesn't. I guess Oracle, MS-SQL et al do have some text searching capability, but we'd need to write our own interface to each of them every time we added support for a new backend. And we have to provide the lowest common demoninator of search functionality to the front end (or at least cope with the backend potentially not supporting a particular search method and notifying the client in a friendly manner).

So how about factoring out the search capabilities into something completely separate? With the Xapian libraries, we'd need to write some code to intelligently split apart the email messages so they can be effectively indexed, provide an interface to query the index and make sure the search index stayed (nearly?) in sync with reality. We could have it running as a separate daemon, so that indexing and searching wouldn't directly impact upon MailManager's main job, which is dealing with the emails themselves. (Actually, I also argue that processing incoming mail should be separated out into its own daemon so that the Zope daemon can carry on doing what it does best: nothing uh, I mean, serving web pages to users :-) but that's for another post.)

At some point in the near future, when I've some CFT, I'm going to have a play around with Xapian, see if I can convince it to be useful. One thing I'd like to do is combine it with a code parser to index & search C/C++/Python code, like a souped-up uber-smart version of ctags, just to see how well it really works.

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>