Indexing and searching
Thursday, August 25, 2005 at 2:05PM One problem with using a database backend's native text searching is that the language to express search terms (and the range of searches possible) varies between backends. They also vary on whether they are case sensitive, whether they support stemming and how non-words are dealt with. In the case of search expression, for example, with tsearch2, we'd do something along the lines of:
[code lang="sql"]SELECT * FROM articles
WHERE idxfti @@ to_tsquery('default', 'this & that');[/code]
to search for records that contained both 'this' and 'that', whereas in MySQL it would be:
[code lang="sql"]SELECT * FROM articles
WHERE MATCH (title,body)
AGAINST ('+this +that' IN BOOLEAN MODE);[/code]
So we have to create our own search language to present to the user, then normalise the searches for each backend. Actually, since neither of them use the same language as we presented to the user in MM 1.x, we've been doing that anyway. :-)
There's also the problem of using other backends that just don't have their own text searching capabilities. SQLite certainly doesn't. I guess Oracle, MS-SQL et al do have some text searching capability, but we'd need to write our own interface to each of them every time we added support for a new backend. And we have to provide the lowest common demoninator of search functionality to the front end (or at least cope with the backend potentially not supporting a particular search method and notifying the client in a friendly manner).
So how about factoring out the search capabilities into something completely separate? With the Xapian libraries, we'd need to write some code to intelligently split apart the email messages so they can be effectively indexed, provide an interface to query the index and make sure the search index stayed (nearly?) in sync with reality. We could have it running as a separate daemon, so that indexing and searching wouldn't directly impact upon MailManager's main job, which is dealing with the emails themselves. (Actually, I also argue that processing incoming mail should be separated out into its own daemon so that the Zope daemon can carry on doing what it does best:
At some point in the near future, when I've some CFT, I'm going to have a play around with Xapian, see if I can convince it to be useful. One thing I'd like to do is combine it with a code parser to index & search C/C++/Python code, like a souped-up uber-smart version of ctags, just to see how well it really works.
Reader Comments