« Building Facebook applications on Amazon Web Services (AWS)
» Sphinx is live and kicking

Solr, Sphinx, Lucene, PHP, API, MySQL

Sphinx and Lucene search engines, first impressions

2008-02-15 | 2 Comments

MySQL full text search is the easiest way to provide full text functionality on your site. It is easy to implement and easy to use. You just have to specify MySQL FULLTEXT index, and after that you search the tables with appropriate SQL queries.

However, there are some disadvantages. On a large databases full text search might perform very slow. Also, MySQL full text indexes are available only on MyISAM tables, so if you are using InnoDB table engine to store critical data, you won’t be able to use MySQL full text search.

In order to scale the search of large data sets you will have to look for the solution beyond MySQL. Here I will write about my first thoughts on two popular open source search engines, Lucene and Sphinx.

I will not talk about performances and speed. This post presents basic things about installation, configuration and implementation of each search engine, so it might help you to choose between Lucene and Sphinx if you want to switch from MySQL full text search on something better, quickly and easily.

MySQL Lucene Sphinx search comparationDon’t think about performances at this moment, search should be faster than MySQL full text search. I will show you only one slide from this presentation which you can find on MySQL Performance Blog to demonstrate the speed difference between MySQL, Lucene and Sphinx on a 5GB test database (of course, results may vary depending on your datasets). If you are interested in alternative full text search engines, you should take a look on the entire presentation (38 slides, PDF)

Lucene

Lucene logoLucene is actually a Java library, so basically you don’t have to install anything to use Lucene. It is a very powerful tool for indexing and searching documents. Lucene doesn’t care do you index files, database records or something else. It is your duty to prepare data and send it to Lucene index as Document object. That means that you will need some extra work to index your database records, or whatever data you want to search. Once created, index can be easily updated or expanded by adding/updating documents from your scripts and programs.

Lots of popular heavy traffic sites are powered by Lucene search like Wikipedia or Digg, so the performance and scalability is not under the question. Lucene is also rich with search options providing boolean search, phrase search, wildcard search, field search and other features.

Implementation around Lucene might be a little bit problematic if you are not using Java. If your project is in PHP, like mine, you can use something like PHP-Java bridge. There is also a PHP Lucene class in Zend framework. I haven’t tried the class, but I have read here that indexing and searching performances are very poor comparing to the original Lucene Java class. It might not be the truth, but you should have that in your mind.

Solr is perhaps the best way to use Lucene search on any platform. Solr is also a Java application and you have to install it on some servlet container, like Tomcat. Once installed, you can access Solr from any development platform using REST protocol. You send add/update XML requests to the update URL (for example http://localhost:8080/solr/update) to index documents.

<add>
<doc>
<field name="id">05991</field>
<field name="description">This is a blog for webmasters and developers</field>
</doc>
</add>

On search, you send GET queries to the Solr select URL (for example http://localhost:8080/solr/select?q=test) and you will receive XML response from the Solr containing structured information with search results. Cool, right? Here you can find the list of very popular web sites that use Solr for search.

Sphinx

Sphinx logoSphinx is a full text search engine and it is ready-made to be used with open source databases like MySQL (MyISAM, InnoDB) and PostgreSql. It is very easy for installation. Configuration is simple as well. Basically, you will specify the SQL connection parameters and the SQL query to fetch the data from your database, and the Sphinx indexer will create the index for your database.

Indexing is very fast, but the problem is that there is no way to update index partially, for example to update/add only couple of records. Instead, you have to re-index an entire dataset from the beginning. However, there is a proposed solution that can be implemented automatically. You create a main index and delta index with recently added/updated records. You regularly re-index only delta index, while you re-index main only periodically. As far as I can see, this works very good in practice once you get it right, at least for my 1GB table, especially because, as I already sad, indexing is very fast. In my case, 5000+ documents delta index is re-indexed in a split second.

All major programming languages are supported, including PHP, Python, Java, Ruby and C++. Sphinx supports features like boolean search, phrase search, field search… It doesn’t support wildcards search, but I’m not crazy about that feature.

My pick

I have tried Lucene with Solr. It needs more coding to index data and an extra work to install and configure Tomcat and Solr. Although everything worked fine on my desktop, I had trouble installing Tomcat on the production server so I have given up Lucene and Solr for now.

On the other hand, Sphinx is very easy to install and configure. If you are switching from MySQL full text search, you only need to adjust your scripts to use sphinx API in combination with SQL instead of MySQL full text search query. I still haven’t launched the Sphinx search on the production server, but the test system is working fine. Basically, Sphinx has all the features I need, I only hope that they will add some live updates and/or partial re-indexing soon. :(

It is up to you. If you are not using Java for your project and you don’t want to spend extra time installing and configuring php-java-bridge or Tomcat/Solr, Sphinx might be the better and instant solution, especially if you are migrating from MySQL full text search. However, if your data sets are frequently updated, you will have to seriously consider can you organize partial indexing with Sphinx main+delta indexing scheme.

Lucene however is still known as the most advanced open source search engine with lots of search options, live updates and other things, so if you need all those functionalities, it might be better to go straight to Lucene and Solr.

2 Comments

have your say

Add your comment below, or trackback from your own site. Subscribe to these comments.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

:

:


« Building Facebook applications on Amazon Web Services (AWS)
» Sphinx is live and kicking