I've been playing with a bunch of different ways to optimise queries for both geospatial data and data in which there is often a text-based component. An example query might be something like:
Give me the latest 10 posts in and around lat,lng where the words "moo" and "chunky" appear somewhere in the post text
MongoDB seemed like a good starting point and it certainly allows for easy persistent storage of this sort of data in JSON format, but querying it is another matter. As of version 2.4.9, MongoDB cannot use an index for both a geospatial seach and one that also involves text lookup.
Even if it could, doing text-based (regex) searches on MongoDB data does not use an index (except where you add a positional marker). Given that we need to search on an entire post's text MongoDB was not the way forward.
ElasticSearch (http://www.elasticsearch.org/) provides a great text and geospatial indexing system that you can install on your own server and query against very easily.
I had my ES (ElasticSearch) server up and running in about 1 hour (including reading through the docs and whotnot), had a basic index working within 2 hours and had integrated realtime CRUD from my application to ES within 3. By hour 4 I had completely replaced my app's search system to use ES.
I now have a hybrid solution that works well for my application. MongoDB is my persistent data store while ES acts as my index for queries.
My stack looks like this:
- Debian 7
- Apache2 with PHP 5.4 (will be replacing apache soon because of memory overhead... looking for alternatives - maybe NGINX)
- Node.js 0.10.26 for realtime typeahead response (will be looking to replace with ES now that it's up and running)
- MongoDB 2.4.9
- ElasticSearch 1.0.1
In my PHP application I push inserts, updates and removes directly to ES via PHP's ES plugin. This is a manual process but my application already had hooks for this to determine when to fire notifications to users so it was a very simple job to add and extra step to keep ES up to date with data changes.
At present I'm using ES as a pure indexing system which returns the ids of the items that match a query. After that I query MongoDB for the data for each of the matching items. Mongo's queries against an ID are blazingly fast so I don't need to worry about this approach too much and it reduces the memory usage that would be present if I got ES to hold all the object data as well.
So searching looks like this:
- Client types a letter
- Request sent to PHP
- PHP askes ES for the ids of any items that match the query
- ES gives PHP the ids
- PHP askes MongoDB for the data for the ids
- MongoDB sends the data back
- PHP sends the data to the client
Optimising This Approach
ElasticSearch has a great REST-based API that allows queries to be posted to it and for that data to be returned at lightning speed. I am tempted to allow the client to connect directly to ES but am a little concerned about security etc. A simple solution here would be to write a quick Node.js app that does a similar job to the PHP component listed above but would be significanlty faster because it would not be creating and destroying the whole PHP app stack every request (including an Apache process as well).
The PHP process could also direct requests at this app to get data back from the index which would bypass creating a new connection to ES every request, and also abstract the query process away from the PHP app.
This way the query system could be moved onto a separate self-contained server / servers without affecting the PHP app at all. This abstraction is also nice because it would allow for the query system to become a software-as-a-service solution for other people looking to index data at high speed but are less tech-savvy.
If you're looking for a great way to index your data for high-throughput search queries in complex text and geospatial objects then ES is a very good choice!
If you are one for geting a solution up and running straight away without all the server-setup annoyance etc, Irrelon is building a new service based on ES that will allow you to simply index any JSON data and query it at lightning speed from a simple API, stay tuned!