Spatial indexing a MongoDb with Rtree

Update : recent versions of MongoDB support native geospatial indexing.

In my previous post about populating a MongoDb I mentioned the existence of the Rtree package as a solution to the lack of a spatial index mechanism in MongoDb. In this post 'm going to try this out. Rtree is a Python package with the sole purpose of creating spatial indexes. To install Rtree I downloaded the Windows installer and ran the installer.

To test Rtree I decided to create a spatial index for my previously created POI database in MongoDb. To do this I first connect to my running MongoDb instance. Then I create an Rtree disk index in the same directory as my script called poi_index. This creates two files (poi_index.dat and poi_index.idx) who will contain the index. Finally I loop over all the POIs in the database and add the x and y values to the index with as reference the id of the POI.

from pymongo.connection import Connection
from rtree import Rtree

mongo_conn = Connection('localhost', 27017)
poidb = mongo_conn.poidb
pois = poidb.pois

poi_index = Rtree("poi_index", overwrite=1)

for poi in pois.find():
    poi_index.add(poi["ID"], (poi["x"], poi["y"]))

Now we have created an index for the 500.002 POIs in my database. The file size of this index is 350 mb which is more than I expected it to be when compared my Mongo POI database which has a total file size of 460 mb. Querying this index is very easy and very fast. To time my query I added a handy wrapper function. Essentially to query the index the only thing you need to do is create a bounding box and call the intersection method of the index. This returns the ids of the intersecting points.

import time

def print_timing(func):
    def wrapper(*arg):
        t1 = time.time()
        res = func(*arg)
        t2 = time.time()
        print '%s took %0.0f ms' % (func.func_name, (t2-t1)*1000.0)
        return res
    return wrapper

minx, maxx = 4.5, 5.0
miny, maxy = 50.5, 51.0
bbox = ( minx, miny, maxx, maxy )

@print_timing
def intersect(index, bbox):
    print len(index.intersection(bbox)), "hits"
intersect(poi_index, bbox)

On my machine this bounding box query took 265 ms on the MongoDb and only 15 ms with Rtree. This is a great improvement for my query. But the disadvantage is that I still need to query my POIs out of the MongoDb with the ids I got from Rtree. Sadly enough this doesn't go very fast so the total improvement is very small and with lots of points even negative. Strange thing is that when I do an IN query on MongoDb the result is always slower than querying the pois one by one by ID. My temporary conclusion is that Rtree is really fast but you only get the ids. The best solution would be that MongoDb could natively use the Rtree index or another spatial index. A small downside of Rtree is that the id can only be an integer value. So you can't store strings as ids. We have come to the end of my small exploration of Rtree as a mean to add a spatial index to MongoDb. I hope you liked it and I still welcome any comments/suggestions.

Related Posts
Populating a MongoDB with POIs
PostGIS Loading and querying data
Spatial queries with Pythonnet

2 comments:

dwight said...

i think mongodb $in query plans, for the particular way used here, are suboptimal. ticket created in bugdb SERVER-100

varun said...

Well, you could save features as wkt in any popular key value tables instead of mongoDB, and then query from there for example redis, memcached etc.... they have near pretty fast search capabilities!!!!