Populating a MongoDb with POIs

I have been looking at the alternatives for relational databases for a while now and this week I came across MongoDB. MongoDB is a documented-oriented database which tries to bridge the gap between key/value stores and relational databases. In this post I'm going to install MongoDb and populate it with 500.000 POIs from Tele Atlas.

To install MongoDb I only needed to download the Windows binaries and that's it. Then I followed the guidelines in the Quickstart to start the database. As the default datapath for MongoDb on Windows is c:\data\db I specified a custom datapath.

> mongod.exe --dbpath "D:\mongodb\data" run

Once I got MongoDb up and running I downloaded and installed the Python bindings named PyMongo. This also went very smoothly. On the MongoDb website there is a great introductory tutorial for PyMongo.

To populate the MongoDb I wrote a Python program that loops over the POIs in a layer, creates a dictionary with its properties and coordinates. Then the POI dictionary is inserted in the pois collection in the poi database.
First I connect to my running MongoDb and create a reference to a new database called poidb. Then I create a new collection called pois.

import arcgisscripting
from pymongo.connection import Connection

mongo_conn = Connection('localhost', 27017)
poidb = mongo_conn.poidb
pois = poidb.pois

Then comes some code that uses ESRI functions to create a list with the fields in the POI layer and a featurecursor for the POIs. I removed the SHAPE field from this list because this will be treated separately.

gp = arcgisscripting.create()
poi_path = r'D:\ta.gdb\eureur_____pi'

fields = gp.listfields(poi_path)
names = [f.name.upper() for f in iter(fields.next, None)]
names.remove('SHAPE')
fc = gp.searchcursor(poi_path)

In the next part we loop over the POIs in the cursor. For each poi we extract the x and y coordinate and save them in the poi dictionary. For all the fields we fetch the value and put the none empty values in the dictionary. The strings are stored as unicode. Finally we insert the poi dictionary in the poi collection of the poidb. The script stops when more then 500.000 POIs have been inserted (500.002 to be exact).

for index, feat in enumerate(iter(fc.next, None)):
    poi = {}
    point = feat.shape.getpart(0)
    poi['x'] = point.x
    poi['y'] = point.y
    for field in names:            
        value = feat.getvalue(field)
        if value and len(str(value).strip()) > 0:
            if isinstance(value, str):
                value = unicode(value, 'latin-1')
            poi[field] = value

    pois.insert(poi)
    if index > 500000:
        break

To count the pois you only need to call the count method of the pois collection object.

print pois.count()

Before querying this data I'am going to create ascending indexes for the x and y fields. It would be better if spatial indexes where supported but with reasonably sized datasets a double index is sufficient for doing bounding box queries. I also could try the RTree package but that will be for another time. At the end the information about the indexes is printed.

from pymongo import ASCENDING
poi_db.pois.ensure_index('x', ASCENDING)
poi_db.pois.ensure_index('y', ASCENDING)

print pois.index_information()

Query parameters are passed to MongoDb in a nested directory structure. So defining a bounding box query is an straightforward task. The $gt character means bigger then and $lt means smaller then. In this query I search for all the POIs with a longitude larger then 4.5 and smaller then 5.0 and a latitude larger then 50.5 and smaller then 51.0. Then I print the result count, the number of milliseconds that the query took and the first resulting document.

query_pois = pois.find({"x":{"$gt":4.5, "$lt":5.0}, "y":{"$gt":50.5, "$lt":51.0}})

print query_pois.count()
print query_pois.explain()["millis"]
print query_pois.next()

The result looks like this :

4585
265
SON([(u'HSNUM', u'173'), (u'MUNCD', u'23062'), (u'MUNID', 10560032003493.0), (u'x', 4.5000037996786091), (u'NAME', u'Shell Overijse Frans Verbeekstraat'), (u'OBJECTID', 134471), (u'TELNUM', u'+(32)-(2)-6573607'), (u'STNAME', u'Frans Verbeekstraat'), (u'FEATTYP', 7311), (u'ADDRPID', 10560000020526.0), (u'RELPOS', 75), (u'POSTCODE', u'3090'), (u'STNAMELC', u'DUT'), (u'ARNAMELC', u'DUT'), (u'y', 50.769815399855986), (u'IMPORT', 2), (u'MUNNAME', u'Overijse'), (u'_id', ObjectId('YB\xd6l\x99\xe0$U\x8f\t>\xc3')), (u'CLTRPELID', 10560001815786.0), (u'ID', 10560300013584.0), (u'BUANAME', u'Hoeilaart')])

So it took 265 milliseconds to query 4585 POIs from a collection 500.002 POIs with two range query parameters. If you need to optimize a query you can take a look at MongoDb Site for some tips and tricks.

We have come to the end of my first exploration of MongoDb. What I especially liked is the flexibility you get from this kind of databases and the ease of installation and use. The downside for geographic applications is that at the moment there is no built-in support for geometries. I hope you enjoyed this post and I welcome any comments and suggestions.

Related Posts
Spatial indexing a MongoDB
PostGIS Loading and querying data
Installing Tokyo Cabinet and Ruby

Python Toolbox 2 : Pygments

This is the second post in my series about Python tools and I'm going to talk about Pygments. Pygments is a generic syntax highlighter implemented in Python. I started using it to able to show the code samples on this blog in a decent way.

Pygments can highlight a very large list of programming languages, template languages and other sources like config files, shell scripts, css and many more. The markup of your sources can then be exported to multiple output formats like HTML, GIF or BMP image, Latex, rtf, svg, Bulletin bord code and terminal. More info about the formatters can be found here.

Once installed you can call this library from within Python or you can call the provided executable from the commandline. The documentation for Pygmentize is very clear and extensive so I won't reproduce it here but I'll show you some small samples.

Below is a sample commandline usage. The first command outputs a file with your code wrapped in a div and span elements with css class attributes. The second command creates the stylesheet for your code.

pygmentize -o test.html test.py pygmentize -S default -f html > style.css

If you want to create a full html file from your code with the styles included then all you need is the following command :

pygmentize -O full=true -o test.html test.py

For the code in my blog I use the following simple Python script that converts the scripts in a folder to a html files. I also added the styles to the css of my blog.
First I import the relevant namespaces from pygments and the os namespace. And then create a basic HtmlFormatter

from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter
import os

formatter = HtmlFormatter()

If you want a printout of all the different classes the only thing you need is the following line. I ran this once and copied to my style definitions on my blog.

print formatter.get_style_defs('.highlight')

The following code converts all the python files in a directory to html files with the same name. We loop over the elements in the folder, read the python files, create a highlight of them and then write this to a new file.

indir= r''
for x in os.listdir(indir):
    if x.endswith('.py'):
        infile = os.path.join(indir, x)
        outfile = infile.replace('.py', '.html')
        code = ''.join(list(open(infile)))
        html = highlight(code, PythonLexer(), formatter)
        f = open(outfile, 'w')
        f.write(html)
        f.close()
        print 'finished :',x

Projections and transformation (2)

In a comment on my previous post about projections and transformations Gene pointed out that I could use Ogr with it its Python bindings. I already used the C# bindings from Ogr in the past but I thought it would be a nice exercise to use Ogr from Python.

If you are on windows and need to install gdal/ogr and its Python bindings I suggest you to use the following installation guideline.

Below is the short code I came up with but there are numerous other samples on the web sample 1 sample 2 and off course the documentation on the gdal site. Attention I got an error when trying to use the TransformPoints method.

from osgeo import osr

def get_spatialref(epsg_code):
    spatialref = osr.SpatialReference()
    spatialref.ImportFromEPSG(epsg_code)
    return spatialref

lambert72_spatialref = get_spatialref(31300)
etrs89_spatialref = get_spatialref(4258)

coordinate_transformation = osr.CoordinateTransformation(lambert72_spatialref, etrs89_spatialref)

## start location : random point in Belgium
x = 155335.20410
y = 166096.08270
print coordinate_transformation.TransformPoint(x, y)

Another solution is to install and use the Python bindings for the Proj4 library.
The following sample code comes from its documentation.

import pyproj

# projection 1: UTM zone 15, grs80 ellipse, NAD83 datum
# (defined by epsg code 26915)
p1 = pyproj.Proj(init='epsg:26915')
p1_2 = get_spatialref(26915)
# projection 2: UTM zone 15, clrk66 ellipse, NAD27 datum
p2 = pyproj.Proj(init='epsg:26715')
p2_2 = get_spatialref(26715)
# find x,y of Jefferson City, MO.
x1, y1 = p1(-92.199881,38.56694)
# transform this point to projection 2 coordinates.
x2, y2 = pyproj.transform(p1,p2,x1,y1)
print x2, y2

If you need to look up a spatial reference you might wanna take a look at spatialreference.org.

Succes !

Python Toolbox 1 : Logging

This is the first post in a series about internal and external Python modules. Today I am going to give a brief overview of the built in logging module.

The logging module is a core python module that was introduced in Python 2.3. There are 6 different logging levels (debug, info, warning, error, critical and fatal). You can also log an exception wich outputs a nice traceback. There are numerous output possibilities defined as handlers. The log information can for example be send to a file, an http server, the windows event log, an e-mail address, a stream and more. The logger can also be configured to log to a different output depending on the logging level. You could for example log to a file when the level is debug but you might want to receive an e-mail when a critical or fatal error occurs. In this post we will use a basic file logger and then create a rotating file logger. The rotating file logger will make a back-up of the logfile when it reaches a predefined maximum size.

As the following code shows, basic logging is easy to set up. In short it first imports the logging module. Then it intialize this module by calling the basicConfig method. According to the help the input parameters for this method are :

filename  Specifies that a FileHandler be created, using the specified filename,
          rather than a StreamHandler.
filemode  Specifies the mode to open the file, if filename is specified
          (if filemode is unspecified, it defaults to "a").
format    Use the specified format string for the handler.
datefmt   Use the specified date/time format.
level     Set the root logger level to the specified level.
stream    Use the specified stream to initialize the StreamHandler. Note that
          this argument is incompatible with 'filename' - if both are present,
          'stream' is ignored.
Then we test the different logging level and also deliberately throw an error to test the exception logging. At the end I added some clean up code because the logging module doesn't get unloaded between to test runs in some IDE's.

import logging

logging.basicConfig(filename='gissolved.log',
                    level=logging.DEBUG,
                    format='%(asctime)s %(levelname)s : %(message)s')

try:
    logging.debug('debug')
    logging.info('info')
    logging.warning('warning')
    logging.error('error')
    logging.critical('critical')
    logging.fatal('fatal')
    try:
        raise TypeError, 'Demo error'
    except TypeError:
        logging.exception('exception')
finally: ## close logger properly
    logging.shutdown()
    logging.getLogger(None).handlers = []

Ok now let's create a rotating file logger. To do this we not only import the logging module but also the sub module handlers and the class Formatter. Then we define the log path and create a logger object. Now its time to create the RotatingFileHandler that is defined in the handlers sub module. The parameters are the filename, the mode (defaults to 'a' which means append), maxBytes (max number of bytes before rollover occurs), backupCount (number of backup files that will be created as needed). Then we define a format for the handler and add it to the logger object. At last we set the logging level of our logger.

Now its time to test the logger. I've put the code in a loop in order to log more bytes then the predefined number of maximum bytes per logfile. Internally the rotating file logger first checks whether the max size will be reached by writing the current message to the logfile. If this is the case it closes the current file and adds a suffix to it, a new file with the message is then created. So in this case the first back-up file will be renamed to gissolved_rotate.log.1. The next back-up will then be named gissolved_rotate.log.2 and so on. Until the maximum number of back-up files is reached and the first back-up files gets overwritten. The file being written to is always the original gissolved_rotate.log. In the finally block the current handler is removed from our logger and the handler is closed. If you've defined multiple handlers for your logger you could loop over the logger.handlers and call their close method and then set the logger.handlers equal to an empty list like I did in the first example.

import logging
from logging import handlers, Formatter

try:
    log_path = 'gissolved_rotate.log'
    logger = logging.getLogger('gissolved')
    ## create a rotating file handler and add it to the logger
    handler = handlers.RotatingFileHandler(log_path,
                                           maxBytes=5000000,
                                           backupCount=4)
    log_format = Formatter("%(name)s %(asctime)s %(levelname)s : %(message)s")
    handler.setFormatter(log_format)
    logger.addHandler(handler)
    ## set the root logging level
    logger.setLevel(logging.DEBUG)
    
    for i in range(11000):
        ## write test logging
        logger.debug('debug')
        logger.info('info')
        logger.warning('warning')
        logger.critical('critical')
        logger.error('error')
        logger.fatal('fatal')
        
        try:
            raise TypeError, 'Demo error'
        except TypeError:
            logger.exception('exception')

        logger.info('next round ' + str(i))

finally: ## close logger properly
    logger.removeHandler(handler)
    handler.close()

If you've still have some questions, shoot and I'll try to answer them.