Benchmarking reading binary values from a file with F#, Python, Julia, R, Go and OCaml

In one of my recent posts I showed some F# code for storing and reading integers as binary data. Since then I've created different versions of the reading part in Python, Julia, Go, OCaml and R to be able to compare the performance of the different languages on this simple task and to get a feeling on what it's like to program in Julia, Go and OCaml as I hadn't created any programs in these languages yet.

Below table shows the benchmark results for reading 10 times 10000 values and 10000 times 10 values with links to the source code files in the different programming languages:

Language 10 times 10000 values 10000 times 10 values
F# 6 seconds 20 seconds
Python 26 seconds 40 seconds
Julia 45 seconds 72 seconds
Go 8 seconds 25 seconds
OCaml 2.5 seconds 48 seconds
R 110 seconds NA

The overall fastest version is the one written in F# but note that it's also the version I have tweaked the most. As I'm not very experienced in most of the languages so any performance tips are welcome. Note that I tried using memory mapped files in .NET and Python this improved performance when querying lots of values from the same file but also made it worse in other cases.

The implementation of the functionality is most of the times rather similar in the different languages. Some notable differences where:

  • Julia apparently doesn't have a null value so I refrained from checking whether the read integer value was equal to the int32 minimum value (-2147483648).
  • In Go converting the bytes to integers was faster with a custom function.
  • I didn't find a function in the OCaml Core library to convert bytes to a 32-bit integer, but luckily I found one on Stack Overflow.

F#:
open System
open System.IO

let readValue (reader:BinaryReader) cellIndex = 
    reader.BaseStream.Seek(int64 (cellIndex*4), SeekOrigin.Begin) |> ignore
    match reader.ReadInt32() with
    | Int32.MinValue -> None
    | v -> Some(v)
        
let readValues indices fileName = 
    use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
    let values = Array.map (readValue reader) indices
    values

Python:
def read_values(filename, indices):
    # indices are sorted and unique
    values = []
    with open(filename, 'rb') as f:
        for index in indices:
            f.seek(index*4L, os.SEEK_SET)
            b = f.read(4)
            v = struct.unpack("@i", b)[0]
            if v == -2147483648:
                v = None
            values.append(v)
    return values

Julia:
function readvalue(stream, position)
    seek(stream, position)
    return read(stream, Int32)
end

function readvalues(filename::String, indices)
    stream = open(filename, "r")
    try
        return Int32[readvalue(stream, index*4) for index in indices]
    finally
        close(stream)
    end
end

Go:
import ("os")
func bytes2int(b []byte) int32 {
    v := int32(0)
    for  i := 0; i < 4; i++ {
        v = v | (int32(b[i]) << (uint(8*i)))
    }
    return v
}

func readValues(indices []int, filename string) []int32 {
    results := make([]int32, len(indices))
    b := make([]byte, 4)
    f,_ := os.Open(filename)
    for i, cellIndex := range indices {
        f.Seek(int64(cellIndex*4), os.SEEK_SET)
        f.Read(b)
        value := bytes2int(b) // around 10-20% faster then binary.Read
        if value != -2147483648 {
            results[i] = value
        } else {
            results[i] = 99999
        }
    }
    return results
}

OCaml:
let input_le_int32 inchannel = (* http://stackoverflow.com/a/6031286/477367 *)
  let res = ref 0l in
    for i = 0 to 3 do
      let byte = input_byte inchannel in
        res := Int32.logor !res (Int32.shift_left (Int32.of_int byte) (8*i))
    done;

    match !res with
      | -2147483648l -> None
      | v -> Some(v)

let readvalue inchannel index =
  seek_in inchannel (index*4);
  input_le_int32 inchannel

let readvalues (indices:int array) filename =
  let inchannel = open_in_bin filename in
    try
      let result = Array.map (readvalue inchannel) indices in
        close_in inchannel;
        result
    with e ->
      close_in_noerr inchannel;
      raise e

R:

read.values <- function(filename, indices) {
  conn <- file(filename, "rb")
  read.value <- function(index) {
    seek(conn, where=index*4)
    readBin(conn, integer(), size = 4, n = 1, endian = "little")
  }
  r <- sapply(indices,read.value)
  close(conn)
  r[r==-2147483648] <- NA
  r
}

Any suggestions for improving the implementation in one of the above programming languages ? Which language would you like to compare my results with ? Which other language(s) do you expect to be faster than my benchmark results ? Can you help me out with a version in your favorite language or in C, fortran, Common Lisp, Scheme, Clojure, Java or J ?

Becoming Functional

Here is my first book review from the O'Reilly Reader Review Program. I picked Becoming Functional by Joshua Backfield because I thought it would be nice to compare it to the latest book I've read on Functional Programming in JavaScript by Michael Fogus (accidentally also published by O'Reilly).

A step by step introduction to some basic concepts of functional programming like higher order functions (passing a function to a method), pure functions (for the same input always return the same output without side effects), immutable variables, recursion and lazy evaluation.

The first chapters are in Java but starting from chapter 4 the book switches to Groovy and the latest chapters are in Scala. The Java code is not very complicated and is probably understandable for anyone with knowledge of statically typed object-oriented programming languages. Note that all Java samples use Java 7 which is rather verbose for functional programming which is one of the reasons the book switches to Groovy and Scala. The other reason is off course is that these languages support functional concepts out of the box.

Content wise the books starts with an introduction to functional programming then makes the transition from first-class functions to higher order functions and pure functions. Next are immutable variables and recursion, with a nice introduction to tail recursion. Followed by small side steps to laziness and statements. To conclude with pattern matching functional object-oriented programming. The last chapter is surprisingly as it first gives advice on transitioning into functional programming then talks about new design patterns and ends with a full implementation of a simplistic database in Scala.

It's not a bad book in the sense that it teaches incorrect material but it's not a very good book neither. It's a book written for Java developers that have never heard of the basic concepts of functional programming but personally I prefer a different approach to achieve this goal. Instead of refactoring code from the imperative style into the functional style and along the way introducing concepts, I think it's faster and easier to learn functional programming by clearly explaining and showing the concepts directly. Only then you should elaborate on how to migrate from the original code to the functional style and explain why this is a good thing to do. Although it's a rather short book (an estimated 140 pages) it is not very dense but it gives semi-real world example based on requirements from a fictional company XXY. Only the most common functional programming concepts are introduced.

My advice is to buy this book if you're a Java developer who has no experience in functional programming.

Otherwise I would skip this book and instead read:
Or take a look at my page of recommendations.

Storing and fetching raster values in F#

In my quest for a fast system to fetch values from rasters at random cell locations I've been experimenting with a few different implementations in F#. More specifically I have now 4 methods for reading raster files. As usual all source code is available online: AsciiToBin.fsx and all comments on it are very welcome.

The first and simplest method I came up with was converting the ASCII raster files with integers to binary files where every integer is converted to its binary equivalent and written to disk. This saves about 50% of disk space compared to the very large ASCII files but also provides a fast and easy to program way of accessing the values. Note that there are cells with no data so these are stored as Int32.MinValue. As you can see the code is rather short.

module SimpleReadWrite = 

    let writeValue (writer:BinaryWriter) (value:int option) =
        match value with
        | Some(v) -> writer.Write(v)
        | None -> writer.Write(Int32.MinValue)

    let writeValues fileName (values:seq<int option>) =
        use writer = new BinaryWriter(File.Open(fileName, FileMode.OpenOrCreate))
        values
        |> Seq.iter (writeValue writer)
            
    let readValue (reader:BinaryReader) cellIndex = 
        // set stream to correct location
        reader.BaseStream.Position <- cellIndex*4L
        match reader.ReadInt32() with
        | Int32.MinValue -> None
        | v -> Some(v)
        
    let readValues fileName indices = 
        use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
        // Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
        let values = List.map (readValue reader) (List.ofSeq indices)
        values

The second version I created uses a memory mapped file for reading the values from the same format as before. This is slightly faster (about 2 times) when we want to query lots of values from the same raster. But also 2 times slower when you query for example 10000 times 10 values from different rasters.

module MemoryMappedSimpleRead =

    open System.IO.MemoryMappedFiles

    let readValue (reader:MemoryMappedViewAccessor) offset cellIndex =
        let position = (cellIndex*4L) - offset
        match reader.ReadInt32(position) with
        | Int32.MinValue -> None
        | v -> Some(v)
        
    let readValues fileName indices =
        use mmf = MemoryMappedFile.CreateFromFile(fileName, FileMode.Open)
        let offset = (Seq.min indices ) * 4L
        let last = (Seq.max indices) * 4L
        let length = 4L+last-offset
        use reader = mmf.CreateViewAccessor(offset, length, MemoryMappedFileAccess.Read)
        let values = (List.ofSeq indices) |> List.map (readValue reader offset)
        values

The third version is similar to the simple reader but it fetches multiple bytes at once when two or more indexes are within a certain range. The performance is a bit worse then the simple reader so I'm not going into any further details. But if you want you can check the solution on github and any suggestions on easier ways of grouping the indexes by inter-distance are welcome.

The last version I created is more space efficient in my case. As I work with world oceanic date, about two thirds of my grids don't have any data (land). To avoid storing this data I separately store a file indicating which cells don't have data and skip those cells without data when writing the binary file. The disadvantage is that this makes everything a lot more complex because you have to map your cell indexes to the location of your binary value in your file in a space efficient way. To be able to store the bitmap I created some BitConverter extension methods to convert a boolean array to a byte array and back which I have also posted on fssnip. Then end result has a performance comparable to the one from the simple reader so if disk space is no problem then this solution isn't worth the additional complexity.

module BitConverter = 
    let pow2 y = 1 <<< y
    // convert booleans to bytes in a space efficient way
    let FromBooleans (bools:bool []) =
        seq {
            let b = ref 0uy
            for i=0 to bools.Length-1 do
                let rem = (i  % 8)
                if rem = 0 && i<> 0 then 
                    yield !b
                    b := 0uy
                if bools.[i] then
                    b := !b + (byte (pow2 rem))
            yield !b
        } |> Array.ofSeq
    // to booleans only works for bytes created with FromBooleans
    let ToBooleans (bytes:byte []) = 
        bytes
        |> Array.map (fun b -> Array.init 8 (fun i -> ((pow2 i) &&& int b) > 0))
        |> Array.concat

After lots of tweaking I managed to get a similar performance as the SimpleReadWrite but with 30% less diskspace needed and a more complex codebase.

Some performance related things I learned on the way are:
  • The Get method from System.Collections.BitArray is slow
  • You might want to convert lots of Seq chaining to one for loop
  • Some use of mutable values (within a function) might be necessary
  • Precompute as much as you can
And as I've tweeted before, I really like to evaluate my code in the REPL (same for my Python and R work).

Any thought on how to make this faster ? Do you know a fast library/system that achieve the same results ? Should I compress my rasters more to decrease disk access ? Any other suggestions ?

Other posts you might like:


Think Python

Several years ago, when I was learning about Python for the first time, I read the online book: "How to think like a computer scientist". Today I just finished Think Python: How To Think Like a Computer Scientist by Allen B. Downey (free e-version) which is an evolved version of this book.

You should read this book if you want to learn programming or if you want to learn python. But if you already have advanced programming skills then I would suggest to just skim the free online version and spend your money on a more advanced python book.

Think Python starts with the most basic things like operators, variables and assignment then functions, conditionals, recursion and iteration are introduced, followed up with strings, lists , dictionaries and tuples. The next chapter is a practical chapter on files and the book ends with 4 chapters introducing object oriented programming. Between the different chapters there are 4 case studies where the learned concepts and techniques are applied.

What you won't learn in this book: 

  • what the different standard libraries are
  • web development in python
  • popular (scientific) libraries like numpy
  • writing (unit) tests for your programs
  • Python 3, except some small remarks
What I particularly  liked:
  • the information about debugging your programs at the end of every chapter and the appendix 
  • the step by step introduction to object-oriented programming
  • the case studies
  • the exercises
  • the glossary at the end of every chapter
As you might have inferred from the above, this book is not a reference book (we have the web for that) but a book that learns you to think like a programmer.

An alternative way to learn Python is Learn Python the Hard Way (html) (pdf + epub + video) by Zed Shaw.

Other books by Allen B. Downey:

If you want to improve your Python skills then take a look at this list of advanced books. I especially liked Expert Python Programming by Tarek Ziadé.

Functional Programming in JavaScript

Last week I was really surprised to find Functional JavaScript: Introducing Functional Programming with Underscore.js by Michael Fogus in the local library and I just finished it and wanted to leave a short review here.

This book really delivers what the title promises: an introduction to functional programming in JavaScript using the library Underscore.js. It doesn't teach you JavaScript nor Underscore.js but teaches what the different functional programming concepts are and how they can be implemented in JavaScript. This is done in less then 250 pages of densely, in a good way, written text and example code. Starting from the basics like first-class functions, applicative programming, variable scoping and closures, the book moves on to higher-order functions, currying and partial function application. Then some side steps are made with a great chapter on recursion which ends with the trampoline and  a chapter on other important functional aspects like purity and immutability. Next is a chapter about flow-based programming, what it is, why it matters and different ways to define flows in your programs. The last chapter makes the connection with object-oriented programming and introduces mixins.

I really enjoyed reading this book because it is written very fluently without heavy (unnecessary) jargon and probably at a sweet spot on my learning curve. I've already read Real World Functional Programming: With Examples in F# and C# by Thomas Petricek and Jon Skeet and the first chapters of SICP but I haven't used it a lot in the wild. I've written my share of JavaScript programs but nothing very advanced, except maybe a Google Maps like library from scratch. If you're new to JavaScript AND functional programming then would advice against this book but otherwise, if you're motivated and don't let you get scared away by the first chapters then everything will be fine. But some playing around with the examples (like I did in this fiddle) and learning the basics of how to call passed in functions and how the often used Underscore.js functions (map, reduce, ...) work might be needed to get the most of this book. Overall this book is a very complete introduction to functional programming, the only thing I missed was a part on functional pattern matching. Note that this book is more about what introducing different functional programming techniques then about when and how to apply this techniques in your day-to-day programming.

Other books you might be interested in:
JavaScript: The Good Parts by Douglas Crockford (he popularized JSON and wrote JSLint and JSMin)
JavaScript: The Definitive Guide by David Flanagan
JavaScript Allongé by Reginald Braithwaite
Real World Functional Programming: With Examples in F# and C# by Thomas Petricek and Jon Skeet
Learn you a Haskell for Great Good! by Miran Lipovača
Learn you some Erlang for Great Good! by Fred Hébert

Working in lat long: great circle distance, bearing, midpoint and centroid calculations

For my work in species distribution modeling I'm mostly working with lat long coordinates so I needed some great circle functions to calculate the point-point distance, point-to-line distance and the centroid of a group of points. Starting from 2 excellent resource http://www.movable-type.co.uk/scripts/latlong.html
and http://williams.best.vwh.net/ftp/avsig/avform.txt I created the following functions in Python, with some tests. As usual the full source code can be found online: GreatCircle.py.

I started with importing some functions from the math module and defining a small Point class:

from math import radians, degrees, cos, sin, sqrt, atan2, asin, fabs, pi

class Point(object):
    def __init__(self,x,y):
        self.x = x
        self.y = y

Then came some very simple functions for calculating the distance (with the Haversine formula), the bearing and the midpoint. These are almost verbatim translations from the previously mentioned resources.


def distance_haversine(A, B, radius=6371000):
    """ note that the default distance is in meters """
    dLat = radians(B.y-A.y)
    dLon = radians(B.x-A.x)
    lat1 = radians(A.y)
    lat2 = radians(B.y)
    a = sin(dLat/2) * sin(dLat/2) + sin(dLon/2) * sin(dLon/2) * cos(lat1) * cos(lat2)
    c = 2 * atan2(sqrt(a), sqrt(1-a))
    return c * radius

def bearing(A, B):
    dLon = radians(B.x-A.x)
    lat1 = radians(A.y)
    lat2 = radians(B.y)
    y = sin(dLon) * cos(lat2)
    x = cos(lat1)* sin(lat2) - sin(lat1) * cos(lat2)* cos(dLon)
    return atan2(y, x)

def bearing_degrees(A,B):
    return degrees(bearing(A,B))

def midpoint(A,B):
    ## little shortcut
    if A.x == B.x: return Point(A.x, (A.y+B.y)/2)
    if A.y == B.y: return Point((A.x+B.x)/2, A.y)
    
    lon1, lat1 = radians(A.x), radians(A.y)
    lon2, lat2 = radians(B.x), radians(B.y)
    dLon = lon2-lon1

    Bx = cos(lat2) * cos(dLon)
    By = cos(lat2) * sin(dLon)
    lat3 = atan2(sin(lat1)+sin(lat2), sqrt((cos(lat1)+Bx)*(cos(lat1)+Bx) + By*By))
    lon3 = lon1 + atan2(By, cos(lat1) + Bx)
    return Point(degrees(lon3),degrees(lat3))

Then came the more puzzling part where I implemented the cross track error to calculate the great circle distance from a point to a line .


def crosstrack_error(p,A,B, radius=6371000):
    """ distance (in meters) from a point to the closest point along a track
        http://williams.best.vwh.net/avform.htm#XTE """
    dAp = distance_haversine(A, p, radius=1)
    brngAp = bearing(A,p)
    brngAB = bearing(A,B)
    dXt = asin(sin(dAp)*sin(brngAp-brngAB))
    return fabs(dXt) * radius

But the results didn't always match the distances that I calculated with PostGIS so I wrote a recursive binary search function consisting of 3 parts: calculating the midpoint, calculating the distance to two end and the midpoint and continuing again with the halve that is closest to the point until the difference in distance between the point and the two ends of the line is within a certain tolerance. The disadvantage of this function is that the rounding errors in calculating the midpoint do add up and the function is not very fast. So if anyone knows what wrong with my cross track implementation or knows a better method for calculating the great circle distance between a point and a line, feel free to comment below or send me an e-mail at mail@<thisdomainhere>.com.


def point_line_distance(p, A,B, radius=6371000, tolerance=0.1): 
    # tolerance and radius in meters
    """ recursive function that halves the search space until result is within tolerance
        disadvantages:
        1) the rounding errors in midpoint do add up
        2) not very fast """
    def rec_point_line_distance(p,A,B,dA,dB):
        C = midpoint(A,B)
        dC = distance_haversine(p,C)
        if fabs(dC-dA) < tolerance or fabs(dC-dB) < tolerance:
            return dC
        elif dA < dB:
            return point_line_distance(p,A,C,dA, dC)
        else:
            return point_line_distance(p,C,B,dC, dB)
    dA = distance_haversine(p,A)
    dB = distance_haversine(p,B)
    return rec_point_line_distance(p,A,B,dA,dB)

The last function I needed was finding the centroid of a set of points in lat long coordinates. Below is an implementation from a response on gis.stackexchange.com. First all coordinates are converted to cartesian coordinates. Then the average of the x, y and z coordinates are calculated and finally these average coordinates are converted back to longitude and latitude.
 
def get_centroid(points):
    """ 
    http://www.geomidpoint.com/example.html 
    http://gis.stackexchange.com/questions/6025/find-the-centroid-of-a-cluster-of-points
    """
    sum_x,sum_y,sum_z = 0,0,0
    for p in points:
        lat = radians(p.y)
        lon = radians(p.x)
        ## convert lat lon to cartesian coordinates
        sum_x = sum_x + cos(lat) * cos(lon)
        sum_y = sum_y + cos(lat) * sin(lon)
        sum_z = sum_z + sin(lat)
    avg_x = sum_x / float(len(points))
    avg_y = sum_y / float(len(points))
    avg_z = sum_z / float(len(points))
    center_lon = atan2(avg_y,avg_x)
    hyp = sqrt(avg_x*avg_x + avg_y*avg_y) 
    center_lat = atan2(avg_z, hyp)
    return Point(degrees(center_lon), degrees(center_lat))

For the implementation and running of my tests I used the built-in unittest module in the following way.

import unittest
class Test_gc(unittest.TestCase):
    def test_distance_haversine(self):
        d = distance_haversine(Point(-10.0001, 80.0001), Point(7.935, 63.302))
        self.assertAlmostEqual(d, 1939037.0, places=0)
        
    def test_bearing(self):
        b = bearing(Point(-10.0001, 80.0001), Point(7.935, 63.302))
        self.assertAlmostEqual(b, 2.661709,places=6)

    def test_bearing_degrees(self):
        b = bearing_degrees(Point(-10.0001, 80.0001), Point(7.935, 63.302))
        self.assertAlmostEqual(b, 152.504694,places=6)

    def test_midpoint(self):
        m = midpoint(Point(-20.0,5.0), Point(-10.0,5.0))
        self.assertAlmostEqual(-15.0, m.x, places=6)
        m = midpoint(Point(7.0,5.0), Point(7.0,15.0))
        self.assertAlmostEqual(10.0, m.y, places=6)
        
    def test_crosstrack_error(self):
        p,A,B = Point(0.0, 1.0), Point(1.0, 1.0), Point(2.0, 1.0)
        d = crosstrack_error(p,A,B)
        self.assertAlmostEqual(d, distance_haversine(p,A), places=0)
        ## Apparently crosstrack_error returns incorrect results in some very specific cases (when p is completly off track)
        
    def test_point_line_distance(self):
        
        p,A,B = Point(0.0, 1.0), Point(1.0, 1.0), Point(2.0, 1.0)
        d = point_line_distance(p,A,B)
        self.assertAlmostEqual(d, distance_haversine(p,A), places=0)

        p,A,B = Point(2.0, 1.2), Point(1.0, 1.0), Point(3.0, 1.0)
        d = point_line_distance(p,A,B)
        self.assertAlmostEqual(d, distance_haversine(p,Point(2.0,1.0)), places=0)

        p,A,B = Point(0.0, 90.0), Point(-179.0, -21.0), Point(179.5, -22.0)
        d = point_line_distance(p,A,B)
        self.assertAlmostEqual(d, distance_haversine(p,A), places=0)

        p,A,B = Point(0.0, 89.0), Point(-179.0, -22.0), Point(179.5, -22.0)
        d = point_line_distance(p,A,B)
        self.assertAlmostEqual(d, distance_haversine(p,Point(0, -22)), places=0)
        
    def test_get_centroid(self):
        ## check crosses the north pole
        center = get_centroid([Point(0.0, 80.0), Point(180.0,80.0)])
        self.assertEqual(90.0, center.y)
        ## check multiple points
        center = get_centroid([Point(0.0, 0.0), Point(10.0,0.0), Point(5.0,10.0)])
        self.assertAlmostEqual(5.0, center.x)
        ## even more points
        center = get_centroid([Point(0.0, 0.0), Point(10.0,0.0),Point(10.0,30.0), Point(10.0,10.0), Point(20.0, 0.0)])
        self.assertAlmostEqual(10.0, center.x)
        ## not lat lon average
        center = get_centroid([Point(0.0, 30.0), Point(10.0,30.0),Point(10.0,60.0), Point(0.0,60.0)])
        self.assertNotAlmostEqual(45.0, center.y)
        self.assertAlmostEqual(5.0, center.x)
        ## crosses date line
        center = get_centroid([Point(170.0, 30.0), Point(-160.0,30.0),Point(170.0,60.0), Point(-160.0,60.0)])
        self.assertNotAlmostEqual(45.0, center.y)
        self.assertAlmostEqual(-175.0, center.x)

if __name__ == '__main__':
    unittest.main()

That's it for this post, the full code is on GitHub. If you'd like more posts like this on this blog then feel free to make some topic suggestions in the comments and make sure to subscribe to the updates of this site via RSS or via e-mail.

Culture & Empire: Digital Revolution

Its been while since I've read such an influential book so I wanted to share it here:

Culture & Empire: Digital Revolution
written by Pieter Hintjens, a campaigner, writer, and programmer (creator of ØMQ)

A book about the digital revolution and the battle between our new communities and the old power and old money. Full with gems like how to build a community, why Germany went crazy in 1939, what are good property systems, the societal cost of patents and many more.

Start reading now on a single pagepdf or epub or buy it on Amazon (affiliate link).

Here is a quote from the beginning of Chapter 4:
Once upon a time, there was a great Empire that ruled the known world. It owned all the lands, the wealth beneath, and the wealth above. The Empire was run by an old, faceless soci- ety of criminals. It ran on cheap oil and cheap blood. It smashed its opponents in the name of Peace. It burned their lands in the name of Reconstruction. It enslaved them in the name of Freedom. It built massive castles of edict and punish- ment to govern its populations, and it fed them a river of pap to keep them docile. It was powerful, invincible, and paranoid.

Far away, in a different place, a civilization called Culture had taken seed and was growing. It owned little except a magic spell called Knowledge. The Culture ran on light, and built little bubbles of fire and hope. It seduced its critics by giving them what they wanted, no matter how unusual. And as it pulled in more people, it grew and built more of its bubbles.

When the Empire first encountered the Culture, it was puzzled. There were no armies to crush, no statesmen to cor- rupt and recruit, no castles to loot and burn. So it ignored the Culture and its pretty bubbles, hoping it would go away.

The Culture grew, and grew faster than you could follow. In less than a generation, it had started to build cities, impossibly beautiful spheres of fire and hope, massive, and yet gentler than the breeze. More people quietly left the castles to move to the cities of the Culture, where they too learned to build their own bubbles of flames and joy.

The Culture seemed harmless. However, the Empire depended on its vassal masses. If the masses left to go to the Culture’s cit - ies, the Empire would starve and die. Total War was inevitable. Both the Empire and the Culture knew it, and prepared for it in very different ways.

The Empire attacked. It tore down the cities closest to it and told the Culture, stop building or we will come back. And for each city it burnt, a hundred others sprang up. Culture shrugged and said, “We enjoy building new cities.” So the Em- pire sent its infiltrators and spies into the cities to try to cor- rupt them. And the Culture laughed, clapped its hands, and exclaimed, “We do much worse to ourselves every day. Look, we enjoy this game!” And it opened its hands. And there lay some of the Empire’s darkest and deepest secrets, for all to see.

So the Empire, the cold finger of fear touching its heart, smiled its most sincere smile and welcomed the Culture into its lands. And then it began to erect a far wall so wide and so high that it could cover all the cities of the Culture in darkness. If the Cul- ture ran on light, thought the Empire, then it would destroy light.
 For other recommended books take a look at the following page.

Array manipulation and booleans in numpy, hunting for what went wrong

Recently I was doing some raster processing with gdal_calc.py but I was getting some surprising results.
The goal was to convert some continuous value into a binary value based on a treshold and then detect places where only A was True (value: -1), where only B was True (value: 1) and where both A and B where True (value: 0). With a treshold set to 2 I came with the following formula:

(A > 2) - (B > 2)

But after calling the gdal_calc.py with my formula and inspecting the results I only got values of 0 and 1.
After inspecting gdal_calc.py I noticed that it uses numpy and more specifically numpy arrays for the raster manipulation.

This how my python shell session went (numpy_array_math.py):

>>> import numpy as np
>>> a = np.array([1,2,3,4])
>>> b = np.array([1,5,3,2])
>>> print(a > 2)
[False False  True  True]
>>> print(b > 2)
[False  True  True False]
>>> print(True-False)
1
>>> print(False-True)
-1
>>> print((a>2)-(b>2))
[False  True False  True]
>>> print((a>2)*1-(b>2))  # we got a winner
[ 0 -1  0  1]

The problem was that boolean substraction in Python does generate the expected numeric results but the results where converted back into a boolean array by numpy after the substraction. And indeed converting -1, 1 or any other non zero number to a boolean generates a True which when converted back to number for writing the raster to disk gives you the value 1.
The solution was to force at least one of the arrays to be numeric so that we substract numeric values.

>>> bool(1)
True
>>> bool(2)
True
>>> bool(-1)
True
>>> bool(0)
False
>>> print((a>2)*1)
[0 0 1 1]

If you want to try this yourself on Windows then the easiest way to install gdal is with the OSGeo4W installer. A windows installer for numpy can be found on the website by Christopher Golke but consider also installing the full SciPy stack with one of the Scientific Python distributions.

What surprising results have you encountered with numpy.array or gdal_calc ?

Minimal introduction to Python

A few years ago I gave a short introduction to Python to some of my former colleagues and this was the result (also on github):

# 1) data basics

# numbers
i = 1234    # integer
f = 1.234   # float

# operators
a = 12 + 5  # = 17
b = 3 * 4   # = 12
c = 6 / 3   # = 2
d = 46 % 5  # = 1 (=> modulo operator)

a += 1      # = 18
a -= 2      # = 16
a *= 2      # = 32
a /= 2      # = 16
a %= 5      # = 1

# text
s = 'string'                
s = s.upper()               # 'STRING'
s = s.lower()               # 'string'
s.startswith('s')           # true
s.endswith('ng')             # true
s = s.replace('str', 'bl')  # 'bling'
l = s.split('i')            # list ['bl', 'ng']
strings = ['s', "t", "r'i'n", 'g"s"tring', "3"]

## add the prefix r for e.g. paths to files to escape the backslash character

testdoc = r'test\test.txt' ## same as : 'test\\test.txt'

# list
l = ['a1']
l.append('b2')
l.append('c3')
l[0] # 'a1'


mixed = ['3', 'abc', 3, 4.56]

d = {} # dictionary (key-value pairs)

d = {'a' : 'blabla',
     'b' : 1,
     3 : 'ablabl',
     4 : 12.3,
     16.9 : 'dif3'}

d['a'] # 'blabla'

# 2) conditions, loops and functions

# indentation is required !!!

if 3 > 2:
    print('3 > 2')
elif 1 == 0:
    print('1 = 0')
else:
    print('else clause')

if 'a' in d:
    print(d['a'])
else:
    print('not found')

for x in range(0, 5):  # from 0 to 5 (0, 1, 2, 3 ,4)
    print(x)

letters = ['a', 'b', 'c']
for letter in letters:
    print(letter)

# list comprehension
upper_letters = [letter.upper() for letter in letters]
print(upper_letters)


d = { 1:'a', 2:'b', 3:'c'}
for key, value in d.iteritems():
    print(str(key), value)
for key in d.keys():
    print('key :', str(key))
for value in d.values():
    print('value :', value)

def special_sum(numberlist):
    total = 0
    for element in numberlist:
        if element < 5:
            continue # go to the next element
        elif total > 100:
            break # stop the loop
        else:
            total += element
    return total

print(special_sum([1,2,2,4,8,50, 60])) # = 118

# 3) using os and shutil

# import modules

import os
import shutil

def setup_test(directory):
    if not os.path.isdir(directory):
        os.mkdir(directory)
    file1 = os.path.join(directory, 'file1_test.dll')
    open(file1, 'a').close() # create empty file and close it
    file2 = os.path.join(directory, 'file2_test.txt')
    open(file2, 'a').close() # 'a' creates a file for appending (doesn't overwrite)
    
setup_test('test')

# looping over files in dir and subdirs and renaming some of them
def list_files(startdir):
    for element in os.listdir(startdir):
        path = os.path.join(startdir, element)
        if os.path.isfile(path):
            print(path)
            root, extension = os.path.splitext(path)
            if extension.lower() == '.dll': # add an extra extension to dll's
                shutil.copyfile(path, path + '.REMOVETHIS')
                # with os.rename you can replace the file
                ## os.rename(path, path + '.REMOVETHIS')
        else:
            list_files(path)

startdir = r'test'
list_files(startdir)

# or you can loop with os.walk

for root, directories, files in os.walk(startdir, topdown=True):
    print 'dir : %s' % root
    if files:
        print('files :')
        for f in files:
            print('\t', os.path.join(root, f))

# 4) reading, parsing and writing files

def create_tab_test(inputpath):
    with open(inputpath, 'w') as inputfile: # 'w' creates a file for (over)writing
        # create some dummy content separated by tabs and with newlines at the end
        lines = ['\t'.join([str(1*i),str(2*i)])+'\n' for i in range(5)]
        # write to the file
        inputfile.writelines(lines)
        
def tab_to_csv(inputpath, outputpath): # only for small files
    lines = []
    with open(inputpath) as inputfile:
        for line in inputfile:
            line = line.replace('\n', '')
            columns = line.split('\t')
            line = ';'.join(columns)
            lines.append(line)
    with open(outputpath, 'w') as outputfile: # overwrites the outputfile if it exists
        outputfile.write('\n'.join(lines))
    
inputpath = r'test\test.txt' ## or 'D:\temp\test.txt'
outputpath = r'test\test.csv'

create_tab_test(inputpath)
tab_to_csv(inputpath, outputpath)

Want to learn more or need some help ?
Contact me at mail@samuelbosch.com or take a look at one of the following resources:
I you want a bigger list of Python and other CS related books to look at then the following free programming books list might be what you're looking for.

-- update --
Some suggestions by Meriam in the comments:

BIOMOD2: invalid term in model formula

While trying out BIOMOD2 for species distribution modeling I encountered the following error when trying to create a random forest (RF) model:

Error in terms.formula(formula, data = data) : 
  invalid term in model formula
Error in predict(model.bm, Data[, expl_var_names, drop = FALSE], on_0_1000 = TRUE) : 
  error in evaluating the argument 'object' in selecting a method for function 'predict': Error: object 'model.bm' not found

After lots of digging in the source code of BIOMOD I discovered what happened and how to fix it.

The line where the error was occurring looked like this:

model.sp <- try(randomForest(formula = makeFormula(resp_name, 
    head(Data), "simple", 0), data = Data[calibLines, 
    ], ntree = Options@RF$ntree, importance = FALSE, 
    norm.votes = TRUE, strata = factor(c(0, 1)), 
    nodesize = Options@RF$nodesize, maxnodes = Options@RF$maxnodes))

As you can see a formula is created from our data. The generated formula looked like this:

654987634 ~ 1 + nitrate_mean + par_mean + sst_mean + sst_range


The number 654987634 is my response variable name that I set when calling BIOMOD_FormatingData which apparently has been added to the Data data.frame as the first column. 


Solution: adding a letter prefix to my response variable name fixed my issue.

Lessons learned:
  1. It's incredibly useful that you can take a look at external source code in an RSession by typing its name, for iternals you add the package name and 3 double dots e.g. biomod2:::.Biomod.Models. Generics functions can be fetched with getMethod e.g. getMethod(biomod2::predict, signature="RF_biomod2_model")
  2. Read the docs very carefully: this is the description for resp.name: response variable name (character). The species name.
  3. An assertion in BIOMOD_FormatingData would have saved me an hour or two. But note that a naive check like is.character(resp.name) wouldn't work in this case, because once the formula is made it isn't a character anymore.
  4. Adding more diagnostic information in your error messages saves people time. If the generated formula would have been printed as well then I would probably have noticed the issue directly.
  5. Libraries make you win time, but you lose some as well.
If you liked this content then don't forget to subscribe on the top left.

Creating a kernel density estimate map in R

In this post I'm going to create a kernel density estimate map in R from a file with latitude/longitude coordinates. WARNING: depending on your application the following gives incorrect results because a non-spherical kernel density estimator is used with spherical data (big thanks too Brian Rowlingson for pointing that out). I didn't find any spherical kernel density estimation implementation in R but if you know any please let me know. While exploring this issue I also wrote a similar implementation as the below code for the generation of a kernel density estimate with the spatial statistics package splancs instead of KernSmooth (splancs_kernel_density.R).

To create this map we'll use the bkde2D function in the KernSmooth package to generate the kernel density estimates and the raster package for outputting a raster file. If you don't have the KernSmooth and raster packages then they can as usually be installed with:

install.packages("KernSmooth")
install.packages("raster")

In order to create the 2D binned kernel density map I first loaded the KernSmooth and raster package.

library("KernSmooth")
library("raster")

Then we read the input file with the coordinates of the points that we want to generate the kernel density estimate off. This input file is an export from the Ocean Biographic Information System (OBIS) and represents distribution records from the seaweed Alaria esculenta.


input <- "Alaria_esculenta.csv"
output <- "Alaria_esculenta_density.asc"

# get the coordinates
records <- read.csv(input)
coordinates <- records[,2:3]

Next we compute the 2D binned kernel density estimate. In order to limit the size of the generated output files I set all values smaller then 0.00001 to 0.

# compute the 2D binned kernel density estimate
est <- bkde2D(coordinates, 
              bandwidth=c(3,3), 
              gridsize=c(4320,2160),
              range.x=list(c(-180,180),c(-90,90)))
est$fhat[est$fhat<0.00001] <- 0 ## ignore very small values

Finally we create a raster file from the output of bkde2D, inspect it visually and export it as an ascii file.

# create raster
est.raster = raster(list(x=est$x1,y=est$x2,z=est$fhat))
projection(est.raster) <- CRS("+init=epsg:4326")
xmin(est.raster) <- -180
xmax(est.raster) <- 180
ymin(est.raster) <- -90
ymax(est.raster) <- 90
# visually inspect the raster output
plot(est.raster)
# write an ascii file
writeRaster(est.raster,output,"ascii")

And the output looks like this. As you can see most records are located near 50°N and 0° which corresponds to the distribution records map generated by OBIS.
The output ascii raster file can be used as an input file for OccurrenceThinner which uses the kernel density estimate grid for filtering out records from densely sampled regions in order to avoid overfitting while building species distribution models (SDM).

Code and data for this post are in my blog github repository.