Showing posts with label Fsharp. Show all posts
Showing posts with label Fsharp. Show all posts

Benchmarking reading binary values from a file with F#, Python, Julia, R, Go and OCaml

In one of my recent posts I showed some F# code for storing and reading integers as binary data. Since then I've created different versions of the reading part in Python, Julia, Go, OCaml and R to be able to compare the performance of the different languages on this simple task and to get a feeling on what it's like to program in Julia, Go and OCaml as I hadn't created any programs in these languages yet.

Below table shows the benchmark results for reading 10 times 10000 values and 10000 times 10 values with links to the source code files in the different programming languages:

Language 10 times 10000 values 10000 times 10 values
F# 6 seconds 20 seconds
Python 26 seconds 40 seconds
Julia 45 seconds 72 seconds
Go 8 seconds 25 seconds
OCaml 2.5 seconds 48 seconds
R 110 seconds NA

The overall fastest version is the one written in F# but note that it's also the version I have tweaked the most. As I'm not very experienced in most of the languages so any performance tips are welcome. Note that I tried using memory mapped files in .NET and Python this improved performance when querying lots of values from the same file but also made it worse in other cases.

The implementation of the functionality is most of the times rather similar in the different languages. Some notable differences where:

  • Julia apparently doesn't have a null value so I refrained from checking whether the read integer value was equal to the int32 minimum value (-2147483648).
  • In Go converting the bytes to integers was faster with a custom function.
  • I didn't find a function in the OCaml Core library to convert bytes to a 32-bit integer, but luckily I found one on Stack Overflow.

F#:
open System
open System.IO

let readValue (reader:BinaryReader) cellIndex = 
    reader.BaseStream.Seek(int64 (cellIndex*4), SeekOrigin.Begin) |> ignore
    match reader.ReadInt32() with
    | Int32.MinValue -> None
    | v -> Some(v)
        
let readValues indices fileName = 
    use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
    let values = Array.map (readValue reader) indices
    values

Python:
def read_values(filename, indices):
    # indices are sorted and unique
    values = []
    with open(filename, 'rb') as f:
        for index in indices:
            f.seek(index*4L, os.SEEK_SET)
            b = f.read(4)
            v = struct.unpack("@i", b)[0]
            if v == -2147483648:
                v = None
            values.append(v)
    return values

Julia:
function readvalue(stream, position)
    seek(stream, position)
    return read(stream, Int32)
end

function readvalues(filename::String, indices)
    stream = open(filename, "r")
    try
        return Int32[readvalue(stream, index*4) for index in indices]
    finally
        close(stream)
    end
end

Go:
import ("os")
func bytes2int(b []byte) int32 {
    v := int32(0)
    for  i := 0; i < 4; i++ {
        v = v | (int32(b[i]) << (uint(8*i)))
    }
    return v
}

func readValues(indices []int, filename string) []int32 {
    results := make([]int32, len(indices))
    b := make([]byte, 4)
    f,_ := os.Open(filename)
    for i, cellIndex := range indices {
        f.Seek(int64(cellIndex*4), os.SEEK_SET)
        f.Read(b)
        value := bytes2int(b) // around 10-20% faster then binary.Read
        if value != -2147483648 {
            results[i] = value
        } else {
            results[i] = 99999
        }
    }
    return results
}

OCaml:
let input_le_int32 inchannel = (* http://stackoverflow.com/a/6031286/477367 *)
  let res = ref 0l in
    for i = 0 to 3 do
      let byte = input_byte inchannel in
        res := Int32.logor !res (Int32.shift_left (Int32.of_int byte) (8*i))
    done;

    match !res with
      | -2147483648l -> None
      | v -> Some(v)

let readvalue inchannel index =
  seek_in inchannel (index*4);
  input_le_int32 inchannel

let readvalues (indices:int array) filename =
  let inchannel = open_in_bin filename in
    try
      let result = Array.map (readvalue inchannel) indices in
        close_in inchannel;
        result
    with e ->
      close_in_noerr inchannel;
      raise e

R:

read.values <- function(filename, indices) {
  conn <- file(filename, "rb")
  read.value <- function(index) {
    seek(conn, where=index*4)
    readBin(conn, integer(), size = 4, n = 1, endian = "little")
  }
  r <- sapply(indices,read.value)
  close(conn)
  r[r==-2147483648] <- NA
  r
}

Any suggestions for improving the implementation in one of the above programming languages ? Which language would you like to compare my results with ? Which other language(s) do you expect to be faster than my benchmark results ? Can you help me out with a version in your favorite language or in C, fortran, Common Lisp, Scheme, Clojure, Java or J ?

Storing and fetching raster values in F#

In my quest for a fast system to fetch values from rasters at random cell locations I've been experimenting with a few different implementations in F#. More specifically I have now 4 methods for reading raster files. As usual all source code is available online: AsciiToBin.fsx and all comments on it are very welcome.

The first and simplest method I came up with was converting the ASCII raster files with integers to binary files where every integer is converted to its binary equivalent and written to disk. This saves about 50% of disk space compared to the very large ASCII files but also provides a fast and easy to program way of accessing the values. Note that there are cells with no data so these are stored as Int32.MinValue. As you can see the code is rather short.

module SimpleReadWrite = 

    let writeValue (writer:BinaryWriter) (value:int option) =
        match value with
        | Some(v) -> writer.Write(v)
        | None -> writer.Write(Int32.MinValue)

    let writeValues fileName (values:seq<int option>) =
        use writer = new BinaryWriter(File.Open(fileName, FileMode.OpenOrCreate))
        values
        |> Seq.iter (writeValue writer)
            
    let readValue (reader:BinaryReader) cellIndex = 
        // set stream to correct location
        reader.BaseStream.Position <- cellIndex*4L
        match reader.ReadInt32() with
        | Int32.MinValue -> None
        | v -> Some(v)
        
    let readValues fileName indices = 
        use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
        // Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
        let values = List.map (readValue reader) (List.ofSeq indices)
        values

The second version I created uses a memory mapped file for reading the values from the same format as before. This is slightly faster (about 2 times) when we want to query lots of values from the same raster. But also 2 times slower when you query for example 10000 times 10 values from different rasters.

module MemoryMappedSimpleRead =

    open System.IO.MemoryMappedFiles

    let readValue (reader:MemoryMappedViewAccessor) offset cellIndex =
        let position = (cellIndex*4L) - offset
        match reader.ReadInt32(position) with
        | Int32.MinValue -> None
        | v -> Some(v)
        
    let readValues fileName indices =
        use mmf = MemoryMappedFile.CreateFromFile(fileName, FileMode.Open)
        let offset = (Seq.min indices ) * 4L
        let last = (Seq.max indices) * 4L
        let length = 4L+last-offset
        use reader = mmf.CreateViewAccessor(offset, length, MemoryMappedFileAccess.Read)
        let values = (List.ofSeq indices) |> List.map (readValue reader offset)
        values

The third version is similar to the simple reader but it fetches multiple bytes at once when two or more indexes are within a certain range. The performance is a bit worse then the simple reader so I'm not going into any further details. But if you want you can check the solution on github and any suggestions on easier ways of grouping the indexes by inter-distance are welcome.

The last version I created is more space efficient in my case. As I work with world oceanic date, about two thirds of my grids don't have any data (land). To avoid storing this data I separately store a file indicating which cells don't have data and skip those cells without data when writing the binary file. The disadvantage is that this makes everything a lot more complex because you have to map your cell indexes to the location of your binary value in your file in a space efficient way. To be able to store the bitmap I created some BitConverter extension methods to convert a boolean array to a byte array and back which I have also posted on fssnip. Then end result has a performance comparable to the one from the simple reader so if disk space is no problem then this solution isn't worth the additional complexity.

module BitConverter = 
    let pow2 y = 1 <<< y
    // convert booleans to bytes in a space efficient way
    let FromBooleans (bools:bool []) =
        seq {
            let b = ref 0uy
            for i=0 to bools.Length-1 do
                let rem = (i  % 8)
                if rem = 0 && i<> 0 then 
                    yield !b
                    b := 0uy
                if bools.[i] then
                    b := !b + (byte (pow2 rem))
            yield !b
        } |> Array.ofSeq
    // to booleans only works for bytes created with FromBooleans
    let ToBooleans (bytes:byte []) = 
        bytes
        |> Array.map (fun b -> Array.init 8 (fun i -> ((pow2 i) &&& int b) > 0))
        |> Array.concat

After lots of tweaking I managed to get a similar performance as the SimpleReadWrite but with 30% less diskspace needed and a more complex codebase.

Some performance related things I learned on the way are:
  • The Get method from System.Collections.BitArray is slow
  • You might want to convert lots of Seq chaining to one for loop
  • Some use of mutable values (within a function) might be necessary
  • Precompute as much as you can
And as I've tweeted before, I really like to evaluate my code in the REPL (same for my Python and R work).

Any thought on how to make this faster ? Do you know a fast library/system that achieve the same results ? Should I compress my rasters more to decrease disk access ? Any other suggestions ?

Other posts you might like:


Azimuthal equidistant projection: An implementation in F#, Python and Julia

Recently I needed to calculate distance from one point to another set of points in order to find the nearest point and its distance to the origin point. I opted to project all points to the azimuthal equidistant projection. The most important property of this projection is that all distances from the center of the projection to all other points represents the true distance (assuming a spheric earth) to all other projected points. To calculate the distance you just have to calculate the euclidean distance and multiply it with the average earth radius (6371000.0 meters).

The formulas for the azimuthal equidistant projection can be found at mathworld.wolfram.com. This formula can then be directly translated into F# like this:

open System
module AzimuthalEquidistantProjection =

    let inline degToRad d = 0.0174532925199433 * d; // (1.0/180.0 * Math.PI) * d

    let project centerlon centerlat lon lat =
        // http://mathworld.wolfram.com/AzimuthalEquidistantProjection.html
        // http://www.radicalcartography.net/?projectionref
        let t:float = degToRad lat
        let l:float = degToRad lon
        let t1 = degToRad centerlat // latitude center of projection
        let l0 = degToRad centerlon // longitude center of projection
        let c = acos ((sin t1) * (sin t) + (cos t1) * (cos t) * (cos (l-l0)))
        let k = c / (sin c)
        let x = k * (cos t) * (sin (l-l0))
        let y = k * ((cos t1) * (sin t) - (sin t1) * (cos t) * (cos (l-l0)))
        (x, y)

I also implemented the azimuthal equidistant projection by using units of measure. I create one set of measures to distinguish between degrees and radians and another one to distinguish between x and y.


open System
module AzimuthalEquidistantProjectionWithMeasures = 
    [<Measure>] type deg // degrees
    [<Measure>] type rad // radians
    [<Measure>] type x
    [<Measure>] type y
    type lon = float<x deg>
    type lat = float<y deg>
    
    let inline degToRad (d:float<'u deg>) = d*0.0174532925199433<rad/deg>
    let cos (d:float<'u rad>) = Math.Cos(float d)
    let sin (d:float<'u rad>) = Math.Sin(float d)
    
    let project (centerlon:lon) (centerlat:lat) (lon:lon) (lat:lat) =
        // http://mathworld.wolfram.com/AzimuthalEquidistantProjection.html
        // http://www.radicalcartography.net/?projectionref
        let t = degToRad lat
        let l = degToRad lon
        let t1 = degToRad centerlat // latitude center of projection
        let l0 = degToRad centerlon // longitude center of projection
        let c = acos ((sin t1) * (sin t) + (cos t1) * (cos t) * (cos (l-l0)))
        let k = c / (sin c)
        let x:float<x> = k * (cos t) * (sin (l-l0)) 
                         |> LanguagePrimitives.FloatWithMeasure
        let y:float<y> = k * ((cos t1) * (sin t) - (sin t1) * (cos t) * (cos (l-l0))) 
                         |> LanguagePrimitives.FloatWithMeasure
        (x, y)

In the past I've used the python OGR bindings and the ESRI Projection Engine for my map projection needs but this time I needed a pure python implementation so I translated the above code and optimized it a bit by precalculating some values that we'll need when we project the points during the initialization of my projection class. A similarly optimized version in F# is on fssnip.net.

from math import cos, sin, acos, radians

class Point(object):
    def __init__(self,x,y):
        self.x = x
        self.y = y       

class AzimuthalEquidistantProjection(object):
    """ 
        http://mathworld.wolfram.com/AzimuthalEquidistantProjection.html
        http://www.radicalcartography.net/?projectionref
    """
    def __init__(self, center):
        self.center = center
        self.t1 = radians(center.y) ## latitude center of projection
        self.l0 = radians(center.x) ## longitude center of projection
        self.cost1 = cos(self.t1)
        self.sint1 = sin(self.t1)
        
    def project(self, point):
        t = radians(point.y)
        l = radians(point.x)
        costcosll0 = cos(t) * cos(l-self.l0)
        sint = sin(t)
        
        c = acos ((self.sint1) * (sint) + (self.cost1) * costcosll0)
        k = c / sin(c)
        x = k * cos(t) * sin(l-self.l0)
        y = k * (self.cost1 * sint - self.sint1 * costcosll0)
        return Point(x, y)

import unittest
class Test_AzimuthalEquidistantProjection(unittest.TestCase):
    def test_project(self):
        p = AzimuthalEquidistantProjection(Point(0.0,0.0))
        r = p.project(Point(1.0,1.0))
        self.assertAlmostEqual(0.01745152022, r.x)
        self.assertAlmostEqual(0.01745417858, r.y)

        p = AzimuthalEquidistantProjection(Point(1.0,2.0))
        r = p.project(Point(3.0,4.0))
        self.assertAlmostEqual(0.03482860733, r.x)
        self.assertAlmostEqual(0.03494898734, r.y)

        p = AzimuthalEquidistantProjection(Point(-10.0001, 80.0001))
        r = p.project(Point(7.935, 63.302))
        self.assertAlmostEqual(0.1405127567, r.x)
        self.assertAlmostEqual(-0.263406547, r.y)

if __name__ == '__main__':
    unittest.main()

I've also implemented the straightforward projection code in Julia. Julia is a promising language built for technical computing. It's still only on version 0.2 but a lot of packages have already been created for the language.


function degToRad(d)
    d*0.0174532925199433
end

function project(centerlon, centerlat, lon, lat)
  t = degToRad(lat)
  l = degToRad(lon)
  t1 = degToRad(centerlat)
  l0 = degToRad(centerlon)
  c = acos(sin(t1) * sin(t) + cos(t1) * cos(t) * cos(l-l0))
  k = c / sin(c)
  x = k * cos(t) * sin(l-l0)
  y = k * (cos(t1) * sin(t) - sin(t1) * cos(t) * cos(l-l0))
  x, y
end

More Python, F#, Julia and geospatial code will be posted here so make sure to subscribe to my e-mail updates on the left. I also occasionally tweet as @gissolved.