Storing and fetching raster values in F#

In my quest for a fast system to fetch values from rasters at random cell locations I've been experimenting with a few different implementations in F#. More specifically I have now 4 methods for reading raster files. As usual all source code is available online: AsciiToBin.fsx and all comments on it are very welcome.

The first and simplest method I came up with was converting the ASCII raster files with integers to binary files where every integer is converted to its binary equivalent and written to disk. This saves about 50% of disk space compared to the very large ASCII files but also provides a fast and easy to program way of accessing the values. Note that there are cells with no data so these are stored as Int32.MinValue. As you can see the code is rather short.

module SimpleReadWrite = 

    let writeValue (writer:BinaryWriter) (value:int option) =
        match value with
        | Some(v) -> writer.Write(v)
        | None -> writer.Write(Int32.MinValue)

    let writeValues fileName (values:seq<int option>) =
        use writer = new BinaryWriter(File.Open(fileName, FileMode.OpenOrCreate))
        |> Seq.iter (writeValue writer)
    let readValue (reader:BinaryReader) cellIndex = 
        // set stream to correct location
        reader.BaseStream.Position <- cellIndex*4L
        match reader.ReadInt32() with
        | Int32.MinValue -> None
        | v -> Some(v)
    let readValues fileName indices = 
        use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
        // Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
        let values = (readValue reader) (List.ofSeq indices)

The second version I created uses a memory mapped file for reading the values from the same format as before. This is slightly faster (about 2 times) when we want to query lots of values from the same raster. But also 2 times slower when you query for example 10000 times 10 values from different rasters.

module MemoryMappedSimpleRead =

    open System.IO.MemoryMappedFiles

    let readValue (reader:MemoryMappedViewAccessor) offset cellIndex =
        let position = (cellIndex*4L) - offset
        match reader.ReadInt32(position) with
        | Int32.MinValue -> None
        | v -> Some(v)
    let readValues fileName indices =
        use mmf = MemoryMappedFile.CreateFromFile(fileName, FileMode.Open)
        let offset = (Seq.min indices ) * 4L
        let last = (Seq.max indices) * 4L
        let length = 4L+last-offset
        use reader = mmf.CreateViewAccessor(offset, length, MemoryMappedFileAccess.Read)
        let values = (List.ofSeq indices) |> (readValue reader offset)

The third version is similar to the simple reader but it fetches multiple bytes at once when two or more indexes are within a certain range. The performance is a bit worse then the simple reader so I'm not going into any further details. But if you want you can check the solution on github and any suggestions on easier ways of grouping the indexes by inter-distance are welcome.

The last version I created is more space efficient in my case. As I work with world oceanic date, about two thirds of my grids don't have any data (land). To avoid storing this data I separately store a file indicating which cells don't have data and skip those cells without data when writing the binary file. The disadvantage is that this makes everything a lot more complex because you have to map your cell indexes to the location of your binary value in your file in a space efficient way. To be able to store the bitmap I created some BitConverter extension methods to convert a boolean array to a byte array and back which I have also posted on fssnip. Then end result has a performance comparable to the one from the simple reader so if disk space is no problem then this solution isn't worth the additional complexity.

module BitConverter = 
    let pow2 y = 1 <<< y
    // convert booleans to bytes in a space efficient way
    let FromBooleans (bools:bool []) =
        seq {
            let b = ref 0uy
            for i=0 to bools.Length-1 do
                let rem = (i  % 8)
                if rem = 0 && i<> 0 then 
                    yield !b
                    b := 0uy
                if bools.[i] then
                    b := !b + (byte (pow2 rem))
            yield !b
        } |> Array.ofSeq
    // to booleans only works for bytes created with FromBooleans
    let ToBooleans (bytes:byte []) = 
        |> (fun b -> Array.init 8 (fun i -> ((pow2 i) &&& int b) > 0))
        |> Array.concat

After lots of tweaking I managed to get a similar performance as the SimpleReadWrite but with 30% less diskspace needed and a more complex codebase.

Some performance related things I learned on the way are:
  • The Get method from System.Collections.BitArray is slow
  • You might want to convert lots of Seq chaining to one for loop
  • Some use of mutable values (within a function) might be necessary
  • Precompute as much as you can
And as I've tweeted before, I really like to evaluate my code in the REPL (same for my Python and R work).

Any thought on how to make this faster ? Do you know a fast library/system that achieve the same results ? Should I compress my rasters more to decrease disk access ? Any other suggestions ?

Other posts you might like:

Think Python

Several years ago, when I was learning about Python for the first time, I read the online book: "How to think like a computer scientist". Today I just finished Think Python: How To Think Like a Computer Scientist by Allen B. Downey (free e-version) which is an evolved version of this book.

You should read this book if you want to learn programming or if you want to learn python. But if you already have advanced programming skills then I would suggest to just skim the free online version and spend your money on a more advanced python book.

Think Python starts with the most basic things like operators, variables and assignment then functions, conditionals, recursion and iteration are introduced, followed up with strings, lists , dictionaries and tuples. The next chapter is a practical chapter on files and the book ends with 4 chapters introducing object oriented programming. Between the different chapters there are 4 case studies where the learned concepts and techniques are applied.

What you won't learn in this book: 

  • what the different standard libraries are
  • web development in python
  • popular (scientific) libraries like numpy
  • writing (unit) tests for your programs
  • Python 3, except some small remarks
What I particularly  liked:
  • the information about debugging your programs at the end of every chapter and the appendix 
  • the step by step introduction to object-oriented programming
  • the case studies
  • the exercises
  • the glossary at the end of every chapter
As you might have inferred from the above, this book is not a reference book (we have the web for that) but a book that learns you to think like a programmer.

An alternative way to learn Python is Learn Python the Hard Way (html) (pdf + epub + video) by Zed Shaw.

Other books by Allen B. Downey:

If you want to improve your Python skills then take a look at this list of advanced books. I especially liked Expert Python Programming by Tarek Ziadé.