Samuel Bosch

Showing posts with label raster. Show all posts

Speeding up mean and standard deviation calculation for a stack of rasters in R

While modelling the distribution of invasive seaweeds during my PhD I wanted to calculate the mean and standard deviation of a set of SDMs that appeared to be equivalent in for the current climate but who gave different results for the future climate. However, the standard naive approach for calculating the mean and standard deviation of a RasterStack was rather slow. I managed to get a 20 fold speed up by calculating the mean and standard deviation by aggregating them in a stream. This rather obvious for calculating the mean, summing all rasters and then dividing by the number of rasters. For the standard deviation, the formula is only slightly more complicated and can be found both on Wikipedia and in code on StackOverflow. If your raster contains very large numbers, you might need to adapt this code to use Welford's method.

# Option 1: shorter but slower

rasterstack_meansd_slow <- function(x) {
mean <- raster::mean(x)
  sd <- raster::calc(x, sd)

  list(mean=mean, sd=sd)

}

# Option 2: faster but more code
rasterstack_meansd_fast <- function(x) {
  s0 <- nlayers(x)
  s1 <- raster(x, layer=1)
  s2 <- s1^2
  for(ri in 2:s0) {
    r <- raster(x, layer=ri)
    s1 <- s1 + r
    s2 <- s2 + r^2
  }
  list(mean=s1/s0, sd=sqrt((s0 * s2 - s1 * s1)/(s0 * (s0 - 1))))
}

As a small example I calculate the mean and standard deviation of the sea surface temperature for different climate scenarios for 2050. The fast version only takes 10 seconds on my machine while the slow version takes 225 seconds to calculate the mean and standard deviation.

library(sdmpredictors)

sstfuture <- load_layers(c('BO2_RCP26_2050_tempmean_ss', 'BO2_RCP45_2050_tempmean_ss', 
                           'BO2_RCP60_2050_tempmean_ss', 'BO2_RCP85_2050_tempmean_ss'))

system.time({ fast <- rasterstack_meansd_fast(sstfuture) }) # 10 seconds
system.time({ slow <- rasterstack_meansd_slow(sstfuture) }) # 225 seconds

Creating a single pixel wide raster along the coastline or another edge

At one point during my PhD I needed a single pixel of raster values along the coastline. In order to get this I used the following code in R:

library(sdmpredictors)
library(raster)

coast_mask <- function(layer) {
  edges <- raster::boundaries(raster(layer, layer=1), type="inner")
  values <- getValues(edges)
  is.na(values) | values == 0
}

l <- load_layers("BO_sstmean", equalarea = FALSE)

mask <- coast_mask(l)
l[mask] <- NA

plot(l, col=rev(heat.colors(255)))

Source code available on GitHub at https://github.com/samuelbosch/blogbits/blob/master/misc/coastalraster.R

Main R Packages used:

Creating a high resolution png from a raster in R

In this post I'll show you two different ways to create a png from an equal-area bathymetry grid in R.

Version 1: using the raster::plot function

The first option for plotting a raster is to use the raster::plot function and write this to a sufficiently large png file. This works reasonably well but the only disadvantage is that both the title and legend are really small. The title is optional but I didn't find a way to disable the legend. The main trick to get the raster::plot function to output a high resolution png consists in setting the maxpixels parameter. Colors where selected on ColorBrewer but with a small addition that the color range was reverted and that the darkest color is repeated to ensure that rare depths (-6000 to -10000) get the same color.

if(!requireNamespace("sdmpredictors")) {
  install.packages("sdmpredictors")
}
library(raster)
x <- sdmpredictors::load_layers("BO_bathymean", equalarea = TRUE)
png("bathymetry_plot1.png", width=ncol(x), height=nrow(x))
col <- rev(c("#f7fbff", "#deebf7", "#c6dbef", "#9ecae1", "#6baed6", "#4292c6","#2171b5","#08519c", rep("#08306b",7)))
plot(x, maxpixels = ncell(x), col = col, colNA = "#818181",

     main = "Bathymetry", axes = FALSE, ylim=extent(x)[3:4])
dev.off()

Version 2: write to png using leaflet colors

While this version looks and is a bit more complicated, it produces really good looking results. The main gist of this code is that it transforms the values to a range from 0 to 1000 while making sure that extreme values at both extremes will get the same colors as more common values. This is similar to what you general would do when e.g. creating a color scale with QGIS. Once values are mapped to colors they are converted to raw bytes with the right dimensions and written to a png file. Remark that this code was inspired by some of the internal functions in the leaflet package.

if(!requireNamespace("leaflet")) {
  install.packages("leaflet")
}
if(!requireNamespace("sdmpredictors")) {
  install.packages("sdmpredictors")
}
library(sdmpredictors)
library(raster)

# create colors
colors <- leaflet::colorNumeric(rev(c("#f7fbff", "#deebf7", "#c6dbef", "#9ecae1", "#6baed6", "#4292c6","#2171b5","#08519c", "#08306b")),
                                -1:1001, na.color =  "#818181")
cols <- c(colors(-1:1001), colors(NA))
x <- sdmpredictors::load_layers("BO_bathymean", equalarea = TRUE)
# scale values and remove extreme values from the color range 
vals <- values(x)
vals <- scale(vals)
minmax <- quantile(vals, probs=c(0.01, 0.99), na.rm = TRUE)
vals <- round((((vals - minmax[1]) / (minmax[2] - minmax[1])) * 1000))
vals[vals < 0] <- 0
vals[vals > 1000] <- 1000
vals[is.na(vals)] <- 1002

# lookup colors for scaled values, convert to raw and write to file
valcolors <- cols[vals+2] # +2 because -1 and 0 are in cols (value 0 is at index 2 in cols)

rgb_data <- col2rgb(valcolors, alpha = TRUE)
raw_data <- as.raw(rgb_data)
dim(raw_data) <- c(4, ncol(x), nrow(x))

png::writePNG(raw_data, "bathymetry_plot2.png")

All source code is available on GitHub at https://github.com/samuelbosch/blogbits/blob/master/misc/raster2png.R

Main R Packages used:

Compressing TIFF files in R and Windows

Compressing TIFF files is an easy way to save disk space when working with raster files or image outputs for research papers. But this is rarely the default option causing used diskspace to quickly inflate heavily. For instance this figure 7000 by 5000 pixel image as a TIFF uses 136MB.

While after lossless compression with the libtiff command tiffcp it is only 8MB. To install libtiff on windows download and run the latest setup from http://gnuwin32.sourceforge.net/packages/tiff.htm. To compress a tiff file you need to run the following command (assuming that tiffcp is in your path). Make sure to specify a different output file name, otherwise the original will be overwritten with a broken and incomplete copy of the original.

tiffcp -c lzw uncompressed_input.tif compressed_output.tif

In R you can also write compressed raster (geo)tiff files by adding extra options to the writeRaster function from the raster package.

library(raster)

x <- raster("uncompressed_input.tif")
tifoptions <- c("COMPRESS=DEFLATE", "PREDICTOR=2", "ZLEVEL=6")
writeRaster(x, "compressed_output.tif",
            options = tifoptions, overwrite = FALSE)

Storing and fetching raster values in F#

In my quest for a fast system to fetch values from rasters at random cell locations I've been experimenting with a few different implementations in F#. More specifically I have now 4 methods for reading raster files. As usual all source code is available online: AsciiToBin.fsx and all comments on it are very welcome.

The first and simplest method I came up with was converting the ASCII raster files with integers to binary files where every integer is converted to its binary equivalent and written to disk. This saves about 50% of disk space compared to the very large ASCII files but also provides a fast and easy to program way of accessing the values. Note that there are cells with no data so these are stored as Int32.MinValue. As you can see the code is rather short.

module SimpleReadWrite = 

    let writeValue (writer:BinaryWriter) (value:int option) =
        match value with
        | Some(v) -> writer.Write(v)
        | None -> writer.Write(Int32.MinValue)

    let writeValues fileName (values:seq<int option>) =
        use writer = new BinaryWriter(File.Open(fileName, FileMode.OpenOrCreate))
        values
        |> Seq.iter (writeValue writer)
            
    let readValue (reader:BinaryReader) cellIndex = 
        // set stream to correct location
        reader.BaseStream.Position <- cellIndex*4L
        match reader.ReadInt32() with
        | Int32.MinValue -> None
        | v -> Some(v)
        
    let readValues fileName indices = 
        use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
        // Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
        let values = List.map (readValue reader) (List.ofSeq indices)
        values

The second version I created uses a memory mapped file for reading the values from the same format as before. This is slightly faster (about 2 times) when we want to query lots of values from the same raster. But also 2 times slower when you query for example 10000 times 10 values from different rasters.

module MemoryMappedSimpleRead =

    open System.IO.MemoryMappedFiles

    let readValue (reader:MemoryMappedViewAccessor) offset cellIndex =
        let position = (cellIndex*4L) - offset
        match reader.ReadInt32(position) with
        | Int32.MinValue -> None
        | v -> Some(v)
        
    let readValues fileName indices =
        use mmf = MemoryMappedFile.CreateFromFile(fileName, FileMode.Open)
        let offset = (Seq.min indices ) * 4L
        let last = (Seq.max indices) * 4L
        let length = 4L+last-offset
        use reader = mmf.CreateViewAccessor(offset, length, MemoryMappedFileAccess.Read)
        let values = (List.ofSeq indices) |> List.map (readValue reader offset)
        values

The third version is similar to the simple reader but it fetches multiple bytes at once when two or more indexes are within a certain range. The performance is a bit worse then the simple reader so I'm not going into any further details. But if you want you can check the solution on github and any suggestions on easier ways of grouping the indexes by inter-distance are welcome.

The last version I created is more space efficient in my case. As I work with world oceanic date, about two thirds of my grids don't have any data (land). To avoid storing this data I separately store a file indicating which cells don't have data and skip those cells without data when writing the binary file. The disadvantage is that this makes everything a lot more complex because you have to map your cell indexes to the location of your binary value in your file in a space efficient way. To be able to store the bitmap I created some BitConverter extension methods to convert a boolean array to a byte array and back which I have also posted on fssnip. Then end result has a performance comparable to the one from the simple reader so if disk space is no problem then this solution isn't worth the additional complexity.

module BitConverter = 
    let pow2 y = 1 <<< y
    // convert booleans to bytes in a space efficient way
    let FromBooleans (bools:bool []) =
        seq {
            let b = ref 0uy
            for i=0 to bools.Length-1 do
                let rem = (i  % 8)
                if rem = 0 && i<> 0 then 
                    yield !b
                    b := 0uy
                if bools.[i] then
                    b := !b + (byte (pow2 rem))
            yield !b
        } |> Array.ofSeq
    // to booleans only works for bytes created with FromBooleans
    let ToBooleans (bytes:byte []) = 
        bytes
        |> Array.map (fun b -> Array.init 8 (fun i -> ((pow2 i) &&& int b) > 0))
        |> Array.concat

After lots of tweaking I managed to get a similar performance as the SimpleReadWrite but with 30% less diskspace needed and a more complex codebase.

Some performance related things I learned on the way are:

The Get method from System.Collections.BitArray is slow
You might want to convert lots of Seq chaining to one for loop
Some use of mutable values (within a function) might be necessary
Precompute as much as you can

And as I've tweeted before, I really like to evaluate my code in the REPL (same for my Python and R work).

Any thought on how to make this faster ? Do you know a fast library/system that achieve the same results ? Should I compress my rasters more to decrease disk access ? Any other suggestions ?

Other posts you might like:

Creating a kernel density estimate map in R

In this post I'm going to create a kernel density estimate map in R from a file with latitude/longitude coordinates. WARNING: depending on your application the following gives incorrect results because a non-spherical kernel density estimator is used with spherical data (big thanks too Brian Rowlingson for pointing that out). I didn't find any spherical kernel density estimation implementation in R but if you know any please let me know. While exploring this issue I also wrote a similar implementation as the below code for the generation of a kernel density estimate with the spatial statistics package splancs instead of KernSmooth (splancs_kernel_density.R).

To create this map we'll use the bkde2D function in the KernSmooth package to generate the kernel density estimates and the raster package for outputting a raster file. If you don't have the KernSmooth and raster packages then they can as usually be installed with:

install.packages("KernSmooth")
install.packages("raster")

In order to create the 2D binned kernel density map I first loaded the KernSmooth and raster package.

library("KernSmooth")
library("raster")

Then we read the input file with the coordinates of the points that we want to generate the kernel density estimate off. This input file is an export from the Ocean Biographic Information System (OBIS) and represents distribution records from the seaweed Alaria esculenta.

input <- "Alaria_esculenta.csv"
output <- "Alaria_esculenta_density.asc"

# get the coordinates
records <- read.csv(input)
coordinates <- records[,2:3]

Next we compute the 2D binned kernel density estimate. In order to limit the size of the generated output files I set all values smaller then 0.00001 to 0.

# compute the 2D binned kernel density estimate
est <- bkde2D(coordinates, 
              bandwidth=c(3,3), 
              gridsize=c(4320,2160),
              range.x=list(c(-180,180),c(-90,90)))
est$fhat[est$fhat<0.00001] <- 0 ## ignore very small values

Finally we create a raster file from the output of bkde2D, inspect it visually and export it as an ascii file.

# create raster
est.raster = raster(list(x=est$x1,y=est$x2,z=est$fhat))
projection(est.raster) <- CRS("+init=epsg:4326")
xmin(est.raster) <- -180
xmax(est.raster) <- 180
ymin(est.raster) <- -90
ymax(est.raster) <- 90
# visually inspect the raster output
plot(est.raster)
# write an ascii file
writeRaster(est.raster,output,"ascii")

And the output looks like this. As you can see most records are located near 50°N and 0° which corresponds to the distribution records map generated by OBIS.

The output ascii raster file can be used as an input file for OccurrenceThinner which uses the kernel density estimate grid for filtering out records from densely sampled regions in order to avoid overfitting while building species distribution models (SDM).

Code and data for this post are in my blog github repository.

Pages