2018

Great circle calculations with numpy

In this very short post I want to point you to some code for calculating the centroid and distance to that centroid for a set of points in numpy. As bonus I also included some profiling. Note that I already blogged about the centroid function in a previous post.

from math import atan2, sqrt, degrees
import numpy as np
from math import radians, sin, cos

RADIUS = 6371.009


def get_centroid(points):
    xy = np.asarray(points)
    xy = np.radians(xy)
    lon, lat = xy[:, 0], xy[:, 1]
    avg_x = np.sum(np.cos(lat) * np.cos(lon)) / xy.shape[0]
    avg_y = np.sum(np.cos(lat) * np.sin(lon)) / xy.shape[0]
    avg_z = np.sum(np.sin(lat)) / xy.shape[0]
    center_lon = atan2(avg_y, avg_x)
    hyp = sqrt(avg_x * avg_x + avg_y * avg_y)
    center_lat = atan2(avg_z, hyp)
    return degrees(center_lon), degrees(center_lat)


def gc_distance_points(a, points):
    b = np.asarray(points)
    lat1, lng1 = radians(a[1]), radians(a[0])
    lat2, lng2 = np.radians(b[:, 1]), np.radians(b[:, 0])

    sin_lat1, cos_lat1 = sin(lat1), cos(lat1)
    sin_lat2, cos_lat2 = np.sin(lat2), np.cos(lat2)

    delta_lng = np.subtract(lng2, lng1)
    cos_delta_lng, sin_delta_lng = np.cos(delta_lng), np.sin(delta_lng)

    d = np.arctan2(np.sqrt((cos_lat2 * sin_delta_lng) ** 2 +
                 (cos_lat1 * sin_lat2 -
                  sin_lat1 * cos_lat2 * cos_delta_lng) ** 2),
                 sin_lat1 * sin_lat2 + cos_lat1 * cos_lat2 * cos_delta_lng)

    return RADIUS * d


def gc_dist(a, b):
        from math import radians, sin, cos, sqrt, atan2
        lat1, lng1 = radians(a[1]), radians(a[0])
        lat2, lng2 = radians(b[1]), radians(b[0])

        sin_lat1, cos_lat1 = sin(lat1), cos(lat1)
        sin_lat2, cos_lat2 = sin(lat2), cos(lat2)

        delta_lng = lng2 - lng1
        cos_delta_lng, sin_delta_lng = cos(delta_lng), sin(delta_lng)

        d = atan2(sqrt((cos_lat2 * sin_delta_lng) ** 2 +
                       (cos_lat1 * sin_lat2 -
                        sin_lat1 * cos_lat2 * cos_delta_lng) ** 2),
                  sin_lat1 * sin_lat2 + cos_lat1 * cos_lat2 * cos_delta_lng)

        return RADIUS * d


if __name__ == '__main__':

    import random
    random.seed(42)
    pts = [(random.uniform(-180, 180), random.uniform(-90, 90)) for _ in range(1000000)]
    
    centr = get_centroid(pts)

    import cProfile

    cProfile.runctx('gc_distance_points(centr, pts)', globals(), locals())
    cProfile.runctx('dists = [gc_dist(centr, b) for b in pts]', globals(), locals())

All code is also available at Github. Feel free to post a comment or email in case you have any questions.

Adding support for sftp to RCurl

While trying to download data from an FTP with the sftp protocol from R I encountered the folllowing error on my mac:

Error in function (type, msg, asError = TRUE)  : 
  Protocol "sftp" not supported or disabled in libcurl

The easiest way to fix this for me was performing the following steps:

brew install curl --with-libssh2

Then check whether sftp is now among the list of supported protocols:

/usr/local/opt/curl/bin/curl -V

Link to the new version of curl you installed:

brew link --force curl

Re-install RCurl from source:

install.packages("RCurl", type = "source")

And finally check if sftp is now supported:

RCurl::curlVersion()

A new fizzbuzz: generating sums for kids

With my oldest quizzing me for one sum after another for him to solve. I thought about writing a little piece of R code that would generate these for me. The code for doing so was less obvious then I imagined it to be.

Goal:

Generate all possible sums and subtractions for kids of the form

10 - 4 = ?
4 + 3 = ?

Constraints:

- Only numbers between 0-10
- Result should be >= 0

Solution:

generate_sums <- function(maxresult=10) {
  sums <- c()
  for (i in 0:(maxresult-1)) {
    for (j in 0:(maxresult-i)) {
      sums <- c(sums, paste0(i, ' + ', j, ' = ?'))
    }
  }
  sample(sums[!duplicated(sums)])
}
generate_subtractions <- function(maxresult=10) {
  substractions <- c()
  for (i in 0:maxresult) {
    for (j in i:maxresult) {
      substractions <- c(substractions, paste0(j, ' - ', i, ' = ?'))
    }
  }
  sample(substractions[!duplicated(substractions)])
}

I still believe that a more elegant approach should be possible but at least it works.

Fetching all records for a taxonomic group in OBIS

For example, if you want to download the records for all Nudibranchia then in R it is as simple as doing:

    install.packages("robis")

    data <- robis::occurrence("Nudibranchia")

    # get species data
    spdata <- data[!is.na(data$taxonRank) & data$taxonRank == "species",]

    # plot on a map
    robis::leafletmap(spdata)

Alternatively, you can download all data with the new mapper that is being developed:

http://iobis.org/mapper2?scientificname=Nudibranchia&download=true

Error in linbin2D: (list) object cannot be coerced to type 'double'

When calling bkde2D you might encounter the following error:

Error in linbin2D(x, gpoints1, gpoints2) :
(list) object cannot be coerced to type 'double'

This issue can manifest itself when the passed in data is a tibble instead of a data.frame.

Easy fix:

data <- as.data.frame(data)

Speeding up mean and standard deviation calculation for a stack of rasters in R

While modelling the distribution of invasive seaweeds during my PhD I wanted to calculate the mean and standard deviation of a set of SDMs that appeared to be equivalent in for the current climate but who gave different results for the future climate. However, the standard naive approach for calculating the mean and standard deviation of a RasterStack was rather slow. I managed to get a 20 fold speed up by calculating the mean and standard deviation by aggregating them in a stream. This rather obvious for calculating the mean, summing all rasters and then dividing by the number of rasters. For the standard deviation, the formula is only slightly more complicated and can be found both on Wikipedia and in code on StackOverflow. If your raster contains very large numbers, you might need to adapt this code to use Welford's method.

# Option 1: shorter but slower

rasterstack_meansd_slow <- function(x) {
mean <- raster::mean(x)
  sd <- raster::calc(x, sd)

  list(mean=mean, sd=sd)

}

# Option 2: faster but more code
rasterstack_meansd_fast <- function(x) {
  s0 <- nlayers(x)
  s1 <- raster(x, layer=1)
  s2 <- s1^2
  for(ri in 2:s0) {
    r <- raster(x, layer=ri)
    s1 <- s1 + r
    s2 <- s2 + r^2
  }
  list(mean=s1/s0, sd=sqrt((s0 * s2 - s1 * s1)/(s0 * (s0 - 1))))
}

As a small example I calculate the mean and standard deviation of the sea surface temperature for different climate scenarios for 2050. The fast version only takes 10 seconds on my machine while the slow version takes 225 seconds to calculate the mean and standard deviation.

library(sdmpredictors)

sstfuture <- load_layers(c('BO2_RCP26_2050_tempmean_ss', 'BO2_RCP45_2050_tempmean_ss', 
                           'BO2_RCP60_2050_tempmean_ss', 'BO2_RCP85_2050_tempmean_ss'))

system.time({ fast <- rasterstack_meansd_fast(sstfuture) }) # 10 seconds
system.time({ slow <- rasterstack_meansd_slow(sstfuture) }) # 225 seconds

Selecting comma separated data as multiple rows with SQLite

A while back I needed to split data stored in one column as a comma separated string into multiple rows in a SQL query from a SQLite database.

My table looked like this:

CREATE TABLE "predictor_sets" 
  (`id` INTEGER, `nvar` INTEGER, `predictors` TEXT, 
    `experiment` TEXT, PRIMARY KEY(`id`));

Insert some sample data:

INSERT INTO predictor_sets VALUES
  (1659, 5, 'BO_bathymax,BO_calcite,BO_parmax,BO_damean,BO_sstrange', 'bathymetry'),
  (1660, 5, 'BO_bathymin,BO_calcite,BO_parmax,BO_damean,BO_sstrange', 'bathymetry'),
  (1661, 5, 'BO_bathymean,BO_calcite,BO_parmax,BO_damean,BO_sstrange', 'bathymetry');

Splitting up the different predictors in seperate rows can be done with the following recursive common table expression:

WITH RECURSIVE split(predictorset_id, predictor_name, rest) AS (
  SELECT id, '', predictors || ',' FROM predictor_sets WHERE id
   UNION ALL
  SELECT predictorset_id, 
         substr(rest, 0, instr(rest, ',')),
         substr(rest, instr(rest, ',')+1)
    FROM split
   WHERE rest <> '')
SELECT predictorset_id, predictor_name 
  FROM split 
 WHERE predictor_name <> ''
 ORDER BY predictorset_id, predictor_name;

Check out the documentation for more info on writing your own common table expressions in SQLite, PostgreSQL or your favorite database.

Creating a single pixel wide raster along the coastline or another edge

At one point during my PhD I needed a single pixel of raster values along the coastline. In order to get this I used the following code in R:

library(sdmpredictors)
library(raster)

coast_mask <- function(layer) {
  edges <- raster::boundaries(raster(layer, layer=1), type="inner")
  values <- getValues(edges)
  is.na(values) | values == 0
}

l <- load_layers("BO_sstmean", equalarea = FALSE)

mask <- coast_mask(l)
l[mask] <- NA

plot(l, col=rev(heat.colors(255)))

Source code available on GitHub at https://github.com/samuelbosch/blogbits/blob/master/misc/coastalraster.R

Main R Packages used:

Creating a high resolution png from a raster in R

In this post I'll show you two different ways to create a png from an equal-area bathymetry grid in R.

Version 1: using the raster::plot function

The first option for plotting a raster is to use the raster::plot function and write this to a sufficiently large png file. This works reasonably well but the only disadvantage is that both the title and legend are really small. The title is optional but I didn't find a way to disable the legend. The main trick to get the raster::plot function to output a high resolution png consists in setting the maxpixels parameter. Colors where selected on ColorBrewer but with a small addition that the color range was reverted and that the darkest color is repeated to ensure that rare depths (-6000 to -10000) get the same color.

if(!requireNamespace("sdmpredictors")) {
  install.packages("sdmpredictors")
}
library(raster)
x <- sdmpredictors::load_layers("BO_bathymean", equalarea = TRUE)
png("bathymetry_plot1.png", width=ncol(x), height=nrow(x))
col <- rev(c("#f7fbff", "#deebf7", "#c6dbef", "#9ecae1", "#6baed6", "#4292c6","#2171b5","#08519c", rep("#08306b",7)))
plot(x, maxpixels = ncell(x), col = col, colNA = "#818181",

     main = "Bathymetry", axes = FALSE, ylim=extent(x)[3:4])
dev.off()

Version 2: write to png using leaflet colors

While this version looks and is a bit more complicated, it produces really good looking results. The main gist of this code is that it transforms the values to a range from 0 to 1000 while making sure that extreme values at both extremes will get the same colors as more common values. This is similar to what you general would do when e.g. creating a color scale with QGIS. Once values are mapped to colors they are converted to raw bytes with the right dimensions and written to a png file. Remark that this code was inspired by some of the internal functions in the leaflet package.

if(!requireNamespace("leaflet")) {
  install.packages("leaflet")
}
if(!requireNamespace("sdmpredictors")) {
  install.packages("sdmpredictors")
}
library(sdmpredictors)
library(raster)

# create colors
colors <- leaflet::colorNumeric(rev(c("#f7fbff", "#deebf7", "#c6dbef", "#9ecae1", "#6baed6", "#4292c6","#2171b5","#08519c", "#08306b")),
                                -1:1001, na.color =  "#818181")
cols <- c(colors(-1:1001), colors(NA))
x <- sdmpredictors::load_layers("BO_bathymean", equalarea = TRUE)
# scale values and remove extreme values from the color range 
vals <- values(x)
vals <- scale(vals)
minmax <- quantile(vals, probs=c(0.01, 0.99), na.rm = TRUE)
vals <- round((((vals - minmax[1]) / (minmax[2] - minmax[1])) * 1000))
vals[vals < 0] <- 0
vals[vals > 1000] <- 1000
vals[is.na(vals)] <- 1002

# lookup colors for scaled values, convert to raw and write to file
valcolors <- cols[vals+2] # +2 because -1 and 0 are in cols (value 0 is at index 2 in cols)

rgb_data <- col2rgb(valcolors, alpha = TRUE)
raw_data <- as.raw(rgb_data)
dim(raw_data) <- c(4, ncol(x), nrow(x))

png::writePNG(raw_data, "bathymetry_plot2.png")

All source code is available on GitHub at https://github.com/samuelbosch/blogbits/blob/master/misc/raster2png.R

Main R Packages used:

Samuel Bosch

Pages