January 2014

What I learned: Part 2 PostgreSQL and PostGIS

In my previous post in the "What I learned" series on Linux commands. Today its time to share some things that I learned while working with PostgreSQL and PostGIS.

Problem: Where are the PostgreSQL logs in ubuntu?

On Ubuntu you can find them here:

cd /var/log/postgresql

With the tail command you can keep an eye on the logs while it is being written to. On my system the exact command for doing this is:

tail -f /var/log/postgresql/postgresql-9.1-main.log

Problem: Copying a subset of a table

I needed to develop some SQL scripts for some very big tables. To be able to quickly test my work I decided to copy the tables to a new schema but with a limited set of records.

CREATE TABLE  AS (SELECT * FROM  LIMIT );

The PostgreSQL documentation contains more information on this topic and some extra samples.

If you also want to copy the original column constraints then its better to create your table with the following two commands:

CREATE TABLE tbl_new (LIKE tbl_org INCLUDING CONSTRAINTS);

INSERT INTO tbl_new 
SELECT col_a, col_b, array[col] aS col_c
FROM tbl_org;

Note that primary and foreign keys and indexes aren't created so if you want them you'll have to create them for your new table.

Problem: Typing the same query over and over again in pgAdmin

Fortunatly you can add macros to pgAdmin with following steps:

Open a query window in pgAdmin
Click Macros -> Manage Macros
Click the key combination that you'd like and type a name and the actual query and hit save.
Repeat for all other macro queries that you'd like

Some sample macros:

-- inspect the first 100 records of a table.

-- Select the table and hit the keyboard shortcut you've picked.
SELECT * FROM $SELECTION$ LIMIT 100

-- how many records does a statement return. Also useful for timing queries and subqueries without having to load all the results in your Data Output window.
SELECT COUNT(*) FROM ($SELECTION$) counttable

Some extra information can be found in the official docs.

Problem: Getting the centroid of multiple points in PostGIS

SELECT ST_CENTROID(ST_COLLECT(geom)) FROM points

Note that ST_Collect is many times faster than ST_Union. The difference is that ST_Collect puts the geometries into one multi geometry and ST_Union combines geometries whenever they overlap.

Problem: Creating an array from a query in PostgreSQL

SELECT array(SELECT id FROM table)

SELECT array_agg(id) FROM table

Problem: Repair polygons that are not closed in PostGIS 1.5.4

Solution add the first point at the end of the exterior ring of the polygon when the first point is not the same as the last point. Note that polygons lose their inner rings if there were any. See also: http://lists.osgeo.org/pipermail/postgis-users/2011-December/031615.html

WITH fixed_geom As (
  SELECT gid, ST_Collect(ST_BuildArea(ST_AddPoint(ST_EXTERIORRING(geom), ST_StartPoint(ST_EXTERIORRING(geom)), -1 /* append the point */))) geom
    FROM (SELECT gid,(ST_DUMP(geom)).geom geom FROM qc.global_shorelines) l
   WHERE NOT ST_StartPoint(ST_EXTERIORRING(geom)) = ST_EndPoint(ST_EXTERIORRING(geom))
   GROUP BY gid
)
UPDATE qc.global_shorelines gs SET geom = f.geom
  FROM fixed_geom f
 WHERE gs.gid  = f.gid

Problem : Executing a SQL script within psql

If you want to execute a SQL script within psql then start psql and type:

\i "path .. to .. file"

Problem: Cancelling a command with psql

To cancel a command hit Ctrl+c and to quit Ctrl+d. An alternative for quitting is typing \q.

Problem: Writing output of psql to a file

If you want to redirect psql output to a file instead of to the prompt then type:

\o filename.here

Type \o again to restore to the standard output.

Problem: Get execution time from query in psql

\timing

Problem: Listing the open connections and active sql queries

SELECT datname,usename,procpid,client_addr,waiting,query_start,current_query FROM pg_stat_activity;

Problem: killing all open connections to a database

SELECT pg_terminate_backend(pg_stat_activity.procpid)
FROM pg_stat_activity
WHERE pg_stat_activity.datname = 'YOUR_DATABASE_NAME'
  AND procpid <> pg_backend_pid()

Problem: Copying data from one server to another with psql

\c dbname username source_db_server
COPY (SELECT * FROM your_table WHERE some_filter) TO '/data/table_copy';
\c dbname username destination_db_server
COPY '/data/table_copy' TO your_table

Problem: On the fly compression of the copy output

To solve this problem in PostgreSQL 9.3 an option was added to the COPY command to directly pipe the output through an external program.
This can be used to pipe the copy output to gzip in the following way:

COPY test TO PROGRAM 'gzip > /data/test.gz';

To achieve the same behavior on older systems you can copy to STDOUT with psql and then pipe this to gzip.

psql -c "COPY test TO STDOUT WITH CSV" | gzip > /data/test.gz

Problem: Tired of typing your password

After filling in my password for the n-th time while using tools like pg_dump and pg_restore I searched for a solution:

export PGPASSWORD=

Conclusions

I hope some of the tips on PostgreSQL and PostGIS might come in handy some day. If you want to learn more on PostgreSQL I would suggest you to check the excellent documentation. For all things related to performance there is the excellent book: PostgreSQL 9.0 High Performance.
The best book on PostGIS is definitely PostGIS in Action. The second edition is coming out in may 28, 2014.

Loading and querying spatial data with PostGIS
What I learned: tmux, top, ls and pscp

Monte Carlo Experiment of the Orchard Game: What are the odds of winning ?

Update november 2021: Simon Naarmann noticed a bug in the worst scenario, I've updated this post to include the fix.

Recently my little son received the Orchard Game (Amazon). After playing it a few time I started wondering what the odds of winning the game are. To find this out I wrote a short Monte Carlo simulation of the game.

The rules

The orchard game is a very simple collaborative game with the following rules:

On the board 4 types of fruits with 10 pieces each.
Every player throws a six sided dice with 4 colors (for each fruit a color), a fruit basket and a raven.
If you throw a color you may harvest one piece of fruit with the same color.
If you throw the fruit basket you can take two pieces of the fruit of your choice.
If you throw the raven you have to put one of the 9 pieces of the raven on the board.
If all fruits are collected by the players before the raven is complete then the players have won the game.

Note that to play perfectly you have to pick up the most abundant fruits when you've thrown the fruit basket.

The program

I wrote the following simulation in Python. First thing I wrote was a simulation of the game. The actual strategy used to decide which fruit is harvested when the fruit basket is thrown is passed in as a function.

import random

def orchard(fruitbox_strategy):
    raven = 0
    fruits = 4 * [10]
    while sum(fruits) > 0 and raven < 9:
        n = random.randint(0,5)
        if n == 5:
            raven = raven+1
        elif n == 4:
            fruitbox_strategy(fruits)
            fruitbox_strategy(fruits)
        elif fruits[n] > 0:
            fruits[n] = fruits[n]-1
    return raven < 9

Next I implemented 3 different fruit basket strategies. The perfect playing strategy is decrease_max. It will always harvest the most abundant fruit(s).

def decrease_max(fruits):
    mi, mv = 0,0
    for i, v in enumerate(fruits):
        if v > mv:
            mi,mv = i,v
    fruits[mi] = mv-1

def decrease_min(fruits):
    mi, mv = 0, 11
    for i, v in enumerate(fruits):
        if v < mv and v > 0:
            mi,mv = i,v

    if mv < 11:
        fruits[mi] = mv-1

def decrease_random(fruits):
    mi = random.randint(0,3)
    v = fruits[mi]
    if v > 0:
        fruits[mi] = v-1
    elif sum(fruits) > 0:
        decrease_random(fruits)

After that I implemented the Monte Carlo simulation. It will repeatedly play the game with the given strategy and return the percentage of games it won. The number of times the game is simulated can be picked up front. To get an idea of the robustness of the result I repeated the simulation 10 times to be able to inspect the variability in the outcome. Increasing the number of games played within a simulation makes the results more stable.

def monte_carlo_simulation(game, count):
    won = 0
    for _ in range(count):
        if game():
            won = won + 1
    return (won*100) /count

def simulate_orchard_best(count):
    return monte_carlo_simulation(lambda:orchard(decrease_max), count)

def simulate_orchard_worst(count):
    return monte_carlo_simulation(lambda:orchard(decrease_min), count)

def simulate_orchard_random(count):
    return monte_carlo_simulation(lambda:orchard(decrease_random), count)


print('Winning rates of 10 runs of the best strategy with 50 games: \n%s' %
      ([str(simulate_orchard_best(50)) + '%' for _ in range(10)]))
    
print('Winning rates of 10 runs of the best strategy with 1000 games: %s' %
      ([str(simulate_orchard_best(1000)) + '%' for _ in range(10)])) 

print('Winning rates of 10 runs of the worst strategy with 50 games: \n%s' %
      ([str(simulate_orchard_worst(50)) + '%' for _ in range(10)]))

print('Winning rates of 10 runs of the worst strategy with 1000 games: \n%s' %
      [str(simulate_orchard_worst(1000)) + '%' for _ in range(10)])

print('Winning rates of 10 runs of the random strategy with 50 games: \n%s' % 
      ([str(simulate_orchard_random(50)) + '%' for _ in range(10)]))

print('Winning rates of 10 runs of the random strategy with 1000 games: \n%s' %
      [str(simulate_orchard_random(1000)) + '%' for _ in range(10)])

A typical result looks like this:

Winning rates of 10 runs of the best strategy with 50 games: 
['58%', '80%', '70%', '62%', '70%', '66%', '62%', '78%', '64%', '70%']
Winning rates of 10 runs of the best strategy with 1000 games:
['67%', '68%', '68%', '68%', '68%', '67%', '69%', '67%', '71%', '69%']
Winning rates of 10 runs of the worst strategy with 50 games: 
['60%', '44%', '62%', '42%', '50%', '50%', '62%', '54%', '64%', '58%']
Winning rates of 10 runs of the worst strategy with 1000 games: 
['53%', '51%', '54%', '54%', '53%', '52%', '54%', '57%', '50%', '54%']
Winning rates of 10 runs of the random strategy with 50 games: 
['72%', '58%', '60%', '66%', '60%', '58%', '60%', '70%', '60%', '72%']
Winning rates of 10 runs of the random strategy with 1000 games: 
['62%', '62%', '63%', '63%', '62%', '60%', '60%', '62%', '61%', '65%']

The full source code is in the github repository of this blog.

What I learned: Part 1 Linux commands

Note: the "What I learned" series of posts is intended as a reference.

Problem 1: starting a long running process process on a GNU/Linux (Ubuntu) machine without having to keep your session open

If you find yourself working regularly on a the same GNU/Linux server but your session doesn't stay alive then there is one command that is really helpful:

tmux

The short flow for using tmux is:

ssh into the remote machine
start tmux by typing tmux into the shell
start the process you want inside the started tmux session
leave/detach the tmux session by typing Ctrl-B and then D or close your ssh session

Whenever you want to continue working in your tmux session you can just do this with

tmux attach

Some useful tmux commands:

Ctrl+b d         detach

Ctrl+b %         split window vertically

Ctrl+b "         split window horizontally

Ctrl+b x         kill pane

Ctrl+b arrow     navigate between panes

For more information on tmux:
http://robots.thoughtbot.com/post/2641409235/a-tmux-crash-course
http://askubuntu.com/a/220880

Problem 2: Viewing the lists of processes and their memory usage

top

Problem 3: Displaying the user rights of a directory

ls -la "directory_name"

Problem 4: copy a file from Windows to a GNU/Linux box with ssh

pscp /path/to/your/file.txt username@example.com:/path/on/server

pscp can be downloaded at the same place as the ssh tool putty: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.

Conclusion

A lot can be done with the built-in linux tools or some downloads but you have to know they exist or where to look for them. A very good introduction to GNU/Linux and its wonderful world of commands is A Practical Guide to Linux Commands, Editors, and Shell Programming.

Samuel Bosch

Pages