Real Performance Numbers for a Style

The tile list used for performance testing a stylesheet is critical. It needs to represent a realistic mix of zooms and tile complexity, and be large enough to have reasonable caching behavior. The best source for a tile list is logs from a real rendering server.

Logs for the tile.openstreetmap.org rendering servers are available, both tile accesses and one day’s worth of rendering. The former contains a lot of cache hits, but the latter is the actual workload for the rendering server.

There’s lots of interesting information in the file, but all that’s needed here is the log lines indicating the start of rendering a tile. In the file, this corresponds to lines like

May  3 06:34:00 yevaud renderd[3459]: DEBUG: START TILE default 16 43368-43375 27984-27991, age 78.57 days

After START TILE are the name of the style, zoom level, x range, y range, and age of the old rendered tile.

Making a list

A bit of magic with sed can turn the log file into a list of tiles in a standard z x y form

mkdir ~/tile_logs
cd ~/tile_logs
curl -sL 'http://planet.openstreetmap.org/tile_logs/renderd/renderd.yevaud.20150503.log.xz' | unxz \
  | sed -n -r 's@^.*DEBUG: START TILE default ([[:digit:]]+) ([[:digit:]]+)-[[:digit:]]+ ([[:digit:]]+)-[[:digit:]]+.*$@\1 \2 \3@p' \
  > all_metas.txt

Looking at the file, we can see how many tiles were rendered at each zoom

for z in `seq 0 19`; do
  echo -n "$z: "
  grep -c "^$z " all_metas.txt
done

This shows that of the 354 659 requests, only 12 were from zoom levels below 13. Low-zoom tiles have a different caching logic, and instead of being frequently re-rendered, are re-rendered in bulk every month. For a stable benchmark, these low-zoom tiles can be discarded. All the tiles need to be shifted up 3 zoom levels as well, and can be put into the same z/x/y format as used before

grep '^1[2-9] ' all_metas.txt | awk '{ print $1-3 "/" $2/8 "/" $3/8 }' > all_tiles.txt    

The full list of about 354k requests is too long for a reasonable benchmark. Instead, a list of about 20k should represent about 90 minutes of load on the rendering server. Generating this list is easy with head -n20000 all_tiles.txt > tiles.txt.

It’s important that this list is big enough. This can be checked by clearing the memory cache and generating the tiles. It takes about a quarter of the list to use all the RAM.

Running the benchmark

Running the benchmark as before with time parallel -a tiles.txt -j8 --progress curl -s -o /dev/null http://localhost:8080/{}.pbf and discarding the first run results in an average time of 1216.5 seconds and a standard deviation of 2.7 seconds.

Experience tells me that this is reasonable. Too long of a tile list will improve the standard deviation, but take too long to run, while too short of a list has too much of an error.

Indexing problems

When you create a GiST index, the result is non-deterministic, so the rendering performance changes with a REINDEX DATABASE command. This used to be particularly bad when clustering on a GiST index because both the table order and indexes then were non-deterministic. If purely testing a stylesheet change, this doesn’t matter because the same indexes can be used with multiple versions of the style, but if the testing involves a reimport with osm2pgsql, it throws a problem into the mix.

The only fix is to reindex multiple times and run the benchmark on each index result. Doing this five times for a total of 25 results gives an average time of 1203 seconds, and a standard deviation of 11 seconds, much higher than before.

Paul’s Blog

A blog without a good name

Real Performance Numbers for a Style

Making a list

Running the benchmark

Indexing problems