Paul’s Blog

A blog without a good name

Osm2pgsql Style File Size

Recently I had occasion to test if an osm2pgsql style file with fewer columns could improve import and rendering speed. Osm2pgsql uses a .style file to define what columns to create when importing data, for example a column for the name tag, the highway tag, the waterway tag, the landuse tag, etc. This file determines what you can render from your database.

Obviously, the fewer columns you have, the smaller the database is. It’s been proposed to use a smaller style file to improve performance, but benchmarking this is tricky.

I ran a series of benchmarks of osm2pgsql with multiple style files, testing rendering speed and if it is improved by a smaller style file. I was able to do this because the default osm2pgsql style file contains some columns that are not used by the openstreetmap-carto style. Most of these columns, such as surface, contain additional information about the objects and few objects will have one of these tags and not some other tag still included. This means that there will still be approximately as many rows as before.

To test the speed I had to import the planet with osm2pgsql twice, once with the normal style and once with the reduced style. As the style only affects the rendering tables I used the drop option to speed up the import and reduce database size.

The smaller style reduced the database size from 89.71GB to 89.45GB, a decrease of 0.28%. If this was the only benefit, it would hardly be worth it, but there are three ways the smaller style could speed up the database: import speed, rendering speed, and update speed.

For all of my benchmarking, I used a modified version of osm2pgsql which doesn’t automatically create indexes so that I can see exactly how long each step takes.

An osm2pgsql import can be broken into three phases: PBF processing, where the PBF is read in to osm2pgsql; pending ways, where ways that could be areas are turned into areas; and PostgreSQL indexing and clustering, where a series of lengthy SQL commands are run.

Testing revealed that the slimmer style increased speed by 1.1% during processing, 0.45% during pending ways, and significantly less than run to run variance for creating indexes and clustering. On the machine I’ve been using for benchmarking, the smaller style took two minutes less, on an import that took 5 hours and 40 minutes, or about 0.5% overall. This is hardly a significant speed gain.

On a high capacity rendering server in a production environment, the initial import time is not significant compared to the rendering speed. You might end up reimporting the planet 2-4 times a year for style changes, to improve clustering, and to fix errors, but the rest of the time it’s rendering and consuming updates.

Before considering rendering performance, it’s necessary to consider the strange nature of geometry indexes. Geometry indexes are gist indexes, and these indexes have an annoying property when benchmarking: they’re different every time you create them. I don’t know if the indexes are non- deterministic or depend on an internal PostgreSQL state, but if you benchmark the performance, re-index the database, and benchmark the performance again you will get different results.

This is handled in two ways. Instead of clustering on a gist index, a ST_GeoHash index is used instead. Clustering on a ST_GeoHash index has the same effect of storing data on disk into spatially correlated manner, like clustering on a GiST index does, but the way the GeoHash index sorts the data consistent from run to run. The second way is to repeatedly re-index the database and run the benchmarks. Both are necessary, the GeoHash clustering to have data in a repeatable position on disk, and the repeated indexing to average out variations.

My standard benchmarking routine is to shut down PostgreSQL, drop all memory caches, start it back up, then render a list of 18000 meta-tiles three times, throwing away the results from the first rendering of the list. This ensures consistent performance and consistent cache behavior, which is very important for database benchmarking. When I tried rendering the list of 18000 tiles more times, I found the run to run variation was about 0.05%, which makes this a very consistent benchmark.

With five runs with regenerated indexes for both the normal and reduced style files, I got an average of 8.655 MT/s and standard deviation of 0.022 MT/s for the reduced style and 8.859 MT/s and standard deviation of 0.066 MT/s for the standard. This is an increase in speed of 2.35%, approximately ten times the percent reduction in size, and five times the import speed increase.

Clearly using as small a style as possible has positive impacts for speed; furthermore the speed increase is more than the percentage decrease in database size.