Paul’s Blog

A blog without a good name

Osm2pgsql Multipolygons

Today I’m going to write about osm2pgsql and multipolygons. Not your run of the mill OpenStreetMap multipolygons, but the PostGIS MULTIPOLYGONs and the osm2pgsql options that create them. Osm2pgsql has the option -G or --multi-geometry which tells it to “generate multi-geometry features in PostgreSQL tables.” This is a bit vague, particularly if you haven’t studied the PostGIS data model.

In OpenStreetMap terms a multipolygon is a relation used to create an area from multiple ways. These ways don’t need to be closed individually, but combined they need to form a closed geometry. One of the most common uses of multipolygon relations is for an area with a hole, such as a building with an interior open area.

Building with interior open area

PostGIS doesn’t need a MULTIPOLYGON to represent this, but can use a POLYGON with an interior ring. MULTIPOLYGONs are only required when multiple disjoint areas are part of the same object. This is a situation sometimes found with boundaries and islands, where the islands will be part of the same administrative or other area, but not the water in between them.

Two disjoint areas which need a MULTIPOLYGON to represent

Two separate POLYGONs

It is always possible to convert a MULTIPOLYGON into multiple POLYGONs, and this is what osm2pgsql does by default. Why would we want to do this? Speed, particularly when doing index lookups.

A common task is to find out what areas you’re within. This is done by first checking if you are within the bounding box of an area (the smallest square that can contain the area) then doing slower calculations to see which side of the boundary you lie on. The bounding box check is normally assisted by an index, so you don’t even have to retrieve geometries where you’re not within the bounding box. A similar query is used for rendering to see what features to return to the renderer.

Picture the state of Alaska in WGS84 (latitude and longitude). It has one large part in the northwest and a small part in the northeast near Russia. The combined bounding box covers everything from about 50 degrees north to 73 degrees north. Within this bounding box are the UK, Finland, Sweden, Norway, Poland and much of Germany and Russia. Every time you want to check where you are in northern Europe, you have to evaluate if you’re in Alaska or not, because you’re within Alaska’s bounding box.

Alaska bounding box stretching around the world

If you split the Alaska MULTIPOLYGON into multiple POLYGONs you can avoid this problem. Each POLYGON has its own bounding box, and none intersects northern Europe. When you’re within one part of Alaska, you don’t have to do any calculations involving the other part. In the case of a large complicated geometry, this can save a significant amount of time.

Multiple small Alaska bounding boxes

With these advantages, why would you ever want to use MULTIPOLYGONs? Well, if you count the number of US states, you now have two Alaskas, for a total of 51. Similarly, area calculations can be messed up. You can get around this with ST_Collect and GROUP BY, but at that point you’ve lost any performance gains from splitting MULTIPOLYGONs.

What about rendering? If all you’re doing is rendering an area fill, POLYGONs are better, hands down, but there’s a pretty important case where the difference does matter. There’s an example from an openstreetmap-carto issue showing exactly how bad labelling can be with a MULTIPOLYGON split into POLYGONs.

Small Isles National Scenic Area labelled many times

To fix this problem, osm2pgsql has the -G option which will not break down MULTIPOLYGONs, leading to a rendering that works.

Small Isles National Scenic Area labelled once

Of course it’s still possible to label each component POLYGON individually if you want to, but with -G you are able to render one label per object and have meaningful geometries for analysis such as areas.

I’ve said there’s a performance hit, but what is it? Using the methods used previously we have a way of finding out exactly how much slower it is, and what zooms the speed decrease occurs on.

The rendering rate on this machine for a database without hstore is 8.665 ± 0.017 metatiles per second (MT/s). With -G we get 8.19 ± 0.07 MT/s, a decrease of 5.5%.

By looking at the percent difference in time spent rendering at different zooms we can get a better look at where the speed reduction is occurring.

Increasing performance losses are seen at higher zooms

Unfortunately, this graph doesn’t help us hugely. The speed decrease is greatest, both on a total and per-meta basis, at high zooms. It being most significant for high zooms is consistent with the explanations above – you will run into more cases where there are MULTIPOLYGONs bounding box intersecting the rendered area at high zooms because the rendered area is smaller relative to the “gaps” between different areas. It being greatest on a per-meta basis is just a consequence of the high zooms being most of the server load.