Paul’s Blog

A blog without a good name

Improve Your ST_GeoHash Sorting With These Three Simple Tricks

It’s common to want to spatially order a table for improved performance. This is best done with the ST_GeoHash function, which allows you to put your data into a spatially correlated order using a Morton code.

This can be done with the SQL

1
2
3
CREATE TABLE foo AS
  SELECT * FROM foo_unordered
    ORDER BY ST_GeoHash(ST_Transform(geom,4326))

It turns out there’s three ways to improve the performance of this.

Quicker transforms

ST_GeoHash requires coordinates in EPSG:4326, so they need to be transformed into that projection. But this involves reprojecting every point. For any projections that don’t cross the boundaries of EPSG:4326 or excessively distort lines, we can use the fact that the bounding box of the transformed geometry is the transformed bounding box of the original geometry. For data stored in EPSG:3857 (Spherical Mercator), this is always true. Even in odder projections, it’s still close enough to be useful for clustering.

The ST_Envelope function gives the bounding box of a geometry, so we can change the SQL

1
2
3
CREATE TABLE foo AS
  SELECT * FROM foo_unordered
    ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(geom),4326));

Shorter GeoHashes

ST_GeoHash takes an argument which lets you define the length of the resulting GeoHash. It turns out that 10 GeoHash characters are ideal, being 5% faster than the full length GeoHashes. It also doesn’t impact the quality of the resulting order, as 12 characters only has 0.02% more unique hashes on a planet-wide dataset. Shorter would be worse, as 8 characters has 23% fewer hashes.

Our SQL is now 15% faster

1
2
3
CREATE TABLE foo AS
  SELECT * FROM foo_unordered
    ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(geom),4326),10);

Stupider sorting

ST_GeoHash produces text, which on most databases will be sorted with a locale-specific UTF8 sort. This will properly sort accents, capitals, and other special characters. But GeoHashes are in base-32, using only 0-9 and b-z. This lets us sort in the “C” locale, which sorts by byte order.

Changing the ORDER BY collation lets us change how it’s sorted

1
2
3
CREATE TABLE foo AS
  SELECT * FROM foo_unordered
    ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(geom),4326),10) COLLATE "C";

This SQL is 40% faster than we started with, and it’s what osm2pgsql has started using to save up to an hour off of some imports.