It Starts With the Planet

To do something with OpenStreetMap data, we have to download it first. This can be the entire data from planet.openstreetmap.org or a smaller extract from a provider like Geofabrik. If you’re doing this manually, it’s easy. Just a single command will call curl or wget, or you can download it from the browser. If you want to script it, it’s a bit harder. You have to worry about error conditions, what can go wrong, and make sure everything can happen unattended. So, to make sure we can do this, we write a simple bash script.

The goal of the script is to download the OSM data to a known file name, and return 0 if successful, or 1 if an error occurred. Also, to keep track of what was downloaded, we’ll make two files with information on what was downloaded, and what state it’s in: state.txt and configuration.txt. These will be compatible with osmosis, the standard tool for updating OpenStreetMap data.

Before doing anything else, we specify that this is a bash script, and that if anything goes wrong, the script is supposed to exit.

#!/usr/bin/env bash

set -euf -o pipefail

Next, we put the information about what’s being downloaded, and where, into variables. It’s traditional to use the Geofabrik Liechtenstein extract for testing, but the same scripts will work with the planet.

PLANET_FILE='data.osm.pbf'

PLANET_URL='http://download.geofabrik.de/europe/liechtenstein-latest.osm.pbf'
PLANET_MD5_URL="${PLANET_URL}.md5"

We’ll be using curl to download the data, and every time we call it, we want to add the options -s and -L. Respectively, these make curl silent and cause it to follow redirects. Two files are needed: the data, and it’s md5 sum. The md5 file looks something like 27f7... liechtenstein-latest.osm.pbf. The problem with this is we’re saving the file as $PLANET_FILE, not liechtenstein-latest.osm.pbf. A bit of manipulation with cut fixes this.

CURL='curl -s -L'
MD5="$($CURL "${PLANET_MD5_URL}" | cut -f1 -d' ')"
echo "${MD5}  ${PLANET_FILE}" > "${PLANET_FILE}.md5"

The reason for downloading the md5 first is it reduces the time between the two downloads are initiated, making it less likely the server will have a new version uploading in that time.

The next step is easy, downloading the planet, and checking the download wasn’t corrupted. It helps to have a good connection here.

$CURL -o "${PLANET_FILE}" "${PLANET_URL}" || { echo "Planet file failed to download"; exit 1; }

md5sum --quiet --status --strict -c "${PLANET_FILE}.md5" || { echo "md5 check failed"; exit 1; }

Libosmium is a popular library for manipulating OpenStreetMap data, and the osmium command can show metadata from the header of the file. The command osmium fileinfo data.osm.pbf tells us

Header:
  Bounding boxes:
    (9.47108,47.0477,9.63622,47.2713)
  With history: no
  Options:
    generator=osmium/1.5.1
    osmosis_replication_base_url=http://download.geofabrik.de/europe/liechtenstein-updates
    osmosis_replication_sequence_number=1764
    osmosis_replication_timestamp=2018-01-15T21:43:03Z
    pbf_dense_nodes=true
    timestamp=2018-01-15T21:43:03Z

The osmosis properties tell us where to go for the updates to the data we downloaded. Despite not needing the updates for this task, it’s useful to store this in the state.txt and configuration.txt files mentioned above.

Rather than try to parse osmium’s output, it has an option to just extract one field. We use this to get the base URL, and save that to configuration.txt

REPLICATION_BASE_URL="$(osmium fileinfo -g 'header.option.osmosis_replication_base_url' "${PLANET_FILE}")"
echo "baseUrl=${REPLICATION_BASE_URL}" > 'configuration.txt'

Replication sequence numbers needed to represented as a three-tiered directory structure, for example 123/456/789. By taking the number, padding it to 9 characters with 0s, and doing some sed magic, we get this format. From there, it’s easy to download the state.txt file representing the state of the data that was downloaded.

REPLICATION_SEQUENCE_NUMBER="$( printf "%09d" "$(osmium fileinfo -g 'header.option.osmosis_replication_sequence_number' "${PLANET_FILE}")" | sed ':a;s@\B[0-9]\{3\}\>@/&@;ta' )"

$CURL -o 'state.txt' "${REPLICATION_BASE_URL}/${REPLICATION_SEQUENCE_NUMBER}.state.txt"

After all this has been run, we’ve got the planet, it’s md5 file, and the state and configuration that correspond to the download.

Combining the code fragments, adding some comments, and cleaning up the files results in this shell script

#!/usr/bin/env bash

set -euf -o pipefail

PLANET_FILE='data.osm.pbf'

PLANET_URL='http://download.geofabrik.de/europe/liechtenstein-latest.osm.pbf'
PLANET_MD5_URL="${PLANET_URL}.md5"
CURL='curl -s -L'

# Clean up any remaining files
rm -f -- "${PLANET_FILE}" "${PLANET_FILE}.md5" 'state.txt' 'configuration.txt'

# Because the planet file name is set above, the provided md5 file needs altering
MD5="$($CURL "${PLANET_MD5_URL}" | cut -f1 -d' ')"
echo "${MD5}  ${PLANET_FILE}" > "${PLANET_FILE}.md5"

# Download the planet
$CURL -o "${PLANET_FILE}" "${PLANET_URL}" || { echo "Planet file failed to download"; exit 1; }

md5sum --quiet --status --strict -c "${PLANET_FILE}.md5" || { echo "md5 check failed"; exit 1; }

REPLICATION_BASE_URL="$(osmium fileinfo -g 'header.option.osmosis_replication_base_url' "${PLANET_FILE}")"
echo "baseUrl=${REPLICATION_BASE_URL}" > 'configuration.txt'

# sed to turn into / formatted, see https://unix.stackexchange.com/a/113798/149591
REPLICATION_SEQUENCE_NUMBER="$( printf "%09d" "$(osmium fileinfo -g 'header.option.osmosis_replication_sequence_number' "${PLANET_FILE}")" | sed ':a;s@\B[0-9]\{3\}\>@/&@;ta' )"

$CURL -o 'state.txt' "${REPLICATION_BASE_URL}/${REPLICATION_SEQUENCE_NUMBER}.state.txt"

Paul’s Blog

A blog without a good name

It Starts With the Planet