A terminal screenshot with clickhouse client executing a geoip query

A Cleaner Approach to IP-based Geolocation in ClickHouse

~4 min read

Updated on


This is a short note on what I found to be a better way of implementing IP-based geolocation in ClickHouse. It may be not for everyone.

Contents

Intro

Since I use ClickHouse for building observability at my company, there are some situations when IP geolocation is needed.

There’s a really great article on the ClickHouse blog explaining in detail how this can be approached.

It’s a good start, but I decided to make my own flavor by adding country names from another dataset in addition to country codes, and squash everything into a single Dictionary definition to eliminate the hustle of maintaining any possible tables backing the dictionary.

Datasets Used

Implementation

Server Configuration

By default, ClickHouse expects the default user to be available, when it uses the CLICKHOUSE SOURCE for a dictionary.

For improved security, I usually disable this user, and create separate users as needed, each protected by a password.

  1. Disable the default user

    Create this file at /etc/clickhouse-server/users.d/default.xml

    <clickhouse>
      <users>
          <default remove="remove"></default>
      </users>
    </clickhouse>
  2. Create a user for dictionaries

    /etc/clickhouse-server/users.d/dict_updater.xml

    <clickhouse>
      <users>
          <dict_updater>
              <networks>
                  <ip>127.0.0.1/32</ip>
              </networks>
              <password_sha256_hex>REPLACEME</password_sha256_hex>
          </dict_updater>
      </users>
    </clickhouse>

    Where REPLACEME is a SHA256 hash. It can be generated using openssl:

    $

    (read P; echo -n $P | openssl dgst -sha256) # Use the `read` builtin with an `-s` option to prevent echoing if your shell supports this

Dictionary

I like to avoid using the default database, and use a separate database for geolocation and other supplementary data.

CREATE DATABASE IF NOT EXISTS meta;

Now, create the Dictionary itself. As of the time writing, I update geolocation data every 2 weeks, so LIFETIME is set to 1209600 seconds. Don’t forget to replace REPLACEME with your password. It should be the password itself this time, not its hash.

Warning

If you are using an interactive ClickHouse client, make sure it doesn’t store command history for the query below. Otherwise, it will leak your dict_updater password.

I usually just connect with the --history-file /dev/null option using clickhouse client for that purpose.

CREATE DICTIONARY IF NOT EXISTS meta.ipv4_geo
(
    cidr          String,
    country_code  String,
    country_name  String,
    city          String,
    latitude      Float64,
    longitude     Float64
)
PRIMARY KEY cidr
SOURCE(CLICKHOUSE(
    user 'dict_updater'
    password 'REPLACEME'
    query '
        SELECT cidr, country_code, country_name, city, latitude, longitude FROM
        (
            WITH
                bitXor(ip_range_start, ip_range_end) AS xor,
                IF(xor != 0, CEIL(LOG2(xor)), 0) AS unmatched,
                32 - unmatched AS cidr_suffix,
                toIPv4(bitAnd(bitNot(POW(2, unmatched) - 1), ip_range_start)::UInt64) AS cidr_address
            SELECT
                *,
                CONCAT(toString(cidr_address), \'/\', toString(cidr_suffix)) AS cidr
            FROM
            (
                SELECT
                    *
                FROM
                    url(
                        \'https://raw.githubusercontent.com/sapics/ip-location-db/master/dbip-city/dbip-city-ipv4.csv.gz\',
                        \'CSV\',
                        \'
                            ip_range_start IPv4,
                            ip_range_end   IPv4,
                            country_code   String,
                            state1         String,
                            state2         String,
                            city           String,
                            postcode       String,
                            latitude       Float64,
                            longitude      Float64
                        \'
                    )
            )
        ) AS geoip
        JOIN
        (
            SELECT Code as country_code, Name as country_name
            FROM
            (
                SELECT
                    *
                FROM
                    url(
                        \'https://raw.githubusercontent.com/annexare/Countries/refs/heads/main/dist/countries.csv\',
                        \'CSVWithNames\',
                        \'
                            Code String,
                            Name String
                        \'
                    )
            )
        ) AS countries
        ON geoip.country_code = countries.country_code
    '
))
LAYOUT(ip_trie)
LIFETIME(1209600);

Custom Lookup Functions

To make your life easier and queries cleaner, create a couple of custom functions for performing IPv4 lookups:

CREATE FUNCTION IF NOT EXISTS lookupIPv4 AS
    (address, selection) -> dictGet('meta.ipv4_geo', selection, tuple(IPv4StringToNumOrDefault(address)));

CREATE FUNCTION IF NOT EXISTS lookupIPv4All AS
    (address) -> dictGet('meta.ipv4_geo',
                         tuple('country_code', 'country_name', 'city', 'latitude', 'longitude'),
                         tuple(IPv4StringToNumOrDefault(address)));

Usage

lookupIPv4

If an IP address is invalid, or there’s no data for this address, default values of each data type are returned.

(String, String or Tuple) -> Tuple or any value

select lookupIPv4('172.67.199.23', 'country_code');
select lookupIPv4('172.67.199.23', ('country_name', 'longitude', 'latitude'));

lookupIPv4All

The behavior is same is above.

(String) -> Tuple

select lookupIPv4All('172.67.199.23');

The Result

Now you have an automatically updating in-memory dictionary that provides you with geoip data. Functions lookupIPv4 and lookupIPv4All can make queries cleaner.

The dictionary will be populated when you first try to access a value from it, so the query will take a long time. Consequent queries will be very fast.

If you restart the server, the dictionary will be wiped from memory.

Even though this covers IPv4 only, I think people will come up with a similar solution for IPv6.


Thank you for reading, and have a good rest of your day! (^ ~ ^ )

If you have any questions/suggestions, or found an error, contact me!


Found this article to be helpful? Consider supporting the author.