Contents
Intro
Since I use ClickHouse for building observability at my company, there are some situations when IP geolocation is needed.
There’s a really great article on the ClickHouse blog explaining in detail how this can be approached.
It’s a good start, but I decided to make my own flavor by adding country names from another dataset in addition to
country codes, and squash everything into a single Dictionary
definition to eliminate the hustle of maintaining any possible
tables backing the dictionary.
Datasets Used
- Base dataset: sapics/ip-location-db/dbip-city
- Country codes to names mapping: annexare/Countries
Implementation
Server Configuration
By default, ClickHouse expects the default
user to be available, when it uses the CLICKHOUSE
SOURCE
for a dictionary.
For improved security, I usually disable this user, and create separate users as needed, each protected by a password.
-
Disable the default user
Create this file at
/etc/clickhouse-server/users.d/default.xml
<clickhouse> <users> <default remove="remove"></default> </users> </clickhouse>
-
Create a user for dictionaries
/etc/clickhouse-server/users.d/dict_updater.xml
<clickhouse> <users> <dict_updater> <networks> <ip>127.0.0.1/32</ip> </networks> <password_sha256_hex>REPLACEME</password_sha256_hex> </dict_updater> </users> </clickhouse>
Where
REPLACEME
is a SHA256 hash. It can be generated using openssl:$
(read P; echo -n $P | openssl dgst -sha256) # Use the `read` builtin with an `-s` option to prevent echoing if your shell supports this
Dictionary
I like to avoid using the default database, and use a separate database for geolocation and other supplementary data.
CREATE DATABASE IF NOT EXISTS meta;
Now, create the Dictionary
itself. As of the time writing, I update geolocation data every 2 weeks, so LIFETIME
is
set to 1209600
seconds. Don’t forget to replace REPLACEME
with your password. It should be the password itself this time, not its hash.
Warning
CREATE DICTIONARY IF NOT EXISTS meta.ipv4_geo
(
cidr String,
country_code String,
country_name String,
city String,
latitude Float64,
longitude Float64
)
PRIMARY KEY cidr
SOURCE(CLICKHOUSE(
user 'dict_updater'
password 'REPLACEME'
query '
SELECT cidr, country_code, country_name, city, latitude, longitude FROM
(
WITH
bitXor(ip_range_start, ip_range_end) AS xor,
IF(xor != 0, CEIL(LOG2(xor)), 0) AS unmatched,
32 - unmatched AS cidr_suffix,
toIPv4(bitAnd(bitNot(POW(2, unmatched) - 1), ip_range_start)::UInt64) AS cidr_address
SELECT
*,
CONCAT(toString(cidr_address), \'/\', toString(cidr_suffix)) AS cidr
FROM
(
SELECT
*
FROM
url(
\'https://raw.githubusercontent.com/sapics/ip-location-db/master/dbip-city/dbip-city-ipv4.csv.gz\',
\'CSV\',
\'
ip_range_start IPv4,
ip_range_end IPv4,
country_code String,
state1 String,
state2 String,
city String,
postcode String,
latitude Float64,
longitude Float64
\'
)
)
) AS geoip
JOIN
(
SELECT Code as country_code, Name as country_name
FROM
(
SELECT
*
FROM
url(
\'https://raw.githubusercontent.com/annexare/Countries/refs/heads/main/dist/countries.csv\',
\'CSVWithNames\',
\'
Code String,
Name String
\'
)
)
) AS countries
ON geoip.country_code = countries.country_code
'
))
LAYOUT(ip_trie)
LIFETIME(1209600);
Custom Lookup Functions
To make your life easier and queries cleaner, create a couple of custom functions for performing IPv4 lookups:
CREATE FUNCTION IF NOT EXISTS lookupIPv4 AS
(address, selection) -> dictGet('meta.ipv4_geo', selection, tuple(IPv4StringToNumOrDefault(address)));
CREATE FUNCTION IF NOT EXISTS lookupIPv4All AS
(address) -> dictGet('meta.ipv4_geo',
tuple('country_code', 'country_name', 'city', 'latitude', 'longitude'),
tuple(IPv4StringToNumOrDefault(address)));
Usage
lookupIPv4
If an IP address is invalid, or there’s no data for this address, default values of each data type are returned.
(String, String or Tuple) -> Tuple or any value
select lookupIPv4('172.67.199.23', 'country_code');
select lookupIPv4('172.67.199.23', ('country_name', 'longitude', 'latitude'));
lookupIPv4All
The behavior is same is above.
(String) -> Tuple
select lookupIPv4All('172.67.199.23');
The Result
Now you have an automatically updating in-memory dictionary that provides you with geoip data.
Functions lookupIPv4
and lookupIPv4All
can make queries cleaner.
The dictionary will be populated when you first try to access a value from it, so the query will take a long time. Consequent queries will be very fast.
If you restart the server, the dictionary will be wiped from memory.
Even though this covers IPv4 only, I think people will come up with a similar solution for IPv6.
Read Next
Building SQL-based Observability With ClickHouse and Grafana