Getting Started

ClickHouse is the fastest and most resource efficient real-time data warehouse and open-source database. This tutorial walks you through how you quickly can get started with data warehousing in Eyevinn Open Source Cloud.

Prerequisites

Step 1: Setup secrets

Create a service secret to hold the password for the default user.

Skärmavbild 2025-02-13 kl  08 27 57

Step 2: Create a ClickHouse server instance

Click on the button "Create clickhouse-server" in the web user interface. Then enter the following values in the create dialog.

Skärmavbild 2025-02-13 kl  08 30 40

Press the button Create and wait for the instance to be in status Running.

Step 3: Insert and Query data

As an example we will create a table and insert a large dataset (two million rows of the New York taxi data). Then you will run queries on the dataset, including an example of how to create a dictionary and use it to perform a JOIN.

Click on the ClickHouse server instance card.

Skärmavbild 2025-02-13 kl  08 32 56

Create the following trips table by pasting below in to the input field above. Enter the user and password that you set when creating the instance.

CREATE TABLE trips
(
    `trip_id` UInt32,
    `vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
    `pickup_date` Date,
    `pickup_datetime` DateTime,
    `dropoff_date` Date,
    `dropoff_datetime` DateTime,
    `store_and_fwd_flag` UInt8,
    `rate_code_id` UInt8,
    `pickup_longitude` Float64,
    `pickup_latitude` Float64,
    `dropoff_longitude` Float64,
    `dropoff_latitude` Float64,
    `passenger_count` UInt8,
    `trip_distance` Float64,
    `fare_amount` Float32,
    `extra` Float32,
    `mta_tax` Float32,
    `tip_amount` Float32,
    `tolls_amount` Float32,
    `ehail_fee` Float32,
    `improvement_surcharge` Float32,
    `total_amount` Float32,
    `payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
    `trip_type` UInt8,
    `pickup` FixedString(25),
    `dropoff` FixedString(25),
    `cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
    `pickup_nyct2010_gid` Int8,
    `pickup_ctlabel` Float32,
    `pickup_borocode` Int8,
    `pickup_ct2010` String,
    `pickup_boroct2010` String,
    `pickup_cdeligibil` String,
    `pickup_ntacode` FixedString(4),
    `pickup_ntaname` String,
    `pickup_puma` UInt16,
    `dropoff_nyct2010_gid` UInt8,
    `dropoff_ctlabel` Float32,
    `dropoff_borocode` UInt8,
    `dropoff_ct2010` String,
    `dropoff_boroct2010` String,
    `dropoff_cdeligibil` String,
    `dropoff_ntacode` FixedString(4),
    `dropoff_ntaname` String,
    `dropoff_puma` UInt16
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY pickup_datetime;

Press button Run.

Now insert the dataset by copying and pasting this.

INSERT INTO trips
SELECT * FROM s3(
    'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz',
    'TabSeparatedWithNames', "
    `trip_id` UInt32,
    `vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
    `pickup_date` Date,
    `pickup_datetime` DateTime,
    `dropoff_date` Date,
    `dropoff_datetime` DateTime,
    `store_and_fwd_flag` UInt8,
    `rate_code_id` UInt8,
    `pickup_longitude` Float64,
    `pickup_latitude` Float64,
    `dropoff_longitude` Float64,
    `dropoff_latitude` Float64,
    `passenger_count` UInt8,
    `trip_distance` Float64,
    `fare_amount` Float32,
    `extra` Float32,
    `mta_tax` Float32,
    `tip_amount` Float32,
    `tolls_amount` Float32,
    `ehail_fee` Float32,
    `improvement_surcharge` Float32,
    `total_amount` Float32,
    `payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
    `trip_type` UInt8,
    `pickup` FixedString(25),
    `dropoff` FixedString(25),
    `cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
    `pickup_nyct2010_gid` Int8,
    `pickup_ctlabel` Float32,
    `pickup_borocode` Int8,
    `pickup_ct2010` String,
    `pickup_boroct2010` String,
    `pickup_cdeligibil` String,
    `pickup_ntacode` FixedString(4),
    `pickup_ntaname` String,
    `pickup_puma` UInt16,
    `dropoff_nyct2010_gid` UInt8,
    `dropoff_ctlabel` Float32,
    `dropoff_borocode` UInt8,
    `dropoff_ct2010` String,
    `dropoff_boroct2010` String,
    `dropoff_cdeligibil` String,
    `dropoff_ntacode` FixedString(4),
    `dropoff_ntaname` String,
    `dropoff_puma` UInt16
") SETTINGS input_format_try_infer_datetimes = 0

Skärmavbild 2025-02-13 kl  08 35 30

Verify it works with this command.

SELECT count() FROM trips

Skärmavbild 2025-02-13 kl  08 36 22

You should have about 2 million rows in your table. Now let us try with this query.

SELECT DISTINCT(pickup_ntaname) FROM trips

Skärmavbild 2025-02-13 kl  08 37 46

This query has to process 2M rows and return 190 values, but notice it does this in about 1 second. The pickup_ntaname column represents the name of the neighborhood in New York City where the taxi ride originated.

Step 4: Analyze the data

Let's run some queries to analyze the 2M rows of data...

We will start with some simple calculations, like computing the average tip amount:

SELECT round(avg(tip_amount), 2) FROM trips

This query computes the average cost based on the number of passengers:

SELECT
    passenger_count,
    ceil(avg(total_amount),2) AS average_total_amount
FROM trips
GROUP BY passenger_count

Find more examples in the ClickHouse tutorials.

Using the ClickHouse client

You can also use the ClickHouse client to insert and query data from your server. Install the client.

% brew install clickhouse

Find the IP and Port on the ClickHouse server instance card.

Skärmavbild 2025-02-13 kl  08 42 18

The ClickHouse URL in this example is then clickhouse://myuser:secret@172.232.131.169:10523 and what you specify when you run the client.

% clickhouse client clickhouse://myuser:secret@172.232.131.169:10523
ClickHouse client version 25.1.3.23 (official build).
Connecting to 172.232.131.169:10523 as user myuser.
Connected to ClickHouse server version 24.12.3.

eyevinnlab-guide-7cc86668b-wmq57 :)

Now we can try with the same query as above.

eyevinnlab-guide-7cc86668b-wmq57 :) SELECT round(avg(tip_amount), 2) FROM trips

SELECT round(avg(tip_amount), 2)
FROM trips

Query id: 20428ab7-3f2d-420d-b05a-bab2a789ce57

   ┌─round(avg(tip_amount), 2)─┐
1. │                      1.68 │
   └───────────────────────────┘

1 row in set. Elapsed: 0.017 sec. Processed 2.00 million rows, 8.00 MB (117.53 million rows/s., 470.13 MB/s.)
Peak memory usage: 71.18 KiB.