Vector denormalisation in PostgreSQL

PostgreSQL is a relational database system, and is designed for the general case. That means it aims for scalability rather than efficiency at any specific data size.

The consequence of this is that costs that would be necessary for storing massive amounts of information also have to be paid for smaller databases. Additionally, features such as transactionality are available in all cases: they are there and active whether you want them or not. Every piece of data may be modified in a transaction, and it needs to be stored in a way that makes this possible. So it has a tuple header containing information to indicate which transactions it is visible in, whether it can be vacuumed, which version of the schema it was created under, etc.

This size of the tuple header is significant for tables with small tuple sizes. One case where this is pronounced is in a table that stores vector data. By vector data I mean large amounts of individual values; data that would likely to be stored in array form in a programming language. In a relational database, each data point becomes a tuple. The per-tuple overhead is multiplied by the number of tuples, and takes up the majority of the table size.

I’ll use the data from my hardware monitoring experiment as an example. The techniques are applicable to any scenario in which you want relational access to vectors of data without the large per-data point overhead.

The table is:

CREATE TABLE hardware.log
(
    time TIMESTAMP NOT NULL,
    sensor_id INTEGER NOT NULL,
    value DOUBLE PRECISION NOT NULL,

    CONSTRAINT log_pk PRIMARY KEY (time, sensor_id)
);

CREATE INDEX log_ix_sensor ON hardware.log (sensor_id);

This is in a highly normal form: (sensor_id, time) is a key, and value is fully functionally dependent on it. I’m not especially concerned with formal normalisation, except to the extent that this obvious table layout is “normalised”, and a less obvious but possibly more efficient form is “denormalised”. (I’m a self-taught database programmer, and apart from some homework at university — well after I became experienced with relational DBs — have not done much functional decomposition. See Wikipedia for more on database normalisation.)

Data is collected for 36 sensors, about every 5 seconds. After 24 days of data collection, this table contains 4.8 million rows. (Data is collected only while the machine is running, which is about 8 hours a day.)

As a normalised table with indexes on all subsets of its keys, this table provides efficient, indexed performance for a large variety of queries. But the size of this table, including indexes, is:

SELECT pg_relation_size('hardware.log'), pg_total_relation_size('hardware.log'),
pg_total_relation_size('hardware.log')/(SELECT COUNT(*) FROM hardware.log) AS per_tuple;
 pg_relation_size | pg_total_relation_size | per_tuple
------------------+------------------------+-----------
        249716736 |              508559360 |       106
(1 row)

The is not just a storage cost: any query, even if it uses indexes, will need to read and process an amount of data that is much larger than the nominal amount of information. Assuming the table is ordered its primary key, a latitudinal query (all readings in a date range) will read 106 bytes per output data point.

EXPLAIN ANALYZE SELECT time, sensor_id, value
FROM hardware.log
WHERE time >= '2011-03-18' AND time < '2011-03-19' ORDER BY time; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort  (cost=44903.25..45192.14 rows=115558 width=20) (actual time=36.152..43.812 rows=100908 loops=1)    Sort Key: "time"    Sort Method:  quicksort  Memory: 9144kB    ->  Bitmap Heap Scan on log  (cost=2969.46..35185.83 rows=115558 width=20) (actual time=10.339..20.272 rows=100908 loops=1)
         Recheck Cond: (("time" >= '2011-03-18 00:00:00'::timestamp without time zone) AND ("time" < '2011-03-19 00:00:00'::timestamp without time zone))          ->  Bitmap Index Scan on log_pk  (cost=0.00..2940.57 rows=115558 width=0) (actual time=10.247..10.247 rows=100908 loops=1)
               Index Cond: (("time" >= '2011-03-18 00:00:00'::timestamp withouttime zone) AND ("time" < '2011-03-19 00:00:00'::timestamp without time zone))
Total runtime: 49.364 ms (8 rows)

A longitudinal query (all readings for a single sensor in a date range) will read 1876 bytes for every output data point — the whole table — because the data for that sensor is dispersed over the every page. This could be changed by ordering the table on sensor_id instead; this improves the second case at the expense of the third. But ordering by time is more desirable because new data comes as time increases. Since the data is appended to the table, the table maintains its ordering. If it was ordered on sensor_id, then the table would need periodic clustering so that all the readings for a sensor were together.

EXPLAIN ANALYZE SELECT time, value FROM hardware.log WHERE sensor_id = 6 ORDER BY time;
                                                    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=44287.21..44585.82 rows=119444 width=16) (actual time=119.237..128.240 rows=133070 loops=1)
   Sort Key: "time"
   Sort Method:  quicksort  Memory: 9295kB
   ->  Bitmap Heap Scan on log  (cost=2238.47..34214.52 rows=119444 width=16) (actual time=23.225..88.607 rows=133070 loops=1)
         Recheck Cond: (sensor_id = 6)
         ->  Bitmap Index Scan on log_ix_sensor  (cost=0.00..2208.61 rows=119444 width=0) (actual time=16.407..16.407 rows=133070 loops=1)
               Index Cond: (sensor_id = 6)
 Total runtime: 134.489 ms
(8 rows)

(Since every block in the table contains a selected tuple, a sequential scan should be even more efficient in this query. Surprisingly, that this is not the case: it takes almost 800 ms. The only explanation I can think of is that sequential scans have special case buffer cache handling.)

Vectorisation schemes

An alternative indexing scheme over the same table layout would have primary key (sensor_id, time), and a supplementary index on time. This results in the same table size, but the relative cost of the queries changes. A longitudinal query will read consecutive tuples for output data points, so has a cost of 106 bytes per point. A latitudinal query for a single point in time will find its data points one per block, so will need to read 8192 bytes per point.

There are two overhead costs we can identify and try to reduce.

Per-tuple storage overhead: this is the tuple header and the index entries.
Non-consecutive reads overhead: this is the cost of reading a block when only part of the data on the block is required.

The per-tuple overhead can be reduced by storing more than one data point in a tuple. The data in log has two dimensions: time and sensor. This forms a two-dimensional array, and hence we have two main options for segmenting it into vectors: vertically and horizontally.

Horizontal vectorisation

Vertical vectorisation

Additionally, we can choose the size of the vectors. The choice can depend on the granularity and extent of the dimension that the vector runs along, as well as considerations for useability. For instance in the case of horizontal vectorisation, if we assume that there are only a few sensors (say, less than 100) and that they have incremental ids starting from 1, we can conveniently assign each to an indexed element of the vector. There is no need for NULL elements in the vector for missing ids, and a single vector can span the entire dimension.

In the case of vertical vectorisation, we have an ever-growing time, with entries appearing every second. We will not be able to fit all the readings for one sensor into a single vector, so the columns must be divided into multiple vectors. A nice way to divide it is to have a vector for each minute of data for each sensor, so that all vectors start at an offset of :00 — this puts 60 elements into each one which is a nice tradeoff between minimising the per-row overhead and keeping vectors small enough to be manageable.

Vertical vectorisation with segmentation

The ideal vectorisation scheme depends on how the data will be used. But we need some idea of the performance effects to give us a heuristic for our decision.

Let’s apply both of these schemes to the sensor database.

Option 1. Horizontal vectorisation

We change the schema to:

CREATE TABLE hlog
(
    time TIMESTAMP NOT NULL,
    values DOUBLE PRECISION[] NOT NULL,

    CONSTRAINT log_pk PRIMARY KEY (time)
);

Each tuple now represents the data for a single second, from all the sensors. The reading for sensor i is values[i]. Populating it is easy, assuming that the sensor ids start at 1 and are contiguous. We double check this:

SELECT DISTINCT n, m1, m2
FROM (SELECT COUNT(DISTINCT sensor_id), MIN(sensor_id), MAX(sensor_id) FROM log) AS x(n, m1, m2);
 n  | m1 | m2
----+----+----
 36 |  1 | 36
(1 row)

Every value has 36 sensor ids between 1 and 36.

INSERT INTO hlog
SELECT l.time, ARRAY(SELECT l2.value FROM log AS l2 WHERE l2.time = l.time)
FROM (SELECT DISTINCT time FROM log) AS l;

Comparing the table sizes:

SELECT tn, pg_relation_size(tn), pg_total_relation_size(tn)
FROM (VALUES ('hardware.log'), ('hardware.hlog')) AS t(tn);
      tn       | pg_relation_size | pg_total_relation_size
---------------+------------------+------------------------
 hardware.log  |        249716736 |              508469248
 hardware.hlog |         47144960 |               50167808
(2 rows)

The denormalised table is 50 MB compared to 500 for the original. It lacks an index on sensor_id. But one would be pointless since every single tuple contains data for a given sensor. The lack of the indexability on that column should be more then compensated by the fact that the table is one tenth as big. It is quicker to scan, and is more likely to fit in cache.

Let’s benchmark a couple of queries:

EXPLAIN ANALYZE SELECT time, values[6] FROM hardware.hlog ORDER BY time;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Index Scan using hlog_pk on hlog  (cost=0.00..9226.31 rows=133070 width=318) (actual time=0.019..58.669 rows=133070 loops=1)
 Total runtime: 64.972 ms
(2 rows)

This compares to 134.489 ms for a longitudinal query on the normalised table, so it is about twice as efficient.

EXPLAIN ANALYZE SELECT time, sensor_id, values[sensor_id]
FROM hardware.hlog, generate_series(1,36) AS gs(sensor_id)
WHERE time >= '2011-03-18' AND time < '2011-03-19' ORDER BY time;                                                                QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop  (cost=0.00..69863.87 rows=3480000 width=322) (actual time=0.055..33.385 rows=100908 loops=1)    ->  Index Scan using hlog_pk on hlog  (cost=0.00..263.86 rows=3480 width=318) (actual time=0.037..1.014 rows=2803 loops=1)
         Index Cond: (("time" >= '2011-03-18 00:00:00'::timestamp without time zone) AND ("time" < '2011-03-19 00:00:00'::timestamp without time zone))    ->  Function Scan on generate_series gs  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.000..0.002 rows=36 loops=2803)
 Total runtime: 38.268 ms
(5 rows)

This compares to 49.364 ms for the equivalent latitudinal query on the normalised table, so it is slightly more efficient.

Option 2. Vertical vectorisation with segmentation

We change the schema to:

CREATE TABLE hardware.vlog
(
    first_time TIMESTAMP NOT NULL,
    sensor_id INTEGER NOT NULL,
    values DOUBLE PRECISION[] NOT NULL,

    CONSTRAINT vlog_pk PRIMARY KEY (first_time, sensor_id)
);

CREATE INDEX vlog_ix_sensor ON hardware.vlog (sensor_id);

Each tuple now represents a sequence of measurements for the same sensor, beginning at first_time. If we choose to store one minute’s worth per tuple, then the per-row overhead is reduced by almost 12 times. There is some additional overhead in using an array: the array has a header and it uses an extra bit per element to indicate NULL elements.

Populating the table is a little trickier, because the time values are less regular than the sensor ids were in option 1. We can write a function to populate each tuple:

CREATE OR REPLACE FUNCTION hardware.copy_to_vlog(first_time TIMESTAMP, sid INTEGER) RETURNS VOID
VOLATILE STRICT LANGUAGE 'plpgsql' AS
$FUNC$
DECLARE
    v DOUBLE PRECISION[];
    r RECORD;
    idx INTEGER;
BEGIN
    FOR r IN
        SELECT time, value
        FROM hardware.log
        WHERE log.sensor_id = sid
        AND log.time >= first_time AND log.time < first_time + INTERVAL '1 minute'
    LOOP
        idx := EXTRACT('second' FROM (r.time - first_time));
        v[idx] := r.value;
    END LOOP;

    INSERT INTO hardware.vlog VALUES (first_time, sid, v);
END;
$FUNC$;

SELECT hardware.copy_to_vlog(ft, sid)
FROM (SELECT DISTINCT date_trunc('minute', time), sensor_id FROM hardware.log) AS s(ft, sid);

SELECT tn, pg_relation_size(tn), pg_total_relation_size(tn)
FROM (VALUES ('hardware.vlog')) AS t(tn);
      tn       | pg_relation_size | pg_total_relation_size
---------------+------------------+------------------------
 hardware.vlog |         70025216 |               92168192
(1 row)

The size of the table, including its indexes, is 92 MB.

EXPLAIN ANALYZE SELECT first_time + interval '1 second' * i AS time, values[i]
FROM hardware.vlog, generate_series(0,59) AS gs(i)
WHERE sensor_id = 6 AND values[i] IS NOT NULL
ORDER BY time;
                                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=2358665.55..2387224.54 rows=11423595 width=134) (actual time=398.988..411.198 rows=128251 loops=1)
   Sort Key: ((vlog.first_time + ('00:00:01'::interval * (gs.i)::double precision)))
   Sort Method:  quicksort  Memory: 7058kB
   ->  Nested Loop  (cost=217.28..238592.30 rows=11423595 width=134) (actual time=3.236..315.113 rows=128251 loops=1)
         Join Filter: (vlog."values"[gs.i] IS NOT NULL)
         ->  Function Scan on generate_series gs  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.016..0.062 rows=60 loops=1)
         ->  Materialize  (cost=217.27..9421.54 rows=11481 width=130) (actual time=0.054..1.256 rows=11363 loops=60)
               ->  Bitmap Heap Scan on vlog  (cost=217.27..9364.13 rows=11481 width=130) (actual time=3.193..14.644 rows=11363 loops=1)
                     Recheck Cond: (sensor_id = 6)
                     ->  Bitmap Index Scan on vlog_ix_sensor  (cost=0.00..214.40 rows=11481 width=0) (actual time=1.845..1.845 rows=11363 loops=1)
                           Index Cond: (sensor_id = 6)
 Total runtime: 418.144 ms
(12 rows)

This is significantly worse than the 134.489 ms for the normalised table! Let’s see how the latitudinal query performs:

EXPLAIN ANALYZE SELECT first_time + interval '1 second' * i AS time, sensor_id, values[i]
FROM hardware.vlog, generate_series(0,59) AS gs(i)
WHERE values[i] IS NOT NULL AND first_time >= '2011-03-18' AND first_time < '2011-03-19' ORDER BY time; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort  (cost=2312418.94..2339990.39 rows=11028580 width=138) (actual time=286.868..294.153 rows=100908 loops=1)    Sort Key: ((vlog.first_time + ('00:00:01'::interval * (gs.i)::double precision)))    Sort Method:  quicksort  Memory: 9144kB    ->  Nested Loop  (cost=285.93..230759.42 rows=11028580 width=138) (actual time=0.977..234.044 rows=100908 loops=1)
         Join Filter: (vlog."values"[gs.i] IS NOT NULL)
         ->  Function Scan on generate_series gs  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.019..0.055 rows=60 loops=1)
         ->  Materialize  (cost=285.92..9512.78 rows=11084 width=134) (actual time=0.016..0.759 rows=8568 loops=60)
               ->  Bitmap Heap Scan on vlog  (cost=285.92..9457.36 rows=11084 width=134) (actual time=0.922..2.176 rows=8568 loops=1)
                     Recheck Cond: ((first_time >= '2011-03-18 00:00:00'::timestamp without time zone) AND (first_time < '2011-03-19 00:00:00'::timestamp without time zone))                      ->  Bitmap Index Scan on vlog_pk  (cost=0.00..283.15 rows=11084 width=0) (actual time=0.896..0.896 rows=8568 loops=1)
                           Index Cond: ((first_time >= '2011-03-18 00:00:00'::timestamp without time zone) AND (first_time < '2011-03-19 00:00:00'::timestamp without time zone))
 Total runtime: 299.793 ms
(12 rows)

Well, that’s not great either. Option 1 definitely seems like a better choice for this set of data.

I have used option 2 successfully before though. This other data was per-day, and a grouped one month’s worth into each vector. Also, instead of a regular and small set of ids on the horizontal axis, I had a irregular and large (about 500k) set of keys. Time values were consistent for each key, but did not span the entire dimension. The cost of empty entries in each array was reduced, and this meant the overhead of the generator (per item) was also reduced (currently it’s an average of 5 generated numbers for item).

The cost of denormalisation is that queries on the denormalised tables are harder. Ideally, we can query a normalised view of the data instead, thus hiding the details of how the underlying data has been arranged. In the follow-up post I explain re-normalisation of the data using views.