Two indexing tricks with PostgreSQL

I would like to share two indexing tricks I stumbled upon earlier this year. They are useful when changing the schema for large tables.

Context

Our database schema seemed rather inefficient:

We were using 8-byte bigint types for small integer keys, and 8-byte timestamp types for simple dates.
The columns were defined in a random order, with the resulting padding required to make the columns align (e.g. if the first two columns are a 4-byte integer and an 8-byte timestamp, the timestamp will need to be padded with 4 bytes of empty space so that it aligns to an 8-byte boundary). I was alerted to this padding cost by RhodiumToad on freenode’s #postgresql channel.
We used monthly partitions which made maintenance tasks, such as vacuuming and reindexing, effectively impossible.
We had indexes on both the primary key (which had 7 columns in it), and all columns used in foreign keys to help with referential integrity checks (e.g. when source_id on table forecast is a foreign key to table source, a delete or update on source needs to scan table forecast using source_id to check whether any of its rows are affected).
XML blobs were stored as plain text; at only 700 bytes each they were not large enough to be automatically compressed.

There is an obvious hazard of “premature optimisation” when addressing any of these. But when your data is 7 TB, your average partition size is 500 GB with another 500 GB of indexes, and it’s all running on incredibly slow NFS storage, it can be worth trying to do something.

(The database had earlier been migrated “as is” from Oracle, with the understanding that when it was working we could work on optimising it. I still wonder if we’d have found the migration less painful in terms of time and storage if we’d performed at least some optimisation of the schema beforehand.)

So that is what I eventually did — with a certain amount of proof-of-concept work in a separate, much, much smaller test database. This post is about the first point from the list above: inefficient column types and how to change them without rewriting all existing data.

Basic set up

Assuming the existing data is too unwieldy to convert to the new schema, we can create new tables with the new schema, and put a view in front of both the old and new tables to combine them into something the application can continue to work with.

In my case, the existing table looks like:

CREATE TABLE forecast (
    source_id BIGINT NOT NULL,
    validity_date TIMESTAMP WITHOUT TIME ZONE NOT NULL,
    ...
    PRIMARY KEY (source_id, validity_date, ...)
);

There are actually 13 columns on the table of various types, but these two columns are sufficient to illustrate the indexing tricks.

We will replace the forecast table with two tables and a view:

forecast_0 – formerly knows as forecast; the existing data in the old schema
forecast_1 – new data from this date on, in the new schema
forecast – a new view to be used in place of the old forecast table

ALTER TABLE forecast RENAME TO forecast_0;
CREATE TABLE forecast_1 (
    source_id INTEGER NOT NULL,
    validity_date DATE NOT NULL,
    ...,
    PRIMARY KEY (source_id, validity_date, ...)
);
CREATE VIEW forecast AS
SELECT source_id::integer, validity_date::date, ...
FROM forecast_0
UNION ALL
SELECT source_id, validity_date, ...
FROM forecast_1;
CREATE TRIGGER ...; /* various triggers for insert
    update,delete on forecast if needed by the app */

The new forecast view has the same column types as forecast_1 table. (In my case, the application will work with the new types without change.)

Querying

There is a now a huge problem. A typical query such as

SELECT * FROM forecast WHERE source_id = 101
AND validity_date = '2016-06-02';

is rewritten by the planner as

SELECT * FROM forecast_0 WHERE source_id::integer = 101
AND validity_date::date = '2016-06-02'
UNION ALL
SELECT * FROM forecast_1 WHERE source_id = 101
AND validity_date = '2016-06-02';

The criteria for the first subquery are on compound expressions such as source_id::integer and validity_date::date, which are not the simple column names source_id and `validity_date1 for which we have existing indexes. None of the indexes can be used and we get full table scans!

EXPLAIN SELECT * FROM forecast WHERE source_id = 42 and validity_date = '2016-06-02';

                                          QUERY PLAN
------------------------------------------------------------------------------------------------
 Append  (cost=0.00..2319239.47 rows=1215 width=423)
   ->  Seq Scan on forecast_0  (cost=0.00..2319233.42 rows=1214 width=423)
         Filter: (((source_id)::integer = 42) AND ((validity_date)::date = '2016-06-02'::date))
   ->  Index Scan using forecast_0_pkey on forecast_1  (cost=0.29..6.06 rows=1 width=82)
         Index Cond: ((source_id = 42) AND (validity_date = '2016-06-02'::date))
(5 rows)

This plan is from a small sample of one day’s worth of data and takes 15 minutes to run; in the full database it would take four days, which is problematic to say the least.

Trick 1: Create indexes for the new column types

If the expression being queried is source_id::integer, then an expression index on (source_id::integer), as another for (validity_date::date), could very well be the answer. Or, since both columns tend to be used in queries, a single index on (source_id::integer, validity_date::date).

CREATE INDEX ON forecast_0 ((source_id::integer), (validity_date::date));

This results in plan, using the new index, that brings the query time back to less than a second.

EXPLAIN SELECT * FROM forecast WHERE source_id = 42 and validity_date = '2016-06-02';

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.56..14.65 rows=2 width=252)
   ->  Index Scan using forecast_0_source_id_validity_date_idx on forecast_0  (cost=0.56..8.60 rows=1 width=423)
         Index Cond: (((source_id)::integer = 42) AND ((validity_date)::date = '2016-06-02'::date))
   ->  Index Scan using forecast_1_pkey on forecast_1  (cost=0.29..6.06 rows=1 width=82)
         Index Cond: ((source_id = 42) AND (validity_date = '2016-06-02'::date))

But there is a cost, in terms of time and storage space, of creating those indexes on the existing data. Instead, we can define the view forecast in terms of the old column types, and create the expression indexes on the new, empty tables:

CREATE VIEW forecast AS
SELECT source_id, validity_date, ... FROM forecast_0
UNION ALL
SELECT source_id::bigint, validity_date::date, ... FROM forecast_1;
CREATE INDEX ON forecast_1 ((source_id::bigint));
CREATE INDEX ON forecast_1 ((validity_dated::timestamp without time zone));

Of course, bigint indexes are bigger than integer indexes, so at some point down the road (perhaps when the old data has fallen out of the retention window and been discarded) it might be worth changing the view and the indexes to the new, more efficient types once and for all.

To make this easier when the time comes, the new tables can have indexes on the simple column names as well. When the view is rewritten in terms of the new types, the expression indexes created above will no longer be used and can be dropped.

Trick 2: Create mapping tables

What if we could tell PostgreSQL’s query planner that the index on bigint column “source_id” can be used for queries by source_id::integer?

We can, if we provide a way for it to map each integer source_id to its equivalent bigint value, and the same for the date/timestamp validity_date values. PostgreSQL is a relational database, and a lot of its smartness pertains to tables and the indexes on them. Let’s create some tables to contain the mappings for these two columns:

CREATE TABLE source_lookup
(
    source_id INTEGER,
    legacy_source_id BIGINT,
    PRIMARY KEY (source_id, legacy_source_id)
);
CREATE TABLE validity_lookup
(
    validity_date DATE,
    legacy_validity_date TIMESTAMP WITHOUT TIME ZONE,
    PRIMARY KEY (validity_date, legacy_validity_date)
);

We need to populate the mapping tables with every value that might be needed. Fortunately there are not that many of them. The source_ids can be found in the source table, and the validity_dates can be generated over a reasonable range covering all existing data:

INSERT INTO source_lookup SELECT source_id::integer, source_id from source;
INSERT INTO validity_lookup
SELECT d::date, d FROM generate_series('2015-01-01'::timestamp without time zone, '2017-01-01'::timestamp without time zone, '1 day') as s(d);

Now, we can recreate the forecast view using the mapping tables:

CREATE VIEW forecast AS
SELECT source_id, validity_date, ... FROM forecast_0
    JOIN source_lookup ON legacy_source_id = forecast_0.source_id
    JOIN validity_lookup ON legacy_validity_date = forecast_0.validity_date
UNION ALL
SELECT source_id, validity_date, ... FROM forecast_1;

How will PostgreSQL plan queries on this view? Note that source_id and validity_date in the view do not come from forecast_base_0; the come from the mapping tables, where they are the first columns of the primary keys. The secondary columns in those tables are then used to find the rows of forecast_base_0 needed to fill out the rest of the view.

EXPLAIN SELECT * FROM forecast WHERE source_id = 42 and validity_date = '20160602';

                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Append  (cost=2706.11..229455.53 rows=244 width=422)
   ->  Nested Loop  (cost=2706.11..229447.05 rows=243 width=423)
         ->  Index Only Scan using validity_lookup_pkey on validity_lookup vl  (cost=0.28..8.29 rows=1 width=12)
               Index Cond: (validity_date = '2016-06-02'::date)
         ->  Nested Loop  (cost=2705.83..228951.36 rows=48557 width=439)
               ->  Index Only Scan using source_lookup_pkey on source_lookup sl  (cost=0.28..8.29 rows=1 width=6)
                     Index Cond: (source_id = 42)
               ->  Bitmap Heap Scan on forecast_0  (cost=2705.56..228251.37 rows=69170 width=443)
                     Recheck Cond: ((source_id = sl.legacy_source_id) AND (validity_date = vl.legacy_validity_date))
                     ->  Bitmap Index Scan on forecast_0_pkey  (cost=0.00..2688.26 rows=69170 width=0)
                           Index Cond: ((source_id = sl.legacy_source_id) AND (validity_date = vl.legacy_validity_date))
   ->  Index Scan using forecast_1_pkey on forecast_1  (cost=0.29..6.06 rows=1 width=82)
         Index Cond: ((source_id = 42) AND (validity_date = '2016-06-02'::date))

Again, this results in a more efficient query that is able to use the existing indexes. Furthermore, it avoids the expensive step of creating expression indexes.