SQL laziness

A small, but cathartic, rant on what’s irritating me today.

I’ve been working on an SQL statement. It’s a join over several tables, with some additional filter conditions. Here’s what it looks like, approximately and obfuscatedly:

SELECT DISTINCT context_vertex_name, context_form_value AS port, ...
FROM context_info i, context c, context_vertex_form vf, context_form f, ...
WHERE i.context_id = c.context_id
AND i.context_name = "FOO"
AND c.context_vertex_name = "BAR"
AND vf.context_vertex_id = c.context_vertex_id
AND vf.context_form_id = f.context_form_id
AND ...
GROUP BY ...

There are several annoyances in that. The first is that it uses the non-standard double quote delimiter for strings. I often see this in the Sybase-dominated work environment. Double quote is supposed to delimit identifiers, not strings! It’s useful when you want to use certain characters in a name, for example SELECT EXTRACT(year, NOW() - date_of_birth) AS "Age in Years", ..., or when some helpful DBA has given you table and column names with certain characters in them that you need to safely reference. It also controls the case-folding policy in SQL (when the DBA has helpfully used mixed case identifiers).

Next is an old one that is unfortunately allowed by the standard. The “AS” keywords are omitted in FROM-clause. The table names and their aliases are separated only by a single space character, which is only slightly distinguishable from the underscores in tables names. Even syntax highlighting won’t help here, because both tokens are of the same type in the grammar and have the same format. It’s not ambiguous, but it’s a pain to read especially in a big query. Whose idea was it to make AS optional? (It’s not optional for column aliases in the SELECT-clause; the standards committee must have been firing on all cylinders the day they wrote that part.) It’s rather uncommon to see constructions of the form “name name” in formal languages; the only exception I can think of is lambda calculus and other functional languages (such as Haskell) in which a function call is the application of two terms, e.g. “f x” or “first 5 primes” (where currying allows us to treat all functions as taking one argument).

Finally, and worst of all, the joins are all implicit! That means, there is a simple list of tables, then additional criteria in the WHERE-clause to filter their cartesian product. There’s not much difference performance-wise between implicit and explicit joins. The optimiser will normally decompose the ON-clauses of explicit joins into the WHERE-clause, and then re-extract useful join criteria from the WHERE-clause when planning how to scan the tables. But there is useful semantic information in distinguishing between join criteria and filter criteria — useful to a human reader who’s trying to understand the query. When you have dozens of criteria, dividing them into those that preserve relational meaningfulness, and those that merely control the subset of results returned, really helps to both comprehend what the intent of the query is, and to alter it to achieve that. Having those criteria mixed in together in one giant WHERE-clause does not.

I blame these persistent habits largely on the DBMS we use. It’s so old and nonstandard that people will use whatever works to get the job done. And it’s so ubiquitous in the environment — in cahoots with its barely-evolved offspring, MS SQL Server; don’t get me started on the evils of T-SQL! — that people don’t realise there is a large, established world of (more-or-less-)standards-compliant, well-factored, SQL practice out there.

This entry was posted in Programming, Rants and tagged , . Bookmark the permalink.

Leave a comment