How the countability of the rational numbers implies an enumeration for regular languages.
Someone posted in r/programming a link to regex-genex, which is a Haskell library for generating all sentences from a regular expression. It’s an interesting computational problem, which I encountered when I worked on a similar program.
A while ago I wrote a Python program for drawing state diagrams from regular expressions. It would parse the expression into a structure, apply all those amazing RE -> GFA -> NFA -> DFA transformations on it, then pipe a description of the result to GraphViz. Unfortunately I’ve lost the source but I fondly remember working on a method to generate the strings matching an expression. Seeing this on Reddit brought to mind the trickiness of doing this properly, and I thought I’d try to recreate some of the solution here. (These regular expressions are the pure ones I was taught in computer science, with simple concatenation, grouping and + and * operations. I use literals such a and b to represent arbitrary subexpressions rather than concrete symbols.)
There are three operations in regular expression construction:
- Alternation. L(a+b) is the union of L(a) and L(b). (For some purposes languages behave more like bags than sets, so if a word appears in both L(a) and L(b) it may appear twice in the union)
- Concatenation. L(ab) is the set of all possible ways of concatenating a word in L(a) with one from L(b); it’s the Cartesian product of those two underlying languages.
- Kleene star. L(a*) is the set of any words that are any finite concatenation of words in L(a). If L(a) has at least one non-empty word, then L(a*) is an infinite set (thanks to Konstantin for spotting my earlier mistake).
So, some regular languages are infinite. Therefore, enumeration of the words in them has to be done lazily if it is done at all. In a lazy implementation, every element of the set should be returned “eventually”. Even then you have to ensure that eventually means within a finite number of steps for any given element. That’s not always easy!
In the case of alternation, L(a+b) can be generated for finite underlying languages easily enough:
def generate_alt(re1, re2): for w in generate(re1): yield w for w in generate(re2): yield w
But if L(a) is infinite, nothing in L(b) will ever be generated. It’s easy to fix in this case, by alternately picking elements from each:
def generate_alt(re1, re2): for w1,w2 in izip_longest(generate(re1), generate(re2)): if w1 is not None: yield w1 if w2 is not None: yield w2
(It’s a bit messy because we need to handle the cases of two finite sequences, finite and infinite sequences, and two infinite sequences.)
Or take the alternation ab. The language for this, L(ab) is the cartesian product of L(a) and L(b). When we’re talking about finite languages, this can be generated simply by a couple of nested loops over L(a) and L(b):
def generate_concat(re1, re2): for w1 in generate(re1): for w2 in generate(re2): yield '%s%s' % (w1, w2)
What about L(a*b*) ? The problem is that the outer loop’s first iteration will never end, because the inner loop will continue forever. The generator will continue to spit out valid words in L(ab), but only the ones starting with the first word in L(a). The others will have to wait until those ones are exhausted, which, given there’s an infinite supply of them, puts the time of their arrival somewhat at odds with the notion of “eventually”. We need to iterative over both underlying languages in a way that covers all pairs of elements.
The two underlying languages are countably infinite sets; so finding L(a*b*) is akin to a proof-by-construction that their product is also countable. It’s essentially the same as proving that the rational numbers are countable, by finding a mapping from them to the natural numbers. This can be done by drawing a one-dimensional path through the two-dimensional matrix of rational numbers, as done by Cantor’s pairing function:
We want a similar function from pairs of regular expressions to some countable set; if we assume the two sets are countable, then there is a mapping from each regexp to a natural number, and we can use the same function to map the pair of these to a natural number. Finally, that number determines the generation order of that pair of elements:
def generate_concat(re1, re2): q =  for wq in generate(re1): # Every outer iteration, a word is taken from re1 q.append(wq) # Then we iterate over that in reverse, at the same time as we take elements from re2 # Iteration over generate(re2) will be limited to the length of q for w1,w2 in izip(reversed(q), generate(re2)): yield '%s%s' % (w1, w2)
By construction, this implements lazy, eventual generation of elements in regular expressions containing alternation and concatenation. It also handles infinite underlying languages such as those defined by the Kleene star; the problem is that we haven’t defined the generator for that case.
If L(a) is finite, L(a*) can be generated using the recursive relation: L(a*) = L(e + aa*) (where e is the empty string).
def generate_star(re1): yield '' for w1 in generate(re1): for w2 in generate_star(re1): yield '%s%s' % (w1, w2)
For infinite L(a), we can combine this with the solution for concatenation over infinite languages:
def generate_star(re1): yield '' q =  for wq in generate(re1): q.append(wq) for w1,w2 in izip(reversed(q), generate_star(re1)): yield '%s%s' % (w1, w2)
When we generated concatenated regexps, we drew a zigzagging diagonal path over the pairs of words from each underlying language. What is the shape of the path over a Kleene star? It’s not three-dimensional; it has an indefinite number of dimensions! Each level of recursion in the above function adds a dimension, and each dimension corresponds to an additional enumeration of L(a).
The need to invoke countability theorems in enumerating regular expressions is surprisingly involved. And I don’t remember using it when I wrote my first implementation! So either that implementation was buggy (in the sense of not generating all words “eventually”), or there’s a simple trick I’m missing! Or maybe both are true…
Update. Posted to Reddit but fumbled the intro paragraph. 665 visits and 0 comments!