Enumerating regular languages

How the countability of the rational numbers implies an enumeration for regular languages.

Someone posted in r/programming a link to regex-genex, which is a Haskell library for generating all sentences from a regular expression. It’s an interesting computational problem, which I encountered when I worked on a similar program.

A while ago I wrote a Python program for drawing state diagrams from regular expressions. It would parse the expression into a structure, apply all those amazing RE -> GFA -> NFA -> DFA transformations on it, then pipe a description of the result to GraphViz. Unfortunately I’ve lost the source but I fondly remember working on a method to generate the strings matching an expression. Seeing this on Reddit brought to mind the trickiness of doing this properly, and I thought I’d try to recreate some of the solution here. (These regular expressions are the pure ones I was taught in computer science, with simple concatenation, grouping and + and * operations. I use literals such a and b to represent arbitrary subexpressions rather than concrete symbols.)

There are three operations in regular expression construction:

  • Alternation. L(a+b) is the union of L(a) and L(b). (For some purposes languages behave more like bags than sets, so if a word appears in both L(a) and L(b) it may appear twice in the union)
  • Concatenation. L(ab) is the set of all possible ways of concatenating a word in L(a) with one from L(b); it’s the Cartesian product of those two underlying languages.
  • Kleene star. L(a*) is the set of any words that are any finite concatenation of words in L(a). If L(a) is not empty, then L(a*) is an infinite set.

So, some regular languages are infinite. Therefore, enumeration of the words in them has to be done lazily if it is done at all. In a lazy implementation, every element of the set should be returned “eventually”. Even then you have to ensure that eventually means within a finite number of steps for any given element. That’s not always easy!

In the case of alternation, L(a+b) can be generated for finite underlying languages easily enough:

def generate_alt(re1, re2):
    for w in generate(re1):
        yield w
    for w in generate(re2):
        yield w

But if L(a) is infinite, nothing in L(b) will ever be generated. It’s easy to fix in this case, by alternately picking elements from each:

def generate_alt(re1, re2):
    for w1,w2 in izip_longest(generate(re1), generate(re2)):
        if w1 is not None:
            yield w1
        if w2 is not None:
            yield w2

(It’s a bit messy because we need to handle the cases of two finite sequences, finite and infinite sequences, and two infinite sequences.)

Or take the alternation ab. The language for this, L(ab) is the cartesian product of L(a) and L(b). When we’re talking about finite languages, this can be generated simply by a couple of nested loops over L(a) and L(b):

def generate_concat(re1, re2):
    for w1 in generate(re1):
        for w2 in generate(re2):
            yield '%s%s' % (w1, w2)

What about L(a*b*) ? The problem is that the outer loop’s first iteration will never end, because the inner loop will continue forever. The generator will continue to spit out valid words in L(ab), but only the ones starting with the first word in L(a). The others will have to wait until those ones are exhausted, which, given there’s an infinite supply of them, puts the time of their arrival somewhat at odds with the notion of “eventually”. We need to iterative over both underlying languages in a way that covers all pairs of elements.

The two underlying languages are countably infinite sets; so finding L(a*b*) is akin to a proof-by-construction that their product is also countable. It’s essentially the same as proving that the rational numbers are countable, by finding a mapping from them to the natural numbers. This can be done by drawing a one-dimensional path through the two-dimensional matrix of rational numbers, as done by Cantor’s pairing function:

Cantor's pairing function, mapping pairs of natural numbers to natural numbers

We want a similar function from pairs of regular expressions to some countable set; if we assume the two sets are countable, then there is a mapping from each regexp to a natural number, and we can use the same function to map the pair of these to a natural number. Finally, that number determines the generation order of that pair of elements:

def generate_concat(re1, re2):
    q = []
    for wq in generate(re1):
       # Every outer iteration, a word is taken from re1
       q.append(wq)
       # Then we iterate over that in reverse, at the same time as we take elements from re2
       # Iteration over generate(re2) will be limited to the length of q
       for w1,w2 in izip(reversed(q), generate(re2)):
            yield '%s%s' % (w1, w2)

By construction, this implements lazy, eventual generation of elements in regular expressions containing alternation and concatenation. It also handles infinite underlying languages such as those defined by the Kleene star; the problem is that we haven’t defined the generator for that case.

If L(a) is finite, L(a*) can be generated using the recursive relation: L(a*) = L(e + aa*) (where e is the empty string).

def generate_star(re1):
    yield ''
    for w1 in generate(re1):
        for w2 in generate_star(re1):
            yield '%s%s' % (w1, w2)

For infinite L(a), we can combine this with the solution for concatenation over infinite languages:

def generate_star(re1):
    yield ''
    q = []
    for wq in generate(re1):
       q.append(wq)
       for w1,w2 in izip(reversed(q), generate_star(re1)):
            yield '%s%s' % (w1, w2)

When we generated concatenated regexps, we drew a zigzagging diagonal path over the pairs of words from each underlying language. What is the shape of the path over a Kleene star? It’s not three-dimensional; it has an indefinite number of dimensions! Each level of recursion in the above function adds a dimension, and each dimension corresponds to an additional enumeration of L(a).

The need to invoke countability theorems in enumerating regular expressions is surprisingly involved. And I don’t remember using it when I wrote my first implementation! So either that implementation was buggy (in the sense of not generating all words “eventually”), or there’s a simple trick I’m missing! Or maybe both are true…

Update. Posted to Reddit but fumbled the intro paragraph.  665 visits and 0 comments!

Advertisements
This entry was posted in Programming and tagged , . Bookmark the permalink.

8 Responses to Enumerating regular languages

  1. Paddy3118 says:

    HI, In the last generate_star definition you izip(reveresed … (re2)):

    Should that have been re1 and not re2?

    – I enjoyed your post :-)

    • ejrh says:

      And that’s why copy-and-paste programming should be avoided! Fixed. (I suppose I could have simply called generate_concat instead of duplicating most of it.)

      Thanks for spotting the error.

  2. Pingback: Twitted by 2ndZeta

  3. David says:

    Is there a bound on the amount of the memory that will be required by generate_concat and generate_star? It seems in the limit, q would have had all of one language appended to it.

    Thanks for the interesting article and techniques in any case.

    • ejrh says:

      Hi David, thanks for reading it and posing an interesting question. One which I don’t feel entirely competent to answer, but here goes…

      It’s not easy to analyse space complexity with these recursive functions. In the case of generate_star, I think we have:

      For an empty language L, constant memory will be used (there is only one word in L*, namely the empty word).

      For a non-empty language L, L* will be infinite, and although it will *eventually* generate all words, the space required for each subsequent word will increase indefinitely. But we can still discuss, for instance, the space required to emit a particular word of a given language; this will differ depending on the internal structure of that language.

      In the simpler case of generate_concat, Cantor’s diagonalisation diagram provides a clue: the memory cost of generating word N depends on the maximal extent along the x-axis of the path to point N in that diagram; approximately sqrt(N). (Not counting the memory cost of child generators; that complicates things for infinite sublanguages.)

      A good way to test this for small cases might be to add some instrumentation to the Python code, to keep track of memory. Or we could just count the number of objects to get a rough idea of memory growth for each word:

      import gc   
      for w in generate(re):
          print w, len(gc.get_objects())
  4. gergo says:

    In the good old spirit of “my programming language has had this feature built in for 40 years”, here’s a fair way to enumerate all the words of a language (which need not be regular) in Prolog:

    ?- length(Expression, N), phrase(my_grammar, Expression).

    The idea here is that the call to length/2 will enumerate all integers N in the order 0, 1, 2, …, bind Expression to a list of that length, and phrase/2 will then enumerate all words of my_grammar of that length. When that’s done, the next larger N is tried.

    I guess as similar implementation would be possible in Python by threading a counter through all the generators, and there would be no need to invoke countability arguments.

  5. Matt says:

    In your generate_concat function, you used r2 instead of re2

  6. patrikja says:

    It would be interesting to try this out using “Feat”: https://hackage.haskell.org/package/testing-feat
    (Functional Enumeration of Algebraic Types)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s