## Data flow analysis and regular expressions

I have some unfinished business with regular expressions.  It arises in the rather peculiar application of static analysis of programs — the regular expressions represent the possible paths through a program.  (Perhaps it’s not all that peculiar to compiler experts, but it’s not the usual “parse title and links out of this web page” stuff, either.)  Bear with me as I explain the background.

In 2006, I took David Pearce’s excellent COMP 463 class, Advanced Software Engineering.  I was a little wary of the “software engineering” part, but luckily it was really the school’s “compiler design” class.  (Its modern successor appears to be SWEN 430 — which again is under “Software Engineering”, argh!)

Some schools have compiler courses in which students get to build actual working compilers.  Victoria doesn’t really do that; parsing (the “front end”) was covered in formal computer science (COMP 202, in my day), and a course like this covered static analysis (the “middle end”).  Building a compiler remains one of my most serious programming ambitions.  I have made some progress towards it, which I hope to eventually get round to posting about.

COMP 463 covered several things of interest to compiler writing.  A little bit of low-level program structure, type theory, the 2-SAT problem…  Most of its subject matter was static analysis.  The big course project was “integer range analysis”.  This is the problem of analysing the possible values of integer variables at various points in the program, and asserting that they satisfy constraints specified by the programmer.

## Data flow analysis

Data flow analysis is a technique for analysing a program (or similar structure) by propagating information about the state of the program at each point in its execution.  It doesn’t require running the program, but it affectively simulates it by nondeterministically applying its instructions to a set of states.  There is a large variety of analyses, here are some of the ones I’m familiar with:

• Live variables: at each point, which variables have values that may be needed at future points?
• Definite assignment: does every variable have an assigned value at the points it is used, regardless of the path to that point?
• Null pointers: are all dereferences of a pointer guaranteed to occur after assigning an object to that pointer?
• Virtual method calls: can we turn a virtual method call into a direct call by recognising the precise class in use at that point?
• Dead code: are any branches (from if statements, etc.) discardable, because the conditions leading to them will never occur?
• Integer ranges: is the possible range of an integer value within certain (programmer-specified constraints) at every point?

The last example is the one that lead to my question, but the idea is applicable to dataflow analysis in general.

The data flow analysis maintains a state structure for every edge in the graph (or alternately, an in-state and several out-states for every vertex).  This structure records facts about the state of the program at that point in execution. The possible states form a lattice, with two special states, top and bottom.  Each state represents information about the program at that point.  Top is best-case information; bottom is worst-case — nothing is known.

There is a function join that combines two states conservatively: if a fact is known in state A, and in state B, then it is also known in A join B (otherwise it is not known).  Additionally there is a flow function that computes new states using the contents of each vertex: out-state = flow(in-state).  For a vertex with several out-edges (for instance, a branch), the flow function will compute appropriately for each one (for a branch, one out-state is the state assuming the branch is taken, the other is assuming it is not).

Many analyses cover several variables.  In the case where the analysis of each variable is independent, the state at each edge is a map from variable names to variable states.  The join function returns a map from variable names to the join of their corresponding variable states from each input.

The analysis then propagates the facts forward through the program graph. Since a program may contain loops, propagation can be cyclic. Generally the propagation is applied through a vertex each time the state on its in-edge is updated. The analysis is complete when a steady state is reached in the graph — that is, when no states have been updated.

Well that’s my understanding of DFA.  The internet has better explanations, including a few interesting papers like this and several good introductions in the form of course notes.

## Integer range analysis

In the case of integer flow analysis, the structure is an integer range state, which maps variable names to sets of possible values.  Ideally, the sets of possible values form a lattice equivalent to the power set of the value range.  For an integer range of 2^32, there are 2^(2^32) elements in this lattice, which is quite a lot: you need 2^32 bits to represent a state in it.

So in practice, we represented states approximately as modular ranges.  A state is a tuple of (m, n, s), and it indicates that the value is known to be an element of {m, m+s, m+s*2, …, n}.  Arithmetic operations on states are implemented.  For example, (m, n, s) + (p, q, s) = (m+p, n+q, s).  Any operation on states whose result cannot be represented precisely can safely emit an imprecise state: one which indicates the value is an element of a superset of the true set of possible values.

Here are two examples of code, with some trivial results of analysis shown in comments. In the first, if the value is known on the in-state to the vertex, then it can be known on the out-state.

```/* Assuming x is 4 on the in-edge to this vertex... */
x = x + 3;
/* x is 7 on the out-edge of this vertex. */```

In this example, if the value is used in a simple if-expression, its possible range is known at the start of each branch.

```if (x > 5) {
/* x > 5 on this branch. */
} else {
/* x <= 4 on this branch. */
}```

If these two snippets were in immediate succession, an additional result could be obtained. In the else branch, both x = 7 and x <= 4 are true; this is impossible.  A dead code analysis code use this information to safely remove that entire branch.  (Similarly, it could recognise that x = 7 means the test for x > 5 is unnecessary, and simply jump straight to the contents of the first branch.)

This analysis is far from perfect. Whenever it cannot perfectly represent the result of a flow or join operation, it will throw out information. It must always be conservative if it cannot be accurate. Which means, in complicated programs, it cannot prove everything that needs to be proved, which can result in spurious warnings or missed optimisations.

```if (y = 1) {
x = 1;
} else {
x = 2;
}
/* Here we join the states x = 1 and x = 2. */
/* Our state structure has no representation for x in {1, 2}, so we must use a conservative approximation: x is unknown. */

if (x <= 3) {
...
} else {
/* x is unknown, so we have to assume this branch could be taken. */
error("internal program error (x is invalid); please submit bug report!");
}```

In this case, we’ve lost information that was known in every possible path we’ve come from. In COMP 463, this was one of the cases are project would be judged against.

## Parallel analysis

In my implementation, the analysis is extended as follows.  The state structure used is a “parallel state”, which maintains a mapping from paths (sequences of vertices) to integer range states.

When a state is updated for vertex, the update is applied to each item in that map, and the label of that vertex is appended to each path.

```1. if (y = 1) {
2. x = 1;
/* State is now:  [1,2]={x=1}. */
} else {
3. x = 2;
/* State is now:  [1,3]={x=2}. */
}
/* Here we join the states.  The result is:  [1,2]={x=1}, [1,3]={x=2}. */

4. if (x <= 3) {
5. ...
} else {
/* On the else-branch, the state is:  [1,2,4]={x=1,x>3}, [1,3,4]={x=2,x>3}. */
/* All paths in the state have now an impossible state, so the state as a whole is impossible at this point.
The error below has been proved impossible, and this branch can be discarded. */
6. error("internal program error (x is invalid); please submit bug report!");
}```

Another problem that this helps to solve is with correlated if statements. Consider a program that occasionally prints parentheses around an expression (based on the value of a variable x):

```b = 0;
if (x = 1) {
print("(");
b = b + 1;
/* State here is b in {0,1}. */
}
/* b=0 join b in {0,1}; resulting state is b in {0,1}.*/

print("some stuff");

if (x = 1) {
print(")");
b = b - 1;
/* State here is b in {-1,0}. */
}
/* State here is b in {-1,0,1} -- as far as the DFA can tell at this point, either or both of the if branches may have triggered. */

if (b != 0) {
/* Unfortunately this branch cannot be discarded, because b could be -1 or 1! */
error("Parentheses not balanced!");
}```

Imagine a DFA analysis that wanted to prove that the error would never occur. Even using accurate integer range states (recording the exact set of all possible values), the analysis cannot tell at the error check that b = 0 and there is no error. With parallel states:

```1. b = 0;
2. if (x = 1) {
3. print("(");
4. b = b + 1;
/* State is:  [1,2,3,4]={b=1,x=1}. */
}
/* Joined state is [1,2]={b=0,x=0}, [1,2,3,4]={b=1,x=1}. */

5. print("some stuff");
/* State is [1,2,5]={b=0,x=0}, [1,2,3,4,5]={b=1,x=1}. */

6. if (x = 1) {
/* Here the state loses all paths that contradict x=1 -- they are impossible. */
/* State is [1,2,3,4,5,6]={b=1,x=1}. */
7. print(")");
8. b = b - 1;
/* State is:  [1,2,3,4,5,6,7,8]={b=0,x=1}. */
}
/* Joined state is [1,2,5,6]={b=0,x=0}, [1,2,3,4,5,6,7,8]={b=0,x=1}. */

9. if (b != 0) {
/* In both paths in the state, b=0, and hence this error is impossible. */
10. error("Parentheses not balanced!");
}```

Information is lost when states are joined; the parallel state effectively defers joins. Instead of recalling that the in-state came from path X, or path Y, the state will record which state came from each.

One big challenge in DFA is loops. The algorithm itself will correctly handle them. The problem is the computational cost: for integer range analysis in a bounded loop, the analysis will iterate for the range of the loop (for an unbounded one it may iterate until it has reduced the state to bottom). With parallel states we have an additional problem: storage of the state. Take this simple loop:

```1. x = 0;
2. while (x < 100) {
3. if (x = 100) {
4. error();
}
5. x = x + 1;
/* State here after 5 iterations is:
[1,2,3,5]={x=1},
[1,2,3,5,2,3,5]={x=2}
[1,2,3,5,2,3,5,2,3,5]={x=3}
[1,2,3,5,2,3,5,2,3,5,2,3,5]={x=4}
[1,2,3,5,2,3,5,2,3,5,2,3,5,2,3,5]={x=5} */
}
/* State here is: [1,2,3,5,...,2,3,5,2]={x=100} */```

It will eventually exhaust the loop, but in the meantime it will consume storage proportional to the square of the loop range.

In my implementation, this was controlled using a max_path_length parameter. Any paths in the parallel state which grew beyond this limit have their prefix removed, and joined are if that results in duplicates. With max_path_length=10, after 5 iterations the state above would be:

```    5. x = x + 1;
/* State here after 5 iterations is:
[1,2,3,5]={x=1},
[1,2,3,5,2,3,5]={x=2}
[1,2,3,5,2,3,5,2,3,5]={x=3}
[5,2,3,5,2,3,5,2,3,5]={x could be anything}.```

## Regular expression compression

There are obvious patterns in the paths. The question arises: instead of simply keeping only the last 10 items, can we find another way of lossily compressing the path?

One of the lecturers suggested using regular expressions to represent sets of paths.  In the previous example, the paths consisted of many repetitions of a cycle 2,3,5. We have a path for 1 repetition, 2 repetitions, etc. If having 100 distinct paths is too many, we could discard the number of repetitions, and just represent the path as 1(235)*.

## My question

The trick is, given two paths represented as regular expressions X and Y, how do we join them?  This is where I’m stuck.

We need a regular expression Z such that L(Z) is a superset of L(X) union L(Y). (L(X) is the language of the regular expression X; in our application the language is an approximation of the set of paths through the program up to that point.)

Also, we want a regular expression that is simple enough: if it’s too complicated we need to simplify it further.  During runtime they would likely be manipulated as finite state machines, so a suitable limit might be a maximum number of states.

There are three manipulations we need to do:

1. Test if two FSMs are the same.  Should be easy (depending on how their stored).
2. Append a point to a FSM.  This can be done by drawing an edge from all accepting states to a new state (which is also accepting), and labelling those edges with the label of the point.
3. Find the union of two finite state machines and, if necessary, reduce its number of states (by making it accept additional paths).  There is an algorithm for the union.  Simplification is tricky.  In fact I’m not sure a regular expression can be simplified in this way!

That’s about as far as I’ve got.  If anyone knows about applying regular expressions to program paths, I would appreciate a few hints on how to do it.  :-)

This entry was posted in Programming and tagged , , . Bookmark the permalink.