I’ve made some improvements to the program discussed last month in Call graphs in Python. It’s had a significant rewrite in program analysis, paying more attention to how names are used. I’ve also experimented a little more with rendering the call graphs as diagrams.
The program pyan.py now creates distinct nodes for each object encountered, e.g. Message.send and Socket.send are no longer considered the same thing. In that particular case, the two send methods could not even be said to be implementations of any meaningful abstract method. The first sends the subject (
this), the second sends an object (a parameter) using the subject.
The program tries to infer which is being used in each particular instance; if it can’t do that, it will add use-edges to every possible match. For example, in this program:
x = Message() x.send()
it can infer a reference to Message.send, after making certain reasonable assumptions, such as that Message is a class. But it is possible that Message is a function that just happens to begin with capital letter. This is all very heuristical; so can’t be relied upon as proper static analysis. But it does clean up the call graphs and makes them better approximate the program in most cases.
Analysing program structure using types
The main goal of this call/usage analysis is to identify, for each variable used in a function, which possible objects that variable could be bound to when the function is executed. This is done using a static analysis: the target program is not executed itself, but properties of its code are analysed to determine roughly how it will behave.
One of the major concepts in program analysis is types. Types are the sets of values that expressions (e.g. variables, and the results of operations on variables) can take. Types can be formal, for instance when there are a fixed set of abstract types , e.g. integer, string, and reference to window; and each expression has one of those types. That can also be informal, and implicit in the program behaviour, for instance the type of integers greater than 0, or the set of window references that are non-NULL).
Ultimately all program behaviour in terms of correctness can be defined in terms of informal types. Programs are functions, and functions have types, too– it’s often desirable to prove that the program is in the type of all functions that correct implement a given specification. Formal types are much more limited, and are a course-grained way to specify the types of parts of the program. In static languages, roughly speaking, the formal types of expressions are independent of the program’s input, and in dynamic languages they are less independent.
Thinking about languages I’ve used (and without researching the official terminology) I can identify five approximate levels of type “staticness”:
- Static typing, in which each variable and expression is assigned a distinct type in the syntax of the program. That variable will only be assigned to objects of that type during execution. Types are disjoint — there is no inheritance or interfaces. C is an example of this kind of language. Every variable has a precise type — otherwise the compiler does not know how to compile machine code for storing and manipulating it. (C also casts that can allow an expression of one type to be assigned to a variable of a another; and unions that allow several variables of different types to occupy the same space. But all expressions still have distinct types.) Note that it is possible for types to be inferred in a static language.
- Interface typing, in which a hierarchy of interfaces (or classes) are defined, and each variable is assigned a distinct interface. The variable may be assigned to any object whose type is a subtype of that interface. Subtyping is a partially ordered relation. A language where this occurs is Java; for example …
- Inferential typing, in which the types of individual variables may be unknown, but are assigned type variables. Two variables with the same type variable must be the same type, although that type is known. Variables may also have their types inferred using standard type inference such as (a -> b) being the type of functions from a to b (whatever types a and b may actually be); if we know that x is of type Int, and we know f is of type a -> [a], then can infer that f(x) is of type [Int]. An example of this Haskell.
- Dynamic typing, in which variables are merely bindings to objects, and can be rebound to objects of other types. Types pertain to objects, not variables or expressions. For example, Python has types (built in primitive types, any number of user-defined classes). But an expression in a program could be of one type at one point, and a completely different type at another. Even when the expression is always the same type, it is difficult to determine which type that is without running the program, or conducting a sophisticated analysis.
- Untyped or “stringly-typed” programs. In these all objects are of the same type, which is made to represent all kinds of data. Typing here is an issue for the programmer’s mental model and arguably for the runtime validity of operations when applied to certain values, but the language model does not provide enough conceptual support for automatic analysis. An example would be shell scripting, in which strings are used for most data.
pyan.py analyses the abstract syntax tree of its target in a single pass. It can use assignments to a name on one line as an indication of the type of that name on subsequent lines.
There are a variety of erroneous analyses it does. One earlier example is this:
class IRC: def join(self, channel): pass print ','.join(range(5))
In this case the program will confuse the join method on strings with the join method on the IRC class. This was fixed by adding code to analyse constants. The type of ‘,’ is str, which means join on that line is a reference to str.join, not to a generic join method that could apply to any class.
I’ve continued to look for neat ways to draw the resulting graph.
The two programs I’ve experimented with are GraphViz and yEd. GraphViz is well known and easy to automate. It consists of a set of command line tools, with a common, text-based graph input format. It’s major drawback is that it only has a few layout modes (though they can be heavily tweaked). It works great for simple or highly hierarchical graphs.
yEd is newer and is a very refined GUI program. Unfortunately it is not open source, but the yEd editor is free. (I understand that graph layout libraries are available for separate (i.e. automated) use, for a fee. The layout libraries that come with yEd are obfuscated to prevent external use — the same company also sells the Java obfuscator!) I have restricted myself to laying out the graph using the options in the yEd GUI.
Updated call graph for my problematic program. Laid out by yEd (not as compactly as I’d like though). The objects from each module are colourised (manually, since the pyan.py export format is TGF which means Trivial Graph Format — no room for colour information in it).
Here is how yEd has laid out the program:
The objects are grouped into modules, and the layout algorithm attempts to keep the groups separate.
I’ve coloured each module’s objects manually. pyan.py exports to yEd using the Trivial Graph Format, which consists only of a list of vertices and a list of edges, each with optional names — there’s no room for formatting information. If I write an exporter for a more advanced format such as yEd’s GraphML, it should be possible for pyan.py to calculate colours for each node and to pass them through.