As a guide to program structure, and hopefully an aid to refactoring.
It’s depressing to look at previous creations and realise how defective they are. This situation is something that programmers — perhaps more than any other group — routinely find themselves embarrassed by. The things we make are complicated and there are many ways to make them. Often they make sense only to the people who wrote them; and a few months on, the writer is a different person and no longer able to read their own work!
In my case, I’ve returned to work on a work project dating from 2008. It’s a Python program that loads fixed-width text files, extracts a few columns, and generates XML containing that data. But it also does lots of other things:
- Gathering and merging config from the command line, a file, and a remote DB
- Downloading input from FTP servers
- Mapping some data using remote DB tables
- Putting output on queues
- Automatically retrying certain steps according to the config
- Sending status emails
The new work is to support a new type of input; the only real changes are in the input parser. But I can hardly test it without setting up two remote databases, and FTP server, and an MQ queue. The problem is the program has a large “Processor” class that is responsible for doing all the high level work; some lower-level parts have been put in modules, but all the work of comprehending the config and locating all these external dependencies, then acquiring the input, iterating over it, mapping each data item, collecting the results, then sending them out is implemented in this monstrous God class.
The thing needs drastic refactoring.
The immediate problem is that there’s no dependency injection. The code that coordinates processing the input goes and finds its own DBs, queues and servers. Instead, there should be a simple default main function that finds these dependencies, and passes them into the processor. That way, my test can have a similar function that uses different data sources (say, a test input file, and a test output directory, and a test mapping table rather than the FTP server, MQ queues, and DBs).
But I haven’t worked on it for 3 years. I need to understand the program before I can figure out how to modularise that processer class. As a dynamic language, Python does not support much automatic refactoring. I need to manually track down every object before I can move it somewhere else.
Searching for “refactoring python”, I found that others have had the same idea about refactoring using the call graph is a guide. They’ve written a simple program to analyse a Python program for the function called in it.
It’s pretty cool. Unfortunately it doesn’t handle classes.
I’ve written a similar program. It analyses a set of Python source files, finding all the “callables” (modules, classes, and functions), and all references to them. All objects with the same name are treated as the same item — because it’s very hard to detect when they aren’t. These are compiled into a directed graph containing edges for “defines” and “uses”. An object X defines Y if a definition of Y appears inside X, for example, classes and functions in a module, and methods within a class. An object X uses Y if the code for X includes a reference to Y, and Y is a callable. This could be literal function call, or it could be a simple reference to the callable, for instance passing it as a callback.
For instance, take this Python code.
# In a file called X.py def Z(n): return W(n)+1 class X: def Y(n): return Z(n)
The resulting diagram is:
The module X and the class X have been combined into the same node (since in any reference to X, it would be hard to tell which was being used). X defines three callables: itself, the method Y, and the function Z (self-definition and self-use are loops in the graph and can be omitted with a small code change). The method Y calls Z. Z also calls W, but there’s no definition of W anywhere, so it doesn’t appear on the graph.
Let’s try running it on it’s own source:
Ok, so the universe didn’t collapse (which is always a danger when you fool around with self-referential systems :p). It’s not a very complicated program, so the result is quite readable. That’s largely because it uses an external module to do most of the work.
The program uses Python’s compiler module to parse the input, and to walk over the abstract syntax trees produced by the parser. This contrasts with Prashanth Ellina’s approach, which uses a concrete syntax tree produced by the parser module. The compiler code is higher-level and, in my opinion, more appropriate; but on the other hand, the module is deprecated! (Which is a pain, because it’s pretty useful. I’m not 100% sure what’s replacing it.)
And here’s the output for my problematic work program:
Not so friendly a diagram. The aspect ratio is a problem, since it makes it harder to print the diagram into a single A3 sheet. (GraphViz includes an option for forcing an aspect ratio, but strangely it corrupts this image by adding sets of spurious multiedges between nodes that already have more than one relationship.)
Objects that use each other form natural clusters. It’s possible that these correspond to sensible modules in the program. This diagram shows a couple of clusters: the sink-related objects in the bottom-right, and some parsing-related objects in the bottom-left. But mostly the program is a mess of tentacles spreading out from the processor.
I have a couple of ideas that could make these relationships more obvious. The first is to colour-code the nodes according to which module they belong to. The presence of the same coloured nodes all over the graph would be a sign that a module is relatively intrusive in other modules. The other idea is to weight the edges by number of uses: this could help graphviz to pull strongly related objects closer together, making this clusters more obvious.
Still, I’m not precisely sure what the equivalent of a nice program would look like. But I’m guessing it would be much less messy. I guess the next steps are:
- Refactor the program (using conventional wisdom; there is plenty of this on the C2 wiki).
- Generate its new call graph.
- Compare against this one.
The results of that experiment could give us some insight into what a good program looks like.