More thoughts on version control

The shadow of Git has lately begun to loom over my programming habits. It has actually become the principal version control system at work, with most active projects migrated to it from ClearCase, Mercurial, CVS, etc. And recent collaborative programming work on my own projects (such as on the Python analysis program) have made me realise that using a distributed source control system from day one can pay off if and when someone wants to fork your work.

I have been an extremely loyal Subversion user for almost ten years — I started using it in November 2002. Not only has it been my main source control system for all my personal projects, documents, and general data, but it’s been a satisfying outlet for some of my own version control related endeavours:

  1. A few small patches I’ve contributed to the Subversion project.
  2. One of the first attempts at an SQL storage backend.
  3. A full text repository indexing tool and database.
  4. A web-based repository browser, like ViewVC or Github (lost!).
  5. A general repository rewriting program (for retroactively fixing metadata and file contents (also lost!).
  6. A revision replay script, which has actually been useful for copying changes from The Ur-Quan Masters to Project 6014.

Even now, the Subversion command line and the TortoiseSVN GUI for Windows are the easiest to use, most robust, and best supported version control tools I know of. I believe they were the right choice for Project 6014, even with distributed alternatives, because all of those alternatives are harder to set up, require much more understanding to use (commit vs push, pull vs fetch vs checkout), and frankly, stlll don’t have the UI support. These are still important requirements, because not everyone who could benefit from version control is going to be a skilled programmer. For 90% of version control activity, Subversion just works.

Unfortunately, Subversion has the inherent limitation of centralisation. This is actually more of a problem for me as an individual user than it would be for a large corporation. I like to program both on my home computer and my netbook, and without constant network connectivity there is no way to have full source control on each. The Subversion repository would be hosted on one (the desktop computer), with a checked out working copy on the other (the netbook). Using my netbook away from home, I would be unable to view the history of the files, and unable to check in changes. I would have to wait until I had a network connection. The worst part of that is that when I made many separate changes to the source during the day, they would be checked in all once, or carefully managed by hand as temporary files to be checked in individually. That kind of tedious manual work and potential for error undermines the point of having version control.

The centralised Subversion model is still useful for a closed organisation with a good network, and its portability, simple command line, and well-polished GUIs should make it appealing to a corporate environment with users of many skill levels. Unfortunately for Subversion it scored badly against several alternatives. This was somewhat unfair as it was incorrectly deemed to fail on several requirements, but there are admittedly other weaknesses. It is somewhat inefficient, and it creates bloated and inconvenient metadata directories throughout the working copy.

And it isn’t distributed, and merging is very primitive (and prone to error, when you try to do something complicated). (But from what I have seen these are unlikely to matter in our organisation.) So, Git it was. And admittedly Git is the best out of the alternatives. It’s fast (in some circumstances; as I type this I’m importing 5000 files into a new repository at a rate of 1 per second). And it’s distributed (if you need that, and it’s better to have it available if you do). And it’s the way of the future.

Unfortunately, Git is much more complicated. It is more powerful (which excuses some of the complexity), but it also has accidental complexity and has not had its interface optimised for simple, everyday use — the way Subversion has. We also use Github at work. I sometimes wonder if it’s because people didn’t realise Git and Github were different things. It has proved to be a rather flaky web-based repository interface, with a confusing layout and too much focus on flashy features like dynamic page changing animations. It is good to have a centralised location of our organisation’s source code (hey, it almost sounds like we’re reinventing centralised version control…), and a web-based interface is the best general browsing mechanism for it. But we should remember that Git doesn’t imply Github (or vice-versa), and ensure that we can continue to use version control without needing to go via the centralised location.

It is important to note that I’ve used Subversion for almost a decade. I am still learning Git, having only used it for some PostgreSQL hacking (implementing loose index scans) a couple of years ago (before the fire; I still have the code but it got messed up when PostgreSQL formally moved their code base to Git, and my clone of an experimental Git repository was no longer compatible with the new official one). And recently I’m been adopting it at work. So my assessment of its usability needs to be tempered by an acknowledgement of my inexperience with it. It’s obvious, though, that Git will become the most popular version control system from now, and it will only get more polished and supported.

What seems to be a suboptimal choice for a corporate environment can have features that are ideal for personal use. The public Github site has free hosting (for open source), features like wikis and issue tracking, and some interesting graphs.  Mmy favourite is the punch card:

Tuesday seems to be the best time for fractals!

And as advertised, it’s “social”. One can fork a repository and be have that relationship tracked. So it’s great when an open source project you want to hack on is hosted there, and you can fork it to get your own copy to work on. There are a few projects I’d like to fork and work on, which will be easier with Git (even those that aren’t hosted on Github). And occasionally someone might want to fork one of mine. I would like to save them having to create an isolated copy; with a fork we can exchange improvements as each side continues development.

Furthermore, it’s often said that “a Github account” is necessary for an impressive portfolio. No one says the same of “a Google Code account” but that’s the power of marketing for you. Of course, it won’t make an practical difference, but having put all that effort into those programs, I would appreciate some improved visibility for them. You never know what someone else might find useful.

Github has a more personal system with a page for each user, and subpages for their repositories; this will be change from the Google Code system of monolithic project pages, and gives me a change to separate my projects out into separate repositories.

So I’ve created a new acccount at Github and, with some some trepidation, I’ll be moving my code into it. I’m not sure how it will work out so I’ll be taking some time to learn how to best use it and evaluate its effect on my project productivity…

 

 

 

Posted in Programming, Rants | Tagged , , , | Leave a comment

Programmer-friendly blogging

Commenting on a blog post about an interesting use of generics in Java, I’ve had the pleasure of seeing my carefully typed code translated from:

public Connection<Closed> close(Connection<Open> c) ....

Connection<Closed> c2 = close(c);

to:

public Connection close(Connection c) ....

Connection c2 = close(c);

Gosh, I never realised type erasure could be performed at the HTML level. Unfortunately this almost completely impaired my comment, in full view of the public. (It can be reconstructed with a bit of common sense, but you have to have the generosity to assume the writer would not intentionally write such spurious junk, which may be too much to ask of the online community :p.)

Writing about programming using standard blogging software is not generally as painful as it once was. WordPress, for instance, provides a sourcecode shortcode, which enables code for many languages to be safely pasted into a post and reliably formatted (but unfortunately not XML-based code, which is subject to being reescaped every time it is viewed in the editor, culminating in a bloated mess like &amp;lt;data x=&amp;quot;42&amp;quot;/&amp;lt;). So there are many blogs with nicely presented source code samples.

Commenting remains programmer-unfriendly. Comment boxes often support plain text, with a limited set of HTML elements (for bold, italics, linking). And as witnessed in today’s example, those elements that aren’t supported can be stripped out rather than treated as intentional text input. And the other problem with comments is they can’t be edited, so the author does not know how their unsupported text wil be rendered until it is too late to use a workaround.

This addition to my experience as a blog writer and commenter makes me feel that HTML is no longer the state of the art for simple, human- (and programmer-) friendly text markup. Humans want something that:

  1. Supports common typesetting features like bold, italics, lists, and links.
  2. Is easy to type, with few unusual symbols and no need to worry about proper nesting in simple cases.
  3. Is readable when being written as markup (an approximation of its rendered form).

And programmers desire these features too, as well as:

  1. Allows source code to be pasted in and rendered attractively, without requiring the code’s various symbols to be escaped lest they trigger some other behaviour on the input system.

Markdown seems to be the way to go. Github supports it, and Gists are increasingly being used as blog posts (though the Github interface is awful). StackOverflow supports it for answers, where it provides a live preview of the rendered result.

As mentioned when I first started blogging here, I think it would be great if WordPress supported Markdown — for posts and comments. It would be great if other platforms, such as the Blogspot site where I attempted to post my comment, followed suit, so that it became the de facto standard for text input on the internet. I do think even normal humans would find themselves using it as a straightforward, efficient, and reliable middle ground between the misfocus of WYSIWYG and the unfriendliness of HTML.

PS. Hmm, my code examples got type-erased when I wrote this post, too…

Posted in Programming, Rants | Tagged , , | Leave a comment

Schema diagrams for PostgreSQL

I have made some progress towards the longstanding goal of drawing nice diagrams of database schemas. Firstly, I’ve figured out how to use yEd‘s Entity Relationship node types as table nodes. These special node types have both a node label, and an additional content field. The additional field is normally free text but can also contain HTML data (akin to GraphViz‘s support for HTML nodes).

So a database table can be drawn with its name in the label, and a list of fields in the node body. And, with HTML mode, that list can be a table with columns for name, type, PK/FK status, etc.

But yEd has not quite caught up with GraphViz. In GraphViz, cells in an HTML table can be identified with ports, and an edge drawn between nodes can then target specific ports in each. yEd only supports a handful of predefined ports per shape (the centre, each side, and each corner). Furthermore, its automatic layout algorithms ignore the port data, and draw edges to the nearest part of a node. Still, those layout algorithms are the great strength of yEd and flexibility vs layout is a reasonable tradeoff.

Since entering the HTML contents by hand is time-consuming and prone to error, I’ve written a quick program to do it. The program, pgschemagraph.py, is run against a live PostgreSQL database. It will examine the table and foreign key definitions in the given schemas, and generate a GraphML representation of them.

Here is an example, taken from my “filesys” database:

Schema for the “filesys” database, generated by pgschemagraph.py.

The complete steps to create this diagram are:

  1. pgschemagraph.py -h localhost -d filesys -u edmund -p ***** -s public > filesys-schema.graphml
  2. Open filesys-schema.graphml in yEd.
  3. Run Orthogonal Layout with default settings (Alt-Shift-O, Enter).

As you can see, pgschemagraph.py has found the tables, their fields and types (including standard SQL types, user defined types like “cube”, and domains like “imageid”), the primary keys (indicated by underscoring) and whether they can be NULL (indicated by italics). It has also discovered the foreign key relations between tables.

Because the program works in terms of abstract database definitions, rather than nicely drawn diagrams, it does not do much layout. When first loaded in yEd, this file will show the tables stacked atop one another. The most pgschemagraph.py can do is guess the height of each node based on the number of rows in it (based on a heuristic of 22 points per line in the default font; it can’t do the width, because it does not know how wide each character is).

But it does perform about 80% of the work required to draw a good diagram. A human can easily and quickly do the rest.

At present it uses the standard information_schema catalog rather than the PostgreSQL-specific pg_catalog. The aim was that this would make it easier to port to other systems that support information_schema. The downsides are that information_schema is actually slightly difficult to use (since it’s very generic), and does not provide some useful information such as inheritence and partition relationships. In some of my schemas, tables are partitioned into multiple subtables. It would be nice to that represented in the diagram, rather than have the subtables floating around unconnected. So a rewrite to use pg_catalog may be in order. (Porting to alternative systems will then require a more system-specific approach).

More information can always be shown on the diagram. One idea is to allow different colours to be used. This could be done by schema, or by grouping of tables.

It would be possible to have the program run yEd’s layout engine directly.  Sadly, yEd is proprietary software, and however much I appreciate its help in laying out my diagrams, I’m averse to depending on it directly.  Generating the standard GraphML output (and potentially .dot output, for GraphViz), and running the program myself is a reasonable compromise.

The program, pgpgschemagraph.py, is on Github.

Posted in Programming | Tagged , , , , | 5 Comments

Coloured call graphs!

Juha Jeronen has added some features to the Python call-graph generator (pyan) I’ve previously blogged about. With a single command line, I can now get pictures like this:

pyan.py -c backup.py journal.py journalcmd.py links.py --dot | dot -Tpng > backup-use-and-def.png

A bit of colour always shows a program’s structure in a better light. Green for the backup program, orange and blue for high-level and low-level journal manipulation, and yellow for link creation.

As well as automatic colouring by namespace, there are options to control whether use- and define- edges are shown, and whether nodes in each namespaces are grouped together. For instance, instead of drawing define-edges between namespaces and their members as above, we could omit those edges and instead group them. The “fdp” layout algorithm seems to render best for this graph:

pyan.py -n -g -c backup.py journal.py journalcmd.py links.py --dot | fdp -Tpng > backup-use-and-group.png

Juha’s colour and grouping enhancements are implemented for the GraphViz output. pyan also outputs .tgf files for yEd. But “trivial graph format” is just too trivial to support these options, so I may soon extend the program to write the more advanced .graphml format.

Posted in Programming | Tagged , , , | 9 Comments

Simple but efficient backups

Backups are perennially on my To-do list. Just having them would be good, but more than that I want to automate their creation and maintenance.

My problem, of course, is I prefer to make my own system for doing this rather than use someone else’s. This doesn’t always work out. One of my early attempts was coming along great, and I decided to test it — on its own source directory. It was soon apparent that I had transposed the from and to arguments to the copy function, as the backup program’s source files had all been truncated to zero length. Karmic neatness aside, there is a lesson there: always make a good backup before testing a backup program for the first time. Continue reading

Posted in Programming | Tagged , , | Leave a comment

New SSD

A couple of Fridays ago, a solid state drive — the 256GB Crucial M4 — on which I’d been keeping my eye became available again at a reasonable price, so within a few minutes I placed an order. It arrived on Monday, swaddled in bubblewrap (a useful material to keep around for future eventualities — which inevitably turn out to be the fun of popping bubbles).

It came in a small box, sealed with a sticker but with no accompanying material beyond the packing foam. This seemed to indicate either that the drive was so simple and idiot-proof that no instructions were needed, or that it would be so complicated that prior experience was necessary and a mere few pages of instructions just weren’t going to suffice. It transpires that the drive was not quite idiot-proof after all, as demonstrated in the following account, which I will punctuate with some of fundamental lessons I’ve learned in the last two weeks. Continue reading

Posted in Hardware, Rants | Tagged , , , | Leave a comment

Curious line-endings in FTP

Whilst hurriedly implementing basic FTP support in a program that’s due in a couple of days, I ran into a strange phenomenon:

  • Retrieving ftp://login:password@server/data.csv, a multiline text file, will return the file intact.
  • Retrieving ftp://login:password@server/data.dat, another multiline text file, won’t: all the data will be on one line.

Continue reading

Posted in Programming | Tagged , , , , | Leave a comment