Text is one of the most underappreciated technologies that computers give us. This post is about why it’s so awesome, and why we should consider using it more — even in places where the trend seems to be the opposite.
By text, I mean, well, text. I mean sequences of readable characters, with no invisible formatting: every byte in the sequence is rendered as a character (with appropriate renderings of white space). I call it a technology — which it is: written language is one of the most important inventions (or discoveries) in history. I refer also to the computer technologies that make manipulation of text so versatile.
Text is somewhat frowned upon: too much of it is intimidating, and it is less immediate than graphics. Graphical interfaces are more “intuitive” (which, for want of a better interpretation, I’ve always read as meaning easy to use for easy things when you’ve used something similar; and arbitrary, clumsy, and inefficient for anything serious). So there are always moves to diminish its presence — which may may make some things easier, but can make other things harder, or impossible.
There are three roles of text that I’m interested in:
- As a programming language — contrasted with the use of graphical or table-based languages.
- As a data format — contrasted with a binary format. In a similar we could also contrast a relatively light-weight text format like JSON with something heavyweight like bull-blown XML.
- As a writing and editing format — usually contrasted with WYSIWYG. In the other direction, we can also contrast a simple markup language with plain text.
When displayed, text can include formatting. This can be simple syntax highlighting. It can also be simple ornamentation where the symbols in the text imply typographical markup (such as asterisks indicating bold, etc.). In these cases it still counts as text, because each byte is rendered and editable.
Text may also be a mere representation of the intended final form, as in the case of TeX or Wiki markup. I don’t want to suggest that the input text is the superior format for reading — only that it is superior for editing, and adequately readable to make it the default format for editing.
Advantages and disadvantages
- Portability: support for ANSI text is virtually ubiquitous, and Unicode support increasingly so. A text file from a proprietary system may look ugly, but it is at least readable in the absence of the software.
- Text is easier to control. In a WYSIWYG editor, control codes are hidden from the user. It can be hard to predict what will happen to the document when additional formatting is applied. And it can be easy to get a document into a state where formatting will always misbehave! (If this happens, sometimes the only recourse is to copy the text into a plain text editor, then paste it into a new document and reapply all the formatting.) Apart from this, some formatting constructs are difficult to represent as toolbar options. Nested tables might be one: they’ve been introduced in word processors recently — having been borrowed from HTML, a markup language — and IMO can be much harder to maintain than the original flat tables were.
- It makes version control possible. Sure, you can keep copies of older versions of files. But what if someone asks, what changed between these two versions? Comparing two documents by eye is time consuming and error-prone. It’s so much easier when you have tools to find and highlight the exact change, who made it, when and why. Word processors can often do this. What they can’t help with is merging: two people change their private copies of a document, and later the changes are to be combined. Whose do you pick as the base, and whose do you have to pore through, looking for changes to copy to the other — and changes that must be ignored? Programmers know they are supposed to use version control for source code. It’s quite incongruous that they often don’t realise the same thing works for documentation!
Of course, there are a few disadvantages, or weaknesses to text.
- The lack of internationalisation, and the complexity of attempts to address that. Unicode support is improving, and there’s no excuse not to use UTF-8 by default, especially for English-speakers — it looks exactly like ASCII. The problem is that there are alternatives to UTF-8. Some of these are legacy (such as the code pages on Windows). Worse than that, there are other kinds of Unicode. Windows operating system data is often UTF-16. It looks bizarre when viewed as text (even using Windows tools), and it can feel unnecessary to convert back-and-forth between this and a more widespread format like ANSI or UTF-8.
- Disagreements over end-of-line markers. Some systems use two characters to mark the end of line. I’ve heard that it arose from the time when teletype machines had separate commands for seeking to the start of the line, and feeding the page by one line. So it’s a historical issue, but one that’s still around to bite us from time to time. It’s a manageable, but nevertheless tedious, gotcha when sharing data between Windows and Unix.
- The cost of parsing (for file formats). There are cases where a binary format is simple and efficient, where the equivalent human-readable text format is not. One case where virtually everyone agrees on binary formats is image data — no one in their right mind transmits a rectangular array of pixel values as XML! In other cases it’s not so obvious what to do. The costs of text for data storage are the runtime cost of parsing and unparsing it, the storage cost of using a more verbose format, and the usability cost of not being able to seek to a precise spot in the data. For small files these are less significant, but there is still the cost of implementing the parser.
Some of these are inherent, and a reasonable trade-off must be made when deciding when and how to use text. Others are a result of text’s early implementations in proprietary and limited systems, and are gradually being overcome as the world standardises.
Text is the established medium for programming. This hasn’t always been the case, obviously! Originally, programming was done by twiddling switches. This gave way to punched cards: a card with 80 columns could store up to 80 characters (assuming binary representation in each column) — enough for a single line of a program. Consequently, some early languages used a fixed-width format, in which characters in certain columns had certain meanings. The layout of modern assembly language still retains something of this: label in the first column, instruction in the second, etc.
In modern languages a program is typically a string that belongs to a particular formal grammar. None of it is executable unless all of it is. Some other languages have been looser with input, such as early line-based BASIC variants, or shell scripts. The lines are read one at a time and executed independently against the current program state. There are advantages to both approaches. The former encourages more correctness earlier. The latter makes it easy to get some parts of the program running early, even if the rest of it has syntax errors.
There are attempts to replace this with graphical or table-based languages. Some examples:
- The “RCX Code” programming environment for Lego Mindstorms. There are predefined commands (such as emit a sound, or power the motors). A program is made by dragging blocks of these commands around on a 2D surface.
- A certain “ETL” database tool in which data flow is represented by a succession of tasks, with arrows drawn between them.
The premise is that visual things are more user friendly and easier for humans to reason about. This may be true for absolute beginners but is grossly misguided for anything beyond that. There is a reason that text is ubiquitous in civilisation: it’s compact, can be copied and pasted, it’s expressive and supports recursively enumerable language, can be abstract or specific, and it is indexible and seekible.
Similarly, mathematics is done with symbols and not with piles of pebbles. This can be intimidating during learning, but more than pays off in abstraction and efficiency once the basics are understood. It is infinitely extendible. Programming is another formal language (arguably, it is maths) and the same rationale applies to it.
Another fad is using data driven behaviour, in the form of rules or tables. These is great when the behaviour is simple, or at least highly regular; when the behaviour does not require a high level of computational complexity. Non-programmers can then maintain this part of the program, implementing the behaviour without worrying about control structures or types. Unfortunately, there is always a temptation to make the rule or table system just a tiny bit more advanced. At some point, more advanced abstraction or recursion is needed. You end up with a program that is not maintainable by non-programmers (because it is full of abstraction and recursion), and not maintainable by programmers either (because it’s full of kludges that aren’t required in a proper general programming languages).
For data and configuration storage
Configuration is often kept in text files, especially on Unix (see The Art of Unix Programming by Eric Raymond). This has all the advantages of simple editibility and version control. The latter is essential for managing changes and really pays off when your configuration files start to become programs in their own right.
This contrasts with the Windows Registry, which is a database of name/value pairs. This has its own advantages, such as concurrent updates and a degree of type safety. It is generally sensible to follow the practices of the platform your program runs on, but I personally prefer it when program setting are kept as text. (I was especially disappointed when Mozilla started to store user data in SQLite databases — it makes it harder to examine or fix, and I’ve not noticed any real performance advantages from it.)
Word processors turn computers into glorified typewriters. Instead of adding new ways to organise ideas and help the writer concentrate on their word, most innovations have been presentational — in the graphical output, and in the WYSIWYG editing mode of it. They are excellent tools to help people make pretty documents. For 80% of writing (and 99% of writing at work), prettiness is irrelevant. Furthermore, 80% of writers do not want to do graphical or typographical design — so giving them a tool that specialises in that when all they want to do is write a document is distracting. Smart users of word processors write text first, then apply styles. Unfortunately the programs are full of sparkly features that encourage ad-hoc formatting, and this is how the majority of users write documents.
The good news is that the web gives us a chance to remedy some of this. It’s because of its limitations: web interfaces need to be portable, so can’t rely on large OS-specific programs. And they need to be light-weight enough to be transmitted as needed over the internet. People still want pretty web pages — but they’ve learned that it’s possible to get 99% of the prettiness with simple markup languages. And these languages are simple enough for 99% of people to figure them out.
One problem is that there are many standards for markup, including several Wiki variants, Markdown, ReStructuredText, etc. All of them are easy to learn, but there’s still the issue of moving documents between them. A wiki markup standard could help with this. It could also get as baroque and over-engineered as most other standards, thereby undoing the benefits of simple markup languages.