Formats

More and more thought is given to the way we store data. How the bits and pieces inside a file are arranged is called the file format or just format. The format used has increasing importance as digital products

Data Value

Now that digital cameras are common, people realize what business have known for a long time: data is very valuable. Photographs have always been used in documenting the past, might that be public history or the history of ones family. This property of photographs is transferred more and more to their digital offspring. Nowadays people still print their digital pictures but that will soon disappear. So they will want to pass on their pictures to their children and grand children.

For that "passing on" to work the pictures need to be stored in a format that is documented, interchangeable and free from license restrictions. And not only that, the archive medium you use should have such a format as well, or one might suffer the same fate as windows backups. Users of Microsoft's backup software on windows 95/98/ME discovered that NT/XP was not recognizing their backups any longer. And Microsoft says that this is by design and that you should install Windows 95/98 in order to access your backups. Hmm. Can you imagine getting Windows 98 on a PC five years from now? Yes? What about ten? Not likely.

Office Data

Business has of course known this since long. And besides self interests there are legal obligations for storage of company documents in almost every country of the world. So, most companies fulfill these requirements by printing documents, as you cannot expect Microsoft Office documents to remain readable long enough.

"But that is so twenty century!" as the saying goes. A way out of this finally arrives with OpenDocument as a free (from legal obligation) and open and interchangeable format. The state of Massachusetts is the first one driving this in the public sector. Understandably, since government needs to keep records much longer than business.

Text Data

A volunteer project for long term archiving is Project Gutenberg. The project has been started 1971 (yes, that was 35 years ago at the time of this writing) and there are few digital things of that age. (Ok, Unix and bourne shells.) Project Gutenberg requires, whenever possible, for submitted data to be in text format or what they call "plain vanilla ASCII". The history page has a good explanation of their reasons and how it worked for them.

Another example from software engineering itself is the format used for writing software. And it is plain text. Everywhere. The text might be in different character encodings like ASCII or ISO-8859-1 or UTF-8, but text it is nevertheless. One of the reasons for sticking to it is the easy interchange of program text between different computer systems. Another good reason is that a whole army of useful tools has been developed (and continues to be developed) to manipulate text files.

Human vs. Machine Readable

There is a continuous shift in the usage of data happening when you look at the ratio of machine vs. human "reading" of data. The most famous example of a machine reader is google. Think about it. Google's computers are busy reading the internet (and also books). Continuously.

This is a dramatic change from the way of working that Office applications are supposed to handle. Office data is meant to be printed (on paper or at a wall during presentations), just so that humans can read it. No other software is expected to read office data than office itself!

What a change! So, we want data formats that humans and computers can read and that has a chance of being readable several (hundred) years from now. Which brings us to...

XML

XML is a markup language that is able to bring structure into text files besides mere lines and paragraphs. If you know HTML and its < and > thingies, XML is its larger brother.

HTML is a way to tell a piece of software (a browser) how to display a text file (the HTML file itself). How to interpret all the brackets is written down in a specification - which is another data file intended to be printed and read by humans and not computers.

The limitation of HTML is that it talks only about the meaning of tagged text in the context of a document. Don't get me wrong: HTML is a huge success and every ongoing efforts to further improve it are great. However if one wants to talk about things in a different context than documents, titles and paragraphs, HTML is not the standard to use. If you want to talk about vector graphics, mathematics or ways to build a Java application, you are unable to express that in HTML so that a machine can interpret it.

That is where XML kicks in as XML gives people a way to assign any meaning they want to tags. People can make up their own tags. XML does that by defining what a tag looks like and how it can appear inside a text, sorry XML file. But XML does not define any meanings for tags (it just defines the syntax, not any semantics). So people can come up with new tag vocabularies for all sorts of applications. Google Earth is such an example. Google Earth's import format is named KML which is such an XML application. It defines tags such as "longitude" and "latitude" with the obvious meanings.

a typical xml application

But the specifications of how to act on the XML dialects was still written in documents intended for print. So for each specification people had to train the machine (e.g. program the computer) on what to do with the different tags. This is necessary for many applications but not for all.

XSLT

Some really, really smart people came up with the idea of XML files which talk about other XML files, especially how to transform a source XML file into something else. So they taught some software how to apply the first XML files to the second set to produce, you might have guessed it, another XML file (or even a HTML or plain text file). So, it is possible to write machine readable specifications in XML on how to transform other XML files.

Time for a practical example. This website is a set of XML files. There is one XML file for the header, one for the footer, one for the menu and several ones for each page you are reading. And there is yet another XML file which specifies how to transform the set of input XML files into a bunch of HTML files. The trick is that one can easily change the specification for the generated HTML and the computer applies that to the whole site. Saves a lot of work.

XHTML

Update (2006-04-26): Coincidence strikes again. There is a quite recent paper by the W3c which describes how to embed RDF in XHTML. The methods are described as, seems to me, tagging existing HTML in order for machines to extract meaning from it. Have a look how meta properties are defined and can embrace existing text in the document.

While the markup mechanisms themselves can be regarded as elegant, the HTML document at the end of the paper does not look like something a human would want to write. But who says they have to? I think it more likely to apply a XSL transform to generate a human readable HTML output which tags the semantics using this RFC mechanism at the same time!. Interesting.