Data Management

Article Index

One of the biggest differences between well trained software engineers and those who are self-taught programmers is the way they manage data in programs. When working on programs of non-trivial complexity, the trained professional will spend a great deal of his time working out how to organise the information that must be handled. The self-taught programmers tend to give most of their attention to constructing the algorithms.

We all have some intuitive understanding of algorithms: cooking is based around algorithms ("recipes"); everything we learned in school mathematics is about algorithms; even in subjects such as English and foreign languages, we were taught algorithms for spelling, plurals, tenses, word-ending agreements and so on. (Parents of primary school children in the UK will find that teachers now use the word "algorithm" quite a lot as a method of introducing computing concepts to children.) Few of us have any level of sophistication in methods of handling information either intuitive or taught.

Before we go any further, incidentally, we might as well clear up a distinction between the terms data and information. Data consists of the numbers and words stored in computer memory (or even on paper sheets). Information is data that has meaning associated with it. The same number (as stored in a computer file) can be used to designate the length of a journey (in kilometres/miles?) or the weight of a bag of flour, or the time (in miutes?) it take to perform a task. The number is of little use until it is assigned an interpretation.

Computer algorithms do not know anything about the meaning of the data they handle. They will perform the same calculation using our numbers whether they are distances, times, or weights. It is up to the program designer to ensure that the numbers use by the program have been given a consistent interpretation. (A spacecraft sent by NASA to land on Mars notoriously crashed into the surface because a measure of distance was assumed by some parts of the guidance software to be in meters and in other parts to be in miles.)

Programmers often make mistakes when constructing software. The ones that are easiest to find are those where the error is in the specification of an algorithm, because odd behaviour caused by an error in one program line can usually be noted and tracked to its source with systematic searches. An error in interpretation of meaning often produces much more subtle failures and can be extremely difficult to track to the source of the problem because it is not localised.

We handle the design of complicated algorithms by decomposition - breaking down the algorithm into component parts, and breaking those down in to simpler parts, until at the lowest level we can see exactly how to code the algorithms. Usually the decomposition is mirrored in the program structure by using subprograms. Hence, the problem of manipulating photographs into the modified forms displayed on this website (as in the Woodland Transformations Gallery for example) breaks down into top level steps as follows:

Read photograph from a JPEG disk file.
Loop until the program is terminated.
1. Handle a control input (generally a key is depressed perhaps associated with a mouse position)
2. Use the control input to modify the stored parameters that govern the image transformation algorithm.
3. Transform the photographic image.
4. Display the transformed image.

Each of these steps is implemented with a subprogram, but the "Transform" step, which is mathematically complex, is also broken down into a number of subprograms of its own, including an upper layer that handles the iteration of every pixel in the output image plane, and a lower layer that applies the a complex mapping algorithm to generate the pixel value at that point.

The professional software engineer is also trained to decompose information into hierarchical structures, and also to understand how relationships between different bundles of information can be accurately represented in software. This skill is not intuitive for the vast majority of people, and normally needs a good deal of practice to master. Few self-taught programmers bother to acquire this skill at the level of software professionals. Many do not even realise that they lack the skill.

What do we mean by this? Consider the compilation of an electoral register. In the UK voting rights go with residence. So, we need to compile a list of all places of residence - all valid addresses in the area covered by the Register. It is, however, people who have the right to vote, so we also need a list of people living at each address. To the software engineer, it makes sense to keep the list of people separate from the list of addresses and simply create an association between a person and an address. There are several reasons why this is a good idea. Firstly, people move but houses don't. (In general! Though new houses are built and sometimes old ones get knocked down.) If we simply create a list of individual electors each with their own address, then almost inevitably we will find that addresses are recorded inconsistently. (A computer does not necessarily know whether two addresses that differ only by commas, or spaces in postcodes are the same address.) If people move away and are removed from the Register, we would also loose any knowledge of the existence of an address, until the next residents fill out their registration forms correctly (perhaps!). Elections take place over a number of geographical ranges, from the most local Parish councils, town councils, county councils and national elections. An address that is associated with a ward in a local council will also associated with the county ward that aggregates the smaller scale electoral regions, and also refers up to the constituencies of parliamentary elections. It is therefore easy to build lists of electors for any level of election, and just as easy when electoral boundaries have to change to move whole groups of address from one area to another.

This was actually a rather trivial example. Consider the problems faced by governments building databases of people entitled to claim benefits, or who need to pay taxes. If the initial methods of structuring are incorrect then small inconsistencies in the data mean that we cannot reliably and consistently interpret the data as information. The inconsistencies will cause ever increasing problems, and new layers of software need to be added to stop the inconsistencies spreading and corrupting the whole database. The reason we hear so often that Government IT projects go over budget and sometimes fail completely is that the the design problems really are extremely complex and difficult to get right - and unfortunately it is not always the case that people employed to commission these large system, or even those who build the systems are adequately qualified and experienced. (Really good IT experts can normally earn much more outside government!)

Most programs that are worth writing in the first place normally experience multiple cycles of subsequent modification. (Only completely unsuccessful programs or completely perfect programs - I have never seen one - never need to be modified.) All of us who work with software know the nightmare of "legacy" systems that have to be given new functionality in spite of less than adequate initial design. We often spend more time papering over the inconsistencies associated with the original structure than we do on delivering the new functions. (Ideally, of course, one should start again, and indeed it is often likely to be cheaper to do so - but try explaining all this to managers with little knowledge of software engineering. A system was apparently doing everything it should do last week, and it really requires only a small amount of new functionality, doesn't it?)

Even on the smaller scale, in my experience it is much more common to find that a program becomes difficult to modify because the information handling is badly structured, rather than that the algorithms are badly structured. It is, in fact, often fairly straightforward to reallocate essential algorithmic code into sensible subroutine hierarchies. (I have had to do this many times.) Correcting the information structures, on the other hand, can be like trying to replace the foundations of a house: difficult, expensive and by no means guaranteed to produce satisfactory results. I have insisted on complete redesign of certain software systems when faced with this problem. (You need a lot of credibility with management to get away with this.)

Many of the features of modern programming languages are intended to supply tools which can be used to control this type of information complexity (for example, object oriented programming). They are indeed powerful tools and when used by experienced programmers who fully understand the theoretical background will do all that is desired. Unfortunately, many programmers assume that they can be correctly used without the necessary background knowledge: it like putting a Lamborghini in the hands of a newly licensed teenager. Expect a crash!

The programs on this website are really rather small compared with typical commercially useful software. Early examples make no use of more sophisticated methods, mainly because I do not want to confuse my readers. Somewhat later programs do use a few object-oriented techniques, but purely as a matter of minor convenience, and also to slowly introduce some of the methods into the mix. I know from my professional experience that a full-on OO approach can considerably confuse those who have not be trained to it.

The first data handling problem most Processing programmers face is using mouse/keyboard inputs to control images drawn on the screen.

For this, we need to change data that is visible within the draw() subprogram (which does actually modify the screen image). Some user actions, such as key-down or mouse-button-down do indeed modify that state of globally visible variables (e.g. the mousePressed and mouseButton variables) that can be seen from within draw(). However, we often wish the produce a change in the value of some variable that persists even after, for example, the mouse button is released. (For example, in many of my examples I need to set and reset the amplitude of a harmonic.) The simplest approach here is to define a variable outside the setup() and draw() subprograms which then becomes visible within all subprograms - a globally visible variable. We have the same issue when we wish to supply initial configure parameters for a sketch. We can set values within setup(), but in order for them to be visible in draw() they need to be globally defined.

This is perfectly satisfactory as long as things do not get too complicated. However, one soon finds that the list of globally defined variables becomes rather long, and it can then get difficult to keep track of which subprograms are modifying global values and which are just using them. We loose track of where information starts and where it end up.

We can achieve a first step in simplification by grouping related data items together using classes, with an instance of the class defined globally. This is just like creating a row of labelled cardboard boxes on a shelf. The only function of the class is to hold data in the box. It works: it is easier to comprehend which items of data logically sit together and should travel together. We still have the problem of keeping track of where information starts and where it ends up, but it is conceptually easier for the programmer because he or she will be following fewer (lumped) objects.

This approach can only go so far, and there is a danger in allowing all data that needs to pass between subprograms to be in a globally visible space. There might indeed be errors arising from accidentally reading or setting a global value, but the real danger comes from a design mind-set which solves every problem by putting data into the global address space: the programmer avoids the hard thinking about structuring the program so that data flows just where it needs to be. In a well structured program it is transparently obvious to the program reader where data has come from and what data created or modified as a result of executing a subprogram.

We have two conceptually different approaches that help to tackle this issue, which ultimately are perhaps not as different as they seem:

Functional programming: holds that all operations within a program should take place through function calls (subprograms that return a single data structure as a result), and that these should be without side effects. This means that all data used by the subprogram call should go in through the subprogram interface, no global data should be read or modified so only the return value contains information generated by the subprogram call. For the purest form of functional programming one needs to use a language such as Haskell in which it is difficult to follow any other programming paradigm. However, more widely used languages such as Python and Fortran 95 do allow a pure functional approach for the disciplined programmer (that is to say, one who is able to resist the temptation of easy short-term solutions). Functional programming has a number of attractions for critical software systems, since it facilitates proofs of correctness, but in its purest form it is not widely deployed in the commercial world. It should nevertheless be part of the mental discipline of every advanced programmer.
Object Oriented Programming: at first sight this seems to be diametrically opposite to functional programming since side effects are an essential element. OO programming was originally promoted as a magic bullet for making software model the real World, since we could identify classes of objects in the real World, and their associated behaviour and then define software classes that directly represented that behaviour. Just as objects in the real World are quite separate, we would automatically achieve a high degree of modularity by following this simple rule. In reality it is much less easy than it might at first appear. In my experience, is more often than not the case that the division of our model into classes is much less intuitive and obvious than one might think. We have to make choices in which we project our notions of what we consider to be significant onto the World, and these choices will constrain the future development of the software. Furthermore, many required classes turn out not to be concrete objects but abstractions. such as events. Yes, I find that OO programs can be made highly modular and easily modifiable, but they also require a good deal of careful design input.

Object oriented programming has three essential elements:

Classes:these are used to describe the common behaviour of a class of objects that are of interest to our world model. All instances of a class - the objects - have the same behaviour. The class definition also specifies the state variables that are associated with each instance of a class.
Encapsulation: each object instance has its own unique identify and also its own internal state defined by the values of its state variables.
Inheritance: it is possible to define sub-classes that inherit all the behaviour and state variable definitions of their super-classes, but to which we can additional behaviour and state variables.

Details: Category: Programming

Artful Computing

Programming Notes

Data Management

Article Index

Artful Computing

Michael's Stuff

Latest Articles

Login Form

Breadcrumbs