Artful Computing

I spent a lot of May 2021 acting as a Section Leader in Stanford University's "CodeInPlace" project, which aimed to teach Python remotely to a very large number of students world-wide (about 12,000),  staffed largely by volunteers. It was a great experience and I am posting here some of the general advice I gave to my students.

I used to believe that the purpose of testing was eliminating errors from my code before I released it into the wild. It turned out I was wrong. About 20 years ago, I found myself on a training workshop lead by Beverley Littlewood, a distinguished professor of software engineering at London’s City University, and was forced to go through the extremely painful process of changing my mind. Unfortunately, Professor Littlewood had extensive statistical evidence that supported his alternative view in a highly convincing way. 

As it happens, there are organisations, such as IBM and major telecommunications suppliers, that develop software which routinely keeps very careful records of every single error that has ever occurred in their systems before and after release. For these companies software errors can have major impact and turn into public embarrassment, so any method of reducing errors gives them direct benefits. Hence, in a project from some years ago, they turned their databases over to Littlewood for careful statistical analysis of the effectiveness of testing.

Littlewood's conclusions seemed highly counter intuitive to me at first - but the data did not lie. He discovered that if the system testers find, say, 100 faults prior to release, it is likely that another 100 will eventually turn up during the system’s production service. On the other hand, if the testers found 1000 prior to release, then it is likely that a further 1000 would turn up during service.

Surely, you might think, if the testers work harder and find more problems, there are therefore likely to be fewer problems in the released code?

The alternative viewpoint, promoted by Prof Littlewood, is that in any real software project there is a finite amount of resource that will be allocated to testing, debugging and fixing problems. (Most software project managers budget for it to be about 25% of total project costs, and once it starts to climb to 50% they cross their fingers and ship the product regardless. Hence the demise of the Lotus spreadsheet system and company.) 

Furthermore, the nature of software errors is such that at least 50% of errors are of a kind that are just extremely difficult to find using typical testing techniques. (They might be subtle errors in design assumptions that will only get revealed in extraordinary unanticipated circumstances. If the testers are making the same wrong assumption they will not produce the tests that probe those assumption.) Therefore, the number of errors found during the testing phase, in a reasonable length of time, is a measure of the quality of the construction process, a statistical indication of the rate at which this team has been introducing design errors into this product. Testing is therefore analogous to typical quality control on a production line, where we take a sample of the products for detailed examination (the quality people do not needs examine every single object coming off the production line to determine whether the overall quality is satisfactory).

It is, in fact, fairly easy to convince yourself that any software which is of a size required to do a useful job (say more than a few hundred lines of code) can never be tested in a way that will prove that the system is completely free from errors. You have to do that in a completely different way - using mathematics.

Every time you use a conditional statement (“if” and “while” in Python) you make a fork in the way the program execution may flow. Whether you get an error on either of these paths will depend on the state of the computer memory at the time. (You may, for example, get a divide-by-zero along one path if a certain variable has the value zero.) When you have completed an if-else block, the condition of the computer memory will be different, depending on which path you took (or else why did you have the branch-point?). That means that each time you introduce a branch point you double the number of potential memory states and the number of possible execution paths through the code. A fault is a combination of a particular memory state acting together with the specific code on a particular path, so we are doubling the number of potential faults at every branch.

Conditionals statements turn up quite frequently in code, perhaps every 10 lines, so in a program of 1000 lines (a very modest code by modern standards) there may be 100 such statements, and if you executed a million test cases every second for the age of the Universe you will still be a factor of a billion short on testing every possible execution path in your short program. 

In fact, unless you are very systematic in the way you build test cases you would probably find that your first attempts at a suite of test cases probably did not even force every line of your program to get executed at some point. (There are tools that let you check this - most people’s unsystematic first attempts get up to perhaps only 60-70%. They are always surprised when you face them with the evidence.) Optimistic testing is one of the reasons why so much consumer facing software appears to fail so often.

High quality software testing mixes a number of different approaches, including getting inside the minds of the users, to anticipate the way they will use your product - which may not be the way you intended. (Yes, sometimes we even watch them use the stuff.) Ultimately, we want their experience of errors to be sufficiently infrequent that they will not cause inconvenience and harm. Engineering is not about perfection: it is about delivering something that is good enough at the right price and the right time.

It is hard to test your own code, because it is hard to step outside the assumptions you have made about the way the code-users will understand and apply the product. (It helps, of course, if your only user is yourself.) When things really matter we do not just get a colleague to test our code, it may even be given to an independent organisation who has a “code-breaking” mindset.

The trouble is that most of us are tempted to stop testing when we stop finding errors - but we are probably just not looking in all the right places. Well trained professionals use specialist monitoring tools to check whether our tests have at some point exercised every line of code, and forced every conditional statement to go both ways. (There are free tools available for Python - such as PyCharm.) This is still a long way from covering every possible path through the code, but I assure you that when I look at the test cases offered by the author of a program it is fairly rare to find that they even manage to execute more than 70% of the code lines, and the coverage of conditional branching is usually much lower. The statistics are sobering - but you will meet quite a lot of consumer-facing code in App stores that is of this level of quality. It explains a lot.

So, we sometimes use other software tools that take code in at one end, examine it, and write test cases out at the other which will automatically exercise a much larger fraction of the system’s behaviour. You may be satisfied with writing a few dozen test cases; doing things this more sophisticated way may give you 50,000 tests or so.

The real challenge here, known as the “Oracle” problem, is working out whether the tests are giving the right or wrong results. If we already know all the answers, why would we need the code? In fact, you can do various things that help a lot, but surveying them would take too long here. There are plenty of books!  (One technique, for example, is working things backward: quite often, particularly in maths and physics, if the software gives an answer it can be relatively easy to check that it actually solves the specific problem you defined.)

The take home message: good testing is much harder than you think - even if you already think it is hard!

Breadcrumbs