Artful Computing

I spent a lot of May 2021 acting as a Section Leader in Stanford University's "CodeInPlace" project, which aimed to teach Python remotely to a very large number of students world-wide (about 12,000),  staffed largely by volunteers. It was a great experience and I am posting here some of the general advice I gave to my students.

I rarely write a Python program without using dictionary data structures. In fact, I have heard it said that you can’t really call yourself a proper Python programmer unless you have mastered the use of dictionaries. 

Dictionaries, as you have already discovered, are a method of creating associations between between arbitrary strings (the “key”) and other types of object (the “value”). If you know the key you can quickly find the value - far more efficiently than checking a long list of keys, one by one, for matches. (Google “hash algorithms” if you are interested the clever details.)

Dictionaries start to become really useful when you want to connect information in different datasets, which when we need to handle large amounts of data is something that in practice we want to do very often. (This is the core of “data science”.) It is where a language like Python is frequently the tool of choice in this area. 

Let me give an example that currently interests me. I sometimes give talks to high school students about working in science and engineering, and certain questions keep turning up, such as “Where is the best place to study..?” The answer, of course, is that it usually depends as much on the personal characteristics of the student as on the those of the institutions. The university promotional material can get them feeling confused, and sometimes they are misled, because, while it is never inaccurate, the departments are usually selective with the evidence they present. I do not want to give the student just more opinions: I want to show them a more balanced view of the actual evidence.

The UK government does actually collect and make publicly available large amount of reliable “Unistats” data about course admissions and educational outcomes, including, for example, employment rates and the distribution of salaries actually earned at six and thirty months from graduation. (So, when you see a course advertising “96% employed after six months.” you can show how many of those have gained “professional” jobs and how many are still serving at tables.)

The raw data tables are, however, hard to interpret because they are held in a relational database form that makes for efficient storage, consistent updating and easy searching - which is all a “good thing". It is just that you now have to use a computer program to get at the information that you need. For many purposes that tool would be a relational database (sometimes a spreadsheet if data volumes are small - but why struggle with Excel when you can use Python) but when my data manipulation needs are relatively straightforward, the data volume is moderately large and I also want to do quite a lot of data presentation using Python’s matplotlib module, it makes sense to do the whole job with one calculation environment.

Fortunately, when data is dumped out of a relational database or an Excel spreadsheet you get tables, often in CSV form, that to some extent look like Python dictionaries: each row has a unique key that is used to associate with the rest of the information on that row. Those same unique keys can appear as values in the columns of other tables, so we are able to follow data trails and connect together information in different tables. 

I quickly convinced myself that cross-referencing the UniStats data tables using Python dictionaries would do everything that I actually needed to do in a reasonably efficient and straightforward way. Furthermore, Python already has a sophisticated data handling module called PANDAS which is specifically designed to read and manipulate tables of exactly this sort, and it is all based around Python dictionaries. Even better, PANDAS is designed to connect easily to Python’s matplotlib because a lot of data analysis usually does end in plotting graphs.

That is why we love Python: a potentially complicated job turns out to be relatively straightforward because most of the hard work has already be done for you. You just have to find the right building blocks and stick them together, and dictionaries are frequently used as the glue.

Breadcrumbs