Discussions of the Python computer language.
I spent a lot of May 2021 acting as a Section Leader in Stanford University's "CodeInPlace" project, which aimed to teach Python remotely to a very large number of students world-wide (about 12,000), staffed largely by volunteers. It was a great experience and I am posting here some of the general advice I gave to my students.
I thought that it might be worth giving a perspective of a Python user who still regards himself primarily as a physicist/engineer (rather than computer-scientist/software-engineer). We do more heavy data handling that most other scientific disciplines but the actual computing knowledge/skill requirements may be fairly modest: 90% of the work requires 10% of the available functionality. We learn Python in order for it to be useful, not because it is a neat elegant tool, or because we enjoy programming (though many of us do). It has to earn its keep.
In fact, CodeInPlace has covered most of the tha essential 10%, so a practicing scientist needs to add just a few more areas from the standard library and the Python module index.
That actually covers at least 90% of what I ever did with scripting languages and what the majority of benefit to many practicing scientists.
I do feel, however, that those who need to do a lot of heavy data handling and statistical analysis might wish to also look at numpy and pandas. You do not need to do a lot to get most of the benefits of leaning Python. By all means go a lot further (and become even more valuable to your employer) but do not imagine that you need to follow loads of stuff from a CS degree to be sufficiently competent.
There is one other tool - not specifically Python - that any serious programmer really needs to use on a regular basis: install a Software Configuration Manager (SCM) which will allow you to keep a history or your code and associated documents, scripts and test cases. You do not need to pay for these (unless you have a particularly well disposed employer who is prepared to equip his team with the best commercial tools). Many Integrated Development Environments (if you like that type of thing) come with a built in SCM but if you like stand-alone tools (I do) then Open Source products such as Mercurial or GIT are excellent and very widely used. (There is no lock-in: it is fairly easy to transfer information from one of these to the other, and the way they are applied is remarkably similar. They each have minor advantages in particular contexts. I happen to use Mercurial just because it is more familiar. Choice often depends on what those around you are using - because they can give you support.) Do it now!
Any serious project involving more than one person, or with external stakeholders will also need to use an Issue Tracker. You will be surprised at how these items quickly build up even with comparatively small projects and you might as well get a handle on it at the start. There are free tools available (e.g. Bugzilla) which are widely used on large Open Source projects - but may be a bit overkill for your individual needs. Even spreadsheets are better than nothing for small projects. Employers looking to maximise the productivity of their development teams (and keep a close eye on progress) may well invest in commercial tool suites that integrate project management, issue tracking with SCM and other software development support tools. In such environments you will not have any choice about what tools you use and how you work.
I spent a lot of May 2021 acting as a Section Leader in Stanford University's "CodeInPlace" project, which aimed to teach Python remotely to a very large number of students world-wide (about 12,000), staffed largely by volunteers. It was a great experience and I am posting here some of the general advice I gave to my students.
I rarely write a Python program without using dictionary data structures. In fact, I have heard it said that you can’t really call yourself a proper Python programmer unless you have mastered the use of dictionaries.
Dictionaries, as you have already discovered, are a method of creating associations between between arbitrary strings (the “key”) and other types of object (the “value”). If you know the key you can quickly find the value - far more efficiently than checking a long list of keys, one by one, for matches. (Google “hash algorithms” if you are interested the clever details.)
Dictionaries start to become really useful when you want to connect information in different datasets, which when we need to handle large amounts of data is something that in practice we want to do very often. (This is the core of “data science”.) It is where a language like Python is frequently the tool of choice in this area.
Let me give an example that currently interests me. I sometimes give talks to high school students about working in science and engineering, and certain questions keep turning up, such as “Where is the best place to study..?” The answer, of course, is that it usually depends as much on the personal characteristics of the student as on the those of the institutions. The university promotional material can get them feeling confused, and sometimes they are misled, because, while it is never inaccurate, the departments are usually selective with the evidence they present. I do not want to give the student just more opinions: I want to show them a more balanced view of the actual evidence.
The UK government does actually collect and make publicly available large amount of reliable “Unistats” data about course admissions and educational outcomes, including, for example, employment rates and the distribution of salaries actually earned at six and thirty months from graduation. (So, when you see a course advertising “96% employed after six months.” you can show how many of those have gained “professional” jobs and how many are still serving at tables.)
The raw data tables are, however, hard to interpret because they are held in a relational database form that makes for efficient storage, consistent updating and easy searching - which is all a “good thing". It is just that you now have to use a computer program to get at the information that you need. For many purposes that tool would be a relational database (sometimes a spreadsheet if data volumes are small - but why struggle with Excel when you can use Python) but when my data manipulation needs are relatively straightforward, the data volume is moderately large and I also want to do quite a lot of data presentation using Python’s matplotlib module, it makes sense to do the whole job with one calculation environment.
Fortunately, when data is dumped out of a relational database or an Excel spreadsheet you get tables, often in CSV form, that to some extent look like Python dictionaries: each row has a unique key that is used to associate with the rest of the information on that row. Those same unique keys can appear as values in the columns of other tables, so we are able to follow data trails and connect together information in different tables.
I quickly convinced myself that cross-referencing the UniStats data tables using Python dictionaries would do everything that I actually needed to do in a reasonably efficient and straightforward way. Furthermore, Python already has a sophisticated data handling module called PANDAS which is specifically designed to read and manipulate tables of exactly this sort, and it is all based around Python dictionaries. Even better, PANDAS is designed to connect easily to Python’s matplotlib because a lot of data analysis usually does end in plotting graphs.
That is why we love Python: a potentially complicated job turns out to be relatively straightforward because most of the hard work has already be done for you. You just have to find the right building blocks and stick them together, and dictionaries are frequently used as the glue.