Scala – just good for Java shops?

Scala talk given at the inaugural Thames Valley Functional Programming Meet-Up.

Advertisements
Posted in public_engagement | Tagged , , , | Leave a comment

Mining Python Software

Posted in Uncategorized | Tagged , , , | Leave a comment

Automate, automate, automate

I’ve recently been working on a new Python project, which started off as a bit of an experiment at the recent PyPy London Sprint. Working on a brand new repository is always nice, a blank slate and a chance to write some really elegant code, without all the crud of a legacy project.

In this case, the infrastructure for the project is pretty involved. I was using the pytest unit testing framework and using the rpython toolkit from pypy, both for the first time.

That led to an interesting situation. When I run the unit tests, I want to use the CPython interpreter. This means I can use all the standard library modules that I know well, and can test the basic algorithms I’m writing. When I want to “translate” my code into a binary executable, I use pypy and some of its rlib replacements for the Python standard library modules. When I get an runtime error in the translation, I need to know whether that is related to my use of the rlib libraries or my code is just plain wrong, and using CPython  helps me to do that.

The problem is that I have to keep switching between different standard libraries and interpreters. Somewhere in my code there is a switch for this:   

DEBUG = True

In testing that switch should be True and in production it should be False, but changing that line manually is a real pain, so I need some scripts to catch when I’ve set the DEBUG flag to the wrong mode.

Test automation #1

Here’s my (slightly simplified) first go at automating a test script:

import subprocess

debug_file = ...
framework = 'pytest.py'
try:
    retcode = subprocess.check_output(['grep', 'DEBUG = False', debug_file])
    print 'Please turn ON the DEBUG switch in', debug_file, 'before testing.'
except subprocess.CalledProcessError:
    subprocess.call(('python', framework))

What does this do? First the script calls the UNIX utility grep to find out whether there the DEBUG flag is correctly set:

retcode = subprocess.check_output(['grep', 'DEBUG = False', debug_file])

If it is, the script prints a warning message:

print 'Please turn ON the DEBUG switch in', debug_file, 'before testing.'

which tells me I have to edit the code, and if not, the script runs the tests:

subprocess.call(('python', framework))

Nice, but I still have to edit the file if the flag is wrong.

Test automation #2

Nicer, would be for the script to change the flag for me. Fortunately, this is easily done with the Python fileinput module. Here’s the second version of the full test script (slightly simplified):

import fileinput
import subprocess
import sys

debug_file = ...
debug_on = 'DEBUG = True'
debug_off = 'DEBUG = False'

def replace_all(filename, search_exp, replace_exp):
    """Replace all occurences of search_exp with replace_exp in filename.

    Code by Jason on:
    http://stackoverflow.com/questions/39086/search-and-replace-a-line-in-a-file-in-python
    """
    for line in fileinput.input(filename, inplace=1, backup='.bak'):
        if search_exp in line:
            line = line.replace(search_exp, replace_exp)
        sys.stdout.write(line)

def main():
    """Check and correct debug switch. Run testing framework.
    """
    framework = 'pytest.py'
    opts = ''

    try:
        retcode = subprocess.check_output(['grep', debug_off, debug_file])
        print 'Turning ON the DEBUG switch in', debug_file, 'before testing...'
        replace_all(debug_file, debug_off, debug_on)
    except subprocess.CalledProcessError:
        pass
    finally:
        subprocess.call(('python', framework, opts))
    return

if __name__ == '__main__':
    main()

Test automation #3

So, now the flag is tested, set correctly if needs be and the tests are run. But I still have to run the test script! What a waste of typing. So, the next step is simply to call this script from a git pre-commit hook

Code for this post

The full history for this script can be found here and here.

Posted in Uncategorized | Tagged , , , , , , | Leave a comment

West Midlands Employment Data

At the Government Open Data Hack Day event organised by James Cattell and Gavin Broughton, Andy Pryke, Christophe Ladroue and I had a go at analysing employment statistics for the West Midlands. In particular we were looking for correlations between employment data and other factors, such as census data about age and gender. As with all data mining work, the most difficult and time-consuming job was cleaning the available data before it could be usefully used in an analysis. Christophe wrote a very clear account of the work he did using R to deal with nomis data. You can see a summary of our results in the video below.

… and if you want to download the yourself here it is publicly available here:

https://docs.google.com/spreadsheet/ccc?key=0AtT1QPEACWUldE9NTVduRGVSUC1yMHZiMkZDVXZYT2c&usp=sharing

Video | Posted on by | Tagged , , , , , | Leave a comment

#efdhack2012 26th May 2012

This one’s a little different. Python West Midlands is hosting a hackday to kick off a new open source project for a very interesting little charity called Evidence for Development(EfD). EfD wants to help people make better decisions about aid projects – at local and national level – by putting real data about the real situation in the hands of the people making the decisions.

If you want to know if your aid programme is making a difference to the right people then you need to model the economy of your target village or district, before and after. Makes sense; simple science right? Problem is you can’ afford a bunch of western econometricians crawling all over the place (cost too much, takes too long) and anyway their cash-based economic models don’t work that well in a place where cash is only a small part of the economy (grow your own; harvest wild food; get paid in kind or cash or both for day labour; trade crops, labour or other goods; etc, etc). So EfD developed simple economic models that work in this environment, that can be learned and applied by locally trained people and that, are built to run on laptops. No reliance on big foundations’ data centres.

Last year EfD, in partnership with Chancellor College of the University of Malawi and The University of  Wolverhampton developed a Python/MySQLapp to model local economies that is already in use in several countries in Southern Africa.

This year the challenge is bigger – to build software that can model national and international economies. The model exists and works (it has a great track record of predicting famine effects from annual summary surveys of rural economies). But the only current implementations are proprietary, ill-supported and not extensible. Smells like open source spirit.

So for this hackday we’re going to have with us the two developers who led the IHM development last year (from Chancellor College in Zomba, Malawi) and the developers of the modeling methodologies from EfD (from Barnes and Surrey – exotic eh?). We’ll have a pretty completeMySQL database schema to work on and we hope to finish the day with a simple demo scenario that downloads reference data about a geographical area (a livelihood zone) produces a spreadsheet template to capture information about that livelihood zone (what they grow there, what they eat, how they make a living) runs some local completeness reports and uploads the captured data for merging (with other livelihood zone surveys) to allow analysis of a national survey.

I’m not a software developer, can I still contribute?

Yes! Absolutely. There are a number of jobs that can be contributed without writing any code. We would really appreciate the support of contributors who can build a web presence for these projects, write user and developer documentation, help spread the word and any number of jobs! If you’re keen to help out, there will definitely be a place for you.

When:

10:30 onwards, 26th May 2012. Please sign up here.

Where:

Thyme Software, Coventry University Technology Park, Puma Way, Coventry, CV1 2TT [map]

Posted in public_engagement, Uncategorized | Tagged , , , , , , , , , , , | Leave a comment

The great Christmas email experiment of 2011-12

This year I took pretty much all the holiday time I could over Christmas, probably for the first time ever. As an experiment, I let all the emails I received over this period accumulate in my Inobx, with the exception of things like posts to mailing lists which get automatically filtered, labeled and skip the Inbox. Generally, I try to follow an Inbox Zero policy, which means my Inbox is usually empty and every email I get is either dealt with as soon as I read it or saved in a “Next Action” list to be dealt with later. That policy makes it much easier to carve out large blocks of time for more difficult tasks, like writing lectures, marking or programming which all require uninterupted concentration. I think this works pretty decently, and at least I haven’t had to declare email bankruptcy.

So, the point of this experiment was really to see how well my Inbox Zero policy is working as well as I thought and, in particular, whether the bulk of the email I deal with is sensible content that really requires attention.

Of course, the “experiment” as such is a little silly, after all this is email from a vacation period and out of term time, so the results are weighted heavily. Usually I get a lot more email per day and a lot more relevant, sensible email that needs attention and the aim is always to maximise the time spent on those emails and minimise the time spent on unecessary emails.

Starting point

Anyway, enough caveats. My starting point was this:

Inbox: 316

Action list: 50

Before going on vacation I cleared out both the Inbox and the Action List of everything that could be dealt with then. So, the starting point here is all the email accumulated over a short vacation and all the items on my to-do list that couldn’t be finished before the holiday started.

The data

Yesterday I spent a happy (!!) afternoon going through each email and either responding to it, deleting it, reporting it as SPAM or filing it. In a Google Docs spreadsheet I wrote down the sender (anonymously unless the sender was a company), sender type and action for each email or group of emails from the same sender. I say “email”, actually I mean “email thread”. So one email on my spreadsheet here could well mean a thread of many emails from various senders. However, what I’m interested in here is really the aggregate data from the 300 emails, which you can see on this table:

Aggregated data from 300 emails
So, there are two things I’m interested in here:
  1. Where is the email from? Is it from people I need to communicate with or from companies and others sending “news” and other updates that can be ignored or processed in a more convenient way, such as via an RSS reader. Obviously emails from colleagues (including external collaborators) and students are all important. Other senders vary considerably depending on the content of the email.
  2. How were the emails processed? Emails that were deleted or marked as SPAM are emails I don’t want to receive repeatedly, so are best unsubscribed from. Emails that needed real attention can be filtered to be marked as important if they aren’t already.

Where to emails come from?

330 emails broken down by sender type

330 emails broken down by sender type

So, thinking of this email as signal and noise, the signal here is email from students, colleagues, friends and open source projects. Of course, SOME of the other emails will be important too and will need some action too, but this is a rough guide. The total number of “noise” emails, according to the sender, worked out as 78 out of 316, or around 25%.

Now, 25% to my mind is astonishingly low. Given that most of the email that hits my account gets filtered out and never sees the Inbox in the first place, 25% is really not what I expected to see here. 

What happened to all those emails?

300 email conversations broken down by next action

300 email conversations broken down by next action

The other way to look at signal vs noise is how the emails were processed. The signal in this case is the emails that were actioned immediately or saved for working on next week, which was 73 out of 316 or just over 23%. That’s very close to the previous SNR, becasuse the sender of a message is a good predictor of its importance.

Again though, 23% is astonishingly low. The main culprit is web apps and social media apps that send frequent notifications, updates and other fluff. Often when you sign up to these things they subscribe you to all sorts of email alerts automatically, then it takes effort on your part to change your settings and unsubscribe. A better way to deal with this, if you use GMail, is to use the Gmail plus trick which allows you to filter out all these emails automatically.

A point about unsubscribing from mailing lists 

When you unsubscribe from an email alert you are informing the sender that you no longer wish to be contacted. The very LAST thing you then need is another email saying “Well done! You have unsubscribed” which you then have to deal with separately. Seriously, this is a terrible way to treat potential customers. Very few of the email alerts I unsubscribed to did this, but those that did really annoyed me.TripIt, Klout, SAA, Costa, the Electoral Reform Society and UCU: consider yourself mildly whinged at. Hurumph.

End point

Just for the record…

Inbox: 0

Action list: 89

Actioned immediately: 34

The take home…

This stuff is boring common sense. It’s motherhood and apple pie. You know it all already. So you’re doing this already, right?

  • Email is a huge sink of time.
  • Process email in batch mode, once or twice a day. Don’t let incoming emails dictacte your work schedule.
  • Unsubscribe to everything you can at the first chance you get. Better still, don’t sign up in the first place.
  • If you use GMail, use the Gmail plus trick.
  • If you sign up to a lot of web apps and different services with logins and passwords, keep confirmation emails in a specific folder or label (I use web-signups) so you can keep track of which services you already have an account for.
  • Filter and label emails automatically whenever you can. Don’t let anything into your Inbox that doesn’t need to be there (looking at you posts to mailing lists).
  • Learn the keyboard shortcuts on your favourite email client. Use them. Banish the mouse.
  • Deal with emails that can be dealt with immediately, immediately.
  • Keep a “next action” folder of emails that cannot be dealt with immediately. Don’t have them hanging around your Inbox making you feel guilty, nervous and demoralised.
  • Keep a sensible hierarchy of folders or labels to organise your email. Or use something like ActiveInbox.
Posted in Uncategorized | Tagged , | Leave a comment

What errors does my Python module define and raise?

On StackOverflow someone asked a whileago whether you can find out what errors a module defines and throws.In Python, a function does not declare that it throws a particularerror object, so you need to look inside the module to see whatexceptions it defines, or what exception it raises. You can do this byreading the docs (RTFM!) but of course they may be out of date, orwhat have you, so an alternative is to use the Python API to do lookfor you.

Which errors does a module define?

To first find which exceptions a module defines, just write a simplescript to go through each object in the module dictionarymodule.__dict__ and see if it ends in the word Error or if it is asubclass of Exception:

If I run this on the shutils module from the standard library I get this:

 


$ python listexn.py shutil
Looking for exception types in module: shutil
shutil.Error is an exception type
shutil.WindowsError is an exception type
$

That tells you which errors are defined, but not which ones are thrown. Of course, if the module has errors with funny names, or ones that are not subclasses of Exception, then this code will miss them.

What errors are thrown by a module?

To find out what errors a module can throw, we need to walk over theabstract syntax tree generated when the Python interpreter parses themodule, and look for every raise statement, then save a list of nameswhich are raised. The code for this is a little long, but pretty straightforward, so first I’ll state the output:


$ python listexn-raised.py /usr/lib/python2.6/shutil.py

Looking for exception types in: /usr/lib/python2.6/shutil.py/usr/lib/python2.6/shutil.py:

OSError is an exception type

/usr/lib/python2.6/shutil.py:Error is an exception type

$

So now we know that shutil.py defines the errors Error andWindowsError and raises the exception OSError and Error. If wewant to be a bit more complete, we could write another method to checkevery except clause to also see which exceptions shutil handles.

Here’s the code to walk over the AST, it just uses thecompiler.visitor interface to create a walker which implements thevisitor pattern from the Gang of Four book:

Posted in Uncategorized | Tagged , , | 2 Comments