Google Refine and the Fotherington-Thomas issue

26 Nov 2012

Google refine.320x320

If you ever work with spreadsheets (!) you no doubt keep your data meticulously clean. I bet you never ever enter “n/a” into a column that’s otherwise full of numbers, or mix “U.K.” with “United Kingdom”.

Today’s news flash: there are spreadsheets out there not created by people like you and me. People who, late at night, mistakenly copy columns and accidentally transpose the data, or make typos and don’t spot them.

Google Refine is my number one tool for fixing this stuff. If you recognise the problems I’ve described, you should give it a good look. It’s worth watching the videos to get an idea of its potential.

Anyway, Google Refine has a built-in toTitlecase() function. I’m a perfectionist, and I like to take care to get people’s names right. So, for me, if you’ll permit me to write some code:

toTitleCase("FOTHERINGTON-THOMAS") = "Fotherington-thomas" //FAIL

Google Refine’s toTitlecase behaves as above - in other words, it doesn’t treat a hyphen as a word boundary. Today, I came up with a not-too-complicated way to make it work better with hyphenated names.

replace(toTitlecase(replace("FOTHERINGTON-THOMAS","-",".")),".","-") = "Fotherington-Thomas" // SUCCESS

This works by replacing hyphens with full stops, then calling toTitlecase, then putting the hyphens back. toTitlecase treats full stops as word boundaries. Here’s what to put into Google Refine’s “transform” box:

replace(toTitlecase(replace(value,"-",".")),".","-")

Tags: recommendation, data, code