The true power of regular expressions emerges if we can use them not only for finding data, but also for manipulating data. GREL enables that and a lot of other functionality as well. It is a simple language designed for easily manipulating values. It consists of a set of built-in functions as well as several variables that are available through OpenRefine. We've briefly touched upon it in several recipes and we will show you the way to build your own GREL expressions here.
One place where GREL comes in handy especially is when you need to transform the values in your cells. You can either change cells in-place (by navigating to the respective column dropdown and then to Edit cells | Transform…) or make a new column with the transformed values (by navigating to the respective column dropdown and then to Edit column | Add column based on this column…). We will try some examples on the Object Title column here.
When you open any of the transformation dialogs, you see a first GREL expression. It is simply called value
and gives back the original value of the cell. That is not really useful, but we can start from there to learn some tricks. For instance, the GREL expression "TITLE: " + value + "."
adds the text TITLE:
before the value and appends a dot to the end. Note that literal text must be surrounded by single or double quotes so OpenRefine doesn't confuse it with commands. You can see the preview get updated as you build the expression.
Something more useful is to change some text. For instance, we can replace the string stone
by stones
in each cell using value.replace("stone", "stones")
. The replace
function is called here on the value with the dot notation. We're telling OpenRefine to take the cell value
and to replace all occurrences of stone
with stones
.
It gets all the more powerful if we combine this with regular expressions. As you can see, some of the cell values start with a number. We can remove this number by doing value.replace(/^\d+ /, "")
. Note how the regular expression is enclosed in slashes //
, whereas the replacement string is enclosed in double quotes ""
.
We can even go further by using groups. Not only are those used within regular expressions to treat several characters as a single unit, they also save the matched contents for later reuse. Therefore, we can replace the numbers at the start with something more meaningful. For instance: 234
becomes Object 234:
. This is possible with the following expression: value.replace(/^(\d+) /, "Object $1: ")
. You identify a group with a dollar sign followed by its number. The first group is called $1
, and the special identifier $0
is used for the result of the entire expression.
The functionality is only limited by the complexity of your expressions. A more difficult example we have noticed before is that cells in the multi-valued Categories column sometimes contain empty values. For instance, a value such as "|Stones||Bones|||Buildings"
actually only contains three values, but if you split on the vertical bar, you will get seven of them (including four empty ones) because there are six divider bars. We can get rid of all these fake bars. Bars at the beginning must be removed, so we first do value.replace(/^\|/, "")
. This looks a little more complicated than it is because the bar has to be escaped as it has a special meaning in regular expressions. To change multiple occurrences into a single one, we can do value.replace(/\|+/, "|")
. Note how the vertical bar is escaped in the regular expression, but not in the replacement string, as it has no special meaning there.
OpenRefine has other methods than replace
, as you can see in the useful Help tab of the expression editor, or in the overview at https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions. One example is the split
method. When used on the Categories field, we can count how many different values are present in each cell as follows: value.split("|").length()
. Here, we instruct OpenRefine to split the value whenever it encounters a vertical bar, and then uses the length
function to count the number of resulting values. You can even build more complicated expressions such as value.split("|").uniques().join("|")
, which, as we've seen in Chapter 3, Advanced Data Operations, removes duplicate values from the field.
Now it is time to teach you something that we have been hiding for a long time: every time you create a facet, you actually execute a GREL expression. For instance, let's create a simple facet by clicking on the Object Title dropdown and navigating to Facet | Customized Facets | Facet by blank. Now click on the change link at the top-right of the newly created facet. OpenRefine will reveal the underlying expression just underneath the facet title and will show you a dialog box to customize the facet as shown in the following screenshot:
We see that the expression for a blank facet is isBlank(value)
. Indeed, the isBlank
function returns true
if its value is blank and false
if its value is not. We change this now to a different facet, for instance, if we want to know whether or not the title starts with the number 1. The expression for that would be value.startsWith("1")
. That gives us a facet where true starts with 1 and false does not start with 1 (and (error) indicates that the value was empty).
Note
This is a great opportunity to learn about GREL. As every facet has an underlying GREL expression, you can study them to learn how they work. For instance, a Duplicates facet will teach you the function facetCount
. If you're in doubt, remember that all dialogs that allow you to edit expressions also have a Help tab with an overview and explanation of all available functions.
If you want a facet of your own, you don't have to start with any of the pre-made facets. For instance, if we want to filter on the number of categories, we can create a custom facet by going to the Categories dropdown and clicking on Facet, Custom text facet…. Enter the expression value.split("|").length()
and click on OK. You see a new Categories facet on the left-hand side which lets you choose the number of categories. If, like here, the output of your expression is a number, you might want to create a numeric facet instead. Follow the same steps, but navigate to Facet | Custom numeric facet… this time. This allows you to filter your records in a numeric way. Custom facets are a great way to explore your dataset in detail without touching your data.
Finally, some knowledge of GREL comes in handy in those places where something doesn't quite work the way you expect. For instance, we saw in Chapter 2, Analyzing and Fixing Data, that the duplicates facet works fine on the Registration Number column, but not on Record ID. Let's see if we can fix that. Create a duplicates facet by clicking on the Record ID dropdown and navigating to Facets | Customized facets | Duplicates facet.
Note
Be sure to try this with the original dataset that has not been cleaned, otherwise the duplicates have already been removed.
As you might recall, OpenRefine does not find duplicates. So, let's edit the expression to update that. When we click on the Change link, we see the original expression:
facetCount(value, 'value', 'Record ID') > 1
The problem here is that value
is a number and facetCount
only works with strings. The solution is to convert value
to a string, like this:
facetCount(value.toString(), 'value', 'Record ID') > 1
This will correctly indicate the duplicate rows in the facet.
You might be wondering: how could I ever have found this out myself? The secret is practice and exploration. The more datasets you analyze, clean, and link in OpenRefine, the better you will get at it. Also, you'll gain more and more experience with regular expressions and GREL as you gradually begin to master the basics. From that point onwards, it is a small step to the things that are increasingly complex. This book has tried to show you the direction; now it's time to go your own way.