Book Image

Using OpenRefine

Book Image

Using OpenRefine

Overview of this book

Data today is like gold - but how can you manage your most valuable assets? Managing large datasets used to be a task for specialists, but the game has changed - data analysis is an open playing field. Messy data is now in your hands! With OpenRefine the task is a little easier, as it provides you with the necessary tools for cleaning and presenting even the most complex data. Once it's clean, that's when you can start finding value. Using OpenRefine takes you on a practical and actionable through this popular data transformation tool. Packed with cookbook style recipes that will help you properly get to grips with data, this book is an accessible tutorial for anyone that wants to maximize the value of their data. This book will teach you all the necessary skills to handle any large dataset and to turn it into high-quality data for the Web. After you learn how to analyze data and spot issues, we'll see how we can solve them to obtain a clean dataset. Messy and inconsistent data is recovered through advanced techniques such as automated clustering. We'll then show extract links from keyword and full-text fields using reconciliation and named-entity extraction. Using OpenRefine is more than a manual: it's a guide stuffed with tips and tricks to get the best out of your data.
Table of Contents (13 chapters)
Using OpenRefine
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

General Refine Expression Language (GREL)


The true power of regular expressions emerges if we can use them not only for finding data, but also for manipulating data. GREL enables that and a lot of other functionality as well. It is a simple language designed for easily manipulating values. It consists of a set of built-in functions as well as several variables that are available through OpenRefine. We've briefly touched upon it in several recipes and we will show you the way to build your own GREL expressions here.

Transforming data

One place where GREL comes in handy especially is when you need to transform the values in your cells. You can either change cells in-place (by navigating to the respective column dropdown and then to Edit cells | Transform…) or make a new column with the transformed values (by navigating to the respective column dropdown and then to Edit column | Add column based on this column…). We will try some examples on the Object Title column here.

When you open any of the transformation dialogs, you see a first GREL expression. It is simply called value and gives back the original value of the cell. That is not really useful, but we can start from there to learn some tricks. For instance, the GREL expression "TITLE: " + value + "." adds the text TITLE: before the value and appends a dot to the end. Note that literal text must be surrounded by single or double quotes so OpenRefine doesn't confuse it with commands. You can see the preview get updated as you build the expression.

Something more useful is to change some text. For instance, we can replace the string stone by stones in each cell using value.replace("stone", "stones"). The replace function is called here on the value with the dot notation. We're telling OpenRefine to take the cell value and to replace all occurrences of stone with stones.

It gets all the more powerful if we combine this with regular expressions. As you can see, some of the cell values start with a number. We can remove this number by doing value.replace(/^\d+ /, ""). Note how the regular expression is enclosed in slashes //, whereas the replacement string is enclosed in double quotes "".

We can even go further by using groups. Not only are those used within regular expressions to treat several characters as a single unit, they also save the matched contents for later reuse. Therefore, we can replace the numbers at the start with something more meaningful. For instance: 234 becomes Object 234:. This is possible with the following expression: value.replace(/^(\d+) /, "Object $1: "). You identify a group with a dollar sign followed by its number. The first group is called $1, and the special identifier $0 is used for the result of the entire expression.

The functionality is only limited by the complexity of your expressions. A more difficult example we have noticed before is that cells in the multi-valued Categories column sometimes contain empty values. For instance, a value such as "|Stones||Bones|||Buildings" actually only contains three values, but if you split on the vertical bar, you will get seven of them (including four empty ones) because there are six divider bars. We can get rid of all these fake bars. Bars at the beginning must be removed, so we first do value.replace(/^\|/, ""). This looks a little more complicated than it is because the bar has to be escaped as it has a special meaning in regular expressions. To change multiple occurrences into a single one, we can do value.replace(/\|+/, "|"). Note how the vertical bar is escaped in the regular expression, but not in the replacement string, as it has no special meaning there.

OpenRefine has other methods than replace, as you can see in the useful Help tab of the expression editor, or in the overview at https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions. One example is the split method. When used on the Categories field, we can count how many different values are present in each cell as follows: value.split("|").length(). Here, we instruct OpenRefine to split the value whenever it encounters a vertical bar, and then uses the length function to count the number of resulting values. You can even build more complicated expressions such as value.split("|").uniques().join("|"), which, as we've seen in Chapter 3, Advanced Data Operations, removes duplicate values from the field.

Creating custom facets

Now it is time to teach you something that we have been hiding for a long time: every time you create a facet, you actually execute a GREL expression. For instance, let's create a simple facet by clicking on the Object Title dropdown and navigating to Facet | Customized Facets | Facet by blank. Now click on the change link at the top-right of the newly created facet. OpenRefine will reveal the underlying expression just underneath the facet title and will show you a dialog box to customize the facet as shown in the following screenshot:

We see that the expression for a blank facet is isBlank(value). Indeed, the isBlank function returns true if its value is blank and false if its value is not. We change this now to a different facet, for instance, if we want to know whether or not the title starts with the number 1. The expression for that would be value.startsWith("1"). That gives us a facet where true starts with 1 and false does not start with 1 (and (error) indicates that the value was empty).

Note

This is a great opportunity to learn about GREL. As every facet has an underlying GREL expression, you can study them to learn how they work. For instance, a Duplicates facet will teach you the function facetCount. If you're in doubt, remember that all dialogs that allow you to edit expressions also have a Help tab with an overview and explanation of all available functions.

If you want a facet of your own, you don't have to start with any of the pre-made facets. For instance, if we want to filter on the number of categories, we can create a custom facet by going to the Categories dropdown and clicking on Facet, Custom text facet…. Enter the expression value.split("|").length() and click on OK. You see a new Categories facet on the left-hand side which lets you choose the number of categories. If, like here, the output of your expression is a number, you might want to create a numeric facet instead. Follow the same steps, but navigate to Facet | Custom numeric facet… this time. This allows you to filter your records in a numeric way. Custom facets are a great way to explore your dataset in detail without touching your data.

Solving problems with GREL

Finally, some knowledge of GREL comes in handy in those places where something doesn't quite work the way you expect. For instance, we saw in Chapter 2, Analyzing and Fixing Data, that the duplicates facet works fine on the Registration Number column, but not on Record ID. Let's see if we can fix that. Create a duplicates facet by clicking on the Record ID dropdown and navigating to Facets | Customized facets | Duplicates facet.

Note

Be sure to try this with the original dataset that has not been cleaned, otherwise the duplicates have already been removed.

As you might recall, OpenRefine does not find duplicates. So, let's edit the expression to update that. When we click on the Change link, we see the original expression:

facetCount(value, 'value', 'Record ID') > 1

The problem here is that value is a number and facetCount only works with strings. The solution is to convert value to a string, like this:

facetCount(value.toString(), 'value', 'Record ID') > 1

This will correctly indicate the duplicate rows in the facet.

You might be wondering: how could I ever have found this out myself? The secret is practice and exploration. The more datasets you analyze, clean, and link in OpenRefine, the better you will get at it. Also, you'll gain more and more experience with regular expressions and GREL as you gradually begin to master the basics. From that point onwards, it is a small step to the things that are increasingly complex. This book has tried to show you the direction; now it's time to go your own way.