Book Image

Using OpenRefine

Book Image

Using OpenRefine

Overview of this book

Data today is like gold - but how can you manage your most valuable assets? Managing large datasets used to be a task for specialists, but the game has changed - data analysis is an open playing field. Messy data is now in your hands! With OpenRefine the task is a little easier, as it provides you with the necessary tools for cleaning and presenting even the most complex data. Once it's clean, that's when you can start finding value. Using OpenRefine takes you on a practical and actionable through this popular data transformation tool. Packed with cookbook style recipes that will help you properly get to grips with data, this book is an accessible tutorial for anyone that wants to maximize the value of their data. This book will teach you all the necessary skills to handle any large dataset and to turn it into high-quality data for the Web. After you learn how to analyze data and spot issues, we'll see how we can solve them to obtain a clean dataset. Messy and inconsistent data is recovered through advanced techniques such as automated clustering. We'll then show extract links from keyword and full-text fields using reconciliation and named-entity extraction. Using OpenRefine is more than a manual: it's a guide stuffed with tips and tricks to get the best out of your data.
Table of Contents (13 chapters)
Using OpenRefine
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Regular expressions for text patterns


OpenRefine offers many ways to find the data you need. But what if you don't know the exact text, or more commonly with large datasets, what if the text slightly varies from cell-to-cell? While finding all cells with a value that contains the letter a is easy, finding values that contain a number (or worse, the letter a followed by a number) is more difficult. This is where regular expressions come in handy.

The purpose of a regular expression is to define a pattern of text instead of a precise chunk of text. It allows you to say things such as "a letter followed by a number" and many more complex things. Regular expressions are built from character classes that represent the characters you are looking for woven together by quantifiers, anchors, choices, and groups.

OpenRefine allows you to try out regular expressions in an easy way. Create a text filter on any column (the Object Title column in the Powerhouse Museum dataset is especially interesting) by clicking on its dropdown and choosing Text filter. In the newly created filter, tick the regular expression checkbox and you're ready to go (refer to the following screenshot). The data will now be filtered according to the expression you enter:

Character classes

Regular expressions use character classes (letters, numbers, whitespace) as an indication for the character you are looking for. For instance, if you are looking for letters or numbers, you can type them directly:

  • The pattern Aar will look for all texts that contain a capital A followed by a small a and r. If the case sensitive box is not ticked, the capitalization will not matter.

  • The pattern 123 finds all texts that contain this number. Note that texts with the number 41235 are also matched, since 123 is a textual part of that.

  • As you can see, this does not differ from regular text matching yet. With regular expressions, we can also say that we expect any letter or any number. We can give a set of characters from which can be chosen by surrounding the alternatives with square brackets []. We can indicate a character range in those brackets by separating the limits with a hyphen.

  • The pattern [0123456789] matches all texts that contain a number.

  • The pattern [0-9] does the same, only it is more condense. It says everything between 0 and 9 is fine.

  • Similarly, the pattern [a-z] matches all lowercase letters (and also uppercase if the case-sensitive option is not used). [A-Z] matches uppercase letters and [a-zA-Z] matches all letters, regardless of the case-sensitive option.

  • If you want to look for numbers and letters, use [0-9a-zA-Z]. Read this as "every character between 0 and 9, between a and z, or between A and Z".

  • As in the earlier examples, we can put several matchers next to each other. analy[sz]e will match both American (analyze) and British (analyse) spellings.

  • Similarly, the expression [0-9][a-z] will find numbers directly followed by at least one letter. The expression [a-z][0-9][0-9][0-9] finds letters directly followed by at least three numbers.

  • If you want to find measurements such as 1 cm or 25 in, you will not find them with the preceding expressions because of the space in between. Luckily, a space is also a symbol, so the expression [0-9] [a-z] will find them. Note the space in between the bracket pairs, and note the fact that this will even find longer measurements such as 12345 meters. Indeed, the expression matches a number followed by a space and a letter, so in this case, 5 m. It doesn't matter whether other numbers or letters are also present.

  • If you're just looking for measurements in inches, the expression [0-9] in will do the job.

  • It might be a bit cumbersome to write [0-9] in full every time. This is why shorthand character classes were introduced. They are symbols that stand for a set of characters. For instance, the symbol \d stands for any digit and is thus equivalent to [0-9], but a lot faster to write. The opposite is \D, which means anything that is not a digit and includes letters as well as symbols. Unfortunately, there is no shorthand for any letter, but there is a shorthand for any letter, number, or underscore, which is \w, which stands for any word character. Again, the opposite is \W, which matches for all non-letters and non-numbers that are not underscores either.

  • \d [a-z] will again match for texts with measurements: any number followed by a space and a letter.

  • \d\d\d matches texts with numbers with a length of at least three.

  • \D\D matches texts that have at least two consecutive non-numbers. If you're surprised to see texts with numbers as well, think about it: the expression only says that the text should contain a non-number followed by a non-number; it doesn't say that there should be no numbers at all. However, if a text field only contains numbers, it won't be matched. To rephrase, the expression means "I'm looking for two non-numbers in a row, do you have them?"

  • You might wonder how \D works. Does it then translate to a huge set that includes everything but numbers? The answer is more simple: using a caret ^ as the first character in braces means none of the characters should be present in the pattern. Therefore, \D stands for [^0-9].

  • The expression [^a-zA-Z0-9] looks for texts that contain something which is not a letter or number. Most texts will match, as for instance spaces and punctuation marks are still allowed. However, empty fields will not match, as there must be at least one character which is not a letter or number.

  • The expression [^a-zA-Z0-9]\d[^a-zA-Z0-9] means any digit surrounded by non-letters or non-numbers. For instance, we see that single digits in parentheses are matched. Note that items such as (123) will not be matched, but (3) will, as we explicitly say that it should be a single digit.

If you really want to match any character, the dot . is what you need. This symbol matches anything (except newlines, which are \n). While that might not seem useful by itself, it is very useful when combined with other character classes:

  • The pattern a.a.a matches any text where an a is followed by any character, another a, any character, and another a. For instance, texts with the words dulcamara, alabaster, and salamander will match.

  • The pattern 19.. (there is a space before and after) will match years in the 20th century.

  • However, you should be careful with the dot: the last pattern also matches 19th and 19M$, because a dot really means anything. If a year is what you want, 19\d\d is more accurate.

Now, you might wonder what to do if you want to match an actual dot, actual square brackets, or an actual backslash followed by a "d". The trick is to tell in the regular expression when you don't want a symbol to be recognized as something special. For letters, numbers, spaces, and a few other characters, this is the default. Characters that have a special meaning need to be escaped by preceding them with a backslash \. A literal dot is thus represented as \.; the backslash itself as \\.

  • To find texts that have three consecutive dots in them, use \.\.\.. (This can actually be done in a more handy way, as we'll see in the next section.)

  • To find texts with a backslash, use \\. (There are none in the dataset.)

  • Texts that contain an opening or a closing square bracket are found with [\[\]]. This looks complicated, but it is quite logical: the first and last bracket say "choose any of the things within". And the things within are actual square brackets, but they have to be escaped, so they become \[ and \].

  • With [2], you find texts that contain the number 2 ("choose any of the things within", and the only thing within is a 2). With \[2\], you find texts with the number 2 surrounded in square brackets, as the brackets are escaped this time and thus have no special meaning anymore.

Quantifiers

So far, we've seen ways in regular expressions to indicate that we want to encounter certain characters. However, we cannot yet express if we want to see characters a certain number of times without this number being known in advance. For instance, how can we say "a number within single quotes"? We could start with '\d', but that would show us only texts that contain 0, 1, 5, and so on, but not multidigit numbers such as 23 or 478. Indeed, computers follow the expression slavishly: one single quote, one digit, one single quote.

Quantifiers can express repetition. There are three simple quantifiers: a plus sign +, which means one or more times, an asterisk *, which means zero or more times, and a question mark ?, which means zero or one time. They only exert an effect on the symbol that is directly to the left of them:

  • bre+d matches texts that contain bred, breed, or breeed with any number of e, as long as at least one e is present.

  • bre*d matches texts that contain brd, bred, breed, or breeed with any number of e, even without. Note how brd is matched with the asterisk, but not with the plus sign.

  • bre?d matches brd and bred, but not breed or any other number of es. Put simply, it makes the e optional.

  • Combinations are also possible. br?e+d matches bed, bred, beed, breed, beeed, breeed, and so on.

  • In addition, you can be explicit about the number of times you would like a symbol to occur by using curly braces {min,max}. You can specify the minimum and maximum number of times it can occur. Either one can be empty, in which case there is no minimum or maximum. Leave the comma in so OpenRefine knows whether you specified the minimum or the maximum. If you just supply a single number without a comma, the exact number of times will be matched.

  • N\d{5,8}, matches texts that contain a code that starts with N followed by five, six, seven, or eight digits, and then a comma.

  • N\d{5,}, matches texts that contain a code that starts with N, at least five digits, and then a comma.

  • N\d{,8}, matches texts which contain a code that starts with N, at most eight digits, and then a comma.

  • N\d{5}, matches texts which contain a code that starts with N, exactly five digits, and then a comma. This is equivalent to N\d\d\d\d\d but much more compact (especially if numbers get large).

  • Needless to say, the quantifiers and braces are special symbols, so if you want to match them literally, you should escape them with a backslash. For instance, question marks can be matched with \?.

Anchors

Sometimes, you don't only want to say how many characters to match, but also where they should be matched. This is possible with anchors . A caret ^ indicates that the match should happen at the beginning of a line, while a dollar sign $ indicates that the match should stop at the end of a line. (Don't confuse this with the caret inside square brackets to indicate a negation; it has a different meaning if it occurs outside of brackets.) Additionally, we can indicate that the match should begin or end at a word boundary with the anchor \b:

  • ^\d matches all texts that begin with a number.

  • \d$ matches all texts that end with a number.

  • ^\d.*\d$ matches all texts that begin and end with a number. Read this expression as: start of the text, a number, zero or more times any character (the dot), a number, end of the text. If we would use ^\d+$ instead, we would have all texts that contain only a number (of any length).

  • \b\d{3}\b searches for texts that contain at least one number of exactly three digits, since the \b anchor matches at word boundaries. If the text contains four-digit numbers but no three-digit numbers, it does not match. (If we would remove the \b anchors, it would.)

  • ^\d{3}\b finds texts that start with a three-digit number.

Choices

We have already seen a mechanism for indicating choices. Square brackets give regular expressions the possibility to choose one of the many characters: [a-z123] searches for texts with at least one lowercase letter or any of the digits 1, 2, and 3. Often, the choices are larger, as they can be several letters or words long. For this, the or operator | was introduced. It works as an OR operator between several alternatives:

  • glass|wood|steel will match texts that contain glass, wood, or steel.

  • \d+|many|few matches either a number, many or few.

  • N\d{5},|N\d{8}, matches either five-digit or eight-digit numbers that start with an N and end with a comma, but nothing in between (so no six-digit numbers for instance).

Groups

The only things we haven't covered so far are groups. If you want to use quantifiers on a group of characters instead of on a single character, you have to enclose them in parentheses ():

  • While la+ matches texts that contain la, laa, laaa, and so on, the expression (la)+ finds texts that contain la, lala, lalala, and so on.

  • The expression analyz|se would match texts that contain analyz and texts that contain se. This is probably not that useful. On the other hand, the expression analy(z|s)e matches both analyze and analyse. Note that in this case, this is equivalent to analy[zs]e because the choice consists of a single letter. However, this would not be the case with analyz(e|ing), which would match analyze and analyzing.

As parentheses have a special meaning, they need to be escaped as well. \( and \) match parenthesis literally.