Book Image

Using OpenRefine

Book Image

Using OpenRefine

Overview of this book

Data today is like gold - but how can you manage your most valuable assets? Managing large datasets used to be a task for specialists, but the game has changed - data analysis is an open playing field. Messy data is now in your hands! With OpenRefine the task is a little easier, as it provides you with the necessary tools for cleaning and presenting even the most complex data. Once it's clean, that's when you can start finding value. Using OpenRefine takes you on a practical and actionable through this popular data transformation tool. Packed with cookbook style recipes that will help you properly get to grips with data, this book is an accessible tutorial for anyone that wants to maximize the value of their data. This book will teach you all the necessary skills to handle any large dataset and to turn it into high-quality data for the Web. After you learn how to analyze data and spot issues, we'll see how we can solve them to obtain a clean dataset. Messy and inconsistent data is recovered through advanced techniques such as automated clustering. We'll then show extract links from keyword and full-text fields using reconciliation and named-entity extraction. Using OpenRefine is more than a manual: it's a guide stuffed with tips and tricks to get the best out of your data.
Table of Contents (13 chapters)
Using OpenRefine
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Preface

Data is often dubbed the new gold, as it is of tremendous value for today's data-driven economy. However, we prefer to think of data as diamonds. At first they're raw, but through great skills, they can be polished to become the shiny assets that are so worthy to us. This is precisely what this book covers; how your dataset can be transformed in OpenRefine so you can optimize its quality for real-world (re)use.

As the vast amount of functionality of OpenRefine can be overwhelming to new users, we are convinced that a decent manual can make the difference. This book will guide you from your very first steps to really advanced operations that you probably didn't know were possible. We will spend time on all different aspects of OpenRefine, so in the end, you will have obtained the necessary skills to revive your own datasets. This book starts out with cleaning the data to fix small errors, and ends by linking your dataset to others so it can become part of a larger data ecosystem.

We realize that every dataset is different, yet learning is easiest by example. This is why we have chosen the Powerhouse Museum dataset to demonstrate the techniques in this book. However, since not all steps apply on your dataset, we have structured the different tasks as recipes. Just like in a regular cookbook, you can just pick the recipes you need for what you want to achieve. Some recipes depend on each other, but this is indicated at the start of each chapter.

In addition, the example dataset in this book illustrates a healthy data culture; the people at Powerhouse decided to bring it online even though they were aware that there were still some quality issues. Interestingly, that didn't stop them from doing it, and in fact, it shouldn't stop you; the important thing is to get the data out. Since then, the data quality has significantly improved, but we're providing you with the old version so you can perform the cleaning and linking yourself.

We are confident this book will explain all the tools necessary to help you get your data in the best possible shape. As soon as you master the skill of polishing, the raw data diamonds you have right now will become shiny diamonds.

Have fun learning OpenRefine!

Ruben and Max.

What this book covers

Chapter 1, Diving Into OpenRefine, teaches you the basic steps of OpenRefine, showing you how to import a dataset and how to get around in the main interface.

Chapter 2, Analyzing and Fixing Data, explains how you can get to know your dataset and how to spot errors in it. In addition, you'll also learn several techniques to repair mistakes.

Chapter 3, Advanced Data Operations, dives deeper into dataset repair, demonstrating some of the more sophisticated data operations OpenRefine has to offer.

Chapter 4, Linking Datasets, connects your dataset to others through reconciliation of single terms and with named-entity recognition on full-text fields.

Appendix, Regular Expressions and GREL, introduces you to advanced pattern matching and the General Refine Expression Language.

What you need for this book

This book does not assume any prior knowledge; we'll even guide you through the installation of OpenRefine in Chapter 1, Diving Into OpenRefine.

Who this book is for

This book is for anybody who is working with data, particularly large datasets. If you've been wondering how you can gain an insight into the issues within your data, increase its quality, or link it to other datasets, then this book is for you.

No prior knowledge of OpenRefine is assumed, but if you've worked with OpenRefine before, you'll still be able to learn new things in this book. We cover several advanced techniques in the later chapters, with Chapter 4, Linking Datasets, entirely devoted to linking your dataset.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Program code inside text is shown as follows: "The expression that transforms the reconciled cell to its URL is cell.recon.match.id".

New terms are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "After clicking on OK, you will see a new column with the corresponding URLs".

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to , and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example files

You can download the raw data and OpenRefine projects to follow along with the recipes in the book. Each chapter has its own example file which can be downloaded from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at if you are having a problem with any aspect of the book, and we will do our best to address it.