Book Image

Using OpenRefine

Book Image

Using OpenRefine

Overview of this book

Data today is like gold - but how can you manage your most valuable assets? Managing large datasets used to be a task for specialists, but the game has changed - data analysis is an open playing field. Messy data is now in your hands! With OpenRefine the task is a little easier, as it provides you with the necessary tools for cleaning and presenting even the most complex data. Once it's clean, that's when you can start finding value. Using OpenRefine takes you on a practical and actionable through this popular data transformation tool. Packed with cookbook style recipes that will help you properly get to grips with data, this book is an accessible tutorial for anyone that wants to maximize the value of their data. This book will teach you all the necessary skills to handle any large dataset and to turn it into high-quality data for the Web. After you learn how to analyze data and spot issues, we'll see how we can solve them to obtain a clean dataset. Messy and inconsistent data is recovered through advanced techniques such as automated clustering. We'll then show extract links from keyword and full-text fields using reconciliation and named-entity extraction. Using OpenRefine is more than a manual: it's a guide stuffed with tips and tricks to get the best out of your data.
Table of Contents (13 chapters)
Using OpenRefine
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Recipe 3 – exploring your data


In this recipe, you will get to know your data by scanning the different zones giving access to the total number of rows/records, the various display options, the column headers and menus, and the actual cell contents.

Once your dataset has been loaded, you will access the main interface of OpenRefine as shown in the following screenshot:

Four zones are seen on this screen; let's go through them from top to bottom, numbered as 1 to 4 in the preceding screenshot:

  1. Total number of rows: If you did not forget to specify that quotation marks are to be ignored (see Recipe 2 – creating a new project), you should see a total of 75814 rows from the Powerhouse file. When data are filtered on a given criterion, this bar will display something like 123 matching rows (75814 total).

  2. Display options: Try to alternate between rows and records by clicking on either word. True, not much will change, except that you may now read 75814 records in zone 1. The number of rows is always equal to the number of records in a new project, but they will evolve independently from now on. This zone will also let you choose whether to display 5, 10, 25, or 50 rows/records on a page, and it also provides the right way to navigate from page to page.

  3. Column headers and menus: You will find here the first row that was parsed as column headers when the project was created. In the Powerhouse dataset, the columns read Record ID, Object Title, Registration Number, and so on (if you deselected the Parse next 1 line as column headers option box, you will see Column 1, Column 2, and so on instead). The leftmost column is always called All and is divided in three subcolumns containing stars (to mark good records, for instance), flags (to mark bad records, for instance), and IDs. Starred and flagged rows can easily be faceted, as we will see in Chapter 2, Analyzing and Fixing Data. Every column also has a menu (see the following screenshot) that can be accessed by clicking on the small dropdown to the left of the column header.

  4. Cell contents: This option shows the main area displaying the actual values of the cells.

Before starting to profile and clean your data, it is important to get to know them well and to be at ease with OpenRefine: have a look at each column (using the horizontal scrollbar) to verify that the column headers have been parsed correctly, that the cell types were rightly guessed, and so on. Change the number of rows displayed per page to 50 and go through a few pages to check that the values are consistent (ideally, you should already have done so during preview before creating your project). When you feel that you are sufficiently familiar with the interface, you can consider moving along to the next recipe.