Book Image

Using OpenRefine

Book Image

Using OpenRefine

Overview of this book

Data today is like gold - but how can you manage your most valuable assets? Managing large datasets used to be a task for specialists, but the game has changed - data analysis is an open playing field. Messy data is now in your hands! With OpenRefine the task is a little easier, as it provides you with the necessary tools for cleaning and presenting even the most complex data. Once it's clean, that's when you can start finding value. Using OpenRefine takes you on a practical and actionable through this popular data transformation tool. Packed with cookbook style recipes that will help you properly get to grips with data, this book is an accessible tutorial for anyone that wants to maximize the value of their data. This book will teach you all the necessary skills to handle any large dataset and to turn it into high-quality data for the Web. After you learn how to analyze data and spot issues, we'll see how we can solve them to obtain a clean dataset. Messy and inconsistent data is recovered through advanced techniques such as automated clustering. We'll then show extract links from keyword and full-text fields using reconciliation and named-entity extraction. Using OpenRefine is more than a manual: it's a guide stuffed with tips and tricks to get the best out of your data.
Table of Contents (13 chapters)
Using OpenRefine
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Recipe 5 – using the project history


In this recipe, you will learn how you can go back in time at any point in the project's life and how to navigate through the history even if your project has been closed and opened again.

A very useful feature of OpenRefine is its handling of the history of all modifications that affected the data since the creation of the project. In practice, this means that you should never be afraid to try out things: do feel free at all times to fiddle with your data and to apply any transformation that crosses your mind, since everything can be undone in case you realize that it was a mistake (even if months have passed since the change was made).

To access the project history, click on the Undo / Redo tab in the top-left of the screen, just next to the Facet / Filter one, as shown in the following screenshot:

In order to turn back the clock, click on the last step that you want to be maintained. For instance, to cancel the removal of the column Provenance (Production) and all subsequent steps, click on 2. Rename column Description. to Description. Step 2 will be highlighted, and steps 3 to 5 will be grayed out. This means that the renaming will be kept, but not the next three steps. To cancel all changes and recover the data as they were before any transformation was made, click on 0. Create project. To undo the cancellation (redo), click on the step up to which you want to restore the history: for instance, click on 4. Reorder rows to apply steps 3 and 4 again, while maintaining the suppression of step 5 (rows removing).

Be cautious, however, that going back and doing something else will erase all subsequent steps. For instance, if you go back from step 5 to step 2 and then choose to move the column Description on the left, step 3 will now read 3. Move column description to position 1 and the gray steps in the preceding screenshot will disappear for good: you cannot have two conflicting histories recorded at the same time. Be sure to experiment with this in order to avoid nasty surprises in the future.

It is important to notice that only the operations actually affecting data are listed in the project history. Visual aids such as switching between the rows and records view, displaying less or more records on a page, or collapsing and expanding columns again, are not really transformations and are therefore not saved. A consequence is that they will be lost from one session to the other: when you come back to a project that was previously closed, all columns will be expanded again, whereas renamed and removed columns will still be the way you left them last time, along with every other operation stored in the project history. In Chapter 2, Analyzing and Fixing Data, we will see that this is also true for other types of operations: while cell and column transformations are registered in the history, filters and facets are not.

Note that operation history can also be extracted in JSON format by clicking on the Extract... button just under Undo / Redo. This will allow you to select the steps you want to extract (note that only reusable operations can be extracted, which excludes operations performed on specific cells), which will then be converted into JSON automatically and can then be copy-pasted. Steps 1 and 2 from the preceding screenshot would be expressed as:

[
  {
    "op": "core/column-move",
    "description": "Move column Registration Number to position 1",
    "columnName": "Registration Number",
    "index": 1
  },
  {
    "op": "core/column-rename",
    "description": "Rename column Description. to Description",
    "oldColumnName": "Description.",
    "newColumnName": "Description"
  }
]

Tip

Downloading the example files

You can download the example files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

In the preceding code, op stands for operation, description actually describes what the operation does, and other variables are parameters passed to the operation (function). Steps that were previously saved as JSON in a text file can subsequently be reapplied to the same or another project by clicking on the Apply... button and pasting the extracted JSON history in the text area. Finally, in case you have performed several hundreds of operations and are at a loss to find some specific step, you can use the Filter field to restrict the history to the steps matching a text string. A filter on remove, for instance (or even on rem), would limit the displayed history to steps 3 and 5.