Book Image

Using OpenRefine

Book Image

Using OpenRefine

Overview of this book

Data today is like gold - but how can you manage your most valuable assets? Managing large datasets used to be a task for specialists, but the game has changed - data analysis is an open playing field. Messy data is now in your hands! With OpenRefine the task is a little easier, as it provides you with the necessary tools for cleaning and presenting even the most complex data. Once it's clean, that's when you can start finding value. Using OpenRefine takes you on a practical and actionable through this popular data transformation tool. Packed with cookbook style recipes that will help you properly get to grips with data, this book is an accessible tutorial for anyone that wants to maximize the value of their data. This book will teach you all the necessary skills to handle any large dataset and to turn it into high-quality data for the Web. After you learn how to analyze data and spot issues, we'll see how we can solve them to obtain a clean dataset. Messy and inconsistent data is recovered through advanced techniques such as automated clustering. We'll then show extract links from keyword and full-text fields using reconciliation and named-entity extraction. Using OpenRefine is more than a manual: it's a guide stuffed with tips and tricks to get the best out of your data.
Table of Contents (13 chapters)
Using OpenRefine
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Recipe 7 – going for more memory


In this last recipe, you will learn how to allocate more memory to the application in order to deal with larger datasets.

For large datasets, you might find that OpenRefine is performing slowly or shows you OutOfMemory errors. This is a sign that you should allocate more memory to the OpenRefine process. Unfortunately, this is a bit more complicated than the other things we have done so far, as it involves a bit of low-level fiddling. But don't worry: we'll guide you through it. The steps are different for each platform. A word of caution: the maximum amount of memory you can assign depends on the amount of RAM in your machine and whether you are using the 32 bit or 64 bit version of Java. When in doubt, try to increase the amount of memory gradually (for instance, in steps of 1024 MB) and check the result first.

Windows

On Windows, you will have to edit the openrefine.l4j.ini file in OpenRefine's main folder. Find the line that starts with -Xmx (which is Java speak for "maximum heap size"), which will show the default allocated memory: 1024M (meaning 1024 MB or 1 GB). Increase this as you see fit, for instance to 2048 M. The new settings will be in effect the next time you start OpenRefine.

Mac

The instructions for Mac are a bit more complicated, as this operating system hides the configuration files from sight. After closing OpenRefine, hold control and click on its icon, selecting Show package contents from the pop-up menu. Then, open the info.plist file from the Contents folder. You should now see a list of OpenRefine settings. Navigate to the Java settings and edit the value of VMOptions (these are the properties of the Java Virtual Machine). Look for the part that starts with -Xmx and change its default value of 1024 M to the desired amount of memory, for instance, -Xmx2048M.

Linux

This might come in as a surprise, but increasing allocated memory is easiest in Linux. Instead of starting OpenRefine with ./refine as you usually would do, just type in ./refine -m 2048M, where 2048 is the desired amount of memory in MB. To make the change permanent, you can create an alias in the hidden .bashrc file located in your home folder by adding the following line at the end of the file:

alias refine='cd path_to_refine ; ./refine -m 2048M'

Here, path_to_refine is the relative path from your home folder to the OpenRefine folder. Then, the next time you start OpenRefine with ./refine, it will be allocated 2 GB by default.