Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester
Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Table of Contents (19 chapters)
Clojure Data Analysis Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Identifying and removing duplicate data


One problem when cleaning up data is dealing with duplicates. How do we find them? What do we do with them once we have them? While a part of this process can be automated, often merging duplicated data is a manual task, because a person has to look at potential matches and determine whether they are duplicates or not and determining what needs to be done with the overlapping data. We can code heuristics, of course, but at some point, a person needs to make the final call.

The first question that needs to be answered is what constitutes identity for the data. If you have two items of data, which fields do you have to look at in order to determine whether they are duplicates? Then, you must determine how close they need to be.

For this recipe, we'll examine some data and decide on duplicates by doing a fuzzy comparison of the name fields. We'll simply return all of the pairs that appear to be duplicates.

Getting ready

First, we need to add the library to...