Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester
Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Table of Contents (19 chapters)
Clojure Data Analysis Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Maintaining consistency with synonym maps


One common problem with data is inconsistency. Sometimes, a value is capitalized, while sometimes it is not. Sometimes it is abbreviated, and sometimes it is full. At times, there is a misspelling.

When it's an open domain, such as words in a free-text field, the problem can be quite difficult. However, when the data represents a limited vocabulary (such as US state names, for our example here) there's a simple trick that can help. While it's common to use full state names, standard postal codes are also often used. A mapping from common forms or mistakes to a normalized form is an easy way to fix variants in a field.

Getting ready

For the project.clj file, we'll use a very simple configuration:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]])

We just need to make sure that the clojure.string/upper-case function is available to us:

(use '[clojure.string :only (upper-case)])

How to do it…

  1. For this recipe, we'll define...