Book Image

Learning Haskell Data Analysis

By : James Church
Book Image

Learning Haskell Data Analysis

By: James Church

Overview of this book

<p>Haskell is trending in the field of data science by providing a powerful platform for robust data science practices. This book provides you with the skills to handle large amounts of data, even if that data is in a less than perfect state. Each chapter in the book helps to build a small library of code that will be used to solve a problem for that chapter. The book starts with creating databases out of existing datasets, cleaning that data, and interacting with databases within Haskell in order to produce charts for publications. It then moves towards more theoretical concepts that are fundamental to introductory data analysis, but in a context of a real-world problem with real-world data. As you progress in the book, you will be relying on code from previous chapters in order to help create new solutions quickly. By the end of the book, you will be able to manipulate, find, and analyze large and small sets of data using your own Haskell libraries.</p>
Table of Contents (16 chapters)
Learning Haskell Data Analysis
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

A crash course in regular expressions


A regular expression is made up of atoms. An unmodified atom in a regular expression must match one, and exactly one, instance of a matching sequence in a string in order to satisfy the expression. When two or more unmodified atoms appear consecutively in an expression (such as Jim used in Chapter 3, Cleaning Our Datasets), the sequence of J, followed immediately by i, which is then followed immediately by m, must appear somewhere in the string. This behavior is similar to the Find feature found in most text editors and word processors. This is also where regular expressions begin to differ from a simple substring search. The sequence of Jim can be seen using the following statements:

> ("My name is Jim." =~ "Jim") :: Bool
True
> ("My name is Frank." =~ "Jim") :: Bool
False

The three repetition modifiers

Atoms can be modified. In our initial examples, we see our modifiers acting on a single character, but (as demonstrated later) an atom may be a single character or a grouping of characters. There are three primary modifiers in regular expressions. The * modifier means that an atom may be found zero or more times in a string. For example, a* can match a, aaaaaa, rabbit, and even cow. The + modifier means that an atom should be found one or more times. The a+ expression can match a, aaaaa, and rabbit, but it will never match cow. Finally, there is ?, which means that an atom may exist zero times or once (think of it as the maybe modifier). These atoms are known as greedy modifiers, which means that they try to gobble up as much of the string with an atom as possible. The * and ? modifiers may both match with 0 instances of an atom. Since there is always going to be at least a zero match of everything, these modified atoms will always return a successful match unless joined by something else.

When using the * modifier, it is important to remember that the atom will always have a match and it will never be evaluated as False:

> ("a" =~ "a*") :: Bool
True
> ("aaaaaaaa" =~ "a*") :: Bool
True
> ("rabbit" =~ "a*") :: Bool
True
> ("cow" =~ "a*") :: Bool
True
> ("" =~ "a*") :: Bool
True

When using the + modifier, the atom must match at least one character in order to be True:

> ("a" =~ "a+") :: Bool
True
> ("aaaaaaaa" =~ "a+") :: Bool
True
> ("rabbit" =~ "a+") :: Bool
True
> ("cow" =~ "a+") :: Bool
False
> ("" =~ "a+") :: Bool
False

Depending on where you live in the world, the agreed-upon correct spelling of a term that is a synonym of hue can be color or colour. Using the ? modifier, we can craft a regular expression that evaluates to True for both the spellings and to False for incorrect spellings. Like the * modifier, this atom modification will always evaluate to True when the expression matches the intended atom (and thus consumes the matched characters) or when it matches with nothing. The last expression in the following example evaluates to False due to other atoms in the expression:

> ("color" =~ "colou?r") :: Bool
True
> ("colour" =~ "colou?r") :: Bool
True
> ("coluor" =~ "colou?r") :: Bool
False

Anchors

Regular expressions can be anchored to either of the two ends of a string. The symbol for the start and end of the string is ^ and $ respectively. Let's examine a regular expression to match with the words that contain grand, but only at the beginning of the expression:

> ("grandmother" =~ "^grand") :: Bool
True
> ("hundred grand" =~ "^grand") :: Bool
False

Likewise, we can anchor an expression to the end of a string. Let's examine the expression to match words ending with the -ing suffix:

> ("writing" =~ "ing$") :: Bool
True
> ("zingers" =~ "ing$") :: Bool
False

We can use ^ and $ together if you want to match a regular expression with a word that may have the expression stretching across the entire string. When using the anchors together, the regular expression engine expects the regular expression to match from the start of the string to the end of the string. The first occurrence of an atom that evaluates to False will cause the entire expression to be evaluated to False. The last example in the following set evaluates to True because the string starts and ends with 0 occurrences of a:

> ("a" =~ "^a*$") :: Bool
True
> ("aaaaaaaa" =~ "^a*$") :: Bool
True
> ("rabbit" =~ "^a*$") :: Bool
False
> ("cow" =~ "^a*$") :: Bool
False
> ("" =~ "^a*$") :: Bool
True

The dot

The dot (or the period) is a special atom that allows us to match any one character. With the modification of * to make .*, we have crafted an expression that will match everything. To craft an expression that matches only a period, you have to escape the dot with a \ for the regular expression engine. Since Haskell escapes strings before passing them to the regular expression engine, we have to escape the \ with a second \:

> ("." =~ ".") :: Bool
True
> ("a" =~ ".") :: Bool
True
> ("." =~ "\\.") :: Bool
True
> ("a" =~ "\\.") :: Bool
False

Character classes

Characters can be grouped into character classes using the square brackets, [ and ]. A character class is a collection of characters in which one character must match. The words grey and gray can be considered correct spellings. We can craft a regular expression to match both spellings:

> ("grey" =~ "gr[ae]y") :: Bool
True
> ("gray" =~ "gr[ae]y") :: Bool
True
> ("graey" =~ "gr[ae]y") :: Bool
False

By beginning a character class with ^, we create the complement of a character class. That is, a character will match because it is not found in that character class. For example, we can check to see whether a word doesn't contain any vowels using a regular expression. This requires us to modify the character class so that it matches at least one character using + and is anchored to the beginning and end of the expression:

> ("rabbit" =~ "^[^aeiou]+$") :: Bool
False
> ("cow" =~ "^[^aeiou]+$") :: Bool
False
> ("why" =~ "^[^aeiou]+$") :: Bool
True

Character classes can also support a range of letters. Rather than requiring a character class to match a lowercase letter that looks like [abcdefghijklmnopqrstuvwxyz], it is clearer to write [a-z]. The [A-Z] range works for uppercase letters and [0-9] works for numbers:

> ("a" =~ "[a-z]") :: Bool
True
> ("A" =~ "[a-z]") :: Bool
False
> ("s" =~ "[A-Z]") :: Bool
False
> ("S" =~ "[A-Z]") :: Bool
True
> ("S" =~ "[a-zA-Z]") :: Bool
True

Groups

Atoms can be grouped together using parentheses, and these groups can be modified. Therefore, the regular expression (row, )+row your boat will match the entire string of row, row, row your boat. An added benefit of grouping is that the text matched in one part of an expression can be used as a regular expression that is used later on in the same expression:

> ("row your boat" =~ "(row, )+row your boat") :: Bool
False
> ("row, row your boat" =~ "(row, )+row your boat") :: Bool
True
> ("row, row, row your boat" =~ "(row, )+row your boat") :: Bool
True
> ("row, row, row, row your boat" =~ "(row, )+row your boat") :: Bool
True

The lyrics to this song are row, row, row your boat. We can enforce that there are exactly three row words in our string (two with commas after it and one without). We also need to use our anchors and the {} modifier, which enforces an explicit number of repetitions:

> ("row your boat" =~ "^(row, ){2}row your boat$") :: Bool
False
> ("row, row your boat" =~ "^(row, ){2}row your boat$") :: Bool
False
> ("row, row, row your boat" =~ "^(row, ){2}row your boat$") :: Bool
True
> ("row, row, row, row your boat" =~ "^(row, ){2}row your boat$") :: Bool
False

Alternations

Alterations can happen any time we want one expression or another. For example, we wish to create a regular expression to match the year of birth of someone that is still alive. At the time of writing this book, the oldest living person was born in 1899. We would like to craft a regular expression to match the birth year of anyone born after 1899 to 2099. We can do this with an alternation.

Using the pipe character |, we can say that the regular expression A|B|C must match A, B, or C in order to be evaluated as True. Now, we must craft three separate regular expressions for the year 1899, any year in the 1900s, and any year in the 2000s:

> ("1898" =~ "^1899|19[0-9][0-9]|20[0-9][0-9]$") :: Bool
False
> ("1899" =~ "^1899|19[0-9][0-9]|20[0-9][0-9]$") :: Bool
True
> ("1900" =~ "^1899|19[0-9][0-9]|20[0-9][0-9]$") :: Bool
True
> ("1999" =~ "^1899|19[0-9][0-9]|20[0-9][0-9]$") :: Bool
True
> ("2015" =~ "^1899|19[0-9][0-9]|20[0-9][0-9]$") :: Bool
True
> ("2115" =~ "^1899|19[0-9][0-9]|20[0-9][0-9]$") :: Bool
False

A note on regular expressions

Regular expressions are defined by their engine, and every regular expression engine has differences. We did our best in this appendix to include features of regular expressions that are common across most engines. (When creating this appendix, we discovered differences between the regex-posix package on Windows and Linux.) Some excellent resources to learn about regular expressions include Mastering Regular Expressions, Jeffrey Friedl, O'Reilly Media and Mastering Python Regular Expressions, Felix Lopez, Victor Romero, Packt Publishing.

Regular expressions should be avoided whenever possible. They are difficult to read, debug, and test, and are prone to being slow. Sometimes, a parser is a better solution than a regular expression due to the recursive nature of some text. If you find a regular expression on the Internet that you intend to use in your code, be sure to test it thoroughly. If you can find a function that does something similar to your needs without having to craft a regular expression, I recommend that you use the function instead.

Having said why you shouldn't use regular expressions, I believe that they provide a fun and intellectual challenge to craft expressions to match patterns of text. A simple regular expression will help you find features in your datasets, which is easier than a simple substring search.