-
Book Overview & Buying
-
Table Of Contents
Data Analytics Made Easy
By :
Often, when we deal with real-world data analytics, we face a reality that is as annoying as ubiquitous: data can be dirty. The format of text and numbers, the order of rows and columns, the presence of undesired data points, and the lack of some expected values are all possible glitches that can slow down or even jeopardize the process of creating some value from data. Indeed, the lower the quality of the input data, the less useful the resulting output will be. This inconvenient truth is often summarized with the acronym GIGO: Garbage In, Garbage Out. As a consequence, one of the preliminary phases of a data analytics workflow is Data Cleaning, meaning the process of systematically identifying and correcting inaccurate or corrupt data points. Let's learn how to build a full set of data cleaning steps in KNIME through a realistic example.
In this tutorial, we are going to clean a table that captures information on the users of an e-commerce website, such as name, age, email address, available credit, and so on. This table has been generated by pulling directly from the webserver all the available raw data. Our ultimate objective is to create a clean list of contactable users, which we can leverage as a mailing list for sending email newsletters. Since the list of users constantly changes (as some subscribe and unregister themselves every day), we want to build a KNIME workflow that systematically cleans the latest data for us every time we want to update our mailing list:

Figure 2.14: The raw data: we certainly have some cleaning chores ahead
As you can see from Figure 2.14, a first look at the raw table unveils a series of data quality flaws to be looked after. For instance:
@ symbol or the full domain), making the respective users not contactable.In KNIME, missing values are indicated with a red question mark symbol, ?. For reference, in computer science, a missing value is referred to with the expression NULL.
We have an Excel file (DirtyData.xlsx) with an excerpt of the raw data, showing samples of all those issues listed above. By using this file as a base, we can build a KNIME workflow that polishes the data and exports a good-looking and ready-to-use mailing list. Let's do this one step at a time:
Cleaning data.
Excel ReaderThis node (IO > Read) opens Excel files, reads the content of the specified worksheet, and makes it available as a table at its output port. In the main tab of the configuration dialog, after indicating which file or folder to open (click on Browse... to change), you can specify (Sheet selection) the worksheet to consider: by default, the node will read the first sheet available in the workbook but you can indicate the name of a specific sheet or its position. If your sheet includes the column headers, you can ask KNIME to use them as column names in the resulting table: in the section Column Header, you can select which row contains the column headers. You can also restrict the reading to a portion of the sheet, by specifying the range of columns and rows to read within the Sheet area section. You can check whether the node is configured correctly by looking at the bottom of the window, which gives you a preview of what KNIME is reading from the file:

Figure 2.15: Configuration of the Excel Reader node: select file, sheets, and areas to read
If you want to apply some transformations (like renaming columns, reordering them, and so on) as the data gets read, you can use the Transformation tab, which works the same as in the CSV Reader node we have already met.
To remove the duplicated rows we can use a new node that does exactly that: its name is Duplicate Row Filter.
Duplicate Row FilterThis node (Manipulation > Row > Filter) identifies rows having the same values in selected columns and manages them accordingly. In the first tab of the configuration window, you select which columns should be considered for the search of duplicates.
If more than one column is selected, the node will consider duplicates as only rows that have exactly the same values across all the selected columns. In the configuration of many KNIME nodes, we will be asked to select a subset of columns, so it makes sense to spend some time on becoming acquainted with the interface:
A) or by type (integers, decimal numbers, strings, and so on), you can select the other options available on the radio selector on top (Wildcard/Regex Selection or Type Selection).The second tab in the configuration window (titled Advanced) lets you decide what to do with the duplicate rows once identified (by default, they get removed but you can also keep them and add an extra column specifying whether they are duplicates or not) and which rows should be kept among the duplicates (by default, the first row is kept and all others are removed, but other strategies are available):
Figure 2.16: Configuration of the Duplicate Row Filter: select which columns to use for detecting duplicate rows
__Email_Entered on the right. We click on OK and run the node (F7). We can now proceed to fix the formatting of names and surnames. To do so, we will start using a very versatile node for working on textual data called String Manipulation.
String ManipulationThis node (Manipulation > Column > Convert & Replace) applies transformations to strings, making it possible to reformat textual data as needed. The node includes a large set of pre-built functions for text manipulation, such as replacement, capitalization, and concatenation, among others:

Figure 2.17: String Manipulation: build your text transformation selecting functions and columns to use
The configuration window provides several panels:
upperCase() will convert a string in all-capital letters. When you double-click on a function here, it will get added to your expression.$) on either side to indicate a column.Table 2.1 summarizes the most useful functions available within this node.
|
Function |
Description |
Example |
Result |
|
strip(x) |
Removes any whitespace from the beginning and the end of a string. |
|
|
|
upperCase(x), lowerCase(x) |
Converts all characters to upper or lower case. |
|
|
|
capitalize(x) |
Converts first letters of all words in a string to upper case. |
|
|
|
compare(x,y) |
Compares two strings and returns 0 if they are equal and -1 or 1 if they differ, depending on their alphabetical sorting. |
|
|
|
replace(x,y,z) |
Replaces all occurrences of substring y within x with z. |
|
|
|
removeChars(x,y) |
Removes from string x all characters included in y. |
|
|
|
join(x,y,...) |
Concatenates any number of strings in a single string. |
|
|
|
length(x) |
Counts the number of characters in a string. |
|
|
Table 2.1: Useful functions within String Manipulation
This node is perfect for our needs as we have a few strings to manipulate. We need to fix the capitalization of names and surnames, remove those bad-looking whitespaces, and create a new column with the full name:
First name. We want to see a nice upper-case character at the beginning of every word and we also require whitespaces to be stripped from both ends of the string. Let's build the expression by double-clicking first on capitalize() and strip() from the Function box and then on First name from the Column list. By clicking in this order, we should have obtained the expression capitalize(strip($First name$)), which is exactly what we wanted. In this case, we want to substitute the raw version of the first name with the result of this expression, so we need to select Replace column and then First name. We are all set so we can click on OK and close the window.Surname instead of First name. Just make sure that both the expression and the Replace column setting refer to Surname this time.join() function. Let's double-click first on join() from the Function box and then on First name from the Column list. Since we want names and surnames to be separated by a blank space, we need to add this character on the expression, by typing the sequence ," ", in the expression box just after $First name$. We complete the expression by double-clicking on the column Surname and we are done. The overall expression should be: join($First name$," ",$Surname$). Before closing, we need to decide where to store the result. This time we want to create a new column so we select Append and then type the name of the new column, which could be Full name. Click on OK and check the results.Since in the end, we are going to keep only the Full name column, we could have combined the last three nodes in a single one. In fact, Full name can be created at once with the expression: join(capitalize(strip($First name$))," ",capitalize(strip($Surname$))).
We took the longer route to get some practice with the node. It's up to you to decide which version to keep in your workflow.
With all names fixed, we can move on to the next hurdle and remove the ill-formatted email addresses. It's time to introduce a new node that will be ubiquitous in our future KNIME workflows: Row Filter.
Row FilterThis node (Manipulation > Row > Filter) applies filters on rows according to the criteria you specify. Such criteria can either be based on values of a specific column to test (like all strings starting with A or all numbers greater than 5.2) or on the position of the row in the table (for instance only the top 20 rows). To configure the node, you need to first specify the type of criteria you would like to apply using the selector on the left. You also need to specify if those rows that match your criteria should be kept in your workflow (Include rows...) or should be dropped, keeping all others (Exclude rows...). You have multiple ways to specify the criteria behind your filtering:
maria: this will match rows like "MARIA" or "Maria," unless you check the case sensitive match option, which would consider the lower and upper cases as different. Another option is to use wild cards in your search pattern (remember to tick contains wild cards): in this case, the star character "*" will stand for any sequence of characters (so "M*" selects all names starting with "M" like "Mary" and "Mario") while the question mark "?" will match any single character ("H?" refers to any string of two characters starting with "H," so it will include "Hi" and exclude "Hello"). If you want to implement more complex searches, you could also use the powerful Regular Expressions (RegEx), which offer great flexibility in setting criteria.Remember that bounds are always considered as included in the interval. If you want to exclude the endpoint of an interval, you need to reverse the logic of your filtering. For instance, if you want to include all non-zero, positive numbers you need to select the option Exclude rows by attribute value and set 0 as the upper bound.
1 in the First row number selector and then 1 in Last row number, you will match only the top 10 rows of the table. If you want to match only the rows after a certain position, like from the 100th onwards, you can set the threshold in the first selector (100) and tick the check box below (to the end of the table).
Figure 2.18: Configuration dialog for Row Filter: specify which rows to keep or remove from your table
If your filtering criteria require several columns to be tested, you can use multiple instances of this node in a series, each time looking at a different column. An alternative is to use a different node called Rule-based Row Filter, which lets you define several rules for filtering at once. Other nodes, such as Row Filter (Labs) and Rule-based Row Filter (Dictionary), can do more sophisticated filtering if needed. Check them out if you need to.
Let's see our new node in action straight away as we filter out all the email addresses that do not look valid:
__Email_Entered. One simple pattern we can use for checking the validity of an email address is the wild card expression *@*.*. This will check for all strings that have at least an @ symbol followed by a dot . with some text in between. This is not going to be the most thorough validity check for email addresses, but it will certainly spot the ones that are clearly irregular and is good enough for us at this stage. Remember to tick the contains wild cards checkbox and click OK to move on.0 as Lower bound, we are good to go since all negative values will be filtered out. We can click OK and move on to the next challenge.At this point, we want to manage the little red question marks appearing here and there in the table, signaling that some values are missing. Also, in this case, KNIME offers a powerful node to manage this situation quickly, with a couple of clicks.
Missing ValueThe node (Manipulation > Column > Transform) handles missing values (NULLs) in a table, offering multiple methods for imputing the best available replacement. In the first tab of the configuration window (Default), you can define a default treatment option for all columns of a certain data type (strings, integer, and double) by selecting it in the dropdown menus. The second tab (Column settings) allows you to set a specific strategy for each individual column by double-clicking on the name of the column from the list on the left and setting the strategy through the menu that will appear.
Unless you have a large number of columns that you want to treat with the same missing value strategy, it's best to be explicit and use the second tab. That way you only impute missing values for the precise columns specified.
You have a vast list of possible methods to treat your missing values. The most useful ones are:
If you substitute missing values with the median of a numeric column, your imputed values are going to stick "in the middle" of the existing distribution, making your inference less disruptive and more robust. Of course, this will depend on your business cases and on the actual distribution of data, but it's worth giving this approach a try.

Figure 2.19: Configuration of Missing Value: decide how to manage the empty spots of your table
Going back to our case, we noticed that we have two columns displaying some question marks. Let's manage them appropriately by leveraging the Missing Value node:
Age (double-click on it from the list on the left), we can select Median: by doing so, we will assign an age to those users missing one that is not "far off" the age that most users tend to have in our table. When it comes to the number of times users have logged in (Logins column) we assume that the lack of a value means that they haven't logged in yet. So the best strategy to select will be Fix Value, keeping 0 as a default value for all. We can click on OK and close this dialog.The only steps left ahead of us are of an aesthetic nature: we want to drop the columns we don't need, sort the ones remaining, and give them a more intuitive name, before finally saving the output file. We are going to need a few more nodes to complete this last bit.
Column FilterThis node (Manipulation > Column > Filter) drops unneeded columns in a table. The only required step for its configuration is to select which columns to keep at the output port (the green box on the right) and which ones to filter out (the red box on the left):

Figure 2.20: Configuration of Column Filter: which columns would you like to keep?
First Name, Surname, and IP_Address) by moving them onto the left panel.
Column RenameThe node lets you change the names and the data types of columns. To configure it, double-click on the columns you would like to edit (you'll find a list on the left) and tick the Change box: you will then be able to enter the new names in the box beside. To change the data type of a column and convert all its values, you can use the drop-down menu on the right. The menu will be prepopulated with a list of possible data types each column can be safely converted into:

Figure 2.21: Configuration of Column Rename: pick the best names for your columns
__Email_Entered, which can become simply Email, and _Credit, which can be renamed to Credit.
Column ResorterThis node (available in Manipulation > Column > Transform) changes the order of columns in a table. In the configuration window, you will find, on the left, all columns available at the input port, and on the right, a series of buttons to move them around. Select the column you wish to move across and then click on the different buttons to move columns up or down, place columns first or last in the table, or sort them in alphabetical order. If different columns appear at the input port (imagine the case where your source file is coming in with some new columns), they will be placed where the <any unknown new column> placeholder lies:

Figure 2.22: Configuration of Column Resorter: shuffle your columns to the desired order
CSV WriterThis node (IO > Write) saves the input data table into a CSV file on the local disk or to a remote location. The only required configuration step is to specify the full path of the file to create: you can click on the Browse... button to select the desired folder. The other configuration steps (not required) let you: change the format of the resulting CSV file like column delimiters (Format section), keep or remove headers as the first row (Write column header), and compress the newly generated file in .gzip format to save space on disk (go to the Advanced Settings tab for this):

Figure 2.23: Configuration of CSV Writer: save your table as a text file
Well done for completing your second data workflow! The routine required for building a clean mailing list out of a messy raw dataset required a dozen nodes and some of our time, but the effort was certainly worth it. Now we can clean up any number of records whenever we like by just re-running the same workflow, making sure that the name of the input file and its path stay the same. To do so, you will just need to: reset the workflow (right-click on the name of the workflow in the Explorer on the left and then click on Reset or just reset the first node pressing F8 after having selected it), and execute it again (the simplest way is to just press Shift + F7 on your keyboard or execute the last node with a right-click and select Execute):

Figure 2.24: The full data cleaning workflow: twelve nodes to make our user data spotless
Change the font size
Change margin width
Change background colour