Book Image

KNIME Essentials

By : Gábor Bakos
Book Image

KNIME Essentials

By: Gábor Bakos

Overview of this book

KNIME is an open source data analytics, reporting, and integration platform, which allows you to analyze a small or large amount of data without having to reach out to programming languages like R. "KNIME Essentials" teaches you all you need to know to start processing your first data sets using KNIME. It covers topics like installation, data processing, and data visualization including the KNIME reporting features. Data processing forms a fundamental part of KNIME, and KNIME Essentials ensures that you are fully comfortable with this aspect of KNIME before showing you how to visualize this data and generate reports. "KNIME Essentials" guides you through the process of the installation of KNIME through to the generation of reports based on data. The main parts between these two phases are the data processing and the visualization. The KNIME variants of data analysis concepts are introduced, and after the configuration and installation description comes the data processing which has many options to convert or extend it. Visualization makes it easier to get an overview for parts of the data, while reporting offers a way to summarize them in a nice way.
Table of Contents (11 chapters)

KNIME terminologies


It is important to share your thoughts and problems using the same terms. This makes it easier to reach your goal, and others will appreciate if it is easy to understand. This section will introduce the main concepts of KNIME.

Organizing your work

In KNIME, you store your files in a workspace. When KNIME starts, you can specify which workspace you want to use. The workspaces are not just for files; they also contain settings and logs. It might be a good idea to set up an empty workspace, and instead of customizing a new one each time, you start a new project; you just copy (extract) it to the place you want to use, and open it with KNIME (or switch to it).

The workspace can contain workflow groups (sometimes referred to as workflow set) or workflows. The groups are like folders in a filesystem that can help organize your workflows. Workflows might be your programs and processes that describe the steps which should be applied to load, analyze, visualize, or transform the data you have, something like an execution plan. Workflows contain the executable parts, which can be edited using the workflow editor, which in turn is similar to a canvas. Both the groups and the workflows might have metadata associated with them, such as the creation date, author, or comments (even the workspace can contain such information).

Workflows might contain nodes, meta nodes, connections, workflow variables (or just flow variables), workflow credentials, and annotations besides the previously introduced metadata.

Workflow credentials is the place where you can store your login name and password for different connections. These are kept safe, but you can access them easily.

Tip

It is safe to share a workflow if you use only the workflow credentials for sensitive information (although the user name will be saved).

Nodes

Each node has a type, which identifies the algorithm associated with the node. You can think of the type as a template; it specifies how to execute for different inputs and parameters, and what should be the result. The nodes are similar to functions (or operators) in programs.

The node types are organized according to the following general types, which specify the color and the shape of the node for easier understanding of workflows. The general types are shown in the following image:

Example representation of different general types of nodes

The nodes are organized in categories; this way, it is easier to find them.

Each node has a node documentation that describes what can be achieved using that type of node, possibly use cases or tips. It also contains information about parameters and possible input ports and output ports. (Sometimes the last two are called inports and outports, or even in-ports and out-ports.)

Parameters are usually single values (for example, filename, column name, text, number, date, and so on) associated with an identifier; although, having an array of texts is also possible. These are the settings that influence the execution of a node. There are other things that can modify the results, such as workflow variables or any other state observable from KNIME.

Node lifecycle

Nodes can have any of the following states:

  • Misconfigured (also called IDLE)

  • Configured

  • Queued for execution

  • Running

  • Executed

There are possible warnings in most of the states, which might be important; you can read them by moving the mouse pointer over the triangle sign.

Meta nodes

Meta nodes look like normal nodes at first sight, although they contain other nodes (or meta nodes) inside them. The associated context of the node might give options for special execution. Usually they help to keep your workflow organized and less scary at first sight.

A user-defined meta node

Ports

The ports are where data in some form flows through from one node to another. The most common port type is the data table. These are represented by white triangles. The input ports (where data is expected to get into) are on the left-hand side of the nodes, but the output ports (where the created data comes out) are on the right-hand side of the nodes. You cannot mix and match the different kinds of ports. It is also not allowed to connect a node's output to its input or create circles in the graph of nodes; you have to create a loop if you want to achieve something similar to that.

Note

Currently, all ports in the standard KNIME distribution are presenting the results only when they are ready; although the infrastructure already allows other strategies, such as streaming, where you can view partial results too.

The ports might contain information about the data even if their nodes are not yet executed.

Data tables

These are the most common form of port types. It is similar to an Excel sheet or a data table in the database. Sometimes these are named example set or data frame.

Each data table has a name, a structure (or schema, a table specification), and possibly properties. The structure describes the data present in the table by storing some properties about the columns. In other contexts, columns may be called attributes, variables, or features.

A column can only contain data of a single type (but the types form a hierarchy from the top and can be of any type). Each column has a type, a name, and a position within the table. Besides these, they might also contain further information, for example, statistics about the contained values or color/shape information for visual representation. There is always something in the data tables that looks like a column, even if it is not really a column. This is where the identifiers for the rows are held, that is, the row keys.

There can be multiple rows in the table, just like in most of the other data handling software (similar to observations or records). The row keys are unique (textual) identifiers within the table. They have multiple roles besides that; for example, usually row keys are the labels when showing the data, so always try to find user-friendly identifiers for the rows.

At the intersection of rows and columns are the (data) cells, similar to the data found in Excel sheets or in database tables (whereas in other contexts, it might refer to the data similar to values or fields). There is a special cell that represents the missing values.

Note

The missing value is usually represented as a question mark (?).

Tip

If you have to represent more information about the missing data, you should consider adding a new column for each column, where this requirement is present, and add that information; however, in the original column, you just declare it as missing.

There are multiple cell types in KNIME, and the following table contains the most important ones:

Cell type

Symbol

Remarks

Int cell

I

This represents integral numbers in the range from -231 to 231-1 (approximately 2E9).

Long cell

L

This represents larger integral numbers, and their range is from -263 to 263-1 (approximately 9E18).

Double cell

D

This represents real numbers with double (64 bit) floating point precision.

String cell

S

This represents unstructured textual information.

Date and time cell

calendar & clock

With these cells, you can store either date or time.

Boolean cell

B

This represents logical values from the Boolean algebra (true or false); note that you cannot exclude the missing value.

Xml cell

XML

This cell is ideal for structured data.

Set cell

{…}

This cell can contain multiple cells (so a collection cell type) of the same type (no duplication or order of values are preserved).

List cell

{…}

This is also a collection cell type, but this keeps the order and does not filter out the duplicates.

Unknown type cell

?

When you have different type of cells in a column (or in a collection cell), this is the generic cell type used.

There are other cell types, for example, the ones for chemical data structures (SMILES, CDK, and so on), for images (SVG cell, PNG cell, and so on), or for documents. This is extensible, so the other extension can define custom data cell types.

Note

Note that any data cell type can contain the missing value.

Port view

The port view allows you to get information about the content of the port. Complete content is available only after the node is executed, but usually some information is available even before that. This is very handy when you are constructing the workflow. You can check the structure of the data even if you will usually use node view in the later stages of data exploration during workflow construction.

Flow variables

Workflows can contain flow variables, which can act as a loop counter, a column name, or even an expression for a node parameter. These are not constants, but you can introduce them to the workspace level as well.

This is a powerful feature; once you master it, you can create workflows you thought were impossible to create using KNIME. A typical use case for them is to assign roles to different columns (by assigning the column names to the role name as a flow variable) and use this information for node configurations. If your workflow has some important parameters that should be adjusted or set before each execution (for example a file name), this is an ideal option to provide these to the user; use the flow variables instead of a preset value that is hard to find. As the automatic generation of figures gets more support, the flow variables will find use there too.

Iterating a range of values or files in a folder should also be done using flow variables.

Node views

Nodes can also have node views associated with them. These help to visualize your data or a model, show the node's internal state, or select a subset of the data using the HiLite feature. An important feature exists that a node's views can be opened multiple times. This allows us to compare different options of visualization without taking screenshots or having to remember what was it like, and how you reached that state. You can export these views to image files.

HiLite

The HiLite feature of KNIME is quite unique. Its purpose is to help identify a group of data that is important or interesting for some reason. This is related to the node views, as this selection is only visible in nodes with node views (for example, it is not available in port views). Support for data high lighting is optional, because not all views support this feature.

The HiLite selection data is based on row keys, and this information can be lost when the row keys change. For this reason, some of the nonview nodes also have an option to keep this information propagated to the adjacent nodes. On the other hand, when the row keys remain the same, the marks in different views point to the same data rows.

It is very important that the HiLite selection is only visible in a well-connected subgraph of workflow. It can also be available for non-executed nodes (for example, the HiLite Collector node).

Note

The HiLite information is not saved in the workflow, so you should use the HiLite filter node once you are satisfied with your selection to save that state, and you can reset that HiLite later.

Eclipse concepts

Because KNIME is based on the Eclipse platform (http://eclipse.org), it inherits some of its features too. One of them is the workspace model with projects (workflows in case of KNIME), and another important one is modularity. You can extend KNIME's functionality using plugins and features; sometimes these are named KNIME extensions. The extensions are distributed through update sites, which allow you to install updates or install new software from a local folder, a zip file, or an Internet location.

The help system, the update mechanism (with proxy settings), or the file search feature are also provided by Eclipse. Eclipse's main window is the workbench. The most typical features are the perspectives and the views. Perspectives are about how the parts of the UI are arranged, while these independently configurable parts are the views. These views have nothing to do with node views or port views. The Eclipse/KNIME views can be detached, closed, moved around, minimized, or maximized within the window. Usually each view can have at most one instance visible (the Console view is an exception). KNIME does not support alternative perspectives (arrangements of views), so it is not important for you; however, you can still reset it to its original state.

It might be important to know that Eclipse keeps the contents of files and folders in a special form. If you generate files, you should refresh the content to load it from the filesystem. You can do this from the context menu, but it can also be automated if you prefer that option.

Preferences

The preferences are associated with the workspace you use. This is where most of the Eclipse and KNIME settings are to be specified. The node parameters are stored in the workflows (which are also within the workspace), and these parameters are not considered to be preferences.

Logging

KNIME has something to tell you about almost every action. Usually, you do not care to read these logs, you do not need to do so. For this reason, KNIME dispatches these messages using different channels. There is a file in the workplace that collects all the messages by default with considerable details. There is even a KNIME/Eclipse view named Console, which contains only the most important details initially.