Book Image

KNIME Essentials

By : Gábor Bakos
Book Image

KNIME Essentials

By: Gábor Bakos

Overview of this book

KNIME is an open source data analytics, reporting, and integration platform, which allows you to analyze a small or large amount of data without having to reach out to programming languages like R. "KNIME Essentials" teaches you all you need to know to start processing your first data sets using KNIME. It covers topics like installation, data processing, and data visualization including the KNIME reporting features. Data processing forms a fundamental part of KNIME, and KNIME Essentials ensures that you are fully comfortable with this aspect of KNIME before showing you how to visualize this data and generate reports. "KNIME Essentials" guides you through the process of the installation of KNIME through to the generation of reports based on data. The main parts between these two phases are the data processing and the visualization. The KNIME variants of data analysis concepts are introduced, and after the configuration and installation description comes the data processing which has many options to convert or extend it. Visualization makes it easier to get an overview for parts of the data, while reporting offers a way to summarize them in a nice way.
Table of Contents (11 chapters)

User interface


So far, you got familiar with the concepts of KNIME and also installed it. Let's run it!

Getting started

When you start the program, the first dialog asks for the location of the workspace you want to use. If the location does not exist, it will be created.

After this, a splash screen will inform you about the progress of the start, and bring you to the welcome screen.

In the background, your firewall might notify you that this program wants to connect to other computers. This is normal; it loads tips from the Internet and tests whether other services (for example, the public repository of KNIME workflows) are available or not. You can allow this if you have permission to do so, but unless you want to connect to other servers, you do not have to give that permission.

The welcome screen shows two main options: one for initializing the workbench for first use, and the other is to install new extensions.

Before we select either of them, we will introduce the most important preferences, because configuring before the first use is always useful.

Setting preferences

Navigate to the Preferences... menu item under File | Preferences... to gain access to the preferences dialog. In the General section, you will see an option to enable Show heap status. It is useful, because it can help you optimize the memory settings for KNIME. I suggest you to turn it on. It will be visible in the lower-right corner of the status bar.

KNIME

You can set some KNIME-related options in the preferences of the KNIME category.

The KNIME GUI subcategory contains confirmation, Console logging, workflow editor grid options, and some text-related options.

If you want to connect to databases, you should find a driver for your database, and register it by navigating to KNIME | Database Driver. There, you can add the archive file, and later, you will be able to use them in database connections.

Tip

Database drivers

You can find JDBC database drivers on your database provider's homepage, but you can also try the JDBC database: http://www.databasedrivers.com/jdbc/

With Preferred Renderers you can set the default renderers for the columns. This options is especially useful if you are working with chemical structures.

The main KNIME preference page contains the file logging detail settings, the parallelism option, and the path to the temporary files.

Other preferences

To set up the proxy, you should navigate to General | Network Connections.

In the General | Keys page, you can redefine the key bindings for KNIME commands. So, you can use the shortcuts with which you are familiar or comfortable on your keyboard.

General | Web Browser and the Help pages are especially useful when you have problems displaying help, or you want to browse local help in your browser.

You can also set some update sites by navigating to Install/Update | Available Software Sites, but usually that is also done by navigating to Help | Install New Software....

You can uninstall extensions by navigating to Help | About KNIME behind the Installation Details button's dialog. The Installed Software tab contains the extensions; you can uninstall them with a button.

Installing extensions

For installing extensions you need some update sites. You already have the default KNIME options, which contain some useful extensions. There are community nodes that also add helpful functionality to KNIME. The stable update site is http://tech.knime.org/update/community-contributions/2.8, while nightly builds are available at http://tech.knime.org/update/community-contributions/nightly.

To add update sites, navigate to Help | Install New Software.... Once you have selected an update site, it will download its summary so you can select which extensions (features) you want to install. These features have short descriptions, so you can have an idea what functionality it will offer after installation. Once you have selected what you want to install from the update site, you should click Next.

The wizard's next page gives some details and summaries about the selected features.

On the next page, you can check the licenses and accept them if you are OK with them. After clicking Finish, the installation starts. During the installation, you might be asked to check whether you really want to install extensions with unsigned content, or you want to accept a signing key. Once it is ready, you will be asked to restart your workbench. After restarting it, you can use the features that were installed; however, sometimes there are some preferences to be set before using them.

Workbench

So far, we have set up the work environment and installed some extensions. Now let's select the large button named Open KNIME Workbench.

The initial workbench

The menu bar is similar to any other menu bar, just like the toolbars and the status bar. We will cover the menu bar and the toolbar in detail.

The KNIME Explorer view can be used to handle your workflows, workflow groups, or connect to KNIME servers. The Favorite Nodes view contains the favorite, last used, and most used nodes as a shortcut. You can specify the maximum number of items that should be there.

Tip

You should play with the view controls a bit more and get familiar with their usage.

Node Repository is one of the most important views. It contains nodes organized in categories. The search box is really helpful when it comes to the workflow design, and if you remember a part of the name but not its category. You will use this feature quite often.

The Outline view gives an overview on what is in the current editor window; it can also help navigating if the window is too large.

Tip

It is considered bad practice to have a single, huge workflow for your task. Using meta nodes, you can have more compact parts in every level.

The Console view contains messages—initially only the important ones.

The Node Description tab provides you with help information for the selected node. Information on how you should use it, what are the parameters, what should be its input, what is its output, and what kind of views are available are answered in that tab. When you select a category in the Node Repository view, the contents of the category will be displayed.

And finally, the central area of the window is for the workflow editor. A workflow named KNIME_project was created. Now, you can start working on it. Try adding the File Reader node from the IO | Read category in Node Repository. Drag it from the repository to the workflow or just double-click it in the repository, move it around, add another, delete it using the context menu, and that would be a good start.

The Undo (Ctrl + Z) and Redo (Ctrl + Y) commands from the Edit or the context menu (or from the toolbar: curved left and right arrows) can help you go back to the previous editing state.

Workflow handling

To create a workflow group, open the context menu of the LOCAL (Local Workspace) item in the KNIME Explorer view and select New Workflow Group... from the menu. Specify the name of the workflow group and where it should be created (once you have more groups, you can create groups inside those too). Creating a workflow can also be done using the New Workflow... command. These commands are also available from the File | New... (Ctrl + N) dialog.

Note

The key bindings are not always easy to remember because there are many of them; for more information and help about them, navigate to the Help | Key Assist... menu item or use Ctrl + Shift + L.

To load a workflow, first you have to make it available locally. There are many options to do that. You can import it to the workspace using the File | Import KNIME workflow... dialog (also available from the context menu).

Tip

There is a file named ExampleFlow.zip in the installation folder; you can use that.

Note

The Example Flow workflow loads the iris dataset (do not reset that node), colors the rows according to their class label, and visualizes the data in three different ways.

Another option is to download a workflow from the KNIME Server. Fortunately, the public KNIME Server is available for guests too. First you have to log in using the context menu. Select the only available option, Login. Once the catalog has been loaded, you can browse it similar to what you can do with the local workspace. But you cannot open the workflow from there. You have to select the one you want to import and copy it (in the context menu, use Copy or press Ctrl + C). Once you have the right place in the local workspace, insert the workflow (in the context menu use Paste, or press Ctrl + V).

The metadata information can be handy if you want to know when it was created, who the author is, or what did someone comment. The comment information is especially handy if you want to choose the workflow you want to download. To get (or set for local workflows) this information, the context menu's Show Meta Information (or Edit Meta Information...) command should be used.

Tip

Describe your dependencies

If you mention the prerequisites to your workflow, it will help the next user (who may be the future you) to set up things properly.

In loaded workflows, sometimes there are yellow notes about the structure of the workflow to grab your attention for customization options, and others. You can create your own notes from the context menu of the workflow editor using the New Workflow Annotation menu item. You can close the workflow by closing its editor.

The context menu gives options to Rename... (F2) (only available for closed workflows), Delete... (Delete), Copy (Ctrl + C), Paste (Ctrl + V), or Cut (Ctrl + X)—or just move using dragging—workflows or workflow groups.

The quickstart.pdf file describes how you can export workflows to share them with other users. The web guide for this is available at: http://tech.knime.org/workbench#export

Node controls

Once you have nodes in the editor, you want to configure it. To do that, you should double-click it, select it from the context menu or the Node menu using the Configure... command, or use the toolbar's checklist icon (also accessible by pressing F6). This opens a configuration dialog (Line Reader node), as shown in the following screenshot:

Example configuration dialog

This way you can set the parameters of the node. There can be various controls, usually with helpful tooltips; you can open them in a side window, and add the node description too. You might wonder what should that v=? button do. It opens up the variable settings. For example, you can use the filename in subsequent nodes as a flow variable, or substitute it with a flow variable, if that is what you need.

The configurations are organized in tabs. The last two tabs are present in all the configuration dialogs. The Flow Variables tab allows you to assign flow variables to the parameters as values, as shown in the following screenshot:

The Flow Variables tab

The Memory Policy tab is seldom needed; you can specify how the data should be handled within KNIME during execution of the node, as shown in the following screenshot:

The Memory Policy tab

It really helps to identify the nodes or their purpose if you give them meaningful names. To change the name, click on a previously selected node or press F2. If you want more detailed information, you might consider adding a workflow annotation around it. Alternatively, you might want to add a node description to it by navigating to the context menu item Edit Node Description..., or the Node menu Edit Node Name and Description... (Alt + F2), or by clicking the toolbar's yellow speech balloon. This information will be the tooltip of the node.

If you find the names distracting or if they are the default name, you can hide or enable them by navigating to Node | Hide Node Names, by pressing Ctrl + Alt + Q or the stroked through text on the toolbar.

The way from not configured to configured, and then the executing and executed states.

We want to execute the node to get the results. To achieve this, select the context menu or the Node menu, and select Execute (F7). On the toolbar, this is the play button (a white triangle on green circle). You can also schedule execution to show the first view after that (Shift + F10). You can change your mind and try to stop the execution before it is finished. For this purpose, navigate to Node | Cancel Execution (F9) of the selected nodes, or navigate to Node | Cancel All Execution (Shift + F9).

There might be warnings or errors even after the execution; you will be notified about those.

If the execution finishes successfully, you can check the ports by selecting one of them from the context menu; alternatively, if you want to check the first output port, navigate to Node | Open First Out-Port View (Shift + F6, a magnifier over a table on the toolbar). Checking views is a good idea too (it can be selected from the context menu or via Node | Open First View, F10, a magnifier on the toolbar). The node views also have some common parts: the File and the HiLite menus.

If you make changes to the configuration, your node will be reset to the configured state; it can also be achieved using Node or the context menu's Reset (F8) command (or the toolbar's x-table button). The reset will not delete the previously set parameters.

To connect a node's output port to another node's input port, just drag the output port to the input port; when the mouse button is released they will be connected (assuming the ports are compatible and would not create cycle in the graph of nodes). From one output port, you can connect to as many input nodes as you want (to same nodes too), but the input ports can only handle one port at the most.

There are arrangement commands available on the toolbar (horizontal, vertical, and auto layout), and you can also configure the node snapping grid properties by navigating to Node | Grid Settings... (Ctrl + Alt + Shift + X) from the toolbar—a grid.

HiLite

As we mentioned previously, HiLite is a view-related feature of KNIME, which allows selecting certain set of rows and making it visible across different rows. The Example Flow is a good start to get familiar with this concept and see it in action. As you can see, there are four visual type nodes available, the Color Manager, Scatter Plot, Parallel Coordinates, and Interactive table. Please open a view for the last three nodes, and also execute them in the same order.

The interactive table node shows data with different colors for different flowers. Select the first Iris-versicolor row, 51. Now from the HiLite menu, select HiLite selected (also available from the context menu in this view). As you can see, a point and a path has already been highlighted on the other two views—those representing the row 51. If you try, you can highlight another row from the Interactive table view; you can select some dots from the scatter plot or paths from the parallel coordinates. Highlighting them can be done similar to what you did in the first view. You also have the option to selectively unhighlight (UnHiLite Selected) or unhighlight all (Clear HiLite). You can also hide or keep only the highlighted rows (in the view, the port content will not be changed) using the HiLite | Filter menu items.

To store the HiLite information, you should add HiLite Filter (for example, add it to the Color Manager node), execute them, and save the workflow. With the Interactive HiLite Collector node, you can add custom information to the currently highlighted rows, so that later you can identify multiple subsets (if you check the New Column box before clicking on Apply). Do not forget to execute the node, and later save the workflow once you are satisfied with your selection.

Variable flows

When you bring your mouse cursor to the left and upper-right corner of the nodes (a bit outside of it), you will get a different tooltip—Variable Inport and Variable Outport (Variables Connection) respectively. Something useful is hidden there. Select a node, and from the context menu, select Show Flow Variable Ports. This way two circles will appear filled with the color red. You can connect them to the other node's input/output flow ports. These connections are red. This way you can make sure the proper set of variables will be available at the right time (circular dependencies are not allowed this way). The loops also use the workflow variables, and there are multiple nodes to create these or change them. You seldom need these connections as flow variables are propagated through normal connections.

You can also specify workflow variables from the context menu of the workflow (Workflow Variables...), or by using the QuickForm nodes.

Meta nodes

We mentioned that the meta nodes are useful for encapsulating the parts of the workflow and to hide the distracting details. The quickstart.pdf file gives a nice introduction to meta nodes; you can find the content on the web too at the link http://tech.knime.org/metanodes.

An unmentioned option to create new meta nodes is by selecting a closed subset of non-executed nodes or meta nodes and invoking the Collapse into Meta Node action from the context menu. The opposite process (bringing the contents of the meta node to the current level) is also possible with the Expand Meta Node context menu item.

Opening a meta node is possible by double-clicking on it or selecting the Open Meta Node context menu item. Both ways, another workflow editor tab will appear, where you can continue the workflow design.

Workflow lifecycle

Once you have a workflow, you might want to save the changes you made and the computed data and models. That is really easy; navigate to File | Save (Ctrl + S) or use the toolbar's disc icon.

Note

You cannot save workflows with executing nodes, so you have to save them before or later, else you have to stop the execution.

Sometimes you want to execute the whole workflow. To do that, you can use the toolbar's Execute all executable nodes button (a fast forward icon with a green circle background, Shift + F8) or the Node | Execute All menu item.

Tip

Batch processing

To process workflows from the command line (or from other program), the KNIME FAQ gives a good description at the following link: http://tech.knime.org/faq#q12

If there are multiple entry points to your workflow, it can be boring to reset all those nodes one by one, but the Reset command from the context menu of KNIME Explorer will reset all the nodes in the selected workflow.

Other views

The Server Workflow Projects view shows only the workflows (and groups) available on servers, but the Workflow Projects view shows only the local ones. If you do not need server workflows, this might be a better choice than the KNIME Explorer view, as this is more compact.

KNIME Node Monitor (View | Other... | KNIME Views) view gives you information about the selected item's state and other parameters. I think you will find this useful, especially if you explore the dropdown menu from the white triangle:

KNIME Node Monitor's possible contents