Book Image

Getting Started with Talend Open Studio for Data Integration

By : Jonathan Bowen
Book Image

Getting Started with Talend Open Studio for Data Integration

By: Jonathan Bowen

Overview of this book

Talend Open Studio for Data Integration (TOS) is an open source graphical development environment for creating custom integrations between systems. It comes with over 600 pre-built connectors that make it quick and easy to connect databases, transform files, load data, move, copy and rename files and connect individual components in order to define complex integration processes. "Getting Started with Talend Open Studio for Data Integration" illustrates common uses and scenarios in a simple, practical manner and, building on knowledge as the book progresses, works towards more complex integration solutions. TOS is a code generator and so does a lot of the "heavy lifting"ù for you. As such, it is a suitable tool for experienced developers and non-developers alike. You'll start by learning how to construct some common integrations tasks ñ transforming files and extracting data from a database, for example. These building blocks form a "toolkit"ù of techniques that you will learn how to apply in many different situations. By the end of the book, once complex integrations will appear easy and you will be your organization's integration expert! Best of all, TOS makes integrating systems fun!
Table of Contents (22 chapters)
Getting Started with Talend Open Studio for Data Integration
Credits
Foreword
Foreword
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
Preface
Index

Modifying data in a database


It is not unusual for data integrations to be simply about getting data from one system, modifying it, then passing it onto another system to consume the classic Extract, Transform and Load (ETL) scenario. However, sometimes we will also need to modify the data in the database that we are sourcing data from or, indeed, the database we are sending data to.

To illustrate this, imagine we have a database table containing customer orders. We need to extract the data out of this table and send it to another system so that the orders can be fulfilled. However, most of the time we will need to be selective about the data that we send over. We cannot select all rows from the table, because presumably some will have already been sent. We could filter our result set by date/time so that the job only runs once per hour and each time it sends data created in the last hour. This would probably work perfectly well for most scenarios, but even this kind of filter is not quite...