Book Image

Pentaho Data Integration 4 Cookbook

Book Image

Pentaho Data Integration 4 Cookbook

Overview of this book

Pentaho Data Integration (PDI, also called Kettle), one of the data integration tools leaders, is broadly used for all kind of data manipulation such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. Do you need quick solutions to the problems you face while using Kettle? Pentaho Data Integration 4 Cookbook explains Kettle features in detail through clear and practical recipes that you can quickly apply to your solutions. The recipes cover a broad range of topics including processing files, working with databases, understanding XML structures, integrating with Pentaho BI Suite, and more. Pentaho Data Integration 4 Cookbook shows you how to take advantage of all the aspects of Kettle through a set of practical recipes organized to find quick solutions to your needs. The initial chapters explain the details about working with databases, files, and XML structures. Then you will see different ways for searching data, executing and reusing jobs and transformations, and manipulating streams. Further, you will learn all the available options for integrating Kettle with other Pentaho tools. Pentaho Data Integration 4 Cookbook has plenty of recipes with easy step-by-step instructions to accomplish specific tasks. There are examples and code that are ready for adaptation to individual needs.
Table of Contents (17 chapters)
Pentaho Data Integration 4 Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Validating well-formed XML files


PDI offers different options for validating XML documents, including the validation of a well-formed document. The structure of an XML document is formed by tags that begin with the character < and end with the character >. In an XML document, you can find start-tags: <exampletag>, end-tags: </exampletag>, or empty-element tags: <exampletag/>, and these tags can be nested. An XML document is called well-formed when it follows the following set of rules:

  • They must contain at least one element

  • They must contain a unique root element – this means a single opening and closing tag for the whole document

  • The tags are case sensitive

  • All of the tags must be nested properly, without overlapping

In this recipe, you will learn to validate whether a document is well-formed, which is the simplest kind of XML validation. Assume that you want to extract data from several XML documents with museums information, but only want to process those files that...