19 June, 2010

Pentaho Data Integration 3.2 : Beginner's Guide


Pentaho Data Integration (PDI) a.k.a Kettle is undoubtedly one of the best ETL (Extract, Transform and Load) tools in the market and a favorite application in our organization. 

Kettle has helped us solve many difficult data processing cases involving many varying sources of data. As a data warehouse consultant and trainer, there is virtually no cases that cannot be solved by this great application. 

Despite of intuitive Kettle / PDI graphical environment, there are so many concept misunderstandings and usages I encounter that led to a poor data treatment designs and thus poor performance gain. That's why I've been long waiting for a PDI book that is both comprehensive and has many day-to-day usage samples. And finally the book was published by Packt Publishing under title "Pentaho Data Integration 3.2: Beginner's Guide"

This book is written by Maria Carina Roldan who has contributed PDI tutorial page in Pentaho wiki. Many thanks goes to Packt for giving me the opportunity to review the e-book version recently.

Positive impression instantly alighted on myself when I've read Table of Contents and finished the first chapter. Some basic and commonly asked questions directly presented with clear and concise explanations: 

"What is ETL?"

"Why in data warehouse do we need ETL tool?"

"What role can PDI do ? As an ETL ? And beyond ETL ?" 

That positive impression continued throughout the next chapter : a simple "Hello World" ETL sample. From the example, Maria introduced Spoon - Kettle's GUI designer - and basic important concepts : 
- How to run Spoon 
- Steps /  Hops
- Rows
- Running / Previewing data flow process
- How to read log

Users can immediately and easily understand the introduction. This is possible because her explanation enriched with many intuitive screenshots and graphical concept illustrations. Something that may take a while from the participants of PDI training session I myself conducted.

As I continued on, I notice that the delivery of practical sessions are also very good. Each sample was a step-by-step guide on how to make an ETL flow - with brief introduction of PDI step's used. Full explanation followed when we finished creating the sample. It makes the book easily followed and not making it a boring technical document.

The delivery consistency continues until last chapter. The book stay rich with samples, screenshots and concept illustrations.

Types of data source handled are also being discussed in a gradual "simple-to-sophisticated" fashion. Starting from processing text files, XML, spreadsheets / Excel, relational databases / SQL, and finally to the creation of datamart.

As a conclusion, this book is highly recommended for readers who want to get familiar with Pentaho Data Integration easily and quickly. But even experienced users may also benefited greatly from the book.

Interested ? You can buy the book from here or read a free sample chapter "Developing and Implementing a Simple Datamart" first.