19 June, 2010

Pentaho Data Integration 3.2 : Beginner's Guide


Pentaho Data Integration (PDI) a.k.a Kettle is undoubtedly one of the best ETL (Extract, Transform and Load) tools in the market and a favorite application in our organization. 

Kettle has helped us solve many difficult data processing cases involving many varying sources of data. As a data warehouse consultant and trainer, there is virtually no cases that cannot be solved by this great application. 

Despite of intuitive Kettle / PDI graphical environment, there are so many concept misunderstandings and usages I encounter that led to a poor data treatment designs and thus poor performance gain. That's why I've been long waiting for a PDI book that is both comprehensive and has many day-to-day usage samples. And finally the book was published by Packt Publishing under title "Pentaho Data Integration 3.2: Beginner's Guide"

This book is written by Maria Carina Roldan who has contributed PDI tutorial page in Pentaho wiki. Many thanks goes to Packt for giving me the opportunity to review the e-book version recently.

Positive impression instantly alighted on myself when I've read Table of Contents and finished the first chapter. Some basic and commonly asked questions directly presented with clear and concise explanations: 

"What is ETL?"

"Why in data warehouse do we need ETL tool?"

"What role can PDI do ? As an ETL ? And beyond ETL ?" 

That positive impression continued throughout the next chapter : a simple "Hello World" ETL sample. From the example, Maria introduced Spoon - Kettle's GUI designer - and basic important concepts : 
- How to run Spoon 
- Steps /  Hops
- Rows
- Running / Previewing data flow process
- How to read log

Users can immediately and easily understand the introduction. This is possible because her explanation enriched with many intuitive screenshots and graphical concept illustrations. Something that may take a while from the participants of PDI training session I myself conducted.

As I continued on, I notice that the delivery of practical sessions are also very good. Each sample was a step-by-step guide on how to make an ETL flow - with brief introduction of PDI step's used. Full explanation followed when we finished creating the sample. It makes the book easily followed and not making it a boring technical document.

The delivery consistency continues until last chapter. The book stay rich with samples, screenshots and concept illustrations.

Types of data source handled are also being discussed in a gradual "simple-to-sophisticated" fashion. Starting from processing text files, XML, spreadsheets / Excel, relational databases / SQL, and finally to the creation of datamart.

As a conclusion, this book is highly recommended for readers who want to get familiar with Pentaho Data Integration easily and quickly. But even experienced users may also benefited greatly from the book.

Interested ? You can buy the book from here or read a free sample chapter "Developing and Implementing a Simple Datamart" first.


04 January, 2009

Mondrian Samples Configuration

Since my article on Mondrian with MySQL backend published in this blog, I've got several questions which are quite similar regarding Mondrian / JPivot samples.

For those of you who got difficulty in getting the samples work, I've dedicated a wiki section page detailing every sample's configuration. The wiki site is located at http://pentaho-en.phi-integration.com/mondrian/configuring-mondrian-sample.

Or you can click directly on these links - each link for each sample configuration :
Hopefully with these articles you can get much more understandings on how Mondrian a.k.a. Pentaho Analysis server works.

If you already following all the instructions and still got some errors or you have other issues that has to do with other database sources then you can post your problem to our Mondrian user group at http://groups.google.com/group/phi-mondrian-olap . Or you can send a blank email to phi-mondrian-olap-subscribe@googlegroups.com to join the group.


Feris Thia
Business Intelligence Consultant
http://www.phi-integration.com
http://pentaho.phi-integration.com
http://pentaho-en.phi-integration.com

03 January, 2009

Upcoming Pentaho Book by Roland Bouman

After years, at last there's a commitment for a Pentaho book. As announced in his blog, Roland Bouman - a former MySQL employee - and  Jos van Dongen from Tholis Consulting will write a book for Wiley with the tentative title "Building Pentaho Solutions".

As a Pentaho partner in Indonesia which sometimes struggling to integrate all Pentaho formerly independent projects (Mondrian, Kettle, and  Weka) with BI Platform, of course this is a good news for us. Hope this will provide a comprehensive benefit as a guide for both Pentaho customers and partners.

Good Luck Roland and Jos !


Feris
Business Intelligence Consultant

15 November, 2008

Getting Started With Spoon

Conventions Used

Bold Text : represent new / important concept such as type of step and also representing interface components such as button, menu item, etc.

[Bold in Bracket] : represent step name


What is Spoon ?

Spoon is graphical utility in Kettle suite to do these functionalities :
  • design and run job / transformation
  • job / transformation profiling
  • set up database connection, variable/object sharing, configuration, etc
This article will serve as introductory tutorial for those who are new to Spoon and Kettle.

Running Spoon

  • Spoon in Windows can be executed using these 2 files :
    • Binary : kettle.exe
    • Batch script : spoon.bat




  • Execute one of the two files





  • Upon starting, you will be presented a dialog to choose a repository - a relasional database system repository to store job/transfformation object detail. Clik on No repository button for now to choose our filesystem as our storage.







  • A tips dialog will pop up shortly. Click on close to continue.
  • You will be presented a "Welcome to Pentaho Data Integration: Kettle Project" page in Welcome tag.
  • Done

Spoon Interface

  • Pulldown Menu  : a main navigation for spoon. Comprising of all needed operations on designing, previewing and executing job / transformation.







  • Toolbar : contain icons for basic file operation, printing, repository view and job / transformation operations.






  • Left panel  : contain configuration, steps and favorite steps used splitted in Main Tree, Core Objects dan Favorite Steps sections.







  • Workspace Tabs : act as divider between your workspaces. Picture below show the tag collections of  1 welcome page, 2 transformations and 1 job.





Creating a Transformation File

Here I show you an example how to create a transformation that read a csv (comma separated value)  file containing 33 provinces of Indonesia and dump it to another csv file. The propinsi.csv sample file can be downloaded at the end of this article or directly through this link. Save the file in C:\contoh_kettle folder. 

propinsi.csv

Kode_Propinsi,Deskripsi
P01,Bali
P02,Bengkulu
P03,Banten
P04,Gorontalo
P05,Irian Jaya Barat
P06,Papua
P07,Jambi
P08,Jawa Barat
P09,Jawa Tengah
P10,Jawa Timur
P11,Kalimantan Barat
P12,Kalimantan Tengah
P13,Kalimantan Timur
P14,Kalimantan Selatan
P15,Kepulauan Bangka Belitung
P16,Kepulauan Riau
P17,Lampung
P18,Maluku
P19,Maluku Utara
P20,Nusa Tenggara Barat
P21,Nusa Tenggara Timur
P22,Riau
P23,Sulawesi Barat
P24,Sulawesi Tengah
P25,Sulawesi Tenggara
P26,Sulawesi Selatan
P27,Sulawesi Utara
P28,Sumatra Barat
P29,Sumatra Selatan
P30,Sumatra Utara
P31,DI Yogyakarta
P32,DKI Jakarta
P33,Nanggroe Aceh Darussalam


Now here are the steps to create a transformation handling the csv files :
  • To create a transformation you can do it with several ways :
    • Choose File | New | Transformation from the pulldown menu 







    • Click on  New | Transformation on the toolbar





    • With a CTRL + N shortcut key

  • A new Transformation 1 workspace tab will show up, you rename it by saving our newly created transformation. Press CTRL + S and save file as c:\contoh_kettle\baca_propinsi.ktr
  • Notice that now our tab is changed to baca_propinsi.

Transformation Steps

  • Transformation will consist of several steps. Now we will drop a step visually in our workspace to read the content of ourC:\contoh_kettle\propinsi.csv file.
  • In the left panel open up Core Objects | Input category. Here you will find several steps to read input from several file formats.
  • For our need, we will use CSV file input step.




  • Click on CSV file input icon and drag it into our workspace.






  • Double click on the [CSV file input] step until a dialog shows up.
    • In Filename section fill in c:\contoh_kettle\propinsi.csv. Left the other as they are now.
    • Click on Get Fields to retrieve our known fields. Click OK and Close subsequently for another dialogs that pop up.






    • You will a configuration with values shown like picture below. Click OK to return to our workspace.





  • Save our transfomation file

Data Preview

  • We can preview several records read from our step using preview facility in Spoon.
  • To demonstrate this, click on [CSV file input] and click Preview icon in the toolbar then click on Quick Launch button that shows up.





  • In seconds you will have Examine Preview Data dialog with a number data of records previewing in this window. Close it for now by clicking Close button.







  • Done

Joining Step with a Hop

  • From the left panel open Core Objects | Scripting and drag Modified Java Script Value step into your transformation workspace.
  • Hold CTRL key, click on both [CSV File Input] step and [Modified Java Script Value] then right click and choose New Hop. Click OK on dialog  that show up.






  • We just created a hop that bridging the two steps we created before.






Transform our Data

  • One of  Modified Java Script Value step functionality is to change our data using programmatically using several built in operator and functions. If you know Java very well, you can also embedded Java code in this step. Double click on the step and type following code in the editor that appears.





    var Deskripsi_lengkap = "Propinsi " + Deskripsi;
    var No_urut = getProcessCount("r");

  • Make sure that Compatibility mode ? is unchecked
  • Click on Get variables button.
  • You will have a dialog look as below.






  • Click OK.
  • Evaluate this step by previewing data on it.

Output Step

  • Now we will dump our result from [Modified Java Script Value] to a text file, C:\contoh_kettle\propinsi.txt.
  • Again, from the left panel open Core Objects | Output and drag Text file output type step to workspace.
  • Joining [Modified Java Script Value] and [Text file output] with a hop.





  • Double click on [Text file output]
  • In the pop up Text file output dialog  click on file tab and type  C:\contoh_kettle\propinsi in Filename section.
  • Still on the dialog, click on Fields tab and click Get Fields button to have 4 fields show up (Kode_Propinsi, Deskripsi,Deskripsi_lengkapText file output).
  • Click OK.

Running Transformation

  • Now our transformation already has our goals : read a csv text file, change some value and put it into another 2 fields,  and save the combining fields into a new csv text file.
  • Run the transformation by click Run button on the toolbar.








  • Click Launch on the dialog.
  • You will be redirected to a log workspace with running steps detail information, for example how many rows that are read and written in the step. I will not going further by explaining parts of this workspace but please notice at the bottom panel where there are  detailed logs output there. You see in the last lines that our transformation has been successfully executed.





    Execution logs

    2008/08/24 18:31:22 - baca_propinsi - Dispatching started for transformation [baca_propinsi]
    2008/08/24 18:31:23 - Spoon - The transformation has finished!!
    2008/08/24 18:52:54 - Spoon - Launching transformation [baca_propinsi]...
    2008/08/24 18:52:54 - Spoon - Started the transformation execution.
    2008/08/24 18:52:54 - baca_propinsi - Dispatching started for transformation [baca_propinsi]
    2008/08/24 18:52:55 - Spoon - The transformation has finished!!
  • Now take a look at our C:\contoh_kettle folder, we will have 1 more file there. A newly created propinsi.txt. Open the file with your favorite text editor and see the change from the original file.




    propinsi.txt

    Kode_Propinsi;Deskripsi;Deskripsi_lengkap;No_urut
    P01;Bali                     ;Propinsi Bali;00000000000001.00
    P02;Bengkulu                 ;Propinsi Bengkulu;00000000000002.00
    P03;Banten                   ;Propinsi Banten;00000000000003.00
    P04;Gorontalo                ;Propinsi Gorontalo;00000000000004.00
    P05;Irian Jaya Barat         ;Propinsi Irian Jaya Barat;00000000000005.00
    P06;Papua                    ;Propinsi Papua;00000000000006.00
    P07;Jambi                    ;Propinsi Jambi;00000000000007.00
    P08;Jawa Barat               ;Propinsi Jawa Barat;00000000000008.00
    P09;Jawa Tengah              ;Propinsi Jawa Tengah;00000000000009.00
    P10;Jawa Timur               ;Propinsi Jawa Timur;00000000000010.00
    P11;Kalimantan Barat         ;Propinsi Kalimantan Barat;00000000000011.00
    P12;Kalimantan Tengah        ;Propinsi Kalimantan Tengah;00000000000012.00
    P13;Kalimantan Timur         ;Propinsi Kalimantan Timur;00000000000013.00
    P14;Kalimantan Selatan       ;Propinsi Kalimantan Selatan;00000000000014.00
    P15;Kepulauan Bangka Belitung;Propinsi Kepulauan Bangka Belitung;00000000000015.00
    P16;Kepulauan Riau           ;Propinsi Kepulauan Riau;00000000000016.00
    P17;Lampung                  ;Propinsi Lampung;00000000000017.00
    P18;Maluku                   ;Propinsi Maluku;00000000000018.00
    P19;Maluku Utara             ;Propinsi Maluku Utara;00000000000019.00
    P20;Nusa Tenggara Barat      ;Propinsi Nusa Tenggara Barat;00000000000020.00
    P21;Nusa Tenggara Timur      ;Propinsi Nusa Tenggara Timur;00000000000021.00
    P22;Riau                     ;Propinsi Riau;00000000000022.00
    P23;Sulawesi Barat           ;Propinsi Sulawesi Barat;00000000000023.00
    P24;Sulawesi Tengah          ;Propinsi Sulawesi Tengah;00000000000024.00
    P25;Sulawesi Tenggara        ;Propinsi Sulawesi Tenggara;00000000000025.00
    P26;Sulawesi Selatan         ;Propinsi Sulawesi Selatan;00000000000026.00
    P27;Sulawesi Utara           ;Propinsi Sulawesi Utara;00000000000027.00
    P28;Sumatra Barat            ;Propinsi Sumatra Barat;00000000000028.00
    P29;Sumatra Selatan          ;Propinsi Sumatra Selatan;00000000000029.00
    P30;Sumatra Utara            ;Propinsi Sumatra Utara;00000000000030.00
    P31;DI Yogyakarta            ;Propinsi DI Yogyakarta;00000000000031.00
    P32;DKI Jakarta              ;Propinsi DKI Jakarta;00000000000032.00
    P33;Nanggroe Aceh Darussalam ;Propinsi Nanggroe Aceh Darussalam;00000000000033.00

  • Done

Conclusion

Congratulations, you have just created a simple transfomation designed and executed in Spoon. For more reading on transformation you can check on Pentaho wiki. We will also frequently update our samples in Kettle section. Soalways stay tuned :)

If you have any question at this article, feel free to drop us a note at info@phi-integration.com.

You can download the sample files from http://pentaho.phi-integration.com/kettle/spoon.