March 15, 2011

Pentaho PDI example

The PDI task is to get the historical numbers for some stocks from Yahoo's REST-ful service and format the data so the column headers are the ticker symbol. Each row will be the date and the prices for each ticker.

Part I: getting the data


The first part of the transformation reads the configuration from an XML file: a list of ticker symbols to get, and the parameters for the Yahoo service. The Yahoo REST parameters need to be the same for all the symbols, but are only in the config.xml file one. To combine the single set of parameters with the list of symbols, the config is read twice. The first stream gets the list of symbols, and a second stream gets the parameters. The PDI stream lookup step combines the two streams. Because there is only one set, all the symbols get the same parameters


The result of the stream lookup step is fed to the HTTP client step. This step does the REST lookup of the Yahoo server. The base URL is http://ichart.finance.yahoo.com/table.csv and with the parameters it looks something like this:
http://ichart.finance.yahoo.com/table.csv?s=VFINX&a=03&b=6&c=2006&d=02&e=29&f=2011&g=m&ignore=.csv
The result of the RESTfull lookup is a csv file, but the HTTP client step puts the entire result (and so the entire text file) in one field called "Result." To use the data, we need to parse up the csv file so that each row in the file is one row in the transformation, and each cell is a column. The first step is to split the one row into multiple rows, each with one line of the csv file. To do that we use the "split field to rows" step. I was not sure what the line delimiter was coming out of Yahoo (or even if it was set by Yahoo, or the Http Client Step), so I used the Java internal variable to set it.


Now we have a row in the transformation for each row in the CSV that came from Yahoo. We still need to get one field in the transformation for each cell in the CSV file. Using "Split field" step to split on a comma takes care of that.


We don't need the header from Yahoo, so that is filtered out.

The results from Yahoo has the opening price, closing price, high, and low but we only want one price. The one we want is set in the configuration file. The easiest way to select the price is to use a Javascript step. The script has a case statement based on the "value_to_use" field from the original XML config file.


Now we have field for the symbol (fron the config), the amount (from the Javascript step) and the date (from Yahoo). To make debugging easier, it would be nice to clear out extraneous info, so I filter out out all the unneeded columns. I have read that the Select/Rename step can be slow, but the REST calls are far slower, so that is not a worry here:


This step can also turn the date field into a DATE data type. This is needed because it came from Yahoo (and the splits) as a STRING:


Now we can take a look at the results using the PDI Preview (in Spoon):


All and all, it was pretty quick and easy to make. Next post I'll show how to do the seemingly impossible step of a data pivot in PDI with arbitrary columns.

click here for Part 2

No comments: