March 16, 2011

Pentaho PDI example Part II: Doing the impossible

Continuing from the last post, we now have a list in the format of "symbol", "date", "amount" and we want to pivot the data. The normal way to pivot would be to hard code each column name and use the Pivot step. However, we need something a little more dynamic because the list of ticker symbols are in the config file and so are not know until run time. So we will cheat (based on an idea from the Pentaho Users Forum). The output file will look like pivoted data, but the Kettle transformation, and will (semi)-dynamic, but will not be an actual dynamic pivot. But first, the data needs to cleaned.

The same parameters were used for all symbols, so all the symbols should have the same set of dates. However, there are some dates that appear in the date set of a single symbol list. This is usually the stock's inception date and first trade. A second date problem is that there are no prices for dates before a stock started trading. For example, if we ask for monthly amounts for 2000-2010, but a stock was first offered on April 8, 2008, there is no data at all from 2000-2008 and the first price is April 8, 2008 instead of the first trading day of that month. No other stocks have prices on April 8, 2008. That date must be cut out, and nulls or 0 must be put in for all the dates from 2000 to April 2008.

To solve the first problem, we can do a cartesian join between all possible dates, and all symbols.


From the filtering step (last step discussed in the last post), two streams are created. One stream will be used to get the dates, the other for the symbols. The Pentaho "Group by" step can be used to get the distinct values:


The same is done for the symbols, and then the "cartesian join" step produces all possible combinations of dates and symbols. To get the prices, we do another stream lookup on a third stream coming from the filtering step. (I put a dummy step between the filter and the lookup for layout) We will be putting a "0" for the price of a stock at any unknown date.


Before we start with the real work of pivoting, we need a stream without the extraneous dates. To get rid of the unwanted dates, such as the inception trading date, we will take a copy of the stream of all dates, sort them, and use the "Group by" step.


The "Group by" step is used to get the minimum day for each month/year combination. This will cut out extra dates, such as the April 8 example above.


Next, a little Javascript to create a date from the year, month and day columns:


We will return to this stream of good dates after we start the magic pivot.

The Pivot:

We can't really do a pivot with arbitrary columns, so we cheat a little. Once the stream is sorted by date and symbol, we can join all the symbols in one comma delimited field, and the prices in another. Because we made sure that every symbol has a value for every date, and the stream is ordered by symbols and date, these combined fields will have the values for each date and the symbols for each value in the same order. Here is the step:


And here is a sample of the output:


If we can align the symbol cell with the data cell, and then split them on the commas to create new fields, we would almost have our pivot. The symbols will be the first row, the values the rest. If the output file does not print the real headers it will look like we performed an arbitrary pivot.


The split is done to create fields with meaningless names. The two streams can be combined as long as both the symbol stream and the value streams use the same field names. I used 60 fields so I can use up to 60 symbols. Here is the split for the symbols:


Now that the prices are grouped by date, we can go back do a stream lookup with the stream of good dates. This will make sure that we get values only for dates that we want.


Add a constant and the "Group by" step can be used to limit the symbol stream to 1 row. Append the two streams together and output the results (without headers). If the resulting file it imported into a spreadsheet, the first row will act as headers for the data:


There is a Jira request to do a arbitrary pivot by changing the metadata. Until then, this little trick is a good work around.

No comments: