Tigers Eat Giants: March 2011

March 17, 2011

Last note on the impossible PDI pivot

Here is a blog post by Matt Casters, the founder of Kettle (which was rebranded as Pentata Data Integration). In it he is un-pivoting data. That is to say, doing the opposite of what I did in the last two posts. I solved the problem of knowing how many columns to create by hard coding 60. He uses a more elegant solution of analyzing the data in one transformation and then using the results of the analysis in a second one and using the metadata injection step. I suspect that following his lead I could also use the metadata injection step and a separate analysis transformation to dynamically set the number of columns. But first I have to study the metadata injection step to see how it works.

March 16, 2011

Pentaho PDI example Part II: Doing the impossible

Continuing from the last post, we now have a list in the format of "symbol", "date", "amount" and we want to pivot the data. The normal way to pivot would be to hard code each column name and use the Pivot step. However, we need something a little more dynamic because the list of ticker symbols are in the config file and so are not know until run time. So we will cheat (based on an idea from the Pentaho Users Forum). The output file will look like pivoted data, but the Kettle transformation, and will (semi)-dynamic, but will not be an actual dynamic pivot. But first, the data needs to cleaned.

The same parameters were used for all symbols, so all the symbols should have the same set of dates. However, there are some dates that appear in the date set of a single symbol list. This is usually the stock's inception date and first trade. A second date problem is that there are no prices for dates before a stock started trading. For example, if we ask for monthly amounts for 2000-2010, but a stock was first offered on April 8, 2008, there is no data at all from 2000-2008 and the first price is April 8, 2008 instead of the first trading day of that month. No other stocks have prices on April 8, 2008. That date must be cut out, and nulls or 0 must be put in for all the dates from 2000 to April 2008.

To solve the first problem, we can do a cartesian join between all possible dates, and all symbols.

From the filtering step (last step discussed in the last post), two streams are created. One stream will be used to get the dates, the other for the symbols. The Pentaho "Group by" step can be used to get the distinct values:

The same is done for the symbols, and then the "cartesian join" step produces all possible combinations of dates and symbols. To get the prices, we do another stream lookup on a third stream coming from the filtering step. (I put a dummy step between the filter and the lookup for layout) We will be putting a "0" for the price of a stock at any unknown date.

Before we start with the real work of pivoting, we need a stream without the extraneous dates. To get rid of the unwanted dates, such as the inception trading date, we will take a copy of the stream of all dates, sort them, and use the "Group by" step.

The "Group by" step is used to get the minimum day for each month/year combination. This will cut out extra dates, such as the April 8 example above.

Next, a little Javascript to create a date from the year, month and day columns:

We will return to this stream of good dates after we start the magic pivot.

The Pivot:

We can't really do a pivot with arbitrary columns, so we cheat a little. Once the stream is sorted by date and symbol, we can join all the symbols in one comma delimited field, and the prices in another. Because we made sure that every symbol has a value for every date, and the stream is ordered by symbols and date, these combined fields will have the values for each date and the symbols for each value in the same order. Here is the step:

And here is a sample of the output:

If we can align the symbol cell with the data cell, and then split them on the commas to create new fields, we would almost have our pivot. The symbols will be the first row, the values the rest. If the output file does not print the real headers it will look like we performed an arbitrary pivot.

The split is done to create fields with meaningless names. The two streams can be combined as long as both the symbol stream and the value streams use the same field names. I used 60 fields so I can use up to 60 symbols. Here is the split for the symbols:

Now that the prices are grouped by date, we can go back do a stream lookup with the stream of good dates. This will make sure that we get values only for dates that we want.

Add a constant and the "Group by" step can be used to limit the symbol stream to 1 row. Append the two streams together and output the results (without headers). If the resulting file it imported into a spreadsheet, the first row will act as headers for the data:

There is a Jira request to do a arbitrary pivot by changing the metadata. Until then, this little trick is a good work around.

March 15, 2011

Pentaho PDI example

The PDI task is to get the historical numbers for some stocks from Yahoo's REST-ful service and format the data so the column headers are the ticker symbol. Each row will be the date and the prices for each ticker.

Part I: getting the data

The first part of the transformation reads the configuration from an XML file: a list of ticker symbols to get, and the parameters for the Yahoo service. The Yahoo REST parameters need to be the same for all the symbols, but are only in the config.xml file one. To combine the single set of parameters with the list of symbols, the config is read twice. The first stream gets the list of symbols, and a second stream gets the parameters. The PDI stream lookup step combines the two streams. Because there is only one set, all the symbols get the same parameters

The result of the stream lookup step is fed to the HTTP client step. This step does the REST lookup of the Yahoo server. The base URL is http://ichart.finance.yahoo.com/table.csv and with the parameters it looks something like this:

http://ichart.finance.yahoo.com/table.csv?s=VFINX&a=03&b=6&c=2006&d=02&e=29&f=2011&g=m&ignore=.csv

The result of the RESTfull lookup is a csv file, but the HTTP client step puts the entire result (and so the entire text file) in one field called "Result." To use the data, we need to parse up the csv file so that each row in the file is one row in the transformation, and each cell is a column. The first step is to split the one row into multiple rows, each with one line of the csv file. To do that we use the "split field to rows" step. I was not sure what the line delimiter was coming out of Yahoo (or even if it was set by Yahoo, or the Http Client Step), so I used the Java internal variable to set it.

Now we have a row in the transformation for each row in the CSV that came from Yahoo. We still need to get one field in the transformation for each cell in the CSV file. Using "Split field" step to split on a comma takes care of that.

We don't need the header from Yahoo, so that is filtered out.

The results from Yahoo has the opening price, closing price, high, and low but we only want one price. The one we want is set in the configuration file. The easiest way to select the price is to use a Javascript step. The script has a case statement based on the "value_to_use" field from the original XML config file.

Now we have field for the symbol (fron the config), the amount (from the Javascript step) and the date (from Yahoo). To make debugging easier, it would be nice to clear out extraneous info, so I filter out out all the unneeded columns. I have read that the Select/Rename step can be slow, but the REST calls are far slower, so that is not a worry here:

This step can also turn the date field into a DATE data type. This is needed because it came from Yahoo (and the splits) as a STRING:

Now we can take a look at the results using the PDI Preview (in Spoon):

All and all, it was pretty quick and easy to make. Next post I'll show how to do the seemingly impossible step of a data pivot in PDI with arbitrary columns.

click here for Part 2

March 11, 2011

Why I like Amazon AWS (Theoretically)

When I say I like Amazon AWS (the whole Amazon cloud services) "theoretically," I mean that I have not actually used it (yet). Instead, I like the design and the possibilities the service creates.

First, some background. I have had been using a virtual computing service for many years. I had a virtual machine with Linode before they were using Xen (it was Usermode Linux back then, which was the cheap man's virtualization by running a kernel in usermode). That is probably 7 years or more. I have always been happy with their service and uptime. I virtualized my personal mail server, and freed up space in my living room the first chance I got. Having a virtual server is great. because I love concentrating on the OS and application while letting someone else worry about the hardware.

I have seen a number of performance tests comparing AWS, Linode and the Rackspace virtual machines. Linode often comes out on top, and Amazon often near the bottom. I never understood why anyone would choose the Amazon service over Linode to host a virtual machine. Plus the virtual machine service EC2 did not save the server's state when it crashes or after a shutdown (though that is different now). That made no sense to me. But Amazon's service is really something more than virtual servers. AWS is not a virtual machine service, it is a virtual infrastructure service that contains machines.

It it like this: You can write a program to respond to HTTP and serve up web pages. Or, you can use a prewritten program (such as Apache) that is dedicated to responding to HTTP and give that program directions (HTML pages) for service up a website. If the website is complicated and dynamic, you could create an application that talks to the web server, and can dynamically create the HTML pages, or you could use a pre-written program (the PHP interpreter) and just give it directions for dynamically creating the website (a bunch of php files). The php files can be plopped on any Apache instance and run (you might need to configure Apache, but that is a separate job from the web site creation).

In Java, you create a bunch of classes that you can bundle together as a JAR or WAR file and then hand that bundle to a Java virtual machine or give it to a J2EE container to run. The JVM or JVM + J2EE container takes care of instantiating the classes and running the program. If you need more processing power, you can hand that same JAR or WAR file to many Java virtual machines or J2EE containers and have the program run many times without changing it. Complicated arrangements or clusters of Java programs can be created that comunicate to each other to balance load. These programs do not hold data, they loose their state when they shutdown or crash. The state of the program is stored externally to the Java program in a database or on a file system.

The Amazon EC2 is like the Java virtual machine. You can create an image of a system made up of a set of applications (shell scripts, Java programs, .Net files, Apache, Tomcat) plus a configurations of the OS (I like Linux). And then hand that image to Amazon. The important data is stored in another system (an outside database, some NoSQL thing such as SampleDB, or Amazon's persistant storage EBS), but the rest of the state of the system is not really important. You don't patch an EC2 instance like I have patched my Linode boxes or my real live servers. You don't patch a running J2EE server with new libraries. You update the images (be they Jar files, War files, or AMI Amazon Images) and roll it out.

When I realized that, then AWS made sense and seemed great. I would imaging building a system of interconnected services that run on different EC2 instances. I could configure them as needed and roll them out to be instantiated as needed.

March 10, 2011

Data Warehouse Toolkit

I started rereading Ralph Kimball's "The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling" So far, it is as good on the second read as it is on the first. My only complaint is that the sub title is more accurate: "The Complete Guide to Dimensional Modeling." The book is a great intro to dimensional modeling, with a lot of examples. Until I read it, I really did not get the point of dimensional modeling, as apposed to the entity-relationships/3rd-normal-form type modeling. I had to re-read the first chapter a few times before dimensional modeling clicked, but once it did the rest of the book was crystal clear. In fact, all the Kimball groups book I have read have been excellent--the Data Warehouse ETL Toolkit, and the The Data Warehouse Lifecycle Toolkit. (I read the first edition of the DW Lifestyle toolkit, and that is a link to the 2nd edition, but I assume it is not substantially different) I recommended them to everyone who asks about dimensional modeling. I have such success using the Kimball model that I never got around to looking at the "opposing" data warehouse camp. I suppose I should read something by Bill Inmom. Maybe Corporate Information Factory?

Favorite Applications--Mirth

One of my favorite applications is Mirth.

In Healthcare IT, there is a messaging protocol called HL7. It can be used to send nearly any healthcare based information between systems, such as appointments made, demographics updated, insurance information, lab results, etc. However, the protocol is not completely standard, and an connection between two systems (an interface) has to be configured by hand.

Mirth is an open source application built on the open source ESB (Enterprise Service Bus) Mule. It used the HL7 HAPI library--thus the name. Mirth's main use is helping connect systems that use HL7. But it can do much more--plain text, email, JDBC, DICOM, and so on. Any message can be copied, altered, disassembled, reassembled, sent on, and so on. The best part is that the UI is nice and clean, and makes setting up the connections, and processes the messages easy. You can also use Javascript to process the messages, and even create pure Java clases to proceses the message. So all that means nearly anything can be done withe the messages and the information in them. And that makes it great.

Here is a little example:

I had a system that sent appointment messages, but was not set up to send laterality information (right side or left side) about the appointment. So "X-ray Foot" instead of "X-ray Left Foot". The vendor could not or would not fix it. The information was locked up in a legacy system that did not use JDBC or ODBC--I think it was running on Cobol. So I wrote a Java class that could telnet to its terminal, and query its SQL like language. After adding the Java jar to the Mirth external library folder, I could have Mirth grab the each HL7 message, extract the appointment info, query the original system (via my custom class), get the laterality, change the appointment type name, and send it on.

I have numerous examples of Mirth letting me do the implausible; maybe I'll give more examples later. In all theses cases Mirth made it easy, and that it why it is one of my favorite applications.