July 26, 2011

Pentaho and SSO Part 1: CAS

Single Sign On allows users to log in once and access multiple applications without logging in again. The paid version of Pentaho comes with a script that enables SSO for Pentaho, but the community edition does not. However, since Pentaho uses Spring Security, you can still set up Pentaho to use SSO. This post will describe how to set up Pentaho to use Jasig CAS, a popular, free, open-source SSO server.

(As an aside, let me say I am not a Pentaho Enterprise Edition customer, nor have I ever seen the Pentaho SSO script. This is a "clean-room" implementation of the configuration.)

Before getting into configuration, an overview of how CAS works would be helpful. When you try to access a protected webpage in Pentaho, the BI-Server checks to see if you are authenticated (and if you are authenticated, checks to see if you are authorized). If you are not authenticated, you are redirected to a login page. When Pentaho is CAS enabled, instead of beign redirected to the Pentaho login page, you are redirected to the CAS login page, which has an address like this:

http://cas.server.address/cas/login

After you login to CAS, you will be redirected back to the original page you tried to access, which probably is the Pentaho PUC. CAS needs to know what page you attempted to see, so it can redirect you there after you authenticate. To tell CAS, the address of the page you attempted to access is appended to the end of the CAS URL. When you are using the Pentaho User Console (PUC), that page is the Spring Security page that checks for authentication:

http://your.cas.address/cas/login?service=http://yourPentaho/pentaho/j_spring_cas_security_check

When it gets the request for the login page, the first thing CAS does is check to see if you have logged into CAS already. If you have not, you are presented with a login page. If you have a correct username and password you are redirected back to http://youPentaho/pentaho/j_spring_cas_security_check. If you have already logged in with CAS (maybe you have logged in before accessing another CAS enabled application), you are immediately redirected back to j_spring_cas_security_check without seeing the CAS login page. In both cases, once you are redirected back to j_spring_cas_security_check, you are assigned a one time use ticket, which is append to that URL. Like this:

http://yourPentaho/pentaho/j_spring_cas_security_check?ticket=ST-3555-McPZ4NKfx6S0EhnCEkHc

Spring Security takes the ticket and checks to see if it is valid by submitting it back to CAS. Like this:

http://your.cas.address/cas/serviceValidate?ticket=ST-3555-McPZ4NKfx6S0EhnCEkHc&service=http://yourPentaho/pentaho/j_spring_cas_security_check

If the ticket is good, CAS responds with the username.  Now that Pentaho knows who you are, the regular Pentaho authorization occurs (checking groups and user access rules). However, for the Pentaho authorization to work, you will either need CAS to point to the database where Pentaho stores its users, or make sure that Pentaho can get group information from the CAS database of users. In this example, we are going to create a new MySQL database of users and point both CAS and Pentaho to it.

The three tables that are needed are a Users table, a groups table (called Authorities) and a many to many table to assigning users to groups.

For the Users:

CREATE TABLE `USERS` (
  `USERNAME` varchar(50) NOT NULL,
  `PASSWORD` varchar(512) DEFAULT NULL,
  `DESCRIPTION` varchar(100) DEFAULT NULL,
  `ENABLED` bit(1) NOT NULL,
  PRIMARY KEY (`USERNAME`)
)


Groups:

CREATE TABLE `AUTHORITIES` (
  `AUTHORITY` varchar(50) NOT NULL,
  `DESCRIPTION` varchar(100) DEFAULT NULL,
  `AUTH_GROUP` varchar(255) NOT NULL DEFAULT 'NONE',
  PRIMARY KEY (`AUTHORITY`)
)


And group assignment:

CREATE TABLE `GRANTED_AUTHORITIES` (
  `USERNAME` varchar(50) NOT NULL,
  `AUTHORITY` varchar(50) NOT NULL,
  PRIMARY KEY (`USERNAME`,`AUTHORITY`),
  KEY `FK7471775DD9EDC77F` (`USERNAME`),
  KEY `FK7471775D41B6DA97` (`AUTHORITY`)
)




Installing CAS:

CAS runs a regular Java webapp. You can run it in the same Tomcat instance as Pentaho, or on another machine.  Once you download CAS, you will need to edit the deployerConfigContext.xml file in the WEB-INF directory to point CAS to the USERS table. In the authenticationHandlers section of the xml file, delete the SimpleTestUsernamePasswordAuthenticationHandler bean, and replace it with this bean:

<bean id="SearchModeSearchDatabaseAuthenticationHandler" class="org.jasig.cas.adaptors.jdbc.SearchModeSearchDatabaseAuthenticationHandler" abstract="false" scope="singeton">
<property name="tableUsers">

  <value>USERS</value>
</property>
<property name="fieldUser">

  <value>USERNAME</value>
</property>
<property name="fieldPassword">

  <value>PASSWORD</value>
</property>
<property name="dataSource" ref="dataSource"/>
<!-- Use the same PW Encoder as your db -->
  <property name="passwordEncoder">
  <bean class="org.jasig.cas.authentication.handler.DefaultPasswordEncoder">
    <constructor-arg value="SHA-256"/>
  </bean>
</property>
</bean>



Note that I am using SHA-256 to encode the password. You can use any encoding that you want, MD5 for example.

Finally, point CAS to the right database by adding this to deployerConfigContext.xml:

<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource">
 <property name="driverClassName"> <value>com.mysql.jdbc.Driver</value> </property>
 <property name="url"> <value>jdbc:mysql://your_mysqlhost:3306/YOUR_DB</value> </property>
 <property name="username"> <value>DB_USERNAME</value> </property>
 <property name="password"> <value>DB_PASSWORD</value> </property>
</bean>


Now CAS is setup. Start up Tomcat (or whatever Servlet container you like).

Part 2 will go over the Pentaho configuration. 

June 22, 2011

Touring CBF part 2

Long delayed, but here is part II:

Running the CBF is easy if you have defined all your build.properties correctly. Assuming you have ant installed, go to the root cbf directory and run this:

ant -Dproject=ProjectName build-all

Note that the name of the project is the part after "project." So if your directory is "project-test" your project name is "test". To see a list of all the ant targets to use you can do this:

ant -Dproject=ProjectName -p

The targets you are interested are probably:

ant -Dproject=ProjectName clean Which cleans the build (but not the deployed directory).

ant -Dproject=ProjectName all Which builds Pentaho.

and ant -Dproject=ProjectName deploy-all Which eploys the biserver and solution to the target.

If you want added flexibility, you can create custom properties files for development and production. The structure is the same as regular build.properties. So copy build.properties to a file named "dev.properties" and make any necessarily changes. Then you can do this:

ant -Dproject=client -Denv=dev dist-clean all run

That was a long wait for a little post....but that should be enough to get started using the CBF.

April 13, 2011

Touring Pentaho CBF part 1

What is it?

The Pentaho Community Build Framework (CBF) is an ant script that can compile, setup and deploy the Pentaho BI-server. The CBF allows the central management of multiple server configurations, such as development and production, for multiple projects. The local copy of the Pentaho bi-server source files are not edited or changed in any way and each bi-server project has its own directory of changes for the bi-server, such as the web.xml or context.xml files, and its own solution directory. When run, the CBF apples the project specific changes (called "patches") to the bi-server, and deploys the project. The CBF was first created by Pedro Alves.

How does it work?

The root of the CBF directory needs a copy of the Pentaho sources file, a copy of the Tomcat source files, and one or more directories for each bi-server project. Each of these projects must be in a directory called project-XXXX where XXXX is the name of the project. For example, if there are three projects named "client1", "client2", and "testing" there should be three directories called project-client1, project-client2, and project-testing. When the CBF is run, two additional directories are created: a build directory called target-build for the compiled bi-server, and a directory where the final bi-server is prepared for distribution called target-dist. Here is the directory structure of the CBF looks like:


cbf

├── pentaho
├── project-test
│   ├── config
│   ├── patches
│   └── solution
├── target-build
├── target-dist
│   ├── administration-console
│   ├── licenses
│   └── server
└── tomcat


Every project directory has three sub directions: one for the CBF configuration (config), one for the project solutions (solutions), and one for any changes to be made to the bi-server (patches) when it is deployed.

When the ant script runs, the patches get applied by copying the entire patches directory on top of the root CBF directory. Therefore, in order for the patches to get applied correctly, the patches directory must mirror the structure of the root CBF directory; changes to the final bi-server need to be in a folder called target-dist and changes to the project solution directory (such as changes in the solution's system or admin directory) need to be a solution directory under a directory with the same name as the project. For example:


project-test

├── config
├── patches
│   ├── project-test
│   │   └── solution
│   │       └── system
│   └── target-dist
│       └── server
│           ├── conf
│           └── webapps
│               └── pentaho
│                   ├── META-INF
│                   └── WEB-INF
└── solution

Setup

To download the Pentaho BI-server sources, you can use subversion:

svn co svn://source.pentaho.org/svnroot/bi-platform-v2/tags/3.8.0-stable/

I renamed the folder to "pentaho":

mv 3.8.0-stable pentaho

Download Tomcat, the CBF, and make the project directories.

The main configuration for a project is located in the config direcory and is called build.properties. Here is an example:

#######################################
## GENERIC PROPERTIES
######################################

javac.path = /usr/bin/javac

# Solution path. Required for compile time
solution.path = ~/cbf/project-test/solution

# Solution path required for runtime. Defaults to the above value but can be different if we're
# deploying to another place. Should be used in WEB-INF/web.xml in the solution-path parameter
solution.deploy.path = NOT SET

#accepted values: tomcat or jboss
server.name = tomcat

# Tomcat 5.5 path:
tomcat.path = ~/cbf/tomcat

# Copy the sample solutions to our project directory? true | false
copy.pentaho.samples = true

# Java options for the run command
java.opts = -Xmx512m -XX:MaxPermSize=512m -Xrunjdwp:transport=dt_socket,address=8765,server=y,suspend=n

# to allow 3.8 to build correctly
project.revision=3.8.0-stable
dependency.pentaho-reporting.revision=3.8.0-stable

#####################################
## PROJECT DIRECTORIES - The defaults usually work fine
######################################
pentaho.dir = pentaho/
pentaho.build.dir = target-build/

#####################################
## DEPLOY OPTIONS
######################################
deploy.mode = rsync
deploy.args = -avz --exclude '.svn/' --exclude '*.log' --exclude 'work/' --exclude 'temp_user/' --exclude 'temp/' --exclude 'tmp/'
deploy.dest.server = user@localhost:/Path/To/Deploy/
deploy.dest.solution = user@localhost:/Path/To/Solution/

#####################################
## DYNAMIC TOKENS TO BE USED AS FILTERS
#####################################

DB_LOCATION = 127.0.0.1
DB_USER = someone
DB_PASSWORD = bar
BASE_URL = http://127.0.0.1:8080/pentaho

Finally, there can be different configurations for the project. Each configuration needs its own file in teh config directory called build-XXX.properites. For example, build-testing.properties, or build-production.properties.

An Example

After making the appropriate changes to the context.xml for a project, it should go in this direcory:

project-test/patches/target-dist/server/webapps/pentaho/META-INF

And the web.xml should go here:

project-test/patches/target-dist/server/webapps/pentaho/WEB-INF

A Second Example

The Community Dashboard Editor (CDE) uses files with the .wcdf  extension. The bi-server cannot apply ACLs on those files without making a change to the pentaho.xml file. The pentaho.xml file is located in the solution's system folder, but when the CBF compiles the bi-server, the solution system folder is written over. Any project specific changes needed in pentaho.xml must be put in the patches directory. So make the diretory, make a copy of the pentaho.xml file:

mkdir patches/project-test/solution/system/
cp ./solution/system/pentaho.xml patches/project-test/solution/system/


and make the following change in patches/project-test/solution/system/pentaho.xml:

<acl-files> xaction,url,prpt,xdash,xcdf,wcdf</acl-files>

Next time: Running it

April 8, 2011

Playing with LucidDB: namespace and users

One of the challenges (and part of the fun) of using a relatively new open source project, is useing the software despite the lack of complete documentation. The nice thing about open source is that if you have a question, there usually is a discussion board or list-serve where you can ask the creator directly. In the spirit of openness, I am posting some of little pointers about using the column based open source database LucidDB especially comparing to to MySQL. There is nothing new here, but if you come to LucidDB from MySQL there are a few (minor) gotchas that is would be nice to avoid. So here is a post about the LucidDB namespace and users. (BTW, LucidDB's approach to namespace and users should be applicable to many databases)

LucidDB is a column-store database designed for data warehousing and business intelligence. It is open source, and has a growing community, including a new corporate sponsor: Dynamo BI. Column-store databases are faster than row-store databases when executing the kinds of queries that are common in data warehouses. There are a number of speed comparisons of LucidDB on the web, and it always does rather well. Additionally, Lucid as a number of interesting features such as the ability to connect to other databases (and even Hadoop!) and run queries on the remote database.

Navigation and basic interaction:

Installing LucidDB is pretty easy, and there are directions on the LudicDB site, so let's start with some basic navigation, and compare it to MySQL.

In MySQL, the highest level of a namespace is the "database." So a single project might have a single database, but there might be other databases running on the same MySQL instance. Any table can be referred to by its qualified name as in:

select * from databaseName.tableName

When you refer to a table without a database name, MySQL will assume you are referring to the presently used database. This is equivalent to the above query.

use databaseName;
select * from tableName;


The highest level of the LucidDB namespace is "catalog" and under catalog is "schema." (I think this is closer to the SQL standard than MySQL.) Here is an example of a LucidDB query with a qualified name:

select * from catalogName.schamaName.tableName

Like MySQL, LucidDB assumes the present catalog and schema when you refer to a table without a catalog and schema. You can change catalog or schema in LucidDB by using the "SET" command (Note the single quotes around the catalog and schema names!):

set catalog 'catalogName';
set schema 'schemaName';
select * from tableName;


If you get lost, MySQL has a function to tell you what database you are in:

select database();

LucidDB does not have an equivalent function (though you could write one if you want). Instead, you must query one of the special systems views. These views expose information about the state of the system. For the current catalog and current schema, you can query the user_session_parameters view to get both:

select * from LOCALDB.sys_root.user_session_parameters where param_name in ('schemaName', 'catalogName');

You can use the same system view to get a lot of information about your current session as well.

In Mysql, the SHOW command is useful to see structure of the database (and a list of available databases). For example:

SHOW databases;
SHOW TABLES;
SHOW VIEWS;


(PostgresQL has its own commands: \dt to show tables). To get similar information from LucidDB you must again go to the systems views. The view contain all the information about tables, views, triggers, databases, etc. Refer to this link for details about all the information you can see, but here are some examples:

SELECT * FROM sys_root.DBA_SCHEMAS;
SELECT * FROM sys_root.DBA_TABLES;
SELECT * FROM sys_root.DBA_VIEWS;


Creating users

When you install LucidDB, there is a default administrative user called "sa" with no password. You can log into LucidDB and create a new user, and set the user's default schema:

create or replace user "sa" identified by 'aGoodPassword';

Or create a new user:

create user "userName" identified by 'aBetterPassword' default schema test_schema;

Lastly, there are a two useful systems views that have information about users, DBA_USERS and DBA_ROLES. However, they are only available in the (yet) unreleased version 0.9.4.

As a side note, it appears that you can set up LudidDB to use JAAS, but the documentation is unclear (to me), and I have not tried it.

March 17, 2011

Last note on the impossible PDI pivot

Here is a blog post by Matt Casters, the founder of Kettle (which was rebranded as Pentata Data Integration). In it he is un-pivoting data. That is to say, doing the opposite of what I did in the last two posts. I solved the problem of knowing how many columns to create by hard coding 60. He uses a more elegant solution of analyzing the data in one transformation and then using the results of the analysis in a second one and using the metadata injection step. I suspect that following his lead I could also use the metadata injection step and a separate analysis transformation to dynamically set the number of columns. But first I have to study the metadata injection step to see how it works.

March 16, 2011

Pentaho PDI example Part II: Doing the impossible

Continuing from the last post, we now have a list in the format of "symbol", "date", "amount" and we want to pivot the data. The normal way to pivot would be to hard code each column name and use the Pivot step. However, we need something a little more dynamic because the list of ticker symbols are in the config file and so are not know until run time. So we will cheat (based on an idea from the Pentaho Users Forum). The output file will look like pivoted data, but the Kettle transformation, and will (semi)-dynamic, but will not be an actual dynamic pivot. But first, the data needs to cleaned.

The same parameters were used for all symbols, so all the symbols should have the same set of dates. However, there are some dates that appear in the date set of a single symbol list. This is usually the stock's inception date and first trade. A second date problem is that there are no prices for dates before a stock started trading. For example, if we ask for monthly amounts for 2000-2010, but a stock was first offered on April 8, 2008, there is no data at all from 2000-2008 and the first price is April 8, 2008 instead of the first trading day of that month. No other stocks have prices on April 8, 2008. That date must be cut out, and nulls or 0 must be put in for all the dates from 2000 to April 2008.

To solve the first problem, we can do a cartesian join between all possible dates, and all symbols.


From the filtering step (last step discussed in the last post), two streams are created. One stream will be used to get the dates, the other for the symbols. The Pentaho "Group by" step can be used to get the distinct values:


The same is done for the symbols, and then the "cartesian join" step produces all possible combinations of dates and symbols. To get the prices, we do another stream lookup on a third stream coming from the filtering step. (I put a dummy step between the filter and the lookup for layout) We will be putting a "0" for the price of a stock at any unknown date.


Before we start with the real work of pivoting, we need a stream without the extraneous dates. To get rid of the unwanted dates, such as the inception trading date, we will take a copy of the stream of all dates, sort them, and use the "Group by" step.


The "Group by" step is used to get the minimum day for each month/year combination. This will cut out extra dates, such as the April 8 example above.


Next, a little Javascript to create a date from the year, month and day columns:


We will return to this stream of good dates after we start the magic pivot.

The Pivot:

We can't really do a pivot with arbitrary columns, so we cheat a little. Once the stream is sorted by date and symbol, we can join all the symbols in one comma delimited field, and the prices in another. Because we made sure that every symbol has a value for every date, and the stream is ordered by symbols and date, these combined fields will have the values for each date and the symbols for each value in the same order. Here is the step:


And here is a sample of the output:


If we can align the symbol cell with the data cell, and then split them on the commas to create new fields, we would almost have our pivot. The symbols will be the first row, the values the rest. If the output file does not print the real headers it will look like we performed an arbitrary pivot.


The split is done to create fields with meaningless names. The two streams can be combined as long as both the symbol stream and the value streams use the same field names. I used 60 fields so I can use up to 60 symbols. Here is the split for the symbols:


Now that the prices are grouped by date, we can go back do a stream lookup with the stream of good dates. This will make sure that we get values only for dates that we want.


Add a constant and the "Group by" step can be used to limit the symbol stream to 1 row. Append the two streams together and output the results (without headers). If the resulting file it imported into a spreadsheet, the first row will act as headers for the data:


There is a Jira request to do a arbitrary pivot by changing the metadata. Until then, this little trick is a good work around.

March 15, 2011

Pentaho PDI example

The PDI task is to get the historical numbers for some stocks from Yahoo's REST-ful service and format the data so the column headers are the ticker symbol. Each row will be the date and the prices for each ticker.

Part I: getting the data


The first part of the transformation reads the configuration from an XML file: a list of ticker symbols to get, and the parameters for the Yahoo service. The Yahoo REST parameters need to be the same for all the symbols, but are only in the config.xml file one. To combine the single set of parameters with the list of symbols, the config is read twice. The first stream gets the list of symbols, and a second stream gets the parameters. The PDI stream lookup step combines the two streams. Because there is only one set, all the symbols get the same parameters


The result of the stream lookup step is fed to the HTTP client step. This step does the REST lookup of the Yahoo server. The base URL is http://ichart.finance.yahoo.com/table.csv and with the parameters it looks something like this:
http://ichart.finance.yahoo.com/table.csv?s=VFINX&a=03&b=6&c=2006&d=02&e=29&f=2011&g=m&ignore=.csv
The result of the RESTfull lookup is a csv file, but the HTTP client step puts the entire result (and so the entire text file) in one field called "Result." To use the data, we need to parse up the csv file so that each row in the file is one row in the transformation, and each cell is a column. The first step is to split the one row into multiple rows, each with one line of the csv file. To do that we use the "split field to rows" step. I was not sure what the line delimiter was coming out of Yahoo (or even if it was set by Yahoo, or the Http Client Step), so I used the Java internal variable to set it.


Now we have a row in the transformation for each row in the CSV that came from Yahoo. We still need to get one field in the transformation for each cell in the CSV file. Using "Split field" step to split on a comma takes care of that.


We don't need the header from Yahoo, so that is filtered out.

The results from Yahoo has the opening price, closing price, high, and low but we only want one price. The one we want is set in the configuration file. The easiest way to select the price is to use a Javascript step. The script has a case statement based on the "value_to_use" field from the original XML config file.


Now we have field for the symbol (fron the config), the amount (from the Javascript step) and the date (from Yahoo). To make debugging easier, it would be nice to clear out extraneous info, so I filter out out all the unneeded columns. I have read that the Select/Rename step can be slow, but the REST calls are far slower, so that is not a worry here:


This step can also turn the date field into a DATE data type. This is needed because it came from Yahoo (and the splits) as a STRING:


Now we can take a look at the results using the PDI Preview (in Spoon):


All and all, it was pretty quick and easy to make. Next post I'll show how to do the seemingly impossible step of a data pivot in PDI with arbitrary columns.

click here for Part 2