Sunday, October 21, 2012

Installing Cloudera CDH4 on Ubuntu 12.04 LTS

Presented are some short notes for installing Cloudera CDH4 on Ubuntu 12.04 LTS running as a guest OS on Oracle's VirtualBox. For those unfamiliar with Cloudera and CDH, CDH is Cloudera’s 100% open source Hadoop distribution. What is documented here is not a complete tutorial, but rather pieces of information to be used in conjunction with the product's documentation. Use these tips to make the installation of Cloudera on Ubuntu easier.

Prerequisites

Creating the VM

Create a new virtual machine using the new VM wizard and downloaded Ubuntu ISO. It is important to have the 64-bit LTS ISO or the Cloudera manager will not start.

VM Settings:

  • 4GB RAM (minimum)
  • 2 CPUs
  • 128MB Display Memory
  • 25GB Dynamic Disk

When finished, you should see something similar to the following:

Configuring Ubuntu 12.04 LTS

Once you have started the Ubuntu VM and logged in, set a password for root. The Cloudera manager will need the password to install the cluster. You are also free to use a passwordless sudo setup.

sudo passwd root

Next, you will need to install the SSH server and client. This is needed by the Cloudera manager for cluster installation:

sudo apt-get install openssh-client
sudo apt-get install openssh-server

Make the following changes to /etc/hosts. Not modifying the file will cause a number of cluster startup errors such as not being able to start hBase or creating a number of default directories:

127.0.0.1 KRDAVIS-CLOUDERA localhost
#127.0.0.1 localhost
#127.0.1.1 KRDAVIS-CLOUDERA

Install the GNOME session fallback package:

sudo apt-get install gnome-session-fallback

Logout and select "Gnome Classic (no effects)" for your session. This will prevent any weirdness with running Compiz under the VM. You can now log back in.

Install Cloudera CDH4

Start the cluster installation by running the Cloudera installation manager:

chmod 755 cloudera-manager-installer.bin
sudo ./cloudera-manager-installer.bin

Follow the instructions and accept and default values. When you are done, you should have a single node CDH4 cluster running in your VM!

Shutting down the cluster and VM

When it comes time to shutdown the VM, I found I have fewer problems if I shutdown the cluster by logging into the Cloudera management web app. Select "All Services" from the "Services" menu. For the cluster, select "Stop..." from the Actions dropdown menu. Wait for all services to come to a stop.

After verifying that all cluster services are stopped, shutdown the VM by opening a terminal and running the following command:

sudo /sbin/shutdown -h now

Selecting shutdown from the Ubuntu UI appears to only log out of the system without shutting it down. That is a problem for another day.

Thursday, October 4, 2012

PHP & MySQL - Search Multiple Fields and Terms

Background

I needed to find a simple way to implement a text search on a table without using MySQL's fulltext search functionality. I am currently limited to using a version of MySQL that only supports fulltext search on the MyISAM storage engine. My tables use the InnoDB storage engine, hence the problem. This post will show a quick and dirty way to search multiple database fields with multiple search terms using only SQL.

Requirements

  • Search for multiple terms across multiple database fields.
  • Specify which table fields are to be searched.
  • Search terms to be space delimited.
  • Find partial and exact terms.
  • Find several terms in a specific order preserving whitespace.
  • Identify and work with dates supplied in common formats.
  • Identify and work with numeric search terms.
  • Provide a few hints to control search behavior.
  • Generate SQL suitable for use in a WHERE clause.
  • Table independent for re-usability.

Note: High performance was not a consideration as the number of records searched would be relatively small.

Dates

While strings are straight forward to compare against, dates posed a bit of a problem. I needed to accept a few common date formats, check if the date was valid, and convert the search format to MySQL's internal date format to make a proper comparison.

The first step was to create an is_date() function to determine if a search term was valid date. The function below is straight forward. We break down the string into the appropriate date components and use the PHP checkdate() function to determine if the date is valid. The function is shown below.

    /**
     * Evaluates if a string is a valid date
     *
     * @param $str string - The date string to evaluate
     *
     * @return bool TRUE if the string is a valid date, FALSE otherwise 
     */
    public function is_date($str) { 
        $stamp = strtotime(str_replace(array('-', '.'), '/', $str)); 

        if (!is_numeric($stamp)) 
            return FALSE; 
        
        $month = date( 'm', $stamp ); 
        $day   = date( 'd', $stamp ); 
        $year  = date( 'Y', $stamp ); 
        
        if (checkdate($month, $day, $year)) 
            return TRUE; 
        
        return FALSE; 
    }

Monday, October 1, 2012

Tips for Running the Dynamic Data System (DDS)

The Dynamic Data System (DDS) makes loading a data warehouse much easier and faster. This system uses SSIS to load standard and partitioned tables via SSIS packages created on the fly using metadata. This eliminates the need to maintain complex SSIS packages. Only basic T-SQL skills are needed. The Dynamic Data System is available on CodePlex.

My current data warehouse project uses SQL Server 2008 R2. Out of the box I found a couple minor configuration issues that prevented me from getting started with using DDS. The following notes, in addition to the documentation provided with the product, should get you on your way.