Introduction to Hadoop

Here is a short presentation that I gave as a guest speaker to a group of graduate level statisticians at the University of Utah. It is a very high level view of what Hadoop is, who is using it, when you should use it, and bit on how it works.

One Reply to “Introduction to Hadoop”

  1. When getting started, find an easy to install Hadoop distribution like Cloudera. It made it very simple to provision my 10 node test clusters. Make sure your DNS is setup properly for your nodes as it will save you provisioning headaches later. I also found it much easier to install on Ubuntu Linux compared to Redhat Linux.

    As for optimal mapper/reducer ratios, I really seems to depend on your data, the number of cluster nodes, and resources available on your cluster nodes (memory, cores, storage, etc.) My tests were showing an increase in performance as I added more mappers to my Hive queries up to point where I began to see diminishing returns.

Leave a Reply

Your email address will not be published. Required fields are marked *