I recently saw the announcement for Kali Linux on Vagrant. I have been a huge fan of Kali Linux for a very long time, and I am interested in virtualization (and currently using VirtualBox in an educational environment), so this was a very interesting combination to me. I have now installed it on a few of my systems, and so far I am quite impressed with it. The Internet of Things is the new frontier. However, generations of ERP systems were not designed to handle global networks of sensors and devices.
With the big 3 Hadoop vendors – Cloudera, Hortonworks and MapR - each providing their own Hadoop sandbox virtual machines (VMs), trying out Hadoop today has become extremely easy. For a developer, it is extremely useful to download and get started with one of these VMs and try out Hadoop to practice data science right away. However, with the core Apache Hadoop, these vendors package their own software into their distributions, mostly for the orchestration and management, which can be a pain due to the multiple scattered open-source projects within the Hadoop ecosystem. Hortonworks includes the open-source Ambari while Cloudera includes its own Cloudera Manager for orchestrating Hadoop installations and managing multi-node clusters. Moreover, most of these distributions require today a 64-bit machine and sometimes a high-amount of memory (for a laptop).
If you already have a running Cloudera Manager installation this course follows on with the logic behind the placement of the Hadoop master/slave daemons across your cluster. We actually go ahead and discuss the placement and perform the installation of Hadoop. If you do not have a Cloudera Manager installation and you want to follow along hands on, you can complete the course: "Real World Vagrant - Automate a Cloudera Manager Build - Toyin Akin" beforehand. "Big Data" technology is a hot and highly valuable skill to have – and this course will teach you how to quickly deploy a Hadoop Cluster using the Cloudera stack. Cloudera allows you to download a QuickStart Virtual machine which is great for developers, but this is of no use for the Operations team to start the planning and the building out of DEV / UAT and PROD environments within their organizations.
Note: This course is built on top of the "Real World Vagrant - Automate a Cloudera Manager Build - Toyin Akin" course Instruct Cloudera Manager to do the work! Here we use Python to instruct an already installed Cloudera Manager to deploy your Hadoop Services. The API is served on the same host and port as the Cloudera Manager Admin Console, and does not require an extra process or extra configuration. The API supports HTTP Basic Authentication, accepting the same users and credentials as the Cloudera Manager Admin Console. Cloudera Manager supports HDFS, MapReduce, YARN, ZooKeeper, HBase, Hive, Oozie, Hue, Flume, Impala, Solr, Sqoop, Spark and Accumulo.
A fully-featured Hadoop environment has a number of pieces that need to be integrated. Vagrant and Ansible are just the tools to make things easier. When getting started with Hadoop, it is useful to have a test environment to quickly try out programs on a small scale before submitting them to a real cluster (or before setting that cluster up). There are instructions on the Hadoop website that describe running Hadoop as a single Java process. However, I've found that running this way hides a lot of how Hadoop really works for large-scale applications, which can slow understanding of what kinds of problems need to be solved to make an implementation work and be performant in a real cluster.