hadoop-test-cluster¶
This package provides a docker setup and command-line tool for setting up and managing several realistic Hadoop clusters. These can be useful for demo purposes, but their main intended use is development and testing. The Hadoop documentation is pretty emphatic about the need for testing:
If you don’t test your YARN application in a secure Hadoop cluster, it won’t work.
Hadoop’s security model is notoriously tricky to get correct. So hard, one of the main contributors wrote an online book comparing its difficulties to Lovecraftian Horrors. The clusters provided here have settings selected to hit potential issues in Hadoop and YARN applications so you can fix your code early rather than in production. If your code runs on these clusters, it should (hopefully) run on any cluster.
Demo
Highlights
Easy to use command-line tool
Supports both Kerberos and Simple authentication modes
Comes with common development and testing dependencies pre-installed (miniconda, maven, git, …)
Designed to minimize resource usage - runs fine both on laptops and on common CI servers such as Travis CI
Configured with a set of options designed to expose bugs in Hadoop applications. If your code runs on these clusters, it should (hopefully) run anywhere.
Installation¶
hadoop-test-cluster
is available on PyPI:
$ pip install hadoop-test-cluster
You can also install from source on github:
$ pip install git+https://github.com/jcrist/hadoop-test-cluster.git
Docker and Docker Compose are required to already be installed on your system - consult their docs for installation instructions.
Overview¶
The main entry point for this package is the htcluster
command-line tool.
This tool can be used to start, stop, and interact with test Hadoop clusters.
Startup a Cluster¶
To start a cluster, use the htcluster startup
command. Two parameters are
used to describe which cluster to run:
--image
: which docker image to use--config
: which Hadoop configuration to run the cluster with
Images
Currently supported images are:
cdh5
: a CDH 5 installation of Hadoop 2.6 (image at jcrist/hadoop-testing-cdh5)cdh6
: a CDH 6 installation of Hadoop 3.0 (image at jcrist/hadoop-testing-cdh6)
To determine which version of Hadoop a cluster is running, see the
HADOOP_TESTING_VERSION
environment variable (will be set to either cdh5
or cdh6
).
Configurations
Currently two different configurations are supported:
simple
: uses simple authentication (unix user permissions)kerberos
: uses kerberos for authentication
To determine which configuration a cluster is running, see the
HADOOP_TESTING_CONFIG
environment variable (will be set to either
simple
or kerberos
).
Examples
Start a CDH 5 cluster with simple authentication:
$ htcluster startup --image cdh5 --config simple
Start a CDH6 cluster with kerberos authentication:
$ htcluster startup --image cdh6 --config kerberos
Start a cluster, mounting the current directory to ~/workdir:
$ htcluster startup --image cdh5 --mount .:workdir
Login to a Cluster¶
For interactive work, you can log in to a cluster using the htcluster login
command.
Examples
Login to the edge node as the default user:
$ htcluster login
Login to the master node as root:
$ htcluster login --user root --service master
Execute a Command on a Cluster¶
Instead of logging in, you can also run a command on a cluster using the
htcluster exec
command.
Examples
Run py.test
on a cluster:
$ htcluster exec -- py.test /path/to/my/tests
Follow the HDFS Namenode logs:
$ htcluster exec --user root --service master \
-- tail -f /var/log/hadoop-hdfs/hadoop-hdfs-namenode.log
Shutdown a Cluster¶
To shutdown a cluster, use the htcluster shutdown
command.
Example
$ htcluster shutdown
Cluster Details¶
Here we provide more details on what each cluster supports.
Layout¶
Each cluster has three containers:
One
master
node running thehdfs-namenode
andyarn-resourcemanager
, as well as the kerberos daemons. Hostname ismaster.example.com
.One
worker
node running thehdfs-datanode
andyarn-nodemanager
. Hostname isworker.example.com
.One
edge
node for interacting with the cluster. Hostname isedge.example.com
.
Installed Packages¶
All clusters provide the following packages:
Additional packages can be installed at runtime with yum
, conda
, or
pip
.
Users¶
One default user account has also been created for testing purposes:
Login:
testuser
Password:
testpass
When using kerberos, a keytab for this user has been put at
/home/testuser/testuser.keytab
, so you can kinit easily like kinit -kt
/home/testuser/testuser.keytab testuser
.
An admin kerberos principal has also been created for use with kadmin
:
Login:
root/admin
Password:
adminpass
Ports¶
The following ports are exposed:
Master Node
NameNode RPC: 9000
NameNode Webui: 50070
ResourceManager Webui: 8088
Kerberos KDC: 88
Kerberos Kadmin: 749
Worker Node - DataNode Webui: 50075 - NodeManager Webui: 8042
Edge Node - User Defined: 8888 - User Defined: 8786
The full address for accessing these is dependent on the IP address of your docker-machine driver, which can be found at:
$ docker-machine inspect --format {{.Driver.IPAddress}})
If you frequently want access to the WebUI’s, it’s recommended to add the
following lines to your /etc/hosts
:
<docker-machine-ip> edge.example.com
<docker-machine-ip> master.example.com
<docker-machine-ip> worker.example.com
Authenticating with Kerberos from outside Docker¶
With the kerberos configuration, the Web UI’s are secured by kerberos, and so won’t be accessible from your browser unless you configure things properly. This is doable, but takes a few steps:
Kerberos/SPNEGO requires that the requested url matches the hosts domain. The easiest way to do this is to modify your
/etc/hosts
and add a line formaster.example.com
:# Add a line to /etc/hosts pointing master.example.com to your docker-machine # driver ip address. # Note that you probably need to run this command as a super user. $ echo "$(docker-machine inspect --format {{.Driver.IPAddress}}) master.example.com" >> /etc/hosts
You must have
kinit
installed locally. You may already have it, otherwise it’s available through most package managers.You need to tell kerberos where the
krb5.conf
is for this domain. This is done with an environment variable. To make this easy,htcluster
has a command to do this:$ eval $(htcluster kerbenv)
At this point you should be able to kinit as testuser:
$ kinit testuser@EXAMPLE.COM
To access kerberos secured pages in your browser you’ll need to do a bit of (simple) configuration. See this documentation from Cloudera for information on what’s needed for your browser.
Since environment variables are only available for processes started in the environment, you have three options here:
Restart your browser from the shell in which you added the environment variables
Manually get a ticket for the
HTTP/master.example.com
principal. Note that this will delete your other tickets, but works fine if you just want to see the webpage$ kinit -S HTTP/master.example.com testuser
Use
curl
to authenticate the first time, at which point you’ll already have the proper tickets in your cache, and the browser authentication will just work. Note that your version of curl must support the GSS-API.$ curl -V # Check your version of curl supports GSS-API curl 7.59.0 (x86_64-apple-darwin17.2.0) libcurl/7.59.0 SecureTransport zlib/1.2.11 Release-Date: 2018-03-14 Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp Features: AsynchDNS IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz UnixSockets $ curl --negotiate -u : http://master.example.com:50070 # get a HTTP ticket for master.example.com
After doing one of these, you should be able to access any of the pages from your browser.