For the last month or so, I've been working on a couple of projects that have required me to move files in and out of HDFS. It's pretty straightforward once you get the appropriate tools working, but it can be a bit counterintuitive to get started (at least it was when I was learning it). Here's how you get started:

Install your tools

In this tutorial, we are working with Cloudera 5.5.1, using an Ubuntu (Trusty Tahr) instance to connect to it. First, we need to add Cloudera's repo to apt:

$ wget
$ sudo dpkg -i cdh5-repository_1.0_all.deb
$ sudo apt-get update

Since I use both the hdfs command and FUSE, I just install FUSE, which installs both tools.

$ sudo apt-get install hadoop-hdfs-fuse

One prerequisite that apt fails to install is Java. If you try running the hdfs command, you'll get the following error:

Error: JAVA_HOME is not set and could not be found.  

Let's put Java on there:

sudo apt-get install openjdk-7-jre  

Set up your config

One little quirk about working with the Hadoop command-line tools is you need to use local config files - so you can't just provide the URL to your nameserver and just connect. One exception to this rule is a Go - based library/client written by Colin Marc called (drumroll please...) HDFS.

In Cloudera, you can get the config through the CDH Manager UI:
cdh manager

Once you download this zip file, put its contents into a subfolder of /etc/hadoop as follows:

$ sudo unzip -d /etc/hadoop
  inflating: /etc/hadoop/hadoop-conf/hdfs-site.xml
  inflating: /etc/hadoop/hadoop-conf/core-site.xml
  inflating: /etc/hadoop/hadoop-conf/
  inflating: /etc/hadoop/hadoop-conf/
  inflating: /etc/hadoop/hadoop-conf/
  inflating: /etc/hadoop/hadoop-conf/ssl-client.xml
  inflating: /etc/hadoop/hadoop-conf/
$ sudo mv /etc/hadoop/hadoop-conf /etc/hadoop/conf.cloudera.HDFS

For the HDFS tools to use your configuration, the HADOOP_CONF_DIR environment variable needs to be set. This can simply be added to your favorite shell profile config:

export HADOOP_CONF_DIR="/etc/hadoop/conf.cloudera.HDFS"  

Name Resolution

Now that you have your configuration in the right place, make sure you can actually resolve the names it uses. For this to happen in Cloudera, ensure that one of your Consul DNS servers is listed before your externally resolving DNS server.

nameserver <-- this would be consul  
nameserver <-- this is your default DNS server  

Try a ping:

$ ping cdh-master-0.node.myclouderacluster.consul
PING cdh-master-0.node.myclouderacluster.consul ( 56(84) bytes of data.  
64 bytes from cdh-master-0.node.myclouderacluster.consul ( icmp_seq=1 ttl=64 time=1.25 ms  
64 bytes from cdh-master-0.node.myclouderacluster.consul ( icmp_seq=2 ttl=64 time=0.899 ms  

Try the HDFS client

To make sure your configuration works, lets use the hdfs command to list our top-level directories:

$ hdfs dfs -ls /
Found 4 items  
drwxr-xr-x   - hbase  hbase               0 2016-05-03 23:23 /hbase  
drwxr-xr-x   - cf     stark               0 2016-05-03 16:16 /org  
drwxrwxrwx   - hdfs   supergroup          0 2016-05-24 00:12 /tmp  
drwxr-xr-x   - mapred supergroup          0 2016-05-06 00:07 /user  

NOTE: If something is wrong, you will either get errors, OR the command will simply return the results of ls in your current working directory.

From here, you can simply read the help for the hdfs command. Most operations are pretty simple.


For the next level, let's try mounting HDFS as a usable filesystem. To do this, first create a mountpoint:

$ sudo mkdir -p /hdfs

If you set up everything correctly for the hdfs command as above, you should be able to mount and use your HDFS filesystem like this:

$ sudo hadoop-fuse-dfs dfs://cdh-master-0.node.myclouderacluster.consul:8020 /hdfs
$ ls -Alh /hdfs
total 16K  
drwxr-xr-x 10     99 99 4.0K May  3 23:23 hbase  
drwxr-xr-x  3     99 99 4.0K May  3 16:16 org  
drwxrwxrwx  6 hdfs   99 4.0K May 24 00:12 tmp  
drwxr-xr-x 11 mapred 99 4.0K May  6 00:07 user  

A Note About Permissions (Security by Obscurity!)

HDFS permissions, by default, are very liberal. As you browse the tree structure, you may notice that you do not have access to get to certain files:

$ ls /hdfs/org/some/restricted/folder
ls: cannot open directory /hdfs/org/some/restricted/folder: Permission denied  

The fix? Create a user with the same name as the folder's owner:

$ sudo useradd -m theboss
$ sudo su - theboss -l
$ ls /hdfs/org/some/restricted/folder

This may or may not work for you - typically, if security for HDFS is desired, then one would enable Kereberos for this environment.

Does it actually work?

To ensure that this does work before handing it off to a customer to upload their gargantuan files, I'd suggest trying to upload a large-ish file and see if the checksums before and after upload match:

$ openssl dgst -sha256 big-file.csv
SHA256(big-file.csv)= 646a45f3caed89d7303ae9240c0c3e45e9188e55cf8e65bda8980daa9855be3e  
$ cp big-file.csv /hdfs
$ openssl dgst -sha256 /hdfs/big-file.csv
SHA256(big-file.csv)= 646a45f3caed89d7303ae9240c0c3e45e9188e55cf8e65bda8980daa9855be3e  

That's it. At this point, you can now interact with HDFS as you would any other linux filesystem.