For the last month or so, I've been working on a couple of projects that have required me to move files in and out of HDFS. It's pretty straightforward once you get the appropriate tools working, but it can be a bit counterintuitive to get started (at least it was when I was learning it). Here's how you get started:
Install your tools
In this tutorial, we are working with Cloudera 5.5.1, using an Ubuntu (Trusty Tahr) instance to connect to it. First, we need to add Cloudera's repo to apt:
$ wget http://archive.cloudera.com/cdh5/one-click-install/trusty/amd64/cdh5-repository_1.0_all.deb $ sudo dpkg -i cdh5-repository_1.0_all.deb $ sudo apt-get update
Since I use both the
hdfs command and FUSE, I just install FUSE, which installs both tools.
$ sudo apt-get install hadoop-hdfs-fuse
One prerequisite that apt fails to install is Java. If you try running the
hdfs command, you'll get the following error:
Error: JAVA_HOME is not set and could not be found.
Let's put Java on there:
sudo apt-get install openjdk-7-jre
Set up your config
One little quirk about working with the Hadoop command-line tools is you need to use local config files - so you can't just provide the URL to your nameserver and just connect. One exception to this rule is a Go - based library/client written by Colin Marc called (drumroll please...) HDFS.
In Cloudera, you can get the config through the CDH Manager UI:
Once you download this zip file, put its contents into a subfolder of
/etc/hadoop as follows:
$ sudo unzip hdfs-clientconfig.zip -d /etc/hadoop Archive: hdfs-clientconfig.zip inflating: /etc/hadoop/hadoop-conf/hdfs-site.xml inflating: /etc/hadoop/hadoop-conf/core-site.xml inflating: /etc/hadoop/hadoop-conf/topology.map inflating: /etc/hadoop/hadoop-conf/topology.py inflating: /etc/hadoop/hadoop-conf/log4j.properties inflating: /etc/hadoop/hadoop-conf/ssl-client.xml inflating: /etc/hadoop/hadoop-conf/hadoop-env.sh $ sudo mv /etc/hadoop/hadoop-conf /etc/hadoop/conf.cloudera.HDFS
For the HDFS tools to use your configuration, the HADOOP_CONF_DIR environment variable needs to be set. This can simply be added to your favorite shell profile config:
Now that you have your configuration in the right place, make sure you can actually resolve the names it uses. For this to happen in Cloudera, ensure that one of your Consul DNS servers is listed before your externally resolving DNS server.
nameserver 10.10.10.250 <-- this would be consul nameserver 10.10.0.2 <-- this is your default DNS server
Try a ping:
$ ping cdh-master-0.node.myclouderacluster.consul PING cdh-master-0.node.myclouderacluster.consul (10.10.10.70) 56(84) bytes of data. 64 bytes from cdh-master-0.node.myclouderacluster.consul (10.10.10.70): icmp_seq=1 ttl=64 time=1.25 ms 64 bytes from cdh-master-0.node.myclouderacluster.consul (10.10.10.70): icmp_seq=2 ttl=64 time=0.899 ms
Try the HDFS client
To make sure your configuration works, lets use the
hdfs command to list our top-level directories:
$ hdfs dfs -ls / Found 4 items drwxr-xr-x - hbase hbase 0 2016-05-03 23:23 /hbase drwxr-xr-x - cf stark 0 2016-05-03 16:16 /org drwxrwxrwx - hdfs supergroup 0 2016-05-24 00:12 /tmp drwxr-xr-x - mapred supergroup 0 2016-05-06 00:07 /user
NOTE: If something is wrong, you will either get errors, OR the command will simply return the results of
ls in your current working directory.
From here, you can simply read the help for the
hdfs command. Most operations are pretty simple.
For the next level, let's try mounting HDFS as a usable filesystem. To do this, first create a mountpoint:
$ sudo mkdir -p /hdfs
If you set up everything correctly for the
hdfs command as above, you should be able to mount and use your HDFS filesystem like this:
$ sudo hadoop-fuse-dfs dfs://cdh-master-0.node.myclouderacluster.consul:8020 /hdfs $ ls -Alh /hdfs total 16K drwxr-xr-x 10 99 99 4.0K May 3 23:23 hbase drwxr-xr-x 3 99 99 4.0K May 3 16:16 org drwxrwxrwx 6 hdfs 99 4.0K May 24 00:12 tmp drwxr-xr-x 11 mapred 99 4.0K May 6 00:07 user
A Note About Permissions (Security by Obscurity!)
HDFS permissions, by default, are very liberal. As you browse the tree structure, you may notice that you do not have access to get to certain files:
$ ls /hdfs/org/some/restricted/folder ls: cannot open directory /hdfs/org/some/restricted/folder: Permission denied
The fix? Create a user with the same name as the folder's owner:
$ sudo useradd -m theboss $ sudo su - theboss -l $ ls /hdfs/org/some/restricted/folder resumes salaries torrents
This may or may not work for you - typically, if security for HDFS is desired, then one would enable Kereberos for this environment.
Does it actually work?
To ensure that this does work before handing it off to a customer to upload their gargantuan files, I'd suggest trying to upload a large-ish file and see if the checksums before and after upload match:
$ openssl dgst -sha256 big-file.csv SHA256(big-file.csv)= 646a45f3caed89d7303ae9240c0c3e45e9188e55cf8e65bda8980daa9855be3e $ cp big-file.csv /hdfs $ openssl dgst -sha256 /hdfs/big-file.csv SHA256(big-file.csv)= 646a45f3caed89d7303ae9240c0c3e45e9188e55cf8e65bda8980daa9855be3e
That's it. At this point, you can now interact with HDFS as you would any other linux filesystem.