Hortonworks HDP: Using the command line to manage files on HDFS

From Define Wiki
Jump to navigation Jump to search

Assumes a working HDP Installation

Using the command line to manage files on HDFS

Perform a directory listing

[boston@compute-0-0 ~]$ /usr/lib/hadoop/bin/hadoop fs -ls /
Found 5 items
drwxr-xr-x   - hdfs   hdfs          0 2014-09-09 17:11 /apps
drwxr-xr-x   - mapred hdfs          0 2014-09-09 17:06 /mapred
drwxr-xr-x   - hdfs   hdfs          0 2014-09-09 17:06 /mr-history
drwxrwxrwx   - hdfs   hdfs          0 2014-09-09 17:12 /tmp
drwxr-xr-x   - hdfs   hdfs          0 2014-09-09 17:11 /user
[boston@compute-0-0 ~]$ /usr/lib/hadoop/bin/hadoop fs -ls /user 
Found 4 items
drwxrwx---   - ambari-qa hdfs          0 2014-09-09 17:16 /user/ambari-qa
drwxr-xr-x   - hcat      hdfs          0 2014-09-09 17:11 /user/hcat
drwx------   - hive      hdfs          0 2014-09-09 17:07 /user/hive
drwxrwxr-x   - oozie     hdfs          0 2014-09-09 17:08 /user/oozie

Create a directory

NOTE: Standard users by default wont be able to create a directory. Use the 'hdfs' user to perform a chmod

[root@compute-0-0 ~]# /usr/lib/hadoop/bin/hadoop fs -mkdir /user/boston
mkdir: Permission denied: user=root, access=WRITE, inode="/user":hdfs:hdfs:drwxr-xr-x
[root@compute-0-0 ~]# su -l hdfs 
[hdfs@compute-0-0 ~]$ /usr/lib/hadoop/bin/hadoop fs -chmod 777 /user
[hdfs@compute-0-0 ~]$

Now the directory can be created with a standard user

[boston@compute-0-0 ~]$ /usr/lib/hadoop/bin/hadoop fs -mkdir /user/boston
[boston@compute-0-0 ~]$

Upload a file to HDFS

Here we create a file and 'put' it in to HDFS

[boston@compute-0-0 hadoop]$ echo "Sample Text" > filename.txt
[boston@compute-0-0 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -put filename.txt /user/boston/
[boston@compute-0-0 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -ls /user/boston/
Found 1 items
-rw-r--r--   3 boston hdfs         12 2014-09-10 11:49 /user/boston/filename.txt

Upload multiple files:

[boston@compute-0-0 hadoop]$ touch multiplefile{1..10}.txt
[boston@compute-0-0 hadoop]$ ls
filename.txt        multiplefile1.txt  multiplefile3.txt  multiplefile5.txt  multiplefile7.txt  multiplefile9.txt
multiplefile10.txt  multiplefile2.txt  multiplefile4.txt  multiplefile6.txt  multiplefile8.txt
[boston@compute-0-0 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -put filename.txt multiplefile* /user/boston/
put: `/user/boston/filename.txt': File exists
[boston@compute-0-0 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -ls /user/boston/
Found 11 items
-rw-r--r--   3 boston hdfs         12 2014-09-10 11:49 /user/boston/filename.txt
-rw-r--r--   3 boston hdfs          0 2014-09-10 12:37 /user/boston/multiplefile1.txt
-rw-r--r--   3 boston hdfs          0 2014-09-10 12:37 /user/boston/multiplefile10.txt
-rw-r--r--   3 boston hdfs          0 2014-09-10 12:37 /user/boston/multiplefile2.txt
-rw-r--r--   3 boston hdfs          0 2014-09-10 12:37 /user/boston/multiplefile3.txt
-rw-r--r--   3 boston hdfs          0 2014-09-10 12:37 /user/boston/multiplefile4.txt
-rw-r--r--   3 boston hdfs          0 2014-09-10 12:37 /user/boston/multiplefile5.txt
-rw-r--r--   3 boston hdfs          0 2014-09-10 12:37 /user/boston/multiplefile6.txt
-rw-r--r--   3 boston hdfs          0 2014-09-10 12:37 /user/boston/multiplefile7.txt
-rw-r--r--   3 boston hdfs          0 2014-09-10 12:37 /user/boston/multiplefile8.txt
-rw-r--r--   3 boston hdfs          0 2014-09-10 12:37 /user/boston/multiplefile9.txt

Check the disk usage on HDFS

[boston@compute-0-0 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -du /user/boston 
12  /user/boston/filename.txt


Some Advanced Features

Use getmerge to concatenate files

This example takes the contents of all files in a hadoop directory and merges the contents in to a single file on your local system (file not created on the HDFS system)

[boston@compute-0-0 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -mkdir /user/boston/mergetest
[boston@compute-0-0 hadoop]$ touch merge{1..5}.txt
[boston@compute-0-0 hadoop]$ echo content1 > merge1.txt 
[boston@compute-0-0 hadoop]$ echo content2 > merge2.txt 
[boston@compute-0-0 hadoop]$ echo content3 > merge3.txt 
[boston@compute-0-0 hadoop]$ echo content4 > merge4.txt 
[boston@compute-0-0 hadoop]$ echo content5 > merge5.txt 
[boston@compute-0-0 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -put merge*.txt /user/boston/mergetest
[boston@compute-0-0 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -ls /user/boston/mergetest
Found 5 items
-rw-r--r--   3 boston hdfs          9 2014-09-10 12:43 /user/boston/mergetest/merge1.txt
-rw-r--r--   3 boston hdfs          9 2014-09-10 12:43 /user/boston/mergetest/merge2.txt
-rw-r--r--   3 boston hdfs          9 2014-09-10 12:43 /user/boston/mergetest/merge3.txt
-rw-r--r--   3 boston hdfs          9 2014-09-10 12:43 /user/boston/mergetest/merge4.txt
-rw-r--r--   3 boston hdfs          9 2014-09-10 12:43 /user/boston/mergetest/merge5.txt
[boston@compute-0-0 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -getmerge /user/boston/mergetest/ ./LocalMergeFile.txt
[boston@compute-0-0 hadoop]$ cat LocalMergeFile.txt 
content1
content2
content3
content4
content5

distcp for large internal copies

  • Copy file or directories recursively
  • It is a tool used for large inter/intra-cluster copying
  • It uses MapReduce to effect its distribution copy, error handling and recovery, and reporting
[boston@compute-0-0 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -mkdir /user/boston/mergecopy
[boston@compute-0-0 hadoop]$ /usr/lib/hadoop/bin/hadoop distcp /user/boston/mergetest /user/boston/mergecopy
14/09/10 13:15:16 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[/user/boston/mergetest], targetPath=/user/boston/mergecopy}
14/09/10 13:15:16 INFO client.RMProxy: Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
14/09/10 13:15:17 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
14/09/10 13:15:17 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
14/09/10 13:15:19 INFO client.RMProxy: Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
14/09/10 13:15:21 INFO mapreduce.JobSubmitter: number of splits:5
14/09/10 13:15:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1410279084309_0005
14/09/10 13:15:22 INFO impl.YarnClientImpl: Submitted application application_1410279084309_0005
14/09/10 13:15:22 INFO mapreduce.Job: The url to track the job: http://compute-0-14.local:8088/proxy/application_1410279084309_0005/
14/09/10 13:15:22 INFO tools.DistCp: DistCp job-id: job_1410279084309_0005
14/09/10 13:15:22 INFO mapreduce.Job: Running job: job_1410279084309_0005
14/09/10 13:15:27 INFO mapreduce.Job: Job job_1410279084309_0005 running in uber mode : false
14/09/10 13:15:27 INFO mapreduce.Job:  map 0% reduce 0%
14/09/10 13:15:32 INFO mapreduce.Job:  map 20% reduce 0%
14/09/10 13:15:34 INFO mapreduce.Job:  map 40% reduce 0%
14/09/10 13:15:35 INFO mapreduce.Job:  map 60% reduce 0%
14/09/10 13:15:36 INFO mapreduce.Job:  map 100% reduce 0%
14/09/10 13:15:41 INFO mapreduce.Job: Job job_1410279084309_0005 completed successfully
14/09/10 13:15:41 INFO mapreduce.Job: Counters: 33
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=520330
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=2585
		HDFS: Number of bytes written=45
		HDFS: Number of read operations=95
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=21
	Job Counters 
		Launched map tasks=5
		Other local map tasks=5
		Total time spent by all maps in occupied slots (ms)=19798
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=19798
		Total vcore-seconds taken by all map tasks=19798
		Total megabyte-seconds taken by all map tasks=20273152
	Map-Reduce Framework
		Map input records=5
		Map output records=0
		Input split bytes=580
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=101
		CPU time spent (ms)=4270
		Physical memory (bytes) snapshot=1087881216
		Virtual memory (bytes) snapshot=14741442560
		Total committed heap usage (bytes)=5211422720
	File Input Format Counters 
		Bytes Read=1960
	File Output Format Counters 
		Bytes Written=0
	org.apache.hadoop.tools.mapred.CopyMapper$Counter
		BYTESCOPIED=45
		BYTESEXPECTED=45
		COPY=5
[boston@compute-0-0 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -ls /user/boston/mergecopy                       
Found 1 items
drwxr-xr-x   - boston hdfs          0 2014-09-10 13:15 /user/boston/mergecopy/mergetest
[boston@compute-0-0 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -ls /user/boston/mergecopy/mergetest
Found 5 items
-rw-r--r--   3 boston hdfs          9 2014-09-10 13:15 /user/boston/mergecopy/mergetest/merge1.txt
-rw-r--r--   3 boston hdfs          9 2014-09-10 13:15 /user/boston/mergecopy/mergetest/merge2.txt
-rw-r--r--   3 boston hdfs          9 2014-09-10 13:15 /user/boston/mergecopy/mergetest/merge3.txt
-rw-r--r--   3 boston hdfs          9 2014-09-10 13:15 /user/boston/mergecopy/mergetest/merge4.txt
-rw-r--r--   3 boston hdfs          9 2014-09-10 13:15 /user/boston/mergecopy/mergetest/merge5.txt