Hortonworks HDP: Exploring Data with Apache Pig from the Grunt shell

From Define Wiki
Jump to navigation Jump to search

Overview

Apache Pig is a platform for analyzing large data sets. It comprises of a high-level language named 'Pig Latin' for expressing data analysis programs, coupled with the infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.


In this tutorial, you will learn the following topics:


  1. Load a data file into HDFS.
  2. Learn about 'FILTER, FOREACH' with examples.
  3. Storing values into HDFS.
  4. Learn about Grunt shell's File Commands.

Create a data file

Create a text file named "movies.txt" in your local file system and upload it to HDFS

[boston@compute-0-13 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -mkdir /user/boston/pig-grunt
[boston@compute-0-13 hadoop]$ vi movies.txt
[boston@compute-0-13 hadoop]$ cat movies.txt 
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333
7,Muriel's Wedding,1994,3.5,6323
8,Mother's Boys,1994,3.4,5733
9,Nosferatu: Original Version,1929,3.5,5651
10,Nick of Time,1995,3.4,5333
[boston@compute-0-13 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -put movies.txt /user/boston/pig-grunt/
[boston@compute-0-13 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -ls /user/boston/pig-grunt/            
Found 1 items
-rw-r--r--   3 boston hdfs        347 2014-09-10 15:27 /user/boston/pig-grunt/movies.txt

Load up the grunt shell

[boston@compute-0-13 hadoop]$ pig
2014-09-10 15:29:53,043 [main] INFO  org.apache.pig.Main - Apache Pig version 0.12.1.2.1.5.0-695 (rexported) compiled Aug 27 2014, 23:56:19
2014-09-10 15:29:53,043 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/boston/hadoop/pig_1410359393041.log
2014-09-10 15:29:53,060 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/boston/.pigbootup not found
2014-09-10 15:29:53,331 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-09-10 15:29:53,331 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 15:29:53,331 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://compute-0-13.local:8020
2014-09-10 15:29:53,870 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
grunt>

Load content into Pig/Grunt

You can load using types or without types, examples for both below.

grunt> Movies = LOAD '/user/boston/pig-grunt/movies.txt' USING PigStorage(',') as (id,name,year,rating,duration);
grunt> MoviesTYPE = LOAD '/user/boston/pig-grunt/movies.txt' USING PigStorage(',') as (id:int,name:chararray,year:int,rating:float, duration:int);

Dump contents of variables

grunt> DUMP Movies 
2014-09-10 15:33:16,015 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2014-09-10 15:33:16,042 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2014-09-10 15:33:16,194 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2014-09-10 15:33:16,226 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2014-09-10 15:33:16,226 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2014-09-10 15:33:16,293 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 15:33:16,415 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2014-09-10 15:33:16,419 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent
2014-09-10 15:33:16,419 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2014-09-10 15:33:16,419 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
2014-09-10 15:33:16,420 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job1442802139177367906.jar
2014-09-10 15:33:18,976 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job1442802139177367906.jar created
2014-09-10 15:33:18,977 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.jar is deprecated. Instead, use mapreduce.job.jar
2014-09-10 15:33:18,999 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2014-09-10 15:33:19,028 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2014-09-10 15:33:19,028 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker.http.address is deprecated. Instead, use mapreduce.jobtracker.http.address
2014-09-10 15:33:19,032 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 15:33:19,045 [JobControl] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 15:33:20,399 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-09-10 15:33:20,399 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2014-09-10 15:33:20,414 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2014-09-10 15:33:21,801 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2014-09-10 15:33:22,128 [JobControl] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 15:33:22,551 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1410279084309_0007
2014-09-10 15:33:22,682 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1410279084309_0007
2014-09-10 15:33:22,707 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://compute-0-14.local:8088/proxy/application_1410279084309_0007/
2014-09-10 15:33:22,707 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1410279084309_0007
2014-09-10 15:33:22,707 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases Movies
2014-09-10 15:33:22,707 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: Movies[1,9] C:  R: 
2014-09-10 15:33:22,736 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2014-09-10 15:33:32,680 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2014-09-10 15:33:37,984 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2014-09-10 15:33:38,023 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-09-10 15:33:38,024 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
2.4.0.2.1.5.0-695	0.12.1.2.1.5.0-695	boston	2014-09-10 15:33:16	2014-09-10 15:33:38	UNKNOWN

Success!

Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTIme	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1410279084309_0007	1	0	2	2	2	2	n/a	n/a	n/a	n/a	Movies	MAP_ONLY	hdfs://compute-0-13.local:8020/tmp/temp951764946/tmp1356916456,

Input(s):
Successfully read 10 records (729 bytes) from: "/user/boston/pig-grunt/movies.txt"

Output(s):
Successfully stored 10 records (437 bytes) in: "hdfs://compute-0-13.local:8020/tmp/temp951764946/tmp1356916456"

Counters:
Total records written : 10
Total bytes written : 437
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1410279084309_0007


2014-09-10 15:33:38,083 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2014-09-10 15:33:38,084 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 15:33:38,097 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-09-10 15:33:38,098 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,The Nightmare Before Christmas,1993,3.9,4568)
(2,The Mummy,1932,3.5,4388)
(3,Orphans of the Storm,1921,3.2,9062)
(4,The Object of Beauty,1991,2.8,6150)
(5,Night Tide,1963,2.8,5126)
(6,One Magic Christmas,1985,3.8,5333)
(7,Muriel's Wedding,1994,3.5,6323)
(8,Mother's Boys,1994,3.4,5733)
(9,Nosferatu: Original Version,1929,3.5,5651)
(10,Nick of Time,1995,3.4,5333)

Check the description of variables

Note the difference in the variables where types were specified during initiation

grunt> describe Movies
Movies: {id: bytearray,name: bytearray,year: bytearray,rating: bytearray,duration: bytearray}
grunt> describe MoviesTYPE
MoviesTYPE: {id: int,name: chararray,year: int,rating: float,duration: int}

Filter Data

In the example below we will restrict the list to movies with a rating of 3.5 or greater.

grunt> movies_greater_than_three_point_five = FILTER Movies BY rating>3.5; 
2014-09-10 15:55:19,211 [main] WARN  org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> dump movies_greater_than_three_point_five 
2014-09-10 15:55:35,294 [main] WARN  org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
2014-09-10 15:55:35,294 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: FILTER
2014-09-10 15:55:35,296 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2014-09-10 15:55:35,304 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2014-09-10 15:55:35,305 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2014-09-10 15:55:35,306 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2014-09-10 15:55:35,321 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 15:55:35,323 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2014-09-10 15:55:35,323 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2014-09-10 15:55:35,324 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job7077885832872681742.jar
2014-09-10 15:55:37,678 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job7077885832872681742.jar created
2014-09-10 15:55:37,689 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2014-09-10 15:55:37,706 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2014-09-10 15:55:37,708 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 15:55:37,713 [JobControl] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 15:55:38,870 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-09-10 15:55:38,870 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2014-09-10 15:55:38,873 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2014-09-10 15:55:40,197 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2014-09-10 15:55:40,797 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1410279084309_0008
2014-09-10 15:55:40,811 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1410279084309_0008
2014-09-10 15:55:40,814 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://compute-0-14.local:8088/proxy/application_1410279084309_0008/
2014-09-10 15:55:40,814 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1410279084309_0008
2014-09-10 15:55:40,814 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases Movies,movies_greater_than_three_point_five
2014-09-10 15:55:40,814 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: Movies[1,9],movies_greater_than_three_point_five[3,39] C:  R: 
2014-09-10 15:55:40,834 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2014-09-10 15:55:50,746 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2014-09-10 15:55:55,987 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-09-10 15:55:55,987 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
2.4.0.2.1.5.0-695	0.12.1.2.1.5.0-695	boston	2014-09-10 15:55:35	2014-09-10 15:55:55	FILTER

Success!

Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTIme	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1410279084309_0008	1	0	2	2	2	2	n/a	n/a	n/a	n/a	Movies,movies_greater_than_three_point_five	MAP_ONLY	hdfs://compute-0-13.local:8020/tmp/temp951764946/tmp302895483,

Input(s):
Successfully read 10 records (729 bytes) from: "/user/boston/pig-grunt/movies.txt"

Output(s):
Successfully stored 2 records (101 bytes) in: "hdfs://compute-0-13.local:8020/tmp/temp951764946/tmp302895483"

Counters:
Total records written : 2
Total bytes written : 101
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1410279084309_0008


2014-09-10 15:55:56,041 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2014-09-10 15:55:56,041 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 15:55:56,042 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2014-09-10 15:55:56,046 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-09-10 15:55:56,046 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,The Nightmare Before Christmas,1993,3.9,4568)
(6,One Magic Christmas,1985,3.8,5333)

We can take the information stored in the variable above and continue to apply filters to it. Note the output ordering below.

grunt> foreachexample= foreach movies_greater_than_three_point_five generate year,rating,name;
2014-09-10 16:02:27,299 [main] WARN  org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> dump foreachexample
2014-09-10 16:02:35,345 [main] WARN  org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
2014-09-10 16:02:35,345 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: FILTER
2014-09-10 16:02:35,346 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2014-09-10 16:02:35,350 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for Movies: $0, $4
2014-09-10 16:02:35,358 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2014-09-10 16:02:35,359 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2014-09-10 16:02:35,359 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2014-09-10 16:02:35,374 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 16:02:35,375 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2014-09-10 16:02:35,376 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2014-09-10 16:02:35,376 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job3720292969611424078.jar
2014-09-10 16:02:37,725 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job3720292969611424078.jar created
2014-09-10 16:02:37,734 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2014-09-10 16:02:37,737 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2014-09-10 16:02:37,737 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cache
2014-09-10 16:02:37,737 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2014-09-10 16:02:37,750 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2014-09-10 16:02:37,751 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 16:02:37,756 [JobControl] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 16:02:38,880 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-09-10 16:02:38,881 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2014-09-10 16:02:38,883 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2014-09-10 16:02:40,033 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2014-09-10 16:02:40,267 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1410279084309_0009
2014-09-10 16:02:40,280 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1410279084309_0009
2014-09-10 16:02:40,283 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://compute-0-14.local:8088/proxy/application_1410279084309_0009/
2014-09-10 16:02:40,283 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1410279084309_0009
2014-09-10 16:02:40,283 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases Movies,foreachexample,movies_greater_than_three_point_five
2014-09-10 16:02:40,284 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: Movies[1,9],movies_greater_than_three_point_five[3,39],foreachexample[4,16] C:  R: 
2014-09-10 16:02:40,303 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2014-09-10 16:02:50,206 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2014-09-10 16:02:55,445 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-09-10 16:02:55,445 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
2.4.0.2.1.5.0-695	0.12.1.2.1.5.0-695	boston	2014-09-10 16:02:35	2014-09-10 16:02:55	FILTER

Success!

Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTIme	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1410279084309_0009	1	0	2	2	2	2	n/a	n/a	n/a	n/a	Movies,foreachexample,movies_greater_than_three_point_five	MAP_ONLY	hdfs://compute-0-13.local:8020/tmp/temp951764946/tmp-1681081927,

Input(s):
Successfully read 10 records (729 bytes) from: "/user/boston/pig-grunt/movies.txt"

Output(s):
Successfully stored 2 records (83 bytes) in: "hdfs://compute-0-13.local:8020/tmp/temp951764946/tmp-1681081927"

Counters:
Total records written : 2
Total bytes written : 83
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1410279084309_0009


2014-09-10 16:02:55,496 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2014-09-10 16:02:55,496 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 16:02:55,497 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2014-09-10 16:02:55,501 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-09-10 16:02:55,501 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1993,3.9,The Nightmare Before Christmas)
(1985,3.8,One Magic Christmas)

Store variable values in to HDFS

We can dump the values of variables to files in HDFS for use later on

grunt> STORE movies_greater_than_three_point_five INTO  '/user/boston/pig-grunt/movies_greater_than_three_point_five' USING PigStorage (',');
2014-09-10 16:07:39,741 [main] WARN  org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
2014-09-10 16:07:39,762 [main] WARN  org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
2014-09-10 16:07:39,771 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: FILTER
2014-09-10 16:07:39,771 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2014-09-10 16:07:39,772 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.textoutputformat.separator is deprecated. Instead, use mapreduce.output.textoutputformat.separator
2014-09-10 16:07:39,775 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2014-09-10 16:07:39,776 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2014-09-10 16:07:39,776 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2014-09-10 16:07:39,789 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 16:07:39,791 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2014-09-10 16:07:39,792 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2014-09-10 16:07:39,792 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job1592945539577817105.jar
2014-09-10 16:07:42,106 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job1592945539577817105.jar created
2014-09-10 16:07:42,111 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2014-09-10 16:07:42,117 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2014-09-10 16:07:42,119 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 16:07:42,124 [JobControl] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 16:07:43,260 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-09-10 16:07:43,260 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2014-09-10 16:07:43,262 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2014-09-10 16:07:44,540 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2014-09-10 16:07:45,173 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1410279084309_0010
2014-09-10 16:07:45,186 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1410279084309_0010
2014-09-10 16:07:45,188 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://compute-0-14.local:8088/proxy/application_1410279084309_0010/
2014-09-10 16:07:45,188 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1410279084309_0010
2014-09-10 16:07:45,188 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases Movies,movies_greater_than_three_point_five
2014-09-10 16:07:45,188 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: Movies[1,9],movies_greater_than_three_point_five[3,39] C:  R: 
2014-09-10 16:07:45,208 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2014-09-10 16:07:54,565 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2014-09-10 16:08:00,313 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-09-10 16:08:00,313 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
2.4.0.2.1.5.0-695	0.12.1.2.1.5.0-695	boston	2014-09-10 16:07:39	2014-09-10 16:08:00	FILTER

Success!

Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTIme	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1410279084309_0010	1	0	2	2	2	2	n/a	n/a	n/a	n/a	Movies,movies_greater_than_three_point_five	MAP_ONLY	/user/boston/pig-grunt/movies_greater_than_three_point_five,

Input(s):
Successfully read 10 records (729 bytes) from: "/user/boston/pig-grunt/movies.txt"

Output(s):
Successfully stored 2 records (83 bytes) in: "/user/boston/pig-grunt/movies_greater_than_three_point_five"

Counters:
Total records written : 2
Total bytes written : 83
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1410279084309_0010


2014-09-10 16:08:00,363 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

The command above creates a directory and stores the output in files in that directory. We can use the pig shell for standard file access to HDFS

grunt> ls /user/boston/pig-grunt/
hdfs://compute-0-13.local:8020/user/boston/pig-grunt/movies.txt<r 3>	347
hdfs://compute-0-13.local:8020/user/boston/pig-grunt/movies_greater_than_three_point_five	<dir>
grunt> ls /user/boston/pig-grunt/movies_greater_than_three_point_five                                                                        
hdfs://compute-0-13.local:8020/user/boston/pig-grunt/movies_greater_than_three_point_five/_SUCCESS<r 3>	0
hdfs://compute-0-13.local:8020/user/boston/pig-grunt/movies_greater_than_three_point_five/part-m-00000<r 3>	83
grunt> cat /user/boston/pig-grunt/movies_greater_than_three_point_five/part-m-00000
1,The Nightmare Before Christmas,1993,3.9,4568
6,One Magic Christmas,1985,3.8,5333

File Commands

Pig's Grunt shell has commands that can run on HDFS as well as on the local file system.

    grunt> cat /user/hadoop/movies.txt
    grunt> ls /user/hadoop/
    grunt> cd /user/
    grunt> ls
    grunt> cd /user/hadoop
    grunt> ls
    grunt> copyToLocal /user/hadoop/movies.txt /home/
    grunt> pwd