Hortonworks HDP: Exploring Data with Apache Pig from the Grunt shell
Overview
Apache Pig is a platform for analyzing large data sets. It comprises of a high-level language named 'Pig Latin' for expressing data analysis programs, coupled with the infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
In this tutorial, you will learn the following topics:
- Load a data file into HDFS.
- Learn about 'FILTER, FOREACH' with examples.
- Storing values into HDFS.
- Learn about Grunt shell's File Commands.
Create a data file
Create a text file named "movies.txt" in your local file system and upload it to HDFS
[boston@compute-0-13 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -mkdir /user/boston/pig-grunt
[boston@compute-0-13 hadoop]$ vi movies.txt
[boston@compute-0-13 hadoop]$ cat movies.txt
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333
7,Muriel's Wedding,1994,3.5,6323
8,Mother's Boys,1994,3.4,5733
9,Nosferatu: Original Version,1929,3.5,5651
10,Nick of Time,1995,3.4,5333
[boston@compute-0-13 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -put movies.txt /user/boston/pig-grunt/
[boston@compute-0-13 hadoop]$ /usr/lib/hadoop/bin/hadoop fs -ls /user/boston/pig-grunt/
Found 1 items
-rw-r--r-- 3 boston hdfs 347 2014-09-10 15:27 /user/boston/pig-grunt/movies.txtLoad up the grunt shell
[boston@compute-0-13 hadoop]$ pig
2014-09-10 15:29:53,043 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.1.2.1.5.0-695 (rexported) compiled Aug 27 2014, 23:56:19
2014-09-10 15:29:53,043 [main] INFO org.apache.pig.Main - Logging error messages to: /home/boston/hadoop/pig_1410359393041.log
2014-09-10 15:29:53,060 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/boston/.pigbootup not found
2014-09-10 15:29:53,331 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-09-10 15:29:53,331 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 15:29:53,331 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://compute-0-13.local:8020
2014-09-10 15:29:53,870 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
grunt>Load content into Pig/Grunt
You can load using types or without types, examples for both below.
grunt> Movies = LOAD '/user/boston/pig-grunt/movies.txt' USING PigStorage(',') as (id,name,year,rating,duration);
grunt> MoviesTYPE = LOAD '/user/boston/pig-grunt/movies.txt' USING PigStorage(',') as (id:int,name:chararray,year:int,rating:float, duration:int);Dump contents of variables
grunt> DUMP Movies
2014-09-10 15:33:16,015 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2014-09-10 15:33:16,042 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2014-09-10 15:33:16,194 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2014-09-10 15:33:16,226 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2014-09-10 15:33:16,226 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2014-09-10 15:33:16,293 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 15:33:16,415 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2014-09-10 15:33:16,419 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent
2014-09-10 15:33:16,419 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2014-09-10 15:33:16,419 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
2014-09-10 15:33:16,420 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job1442802139177367906.jar
2014-09-10 15:33:18,976 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job1442802139177367906.jar created
2014-09-10 15:33:18,977 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.jar is deprecated. Instead, use mapreduce.job.jar
2014-09-10 15:33:18,999 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2014-09-10 15:33:19,028 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2014-09-10 15:33:19,028 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker.http.address is deprecated. Instead, use mapreduce.jobtracker.http.address
2014-09-10 15:33:19,032 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 15:33:19,045 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 15:33:20,399 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-09-10 15:33:20,399 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2014-09-10 15:33:20,414 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2014-09-10 15:33:21,801 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2014-09-10 15:33:22,128 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 15:33:22,551 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1410279084309_0007
2014-09-10 15:33:22,682 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1410279084309_0007
2014-09-10 15:33:22,707 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://compute-0-14.local:8088/proxy/application_1410279084309_0007/
2014-09-10 15:33:22,707 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1410279084309_0007
2014-09-10 15:33:22,707 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases Movies
2014-09-10 15:33:22,707 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: Movies[1,9] C: R:
2014-09-10 15:33:22,736 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2014-09-10 15:33:32,680 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2014-09-10 15:33:37,984 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2014-09-10 15:33:38,023 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-09-10 15:33:38,024 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.4.0.2.1.5.0-695 0.12.1.2.1.5.0-695 boston 2014-09-10 15:33:16 2014-09-10 15:33:38 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1410279084309_0007 1 0 2 2 2 2 n/a n/a n/a n/a Movies MAP_ONLY hdfs://compute-0-13.local:8020/tmp/temp951764946/tmp1356916456,
Input(s):
Successfully read 10 records (729 bytes) from: "/user/boston/pig-grunt/movies.txt"
Output(s):
Successfully stored 10 records (437 bytes) in: "hdfs://compute-0-13.local:8020/tmp/temp951764946/tmp1356916456"
Counters:
Total records written : 10
Total bytes written : 437
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1410279084309_0007
2014-09-10 15:33:38,083 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2014-09-10 15:33:38,084 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 15:33:38,097 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-09-10 15:33:38,098 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,The Nightmare Before Christmas,1993,3.9,4568)
(2,The Mummy,1932,3.5,4388)
(3,Orphans of the Storm,1921,3.2,9062)
(4,The Object of Beauty,1991,2.8,6150)
(5,Night Tide,1963,2.8,5126)
(6,One Magic Christmas,1985,3.8,5333)
(7,Muriel's Wedding,1994,3.5,6323)
(8,Mother's Boys,1994,3.4,5733)
(9,Nosferatu: Original Version,1929,3.5,5651)
(10,Nick of Time,1995,3.4,5333)Check the description of variables
Note the difference in the variables where types were specified during initiation
grunt> describe Movies
Movies: {id: bytearray,name: bytearray,year: bytearray,rating: bytearray,duration: bytearray}
grunt> describe MoviesTYPE
MoviesTYPE: {id: int,name: chararray,year: int,rating: float,duration: int}Filter Data
In the example below we will restrict the list to movies with a rating of 3.5 or greater.
grunt> movies_greater_than_three_point_five = FILTER Movies BY rating>3.5;
2014-09-10 15:55:19,211 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> dump movies_greater_than_three_point_five
2014-09-10 15:55:35,294 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
2014-09-10 15:55:35,294 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: FILTER
2014-09-10 15:55:35,296 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2014-09-10 15:55:35,304 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2014-09-10 15:55:35,305 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2014-09-10 15:55:35,306 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2014-09-10 15:55:35,321 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 15:55:35,323 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2014-09-10 15:55:35,323 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2014-09-10 15:55:35,324 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job7077885832872681742.jar
2014-09-10 15:55:37,678 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job7077885832872681742.jar created
2014-09-10 15:55:37,689 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2014-09-10 15:55:37,706 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2014-09-10 15:55:37,708 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 15:55:37,713 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 15:55:38,870 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-09-10 15:55:38,870 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2014-09-10 15:55:38,873 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2014-09-10 15:55:40,197 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2014-09-10 15:55:40,797 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1410279084309_0008
2014-09-10 15:55:40,811 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1410279084309_0008
2014-09-10 15:55:40,814 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://compute-0-14.local:8088/proxy/application_1410279084309_0008/
2014-09-10 15:55:40,814 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1410279084309_0008
2014-09-10 15:55:40,814 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases Movies,movies_greater_than_three_point_five
2014-09-10 15:55:40,814 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: Movies[1,9],movies_greater_than_three_point_five[3,39] C: R:
2014-09-10 15:55:40,834 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2014-09-10 15:55:50,746 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2014-09-10 15:55:55,987 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-09-10 15:55:55,987 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.4.0.2.1.5.0-695 0.12.1.2.1.5.0-695 boston 2014-09-10 15:55:35 2014-09-10 15:55:55 FILTER
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1410279084309_0008 1 0 2 2 2 2 n/a n/a n/a n/a Movies,movies_greater_than_three_point_five MAP_ONLY hdfs://compute-0-13.local:8020/tmp/temp951764946/tmp302895483,
Input(s):
Successfully read 10 records (729 bytes) from: "/user/boston/pig-grunt/movies.txt"
Output(s):
Successfully stored 2 records (101 bytes) in: "hdfs://compute-0-13.local:8020/tmp/temp951764946/tmp302895483"
Counters:
Total records written : 2
Total bytes written : 101
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1410279084309_0008
2014-09-10 15:55:56,041 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2014-09-10 15:55:56,041 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 15:55:56,042 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2014-09-10 15:55:56,046 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-09-10 15:55:56,046 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,The Nightmare Before Christmas,1993,3.9,4568)
(6,One Magic Christmas,1985,3.8,5333)We can take the information stored in the variable above and continue to apply filters to it. Note the output ordering below.
grunt> foreachexample= foreach movies_greater_than_three_point_five generate year,rating,name;
2014-09-10 16:02:27,299 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> dump foreachexample
2014-09-10 16:02:35,345 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
2014-09-10 16:02:35,345 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: FILTER
2014-09-10 16:02:35,346 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2014-09-10 16:02:35,350 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for Movies: $0, $4
2014-09-10 16:02:35,358 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2014-09-10 16:02:35,359 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2014-09-10 16:02:35,359 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2014-09-10 16:02:35,374 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 16:02:35,375 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2014-09-10 16:02:35,376 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2014-09-10 16:02:35,376 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job3720292969611424078.jar
2014-09-10 16:02:37,725 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job3720292969611424078.jar created
2014-09-10 16:02:37,734 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2014-09-10 16:02:37,737 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2014-09-10 16:02:37,737 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cache
2014-09-10 16:02:37,737 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2014-09-10 16:02:37,750 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2014-09-10 16:02:37,751 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 16:02:37,756 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 16:02:38,880 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-09-10 16:02:38,881 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2014-09-10 16:02:38,883 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2014-09-10 16:02:40,033 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2014-09-10 16:02:40,267 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1410279084309_0009
2014-09-10 16:02:40,280 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1410279084309_0009
2014-09-10 16:02:40,283 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://compute-0-14.local:8088/proxy/application_1410279084309_0009/
2014-09-10 16:02:40,283 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1410279084309_0009
2014-09-10 16:02:40,283 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases Movies,foreachexample,movies_greater_than_three_point_five
2014-09-10 16:02:40,284 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: Movies[1,9],movies_greater_than_three_point_five[3,39],foreachexample[4,16] C: R:
2014-09-10 16:02:40,303 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2014-09-10 16:02:50,206 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2014-09-10 16:02:55,445 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-09-10 16:02:55,445 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.4.0.2.1.5.0-695 0.12.1.2.1.5.0-695 boston 2014-09-10 16:02:35 2014-09-10 16:02:55 FILTER
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1410279084309_0009 1 0 2 2 2 2 n/a n/a n/a n/a Movies,foreachexample,movies_greater_than_three_point_five MAP_ONLY hdfs://compute-0-13.local:8020/tmp/temp951764946/tmp-1681081927,
Input(s):
Successfully read 10 records (729 bytes) from: "/user/boston/pig-grunt/movies.txt"
Output(s):
Successfully stored 2 records (83 bytes) in: "hdfs://compute-0-13.local:8020/tmp/temp951764946/tmp-1681081927"
Counters:
Total records written : 2
Total bytes written : 83
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1410279084309_0009
2014-09-10 16:02:55,496 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2014-09-10 16:02:55,496 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 16:02:55,497 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2014-09-10 16:02:55,501 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-09-10 16:02:55,501 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1993,3.9,The Nightmare Before Christmas)
(1985,3.8,One Magic Christmas)Store variable values in to HDFS
We can dump the values of variables to files in HDFS for use later on
grunt> STORE movies_greater_than_three_point_five INTO '/user/boston/pig-grunt/movies_greater_than_three_point_five' USING PigStorage (',');
2014-09-10 16:07:39,741 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
2014-09-10 16:07:39,762 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
2014-09-10 16:07:39,771 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: FILTER
2014-09-10 16:07:39,771 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2014-09-10 16:07:39,772 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.textoutputformat.separator is deprecated. Instead, use mapreduce.output.textoutputformat.separator
2014-09-10 16:07:39,775 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2014-09-10 16:07:39,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2014-09-10 16:07:39,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2014-09-10 16:07:39,789 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 16:07:39,791 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2014-09-10 16:07:39,792 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2014-09-10 16:07:39,792 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job1592945539577817105.jar
2014-09-10 16:07:42,106 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job1592945539577817105.jar created
2014-09-10 16:07:42,111 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2014-09-10 16:07:42,117 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2014-09-10 16:07:42,119 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at compute-0-14.local/10.1.255.238:8050
2014-09-10 16:07:42,124 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-09-10 16:07:43,260 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-09-10 16:07:43,260 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2014-09-10 16:07:43,262 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2014-09-10 16:07:44,540 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2014-09-10 16:07:45,173 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1410279084309_0010
2014-09-10 16:07:45,186 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1410279084309_0010
2014-09-10 16:07:45,188 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://compute-0-14.local:8088/proxy/application_1410279084309_0010/
2014-09-10 16:07:45,188 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1410279084309_0010
2014-09-10 16:07:45,188 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases Movies,movies_greater_than_three_point_five
2014-09-10 16:07:45,188 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: Movies[1,9],movies_greater_than_three_point_five[3,39] C: R:
2014-09-10 16:07:45,208 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2014-09-10 16:07:54,565 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2014-09-10 16:08:00,313 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-09-10 16:08:00,313 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.4.0.2.1.5.0-695 0.12.1.2.1.5.0-695 boston 2014-09-10 16:07:39 2014-09-10 16:08:00 FILTER
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1410279084309_0010 1 0 2 2 2 2 n/a n/a n/a n/a Movies,movies_greater_than_three_point_five MAP_ONLY /user/boston/pig-grunt/movies_greater_than_three_point_five,
Input(s):
Successfully read 10 records (729 bytes) from: "/user/boston/pig-grunt/movies.txt"
Output(s):
Successfully stored 2 records (83 bytes) in: "/user/boston/pig-grunt/movies_greater_than_three_point_five"
Counters:
Total records written : 2
Total bytes written : 83
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1410279084309_0010
2014-09-10 16:08:00,363 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!The command above creates a directory and stores the output in files in that directory. We can use the pig shell for standard file access to HDFS
grunt> ls /user/boston/pig-grunt/
hdfs://compute-0-13.local:8020/user/boston/pig-grunt/movies.txt<r 3> 347
hdfs://compute-0-13.local:8020/user/boston/pig-grunt/movies_greater_than_three_point_five <dir>
grunt> ls /user/boston/pig-grunt/movies_greater_than_three_point_five
hdfs://compute-0-13.local:8020/user/boston/pig-grunt/movies_greater_than_three_point_five/_SUCCESS<r 3> 0
hdfs://compute-0-13.local:8020/user/boston/pig-grunt/movies_greater_than_three_point_five/part-m-00000<r 3> 83
grunt> cat /user/boston/pig-grunt/movies_greater_than_three_point_five/part-m-00000
1,The Nightmare Before Christmas,1993,3.9,4568
6,One Magic Christmas,1985,3.8,5333File Commands
Pig's Grunt shell has commands that can run on HDFS as well as on the local file system.
grunt> cat /user/hadoop/movies.txt
grunt> ls /user/hadoop/
grunt> cd /user/
grunt> ls
grunt> cd /user/hadoop
grunt> ls
grunt> copyToLocal /user/hadoop/movies.txt /home/
grunt> pwd