Skip to content

Commit

Permalink
-cmdenv and GP_VIEWDIR options
Browse files Browse the repository at this point in the history
  • Loading branch information
William Cohen committed Jul 16, 2015
1 parent 3c8e6ae commit d1cfc77
Show file tree
Hide file tree
Showing 3 changed files with 30 additions and 12 deletions.
19 changes: 16 additions & 3 deletions TODO.txt
Original file line number Diff line number Diff line change
Expand Up @@ -66,14 +66,27 @@ Default output format [None]: json
seems like the access key is for interactions with aws, the keypair will be for interactions with
the clusters you are going to create.

4.5) create a security group:

aws ec2 create-security-group --group-name MySecurityGroup --description "My security group"
aws ec2 authorize-security-group-ingress --group-name MySecurityGroup --protocol tcp --port 22 --cidr 0.0.0.0/0

5) now use emr subcommand [docs: http://docs.aws.amazon.com/cli/latest/reference/emr/index.html] to
build and access the cluster

% aws emr create-cluster --ami-version 3.8.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --ec2-attributes KeyName=MyKeyPair
% aws emr create-cluster --ami-version 3.8.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --ec2-attributes KeyName=MyKeyPair --log-uri s3n://wcohen-gpig-log

{
"ClusterId": "j-1LF855E531Y16"
}
- to add: logging output, since the example below fails

should add a name: --name "foo"
should add a bootstrap action script: --bootstrap-action Path="s3://wcohen-.../foo.sh" to pull in gpig and notify me with
mkdir gpig
cd gpig
wget http://www.cs.cmu.edu/~wcohen/10-605/gpigtut.tgz
tar -xzf gpigtut.tgz
echo the cluster is ready now | mail -s test [email protected]

wait a bit then:

Expand All @@ -85,7 +98,7 @@ build and access the cluster
% unpack the tutorial...
%
% export GP_STREAMJAR=/home/hadoop/contrib/streaming/hadoop-streaming.jar
% hadoop jar hadoop-examples.jar pi 10 10000000 #somehow this was needed to set up hdfs:/user/hadoop
% hadoop jar ~/hadoop-examples.jar pi 10 10000000 #somehow this was needed to set up hdfs:/user/hadoop
% hadoop fs -mkdir /user/hadoop/gp_views
% python param-wordcount.py --opts target:hadoop,viewdir:/user/hadoop/gp_views,echo:1 --params corpus:s3%3A//wcohen-gpig-input/corpus.txt --store wc

Expand Down
18 changes: 12 additions & 6 deletions guineapig.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,21 +31,23 @@ class GPig(object):
#command-line option, and these are the default values
#The location of the streaming jar is a special case,
#in that it's also settable via an environment variable.
defaultJar = '/usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.2.0.1.3.0.0-107.jar'
defaultJar = '/home/hadoop/contrib/streaming/hadoop-streaming.jar'
envjar = os.environ.get('GP_STREAMJAR', defaultJar)
defaultViewDir = 'gpig_views'
envViewDir = os.environ.get('GP_VIEWDIR',defaultViewDir )
DEFAULT_OPTS = {'streamJar': envjar,
'parallel':5,
'target':'shell',
'echo':0,
'viewdir':'gpig_views',
'viewdir': envViewDir,
}
#These are the types of each option that has a non-string value
DEFAULT_OPT_TYPES = {'parallel':int,'echo':int}
#We need to pass non-default options in to mappers and reducers,
#but since the remote worker's environment can be different that
#the environment of this script, we also need to pass in options
#computed from the environment
COMPUTED_OPTION_DEFAULTS = {'streamJar':defaultJar}
COMPUTED_OPTION_DEFAULTS = {'streamJar':defaultJar, 'viewdir':defaultViewDir}

@staticmethod
def getCompiler(target):
Expand Down Expand Up @@ -1114,6 +1116,7 @@ def simpleMapCommands(self,task,gp,mapCom,src,dst):
def simpleMapReduceCommands(self,task,gp,mapCom,reduceCom,src,dst):
hcom = self.HadoopCommandBuf(gp,task)
hcom.extendDef('-D','mapred.reduce.tasks=%d' % gp.opts['parallel'])
hcom.extend('-cmdenv','PYTHONPATH=.')
hcom.extend('-input',src,'-output',dst)
hcom.extend("-mapper '%s'" % mapCom)
hcom.extend("-reducer '%s'" % reduceCom)
Expand Down Expand Up @@ -1519,11 +1522,14 @@ def runMain(self,argv):
print ''
print 'OPTIONS are specified as "--opts key:value,...", where legal keys for "opts", with default values, are:'
for (key,val) in GPig.DEFAULT_OPTS.items():
print ' %s:%s' % (key,str(val))
print 'Values in the "opts" key/value pairs are assumed to be URL-escaped. (Note: %3A escapes a colon.)'
print ' %s:\t%s' % (key,str(val))
print 'The environment variables GP_STREAMJAR and GP_VIEWDIR, if defined, set two of these default values.'
print 'Options affect Guinea Pig\'s default behavior.'
print ''
print 'PARAMS are specified as "--params key:value,..." and the associated dictionary is accessible to'
print 'user programs via the function GPig.getArgvParams().'
print 'user programs via the function GPig.getArgvParams(). Params are used as program-specific inputs.'
print ''
print 'Values in the "opts" and "params" key/value pairs are assumed to be URL-escaped. (Note: %3A escapes a colon.)'
print ''
print 'There\'s more help at http://curtis.ml.cmu.edu/w/courses/index.php/Guinea_Pig'

Expand Down
5 changes: 2 additions & 3 deletions tutorial/Makefile
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
update:
echo updates no longer needed
cp ../guineapig.py .

clean:
rm -rf gpig_views
rm -f total.gp
rm *.pyc

tar:
cp ../guineapig.py .
tar: update
echo created on `date` > marker.txt
tar -cvzf tutorial.tgz marker.txt guineapig.py *corpus.txt id-parks.txt *.py phirl-naive.pig

Expand Down

0 comments on commit d1cfc77

Please sign in to comment.