Skip to content

Commit

Permalink
docs
Browse files Browse the repository at this point in the history
  • Loading branch information
William Cohen committed Jul 20, 2015
1 parent 9e720f3 commit aa42697
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 138 deletions.
153 changes: 16 additions & 137 deletions TODO.txt
Original file line number Diff line number Diff line change
@@ -1,137 +1,38 @@
TODO - priorities

for 1.3
- python wordprob.py --store prob seems to fail on a fresh tutorial
- clean up null src issue - do I need it?
- test hadoop
- safer eval - test and document, strip out reprInverse

FUNCTIONALITY

- add --dictSeps =, instead of default :,
- safer eval
- a GPig.registerImport('foo.py') - ok that's just ship? and GPig.registerCompiler('key',factoryClass)
- add --dictSeps =, instead of default :, for s3
- a GPig.registerImport('foo.py'), GPig.registerCompiler('key',factoryClass) - what is registerCompiler?
- option(storedIn=FILE) - so you can retrieve and store work on s3
- add Reuse(FILE) view
- add Concat(view1,....,viewK)
- add Stream(view1, through='shell command', shipping=[f1,..,fk])
- add StreamingMapReduce(view1, mapper='shell command', reducer='shell command', combiner='shell command', shipping=[f1,..,fk])
- add user-defined Reuse(FILE) ? (why do I want this again?)

- extras, for debugging:
-- Log
-- ReadBlocks
-- Wrap
-- PPrint
- gpextras, for debugging:
-- PPrint?
-- Wrap?
-- Describe?
-- Illustrate?
--- standardize view.by argument

- efficiency
-- combiners
-- compression
-- hadoop options (parallel, etc)
-- compiler for marime.py map-reducer with ramdisks (note: diskutil erasevolume HFS+ 'RAMDisk' `hdiutil attach -nomount ram://10315776`,
size is in 2048-byte blocks)
-- combiners: add combiner as combiningTo=.. option of Group.
-- compression -jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCode
-- hadoop options (parallel, hopts, ...)

- cleanup
-- standardize view.by argument
-- clean up.gpmo and other tmp files? could do this via analysis at the AbstractMapReduceTask lst level
-- log created views so you can continue with --reuse `cat foo.log|grep ^created|cut -f2`
-- maybe add --config logging:warn,...

DOCS:
- howto for EC2 EMR
- some longer examples for the tutorial (phirl-naive?)
- document planner.ship, planner.setReprInverseFun, planner.setSerializer

NOTES - EC2

1) Install AWS CLI - see http://docs.aws.amazon.com/cli/latest/userguide/installing.html
% curl https://s3.amazonaws.com/aws-cli/awscli-bundle.zip > awscli-bundle.zip
% unzip awscli-bundle.zip
% ./awscli-bundle/install -i `pwd`/install
% export PATH=$PATH:/Users/wcohen/Documents/code/aws-cli/install/bin/
2) check install with
% aws --version
3) get an access key with https://console.aws.amazon.com/iam/home?#security_credential
and save in environment vars

% aws configure
AWS Access Key ID [None]: ...
AWS Secret Access Key [None]: ...
Default region name [None]: us-east-1
Default output format [None]: json

4) create a keypair: http://docs.aws.amazon.com/cli/latest/userguide/cli-ec2-keypairs.html

% aws ec2 create-key-pair --key-name MyKeyPair --query 'KeyMaterial' --output text > MyKeyPair.pem

seems like the access key is for interactions with aws, the keypair will be for interactions with
the clusters you are going to create.

4.5) create a security group:

aws ec2 create-security-group --group-name MySecurityGroup --description "My security group"
aws ec2 authorize-security-group-ingress --group-name MySecurityGroup --protocol tcp --port 22 --cidr 0.0.0.0/0

5) now use emr subcommand [docs: http://docs.aws.amazon.com/cli/latest/reference/emr/index.html] to
build and access the cluster

% aws emr create-cluster --ami-version 3.8.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --ec2-attributes KeyName=MyKeyPair --log-uri s3n://wcohen-gpig-log --bootstrap-action Path="s3n://wcohen-gpig-input/emr-bootstrap.sh"

{
"ClusterId": "j-1LF855E531Y16"
}

hint: https://s3.amazonaws.com/bucket-name/path-to-file accesses a file

added tutorial/emr-bootstrap.sh as an action, might modify it to only run on the master, I think I can
replace the "echo running on master node" with an s3n:// script...but it should have a #!/bin/sh header.
--bootstrap-action Path=s3://elasticmapreduce/bootstrap-actions/run-if,Args=["instance.isMaster=true","echo running on master node"]


should add a name: --name "foo"


% aws emr put --cluster-id j-1LF855E531Y16 --key-pair-file MyKeyPair.pem --src path/to/tutorial.tgz
% aws emr ssh --cluster-id j-1LF855E531Y16 --key-pair-file MyKeyPair.pem

you're logged in so to run

% unpack the tutorial... TODO: add the hadoop startup and mkdir to emr-bootstrap
%
% export GP_STREAMJAR=/home/hadoop/contrib/streaming/hadoop-streaming.jar
% hadoop jar ~/hadoop-examples.jar pi 10 10000000 #somehow this was needed to set up hdfs:/user/hadoop
% hadoop fs -mkdir /user/hadoop/gp_views
% python param-wordcount.py --opts target:hadoop,viewdir:/user/hadoop/gp_views,echo:1 --params corpus:s3%3A//wcohen-gpig-input/corpus.txt --store wc

-- this is where things fail now....

--------------------

this is obsolete?

1 follow: /Users/wcohen/Documents/code/elastic-mapreduce-cli
installed on eddy in /Users/wcohen/Documents/code/elastic-mapreduce-cli, keypair=wcohen
buckets: wcohen-gpig-input, wcohen-gpig-views
helpful: https://aws.amazon.com/articles/Elastic-MapReduce/3938

after installation:
$ ./elastic-mapreduce --create --alive --name "Testing streaming -- wcohen" --num-instances 5 --instance-type c1.medium
Created job flow j-1F8U85HWYBRBT
$ ./elastic-mapreduce --jobflow j-1F8U85HWYBRBT --put gpigtut.tgz
$ ./elastic-mapreduce --jobflow j-1F8U85HWYBRBT --ssh
# then I could copy data in from s3: by just reading it in a ReadLines view...

$ ./elastic-mapreduce --set-termination-protection false
$ ./elastic-mapreduce --terminate

# copying: see https://wiki.apache.org/hadoop/AmazonS3
% ${HADOOP_HOME}/bin/hadoop distcp hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998 s3://123:456@nutch/

logs are by default on master in /mnt/var/log/ - see http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-manage-view-web-log-files.html

- document planner.ship, planner.setEvaluator

TODO - MAJOR

- test on EC2
- a GPig.registerCompiler('key',factoryClass), for adding new targets other than hadoop?
- compiler for marime.py map-reducer with ramdisks (note: diskutil erasevolume HFS+ 'RAMDisk' `hdiutil attach -nomount ram://10315776`,
size is in 2048-byte blocks)

- multithreading ideas

Expand All @@ -156,32 +57,10 @@ TODO - MAJOR
K subprocesses, Ri, to run '... | RED > out/shard.k' -- or could use threads (subprocesses more modular)
K threads to print from the shardBuffer to Ri

- benchmark hadoop stuff
- working: time python phirl-naive.py --opts viewdir:/user/wcohen/gpig_views,target:hadoop --store flook | tee tmp.log (real 13m44.946s)
- benchmark vs PIG: pig took 8min, launched 14 jobs; guineapig took 13:45, launched 27 jobs.
- issues:
- problem: hdoop processes don't seem to know where the user home dir is, workaround is a rooted path,
but maybe that's ok (TODO: warn if target=hadoop and relative path)
- can't run a program in a subdirectory, eg python demo/phirl-naive.py ... TODO: look for guineapig.py on pythonpath.
or, maybe figure out how the -file option works better....

- COMBINERS: add combiner as combiningTo=.. option of Group.

- DESCRIBE(...) - could be just a pretty_print?

- ILLUSTRATE(view,[outputs]) - using definition of view, select the
inputs from the innerviews that produce those outputs. Then, do that
recursively to get a test case.

TODO - SMALL

- add ReuseView(FILE) and option(location=FILE) so output views can be stored anywhere (eg s3 or s3n)
- log created views so you can continue with --reuse `cat foo.log|grep ^created|cut -f2`
- make safer version of 'eval'

- add --hopts to pass in to hadoop?
- maybe add --config logging:warn,...
- clean .gpmo outputs?
- find guineapig.py on sys.path
- jobconf mapred.job.name="...."
- compression jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCode
2 changes: 1 addition & 1 deletion guineapig.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ def onlyRowOf(view):

@staticmethod
class SafeEvaluator(object):
"""Evaluates expressions that correzpond to serialized guinea pig rows."""
"""Evaluates expressions that correspond to serialized guinea pig rows."""
def __init__(self,restrictedBindings={}):
self.restrictedBindings = restrictedBindings
def eval(self,s):
Expand Down

0 comments on commit aa42697

Please sign in to comment.