Skip to content

Commit 6d0eff1

Browse files
committed
Update retry docs with SGE failures (as opposed to workflow step
failures)
1 parent 525a53d commit 6d0eff1

File tree

1 file changed

+96
-109
lines changed

1 file changed

+96
-109
lines changed

seqware-distribution/docs/site/content/docs/6-pipeline/user-configuration.md

+96-109
Original file line numberDiff line numberDiff line change
@@ -64,112 +64,99 @@ After restarting Oozie, Oozie will use the listed error codes in combination wit
6464
be retried in case of a specific error. For example, in the above jobs that return with an SGE error code of SGE137 will automatically be retried 30 or
6565
OOZIE_RETRY_MAX times, whatever is higher. The actual error codes will likely be dependent on your site.
6666

67-
## Pegasus Workflow Engine Configuration
68-
69-
The SeqWare Pipeline project can (currently) use two workflow engines: 1) the Pegasus/Condor/Globus/SGE engine or 2) the Oozie/Hadoop engine. Each requires a bit of additional information to make them work (and, obviously, the underlying cluster tools correctly installed and configured). For the Pegasus engine you need a few extra files, referenced by the SW_PEGASUS_CONFIG_DIR parameter above:
70-
71-
### sites.xml3
72-
73-
<!-- see http://www.opinionatedgeek.com/DotNet/Tools/HTMLEncode/encode.aspx -->
74-
75-
<pre><code>#!xml
76-
&lt;sitecatalog xmlns=&quot;http://pegasus.isi.edu/schema/sitecatalog&quot; xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot; xsi:schemaLocation=&quot;http://pegasus.isi.edu/schema/sitecatalog http://pegasus.isi.edu/schema/sc-3.0.xsd&quot; version=&quot;3.0&quot;&gt;
77-
&lt;site handle=&quot;local&quot; arch=&quot;x86_64&quot; os=&quot;LINUX&quot; osrelease=&quot;&quot; osversion=&quot;&quot; glibc=&quot;&quot;&gt;
78-
&lt;grid type=&quot;gt5&quot; contact=&quot;seqwarevm/jobmanager-fork&quot; scheduler=&quot;Fork&quot; jobtype=&quot;auxillary&quot;/&gt;
79-
&lt;grid type=&quot;gt5&quot; contact=&quot;seqwarevm/jobmanager-sge&quot; scheduler=&quot;SGE&quot; jobtype=&quot;compute&quot;/&gt;
80-
&lt;head-fs&gt;
81-
&lt;scratch&gt;
82-
&lt;shared&gt;
83-
&lt;file-server protocol=&quot;gsiftp&quot; url=&quot;gsiftp://seqwarevm&quot; mount-point=&quot;/home/seqware/SeqWare/pegasus-working&quot;/&gt;
84-
&lt;internal-mount-point mount-point=&quot;/home/seqware/SeqWare/pegasus-working&quot;/&gt;
85-
&lt;/shared&gt;
86-
&lt;/scratch&gt;
87-
&lt;storage&gt;
88-
&lt;shared&gt;
89-
&lt;file-server protocol=&quot;gsiftp&quot; url=&quot;gsiftp://seqwarevm&quot; mount-point=&quot;/&quot;/&gt;
90-
&lt;internal-mount-point mount-point=&quot;/&quot;/&gt;
91-
&lt;/shared&gt;
92-
&lt;/storage&gt;
93-
&lt;/head-fs&gt;
94-
&lt;replica-catalog type=&quot;LRC&quot; url=&quot;rlsn://smarty.isi.edu&quot;/&gt;
95-
&lt;profile namespace=&quot;env&quot; key=&quot;GLOBUS_LOCATION&quot;&gt;/usr&lt;/profile&gt;
96-
&lt;profile namespace=&quot;env&quot; key=&quot;JAVA_HOME&quot;&gt;/usr/java/default&lt;/profile&gt;
97-
&lt;!--profile namespace=&quot;env&quot; key=&quot;LD_LIBRARY_PATH&quot;&gt;/.mounts/labs/seqware/public/globus/default/lib&lt;/profile--&gt;
98-
&lt;profile namespace=&quot;env&quot; key=&quot;PEGASUS_HOME&quot;&gt;/opt/pegasus/3.0&lt;/profile&gt;
99-
&lt;/site&gt;
100-
&lt;site handle=&quot;seqwarevm&quot; arch=&quot;x86_64&quot; os=&quot;LINUX&quot; osrelease=&quot;&quot; osversion=&quot;&quot; glibc=&quot;&quot;&gt;
101-
&lt;grid type=&quot;gt5&quot; contact=&quot;seqwarevm/jobmanager-fork&quot; scheduler=&quot;Fork&quot; jobtype=&quot;auxillary&quot;/&gt;
102-
&lt;grid type=&quot;gt5&quot; contact=&quot;seqwarevm/jobmanager-sge&quot; scheduler=&quot;SGE&quot; jobtype=&quot;compute&quot;/&gt;
103-
&lt;head-fs&gt;
104-
&lt;scratch&gt;
105-
&lt;shared&gt;
106-
&lt;file-server protocol=&quot;gsiftp&quot; url=&quot;gsiftp://seqwarevm&quot; mount-point=&quot;/home/seqware/SeqWare/pegasus-working&quot;/&gt;
107-
&lt;internal-mount-point mount-point=&quot;/home/seqware/SeqWare/pegasus-working&quot;/&gt;
108-
&lt;/shared&gt;
109-
&lt;/scratch&gt;
110-
&lt;storage&gt;
111-
&lt;shared&gt;
112-
&lt;file-server protocol=&quot;gsiftp&quot; url=&quot;gsiftp://seqwarevm&quot; mount-point=&quot;/&quot;/&gt;
113-
&lt;internal-mount-point mount-point=&quot;/&quot;/&gt;
114-
&lt;/shared&gt;
115-
&lt;/storage&gt;
116-
&lt;/head-fs&gt;
117-
&lt;replica-catalog type=&quot;LRC&quot; url=&quot;rlsn://smarty.isi.edu&quot;/&gt;
118-
&lt;profile namespace=&quot;env&quot; key=&quot;GLOBUS_LOCATION&quot;&gt;/usr&lt;/profile&gt;
119-
&lt;profile namespace=&quot;env&quot; key=&quot;JAVA_HOME&quot;&gt;/usr/java/default&lt;/profile&gt;
120-
&lt;!--profile namespace=&quot;env&quot; key=&quot;LD_LIBRARY_PATH&quot;&gt;/.mounts/labs/seqware/public/globus/default/lib&lt;/profile--&gt;
121-
&lt;profile namespace=&quot;env&quot; key=&quot;PEGASUS_HOME&quot;&gt;/opt/pegasus/3.0&lt;/profile&gt;
122-
&lt;/site&gt;
123-
&lt;/sitecatalog&gt;
124-
</code></pre>
125-
126-
This file is from Pegasus and the handle="clustername" is how you tell SeqWare which cluster to submit to. The setup of cluster resources in the sites.xml3 file is beyond the scope of SeqWare so we refer you to the [Pegasus documentation](http://pegasus.isi.edu/).
127-
128-
### properties
129-
130-
<pre><code>#!ini
131-
##########################
132-
# PEGASUS USER PROPERTIES
133-
##########################
134-
135-
## SELECT THE REPLICA CATALOG MODE AND URL
136-
pegasus.catalog.replica = SimpleFile
137-
pegasus.catalog.replica.file = /home/seqware/.seqware/pegasus/rc.data
138-
139-
## SELECT THE SITE CATALOG MODE AND FILE
140-
pegasus.catalog.site = XML3
141-
pegasus.catalog.site.file = /home/seqware/.seqware/pegasus/sites.xml3
142-
143-
144-
## SELECT THE TRANSFORMATION CATALOG MODE AND FILE
145-
pegasus.catalog.transformation = File
146-
pegasus.catalog.transformation.file = /home/seqware/.seqware/pegasus/tc.data
147-
148-
## USE DAGMAN RETRY FEATURE FOR FAILURES
149-
dagman.retry=1
150-
151-
## STAGE ALL OUR EXECUTABLES OR USE INSTALLED ONES
152-
pegasus.catalog.transformation.mapper = All
153-
154-
## CHECK JOB EXIT CODES FOR FAILURE
155-
pegasus.exitcode.scope=all
156-
157-
## OPTIMIZE DATA & EXECUTABLE TRANSFERS
158-
pegasus.transfer.refiner=Bundle
159-
pegasus.transfer.links = true
160-
161-
# JOB Priorities
162-
pegasus.job.priority=10
163-
pegasus.transfer.*.priority=100
164-
165-
#JOB CATEGORIES
166-
pegasus.dagman.projection.maxjobs=2
167-
</code></pre>
168-
169-
The Pegasus properties file controls where the sites.xml3 file lives and a few
170-
other Pegasus parameters (our tc.data and rc.data files in SeqWare are empty).
171-
The most important parameter above is "dagman.retry=1" which controls how many
172-
attempts should be made before job is considered failed in a workflow. In this
173-
example "1" means it should be retried once before failing. There are other
174-
parameters that might be useful for Pegasus, see the [Pegasus
175-
documentation](http://pegasus.isi.edu/) for more information.
67+
For versions of the oozie-sge plugin from 1.0.3 onwards, two kinds of error codes are possible. Error codes of the form SGE[0-9]+ refer to the exit status of the actual Bash scripts that form steps in your workflows. Error codes of the form SGEF[0-9]+ refer to the failure code of the SGE infrastructure itself.
68+
69+
For example, the following output from "qacct -j" refers to a workflow step which failed with an error code of 1 (which would correspond to SGE1 for the Oozie XML parameter above).
70+
71+
$ qacct -j 3702
72+
==============================================================
73+
qname main.q
74+
hostname master
75+
group seqware
76+
owner seqware
77+
project NONE
78+
department defaultdepartment
79+
jobname annotate_5
80+
jobnumber 3702
81+
taskid undefined
82+
account sge
83+
priority 0
84+
qsub_time Fri Aug 29 16:40:08 2014
85+
start_time Fri Aug 29 16:40:20 2014
86+
end_time Fri Aug 29 16:40:21 2014
87+
granted_pe NONE
88+
slots 1
89+
failed 0
90+
exit_status 1
91+
ru_wallclock 1
92+
ru_utime 1.468
93+
ru_stime 0.072
94+
ru_maxrss 112212
95+
ru_ixrss 0
96+
ru_ismrss 0
97+
ru_idrss 0
98+
ru_isrss 0
99+
ru_minflt 42375
100+
ru_majflt 0
101+
ru_nswap 0
102+
ru_inblock 0
103+
ru_oublock 168
104+
ru_msgsnd 0
105+
ru_msgrcv 0
106+
ru_nsignals 0
107+
ru_nvcsw 726
108+
ru_nivcsw 269
109+
cpu 1.540
110+
mem 0.306
111+
io 0.006
112+
iow 0.000
113+
maxvmem 557.734M
114+
arid undefined
115+
116+
The following output from "qacct -j" refers to a workflow step where the actual qsub failed since a logging directory was unavailable (leading to a Eqw state). This would correspond to an Oozie error code of SGEF26.
117+
118+
$ qacct -j 3801
119+
==============================================================
120+
qname main.q
121+
hostname master
122+
group seqware
123+
owner seqware
124+
project NONE
125+
department defaultdepartment
126+
jobname start_0
127+
jobnumber 3801
128+
taskid undefined
129+
account sge
130+
priority 0
131+
qsub_time Fri Sep 12 15:03:02 2014
132+
start_time -/-
133+
end_time -/-
134+
granted_pe NONE
135+
slots 1
136+
failed 26 : opening input/output file
137+
exit_status 0
138+
ru_wallclock 0
139+
ru_utime 0.000
140+
ru_stime 0.000
141+
ru_maxrss 0
142+
ru_ixrss 0
143+
ru_ismrss 0
144+
ru_idrss 0
145+
ru_isrss 0
146+
ru_minflt 0
147+
ru_majflt 0
148+
ru_nswap 0
149+
ru_inblock 0
150+
ru_oublock 0
151+
ru_msgsnd 0
152+
ru_msgrcv 0
153+
ru_nsignals 0
154+
ru_nvcsw 0
155+
ru_nivcsw 0
156+
cpu 0.000
157+
mem 0.000
158+
io 0.000
159+
iow 0.000
160+
maxvmem 0.000
161+
arid undefined
162+

0 commit comments

Comments
 (0)