To manage batch jobs, we use Google Cloud Dataproc Workflow templates.
The following command should be able to create the workflow with all parameters ready to be triggered by EventsAPI when needed.
- Login in your CLI:
gcloud init
gcloud dataproc workflow-templates import epic-spark --source workflow.yaml
You can check the results going to the web interface of Dataproc.
Requires: mvn
,
mvn archetype:generate -DarchetypeGroupId=org.apache.maven.archetypes -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4
- Add Spark dependency and shading plugin to the generated
pom.xml
(see media-spark pom.xml as an example) - Create Spark job. You should take in an event name as the only argument. This event name should direct the Spark job to execute on an specific event. Make sure that results are also output into separate folders by event.
- Generate jar:
mvn package
- Upload jar from
target
folder to epic-spark-jars Google Cloud bucket - Add job to workflow.yml under the
jobs
tag. Use this template replacing with your jar file and step id of your job (step id needs to be unique):
- sparkJob:
args:
- gs://epic-historic-tweets/random/*
mainJarFileUri: gs://epic-spark-jars/YOUR_JAR_FILE.jar
stepId: YOUR_STEP_ID
- Add your step into the event parameter (see
fields
list). Make sure to replace yourYOUR_STEP_ID
with the step id you set in the previous step.
- jobs['YOUR_STEP_ID'].sparkJob.args[0]
- Update workflow:
gcloud dataproc workflow-templates import epic-spark --source workflow.yaml