Skip to content

Latest commit

 

History

History
124 lines (81 loc) · 4.34 KB

README.adoc

File metadata and controls

124 lines (81 loc) · 4.34 KB

TPC-DS Benchmark with Spark on OpenShift

The following instructions are derived from this project.

We will built tools and run TPS-DS benchmarks on Spark 3.0.1 using an S3 storage data lake.

If you want to directly Generate the Data and Run the Benchmark using pre-built images, you can directly go to the Run the benchmark section.

Prerequisite

You will need a pre-built Spark image with the S3 connector built in. Refer to this project.

Build Benchmark project image

Note
You will need the sbt tool to build the dependency and the benchmark tool.

Build dependencies

  • spark-sql-perf

Latest version in maven central repo is 0.3.2 which is too old, we need to build a new libary from source.

Get the source
git clone https://github.com/databricks/spark-sql-perf
cd spark-sql-perf

Check/Edit the version of Spark or Scala you want to use in the file build.sbt.

Build spark-sql-perf dependency
sbt +package
cp target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar <your-code-path>/benchmark/libs

Build Benchmark utility

From the benchmark folder:

sbt assembly

Build Benchmark container image

Build from our pre-built Spark Image
docker build -t spark-benchmark:s3.0.1-h3.3.0_v0.0.1 --build-arg SPARK_BASE_IMAGE=quay.io/guimou/spark-odh:s3.0.1-h3.3.0_v0.0.2 .
Tag and push the image
docker tag spark-benchmark:s3.0.1-h3.3.0_v0.0.1 quay.io/guimou/spark-benchmark:s3.0.1-h3.3.0_v0.0.1
docker push quay.io/guimou/spark-benchmark:s3.0.1-h3.3.0_v0.0.1

Run the benchmark

In the examples folder you will find different examples to create the Datasets or Run the benchmark. You can adapt them to your environment: data location, logs location,…​

Following instructions use an S3 bucket for the Data, as well as the History Server as described here.

Data Store

We’ll create a bucket to hold the TPC-DS Data using and ObjectBucketClaim.

Note
This OBC creates a bucket in the RGW from an OpenShift Data Foundation deployment. Adapt the instructions depending on your S3 provider.

From the test folder:

Create the OBC
oc apply -f obc-tpcds-data.yaml

History Server

To make it easier to retrieve all the data from the generation and the benchmark, point your history server to the same bucket.

Note
you must create a logs-dir folder in this bucket prior to launching the history server (populate it with a hidden empty file like .s3keep).

CPU and Memory limits adjustments

By default OpenShift places restrictions on the size of the pods you can launch, which may be an issue for intensive workloads. We must adjust what the Spark-Operator is allowed to do when launching the driver and the executors.

From the OpenShift UI, as a cluster-admin, go to Administration→LimitRanges and edit spark-operator-core-resource-limits according to your needs and the available resources. The parameters to change are the max for cpu and memory for both Containers and Pods.

Data Generation

From the examples folder, review and apply one of the file called tpcds-data-generation-SIZE.yaml. The SIZE in the filename indicates the size of the synthetic dataset created, in GB. In the file, choose the number or executors you want to run along with their sizing. You must also replace the placeholders with the values for your bucket namen, access key and secret key.

Run the data generation for a 1G dataset
oc apply -f tpcds-data-generation-1G.yaml

Benckmark

From the examples folder, review and apply one of the file called tpcds-benchmark-SIZE.yaml. The SIZE in the filename indicates the size of the synthetic dataset created, in GB. In the file, choose the number or executors you want to run along with their sizing. You must also replace the placeholders with the values for your bucket namen, access key and secret key.

Run the data generation for a 1G dataset
oc apply -f tpcds-benchmark-1G.yaml

Apart from the History Server, all the Results are saved in YOUR_BUCKET/TPCDS-TEST-1G-RESULT.

Note
Adapat all those instructions for different sizes of datasets.