Skip to content

Commit 3ccb729

Browse files
authored
support Spark 3.2 and EMR 6.7 (#98)
* support Spark 3.2 and EMR 6.7 * fix previous commit format error by running black * fix missing docstring in public module * fix epel release error * fix epel release error * revert back epel change * change for CR feedback * delete setup.py
1 parent d425af9 commit 3ccb729

File tree

17 files changed

+1506
-12
lines changed

17 files changed

+1506
-12
lines changed

DEVELOPMENT.md

Lines changed: 32 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -126,14 +126,27 @@ docker push $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/$SPARK_REPOSITORY:$V
126126
make test-sagemaker
127127
```
128128

129+
6. please run following command before you raise CR:
130+
131+
```
132+
make test-unit
133+
make install-container-library
134+
```
135+
136+
129137
## Push the code
130138
1. You need to create PR request in order to merge the code. How to create PR request lists here:https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request
131139
2. You need to get Github access of AWS organization. please following here:https://w.amazon.com/?Open_Source/GitHub
132140
3. Get access to permission specific to a team, example is here:https://github.com/orgs/aws/teams/sagemakerwrite/members
133141
4. Ask a person to review the code and merge it in.This repo needs at least one code reviewer.
134142
5. The code needs to be signed before pushing. More detail about signing is here:https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits.
135-
Remember in your local, you need to set up: git config --global user.signingkey [key id] and also upload public key into your github account.
136-
6. The email you specify when you created public key must match github email in github settings.
143+
144+
```
145+
$ git commit -S -m "your commit message"
146+
```
147+
148+
6.Remember in your local, you need to set up: git config --global user.signingkey [key id] and also upload public key into your github account.
149+
The email you specify when you created public key must match github email in github settings.
137150

138151
### FAQ
139152

@@ -168,3 +181,20 @@ make: *** [install-container-library] Error 255
168181
```
169182

170183
* you need to update smsparkbuild/py39/Pipfile corresponding package version.
184+
185+
6. Code build may fail because of the format,
186+
for example
187+
```
188+
2 files would be reformatted, 13 files would be left unchanged.
189+
```
190+
191+
you can fix it by running
192+
193+
```
194+
black src/smspark/bootstrapper.py
195+
```
196+
see https://www.freecodecamp.org/news/auto-format-your-python-code-with-black/ for detail.
197+
198+
7. Remember to define module at start of python file. Missing docstring error.
199+
200+
see more detail here https://stackoverflow.com/questions/46192576/how-can-i-fix-flake8-d100-missing-docstring-error-in-atom-editor

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ SHELL := /bin/sh
77

88
# Set variables if testing locally
99
ifeq ($(IS_RELEASE_BUILD),)
10-
SPARK_VERSION := 3.1
10+
SPARK_VERSION := 3.2
1111
PROCESSOR := cpu
1212
FRAMEWORK_VERSION := py39
1313
SM_VERSION := 1.0

Pipfile

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
[[source]]
2+
name = "pypi"
3+
url = "https://pypi.org/simple"
4+
verify_ssl = true
5+
6+
[dev-packages]
7+
8+
[packages]
9+
tenacity = "==8.0.1"
10+
psutil = "==5.9.0"
11+
click = "==8.1.2"
12+
watchdog = "==0.10.3"
13+
waitress = "==2.1.2"
14+
types-waitress = "==2.0.6"
15+
requests = "==2.27.1"
16+
types-requests = "==2.27.16"
17+
rsa = "==4.3"
18+
pyasn1 = "==0.4.8"
19+
boto3 = "==1.21.33"
20+
safety = "==1.10.3"
21+
black = "==22.3.0"
22+
mypy = "==0.942"
23+
flake8 = "==4.0.1"
24+
flake8-docstrings = "==1.5.0"
25+
pytest = "==7.1.1"
26+
pytest-cov = "==2.10.0"
27+
pytest-xdist = "==2.5.0"
28+
docker = "==5.0.3"
29+
docker-compose = "==1.29.2"
30+
cryptography = "==36.0.2"
31+
typing-extensions = "==4.1.1"
32+
sagemaker = "==2.83.0"
33+
smspark = {editable = true, path = "."}
34+
importlib-metadata = "==4.11.3"
35+
pytest-parallel = "==0.1.1"
36+
pytest-rerunfailures = "10.0"
37+
numpy = "==1.22.2"
38+
protobuf = "==3.20.1"
39+
40+
[requires]
41+
python_version = "3.9"

Pipfile.lock

Lines changed: 1052 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

new_images.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
new_images:
3-
- spark: "3.1.1"
3+
- spark: "3.2"
44
use-case: "processing"
55
processors: ["cpu"]
66
python: ["py39"]
7-
sm_version: "1.3"
7+
sm_version: "1.0"
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
echo "Not implemented"
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
FROM 137112412989.dkr.ecr.us-west-2.amazonaws.com/amazonlinux:2
2+
ARG REGION
3+
ENV AWS_REGION ${REGION}
4+
5+
RUN yum clean all \
6+
&& yum update -y \
7+
&& yum install -y awscli bigtop-utils curl gcc gzip unzip zip gunzip tar wget liblapack* libblas* libopencv* libopenblas*
8+
9+
# Install python 3.9
10+
ARG PYTHON_BASE_VERSION=3.9
11+
ARG PYTHON_WITH_BASE_VERSION=python${PYTHON_BASE_VERSION}
12+
ARG PIP_WITH_BASE_VERSION=pip${PYTHON_BASE_VERSION}
13+
ARG PYTHON_VERSION=${PYTHON_BASE_VERSION}.12
14+
RUN yum -y groupinstall 'Development Tools' \
15+
&& yum -y install openssl-devel bzip2-devel libffi-devel sqlite-devel xz-devel \
16+
&& wget https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tgz \
17+
&& tar xzf Python-${PYTHON_VERSION}.tgz \
18+
&& cd Python-*/ \
19+
&& ./configure --enable-optimizations \
20+
&& make altinstall \
21+
&& echo -e 'alias python3=python3.9\nalias pip3=pip3.9' >> ~/.bashrc \
22+
&& ln -s $(which ${PYTHON_WITH_BASE_VERSION}) /usr/local/bin/python3 \
23+
&& ln -s $(which ${PIP_WITH_BASE_VERSION}) /usr/local/bin/pip3 \
24+
&& cd .. \
25+
&& rm Python-${PYTHON_VERSION}.tgz \
26+
&& rm -rf Python-${PYTHON_VERSION}
27+
28+
# install nginx amazonlinux:2.0.20200304.0 does not have nginx, so need to install epel-release first
29+
RUN wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
30+
RUN yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
31+
RUN yum install -y nginx
32+
33+
RUN rm -rf /var/cache/yum
34+
35+
ENV PYTHONDONTWRITEBYTECODE=1
36+
ENV PYTHONUNBUFFERED=1
37+
# http://blog.stuart.axelbrooke.com/python-3-on-spark-return-of-the-pythonhashseed
38+
ENV PYTHONHASHSEED 0
39+
ENV PYTHONIOENCODING UTF-8
40+
ENV PIP_DISABLE_PIP_VERSION_CHECK 1
41+
42+
# Install EMR Spark/Hadoop
43+
ENV HADOOP_HOME /usr/lib/hadoop
44+
ENV HADOOP_CONF_DIR /usr/lib/hadoop/etc/hadoop
45+
ENV SPARK_HOME /usr/lib/spark
46+
47+
COPY yum/emr-apps.repo /etc/yum.repos.d/emr-apps.repo
48+
49+
# Install hadoop / spark dependencies from EMR's yum repository for Spark optimizations.
50+
# replace placeholder with region in repository URL
51+
RUN sed -i "s/REGION/${AWS_REGION}/g" /etc/yum.repos.d/emr-apps.repo
52+
RUN adduser -N hadoop
53+
54+
# These packages are a subset of what EMR installs in a cluster with the
55+
# "hadoop", "spark", and "hive" applications.
56+
# They include EMR-optimized libraries and extras.
57+
RUN yum install -y aws-hm-client \
58+
aws-java-sdk \
59+
aws-sagemaker-spark-sdk \
60+
emr-goodies \
61+
emr-ruby \
62+
emr-scripts \
63+
emr-s3-select \
64+
emrfs \
65+
hadoop \
66+
hadoop-client \
67+
hadoop-hdfs \
68+
hadoop-hdfs-datanode \
69+
hadoop-hdfs-namenode \
70+
hadoop-httpfs \
71+
hadoop-kms \
72+
hadoop-lzo \
73+
hadoop-yarn \
74+
hadoop-yarn-nodemanager \
75+
hadoop-yarn-proxyserver \
76+
hadoop-yarn-resourcemanager \
77+
hadoop-yarn-timelineserver \
78+
hive \
79+
hive-hcatalog \
80+
hive-hcatalog-server \
81+
hive-jdbc \
82+
hive-server2 \
83+
s3-dist-cp \
84+
spark-core \
85+
spark-datanucleus \
86+
spark-external \
87+
spark-history-server \
88+
spark-python
89+
90+
91+
# Point Spark at proper python binary
92+
ENV PYSPARK_PYTHON=/usr/local/bin/python3.9
93+
94+
# Setup Spark/Yarn/HDFS user as root
95+
ENV PATH="/usr/bin:/opt/program:${PATH}"
96+
ENV YARN_RESOURCEMANAGER_USER="root"
97+
ENV YARN_NODEMANAGER_USER="root"
98+
ENV HDFS_NAMENODE_USER="root"
99+
ENV HDFS_DATANODE_USER="root"
100+
ENV HDFS_SECONDARYNAMENODE_USER="root"
101+
102+
103+
104+
# Set up bootstrapping program and Spark configuration
105+
COPY hadoop-config /opt/hadoop-config
106+
COPY nginx-config /opt/nginx-config
107+
COPY aws-config /opt/aws-config
108+
COPY Pipfile Pipfile.lock setup.py *.whl /opt/program/
109+
ENV PIPENV_PIPFILE=/opt/program/Pipfile
110+
# Use --system flag, so it will install all packages into the system python,
111+
# and not into the virtualenv. Since docker containers do not need to have virtualenvs
112+
# pipenv > 2022.4.8 fails to build smspark
113+
RUN /usr/local/bin/python3.9 -m pip install pipenv==2022.4.8 \
114+
&& pipenv install --system \
115+
&& /usr/local/bin/python3.9 -m pip install /opt/program/*.whl
116+
117+
# Setup container bootstrapper
118+
COPY container-bootstrap-config /opt/container-bootstrap-config
119+
RUN chmod +x /opt/container-bootstrap-config/bootstrap.sh \
120+
&& /opt/container-bootstrap-config/bootstrap.sh
121+
122+
# With this config, spark history server will not run as daemon, otherwise there
123+
# will be no server running and container will terminate immediately
124+
ENV SPARK_NO_DAEMONIZE TRUE
125+
126+
WORKDIR $SPARK_HOME
127+
128+
ENTRYPOINT ["smspark-submit"]
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3+
<!-- Put site-specific property overrides in this file. -->
4+
5+
<configuration>
6+
<property>
7+
<name>fs.defaultFS</name>
8+
<value>hdfs://nn_uri/</value>
9+
<description>NameNode URI</description>
10+
</property>
11+
<property>
12+
<name>fs.s3a.aws.credentials.provider</name>
13+
<value>com.amazonaws.auth.DefaultAWSCredentialsProviderChain</value>
14+
<description>AWS S3 credential provider</description>
15+
</property>
16+
<property>
17+
<name>fs.s3.impl</name>
18+
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
19+
<description>s3a filesystem implementation</description>
20+
</property>
21+
<property>
22+
<name>fs.AbstractFileSystem.s3a.imp</name>
23+
<value>org.apache.hadoop.fs.s3a.S3A</value>
24+
<description>s3a filesystem implementation</description>
25+
</property>
26+
</configuration>
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3+
<!-- Put site-specific property overrides in this file. -->
4+
5+
<configuration>
6+
<property>
7+
<name>dfs.datanode.data.dir</name>
8+
<value>file:///opt/amazon/hadoop/hdfs/datanode</value>
9+
<description>Comma separated list of paths on the local filesystem of a DataNode where it should store its\
10+
blocks.</description>
11+
</property>
12+
13+
<property>
14+
<name>dfs.namenode.name.dir</name>
15+
<value>file:///opt/amazon/hadoop/hdfs/namenode</value>
16+
<description>Path on the local filesystem where the NameNode stores the namespace and transaction logs per\
17+
sistently.</description>
18+
</property>
19+
20+
<!-- Fix for "Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try"
21+
From https://community.cloudera.com/t5/Support-Questions/Failed-to-replace-a-bad-datanode-on-the-existing-pipeline/td-p/207711
22+
This issue can be caused by Continuous network issues causing or repeated packet drops. This specially happens when data is
23+
being written to any one of the DataNode which is in process of pipelining the data to next datanode and due to any communicaiton
24+
issue it may lead to pipeline failure. We are only see this issue in small regions. -->
25+
<property>
26+
<name>dfs.client.block.write.replace-datanode-on-failure.enable</name>
27+
<value>true</value>
28+
<description>
29+
If there is a datanode/network failure in the write pipeline,
30+
DFSClient will try to remove the failed datanode from the pipeline
31+
and then continue writing with the remaining datanodes. As a result,
32+
the number of datanodes in the pipeline is decreased. The feature is
33+
to add new datanodes to the pipeline.
34+
35+
This is a site-wide property to enable/disable the feature.
36+
37+
When the cluster size is extremely small, e.g. 3 nodes or less, cluster
38+
administrators may want to set the policy to NEVER in the default
39+
configuration file or disable this feature. Otherwise, users may
40+
experience an unusually high rate of pipeline failures since it is
41+
impossible to find new datanodes for replacement.
42+
43+
See also dfs.client.block.write.replace-datanode-on-failure.policy
44+
</description>
45+
</property>
46+
47+
<property>
48+
<name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
49+
<value>ALWAYS</value>
50+
<description>
51+
This property is used only if the value of
52+
dfs.client.block.write.replace-datanode-on-failure.enable is true.
53+
54+
ALWAYS: always add a new datanode when an existing datanode is
55+
removed.
56+
57+
NEVER: never add a new datanode.
58+
59+
DEFAULT:
60+
Let r be the replication number.
61+
Let n be the number of existing datanodes.
62+
Add a new datanode only if r is greater than or equal to 3 and either
63+
(1) floor(r/2) is greater than or equal to n; or
64+
(2) r is greater than n and the block is hflushed/appended.
65+
</description>
66+
</property>
67+
</configuration>
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar
2+
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
3+
spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar
4+
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
5+
spark.driver.host=sd_host
6+
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
7+
8+
# Fix for "Uncaught exception: org.apache.spark.rpc.RpcTimeoutException: Cannot
9+
# receive any reply from 10.0.109.30:35219 in 120 seconds.""
10+
spark.rpc.askTimeout=300s

0 commit comments

Comments
 (0)