This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: 3hUOjpMNq-8n6rEFN3KBybzPot8ca9lpzT6QOOmoFVY
Cover

Building a Custom Docker Image for K8s Spark Operator to Fix Vulnerabilities

Written by @kirillkulikov | Published on 2024/10/11

TL;DR
There is a requirement to use Spark Operator in a K8s cluster to run a spark job. The official image contains many vulnerabilities, including those due to Hadoop libraries. To build our image, we'll need a spark image as a base image and a Golang image to build Spark Operator itself.

There is a requirement to use Spark Operator in a K8s cluster to run a spark job. The official image contains many vulnerabilities, including those due to Hadoop libraries. Let's build our own Spark Operator image.

To build our image, we'll need a spark image as a base image and a Golang image to build Spark Operator itself.

Spark image

Building a Spark image without Hadoop using a specific version of Spark

RUN curl -L https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-without-hadoop.tgz -o spark-3.5.1-bin-without-hadoop.tgz \
    && tar -xvzf spark-3.5.1-bin-without-hadoop.tgz \
    && mv spark-3.5.1-bin-without-hadoop /opt/spark \
    && rm spark-3.5.1-bin-without-hadoop.tgz

Spark-operator image

We build the Spark Operator image, we will need several Hadoop libraries to run submit commands.

For example, the FIPS version build is given, the differences in the build and run commands.

For building on Go, the parameter GOEXPERIMENT=boringcrypto is used

For running spark-submit, the java parameter for Bouncy Castle is used Djavax.net.ssl.trustStorePassword=password

You can build an image without FIPS changes.

To run spark-submit, we will add Hadoop libraries during the build process:

  • hadoop-client-runtime
  • hadoop-client-api
  • slf4j-api

entrypoint.sh is used from the official Kubeflow repository https://github.com/kubeflow/spark-operator/blob/master/entrypoint.sh

Example Dockerfile for building Spark Operator

ARG SPARK_IMAGE=spark-3.5.1-bin-without-hadoop
ARG GOLANG_IMAGE=golang-1.21
ARG SPARK_OPERATOR_VERSION=1.3.1
ARG HADOOP_VERSION_DEFAULT=3.4.0
ARG HADOOP_TMP_HOME="/opt/hadoop"
ARG TARGETARCH=amd64

# Prepare spark-operator build
FROM ${GOLANG_IMAGE} as builder
WORKDIR /app/spark-operator

ARG SPARK_OPERATOR_VERSION
RUN curl -Ls https://github.com/kubeflow/spark-operator/archive/refs/tags/spark-operator-chart-${SPARK_OPERATOR_VERSION}.tar.gz | tar -xz --strip-components 1 -C /app/spark-operator

RUN GOTOOLCHAIN=go1.22.3 go mod download

# Build
ARG TARGETARCH
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} GO111MODULE=on GOTOOLCHAIN=go1.22.3 GOEXPERIMENT=boringcrypto go build -a -o /app/spark-operator/spark-operator main.go

#Install Hadoop jars
ARG HADOOP_VERSION_DEFAULT
ARG HADOOP_TMP_HOME
RUN mkdir -p ${HADOOP_TMP_HOME}
RUN curl -Ls https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION_DEFAULT}/hadoop-${HADOOP_VERSION_DEFAULT}.tar.gz | tar -xz --strip-components 1 -C ${HADOOP_TMP_HOME}

# Prepare spark-operator image
FROM ${ECR_URL}:${SPARK_IMAGE}
WORKDIR /opt/spark-operator
USER root

ENV PATH $JAVA_HOME/bin:$PATH
ENV SPARK_HOME="/opt/spark"
ENV JAVA_HOME="/opt/jdk-11.0.21"
ENV SPARK_SUBMIT_OPTS="${SPARK_SUBMIT_OPTS} -Djavax.net.ssl.trustStorePassword=password"
ENV PATH=${PATH}:${SPARK_HOME}/bin:${JAVA_HOME}/bin:

RUN yum update -y && \
    yum install --setopt=tsflags=nodocs -y openssl && \
    yum clean all

ARG HADOOP_TMP_HOME
COPY --from=builder ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-runtime-*.jar ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-api-*.jar ${HADOOP_TMP_HOME}/share/hadoop/common/lib/slf4j-api-*.jar /opt/spark/jars/

COPY --from=builder /app/spark-operator/spark-operator /opt/spark-operator/
COPY --from=builder /app/spark-operator/hack/gencerts.sh /usr/bin/

COPY entrypoint.sh /opt/spark-operator/
RUN chmod a+x /opt/spark-operator/entrypoint.sh
ENTRYPOINT ["/opt/spark-operator/entrypoint.sh"]

Conclusion

After the build, we still have several vulnerabilities in the Hadoop library hadoop-client-runtime:

  • org.apache.avro:avro (hadoop-client-runtime-3.4.0.jar) – CVE-2023-39410
  • org.apache.commons:commons-compress – CVE-2024-25710, CVE-2024-26308

Since without this library we'll not be able to run spark-submit, but the rest of the huge part of the vulnerabilities is removed along with the main Hadoop libraries.

[story continues]


Written by
@kirillkulikov
I believe in the power of automation

Topics and
tags
devops|docker|kubernetes|kubernetes-guide|custom-docker-image|k8s-spark-operator|fixing-docker-vulnerabilities|spark-image
This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: 3hUOjpMNq-8n6rEFN3KBybzPot8ca9lpzT6QOOmoFVY