Usage with Apache Spark on YARN¶
venv-pack
can be used to distribute virtual environments to be used with
Apache Spark jobs when deploying on Apache YARN. By bundling your
environment for use with Spark, you can use custom packages, and ensure that
they’re consistently provided on every node. This makes use of YARN’s
resource localization by distributing environments as archives, which are then
automatically unarchived on every node. In this case either the tar.gz
or
zip
formats must be used.
Example¶
Create an environment:
# Using venv (Python 3 only)
$ python -m venv example
# Or using virtualenv
$ virtualenv example
Activate the environment:
$ source example/bin/activate
Install some packages into the environment
(example) $ pip install numpy pandas scikit-learn scipy
Package the environment into a tar.gz
archive:
(example) $ venv-pack -o environment.tar.gz
Collecting packages...
Packing environment at '/home/jcrist/example' to 'environment.tar.gz'
[########################################] | 100% Completed | 16.6s
Write a PySpark script, for example:
# script.py
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)
def some_function(x):
# Packages are imported and available from your bundled environment.
import sklearn
import pandas
import numpy as np
# Use the libraries to do work
return np.sin(x)**2 + 2
rdd = (sc.parallelize(range(1000))
.map(some_function)
.take(10))
print(rdd)
Submit the job to Spark using spark-submit
. In YARN cluster mode:
$ PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn \
--deploy-mode cluster \
--archives environment.tar.gz#environment \
script.py
Or in YARN client mode:
$ PYSPARK_DRIVER_PYTHON=`which python` \
PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn \
--deploy-mode client \
--archives environment.tar.gz#environment \
script.py
You can also start a PySpark interactive session using the following:
$ PYSPARK_DRIVER_PYTHON=`which python` \
PYSPARK_PYTHON=./environment/bin/python \
pyspark \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn \
--deploy-mode client \
--archives environment.tar.gz#environment