Usage with Apache Spark on YARN
===============================
``venv-pack`` can be used to distribute virtual environments to be used with
`Apache Spark `_ jobs when `deploying on Apache YARN
`_. By bundling your
environment for use with Spark, you can use custom packages, and ensure that
they're consistently provided on every node. This makes use of `YARN's
`_
resource localization by distributing environments as archives, which are then
automatically unarchived on every node. In this case either the ``tar.gz`` or
``zip`` formats must be used.
Example
-------
Create an environment:
.. code-block:: bash
# Using venv (Python 3 only)
$ python -m venv example
# Or using virtualenv
$ virtualenv example
Activate the environment:
.. code-block:: bash
$ source example/bin/activate
Install some packages into the environment
.. code-block:: bash
(example) $ pip install numpy pandas scikit-learn scipy
Package the environment into a ``tar.gz`` archive:
.. code-block:: bash
(example) $ venv-pack -o environment.tar.gz
Collecting packages...
Packing environment at '/home/jcrist/example' to 'environment.tar.gz'
[########################################] | 100% Completed | 16.6s
Write a PySpark script, for example:
.. code-block:: python
# script.py
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)
def some_function(x):
# Packages are imported and available from your bundled environment.
import sklearn
import pandas
import numpy as np
# Use the libraries to do work
return np.sin(x)**2 + 2
rdd = (sc.parallelize(range(1000))
.map(some_function)
.take(10))
print(rdd)
Submit the job to Spark using ``spark-submit``. In YARN cluster mode:
.. code-block:: bash
$ PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn \
--deploy-mode cluster \
--archives environment.tar.gz#environment \
script.py
Or in YARN client mode:
.. code-block:: bash
$ PYSPARK_DRIVER_PYTHON=`which python` \
PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn \
--deploy-mode client \
--archives environment.tar.gz#environment \
script.py
You can also start a PySpark interactive session using the following:
.. code-block:: bash
$ PYSPARK_DRIVER_PYTHON=`which python` \
PYSPARK_PYTHON=./environment/bin/python \
pyspark \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn \
--deploy-mode client \
--archives environment.tar.gz#environment