Usage with Apache Spark on YARN =============================== ``venv-pack`` can be used to distribute virtual environments to be used with `Apache Spark `_ jobs when `deploying on Apache YARN `_. By bundling your environment for use with Spark, you can use custom packages, and ensure that they're consistently provided on every node. This makes use of `YARN's `_ resource localization by distributing environments as archives, which are then automatically unarchived on every node. In this case either the ``tar.gz`` or ``zip`` formats must be used. Example ------- Create an environment: .. code-block:: bash # Using venv (Python 3 only) $ python -m venv example # Or using virtualenv $ virtualenv example Activate the environment: .. code-block:: bash $ source example/bin/activate Install some packages into the environment .. code-block:: bash (example) $ pip install numpy pandas scikit-learn scipy Package the environment into a ``tar.gz`` archive: .. code-block:: bash (example) $ venv-pack -o environment.tar.gz Collecting packages... Packing environment at '/home/jcrist/example' to 'environment.tar.gz' [########################################] | 100% Completed | 16.6s Write a PySpark script, for example: .. code-block:: python # script.py from pyspark import SparkConf from pyspark import SparkContext conf = SparkConf() conf.setAppName('spark-yarn') sc = SparkContext(conf=conf) def some_function(x): # Packages are imported and available from your bundled environment. import sklearn import pandas import numpy as np # Use the libraries to do work return np.sin(x)**2 + 2 rdd = (sc.parallelize(range(1000)) .map(some_function) .take(10)) print(rdd) Submit the job to Spark using ``spark-submit``. In YARN cluster mode: .. code-block:: bash $ PYSPARK_PYTHON=./environment/bin/python \ spark-submit \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \ --master yarn \ --deploy-mode cluster \ --archives environment.tar.gz#environment \ script.py Or in YARN client mode: .. code-block:: bash $ PYSPARK_DRIVER_PYTHON=`which python` \ PYSPARK_PYTHON=./environment/bin/python \ spark-submit \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \ --master yarn \ --deploy-mode client \ --archives environment.tar.gz#environment \ script.py You can also start a PySpark interactive session using the following: .. code-block:: bash $ PYSPARK_DRIVER_PYTHON=`which python` \ PYSPARK_PYTHON=./environment/bin/python \ pyspark \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \ --master yarn \ --deploy-mode client \ --archives environment.tar.gz#environment