Most applications require some number of files (executables, configuration, scripts, etc…) to run. When deploying these applications on YARN you have two options depending on the kind of access you want to assume:
IT Permissions: Install the files on every node in the cluster. This usually requires a cluster administrator, and is often only done for things like commonly used libraries. For example, you might install Python and common libraries on every node, but not your application code.
User Permissions: Distribute the files to the executing containers as part of the application. This is the common approach, and does not require administrator privileges.
Both options have their place, and can be mixed in a single application. Below we discuss the second option in detail.
Specifying Files for a Service¶
Files for distribution are specified per-service using the files field of
the specification. This field takes a mapping of
destination relative paths to
describing the sources for these paths. Each
skein.File object takes
the following fields:
The path to the file/archive. If no scheme is specified, the path is assumed to be on the local filesystem (
file://scheme). Relative paths are supported, and are taken relative to the location of the specification file.
The type of file to distribute – either
file. Archive’s are automatically extracted by yarn into a directory with the same name as their destination (only
.tgzsupported). Optional; by default the type is inferred from the file extension.
The resource visibility. Describes how resources are shared between applications. Valid values are:
application– Shared among containers of the same application on the node.
private– Shared among all applications of the same user on the node.
public– Shared by all users on the node.
Optional, default is
application. In most cases the default is what you want.
The resource size in bytes. Optional; if not provided will be determined by the file system. In most cases the default is what you want.
The time the resource was last modified. Optional; if not provided will be determined by the file system. In most cases the default is what you want.
As a shorthand, values may be the source path instead of a
services: my_service: files: # /local/path/to/file.zip will be uploaded to HDFS, and extracted # into the directory path_on_container path_on_container: source: /local/path/to/file.zip type: archive # Can also specify only the source path - missing fields are inferred script_path.py: /path/to/script.py # Files on remote filesystems can be used by specifying the scheme. # If the specified filesystem matches the default filesystem # (typically HDFS), uploading will is avoided, allowing for faster # application startup. script2_path.py: hdfs:///remote/path/to/script2.py
The File Distribution Process¶
When an application launches, the following process occurs:
An application staging directory is created in the user’s home directory on the default filesystem (typically HDFS). The path for this is the format
The missing fields for each file in the application specification are inferred. All files that are not already on the default filesystem are copied over. Note that if a file is used by more than once service, it is only copied over once for the application.
The application is started.
When a container is started, YARN resource localization is applied for every file specified by that service. During this process files are copied to the container and any archives are extracted to the destination directories. Based on the file
visibilitysetting, this may be done once per node per application, or once per node (with an LRU cache for clearing old files). For more information on this process see this blogpost from Hortonworks.
When the application completes, the staging directory is deleted.
Distributing Python Environments¶
When deploying Python applications, one needs to figure out how to distribute any library dependencies. If Python and the required libraries are already installed on every node (option 1 above), you can use the local Python and avoid this problem completely. If they aren’t, then one needs to package the environment to distribute with the application. This is typically handled using
Both are tools for taking an environment and creating an archive of it in a way that (most) absolute paths in any libraries or scripts are altered to be relocatable. This archive then can be distributed with your application, and will be automatically extracted during YARN resource localization
Packaging a Conda Environment with Conda-Pack¶
# Create a new conda environment $ conda create -n example ... # Activate the environment $ conda activate example # Install the needed packages $ conda install conda-pack skein numpy scikit-learn numba -c conda-forge ... # Package the environment into environment.tar.gz $ conda pack -o environment.tar.gz Collecting packages... Packing environment at '/home/jcrist/miniconda/envs/example' to 'environment.tar.gz' [########################################] | 100% Completed | 24.2s
Packaging a Virtual Environment with Venv-Pack¶
Here we create a virtual environment, and then use venv-pack to package it
tar.gz file named
environment.tar.gz. This is what will be
distributed with our application. The virtual environment can be created using
either venv or virtualenv.
Note that the python linked to in the virtual environment must exist and be
accessible on every node in the YARN cluster. If the environment was created
with a different Python, you can change the link path using the
--python-prefix flag. For more information see the venv-pack
# Create a virtual environment $ python -m venv example # Using venv $ python -m virtualenv example # Or using virtualenv ... # Activate the environment $ source example/bin/activate # Install the needed packages $ pip install venv-pack skein numpy scikit-learn numba ... # Package the environment into environment.tar.gz $ venv-pack -o environment.tar.gz Collecting packages... Packing environment at '/home/jcrist/environments/example' to 'environment.tar.gz' [########################################] | 100% Completed | 12.4s
Using the Packaged Environment¶
To use the packaged environment in a service specification, you need to include
the archive in
files, and activate the environment in the
looks the same for environments packaged using either tool.
services: my_service: files: # The environment archive will be uploaded to HDFS, and extracted # into a directory named ``environment`` in each container. environment: environment.tar.gz script: | # Activate the environment source environment/bin/activate # Run commands inside the environment. All executables or imported # python libraries will be from within the packaged environment. my-cool-application