hdfscm

A custom ContentsManager for Jupyter Notebooks that stores contents on HDFS.

Installation

hdfscm should be installed in the same Python environment as the notebook server.

Note that for use with JupyterHub this means the user’s environment (which is not necessarily the same environment run by the JupyterHub server).

Install with Conda:

conda install -c conda-forge jupyter-hdfscm

Install with Pip:

pip install jupyter-hdfscm

Install from source:

pip install git+https://github.com/jcrist/hdfscm.git

Configuration

To enable, add the following line to your jupyter_notebook_config.py:

c.NotebookApp.contents_manager_class = 'hdfscm.HDFSContentsManager'

By default notebooks are stored on HDFS at '/user/{username}/notebooks'. To change this, configure either HDFSContentsManager.root_dir_template (a template string) or HDFSContentsManager.root_dir directly:

# Example: Store notebooks in /jupyter/notebooks/{username} instead
c.HDFSContentsManager.root_dir_template = '/jupyter/notebooks/{username}'

For most systems these parameters should be enough, other fields will be inferred from the environment. Note that if your hadoop cluster has kerberos enabled, you’ll need to have acquired credentials before starting the notebook server (either through kinit, or distributed as a delegation token).

If you encounter classpath issues initializing the filesystem, refer to the pyarrow hdfs documentation. In most environments setting ARROW_LIBHDFS_DIR resolves these issues.

For more information on all configuration options, see Configuration Options.

Additional Resources

If you’re interested in hdfscm, you may also be interested in a few other libraries: