hdfscm¶
A custom ContentsManager for Jupyter Notebooks that stores contents on HDFS.
Installation¶
hdfscm
should be installed in the same Python environment as the notebook
server.
Note that for use with JupyterHub this means the user’s environment (which is not necessarily the same environment run by the JupyterHub server).
Install with Conda:
conda install -c conda-forge jupyter-hdfscm
Install with Pip:
pip install jupyter-hdfscm
Install from source:
pip install git+https://github.com/jcrist/hdfscm.git
Configuration¶
To enable, add the following line to your jupyter_notebook_config.py
:
c.NotebookApp.contents_manager_class = 'hdfscm.HDFSContentsManager'
By default notebooks are stored on HDFS at '/user/{username}/notebooks'
. To
change this, configure either HDFSContentsManager.root_dir_template
(a
template string) or HDFSContentsManager.root_dir
directly:
# Example: Store notebooks in /jupyter/notebooks/{username} instead
c.HDFSContentsManager.root_dir_template = '/jupyter/notebooks/{username}'
For most systems these parameters should be enough, other fields will be
inferred from the environment. Note that if your hadoop cluster has kerberos
enabled, you’ll need to have acquired credentials before starting the notebook
server (either through kinit
, or distributed as a delegation token).
If you encounter classpath issues initializing the filesystem, refer to the
pyarrow hdfs documentation. In most environments setting
ARROW_LIBHDFS_DIR
resolves these issues.
For more information on all configuration options, see Configuration Options.
Additional Resources¶
If you’re interested in hdfscm
, you may also be interested in a few
other libraries:
yarnspawner: A JupyterHub Spawner for launching notebook servers on YARN. This can be used in tandem with
hdfscm
providing a way to persist notebooks between sessions.pgcontents: A Jupyter ContentsManager for storing contents in a Postgres database.
s3contents: A Jupyter ContentsManager for storing contents in an object store like S3 or GCS.
pyarrow: Among other things, this Python library provides the HDFS client used for
hdfscm
.