Usage¶
Knit
can be used in several novel ways. Our primary concern is supporting easy deployment of
distributed Python runtimes; though, we can also consider other languages (R, Julia, etc) should
interest develop. Below are a few novels ways we can currently use Knit
Python¶
The example below use any Python found in the $PATH
. This is usually the system Python (i.e.,
on a cluster where it has already been installed for you).
>>> import knit
>>> k = knit.Knit()
>>> cmd = "python -c 'import sys; print(sys.path); import socket; print(socket.gethostname())'"
>>> appId = k.start(cmd)
Zipped Conda Envs¶
Often nodes managed under YARN may not have desired Python libraries or the Python binary at all! In these cases,
we want to package up an environment to be shipped along with the command. knit
allows us to declare a
zipped directory with the following structure typical of Python environments:
$ ll dev/
drwxr-xr-x+ 23 ubuntu ubuntu 782B Jan 30 17:55 bin
drwxr-xr-x+ 20 ubuntu ubuntu 680B Jan 30 17:55 include
drwxr-xr-x+ 39 ubuntu staff 1.3K Jan 30 17:55 lib
drwxr-xr-x+ 4 ubuntu staff 136B Jan 30 17:55 share
drwxr-xr-x+ 6 ubuntu ubuntu 204B Jan 30 17:55 ssl
>>> appId = k.start(cmd, env='<full-path>/dev.zip')
When we ship <full-path>/dev.zip
, knit uploads dev.zip
to a temporary directory within the
user’s home HDFS space e.g. /users/ubuntu/.knitDeps
and the following bash ENVIRONMENT variables
will be available:
$CONDA_PREFIX
: full path to prefix location of zipped directory$PYTHON_BIN
: full path to Python binary
With the ENVIRONMENT variables available users can build more nuanced commands like the following:
>>> cmd = '$PYTHON_BIN $CONDA_PREFIX/bin/dask-worker 8786'
knit
also provides a convenience method with conda
to help build zipped environments. The following
builds an environment env.zip
with Python 3.5 and a variety of popular data Python libraries:
>>> env_zip = k.create_env(env_name='dev', packages=['python=3', 'distributed',
... 'dask', 'pandas', 'scikit-learn'])
Adding Files¶
Knit
can also pass local files to each container.
>>> files = ['creds.txt', 'data.csv']
>>> k.start(cmd, files=files)
With the above, we are send files creds.txt
and data.csv
to each container and can reference
them as local file paths in the cmd
command.
Dask Clusters¶
The previous methods can be combined to launch a full distributed dask cluster on YARN with code like the following
from dask_yarn import DaskYARNCluster
cluster = DaskYARNCluster(env='my/conda/env.zip')
cluster.start(8, cpu=2, memory=2048)
The object cluster
starts a dask scheduler, and can also be used to start or stop more
containers than the original 8 referenced above. The same set of config options apply as for a
Knit
object, in addition to conda creation options, which will define the environment in
which the workers run.
To start a dask client in the same session, you can simply do
from dask.distributed import Client
c = Client(cluster)
and use as usual, or look at cluster.scheduler_address
for clients connecting from other sessions.
Note that DaskYARNCluster can also be used as a context manager, which will ensure that it gets
closed (and the corresponding YARN application killed) when the with
context finishes.
Instance Connections¶
The main instances that you will handle in this library have attributes which
are instances of other classes, and expose functionality. Generally, parameters are
passed down, so that the constructor parameters for DaskYarnCluster will also be used
for Knit (e.g., replication_factor
), CondaCreator (e.g., channels
) and YARNAPI
(e.g., rm
).
DaskYarnCluster:
.knit
is an instance of Knit, and exposes methods to check the yarn application
state, logs and to increase/decrease the container count
- an instance of CondaCreator
is created on-the-fly if making/zipping a conda environment
- .local_cluster
is an instance of dask.distributed.LocalCluster
, with no local
workers. The only parameter passed in is ip
.
Knit:
.yarn_api
is an instance of YARNAPI, which provides commands to be directly
executed by the ResourceManager, including several informational calls, mostly via REST
- an instance of CondaCreator
is created on-the-fly if making/zipping a conda environment