Quickstart

Preprequisites

The quickstart assumes user have access to Kubeflow Pipelines deployment. Pipelines can be dedployed on any Kubernetes cluster, including local cluster.

Install the toy project with Kubeflow Pipelines support

It is a good practice to start by creating a new virtualenv before installing new packages. Therefore, use virtalenv command to create new env and activate it:

$ virtualenv venv-demo
created virtual environment CPython3.8.5.final.0-64 in 145ms
  creator CPython3Posix(dest=/home/mario/kedro/venv-demo, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/mario/.local/share/virtualenv)
    added seed packages: pip==20.3.1, setuptools==51.0.0, wheel==0.36.2
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
$ source venv-demo/bin/activate

Then, kedro must be present to enable cloning the starter project, along with the latest version of kedro-kubeflow plugina and kedro-docker (required to build docker images with the Kedro pipeline nodes):

$ pip install 'kedro<0.17' kedro-kubeflow kedro-docker

With the dependencies in place, let’s create a new project:

$ kedro new --starter=git+https://github.com/getindata/kedro-starter-spaceflights.git --checkout allow_nodes_with_commas
Project Name:
=============
Please enter a human readable name for your new project.
Spaces and punctuation are allowed.
 [New Kedro Project]: Kubeflow Plugin Demo

Repository Name:
================
Please enter a directory name for your new project repository.
Alphanumeric characters, hyphens and underscores are allowed.
Lowercase is recommended.
 [kubeflow-plugin-demo]: 

Python Package Name:
====================
Please enter a valid Python package name for your project package.
Alphanumeric characters and underscores are allowed.
Lowercase is recommended. Package name must start with a letter or underscore.
 [kubeflow_plugin_demo]: 

Change directory to the project generated in /home/mario/kedro/kubeflow-plugin-demo

A best-practice setup includes initialising git and creating a virtual environment before running `kedro install` to install project-specific dependencies. Refer to the Kedro documentation: https://kedro.readthedocs.io/

TODO: switch to the official spaceflights starter after https://github.com/quantumblacklabs/kedro-starter-spaceflights/pull/10 is merged

Finally, go the demo project directory and ensure that kedro-kubeflow plugin is activated:

$ cd kubeflow-plugin-demo/
$ kedro install
(...)
Requirements installed!
$ kedro kubeflow --help
Usage: kedro kubeflow [OPTIONS] COMMAND [ARGS]...

  Interact with Kubeflow Pipelines

Options:
  -e, --env TEXT  Environment to use.
  -h, --help      Show this message and exit.

Commands:
  compile          Translates Kedro pipeline into YAML file with Kubeflow...
  init             Initializes configuration for the plugin
  list-pipelines   List deployed pipeline definitions
  mlflow-start
  run-once         Deploy pipeline as a single run within given experiment.
  schedule         Schedules recurring execution of latest version of the...
  ui               Open Kubeflow Pipelines UI in new browser tab
  upload-pipeline  Uploads pipeline to Kubeflow server

Build the docker image to be used on Kubeflow Pipelines runs

First, initialize the project with kedro-docker configuration by running:

$ kedro docker init

This command creates a several files, including .dockerignore. This file ensures that transient files are not included in the docker image and it requires small adjustment. Open it in your favourite text editor and extend the section # except the following by adding there:

!data/01_raw

This change enforces raw data existence in the image. Also, one of the limitations of running the Kedro pipeline on Kubeflow (and not on local environemt) is inability to use MemoryDataSets, as the pipeline nodes do not share memory, so every artifact should be stored as file. The spaceflights demo configures four datasets as in-memory, so let’s change the behaviour by adding these lines to conf/base/catalog.yml:

X_train:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/X_train.pickle
  layer: model_input

y_train:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/y_train.pickle
  layer: model_input

X_test:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/X_test.pickle
  layer: model_input

y_test:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/y_test.pickle
  layer: model_input

Finally, build the image:

kedro docker build

When execution finishes, your docker image is ready. If you don’t use local cluster, you should push the image to the remote repository:

docker tag kubeflow_plugin_demo:latest remote.repo.url.com/kubeflow_plugin_demo:latest
docker push remote.repo.url.com/kubeflow_plugin_demo:latest

Run the pipeline on Kubeflow

First, run init script to create the sample configuration. A parameter value should reflect the kubeflow base path as seen from the system (so no internal Kubernetes IP unless you run the local cluster):

kedro kubeflow init https://kubeflow.cluster.com
(...)
Configuration generated in /home/mario/kedro/kubeflow-plugin-demo/conf/base/kubeflow.yaml

Then, if needed, adjust the conf/base/kubeflow.yaml. For example, the image: key should point to the full image name (like remote.repo.url.com/kubeflow_plugin_demo:latest if you pushed the image at this name). Depending on the storage classes availability in Kubernetes cluster, you may want to modify volume.storageclass and volume.access_modes (please consult with Kubernetes admin what values should be there).

Finally, everything is set to run the pipeline on Kubeflow. Run upload-pipeline:

$ kedro kubeflow upload-pipeline
2021-01-12 09:47:35,132 - kedro_kubeflow.kfpclient - INFO - No IAP_CLIENT_ID provided, skipping custom IAP authentication
2021-01-12 09:47:35,209 - kedro_kubeflow.kfpclient - INFO - Pipeline created
2021-01-12 09:47:35,209 - kedro_kubeflow.kfpclient - INFO - Pipeline link: https://kubeflow.cluster.com/#/pipelines/details/9a3e4e16-1897-48b5-9752-d350b1d1faac/version/9a3e4e16-1897-48b5-9752-d350b1d1faac

As you can see, the pipeline was compiled and uploaded into Kubeflow. Let’s visit the link:

../../_images/uploaded_pipeline.png Uploaded pipeline

The Kubeflow pipeline reflects the Kedro pipeline with two extra steps:

data-volume-create - creates an empty volume in Kubernetes cluster as a persistence layer for inter-steps data access
data-volume-init - initialized the volume with 01_raw data when the pipeline starts

By using Create run button you can start a run of the pipeline on the cluster. A run behaves like kedro run command, but the steps are executed on the remote cluster. The outputs are stored on the persistent volume, and passed as the inputs accordingly to how Kedro nodes need them.

../../_images/pipeline_run.gif Pipeline run

From the UI you can access the logs of the execution. If everything seems fine, use `schedule to create a recurring run:

$ kedro kubeflow schedule --cron-expression '0 0 4 * * *'
(...)
2021-01-12 12:37:23,086 - kedro_kubeflow.kfpclient - INFO - No IAP_CLIENT_ID provided, skipping custom IAP authentication
2021-01-12 12:37:23,096 - root - INFO - Creating experiment Kubeflow Plugin Demo.
2021-01-12 12:37:23,103 - kedro_kubeflow.kfpclient - INFO - New experiment created: 2123c082-b336-4093-bf3f-ce73f68b66b4
2021-01-12 12:37:23,147 - kedro_kubeflow.kfpclient - INFO - Pipeline scheduled to 0 0 4 * * *

You can see that the new experiment was created (that will group the runs) and the pipeline was scheduled. Please note, that Kubeflow uses 6-places cron expression (as opposite to Linux’s cron with 5-places), where first place is the second indicator.

../../_images/scheduled_run.png Scheduled run