davidgortegacmltest v0.1.0
DVC Github action for continuous delivery for machine learning
Introduction
DVC is a great tool as a data versioning system, but also is great as a build tool for ML experimentation. This repo offers the possibility of using DVC to establish your ML pipeline to be run by Github Actions runners or Gitlab runners.
You can also deploy your own Github runners or your own Gitlab runners with special capabilities like GPUs...
Major benefits of using DVC-CML in your ML projects includes:
- Reproducibility: DVC is always in charge of maintain your experiment tracking all the dependencies, so you don't have to. Additionally your experiment is always running under the same constrains so you don't have to worry about replicating the same environment again.
- Observability: DVC offers you metrics to be tracked. In DVC-action we make those metrics more human friendly and we also offer direct access to other experiments run through the DVC Report offered as checks in Github or Releases in Gitlab.
- Releases: DVC-action tags every experiment that runs with repro generating the report. Aside of that DVC-CML is just a step in your Github Workflow or Gitlab Pipeline that could generate your model releases or deployment according to your business requirements.
- Teaming: Give visibility to your experiments or releases to your teammates working together.
DVC-cml performs in your push or pull requests:
- DVC repro
- Push changes into DVC remote and Git remote
- In Github generates a Github check displaying the DVC Report
- In Gitlab generates a Tag/Release displaying the DVC Report
Usage
:eyes: Knowledge of Github Actions and DVC pipeline is very useful for a fully comprehension.
Example of a simple DVC-CML workflow:
:eyes: Note the use of the container
name: your-workflow-name
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
container: docker://dvcorg/cml:latest
steps:
- uses: actions/checkout@v2
- name: dvc_cml_run
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
repo_token: ${{ secrets.GITHUB_TOKEN }}
run: |
# Install your project dependencies.
# An example for Python3:
apt-get install -y python3 python3-pip
pip3 install --upgrade pip
update-alternatives --install /usr/bin/python python $(which python3) 10
update-alternatives --install /usr/bin/pip pip $(which pip3) 10
test -f requirements.txt && pip3 install -r requirements.txt
# needed to be able to do dvc metrics and dvc diff
git fetch --prune --unshallow
# -f is needed
dvc pull -f
dvc repro
dvc push
BASELINE=origin/master
echo "# CML report" >> report.md
dvc metrics diff --show-json "$BASELINE" | cml-metrics >> report.md
dvc diff --show-json "$BASELINE" | cml-files >> report.md
# publish image
cml-publish my-file.png --md --title 'my-file' >> report.md
# publish pdf
cml-publish my-file.pdf --md --title 'my-file' >> report.md
# pipe example
vl2png vega.json | cml-publish --md --title 'my image' >> report.md
cml-send-comment report.md
:eyes: Knowledge of Gitlab CI/CD Pipeline and DVC pipeline is very useful for a fully comprehension.
Example of a simple DVC-CML workflow in Gitlab:
:eyes: Some required environment variables like remote credentials and GITLAB_TOKEN are set as CI/CD environment variables in Gitlab's UI
:warning:
tag_prefix
should be set in order to have DVC Reports, i.e. dvc_ . This will generate tags in your repo with the report as release notes
# .gitlab-ci.yml
stages:
- dvc_cml_run
dvc:
stage: dvc_cml_run
image: dvcorg/cml:latest
script:
- # Install your project dependencies.
- # An example for Python3:
- apt-get install -y python3 python3-pip
- pip3 install --upgrade pip
- update-alternatives --install /usr/bin/python python $(which python3) 10
- update-alternatives --install /usr/bin/pip pip $(which pip3) 10
- test -f requirements.txt && pip3 install -r requirements.txt
- # needed to be able to do dvc metrics and dvc diff
- git fetch --prune --unshallow
- # -f is needed
- dvc pull -f
- dvc repro train.dvc
- dvc push
- BASELINE=origin/master
- echo "# CML report" >> report.md
- dvc metrics diff --show-json "$BASELINE" | cml-metrics >> report.md
- dvc diff --show-json "$BASELINE" | cml-files >> report.md
- # publish image
- cml-publish my-file.png --md --title 'my-file' >> report.md
- # publish pdf
- cml-publish my-file.pdf --md --title 'my-file' >> report.md
- # pipe example
- vl2png vega.json | cml-publish --md --title 'my image' >> report.md
- cml-send-comment report.md
This workflow will run every time that you push code or do a Pull/Merge Request.
When triggered DVC-CML will setup the runner and DVC will run the pipelines
specified by repro_targets
. Two scenarios may happen:
- DVC repro is up to date and there is nothing to do. This means that the commit that you have done in your code is not related to your DVC pipelines and there is nothing to do.
- DVC pipeline has changed and DVC will run repro, updating the output that may generate (models, data...) in your DVC remote storage and then committing, tagging and pushing the changes in git remote.
Additionally, you may extend your CI/CD Pipeline/Workflow to generate your releases or even deploy automatically your models.
Variables
:warning: In Github Actions they are set via
env:
notinputs:
:eyes: In Gitlab pipeline they are set via
variables:
Variable | Type | Required | Default | Info |
---|---|---|---|---|
repo_token | string | yes | In Github you can set the default autogenerated GITHUB_TOKEN. In Gitlab we scan for GITLAB_TOKEN. See Tensorflow Mnist in Gitlab example for a complete walkthrough. |
How to use GPUs
Our DVC-CML GPU docker image is an Ubuntu 18.04 that already supports:
- cuda 10.1
- libcudnn 7
- cublas 10
- libinfer 10
Setup
- You need to setup properly your nvidia drivers and nvidia-docker in your host
machine.
sudo ubuntu-drivers autoinstall sudo apt-get install nvidia-docker2 sudo systemctl restart docker
- Launch your own runner following your CI vendor instructions.
Repo settings -> Actions -> Add Runner button
Repo settings -> CI/CD -> Runners -> Specific Runners
# Gitlab self-hosted runner with cml and GPU
gitlab-runner register \
--non-interactive \
--run-untagged="true" \
--locked="false" \
--access-level="not_protected" \
--executor "docker" \
--docker-runtime "nvidia" \
--docker-image "dvcorg/cml-gpu:latest" \
--url "https://gitlab.com/" \
--tag-list "cml" \
--registration-token "here_goes_your_gitlab_runner_token"
gitlab-runner start
- Modify your CI pipeline / Workflow to setup your GPU in your DVC job.
# Github
dvc:
runs-on: [self-hosted]
container:
image: docker://dvcorg/cml-gpu:latest
options: --runtime "nvidia" -e NVIDIA_VISIBLE_DEVICES=all
# Gitlab
dvc:
tags:
- cml
stage: dvc_action_run
image: dvcorg/cml-gpu:latest
variables:
NVIDIA_VISIBLE_DEVICES: all
...
Pitfalls
- "My runner says: Got permission denied while trying to connect to the Docker daemon socket". You need to add your user to the docker group. Check your OS configuration for further details. Recipe for ubuntu:
sudo groupadd docker
sudo usermod -aG docker ${USER}
su -s ${USER}
"With Github runners I can't specify custom tags to reach different runners". We know, It's a Github limitation.
"I have followed all the steps and I could not make it work". Try to run nvidia-smi in the
run
section in your workflow and see if gpu is available to your docker container.
Working with DVC remotes
DVC support different kinds of remote storage. To setup them properly you have to setup credentials (if needed) as Github secrets or Gitlab masked enviroment variables to keep them secure. Additionally in Github you need to add them as env variables in the workflow file.
# Github
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_SESSION_TOKEN: ${{ secrets.AWS_SESSION_TOKEN }}
:point_right: AWS_SESSION_TOKEN is optional.
env:
AZURE_STORAGE_CONNECTION_STRING:
${{ secrets.AZURE_STORAGE_CONNECTION_STRING }}
AZURE_STORAGE_CONTAINER_NAME: ${{ secrets.AZURE_STORAGE_CONTAINER_NAME }}
env:
OSS_BUCKET: ${{ secrets.OSS_BUCKET }}
OSS_ACCESS_KEY_ID: ${{ secrets.OSS_ACCESS_KEY_ID }}
OSS_ACCESS_KEY_SECRET: ${{ secrets.OSS_ACCESS_KEY_SECRET }}
OSS_ENDPOINT: ${{ secrets.OSS_ENDPOINT }}
:warning: Normally, GOOGLE_APPLICATION_CREDENTIALS points to the path of the json file that contains the credentials. However in the action this variable CONTAINS the content of the file. Copy that json and add it as a secret.
env:
GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}
:warning: After configuring your Google Drive credentials you will find a json file at
your_project_path/.dvc/tmp/gdrive-user-credentials.json
. Copy that json and add it as a secret.
env:
GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
:warning: Not supported yet
:warning: Not supported yet
:warning: Not supported yet
Examples
4 years ago