0.1.0 • Published 4 years ago

davidgortegacmltest v0.1.0

Weekly downloads
-
License
Apache-2.0
Repository
-
Last release
4 years ago

DVC Github action for continuous delivery for machine learning

  1. Introduction
  2. Usage
  3. How to use GPUs
  4. Working with DVC remotes
  5. Examples

Introduction

DVC is a great tool as a data versioning system, but also is great as a build tool for ML experimentation. This repo offers the possibility of using DVC to establish your ML pipeline to be run by Github Actions runners or Gitlab runners.

You can also deploy your own Github runners or your own Gitlab runners with special capabilities like GPUs...

Major benefits of using DVC-CML in your ML projects includes:

  • Reproducibility: DVC is always in charge of maintain your experiment tracking all the dependencies, so you don't have to. Additionally your experiment is always running under the same constrains so you don't have to worry about replicating the same environment again.
  • Observability: DVC offers you metrics to be tracked. In DVC-action we make those metrics more human friendly and we also offer direct access to other experiments run through the DVC Report offered as checks in Github or Releases in Gitlab.
  • Releases: DVC-action tags every experiment that runs with repro generating the report. Aside of that DVC-CML is just a step in your Github Workflow or Gitlab Pipeline that could generate your model releases or deployment according to your business requirements.
  • Teaming: Give visibility to your experiments or releases to your teammates working together.

DVC-cml performs in your push or pull requests:

  1. DVC repro
  2. Push changes into DVC remote and Git remote
    • In Github generates a Github check displaying the DVC Report
    • In Gitlab generates a Tag/Release displaying the DVC Report

image image

Usage

:eyes: Knowledge of Github Actions and DVC pipeline is very useful for a fully comprehension.

Example of a simple DVC-CML workflow:

:eyes: Note the use of the container

name: your-workflow-name

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
    container: docker://dvcorg/cml:latest

    steps:
      - uses: actions/checkout@v2

      - name: dvc_cml_run
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        repo_token: ${{ secrets.GITHUB_TOKEN }}
      run: |
        # Install your project dependencies.
        # An example for Python3:
        apt-get install -y python3 python3-pip
        pip3 install --upgrade pip
        update-alternatives --install /usr/bin/python python $(which python3) 10
        update-alternatives --install /usr/bin/pip pip $(which pip3) 10
        test -f requirements.txt && pip3 install -r requirements.txt

        # needed to be able to do dvc metrics and dvc diff
        git fetch --prune --unshallow

        # -f is needed
        dvc pull -f
        dvc repro
        dvc push

        BASELINE=origin/master
        echo "# CML report" >> report.md
        dvc metrics diff --show-json "$BASELINE" | cml-metrics >> report.md
        dvc diff --show-json "$BASELINE" | cml-files >> report.md

        # publish image
        cml-publish my-file.png --md --title 'my-file' >> report.md

        # publish pdf
        cml-publish my-file.pdf --md --title 'my-file' >> report.md

        # pipe example
        vl2png vega.json | cml-publish --md --title 'my image' >> report.md

        cml-send-comment report.md

:eyes: Knowledge of Gitlab CI/CD Pipeline and DVC pipeline is very useful for a fully comprehension.

Example of a simple DVC-CML workflow in Gitlab:

:eyes: Some required environment variables like remote credentials and GITLAB_TOKEN are set as CI/CD environment variables in Gitlab's UI

:warning: tag_prefix should be set in order to have DVC Reports, i.e. dvc_ . This will generate tags in your repo with the report as release notes image

# .gitlab-ci.yml
stages:
  - dvc_cml_run

dvc:
  stage: dvc_cml_run
  image: dvcorg/cml:latest

  script:
    -  # Install your project dependencies.
    -  # An example for Python3:
    - apt-get install -y python3 python3-pip
    - pip3 install --upgrade pip
    - update-alternatives --install /usr/bin/python python $(which python3) 10
    - update-alternatives --install /usr/bin/pip pip $(which pip3) 10
    - test -f requirements.txt && pip3 install -r requirements.txt

    -  # needed to be able to do dvc metrics and dvc diff
    - git fetch --prune --unshallow

    -  # -f is needed
    - dvc pull -f
    - dvc repro train.dvc
    - dvc push

    - BASELINE=origin/master
    - echo "# CML report" >> report.md
    - dvc metrics diff --show-json "$BASELINE" | cml-metrics >> report.md
    - dvc diff --show-json "$BASELINE" | cml-files >> report.md

    -  # publish image
    - cml-publish my-file.png --md --title 'my-file' >> report.md

    -  # publish pdf
    - cml-publish my-file.pdf --md --title 'my-file' >> report.md

    -  # pipe example
    - vl2png vega.json | cml-publish --md --title 'my image' >> report.md

    - cml-send-comment report.md

This workflow will run every time that you push code or do a Pull/Merge Request. When triggered DVC-CML will setup the runner and DVC will run the pipelines specified by repro_targets. Two scenarios may happen:

  1. DVC repro is up to date and there is nothing to do. This means that the commit that you have done in your code is not related to your DVC pipelines and there is nothing to do.
  2. DVC pipeline has changed and DVC will run repro, updating the output that may generate (models, data...) in your DVC remote storage and then committing, tagging and pushing the changes in git remote.

Additionally, you may extend your CI/CD Pipeline/Workflow to generate your releases or even deploy automatically your models.

Variables

:warning: In Github Actions they are set via env: not inputs:

:eyes: In Gitlab pipeline they are set via variables:

VariableTypeRequiredDefaultInfo
repo_tokenstringyesIn Github you can set the default autogenerated GITHUB_TOKEN. In Gitlab we scan for GITLAB_TOKEN. See Tensorflow Mnist in Gitlab example for a complete walkthrough.

How to use GPUs

Our DVC-CML GPU docker image is an Ubuntu 18.04 that already supports:

  • cuda 10.1
  • libcudnn 7
  • cublas 10
  • libinfer 10

Setup

  1. You need to setup properly your nvidia drivers and nvidia-docker in your host machine.
     sudo ubuntu-drivers autoinstall
     sudo apt-get install nvidia-docker2
     sudo systemctl restart docker
  2. Launch your own runner following your CI vendor instructions.

Repo settings -> Actions -> Add Runner button

Repo settings -> CI/CD -> Runners -> Specific Runners

# Gitlab self-hosted runner with cml and GPU
gitlab-runner register \
    --non-interactive \
    --run-untagged="true" \
    --locked="false" \
    --access-level="not_protected" \
    --executor "docker" \
    --docker-runtime "nvidia" \
    --docker-image "dvcorg/cml-gpu:latest" \
    --url "https://gitlab.com/" \
    --tag-list "cml" \
    --registration-token "here_goes_your_gitlab_runner_token"

gitlab-runner start
  1. Modify your CI pipeline / Workflow to setup your GPU in your DVC job.
# Github
dvc:
  runs-on: [self-hosted]
  container:
    image: docker://dvcorg/cml-gpu:latest
    options: --runtime "nvidia" -e NVIDIA_VISIBLE_DEVICES=all
# Gitlab
dvc:
 tags:
   - cml
 stage: dvc_action_run
 image: dvcorg/cml-gpu:latest

 variables:
   NVIDIA_VISIBLE_DEVICES: all
   ...

Pitfalls

  • "My runner says: Got permission denied while trying to connect to the Docker daemon socket". You need to add your user to the docker group. Check your OS configuration for further details. Recipe for ubuntu:
sudo groupadd docker
sudo usermod -aG docker ${USER}
su -s ${USER}
  • "With Github runners I can't specify custom tags to reach different runners". We know, It's a Github limitation.

  • "I have followed all the steps and I could not make it work". Try to run nvidia-smi in the run section in your workflow and see if gpu is available to your docker container. image

Working with DVC remotes

DVC support different kinds of remote storage. To setup them properly you have to setup credentials (if needed) as Github secrets or Gitlab masked enviroment variables to keep them secure. Additionally in Github you need to add them as env variables in the workflow file.

# Github
env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  AWS_SESSION_TOKEN: ${{ secrets.AWS_SESSION_TOKEN }}

:point_right: AWS_SESSION_TOKEN is optional.

env:
  AZURE_STORAGE_CONNECTION_STRING:
    ${{ secrets.AZURE_STORAGE_CONNECTION_STRING }}
  AZURE_STORAGE_CONTAINER_NAME: ${{ secrets.AZURE_STORAGE_CONTAINER_NAME }}
env:
  OSS_BUCKET: ${{ secrets.OSS_BUCKET }}
  OSS_ACCESS_KEY_ID: ${{ secrets.OSS_ACCESS_KEY_ID }}
  OSS_ACCESS_KEY_SECRET: ${{ secrets.OSS_ACCESS_KEY_SECRET }}
  OSS_ENDPOINT: ${{ secrets.OSS_ENDPOINT }}

:warning: Normally, GOOGLE_APPLICATION_CREDENTIALS points to the path of the json file that contains the credentials. However in the action this variable CONTAINS the content of the file. Copy that json and add it as a secret.

env:
  GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}

:warning: After configuring your Google Drive credentials you will find a json file at your_project_path/.dvc/tmp/gdrive-user-credentials.json. Copy that json and add it as a secret.

env:
  GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}

:warning: Not supported yet

:warning: Not supported yet

:warning: Not supported yet

Examples