Machine Learning Experimentation with DVC and VS Code

Author: The FourthBrain Team • November 7, 2022

FourthBrain hosted an event on Machine Learning Experimentation with DVC and VS Code led by Alex Kim, Solutions Engineer at Iterative. Would you like to follow the demo and try DVC and VSCode yourself?  Download the code and instructions from the following GitHub repo.

Based on the event, we will explore DVC as follows: :

  1. What is DVC?
  2. What problem does DVC solve?
  3. How can DVC be used with VSCode?
  4. Case Study: Analyzing Customer churn using DVC via VSCode 

What is DVC? 

Data Version Control (DVC) is a data versioning, ML workflow automation, and experiment management tool that takes advantage of the existing software engineering toolset you’re already familiar with (Git, your IDE, CI/CD, etc.). DVC helps data science and machine learning teams manage large datasets, make projects reproducible, and better collaborate.

DVC is a product of Iterative, a company Based in San Francisco. They produce several  open-source tools to streamline the workflow of data scientists.

What problem does DVC solve?

Data versioning: DVC lets you capture the versions of your data and models in Git commits, while storing them on-premises or in cloud storage. It also provides a mechanism to switch between these different data contents. The result is a single history for data, code, and ML models.

Figure 1: Exponential complexity of data science projects (source)

ML pipelines: we often find ourselves repeating actions to get or update the results of our project. For example, a data science workflow could involve:

  • Gathering data for training and validation
  • Extracting useful features from the training dataset
  • (Re)training an ML model
  • Evaluating the results against the validation set
Figure 2: A typical ML data pipeline (source)

A data pipeline is a series of data processing steps. DVC helps you define these steps in a standard YAML format (.dvc and dvc.yaml files), making your pipeline more manageable and consistent to reproduce.

Experiment management: Data science and machine learning are iterative processes that require a large number of attempts to reach a certain level of a metric. Running a machine learning experiment testing a given combination of code, data and model configuration parameters.

Experimentation is part of the development of data features, hyperspace exploration, deep learning optimization, etc. DVC experiment management features help you organize, execute, manage, and share ML experiments. 

How can DVC be used with VSCode?

DVC is available for Visual Studio Code, any system terminal, and as a Python library.  VS Code is free for private or commercial use. See the product license for details.

Case study: Analyzing customer churn using DVC via VSCode

In the demo, Alex showed a typical use case for machine learning models: customer churn, which can be defined as the percentage of customers that stopped using a company’s product or service during a certain time frame. We could also talk about churn when an employee leaves a company. It is a critical prediction for many businesses because acquiring new clients (or employees) often costs more than retaining existing ones.

When we start using machine learning to solve a problem, we usually write our code in a Jupyter notebook. Even when this approach is flexible and easy to use, it has many disadvantages if we want to iteratively improve the performance of our model by running many experiments (i.e.: different combinations of code, data and parameters)

Relying only on Jupyter notebooks isn’t the best solution if we try to answer these questions:

  • What exactly was used to produce a particular model?
  • Can you easily compare many ML experiments?
  • Will you be able to reproduce them later?

We ask these questions because we want to achieve certain goals:

  1. Achieve high predictive performance from our model, which usually implies running many experiments.
  2. Ensure reproducibility
    1. You can’t improve what you can’t reproduce
    2. Transparency and team collaboration (everybody knows exactly what the team is doing)
    3. Auditability, in terms of law and regulations on the results of your model.
  3. Minimal setup and dependency on 3rd party services
    1. Do not depend on a specific vendor
    2. Keep the maintenance and cost as low as possible
    3. Do not send sensitive data outside your company

According to Alex, some solutions that are similar to DVC (MLflow, W&B, suffer from at least one of these problems and it is difficult to achieve all three goals at the same time. His recommendation is to try a combination of open-source tools:

Figure 3: Open-source tools recommended by Alex in his talk

DVC will help produce many experiments very fast and to version everything (by using Git and other storage devices). VSCode, in turn, is a convenient UI to manage the whole process.   

Figure 4: Data versioning (source)

You are now invited to follow the demo, step by step, by watching the video on this link. You will learn how to define pipelines, run many experiments and retrieve their results later. 

Where can I go to learn more about FourthBrain, VSCode, DVC, and future events like this?If you are interested in learning more about the Machine Learning programs at FourthBrain, check out our website for the next cohort start dates and follow us on LinkedIn to be notified of future events like this one. If you want to experiment yourself with DVC and VSCode, refer to the official documentation here and here.