SURF works on improving data management for AI-based projects

SURF is exploring ways to provide researchers with standards, best practices, consultancy and a well-integrated set of tools to enable reproducible machine learning workflows. We are currently experimenting with available tools and gaining insights.
Datamanagement for AI

Many open-source tools exist, but each solves only a part of the challenge. Where one tool provides data versioning, another is needed to run experiments at scale, and yet another for inspecting experiment results. Each requires different levels of expertise, knowledge of programming languages, and types of infrastructure. On the other hand, commercial platforms provide integrated solutions but lock the user in a specific infrastructure.

The importance of reproducibility

Reproducibility of scientific research is vital to maintain transparency and trust among scientists as well as between the scientific community and the public. In computational research, one of the first steps towards improving reproducibility is the application of version control. With version control software changes are tracked, which allows other researchers to replicate experiments exactly. For machine learning research, tracking only software changes is often not enough. With data changing as frequently as models, how can machine learning workflows remain transparent and reproducible?

Experimenting with available tools

In the past few month, SURF has run experiments with several available tools in order to understand what solutions already exists and their shortcomings, if any. With Data Version Control both machine learning model code and data are versioned, allowing researchers to automatically keep an exact history of all relevant aspects of their workflow. With Ray Tune or Optuna, hyperparameter tuning jobs are distributed across the Lisa cluster to run experiments at scale, and researchers can easily keep track of experiments with the web interface of the MLflow tool.

Investigating services of public cloud providers

In addition, SURF is investigating machine learning services of public cloud providers, such as AWS SageMaker, Azure Machine Learning, and Google’s Vertex AI. These cloud providers claim to provide a unified platform for preprocessing data and training, tuning and deploying machine learning models. Is this the case, how do these platforms differ from existing tools, and what functionality do they provide for reproducible machine learning research projects?

Get in touch

Are you working on a machine learning project and do you recognize any of these challenges? Please contact us at ( We are currently preparing resources to help users on our infrastructure implement this workflow and hope to be able to provide more generally available support in the future.