Organization

Agile Methods in Data Science

  • Josef Korte
  • February 19, 2021
Agile had its beginnings in technology with the Manifesto for Agile Software Development. It’s a simple yet comprehensive framework that explains the four pillars of the developer’s definition of agile to the broader public.
 
Can agile also be beneficial in data science? The relatively new field of data science revolves around the analysis of data and uses methods linked to development and programming (i.e. statistics, machine learning and data processing). The actual essence is, however, the delivery of data-based insights that help in empowering the business.
 
Agile methods build on short iteration loops that may also enable your data science team to respond to new insights and requirements from multiple stakeholders, while visualizing the progress. Frequent updates increase the transparency of the efforts and the results they yield. Retrospectives performed by the team will support continuous improvement of the process while ensuring a steep learning curve for all stakeholders.
 
However, agile methods cannot be unconditionally applied to data science. Challenges mainly root in applying agile planning instruments, especially due to the consecutive nature of tasks and problems of fitting models to frequently changing datasets. This is why we have adapted the agile manifesto.
 
Hypotheses and experiments
over processes
Oftentimes, data science teams are ultimately measured in the success of their prediction and decision models. Building those models require testing of hypotheses and generating insights—both areas where agile approaches work well. However, we’ve seen resource and capacity conflicts arising when trying to follow agile methods for process-oriented tasks, such as collecting and maintaining data. 
Agile methods build on short iteration loops that may also enable your data science team to respond to new insights and requirements from multiple stakeholders, while visualizing the progress.
Defined quality objectives
over striving for optimization
We’re observing quite some academic working habits in data science where the optimization of a model is prioritized over a pragmatic result focus. This is why clearly defined quality criteria such as “confidence level of X%” or “withdrawal after X number of iterations” shall be used to ensure on-time execution needed in a business setting.
 
Applicability and problem understanding
over methodological excellence
Data is rarely structured, correct or complete. A model which used to work well for some datasets, might not be delivering on the results after changing the data scope (in statistics this is known as overfitting). We recommend applying simple models that easily work in many environments, rather than the ‘optimum’ model that may only function under lab conditions.
 
Periodic review and learning
over finished software and models

Once implemented, many models are not regularly reviewed and improved. This is frequently justified with higher consistency and better comparability over time. However, due to quick technological developments, a regular and maybe even automated review should be part of every data science team's calendar.

We are convinced that by following this adjusted manifesto, data science can finally become agile – and be even more fun to work in.

This piece was written by Josef Korte and Jan Ortmann