Problem Formulation in Data Science

2. Problem Formulation in Data Science#

Data Science Flower

Most applications of data science are novel and open-ended, which means the first challenge for any data scientist is not coding or modeling, but understanding the problem itself. Before jumping into data collection or analysis, it is essential to grasp the context in which the problem exists.

The first step, therefore, is to understand the “physics” of the domain, the underlying mechanisms, constraints, and relationships that define how things work in that particular field. This may involve talking to domain experts to gain qualitative insights, reviewing existing literature or research studies, exploring previously used datasets, or simply understanding business processes or scientific principles relevant to the domain.

Once this foundational understanding is built, the next step is to translate a broad, often ambiguous idea into a well-defined, data-driven problem statement. This involves breaking down the general question into smaller, actionable components that can be explored using data. For example, instead of asking How can we improve customer satisfaction?, we might refine it into Can we predict customer churn based on behavioral and transactional data?

This stage typically requires multiple iterations, moving back and forth between domain understanding, hypothesis generation, and feasibility assessment. Each iteration helps refine the problem further, narrowing it down to something that is both meaningful and measurable using available data.

A well-formulated data science problem should answer three key questions:

  1. What are we trying to achieve? Define the core objective, prediction, classification, detection, optimization, etc.

  2. Why does it matter? Understand the value or impact of solving the problem, for the organization, community, or research domain.

  3. What data or evidence can help us get there? Identify the type and scope of data required, and whether it exists or needs to be collected.

At the end of this stage, you should have a clear and refined problem statement, such as:

“Given historical transaction and support data, predict which customers are likely to churn in the next 30 days.”

This clarity forms the foundation for every subsequent step, from data acquisition and wrangling to modeling and deployment. A well-defined problem ensures that the entire data science process remains focused, interpretable, and impactful.

To achieve this, we will explore a structured framework to guide the process. Note that this is not something to be followed rigidly, but rather a guiding framework to support the problem formulation stage.