1.1. Novel Framework#
This framework provides a step by step approach to tackling any data science problem, from understanding the domain to measuring real world impact and iterating on solutions.
1.1.1. Framework Overview#
Step |
Stage |
Description |
Example |
|---|---|---|---|
0 |
Domain Knowledge |
Build an understanding of the problem domain, constraints, and stakeholders. |
Speak with pediatricians to understand factors affecting diabetes risk in children. |
1 |
Hypothesis |
Form a testable assumption that gives direction to the analysis. |
Children with irregular follow up visits are at higher risk of late diabetes diagnosis. |
2 |
Information Gathering |
Collect relevant data and contextual knowledge to test the hypothesis. |
Gather EMR data including visit frequency, lab results, age, and family history. |
3 |
Intelligent Solution Design |
Apply appropriate analytical or modeling techniques to address the problem. |
Build a risk scoring model using patient visit patterns and clinical features. |
4 |
Impact Evaluation |
Measure whether the solution improves outcomes relative to a baseline. |
Compare early detection rates before and after deploying the model. |
5 |
Reflection and Iteration |
Reassess assumptions and refine the solution based on results and feedback. |
Identify missing social factors and update the model accordingly. |
1.1.2. Domain Knowledge#
Data science can be applied across a wide range of domains such as healthcare, finance, education, manufacturing, and sports. Sccessful application depends heavily on understanding the domain in which the problem exists hence it is essential to understand how the system works, what constraints exist, and what outcomes matter. Without this foundation, even technically strong solutions can be misleading or unusable.
Domain knowledge helps answer questions such as which variables are meaningful, how they interact, and what real world constraints apply. For example, in healthcare, a variable may appear predictive only because it is recorded after the outcome occurs. Using such a variable would lead to data leakage and invalidate the model. Strong domain understanding helps avoid these pitfalls and ensures that models are both valid and useful.
1.1.2.1. Reviewing Existing Work#
Reviewing existing research, industry practices, and open datasets helps establish context and avoid redundant efforts. Understanding what has already been attempted allows you to build upon prior work rather than starting from scratch. This step also helps set realistic expectations about what data science can and cannot achieve in the domain.
1.1.2.2. Expert Consultation#
Engaging with domain experts early accelerates learning and improves solution quality. Experts can highlight common misconceptions, explain real world workflows, and point out constraints that are not obvious from data alone.
In healthcare, conversations with clinicians can reveal how diagnoses are made in practice, which signals are trusted, and which data fields are unreliable or inconsistently recorded.
1.1.2.3. Outside Perspective#
Seeking outside perspectives strengthens problem formulation. These perspectives may come from domain experts, peers, mentors, or personal experience.
For instance, when designing a healthcare solution, personal experience as a patient can help identify pain points that are not visible in structured data. Combining multiple viewpoints often leads to more grounded and impactful solutions.
1.1.3. Hypothesis#
Every data science project should begin with a hypothesis. A hypothesis provides direction and prevents unfocused exploration. A strong hypothesis connects domain understanding to a measurable question.
A clear hypothesis defines the problem at a high level. This helps maintain focus on the broader objective rather than getting lost in minor technical details. A well formulated hypothesis also highlights the intended outcome of the analysis, guiding subsequent steps such as data collection and solution design, and establishing a clear, evaluable goal for the project.
Example hypothesis
Children with fewer routine follow up visits are more likely to receive a delayed diabetes diagnosis.
This hypothesis clearly states a relationship that can be tested using data.
1.1.4. Information Gathering#
Once a hypothesis is defined, the next step is to gather the information needed to build and test it. This includes identifying relevant datasets, understanding how data is generated, and recognizing limitations pertaining to the problem domain.
Often, this can involve looking at different resources such as datasets and evaluating their quality and the information they contain. A high quality dataset might not have all the necessary fields, while a lower quality dataset might still provide useful signals required to test our hypothesis.
For example, In healthcare, this may involve collecting electronic medical records, understanding how visit frequency is recorded, and identifying missing or unreliable fields.
Information gathering is not limited to numerical data. Documentation, interviews, and process understanding are equally important.
1.1.5. Intelligent Solution Design#
With data and context in place, we can design an appropriate solution for the problem at hand. This may involve statistical analysis, machine learning models, or rule based systems for easy interpretation.
We have seen the different types of problems that can be solved using data science techniques in Problem Formulation section. The choice of solution method depends on the problem type, data characteristics, and deployment constraints.
In our example, a risk scoring model could be built using patient demographics, visit history, and lab measurements. The choice of method should match the problem, data quality, and deployment constraints.
1.1.6. Impact Evaluation#
After building a solution, it is critical to evaluate its impact. This involves comparing outcomes against a baseline and determining whether the solution creates measurable value.
For the healthcare example, evaluation might involve measuring improvements in early diagnosis rates or reductions in emergency admissions.
1.1.7. Reflection and Iteration#
Data science is inherently iterative. Results often reveal new insights, flaws in assumptions, or opportunities for improvement.
Reflection allows you to reassess earlier decisions and refine the problem definition or solution. Iteration ensures that the system improves over time as new data and feedback become available.
Revisiting the Reflective Prompts throughout this process helps maintain clarity and focus.
1.1.8. Summary#
This framework provides a structured and repeatable approach to data science problem formulation. Each stage builds upon the previous one, guiding you from vague ideas to well scoped, impactful solutions.
By following this process, you ensure that your work remains grounded in context, driven by evidence, and focused on real world value.
This framework is in agreement with the Introduction - Novel Lifecycle which outlines the overall lifecycle of a data science project from problem formulation to deployment and monitoring. We will be revisiting this throughout the book.