Project Overview#
Along with the book and lab exercises, we recommend all the reader to build a project of their own. We have provided some guidelines in the upcoming sections to help you get started.
The project is designed in such a way that you can iterate on it as you progress through the book. Each chapter will build upon the previous one, allowing you to gradually enhance and refine your project.
The project will not only help you apply the concepts learned in the book but also provide you with a tangible outcome that you can showcase in your portfolio.
The goal of the project is to give you hands-on experience in data science and help you develop the skills necessary to tackle real-world data challenges.
Introduction to the project#
In this project, the reader is invited to take on the challenge of designing a complete data science workflow from start to finish. The aim is not only to practice technical skills, but also to think critically about how problems are defined, how data is used to address them, and how results can be shared in a meaningful way.
The project is structured around the data science cycle, which includes the following stages:
Problem Formulation – identifying and framing a problem that can be approached with data.
Data Acquisition – locating, collecting, or generating relevant datasets.
Data Exploration – understanding the dataset through descriptive analysis and visualization.
Data Wrangling – cleaning, transforming, and organizing the data for analysis.
Data Mining – extracting patterns or features that shed light on the problem.
Modeling – developing and evaluating predictive or explanatory models.
Model Deployment – showing how the model might be put into use.
Visualization – creating graphics that reveal insights and make results interpretable.
Dashboard – presenting results in an interactive format.
Tool Development – wrapping your work into a simple, usable prototype.
Tool Deployment – demonstrating how others might access or interact with your solution.
Technical Environment#
For this project, we recommend using Python with a lightweight prototyping stack. In particular:
Streamlit for building interactive dashboards and prototypes.
pandas for data handling.
matplotlib and seaborn for visualization.
plotly for interactive visualization
Additional libraries (e.g., scikit-learn, NumPy) as needed for modeling and analysis.
This setup is intentionally approachable and designed for rapid experimentation. Depending on the reader’s background and ambitions, a more scalable architecture (e.g., Docker, cloud deployment, larger databases) could be explored. However, the guidance provided here assumes the basic prototyping environment described above.
The project may be developed as a single Jupyter Notebook or divided into smaller parts, depending on the reader’s preference. What matters most is that each stage of the cycle is addressed, documented, and connected to the overarching problem formulation.