Milestone 1#
Along with the book and lab exercises, readers are strongly encouraged to build a project of their own. This project is designed to be developed incrementally as you progress through the book. Each chapter builds upon the previous one, allowing you to gradually enhance and refine your work.
The project serves two main purposes:
To help you apply the concepts learned in the book in a realistic context
To produce a tangible artifact that can be showcased in a professional portfolio
The overarching goal is to provide hands-on experience with the end-to-end data science workflow, preparing you to tackle real-world data challenges.
This repository focuses on Milestone 1, which corresponds to the early stages of this cycle.
Technical Environment#
The recommended technical stack for this project is intentionally lightweight and suitable for rapid prototyping:
Python
pandas for data manipulation
matplotlib and seaborn for static visualizations
plotly for interactive visualizations
Streamlit for dashboards and prototypes
Additional libraries (e.g., NumPy, scikit-learn) as needed
The project may be implemented as a single Jupyter Notebook or split into multiple scripts. What matters most is that each stage is clearly documented and logically connected to the problem formulation.
Scope and Expectations#
Milestone 1 is designed to take approximately five weeks of work and should be developed in parallel with reading the book.
By the end of this milestone, the reader should:
Clearly define a data-driven problem statement that is:
Concrete
Measurable
Includes defined inputs, outputs, and a time horizon
Acquire the data and:
Document all data sources
Ensure the data is accessible within the repository or reproducible via scripts
Provide a data dictionary describing variables, measurements, and sources
Design a database schema that:
Integrates all data sources into a single SQL database
Is implemented using the acquired data
Explore the data using:
Descriptive statistics
Simple visualizations
Notes on missing values, anomalies, or behaviors requiring future attention
In summary:
By the end of Milestone 1, the reader will know what their problem is, how the data was acquired, how it is stored, and will have a deep understanding of the dataset.
Required Deliverables#
The repository must include the following:
Diary Folder
A folder nameddiary/containing five.txtfiles documenting:Problem formulation
Data acquisition (sources and relevance)
Data acquisition II (database storage)
Data exploration
Reflection and next steps
Database Schema
One
.pngimage showing the database schema
Data Dictionary
One
.pdfdescribing variables, measurements, and sources
Code Implementation
One organized
.ipynbor multiple.pyfiles containing:Problem formulation
Data acquisition
Data acquisition II (database)
Data exploration
-
Explanation of the repository structure and instructions to reproduce the work
requirements.txt
A list of project dependencies with versions
Database File or Access Instructions
The
.dbfile or clear instructions on how to generate/access it
Version Control
At least four meaningful commits over the five-week period
Diary files naturally account for at least five commits
Milestone 1 Grading Rubric (200 Points Total)#
1. Diary Documentation (40 points)#
Clear, concise entries for all five stages
Reflective, thoughtful, and connected to decisions made
Professional tone and consistent formatting
2. Problem Formulation (30 points)#
Data-driven, concrete, and measurable
Clearly defined inputs, outputs, and time horizon
Strong alignment between question and data
3. Data Acquisition & Documentation (30 points)#
Data sources clearly identified and justified
Data reproducible or included in repository
Data dictionary is complete and well-organized
4. Database Schema & Implementation (30 points)#
Logical schema design
Successful integration of all data sources
Clear schema visualization and functioning database
5. Data Exploration (30 points)#
Appropriate descriptive statistics
Clear, readable visualizations
Explicit notes on missing values and anomalies
6. Code Quality & Organization (20 points)#
Well-structured files and sections
Readable, commented, and reproducible code
Clear separation of project stages
7. Repository Organization & Professionalism (10 points)#
Clean directory structure
Clear README instructions
Consistent naming conventions
8. Version Control Practices (10 points)#
Regular, meaningful commits
Clear commit messages reflecting progress
Total: 200 points