Milestone 1#

Along with the book and lab exercises, readers are strongly encouraged to build a project of their own. This project is designed to be developed incrementally as you progress through the book. Each chapter builds upon the previous one, allowing you to gradually enhance and refine your work.

The project serves two main purposes:

  • To help you apply the concepts learned in the book in a realistic context

  • To produce a tangible artifact that can be showcased in a professional portfolio

The overarching goal is to provide hands-on experience with the end-to-end data science workflow, preparing you to tackle real-world data challenges.


This repository focuses on Milestone 1, which corresponds to the early stages of this cycle.


Technical Environment#

The recommended technical stack for this project is intentionally lightweight and suitable for rapid prototyping:

  • Python

  • pandas for data manipulation

  • matplotlib and seaborn for static visualizations

  • plotly for interactive visualizations

  • Streamlit for dashboards and prototypes

  • Additional libraries (e.g., NumPy, scikit-learn) as needed

The project may be implemented as a single Jupyter Notebook or split into multiple scripts. What matters most is that each stage is clearly documented and logically connected to the problem formulation.


Scope and Expectations#

Milestone 1 is designed to take approximately five weeks of work and should be developed in parallel with reading the book.

By the end of this milestone, the reader should:

  1. Clearly define a data-driven problem statement that is:

    • Concrete

    • Measurable

    • Includes defined inputs, outputs, and a time horizon

  2. Acquire the data and:

    • Document all data sources

    • Ensure the data is accessible within the repository or reproducible via scripts

    • Provide a data dictionary describing variables, measurements, and sources

  3. Design a database schema that:

    • Integrates all data sources into a single SQL database

    • Is implemented using the acquired data

  4. Explore the data using:

    • Descriptive statistics

    • Simple visualizations

    • Notes on missing values, anomalies, or behaviors requiring future attention

In summary:
By the end of Milestone 1, the reader will know what their problem is, how the data was acquired, how it is stored, and will have a deep understanding of the dataset.


Required Deliverables#

The repository must include the following:

  1. Diary Folder
    A folder named diary/ containing five .txt files documenting:

    • Problem formulation

    • Data acquisition (sources and relevance)

    • Data acquisition II (database storage)

    • Data exploration

    • Reflection and next steps

  2. Database Schema

    • One .png image showing the database schema

  3. Data Dictionary

    • One .pdf describing variables, measurements, and sources

  4. Code Implementation

    • One organized .ipynb or multiple .py files containing:

      1. Problem formulation

      2. Data acquisition

      3. Data acquisition II (database)

      4. Data exploration

  5. README.md

    • Explanation of the repository structure and instructions to reproduce the work

  6. requirements.txt

    • A list of project dependencies with versions

  7. Database File or Access Instructions

    • The .db file or clear instructions on how to generate/access it

  8. Version Control

    • At least four meaningful commits over the five-week period

    • Diary files naturally account for at least five commits


Milestone 1 Grading Rubric (200 Points Total)#

1. Diary Documentation (40 points)#

  • Clear, concise entries for all five stages

  • Reflective, thoughtful, and connected to decisions made

  • Professional tone and consistent formatting

2. Problem Formulation (30 points)#

  • Data-driven, concrete, and measurable

  • Clearly defined inputs, outputs, and time horizon

  • Strong alignment between question and data

3. Data Acquisition & Documentation (30 points)#

  • Data sources clearly identified and justified

  • Data reproducible or included in repository

  • Data dictionary is complete and well-organized

4. Database Schema & Implementation (30 points)#

  • Logical schema design

  • Successful integration of all data sources

  • Clear schema visualization and functioning database

5. Data Exploration (30 points)#

  • Appropriate descriptive statistics

  • Clear, readable visualizations

  • Explicit notes on missing values and anomalies

6. Code Quality & Organization (20 points)#

  • Well-structured files and sections

  • Readable, commented, and reproducible code

  • Clear separation of project stages

7. Repository Organization & Professionalism (10 points)#

  • Clean directory structure

  • Clear README instructions

  • Consistent naming conventions

8. Version Control Practices (10 points)#

  • Regular, meaningful commits

  • Clear commit messages reflecting progress

Total: 200 points