Milestone 1

Milestone 1#

Along with the book and lab exercises, readers are strongly encouraged to build a project of their own. This project is designed to be developed incrementally as you progress through the book. Each chapter builds upon the previous one, allowing you to gradually enhance and refine your work.

The project serves two main purposes:

To help you apply the concepts learned in the book in a realistic context
To produce a tangible artifact that can be showcased in a professional portfolio

The overarching goal is to provide hands-on experience with the end-to-end data science workflow, preparing you to tackle real-world data challenges.

This repository focuses on Milestone 1, which corresponds to the early stages of this cycle.

Technical Environment#

The recommended technical stack for this project is intentionally lightweight and suitable for rapid prototyping:

Python
pandas for data manipulation
matplotlib and seaborn for static visualizations
plotly for interactive visualizations
Streamlit for dashboards and prototypes
Additional libraries (e.g., NumPy, scikit-learn) as needed

The project may be implemented as a single Jupyter Notebook or split into multiple scripts. What matters most is that each stage is clearly documented and logically connected to the problem formulation.

Scope and Expectations#

Milestone 1 is designed to take approximately five weeks of work and should be developed in parallel with reading the book.

By the end of this milestone, the reader should:

Clearly define a data-driven problem statement that is:
- Concrete
- Measurable
- Includes defined inputs, outputs, and a time horizon
Acquire the data and:
- Document all data sources
- Ensure the data is accessible within the repository or reproducible via scripts
- Provide a data dictionary describing variables, measurements, and sources
Design a database schema that:
- Integrates all data sources into a single SQL database
- Is implemented using the acquired data
Explore the data using:
- Descriptive statistics
- Simple visualizations
- Notes on missing values, anomalies, or behaviors requiring future attention

In summary:
By the end of Milestone 1, the reader will know what their problem is, how the data was acquired, how it is stored, and will have a deep understanding of the dataset.

Required Deliverables#

The repository must include the following:

Diary Folder
A folder named diary/ containing five .txt files documenting:
- Problem formulation
- Data acquisition (sources and relevance)
- Data acquisition II (database storage)
- Data exploration
- Reflection and next steps
Database Schema
- One .png image showing the database schema
Data Dictionary
- One .pdf describing variables, measurements, and sources
Code Implementation
- One organized .ipynb or multiple .py files containing:
  1. Problem formulation
  2. Data acquisition
  3. Data acquisition II (database)
  4. Data exploration
README.md
- Explanation of the repository structure and instructions to reproduce the work
requirements.txt
- A list of project dependencies with versions
Database File or Access Instructions
- The .db file or clear instructions on how to generate/access it
Version Control
- At least four meaningful commits over the five-week period
- Diary files naturally account for at least five commits

Milestone 1 Grading Rubric (200 Points Total)#

1. Diary Documentation (40 points)#

Clear, concise entries for all five stages
Reflective, thoughtful, and connected to decisions made
Professional tone and consistent formatting

2. Problem Formulation (30 points)#

Data-driven, concrete, and measurable
Clearly defined inputs, outputs, and time horizon
Strong alignment between question and data

3. Data Acquisition & Documentation (30 points)#

Data sources clearly identified and justified
Data reproducible or included in repository
Data dictionary is complete and well-organized

4. Database Schema & Implementation (30 points)#

Logical schema design
Successful integration of all data sources
Clear schema visualization and functioning database

5. Data Exploration (30 points)#

Appropriate descriptive statistics
Clear, readable visualizations
Explicit notes on missing values and anomalies

6. Code Quality & Organization (20 points)#

Well-structured files and sections
Readable, commented, and reproducible code
Clear separation of project stages

7. Repository Organization & Professionalism (10 points)#

Clean directory structure
Clear README instructions
Consistent naming conventions

8. Version Control Practices (10 points)#

Regular, meaningful commits
Clear commit messages reflecting progress

Total: 200 points