Skip to main content
Back to top
Ctrl
+
K
The Data Science Novel
Book
1. Introduction - Novel Lifecycle
1.1. Novel Framework
1.2. Getting Started
2. Problem Formulation in Data Science
2.1. Problem Formulation and Problem Scoping
3. Data Acquisition
3.1. Accessing Data
3.1.1. Internet Access
3.1.2. API Access
3.1.3. Remote Server Access
3.1.4. Scrapping
3.1.5. Streaming
3.2. Storing Data
3.2.1. Data Dictionary
3.2.2. Structured Data
3.2.2.1. CSV
3.2.2.2. DataFrame
3.2.2.3. Excel
3.2.2.4. Parquet
3.2.3. Unstructured Data
3.2.4. Database Management System
3.2.5. SQL
3.2.5.1. Database Design
3.2.5.2. SQL Commands
3.2.5.3. SQL Advanced Queries
3.2.5.4. SQL Joins
3.2.5.5. SQL Joins Example in Python with SQLite
3.2.6. Enterprise Database Systems
4. Data Wrangling
4.1. Data Cleaning
4.1.1. Handling Imperfections
4.1.2. Null Handling and Imputation
4.1.3. Data Type Handling
4.1.4. De-Duplication
4.1.5. Declutter Features
4.2. Feature Engineering
4.2.1. Categorical Encoding
4.2.2. Numeric Transformations
4.2.3. Derived Features
4.2.4. Feature Selection
5. Data Exploration
5.1. Data Overview
5.1.1. Data Nomenclature
5.1.2. Data Preview
5.1.3. Sampling
5.2. Issue Identification
5.2.1. Structural Data Issues
5.2.2. Data Quality Issues
5.3. Statistical Analysis and Feature Relationships
5.3.1. Data Distribution and Feature Behavior
5.3.2. Multivariate Distributions
5.3.3. Relating Data Points
5.3.4. Visualizing Relationships and Structure
6. Modelling
6.1. Modelling Fundamentals
6.1.1. What Does It Mean to Model?
6.1.2. Supervised vs. Unsupervised Learning
6.1.3. Train, Validation, and Test Split
6.1.4. Model Complexity
6.1.5. Underfitting and Overfitting
6.1.6. The Bias-Variance Tradeoff
6.1.7. Regularization: Taming Complexity
6.2. Supervised Learning
6.2.1. Regression
6.2.1.1. Metrics and Loss Functions
6.2.1.2. Linear Regression
6.2.1.3. Regularized Regression
6.2.1.4. Support Vector Regression
6.2.1.5. Decision Tree Regression
6.2.1.6. Ensemble Methods
6.2.1.7. Random Forest
6.2.1.8. Boosting
6.2.2. Classification
6.2.2.1. Metrics and Loss Functions
6.2.2.2. Logistic Regression
6.2.2.3. k-Nearest Neighbours Classification
6.2.2.4. Naïve Bayes
6.2.2.5. Decision Tree Classifier
6.2.2.6. Support Vector Machines
6.2.2.7. Random Forest
6.2.2.8. Boosting
6.2.3. Neural Networks
6.2.3.1. The Perceptron
6.2.3.2. Optimizers
6.2.3.3. Neural Networks
6.2.3.4. Deep Learning and Practical Tips
6.3. Unsupervised Learning
6.3.1. Clustering
6.3.1.1. K-Means Clustering
6.3.1.2. Hierarchical Clustering
6.3.1.3. DBSCAN: Density-Based Clustering
6.3.1.4. Cluster Evaluation: Measuring Quality
6.3.2. Embedding
6.3.2.1. Principal Component Analysis (PCA)
6.3.2.2. t-SNE
6.3.2.3. UMAP: Uniform Manifold Approximation and Projection
6.3.3. Anomaly Detection
6.3.4. Pattern Mining
6.3.4.1. Pattern Mining Terminologies
6.3.4.2. Frequent Itemsets
6.3.4.3. Association Rule Mining
6.4. Model Engineering
6.4.1. Baseline Models
6.4.2. Cross-Validation
6.4.3. Hyperparameter Tuning: Finding Optimal Settings
6.4.4. Scikit-learn Pipelines
6.4.5. Ensemble Methods
6.4.6. Feature Importance
7. Advanced Visualization
7.1. Interactive Widgets
7.2. Model-Powered Visualization
7.3. Visualization Libraries
8. Model Deployment
8.1. Model Serialization
8.1.1. Pickle and Joblib
8.1.2. Scikit-Learn Model Persistence
8.1.3. PyTorch Model Persistence
8.2. Containerization
8.3. Deployment Landscape
8.3.1. Cloud Deployment
8.3.2. Model Serving Frameworks
8.3.3. Deployment Considerations
9. Tool Deployment
9.1. Notebook Applications with Voila
9.2. Streamlit Development
9.3. Web Application Frameworks
Project
Project Overview
Milestone 1
Milestone 2
Milestone 3
Labs
1. Lab 1 - Introduction
1.1. Getting Fimiliar
1.2. Working and Collaborating with an Existing Codebase
2. Lab 2 – Problem Formulation
2.1. Part 1: Scoping and Questioning
2.2. Part 2: Problem Formulation Activity
2.3. Part 3: Anchor Problems
3. Lab 3 - Data Acquisition Lab
3.1. Part 1: Data Access
3.2. Part 2: Building SQL Database
4. Lab 4 - Data Exploration Lab
4.1. Part 1: Data Dictionary
4.2. Part 2: Data Deep Dive
Repository
Suggest edit
Open issue
.md
.pdf
Text Preprocessing
Text Preprocessing
#
TODO