6.4.4. Scikit-learn Pipelines#

Here’s a common pattern in beginner code:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression()
model.fit(X_train_scaled, y_train)
predictions = model.predict(X_test_scaled)

This works in notebooks. But what happens at inference time, six months later, when a new row of data arrives? You need to remember to apply the same scaler, in the same way, with the same fitted parameters. If you forget — or apply it incorrectly — your model silently produces wrong predictions.

Pipelines solve this by bundling preprocessing and modelling into a single object that behaves exactly like a model.

6.4.4.1. The Core Idea#

A Pipeline is a sequence of steps. Each step (except the last) must be a transformer that implements fit and transform. The last step is typically an estimator (model) that implements fit and predict.

When you call pipeline.fit(X_train, y_train), it runs fit_transform on each transformer in order, then fit on the final estimator. When you call pipeline.predict(X_new), it runs transform at each step, then predict — all in one call.

Hide code cell source

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from myst_nb import glue
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

6.4.4.2. Building a Simple Pipeline#

Start with the most common pattern: scale features, then fit a model.

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Without pipeline (manual, error-prone)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)
lr_manual = LogisticRegression(max_iter=1000, random_state=42)
lr_manual.fit(X_train_scaled, y_train)
score_manual = lr_manual.score(X_test_scaled, y_test)

# With pipeline (clean, safe)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  LogisticRegression(max_iter=1000, random_state=42)),
])
pipeline.fit(X_train, y_train)
score_pipeline = pipeline.score(X_test, y_test)

# Inference
new_patient  = X_test[0:1]
prediction   = pipeline.predict(new_patient)
probability  = pipeline.predict_proba(new_patient)

display(pd.DataFrame([
    {'Approach': 'Manual (fit scaler separately)', 'Test Accuracy': round(score_manual,   4)},
    {'Approach': 'Pipeline (scaler + model)',       'Test Accuracy': round(score_pipeline, 4)},
]))

display(pd.DataFrame([
    {'Sample': 1, 'Predicted class': int(prediction[0]),
     'P(class 0)': round(float(probability[0, 0]), 3),
     'P(class 1)': round(float(probability[0, 1]), 3)},
]))
Approach Test Accuracy
0 Manual (fit scaler separately) 0.9825
1 Pipeline (scaler + model) 0.9825
Sample Predicted class P(class 0) P(class 1)
0 1 0 1.0 0.0

6.4.4.3. Why Pipelines Prevent Data Leakage#

This is the most important reason to use pipelines when doing cross-validation.

The wrong way (data leakage):

# BUG: scaler sees ALL data before the split!
X_scaled = scaler.fit_transform(X)
scores = cross_val_score(model, X_scaled, y, cv=5)

The scaler has been fitted on the full dataset, including what will be used as validation data in each fold. Information from the validation set has leaked into the training process. The evaluation is overly optimistic.

The right way (with a pipeline):

# Correct: scaler is re-fitted inside each CV fold, on training data only
pipeline = Pipeline([('scaler', StandardScaler()), ('model', model)])
scores = cross_val_score(pipeline, X, y, cv=5)
model = LogisticRegression(max_iter=1000, random_state=42)

# Leaky: fit scaler on all data first
scaler_leaky = StandardScaler()
X_leaky      = scaler_leaky.fit_transform(X)
leaky_scores = cross_val_score(model, X_leaky, y, cv=5, scoring='accuracy')

# Correct: pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=1000, random_state=42)),
])
correct_scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')

display(pd.DataFrame([
    {'Approach': 'Leaky (scaler fit on all data)', 'Mean CV Accuracy': round(leaky_scores.mean(), 4),   'Std': round(leaky_scores.std(), 4)},
    {'Approach': 'Pipeline (correct)',             'Mean CV Accuracy': round(correct_scores.mean(), 4), 'Std': round(correct_scores.std(), 4)},
]))
Approach Mean CV Accuracy Std
0 Leaky (scaler fit on all data) 0.9807 0.0065
1 Pipeline (correct) 0.9807 0.0065

6.4.4.4. Handling Mixed Data: ColumnTransformer#

Real datasets have numeric and categorical columns that need different preprocessing. ColumnTransformer lets you apply different transformers to different column subsets, and Pipeline wraps the whole thing.

np.random.seed(42)
n = 500

df = pd.DataFrame({
    'age':       np.random.randint(18, 70, n).astype(float),
    'income':    np.random.exponential(50000, n),
    'education': np.random.choice(['HighSchool', 'Bachelor', 'Masters', 'PhD'], n),
    'city':      np.random.choice(['NYC', 'LA', 'Chicago', 'Houston'], n),
})
df.loc[np.random.choice(n, 30, replace=False), 'age']    = np.nan
df.loc[np.random.choice(n, 20, replace=False), 'income'] = np.nan
y_mixed = (df['income'].fillna(df['income'].median()) > 50000).astype(int)

numeric_features     = ['age', 'income']
categorical_features = ['education', 'city']

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
])
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot',  OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])
preprocessor = ColumnTransformer([
    ('num', numeric_transformer,     numeric_features),
    ('cat', categorical_transformer, categorical_features),
])
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(n_estimators=100, random_state=42)),
])

X_tr, X_te, y_tr, y_te = train_test_split(df, y_mixed, test_size=0.2, random_state=42)
full_pipeline.fit(X_tr, y_tr)
cv_scores = cross_val_score(full_pipeline, df, y_mixed, cv=5, scoring='accuracy')

glue('ct-test-acc', round(full_pipeline.score(X_te, y_te), 3), display=False)
glue('ct-cv-mean',  round(cv_scores.mean(), 3),                 display=False)
glue('ct-cv-std',   round(cv_scores.std(),  3),                 display=False)

display(df.head(3))
age income education city
0 56.0 154905.174928 HighSchool LA
1 69.0 66950.870244 Bachelor Chicago
2 46.0 40411.524097 Masters NYC

Pipeline test accuracy: 1.0. 5-fold CV: np.float64(0.998) ± np.float64(0.004).

6.4.4.5. Tuning Pipeline Hyperparameters#

Pipelines integrate directly with GridSearchCV. Parameter names follow the pattern stepname__parametername.

pipe_tune = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  LogisticRegression(max_iter=1000, random_state=42)),
])
param_grid = {
    'model__C':       [0.01, 0.1, 1.0, 10.0],
    'model__penalty': ['l1', 'l2'],
    'model__solver':  ['liblinear'],
}
search = GridSearchCV(pipe_tune, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
search.fit(X_train, y_train)

glue('pipe-best-C',    search.best_params_['model__C'],  display=False)
glue('pipe-cv-score',  round(search.best_score_, 3),     display=False)
glue('pipe-test-score',round(search.score(X_test, y_test), 3), display=False)

display(pd.DataFrame([{
    'Best C':        search.best_params_['model__C'],
    'Best penalty':  search.best_params_['model__penalty'],
    'Best CV score': round(search.best_score_, 3),
    'Test score':    round(search.score(X_test, y_test), 3),
}]))
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_3587_d6bb7a780d8b4e41b3247e2421957a82_c66d400b8a994fc1bc2d58adb3046365 for automatic cleanup: unknown resource type folder
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-9i_yal5p for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-6gzgg5lq for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-pi65p_7p for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-cbmp1s5p for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-zxe__9zp for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-6px1pdk9 for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_3587_d6bb7a780d8b4e41b3247e2421957a82_ad7af4cf0b9948c09f89ca210244e8e6 for automatic cleanup: unknown resource type folder
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-_tqkydo9 for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-_rfu8i9s for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_3587_d6bb7a780d8b4e41b3247e2421957a82_ad7af4cf0b9948c09f89ca210244e8e6 for automatic cleanup: unknown resource type folder
Best C Best penalty Best CV score Test score
0 0.1 l2 0.98 0.982

6.4.4.6. Using the Pipeline at Inference Time#

The whole point of a pipeline is that inference is clean. The fitted object handles everything.

best_pipe   = search.best_estimator_
new_data    = pd.DataFrame(X_test[:3], columns=cancer.feature_names)
predictions = best_pipe.predict(new_data.values)
probs       = best_pipe.predict_proba(new_data.values)

display(new_data.iloc[:, :4].round(2))

display(pd.DataFrame([
    {'Sample': i+1,
     'Predicted label': cancer.target_names[pred],
     'Confidence': f'{prob.max():.2%}'}
    for i, (pred, prob) in enumerate(zip(predictions, probs))
]))
mean radius mean texture mean perimeter mean area
0 19.55 28.77 133.60 1207.0
1 11.13 16.62 70.47 381.1
2 13.82 24.49 92.33 595.9
Sample Predicted label Confidence
0 1 malignant 100.00%
1 2 benign 99.92%
2 3 malignant 95.88%

Tip

When you save and reload a trained pipeline (covered in the next section), preprocessing is included automatically. You hand a raw row of data to the loaded pipeline, and it produces a prediction — exactly as during training.

6.4.4.7. Pipeline Inspection#

You can access any step in a fitted pipeline by name.

steps_df = pd.DataFrame([
    {'Step name': name, 'Class': type(step).__name__}
    for name, step in best_pipe.steps
])
display(steps_df)

fitted_scaler = best_pipe.named_steps['scaler']
fitted_model  = best_pipe.named_steps['model']

display(pd.DataFrame([
    {'Property': 'Scaler mean (first 3 features)', 'Value': str(fitted_scaler.mean_[:3].round(3))},
    {'Property': 'Model C',                        'Value': fitted_model.C},
    {'Property': 'Coefficient shape',              'Value': str(fitted_model.coef_.shape)},
]))
Step name Class
0 scaler StandardScaler
1 model LogisticRegression
Property Value
0 Scaler mean (first 3 features) [14.067 19.247 91.557]
1 Model C 0.1
2 Coefficient shape (1, 30)

6.4.4.8. Summary#

Pattern

When to Use

Pipeline([('scaler', ...), ('model', ...)])

Any time you preprocess before modelling

ColumnTransformer

Mixed numeric and categorical columns

Pipeline + cross_val_score

Leak-free cross-validation

Pipeline + GridSearchCV

Hyperparameter tuning across preprocessing and model

pipeline.predict(raw_data)

Clean inference in production

A pipeline is not just a convenience — it is a contract. It guarantees that the exact same sequence of operations that was applied during training will be applied to every new sample at prediction time. Build pipelines from day one.