6.4.4. Scikit-learn Pipelines#
Here’s a common pattern in beginner code:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
predictions = model.predict(X_test_scaled)
This works in notebooks. But what happens at inference time, six months later, when a new row of data arrives? You need to remember to apply the same scaler, in the same way, with the same fitted parameters. If you forget — or apply it incorrectly — your model silently produces wrong predictions.
Pipelines solve this by bundling preprocessing and modelling into a single object that behaves exactly like a model.
6.4.4.1. The Core Idea#
A Pipeline is a sequence of steps. Each step (except the last) must be a transformer that implements fit and transform. The last step is typically an estimator (model) that implements fit and predict.
When you call pipeline.fit(X_train, y_train), it runs fit_transform on each transformer in order, then fit on the final estimator. When you call pipeline.predict(X_new), it runs transform at each step, then predict — all in one call.
6.4.4.2. Building a Simple Pipeline#
Start with the most common pattern: scale features, then fit a model.
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Without pipeline (manual, error-prone)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
lr_manual = LogisticRegression(max_iter=1000, random_state=42)
lr_manual.fit(X_train_scaled, y_train)
score_manual = lr_manual.score(X_test_scaled, y_test)
# With pipeline (clean, safe)
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression(max_iter=1000, random_state=42)),
])
pipeline.fit(X_train, y_train)
score_pipeline = pipeline.score(X_test, y_test)
# Inference
new_patient = X_test[0:1]
prediction = pipeline.predict(new_patient)
probability = pipeline.predict_proba(new_patient)
display(pd.DataFrame([
{'Approach': 'Manual (fit scaler separately)', 'Test Accuracy': round(score_manual, 4)},
{'Approach': 'Pipeline (scaler + model)', 'Test Accuracy': round(score_pipeline, 4)},
]))
display(pd.DataFrame([
{'Sample': 1, 'Predicted class': int(prediction[0]),
'P(class 0)': round(float(probability[0, 0]), 3),
'P(class 1)': round(float(probability[0, 1]), 3)},
]))
| Approach | Test Accuracy | |
|---|---|---|
| 0 | Manual (fit scaler separately) | 0.9825 |
| 1 | Pipeline (scaler + model) | 0.9825 |
| Sample | Predicted class | P(class 0) | P(class 1) | |
|---|---|---|---|---|
| 0 | 1 | 0 | 1.0 | 0.0 |
6.4.4.3. Why Pipelines Prevent Data Leakage#
This is the most important reason to use pipelines when doing cross-validation.
The wrong way (data leakage):
# BUG: scaler sees ALL data before the split!
X_scaled = scaler.fit_transform(X)
scores = cross_val_score(model, X_scaled, y, cv=5)
The scaler has been fitted on the full dataset, including what will be used as validation data in each fold. Information from the validation set has leaked into the training process. The evaluation is overly optimistic.
The right way (with a pipeline):
# Correct: scaler is re-fitted inside each CV fold, on training data only
pipeline = Pipeline([('scaler', StandardScaler()), ('model', model)])
scores = cross_val_score(pipeline, X, y, cv=5)
model = LogisticRegression(max_iter=1000, random_state=42)
# Leaky: fit scaler on all data first
scaler_leaky = StandardScaler()
X_leaky = scaler_leaky.fit_transform(X)
leaky_scores = cross_val_score(model, X_leaky, y, cv=5, scoring='accuracy')
# Correct: pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression(max_iter=1000, random_state=42)),
])
correct_scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
display(pd.DataFrame([
{'Approach': 'Leaky (scaler fit on all data)', 'Mean CV Accuracy': round(leaky_scores.mean(), 4), 'Std': round(leaky_scores.std(), 4)},
{'Approach': 'Pipeline (correct)', 'Mean CV Accuracy': round(correct_scores.mean(), 4), 'Std': round(correct_scores.std(), 4)},
]))
| Approach | Mean CV Accuracy | Std | |
|---|---|---|---|
| 0 | Leaky (scaler fit on all data) | 0.9807 | 0.0065 |
| 1 | Pipeline (correct) | 0.9807 | 0.0065 |
6.4.4.4. Handling Mixed Data: ColumnTransformer#
Real datasets have numeric and categorical columns that need different preprocessing. ColumnTransformer lets you apply different transformers to different column subsets, and Pipeline wraps the whole thing.
np.random.seed(42)
n = 500
df = pd.DataFrame({
'age': np.random.randint(18, 70, n).astype(float),
'income': np.random.exponential(50000, n),
'education': np.random.choice(['HighSchool', 'Bachelor', 'Masters', 'PhD'], n),
'city': np.random.choice(['NYC', 'LA', 'Chicago', 'Houston'], n),
})
df.loc[np.random.choice(n, 30, replace=False), 'age'] = np.nan
df.loc[np.random.choice(n, 20, replace=False), 'income'] = np.nan
y_mixed = (df['income'].fillna(df['income'].median()) > 50000).astype(int)
numeric_features = ['age', 'income']
categorical_features = ['education', 'city']
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
])
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('model', RandomForestClassifier(n_estimators=100, random_state=42)),
])
X_tr, X_te, y_tr, y_te = train_test_split(df, y_mixed, test_size=0.2, random_state=42)
full_pipeline.fit(X_tr, y_tr)
cv_scores = cross_val_score(full_pipeline, df, y_mixed, cv=5, scoring='accuracy')
glue('ct-test-acc', round(full_pipeline.score(X_te, y_te), 3), display=False)
glue('ct-cv-mean', round(cv_scores.mean(), 3), display=False)
glue('ct-cv-std', round(cv_scores.std(), 3), display=False)
display(df.head(3))
| age | income | education | city | |
|---|---|---|---|---|
| 0 | 56.0 | 154905.174928 | HighSchool | LA |
| 1 | 69.0 | 66950.870244 | Bachelor | Chicago |
| 2 | 46.0 | 40411.524097 | Masters | NYC |
Pipeline test accuracy: 1.0. 5-fold CV: np.float64(0.998) ± np.float64(0.004).
6.4.4.5. Tuning Pipeline Hyperparameters#
Pipelines integrate directly with GridSearchCV. Parameter names follow the pattern stepname__parametername.
pipe_tune = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression(max_iter=1000, random_state=42)),
])
param_grid = {
'model__C': [0.01, 0.1, 1.0, 10.0],
'model__penalty': ['l1', 'l2'],
'model__solver': ['liblinear'],
}
search = GridSearchCV(pipe_tune, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
search.fit(X_train, y_train)
glue('pipe-best-C', search.best_params_['model__C'], display=False)
glue('pipe-cv-score', round(search.best_score_, 3), display=False)
glue('pipe-test-score',round(search.score(X_test, y_test), 3), display=False)
display(pd.DataFrame([{
'Best C': search.best_params_['model__C'],
'Best penalty': search.best_params_['model__penalty'],
'Best CV score': round(search.best_score_, 3),
'Test score': round(search.score(X_test, y_test), 3),
}]))
Traceback (most recent call last):
File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
raise ValueError(
f'Cannot register {name} for automatic cleanup: '
f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_3587_d6bb7a780d8b4e41b3247e2421957a82_c66d400b8a994fc1bc2d58adb3046365 for automatic cleanup: unknown resource type folder
Traceback (most recent call last):
File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
raise ValueError(
f'Cannot register {name} for automatic cleanup: '
f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-9i_yal5p for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
raise ValueError(
f'Cannot register {name} for automatic cleanup: '
f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-6gzgg5lq for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
raise ValueError(
f'Cannot register {name} for automatic cleanup: '
f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-pi65p_7p for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
raise ValueError(
f'Cannot register {name} for automatic cleanup: '
f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-cbmp1s5p for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
raise ValueError(
f'Cannot register {name} for automatic cleanup: '
f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-zxe__9zp for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
raise ValueError(
f'Cannot register {name} for automatic cleanup: '
f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-6px1pdk9 for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
raise ValueError(
f'Cannot register {name} for automatic cleanup: '
f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_3587_d6bb7a780d8b4e41b3247e2421957a82_ad7af4cf0b9948c09f89ca210244e8e6 for automatic cleanup: unknown resource type folder
Traceback (most recent call last):
File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
raise ValueError(
f'Cannot register {name} for automatic cleanup: '
f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-_tqkydo9 for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
raise ValueError(
f'Cannot register {name} for automatic cleanup: '
f'unknown resource type {rtype}')
ValueError: Cannot register /loky-3587-_rfu8i9s for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
raise ValueError(
f'Cannot register {name} for automatic cleanup: '
f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_3587_d6bb7a780d8b4e41b3247e2421957a82_ad7af4cf0b9948c09f89ca210244e8e6 for automatic cleanup: unknown resource type folder
| Best C | Best penalty | Best CV score | Test score | |
|---|---|---|---|---|
| 0 | 0.1 | l2 | 0.98 | 0.982 |
6.4.4.6. Using the Pipeline at Inference Time#
The whole point of a pipeline is that inference is clean. The fitted object handles everything.
best_pipe = search.best_estimator_
new_data = pd.DataFrame(X_test[:3], columns=cancer.feature_names)
predictions = best_pipe.predict(new_data.values)
probs = best_pipe.predict_proba(new_data.values)
display(new_data.iloc[:, :4].round(2))
display(pd.DataFrame([
{'Sample': i+1,
'Predicted label': cancer.target_names[pred],
'Confidence': f'{prob.max():.2%}'}
for i, (pred, prob) in enumerate(zip(predictions, probs))
]))
| mean radius | mean texture | mean perimeter | mean area | |
|---|---|---|---|---|
| 0 | 19.55 | 28.77 | 133.60 | 1207.0 |
| 1 | 11.13 | 16.62 | 70.47 | 381.1 |
| 2 | 13.82 | 24.49 | 92.33 | 595.9 |
| Sample | Predicted label | Confidence | |
|---|---|---|---|
| 0 | 1 | malignant | 100.00% |
| 1 | 2 | benign | 99.92% |
| 2 | 3 | malignant | 95.88% |
Tip
When you save and reload a trained pipeline (covered in the next section), preprocessing is included automatically. You hand a raw row of data to the loaded pipeline, and it produces a prediction — exactly as during training.
6.4.4.7. Pipeline Inspection#
You can access any step in a fitted pipeline by name.
steps_df = pd.DataFrame([
{'Step name': name, 'Class': type(step).__name__}
for name, step in best_pipe.steps
])
display(steps_df)
fitted_scaler = best_pipe.named_steps['scaler']
fitted_model = best_pipe.named_steps['model']
display(pd.DataFrame([
{'Property': 'Scaler mean (first 3 features)', 'Value': str(fitted_scaler.mean_[:3].round(3))},
{'Property': 'Model C', 'Value': fitted_model.C},
{'Property': 'Coefficient shape', 'Value': str(fitted_model.coef_.shape)},
]))
| Step name | Class | |
|---|---|---|
| 0 | scaler | StandardScaler |
| 1 | model | LogisticRegression |
| Property | Value | |
|---|---|---|
| 0 | Scaler mean (first 3 features) | [14.067 19.247 91.557] |
| 1 | Model C | 0.1 |
| 2 | Coefficient shape | (1, 30) |
6.4.4.8. Summary#
Pattern |
When to Use |
|---|---|
|
Any time you preprocess before modelling |
|
Mixed numeric and categorical columns |
|
Leak-free cross-validation |
|
Hyperparameter tuning across preprocessing and model |
|
Clean inference in production |
A pipeline is not just a convenience — it is a contract. It guarantees that the exact same sequence of operations that was applied during training will be applied to every new sample at prediction time. Build pipelines from day one.