Scikit-learn Pipelines

6.4.4. Scikit-learn Pipelines#

Here’s a common pattern in beginner code:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression()
model.fit(X_train_scaled, y_train)
predictions = model.predict(X_test_scaled)

This works in notebooks. But what happens at inference time, six months later, when a new row of data arrives? You need to remember to apply the same scaler, in the same way, with the same fitted parameters. If you forget — or apply it incorrectly — your model silently produces wrong predictions.

Pipelines solve this by bundling preprocessing and modelling into a single object that behaves exactly like a model.

6.4.4.1. The Core Idea#

A Pipeline is a sequence of steps. Each step (except the last) must be a transformer that implements fit and transform. The last step is typically an estimator (model) that implements fit and predict.

When you call pipeline.fit(X_train, y_train), it runs fit_transform on each transformer in order, then fit on the final estimator. When you call pipeline.predict(X_new), it runs transform at each step, then predict — all in one call.

6.4.4.2. Building a Simple Pipeline#

Start with the most common pattern: scale features, then fit a model.

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Without pipeline (manual, error-prone)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)
lr_manual = LogisticRegression(max_iter=1000, random_state=42)
lr_manual.fit(X_train_scaled, y_train)
score_manual = lr_manual.score(X_test_scaled, y_test)

# With pipeline (clean, safe)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  LogisticRegression(max_iter=1000, random_state=42)),
])
pipeline.fit(X_train, y_train)
score_pipeline = pipeline.score(X_test, y_test)

# Inference
new_patient  = X_test[0:1]
prediction   = pipeline.predict(new_patient)
probability  = pipeline.predict_proba(new_patient)

display(pd.DataFrame([
    {'Approach': 'Manual (fit scaler separately)', 'Test Accuracy': round(score_manual,   4)},
    {'Approach': 'Pipeline (scaler + model)',       'Test Accuracy': round(score_pipeline, 4)},
]))

display(pd.DataFrame([
    {'Sample': 1, 'Predicted class': int(prediction[0]),
     'P(class 0)': round(float(probability[0, 0]), 3),
     'P(class 1)': round(float(probability[0, 1]), 3)},
]))

	Approach	Test Accuracy
0	Manual (fit scaler separately)	0.9825
1	Pipeline (scaler + model)	0.9825

	Sample	Predicted class	P(class 0)	P(class 1)
0	1	0	1.0	0.0

6.4.4.3. Why Pipelines Prevent Data Leakage#

This is the most important reason to use pipelines when doing cross-validation.

The wrong way (data leakage):

# BUG: scaler sees ALL data before the split!
X_scaled = scaler.fit_transform(X)
scores = cross_val_score(model, X_scaled, y, cv=5)

The scaler has been fitted on the full dataset, including what will be used as validation data in each fold. Information from the validation set has leaked into the training process. The evaluation is overly optimistic.

The right way (with a pipeline):

# Correct: scaler is re-fitted inside each CV fold, on training data only
pipeline = Pipeline([('scaler', StandardScaler()), ('model', model)])
scores = cross_val_score(pipeline, X, y, cv=5)

model = LogisticRegression(max_iter=1000, random_state=42)

# Leaky: fit scaler on all data first
scaler_leaky = StandardScaler()
X_leaky      = scaler_leaky.fit_transform(X)
leaky_scores = cross_val_score(model, X_leaky, y, cv=5, scoring='accuracy')

# Correct: pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=1000, random_state=42)),
])
correct_scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')

display(pd.DataFrame([
    {'Approach': 'Leaky (scaler fit on all data)', 'Mean CV Accuracy': round(leaky_scores.mean(), 4),   'Std': round(leaky_scores.std(), 4)},
    {'Approach': 'Pipeline (correct)',             'Mean CV Accuracy': round(correct_scores.mean(), 4), 'Std': round(correct_scores.std(), 4)},
]))

	Approach	Mean CV Accuracy	Std
0	Leaky (scaler fit on all data)	0.9807	0.0065
1	Pipeline (correct)	0.9807	0.0065

6.4.4.4. Handling Mixed Data: ColumnTransformer#

Real datasets have numeric and categorical columns that need different preprocessing. ColumnTransformer lets you apply different transformers to different column subsets, and Pipeline wraps the whole thing.

np.random.seed(42)
n = 500

df = pd.DataFrame({
    'age':       np.random.randint(18, 70, n).astype(float),
    'income':    np.random.exponential(50000, n),
    'education': np.random.choice(['HighSchool', 'Bachelor', 'Masters', 'PhD'], n),
    'city':      np.random.choice(['NYC', 'LA', 'Chicago', 'Houston'], n),
})
df.loc[np.random.choice(n, 30, replace=False), 'age']    = np.nan
df.loc[np.random.choice(n, 20, replace=False), 'income'] = np.nan
y_mixed = (df['income'].fillna(df['income'].median()) > 50000).astype(int)

numeric_features     = ['age', 'income']
categorical_features = ['education', 'city']

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
])
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot',  OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])
preprocessor = ColumnTransformer([
    ('num', numeric_transformer,     numeric_features),
    ('cat', categorical_transformer, categorical_features),
])
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(n_estimators=100, random_state=42)),
])

X_tr, X_te, y_tr, y_te = train_test_split(df, y_mixed, test_size=0.2, random_state=42)
full_pipeline.fit(X_tr, y_tr)
cv_scores = cross_val_score(full_pipeline, df, y_mixed, cv=5, scoring='accuracy')

glue('ct-test-acc', round(full_pipeline.score(X_te, y_te), 3), display=False)
glue('ct-cv-mean',  round(cv_scores.mean(), 3),                 display=False)
glue('ct-cv-std',   round(cv_scores.std(),  3),                 display=False)

display(df.head(3))

	age	income	education	city
0	56.0	154905.174928	HighSchool	LA
1	69.0	66950.870244	Bachelor	Chicago
2	46.0	40411.524097	Masters	NYC

Pipeline test accuracy: 1.0. 5-fold CV: np.float64(0.998) ± np.float64(0.004).

6.4.4.5. Tuning Pipeline Hyperparameters#

Pipelines integrate directly with GridSearchCV. Parameter names follow the pattern stepname__parametername.

pipe_tune = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  LogisticRegression(max_iter=1000, random_state=42)),
])
param_grid = {
    'model__C':       [0.01, 0.1, 1.0, 10.0],
    'model__penalty': ['l1', 'l2'],
    'model__solver':  ['liblinear'],
}
search = GridSearchCV(pipe_tune, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
search.fit(X_train, y_train)

glue('pipe-best-C',    search.best_params_['model__C'],  display=False)
glue('pipe-cv-score',  round(search.best_score_, 3),     display=False)
glue('pipe-test-score',round(search.score(X_test, y_test), 3), display=False)

display(pd.DataFrame([{
    'Best C':        search.best_params_['model__C'],
    'Best penalty':  search.best_params_['model__penalty'],
    'Best CV score': round(search.best_score_, 3),
    'Test score':    round(search.score(X_test, y_test), 3),
}]))

Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_4335_c4ea9346b92f4c83842f40509a944e21_865dc88388ad4eb0afb1f2dd6a5be537 for automatic cleanup: unknown resource type folder
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-4335-_jou61j_ for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-4335-r9e1bnca for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-4335-w9fmcc0u for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-4335-_ydpxvav for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-4335-yofhu5_6 for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-4335-9xdhy05u for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_4335_c4ea9346b92f4c83842f40509a944e21_a34896f0f4e048c9a090dea507d90105 for automatic cleanup: unknown resource type folder
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-4335-q83gxing for automatic cleanup: unknown resource type semlock
Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /loky-4335-zz4_yruk for automatic cleanup: unknown resource type semlock

Traceback (most recent call last):
  File "/home/runner/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py", line 371, in main
    raise ValueError(
        f'Cannot register {name} for automatic cleanup: '
        f'unknown resource type {rtype}')
ValueError: Cannot register /dev/shm/joblib_memmapping_folder_4335_c4ea9346b92f4c83842f40509a944e21_a34896f0f4e048c9a090dea507d90105 for automatic cleanup: unknown resource type folder

	Best C	Best penalty	Best CV score	Test score
0	0.1	l2	0.98	0.982

6.4.4.6. Using the Pipeline at Inference Time#

The whole point of a pipeline is that inference is clean. The fitted object handles everything.

best_pipe   = search.best_estimator_
new_data    = pd.DataFrame(X_test[:3], columns=cancer.feature_names)
predictions = best_pipe.predict(new_data.values)
probs       = best_pipe.predict_proba(new_data.values)

display(new_data.iloc[:, :4].round(2))

display(pd.DataFrame([
    {'Sample': i+1,
     'Predicted label': cancer.target_names[pred],
     'Confidence': f'{prob.max():.2%}'}
    for i, (pred, prob) in enumerate(zip(predictions, probs))
]))

	mean radius	mean texture	mean perimeter	mean area
0	19.55	28.77	133.60	1207.0
1	11.13	16.62	70.47	381.1
2	13.82	24.49	92.33	595.9

	Sample	Predicted label	Confidence
0	1	malignant	100.00%
1	2	benign	99.92%
2	3	malignant	95.88%

Tip

When you save and reload a trained pipeline (covered in the next section), preprocessing is included automatically. You hand a raw row of data to the loaded pipeline, and it produces a prediction — exactly as during training.

6.4.4.7. Pipeline Inspection#

You can access any step in a fitted pipeline by name.

steps_df = pd.DataFrame([
    {'Step name': name, 'Class': type(step).__name__}
    for name, step in best_pipe.steps
])
display(steps_df)

fitted_scaler = best_pipe.named_steps['scaler']
fitted_model  = best_pipe.named_steps['model']

display(pd.DataFrame([
    {'Property': 'Scaler mean (first 3 features)', 'Value': str(fitted_scaler.mean_[:3].round(3))},
    {'Property': 'Model C',                        'Value': fitted_model.C},
    {'Property': 'Coefficient shape',              'Value': str(fitted_model.coef_.shape)},
]))

	Step name	Class
0	scaler	StandardScaler
1	model	LogisticRegression

	Property	Value
0	Scaler mean (first 3 features)	[14.067 19.247 91.557]
1	Model C	0.1
2	Coefficient shape	(1, 30)

6.4.4.8. Summary#

Pattern	When to Use
`Pipeline([('scaler', ...), ('model', ...)])`	Any time you preprocess before modelling
`ColumnTransformer`	Mixed numeric and categorical columns
`Pipeline` + `cross_val_score`	Leak-free cross-validation
`Pipeline` + `GridSearchCV`	Hyperparameter tuning across preprocessing and model
`pipeline.predict(raw_data)`	Clean inference in production

A pipeline is not just a convenience — it is a contract. It guarantees that the exact same sequence of operations that was applied during training will be applied to every new sample at prediction time. Build pipelines from day one.