8.1.1. Pickle and Joblib#
Python provides two primary libraries for serializing objects: pickle (built-in) and joblib (from the scikit-learn ecosystem). Both allow you to save Python objects to disk and reload them later, but they differ in how efficiently they handle the large numerical arrays that machine learning models rely on.
8.1.1.1. Pickle#
pickle is Python’s standard serialization module, available in the standard library with no installation required. It converts a Python object into a byte stream that can be written to a file, then later read back and reconstructed as an identical object. This works for most Python objects, including custom classes, functions, and complex data structures.
Basic Usage#
import pickle
# Save an object
data = {'model': 'trained_model', 'accuracy': 0.95}
with open('model.pkl', 'wb') as f:
pickle.dump(data, f)
# Load an object
with open('model.pkl', 'rb') as f:
loaded_data = pickle.load(f)
Saving a Simple Model#
from sklearn.linear_model import LogisticRegression
import pickle
# Train a model
model = LogisticRegression()
model.fit(X_train, y_train)
# Save the model
with open('logistic_model.pkl', 'wb') as f:
pickle.dump(model, f)
# Load the model
with open('logistic_model.pkl', 'rb') as f:
loaded_model = pickle.load(f)
# Use the loaded model
predictions = loaded_model.predict(X_test)
Advantages of Pickle#
Built-in: No additional installation required
Versatile: Works with most Python objects
Simple API: Straightforward to use
Limitations of Pickle#
Security risk: Never unpickle data from untrusted sources (can execute arbitrary code)
Python-specific: Pickled objects cannot be easily used in other languages
Version sensitivity: May fail across different Python versions
Large arrays: Inefficient for large NumPy arrays
8.1.1.2. Joblib#
joblib is part of the scikit-learn ecosystem and was designed specifically for the kinds of objects that arise in scientific computing. The critical difference from pickle is how it handles large NumPy arrays. Rather than serializing them as generic Python objects, joblib uses a memory-mapped file format that allows extremely large arrays to be saved and loaded much more efficiently. For most machine learning models—which store their parameters as NumPy arrays under the hood—joblib is noticeably faster and produces smaller files than pickle.
Basic Usage#
import joblib
# Save an object
joblib.dump(model, 'model.joblib')
# Load an object
loaded_model = joblib.load('model.joblib')
Saving Complex Models#
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import joblib
# Train a model with preprocessing
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_scaled, y_train)
# Save both the scaler and model
joblib.dump(scaler, 'scaler.joblib')
joblib.dump(model, 'random_forest.joblib')
# Load and use
loaded_scaler = joblib.load('scaler.joblib')
loaded_model = joblib.load('random_forest.joblib')
X_test_scaled = loaded_scaler.transform(X_test)
predictions = loaded_model.predict(X_test_scaled)
Advantages of Joblib#
Efficient compression: Better performance for large NumPy arrays
Faster I/O: Optimized for scientific computing data structures
Compression support: Built-in compression options
Memory mapping: Can load large models without loading entire file into memory
Compression with Joblib#
# Save with compression
joblib.dump(model, 'model.joblib', compress=3) # 0-9, higher = more compression
# Automatic extension
joblib.dump(model, 'model.joblib.gz') # Compressed automatically
8.1.1.3. Pickle vs Joblib: When to Use Each#
Criterion |
Pickle |
Joblib |
|---|---|---|
Small models |
Good |
Good |
Large models with NumPy arrays |
Slower |
Faster |
Compression needed |
Manual |
Built-in |
Standard library |
Built-in |
Requires installation |
scikit-learn models |
Works |
Optimized |
Memory efficiency |
Loads all |
Memory mapping |
8.1.1.4. Best Practices#
Saving the model object alone is rarely sufficient. In practice, saving a model incorrectly—without its preprocessing steps, without metadata, or without version information—is a common source of silent bugs at deployment time. The following patterns address these concerns.
Always Save Preprocessing Components#
import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Create and train a pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
pipeline.fit(X_train, y_train)
# Save the entire pipeline
joblib.dump(pipeline, 'model_pipeline.joblib')
# Load and use
pipeline = joblib.load('model_pipeline.joblib')
predictions = pipeline.predict(X_new) # Automatically scales data
Include Metadata#
A model file without context is difficult to audit. When you load a model weeks or months after training, you need to know when it was trained, what data it was trained on, how it was evaluated, and which library versions were used. Storing this alongside the model prevents a great deal of confusion.
import joblib
from datetime import datetime
# Create a model package with metadata
model_package = {
'model': trained_model,
'scaler': scaler,
'feature_names': feature_names,
'training_date': datetime.now(),
'accuracy': 0.95,
'hyperparameters': {'max_depth': 5, 'n_estimators': 100}
}
joblib.dump(model_package, 'model_package.joblib')
# Load and inspect
package = joblib.load('model_package.joblib')
print(f"Model trained on: {package['training_date']}")
print(f"Accuracy: {package['accuracy']}")
predictions = package['model'].predict(package['scaler'].transform(X_new))
Version Your Models#
Including a version number and timestamp in the filename costs nothing and saves considerable confusion in production systems where multiple model versions may coexist or be rolled back.
import joblib
from datetime import datetime
# Version naming convention
version = "v1.2.0"
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"model_{version}_{timestamp}.joblib"
joblib.dump(model, filename)
Handle Errors Gracefully#
import joblib
import os
def save_model(model, filepath, overwrite=False):
"""Safely save a model with error handling."""
if os.path.exists(filepath) and not overwrite:
raise FileExistsError(f"{filepath} already exists. Set overwrite=True to replace.")
try:
joblib.dump(model, filepath)
print(f"Model saved successfully to {filepath}")
except Exception as e:
print(f"Error saving model: {e}")
def load_model(filepath):
"""Safely load a model with error handling."""
if not os.path.exists(filepath):
raise FileNotFoundError(f"{filepath} not found.")
try:
model = joblib.load(filepath)
print(f"Model loaded successfully from {filepath}")
return model
except Exception as e:
print(f"Error loading model: {e}")
return None
8.1.1.5. Summary#
Use
picklefor simple objects and when you need Python’s built-in libraryUse
joblibfor scikit-learn models and large NumPy arraysAlways save preprocessing components along with models
Include metadata for tracking and debugging
Implement proper version control
Never load serialized files from untrusted sources
In the next section, we’ll explore scikit-learn’s specific persistence patterns and how to properly serialize complex pipelines.