Python Machine Learning by Example, Fourth Edition
I forked this from the original repo on 2025-04-15 and added the following:
- converted it to use
uvfor package installation - most dependencies are installed via
uv add <package_name> ... pytorchwas not able to be installed this way, instead:uv pip install torch torchvision
Datasets are stored externally and accessed via a global configuration system using environment variables.
- Copy datasets to your local machine (from OneDrive or other source)
- Create a
.envfile in the repository root with your dataset paths:# Dataset paths configuration DATA_ROOT=/path/to/your/datasets/Python-Machine-Learning-by-Example-Fourth-Edition/data MOVIELENS_PATH=${DATA_ROOT}/ch02/ml-1m CLICK_RATE_PATH=${DATA_ROOT}/ch03
- Install dependencies including python-dotenv and xgboost:
uv sync
For Interactive Sessions (Quarto/Jupyter Notebooks) - Copy/Paste Template:
Use this code block at the start of any new .qmd or .ipynb file:
# Setup for dataset access in interactive environments
import pandas as pd
import sys
import os
from pathlib import Path
# Add repository root to Python path for config import
# This automatically finds the repo root containing config.py
repo_root = Path.cwd()
while repo_root != repo_root.parent:
if (repo_root / 'config.py').exists():
break
repo_root = repo_root.parent
sys.path.insert(0, str(repo_root))
# Import dataset configuration functions
from config import get_movielens_file, get_click_rate_file, get_data_path
# Examples of loading datasets:
# data_path = get_movielens_file('ratings.dat') # MovieLens data
# data_path = get_click_rate_file('click-rate-train.csv') # Click rate data
# data_path = get_data_path(5, 'some-file.csv') # Chapter 5 data
# df = pd.read_csv(data_path)
# Optional: Import model caching utility for expensive operations
# from model_cache import cached_grid_search, ModelCacheFor Python Scripts (.py files):
import sys
from pathlib import Path
sys.path.append(str(Path(__file__).parent.parent))
from config import get_movielens_file, get_data_path
data_path = get_movielens_file('ratings.dat')
df = pd.read_csv(data_path)Available Dataset Functions:
get_movielens_file(filename)- MovieLens dataset files (ch02)get_click_rate_file(filename)- Click rate dataset files (ch03)get_data_path(chapter, filename)- Any dataset file by chapter number
Multiprocessing Warnings in GridSearchCV/RandomizedSearchCV:
For the Flexible Hyperparameter Search Script: The script now automatically handles multiprocessing warnings on macOS. No user action needed!
For Interactive Environments (Quarto/Jupyter): If you encounter multiprocessing warnings, add this code before your grid search:
# Fix for multiprocessing warnings in interactive environments (Quarto/Jupyter)
import os
import warnings
from sklearn.utils._testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning
# Option 1: Disable parallel processing to avoid multiprocessing issues
grid_search = GridSearchCV(
estimator=classifier,
param_grid=param_grid,
cv=3,
scoring='f1',
n_jobs=1, # Use single core instead of n_jobs=-1
verbose=1
)
# Option 2: Set multiprocessing start method (add at the top of your notebook)
import multiprocessing as mp
if __name__ == '__main__':
try:
mp.set_start_method('spawn', force=True)
except RuntimeError:
pass # Already set
# Option 3: Suppress warnings (not recommended for debugging)
warnings.filterwarnings('ignore', category=UserWarning, module='joblib')Recommended approach: Use n_jobs=1 in interactive environments to avoid multiprocessing issues while maintaining functionality.
For long-running operations like GridSearchCV/RandomizedSearchCV, consider using Python scripts instead of interactive notebooks:
Benefits of Python scripts:
- ✅ Full multiprocessing support (
n_jobs=-1works properly) - ✅ Better memory management and performance
- ✅ No risk of losing work if notebook crashes
- ✅ Can run in background while you work on other tasks
- ✅ Better error handling and logging
Workflow:
- Develop/prototype in Quarto notebooks with small parameter grids
- Run expensive searches in Python scripts with full parameter grids
- Analyze results back in interactive notebooks
Example script structure:
# hyperparameter_search.py
import sys
from pathlib import Path
sys.path.append(str(Path(__file__).parent.parent))
from config import get_click_rate_file
from model_cache import cached_grid_search
from sklearn.tree import DecisionTreeClassifier
# ... other imports
if __name__ == "__main__":
# Load and prepare data
# Run expensive search with n_jobs=-1
# Results automatically cached for notebook analysisRun the script:
cd ch3/
python hyperparameter_search.pySave time on repeated GridSearchCV/RandomizedSearchCV runs:
The repository includes a model_cache.py utility to automatically cache expensive search results:
# Easy way - use the convenience function
from model_cache import cached_grid_search
# This will cache results and reuse them if parameters haven't changed
grid_search = cached_grid_search(
estimator=DecisionTreeClassifier(random_state=42),
param_grid={'max_depth': [3, 5, 7], 'min_samples_split': [2, 5]},
X_train=X_train,
y_train=y_train,
model_name="decision_tree_clickrate", # Unique identifier
cv=5,
scoring='f1',
n_jobs=1
)
# Manual way - more control
from model_cache import ModelCache
cache = ModelCache("my_models") # Cache directory
# Try to load cached result
cached_result = cache.load_search_result("my_model", params_dict)
if cached_result is None:
# Run expensive search
grid_search = GridSearchCV(...)
grid_search.fit(X_train, y_train)
# Save result
cache.save_search_result("my_model", grid_search, params_dict)
else:
grid_search = cached_resultCache Management:
cache = ModelCache()
cache.list_cached_models() # Show all cached models
cache.clear_cache("model_name") # Clear specific model
cache.clear_cache() # Clear all modelsThe repository includes several working examples demonstrating these features:
ch3/decision-tree-exercise.qmd- Interactive development with cachingch3/decision-tree-with-caching-example.qmd- Comprehensive caching democh3/hyperparameter_search.py- Production script for expensive searchesch3/analyze_search_results.qmd- Analysis of cached search results
1. Quick Prototyping (Interactive):
# In Quarto notebook - fast iteration with small parameter grids
from model_cache import cached_grid_search
grid_search = cached_grid_search(
estimator=DecisionTreeClassifier(random_state=42),
param_grid={'max_depth': [3, 5], 'min_samples_split': [2, 5]}, # Small grid
X_train=X_train, y_train=y_train,
model_name="dt_prototype",
cv=3, n_jobs=1 # Fast for experimentation
)2. Production Search (Script):
# Run comprehensive search in background
cd ch3/
python hyperparameter_search.py
# This runs extensive parameter grids with n_jobs=-1
# Results automatically cached3. Results Analysis (Interactive):
# In Quarto notebook - load cached results for analysis
from model_cache import ModelCache
cache = ModelCache()
cached_model = cache.load_search_result("decision_tree_comprehensive_grid", params)
# Rich visualizations and analysis without re-computation├── config.py # Global dataset configuration
├── model_cache.py # Caching utility for expensive operations
├── flexible_hyperparameter_search.py # NEW: Flexible multi-classifier tuning
├── analyze_flexible_search_results.qmd # NEW: Analysis for flexible searches
├── FLEXIBLE_SEARCH_GUIDE.md # NEW: Comprehensive usage guide
├── .env # Local dataset paths (not committed)
├── ch2/ # Naive Bayes examples
├── ch3/ # Decision Trees with caching examples
│ ├── decision-tree-exercise.qmd # Interactive development
│ ├── decision-tree-with-caching-example.qmd # Caching demonstration
│ ├── hyperparameter_search.py # Production hyperparameter search
│ └── analyze_search_results.qmd # Results analysis
├── data/ # Local datasets (gitignored)
└── model_cache/ # Cached models (gitignored)
✅ Global Dataset Configuration - Access OneDrive datasets from anywhere
✅ Interactive Environment Support - Works in Quarto, Jupyter, and Python scripts
✅ Model Caching System - Save expensive GridSearchCV/RandomizedSearchCV results
✅ Production-Ready Scripts - Full multiprocessing support for expensive operations
✅ Comprehensive Examples - Working code for all common scenarios
NEW: Advanced hyperparameter tuning for multiple classifiers with custom parameter selection
# Decision Tree with default parameter ranges
python flexible_hyperparameter_search.py \
--classifier decision_tree \
--dataset click-rate-train.csv \
--target click \
--drop id,hour,device_id,device_ip \
--tune-params max_depth,criterion
# Random Forest with custom parameter values (JSON format)
python flexible_hyperparameter_search.py \
--classifier random_forest \
--dataset my_data.csv \
--target outcome \
--tune-params n_estimators,max_depth \
--param-values '{"n_estimators": [100, 200, 300], "max_depth": [10, 15, 20]}'
# XGBoost with custom values (key-value format)
python flexible_hyperparameter_search.py \
--classifier xgboost \
--dataset my_data.csv \
--target outcome \
--tune-params learning_rate,reg_alpha \
--param-values "learning_rate=0.05,0.1,0.15 reg_alpha=0,0.5,1.0"- ✅ 3 Classifiers Supported: DecisionTree, RandomForest, XGBoost
- ✅ User-Selected Parameters: Only tune what you specify (no forced defaults)
- ✅ Custom Parameter Values: Specify your own ranges or use built-in defaults
- ✅ Flexible Dataset Support: Any CSV with configurable target/drop columns
- ✅ Smart Caching: Compatible with existing analysis tools
- ✅ Comprehensive Help: Built-in parameter reference and examples
# Show all available parameters for any classifier
python flexible_hyperparameter_search.py --help-params decision_tree
python flexible_hyperparameter_search.py --help-params random_forest
python flexible_hyperparameter_search.py --help-params xgboost
# Show usage examples (including custom parameter formats)
python flexible_hyperparameter_search.py --examplesThe script supports custom parameter value ranges in two formats:
JSON Format:
--param-values '{"max_depth": [5, 10, 15], "criterion": ["gini", "entropy"]}'Key-Value Format:
--param-values "max_depth=5,10,15 criterion=gini,entropy"Mixed Usage (Custom + Default):
# Custom learning_rate values, default ranges for reg_alpha and n_estimators
python flexible_hyperparameter_search.py \
--classifier xgboost \
--dataset my_data.csv \
--target outcome \
--tune-params learning_rate,reg_alpha,n_estimators \
--param-values '{"learning_rate": [0.05, 0.1, 0.15]}'analyze_flexible_search_results.qmd- Comprehensive analysis of any cached searchFLEXIBLE_SEARCH_GUIDE.md- Complete guide with examples, custom parameter usage, and best practices
Note: The .env file is not committed to the repository for security and portability reasons.