Regularization and Hyperparameter Tuning for Model Optimization
Optimizing machine learning models in Python requires both regularization to prevent overfitting and hyperparameter tuning to achieve peak performance. In this guide, we cover scikit-learn regularization techniques including L1, L2, and ElasticNet, and explore hyperparameter tuning with GridSearchCV, including practical examples and a specific case using SVC. Each section includes clear Python code to help you implement these strategies and build robust, high-performing models. In machine learning, creating models that perform well on new, unseen data is a common challenge. Regularization and hyperparameter tuning are essential techniques to improve model performance and prevent overfitting. Regularization methods, like L1 and L2, help control model complexity, while hyperparameter tuning fine-tunes settings such as learning rates and tree depths to achieve the best results. By combining these approaches, data scientists can build more accurate and reliable models, making regularization and hyperparameter tuning indispensable for modern machine learning projects.
Before getting started, make sure scikit-learn is properly installed and configured in your Python environment. If you haven't done this yet, refer to our step-by-step guide on installing and setting up scikit-learn to ensure a smooth learning experience.
To help you navigate this guide, the following table of contents outlines the key topics on regularization, hyperparameter tuning, and techniques to enhance machine learning model performance.
B.Hyperparameter tuning
a.GridSearchCV
GridSearchCV Example
GridSearchCV Example with SVC
Regularization
Regularization is a technique used in Machine Learning to minimize overfitting. A good model should perform well both on training and test data. Overfitting occurs when a model performs well on training data but fails to perform well on unseen data. There are different techniques to prevent/minimize overfitting. Lasso, Ridge, and ElasticNet are the most common regularization techniques. Lasso (L1, the least absolute shrinkage) regularization is a linear regression technique used for regularization. Both Lasso and Ridge regression penalize overfitting. L1 penalty refers to the sum of the absolute values of the coefficients multiplied by alpha (parameter for regularization). Ridge regression uses the L2 penalty for regularization. L2 penalty is the sum of the squared coefficient values multiplied by alpha. The magnitude of coefficients shrinks during the process. Coefficients can be zero in Lasso, while coefficients do not shrink to zero in Ridge regression. ElasticNet is a Linear regression with combined L1 and L2 priors as regularizers. We will create a model with Lasso using sklearn.linear_model:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
X,y = fetch_california_housing(return_X_y=True)
housing = fetch_california_housing()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=46)
lasso = Lasso(alpha=0.04)
lasso.fit(X_train, y_train)
train_pred = lasso.predict(X_train)
test_pred = lasso.predict(X_test)
from sklearn.metrics import mean_squared_error
training_mse = mean_squared_error(y_train, train_pred)
test_mse = mean_squared_error(y_test, test_pred)
We used sklearn's California housing dataset for the example above. alpha is the constant that multiplies L1. You can test different values for alpha to improve performance. The performance of the example above is not as good as the previous examples. However, it's a good example to see how Lasso works. You can use the same syntax for Ridge regression and ElasticNet. The performances of Ridge and ElasticNet for the example above are slightly better. The mean squared error of training data is close to the mean squared error of test data, and they are both high. The model does not learn enough from the data, and it does not make correct predictions.
Regularization is applied by default in Logistic Regression. Logistic Regression implements regularized logistic regression using the "liblinear" library, "newton-cg", "sag", "saga", and "lbfgs" solvers. The default penalty used in the Logistic regression is l2. However, some penalties may not work with some solvers. For more information, see the Scikit-learn page. You can see the syntax of Logistic regression with an l1 penalty:
lr = LogisticRegression(penalty = 'l1', solver = 'liblinear')
You can use the liblinear solver with both l1 and l2 penalties. You can change the solver and the penalty. As mentioned earlier, some penalties may not work with some solvers. For example, the default solver, "lbfgs" can only work with l2. See the Logistic Regression tutorial for more information about Logistic Regression parameters.
Hyperparameter tuning
Hyperparameters are Machine Learning parameters selected before the training process. They are used to control the learning process. K in the K nearest neighbor, and alpha in Lasso regression are good examples of hyperparameters. The bias occurs when the model can't make successful predictions. The bias represents the difference between a model's predictions and correct values. The variance refers to the dependence on training data. High variance occurs when the model performs well on training data but performs poorly on unseen data. High variance leads to overfitting. The bias-variance tradeoff refers to the relationship between variance and bias. We should minimize both variance and bias. Hyperparameter tuning is the process of finding optimal hyperparameters for an algorithm. You can search for the optimal hyperparameters manually. You can also use sklearn's GridSearchCV method.
GridSearchCV
GridSearchCV is a technique in machine learning that performs an exhaustive search over a specified parameter grid to find the best hyperparameters for a given estimator. It uses cross-validation to evaluate different combinations of parameters and select the one that optimizes model performance. param_grid is a dictionary where the keys are parameter names, and the values are lists of settings to try. The scoring parameter defines the evaluation metric used to measure model performance on the validation folds during the grid search. You can find the full list of score names. You should choose the correct scoring parameter for the estimator. GridSearchCV can be hard to understand. Let's explore how GridSearchCV functions through the example below.
GridSearchCV Example
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=20)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
classifier = LogisticRegression()
model = GridSearchCV(estimator = classifier, param_grid=[{'solver': ["lbfgs", "newton-cg"], "penalty": ["l2", None]}], scoring="accuracy", cv = 5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(model.best_params_)
print(model.best_score_)
The best_params_ is the parameter setting that gave the best results on the hold out data. The best_params_ in the example above are {'penalty': 'l2', 'solver': 'lbfgs'}. The best_score_ parameter is the mean cross-validated score of the best_estimator. The best_score_ is 0.96. To get the best_score_ in the example above, the solver should be "lbfgs", and the penalty should be "l2". As the scoring is accuracy, the GridSearchCV looks for the most accurate results. The param_grid in the example above has solvers and penalties. Options for solvers are "lbfgs" and "newton-cg". The penalty can be either "l2" or none because "lbfgs" and "newton-cg" don't work with "l1".
GridSearchCV Example with SVC
Let's see another GridSearchCV example using SVC:
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=40)
model = GridSearchCV(estimator = SVC(), param_grid={'C': [1, 5, 10, 20], 'kernel': ( 'rbf', 'linear','poly', 'sigmoid')}, scoring = 'accuracy', cv = 5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(model.best_params_)
print(model.best_score_)
The best_params_ are {'C': 5, 'kernel': 'rbf'}. The best_score_ is 0.99. The param_grid has "C" and "kernel" parameters. Kernel values are "rbf", "linear", "poly", and "sigmoid". You can read the sklearn's parameters and attributes for GridSearchCV.
If you want to learn more about SVC, see the Support Vector Machines tutorial.
To learn more about scikit-learn concepts, installation, and core machine learning algorithms, visit our complete scikit-learn tutorial.