logo

Scikit-learn Pipeline Examples: GridSearchCV, Preprocessing & VotingClassifier (Python)

Scikit-learn pipelines provide a clean, reliable way to chain data preprocessing and model training steps into a single, reproducible workflow. Instead of manually transforming data and risking leakage or inconsistency, pipelines ensure that every step—from scaling and encoding to fitting the estimator—is applied in the correct order and reused consistently during training and prediction. They make machine learning code easier to read, debug, and maintain, while also integrating seamlessly with cross-validation and hyperparameter tuning tools. In practice, pipelines help you focus less on glue code and more on building models that generalize well. For detailed instructions on how to install and set up Scikit-learn in Python, visit the Scikit-learn installation guide.

A pipeline is used to apply a sequence of data preprocessing steps (transformers), followed by an optional final estimator (predictive model). The main purpose of a pipeline is to streamline the workflow by combining multiple steps into a single object. This makes it easier to perform tasks like cross-validation and hyperparameter tuning consistently across all stages. We will have different examples to explore Scikit-learn's Pipelines. Let's see a Pipeline with an estimator:

from sklearn.pipeline import Pipeline

pipeline = Pipeline([("rfc", RandomForestClassifier())])

Pipeline with GridSearchCV

We can use a pipeline combined with GridSearchCV to optimize our model:

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y= True, as_frame= True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

pipeline = Pipeline([ ("rfc", RandomForestClassifier())])

grid = [{'rfc': [RandomForestClassifier()], "rfc__n_estimators": [10, 20]}]
gs = GridSearchCV(pipeline, param_grid= grid, scoring="accuracy", cv=5)
gs.fit(X_train, y_train)

You need to specify the parameters of the GridSearchCV before the GridSearchCV. If you want to set a parameter, you need to use '__' as a separator in the parameter names. rfc__n_estimators parameter in the example above sets n_estimators parameter for the GridSearchCV. If you want to learn GridSearchCV, see the GridSearchCV.

Pipeline with StandardScaler

You can use StandardScaler or other scalers with Pipeline to scale data:

from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 20)

scaled_pipe = Pipeline([('scaler', StandardScaler()), ('lr', LogisticRegression())])
scaled_pipe.fit(X_train, y_train)
y_pred = scaled_pipe.predict(X_test)

Pipeline with OneHotEncoder, SimpleImputer, ColumnTransformer

We will use Seaborn Library's Titanic dataset for the example below. Our model will predict the "survived" column. The Titanic dataset has both categorical and numerical features. We will use Logistic Regression as an estimator. Therefore, we need to convert categorical data to numerical data. We will use Scikit-learn's OneHotEncoder to encode categorical features. ColumnTransformer allows different columns or column subsets of the input to be transformed separately, and the features generated by each transformer will be concatenated to form a single feature space. SimpleImputer univariates imputer for completing missing values with simple strategies. You need different strategies for categorical and numerical data. We will use the "most_frequent" strategy for categorical data and the "mean" strategy for numerical data. The "most_frequent" imputation strategy replaces missing values using the most frequent value along each column. The "most_frequent" strategy can be used for numerical data as well. The "mean" imputation strategy fills in missing values using the mean of each column. If you want to learn if your data has missing values, you can print df.isnull().sum().

Pipeline with OneHotEncoder, SimpleImputer, ColumnTransformer Example

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
import numpy as np

titanic = sns.load_dataset('titanic')
titanic = pd.DataFrame(titanic)
X = titanic.drop(columns="survived")
y = titanic["survived"]

num_cols = X.select_dtypes(include=np.number).columns
cat_cols = X.select_dtypes(include=["object","bool", "category"]).columns

x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.25)

num_vals = Pipeline([("imputer", SimpleImputer(strategy="mean")), ("scale", StandardScaler())])
cat_vals = Pipeline([("imputer", SimpleImputer(strategy="most_frequent")), ("onehotencoder", OneHotEncoder())])

preprocess = ColumnTransformer(
    transformers=[
       ("num_preprocess", num_vals, num_cols),
       ("cat_preprocess", cat_vals, cat_cols)])

pipeline = Pipeline([("preprocess", preprocess), ("regr", LogisticRegression())])
pipeline.fit(x_train, y_train)
pred = pipeline.predict(x_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, pred))

The cat_cols variable represents the categorical columns, while the num_cols variable represents the numerical columns. SimpleImputer and StandardScaler will be applied to numerical data. SimpleImputer and OneHotEncoder will be applied to categorical data. We need to declare ColumnTransformer before the final pipeline.

Advanced Example: Using Pipelines with Ensemble Methods

Pipelines in Scikit-learn are not limited to single models like Logistic Regression. You can also integrate ensemble models, such as RandomForestClassifier or VotingClassifier, as the final estimator in your workflow. This allows you to apply the same preprocessing steps—scaling, encoding, or imputation—to more complex models without changing your code structure. For example, a pipeline can first scale numeric features and then feed the processed data into a VotingClassifier combining Logistic Regression and Random Forest, making your workflow both robust and modular. For a deeper dive into ensemble methods like Voting, Bagging, and Boosting, check out our detailed guide on Ensemble Methods in Scikit-learn.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.metrics import accuracy_score

# Load data
X, y = load_wine(return_X_y=True)

# Train-test split
x_train, x_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=0 )

# VotingClassifier
voting = VotingClassifier(
    estimators=[
        ("lr", LogisticRegression(max_iter=1000)),
        ("rf", RandomForestClassifier(n_estimators=100, random_state=42)) ], voting='soft' )

# Pipeline (only scaling needed)
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", voting)])

# Train
pipeline.fit(x_train, y_train)

# Predict
pred = pipeline.predict(x_test)

print("Test accuracy:", accuracy_score(y_test, pred))

The accuracy score of the example above is 1.0. In the example above, we used LogisticRegression and RandomForestClassifier as the base models for the VotingClassifier.

Let's see a more advanced example:

Pipeline with VotingClassifier Example

import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.metrics import accuracy_score

#Load Titanic dataset
titanic = sns.load_dataset('titanic')
X = titanic.drop(columns="survived")
y = titanic["survived"]

#Split numeric and categorical columns
num_cols = X.select_dtypes(include=np.number).columns
cat_cols = X.select_dtypes(include=["object","category","bool"]).columns

#Convert boolean columns to strings for OneHotEncoder
X[cat_cols] = X[cat_cols].astype(str)

#Split data
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

#Preprocessing pipelines
num_vals = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scale", StandardScaler())])

cat_vals = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehotencoder", OneHotEncoder(handle_unknown="ignore"))])

preprocess = ColumnTransformer([
    ("num_preprocess", num_vals, num_cols),
    ("cat_preprocess", cat_vals, cat_cols) ])

#Ensemble model (VotingClassifier)
voting = VotingClassifier(estimators=[
    ("lr", LogisticRegression(max_iter=1000)),
    ("rf", RandomForestClassifier(n_estimators=100, random_state=42))], voting='soft')

#Full pipeline: preprocessing + ensemble
pipeline = Pipeline([
    ("preprocess", preprocess),
    ("classifier", voting)])

#Fit the pipeline
pipeline.fit(x_train, y_train)

#Predict and evaluate
pred = pipeline.predict(x_test)
print("Test accuracy:", accuracy_score(y_test, pred))

The accuracy score of the example above is 1.0.

In the advanced example above, we brought everything together: handling missing values with SimpleImputer, encoding categorical features using OneHotEncoder, organizing transformations through a ColumnTransformer, wrapping models inside Pipelines, and finally combining them with a VotingClassifier. If some parts of this workflow felt challenging — especially how preprocessing, modeling, and ensembling connect into one clean system — that's completely normal. These components are powerful individually, but their real strength appears when they're used together correctly.

If you'd like to reinforce these concepts step by step, you can explore the complete Scikit-learn course, where we break down preprocessing techniques, pipelines, model evaluation, and advanced workflows in a structured and practical way.

Since this example also introduced ensemble learning through voting, you may also want to read the dedicated article on Ensemble Methods in Scikit-learn, where we dive deeper into bagging, boosting, stacking, and voting classifiers — including when and why to use each approach in real-world projects.