Hide code cell source
%matplotlib inline
import warnings

# Ignore the first specific warning
warnings.filterwarnings(
    "ignore",
    message="plotting functions contained within `_documentation_utils` are intended for nemos's documentation.",
    category=UserWarning,
)

# Ignore the second specific warning
warnings.filterwarnings(
    "ignore",
    message="Ignoring cached namespace 'core'",
    category=UserWarning,
)

warnings.filterwarnings(
    "ignore",
    message=(
        "invalid value encountered in div "
    ),
    category=RuntimeWarning,
)

Selecting basis by cross-validation with scikit-learn#

In this demo, we will demonstrate how to select an appropriate basis and its hyperparameters using cross-validation. In particular, we will learn:

  1. What a scikit-learn pipeline is.

  2. Why pipelines are useful.

  3. How to combine NeMoS Basis and GLM objects in a pipeline.

  4. How to select the number of bases and the basis type through cross-validation (or any other hyperparameter in the pipeline).

  5. How to use a custom scoring metric to quantify the performance of each configuration.

What is a scikit-learn pipeline#

Pipeline illustration.
Schematic of a scikit-learn pipeline.

A pipeline is a sequence of data transformations leading up to a model. Each step before the final one transforms the input data into a different representation, and then the final model step fits, predicts, or scores based on the previous step’s output and some observations. Setting up such machinery can be simplified using the Pipeline class from scikit-learn.

To set up a scikit-learn Pipeline, ensure that:

  1. Each intermediate step is a scikit-learn transformer object with a transform and/or fit_transform method.

  2. The final step is an estimator object with a fit method, or a model with fit, predict, and score methods.

Each transformation step takes a 2D array X of shape (num_samples, num_original_features) as input and outputs another 2D array of shape (num_samples, num_transformed_features). The final step takes a pair (X, y), where X is as before, and y is a 1D array of shape (n_samples,) containing the observations to be modeled.

You can define a pipeline as follows:

from sklearn.pipeline import Pipeline

# Assume transformer_i/predictor is a transformer/model object
pipe = Pipeline(
    [
        ("label_1", transformer_1), 
        ("label_2", transformer_2),
        ...,
        ("label_n", transformer_n),
        ("label_model", model)
    ]
)

Note that you have to assign a label to each step of the pipeline.

Tip

Here we used a placeholder "label_i" for demonstration; you should choose a more descriptive name depending on the type of transformation step.

Calling pipe.fit(X, y) will perform the following computations:

# Chain of transformations
X1 = transformer_1.fit_transform(X)
X2 = transformer_2.fit_transform(X1)
# ...
Xn = transformer_n.fit_transform(Xn_1)

# Fit step
model.fit(Xn, y)

And the same holds for pipe.score and pipe.predict.

Why pipelines are useful#

Pipelines not only streamline and simplify your code but also offer several other advantages. The real power of pipelines becomes evident when combined with the scikit-learn model_selection module, which includes cross-validation and similar methods. This combination allows you to tune hyperparameters at each step of the pipeline in a straightforward manner.

In the following sections, we will showcase this approach with a concrete example: selecting the appropriate basis type and number of bases for a GLM regression in NeMoS.

Combining basis transformations and GLM in a pipeline#

Let’s start by creating some toy data.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats
import seaborn as sns
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

import nemos as nmo

# some helper plotting functions
from nemos import _documentation_utils as doc_plots

# predictors, shape (n_samples, n_features)
X = np.random.uniform(low=0, high=1, size=(1000, 1))
# observed counts, shape (n_samples,)
rate = 2 * (
    scipy.stats.norm.pdf(X, scale=0.1, loc=0.25)
    + scipy.stats.norm.pdf(X, scale=0.1, loc=0.75)
)
y = np.random.poisson(rate).astype(float).flatten()

Let’s now plot the simulated neuron’s tuning curve, which is bimodal, Gaussian-shaped, and has peaks at 0.25 and 0.75.

fig, ax = plt.subplots()
ax.scatter(X.flatten(), y, alpha=0.2)
ax.set_xlabel("input")
ax.set_ylabel("spike count")
sns.despine(ax=ax)
../_images/abb0159896a552ecd9ecfa3735cc945350f4080a0aaac2516cc651fd74115607.png

Converting NeMoS Basis to a transformer#

In order to use NeMoS Basis in a pipeline, we need to convert it into a scikit-learn transformer.

bas = nmo.basis.RaisedCosineLinearConv(5, window_size=5)

# initalize using the constructor
trans_bas = nmo.basis.TransformerBasis(bas)

# equivalent initialization via "to_transformer"
trans_bas = bas.to_transformer()

# setup the transformer
trans_bas.set_input_shape(1)
Transformer(RaisedCosineLinearConv(n_basis_funcs=5, window_size=5, width=2.0))

Learn More about TransformerBasis

To learn more about sklearn transformers and TransforerBasis, check out this note.

Creating and fitting a pipeline#

We might want to combine first transforming the input data with our basis functions, then fitting a GLM on the transformed data.

This is exactly what Pipeline is for!

pipeline = Pipeline(
    [
        (
            "transformerbasis",
            nmo.basis.RaisedCosineLinearEval(6).set_input_shape(1).to_transformer(),
        ),
        (
            "glm",
            nmo.glm.GLM(regularizer_strength=0.5, regularizer="Ridge"),
        ),
    ]
)

pipeline.fit(X, y)
Pipeline(steps=[('transformerbasis',
                 Transformer(RaisedCosineLinearEval(n_basis_funcs=6, width=2.0))),
                ('glm',
                 GLM(
    observation_model=PoissonObservations(inverse_link_function=exp),
    regularizer=Ridge(),
    regularizer_strength=0.5,
    solver_name='GradientDescent'
))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Note how NeMoS models are already scikit-learn compatible and can be used directly in the pipeline.

Visualize the fit:

# Predict the rate.
# Note that you need a 2D input even if x is a flat array.
# We are using expand dim to add the extra-dimension
x = np.sort(X, axis=0)
predicted_rate = pipeline.predict(x)
fig, ax = plt.subplots()

ax.scatter(X.flatten(), y, alpha=0.2, label="generated spike counts")
ax.set_xlabel("input")
ax.set_ylabel("spike count")


ax.plot(
    x,
    predicted_rate,
    label="predicted rate",
    color="tab:orange",
)

ax.legend()
sns.despine(ax=ax)
../_images/d67b97f622640225eb31a278b1c3533abadc87ca089c9189950d16ac07e388a7.png

The current model captures the bimodal distribution of responses, appropriately picking out the peaks. However, it doesn’t do a good job capturing the actual firing rate: the peaks are too low and the valleys are not low enough. This might be because of our choice of basis and/or regularizer strength, so let’s see if tuning those parameters results in a better fit! We could do this manually, but doing this with the sklearn pipeline will make everything much easier!

Select the number of basis by cross-validation#

Warning

Please keep in mind that while GLM.score supports different ways of evaluating goodness-of-fit through the score_type argument, pipeline.score(X, y, score_type="...") does not propagate this, and uses the default value of log-likelihood.

To evaluate a pipeline, please create a custom scorer (e.g. pseudo_r2 below) and call my_custom_scorer(pipeline, X, y).

Define the parameter grid#

Let’s define candidate values for the parameters of each step of the pipeline we want to cross-validate. In this case the number of basis functions in the transformation step and the ridge regularization’s strength in the GLM fit:

param_grid = dict(
    glm__regularizer_strength=(0.1, 0.01, 0.001, 1e-6),
    transformerbasis__n_basis_funcs=(3, 5, 10, 20, 100),
)

Grid definition

In order to define a parameter grid dictionary for a pipeline, you must structure the dictionary keys as follows:

  • Start with the pipeline label ("glm" or "transformerbasis" for us). This determines which pipeline step has the relevant hyperparameter.

  • Add "__" followed by the hyperparameter name (for example, "n_basis_funcs").

  • If the hyperparameter is itself an object with attributes, add another "__" followed by the attribute name. For instance, "glm__observation_model__inverse_link_function" would be a valid key for cross-validating over the link function of the GLM’s observation_model attribute inverse_link_function. The values in the dictionary are the parameters to be tested.

Visualize the scores#

Let’s extract the scores from gridsearch and take a look at how the different parameter values of our pipeline influence the test score:

cvdf = pd.DataFrame(gridsearch.cv_results_)

cvdf_wide = cvdf.pivot(
    index="param_transformerbasis__n_basis_funcs",
    columns="param_glm__regularizer_strength",
    values="mean_test_score",
)

doc_plots.plot_heatmap_cv_results(cvdf_wide)
../_images/f8b4c2d6d4f56b8eab1497acf6f27c6f1a3746e7f42ebe503dcde2a61874fbe4.png

The plot displays the model’s log-likelihood for each parameter combination in the grid. The parameter combination with the highest score, which is the one selected by the procedure, is highlighted with a blue rectangle. We can thus see that we need 10 or more basis functions, and that all of the tested regularization strengths agree with each other. In general, we want the fewest number of basis functions required to get a good fit, so we’ll choose 10 here.

Visualize the predicted rate#

Finally, visualize the predicted firing rates using the best model found by our grid-search, which gives a better fit than the randomly chosen parameter values we tried in the beginning:

# Predict the ate using the best configuration,
x = np.sort(X, axis=0)
predicted_rate = gridsearch.best_estimator_.predict(x)
fig, ax = plt.subplots()

ax.scatter(X.flatten(), y, alpha=0.2, label="generated spike counts")
ax.set_xlabel("input")
ax.set_ylabel("spike count")


ax.plot(
    x,
    predicted_rate,
    label="predicted rate",
    color="tab:orange",
)

ax.legend()
sns.despine(ax=ax)
../_images/65ecd15fc7c88977700afe1df4478fe6cb52436004177459c9f41278fceba786.png
Hide code cell source
# save image for thumbnail
from pathlib import Path
import os

root = os.environ.get("READTHEDOCS_OUTPUT")
if root:
   path = Path(root) / "html/_static/thumbnails/how_to_guide"
# if local store in ../_build/html/...
else:
   path = Path("../_build/html/_static/thumbnails/how_to_guide")
 
# make sure the folder exists if run from build
if root or Path("../assets/stylesheets").exists():
   path.mkdir(parents=True, exist_ok=True)

if path.exists():
  fig.savefig(path / "plot_06_sklearn_pipeline_cv_demo.svg")

🚀🚀🚀 Success! 🚀🚀🚀

We are now able to capture the distribution of the firing rate appropriately: both peaks and valleys in the spiking activity are matched by our model predicitons.

Evaluating different bases directly#

In the previous example we set the number of basis functions of the Basis wrapped in our TransformerBasis. However, if we are for example not sure about the type of basis functions we want to use, or we have already defined some basis functions of our own, then we can use cross-validation to directly evaluate those as well.

Here we include transformerbasis__basis in the parameter grid to try different values for TransformerBasis.basis:

param_grid = dict(
    glm__regularizer_strength=(0.1, 0.01, 0.001, 1e-6),
    transformerbasis__basis=(
        nmo.basis.RaisedCosineLinearEval(5).set_input_shape(1),
        nmo.basis.RaisedCosineLinearEval(10).set_input_shape(1),
        nmo.basis.RaisedCosineLogEval(5).set_input_shape(1),
        nmo.basis.RaisedCosineLogEval(10).set_input_shape(1),
        nmo.basis.MSplineEval(5).set_input_shape(1),
        nmo.basis.MSplineEval(10).set_input_shape(1),
    ),
)

Then run the grid search:

gridsearch = GridSearchCV(
    pipeline,
    param_grid=param_grid,
    cv=5,
)

# run the 5-fold cross-validation grid search
gridsearch.fit(X, y)
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('transformerbasis',
                                        Transformer(RaisedCosineLinearEval(n_basis_funcs=6, width=2.0))),
                                       ('glm',
                                        GLM(
    observation_model=PoissonObservations(inverse_link_function=exp),
    regularizer=Ridge(),
    regularizer_strength=0.5,
    solver_name='GradientDescent'
))]),
             param_grid={'glm__regularizer_strength': (0.1, 0.01, 0.001, 1e-06),
                         'tra...inearEval(n_basis_funcs=5, width=2.0),
                                                     RaisedCosineLinearEval(n_basis_funcs=10, width=2.0),
                                                     RaisedCosineLogEval(n_basis_funcs=5, width=2.0, time_scaling=50.0, enforce_decay_to_zero=True),
                                                     RaisedCosineLogEval(n_basis_funcs=10, width=2.0, time_scaling=50.0, enforce_decay_to_zero=True),
                                                     MSplineEval(n_basis_funcs=5, order=4),
                                                     MSplineEval(n_basis_funcs=10, order=4))})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Wrangling the output data a bit and looking at the scores:

cvdf = pd.DataFrame(gridsearch.cv_results_)

# Read out the number of basis functions
cvdf["transformerbasis_config"] = [
    f"{b.__class__.__name__} - {b.n_basis_funcs}"
    for b in cvdf["param_transformerbasis__basis"]
]

cvdf_wide = cvdf.pivot(
    index="transformerbasis_config",
    columns="param_glm__regularizer_strength",
    values="mean_test_score",
)

doc_plots.plot_heatmap_cv_results(cvdf_wide)
../_images/ff095cdaa7f04233f3f050747e872b106b6a1fe195979afd794ab2e7ad9e3119.png

As shown in the table, the model with the highest score, highlighted in blue, used a RaisedCosineLinearEval basis (as used above), which appears to be a suitable choice for our toy data. We can confirm that by plotting the firing rate predictions:

# Predict the rate using the optimal configuration
x = np.sort(X, axis=0)
predicted_rate = gridsearch.best_estimator_.predict(x)
fig, ax = plt.subplots()

ax.scatter(X.flatten(), y, alpha=0.2, label="generated spike counts")
ax.set_xlabel("input")
ax.set_ylabel("spike count")

ax.plot(
    x,
    predicted_rate,
    label="predicted rate",
    color="tab:orange",
)

ax.legend()
sns.despine(ax=ax)
../_images/65ecd15fc7c88977700afe1df4478fe6cb52436004177459c9f41278fceba786.png

The plot confirms that the firing rate distribution is accurately captured by our model predictions.

Warning

Please note that because it would lead to unexpected behavior, mixing the two ways of defining values for the parameter grid is not allowed. The following would lead to an error:

param_grid = dict(
    glm__regularizer_strength=(0.1, 0.01, 0.001, 1e-6),
    transformerbasis__n_basis_funcs=(3, 5, 10, 20, 100),
    transformerbasis__basis=(
        nmo.basis.RaisedCosineLinearEval(5).set_input_shape(1),
        nmo.basis.RaisedCosineLinearEval(10).set_input_shape(1),
        nmo.basis.RaisedCosineLogEval(5).set_input_shape(1),
        nmo.basis.RaisedCosineLogEval(10).set_input_shape(1),
        nmo.basis.MSplineEval(5).set_input_shape(1),
        nmo.basis.MSplineEval(10).set_input_shape(1),
    ),
)

Create a custom scorer#

By default, the GLM score method returns the model log-likelihood. If you want to try a different metric, such as the pseudo-R2, you can create a custom scorer and pass it to the cross-validation object:

from sklearn.metrics import make_scorer

pseudo_r2 = make_scorer(
    nmo.observation_models.PoissonObservations().pseudo_r2
)

We can now run the grid search providing the custom scorer

gridsearch = GridSearchCV(
    pipeline,
    param_grid=param_grid,
    cv=5,
    scoring=pseudo_r2,
)

# Run the 5-fold cross-validation grid search
gridsearch.fit(X, y)
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('transformerbasis',
                                        Transformer(RaisedCosineLinearEval(n_basis_funcs=6, width=2.0))),
                                       ('glm',
                                        GLM(
    observation_model=PoissonObservations(inverse_link_function=exp),
    regularizer=Ridge(),
    regularizer_strength=0.5,
    solver_name='GradientDescent'
))]),
             param_grid={'glm__regularizer_strength': (0.1, 0.01, 0.001, 1e-06),
                         'tra...
                                                     RaisedCosineLinearEval(n_basis_funcs=10, width=2.0),
                                                     RaisedCosineLogEval(n_basis_funcs=5, width=2.0, time_scaling=50.0, enforce_decay_to_zero=True),
                                                     RaisedCosineLogEval(n_basis_funcs=10, width=2.0, time_scaling=50.0, enforce_decay_to_zero=True),
                                                     MSplineEval(n_basis_funcs=5, order=4),
                                                     MSplineEval(n_basis_funcs=10, order=4))},
             scoring=make_scorer(pseudo_r2, response_method='predict'))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

And finally, we can plot each model’s score.

Plot the pseudo-R2 scores

cvdf = pd.DataFrame(gridsearch.cv_results_)

# Read out the number of basis functions
cvdf["transformerbasis_config"] = [
    f"{b.__class__.__name__} - {b.n_basis_funcs}"
    for b in cvdf["param_transformerbasis__basis"]
]

cvdf_wide = cvdf.pivot(
    index="transformerbasis_config",
    columns="param_glm__regularizer_strength",
    values="mean_test_score",
)

doc_plots.plot_heatmap_cv_results(cvdf_wide, label="pseudo-R2")
../_images/44a0382c466cd5e22f7ed762b517a32aec9cb25b0cef2bd8c3e6d474b3255a2e.png

As you can see, the results with pseudo-R2 agree with those of the negative log-likelihood. Note that this new metric is normalized between 0 and 1, with a higher score indicating better performance.