---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.16.4
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

# Construct Design Matrices for Categorical Features

## Design Matrix with `patsy`

:::{dropdown} Additional requirements
:color: warning
:icon: alert

To run this notebook, you may need to install the `patsy` package:

```bash
pip install patsy
```

Alternatively, to install all dependencies required for running any of our notebooks, execute:

```bash
pip install "nemos[examples]"
```
:::

If you have categorized your trials and want to capture a change in the firing rate with the trial category, you can create your design matrix using [`patsy`](https://patsy.readthedocs.io/en/latest/).

Let's assume we have a simple dataset with four samples and two labels assigned to each sample. You can think of these labels as characteristics of the trial to which the samples belong, such as stimulus 1 vs stimulus 2 or context 1 vs context 2.

To make it concrete, let's assume that the stimulus can be the image of a red triangle (Tri) a blue square (Sq), and the context defines the task, for example report either the color (C) or the shape (S) of the stimulus.

```{code-cell} ipython3
# Example data
import pandas as pd
from patsy import dmatrix
import numpy as np

# Example data
data = pd.DataFrame({
    'stimulus': ['Tri', 'Tri', 'Sq', 'Sq'],  # Stimulus
    'context': ['C', 'C', 'S', 'S']   # Context
})

# Spike counts
counts = np.array([10, 5, 2, 0])
```

You can use a `patsy` formula to structure the design matrix:

```{code-cell} ipython3
# (a:b) indicates interaction between a and b
formula = "stimulus + context + stimulus:context"
```

To construct the design matrix:

```{code-cell} ipython3
# Generate design matrix
design_df = dmatrix(formula, data, return_type="dataframe")

design_df
```

The feature in the design matrix are 1-hot encoded categories.
Note that `patsy` adds an intercept and drops the reference `Tri`. This is done by design to avoid perfect collinearity (the sum of the `Tri` and the `Sq` column would be equal to the intercept).

NeMoS GLMs, however, implicitly specify an intercept term, making the first column of this dataframe redundant. Let's drop it.

```{code-cell} ipython3
design_df.drop(columns=["Intercept"], inplace=True)

design_df
```

:::{dropdown} Understanding `patsy`'s Output
:color: info
:icon: info

In the design matrix:
- `T.` indicates **treatment coding**, which is used to encode categorical variables. For example, `stimulus[T.Sq]` represents the presence (`1`) or absence (`0`) of the category `Sq` compared to the reference category `Tri`.
- The reference category (`Tri`) is dropped to avoid collinearity, as including all categories would result in redundancy.
- Similarly, for `context`, columns `context[C]`, and `context[T.S]` is dropped to avoid collinearity.
- The interaction term `stimulus[T.Sq]:context[T.S]` represents the combined effect of `stimulus = Sq` and `context = S`. Only one column is needed for the interaction, as the other combinations are implicitly represented by the reference categories (`Tri` and `C`).

See [`patsy` docs](https://patsy.readthedocs.io/en/latest/formulas.html#the-formula-language) for more details.
:::


### Fit the GLM

We are now ready to fit a GLM model using the 1-hot encoded categories as predictor. The model will learn a different
firing rate for each condition.

```{code-cell} ipython3
import nemos as nmo

# Fit the GLM model
model = nmo.glm.GLM().fit(design_df, counts)
```

:::{note}

You can directly pass a [`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to the
`fit` method, as long as it can be internally converted to a floating-point array (that means, for example, that you can't have a column whose values are strings). The conversion is handled automatically, by calling `design_df.values`,
so no additional steps are required on your part.
:::

## Categorical Predictors and Basis Functions

Assume you have an additional time series that you want to process with a basis:

```{code-cell} ipython3
# Example time series
speed = np.array([10., 3., 2., 20.])
```

You can create a composite basis to combine predictors:

```{code-cell} ipython3
# Identity basis combined with a B-spline basis
bas = (
    nmo.basis.IdentityEval(label="context") +
    nmo.basis.RaisedCosineLinearEval(3, label="speed")
)
bas
```

And you can compute a design matrix including both predictors.

```{code-cell} ipython3
# Compute features
X = bas.compute_features(design_df, speed)

# convert to DataFrame for readability
# bi(speed) indicates that the speed is passed through the i-th raised cosine basis function
pd.DataFrame(X, columns=design_df.columns.to_list() + [f"b{i}(speed)" for i in range(3)])
```

Here the `IdentityEval` basis just passes through the design_df (converting it to an array), while the `speed` is processed by the `RaisedCosineLinearEval`.
The output of both bases is concatenated to form a single design matrix.

:::{admonition} Basis Composition

Check out this [background note](composing_basis_function) to learn more about basis composition.
:::

### Interaction with Basis algebra

To add the interaction To add the interaction between the `S` context and speed to your design matrix, you can take advantage of basis multiplication:

```{code-cell} ipython3
# Interaction basis
bas = (
    nmo.basis.IdentityEval(label="context") *
    nmo.basis.RaisedCosineLinearEval(3, label="speed")
)

# Compute features for interaction
X2 = bas.compute_features(design_df["context[T.S]"], speed)

# convert to DataFrame for readability
pd.DataFrame(X2, columns=[f"context[T.S] * b{i}(speed)" for i in range(3)])
```