Download
Download this notebook: categorical_identifiability.ipynb!
Technical Note: Redundancy in Categorical Designs#
In this notebook, we’re discussing a basic issue with categorical predictors in linear (and related) models. Thus, we’ll discuss this problem using the following notation:
\(X \in \mathbb{R}^{N \times F}\) is the N-samples by F-features design matrix.
\(w \in \mathbb{R}^F\) are the model coefficients.
\(c \in \mathbb{R}\) is the model intercept.
Why Does Redundancy Arise?#
When the Category basis is used as a standalone main-effect predictor
together with a NeMoS GLM (which always includes an intercept), there are infinitely many coefficient vectors that produce the same linear predictor \(X \cdot \mathbf{w} + c\). Let’s see why:
import numpy as np
import nemos as nmo
category = np.array(["L", "L", "L", "L", "R", "R", "R", "R"])
cat_basis = nmo.basis.Category(["L", "R"])
X = cat_basis.compute_features(category)
print("One-hot encoding:\n", X)
print("Sum over columns:", X.sum(axis=1))
One-hot encoding:
[[1. 0.]
[1. 0.]
[1. 0.]
[1. 0.]
[0. 1.]
[0. 1.]
[0. 1.]
[0. 1.]]
Sum over columns: [1. 1. 1. 1. 1. 1. 1. 1.]
Because the columns of our design matrix sum to a vector of ones, we can find coefficients \(\mathbf{w}\) and intercept \(c\) for which \(X \cdot \mathbf{w} + c = 0\):
c = -1
w = np.array([1, 1])
print("X @ w + c:", X @ w + c)
X @ w + c: [0. 0. 0. 0. 0. 0. 0. 0.]
This is bad! It means that our design matrix has a null space, vectors we can add to the parameters that have no effect on the model predictions. This is what gives rise to the problem: our model cannot distinguish between different sets of parameters if their only differences lie in the null space. This both makes optimization more difficult and prevents us from interpreting the parameters.
Let’s unpack this a bit more.
Stacking \(c\) and \(\mathbf{w}\) into a single vector \(\mathbf{v} = [c, \mathbf{w}]\) and prepending an all-ones column to \(X\) gives the compact form \(X_\text{aug} \cdot \mathbf{v} = 0\), where \(X_\text{aug} = \left[1 | X \right]\). Although equivalent to the formulation above, this augmented representation is more mathematically convenient, so we will use it for the remainder of the notebook.
v = np.hstack([c, w])
X_aug = np.column_stack([np.ones(len(category)), X]) # [1 | X]
print("X_aug @ v:", X_aug @ v) # = 0 → v is in the null space of X_aug
X_aug @ v: [0. 0. 0. 0. 0. 0. 0. 0.]
In mathematical terms, we can say that \(\mathbf{v}\) is in the null space of the augmented matrix \(X_\text{aug}\). This happens when at least one column of \(X_\text{aug}\) can be expressed as a weighted sum of the other columns—that is, the columns are linearly dependent. The representation is therefore redundant, and we can add any multiple of \(\mathbf{v}\) to the parameters without changing the predictions:
params = np.array([2., 0.5, -0.3]) # arbitrary parameters [intercept, coef]
alpha = 5.0 # arbitrary shift along the null direction
print("X_aug @ params: ", X_aug @ params)
print("X_aug @ (params + alpha*v): ", X_aug @ (params + alpha * v))
X_aug @ params: [2.5 2.5 2.5 2.5 1.7 1.7 1.7 1.7]
X_aug @ (params + alpha*v): [2.5 2.5 2.5 2.5 1.7 1.7 1.7 1.7]
Models with parameters \([c, \mathbf{w}]\) and \([c, \mathbf{w}] + \alpha \cdot \mathbf{v}\) predict the same firing rate for any \(\alpha \in \mathbb{R}\). This is non-identifiability: the data alone cannot distinguish between them, and there is no unique optimal solution.
A practical way to detect this redundancy is to check whether the rank of \(X_\text{aug}\) is strictly less than number of its columns. The rank is the number of linearly independent columns, so if it is smaller than the total number of columns, some columns must be redundant. In that case, a non-trivial null space exists, i.e., there are non-zero vectors \(\mathbf{v}\) such that \(X_\text{aug} \cdot \mathbf{v} = 0\):
print(f"Rank of [1 | X]: {np.linalg.matrix_rank(X_aug)}")
print(f"Number of columns: {X_aug.shape[1]}")
Rank of [1 | X]: 2
Number of columns: 3
As we can see, in this example, the rank is less than the number of columns, and therefore there is a non-empty null space and the model is non-identifiable.
Reference Coding: A Simple Way to Recover Identifiability#
One way to solve this problem, known as “reference coding”, is to drop one column from the full encoding. The retained columns become contrasts against the dropped (reference) category. All columns are now linearly independent:
X_ref = X[:, 1:] # drop "L"; the remaining column codes "R" vs "L"
X_ref_aug = np.column_stack([np.ones(len(category)), X_ref])
# rank == n_cols → full rank
print(f"Rank of [1 | X_ref]: {np.linalg.matrix_rank(X_ref_aug)}")
print(f"Number of columns: {1 + X_ref.shape[1]}")
Rank of [1 | X_ref]: 2
Number of columns: 2
Which column you drop is arbitrary from a model-fit perspective, but determines coefficient interpretation: every retained coefficient is the effect of that category relative to the reference. See [1] for a full discussion of contrast coding schemes.
When Redundancy Does Not Arise#
Multiplying a Category basis with a continuous basis produces category-specific tuning curves.
The intercept is not involved — the interaction columns are already linearly independent — so no
column needs to be dropped:
speed = np.array([10., 3., 2., 20., 5., 8., 15., 1.])
speed_by_context = nmo.basis.Category(["L", "R"]) * nmo.basis.RaisedCosineLinearEval(3)
X_interact = speed_by_context.compute_features(category, speed)
# Full-rank even without dropping anything
X_interact_aug = np.column_stack([np.ones(len(category)), X_interact])
print("Number of columns:", X_interact_aug.shape[1])
print("Rank: ", np.linalg.matrix_rank(X_interact_aug))
Number of columns: 7
Rank: 7
See Design Matrices Construction for worked examples of this pattern.
Effect of Regularization#
If all columns are retained (no reference dropped), whether the fitted coefficients are unique depends on the regularizer:
Regularizer |
Unique solution? |
Notes |
|---|---|---|
None (unregularized) |
No |
Any redistribution of the total effect across redundant columns that preserves predictions is equally valid. Do not interpret individual coefficients. |
Ridge (L2) |
Yes |
The L2 penalty is symmetric; it selects a unique solution by shrinking coefficients, effectively imposing a specific linear relationship between the intercept and category coefficients. The solution is well-defined but the coefficients do not have a contrast interpretation. |
Lasso (L1) |
No |
The L1 penalty does not restore uniqueness along the degenerate directions. Multiple solutions can achieve the same objective value. Coefficient values are solver-dependent. |
Elastic net |
Yes |
The L2 component restores strict convexity [2]. Unique, but less interpretable than pure Ridge; dropping a reference is recommended whenever interpretation matters. |
In practice, we advise to always drop a reference column when using Category as a standalone
predictor, regardless of regularizer. This makes coefficients interpretable as contrasts and
avoids solver-dependent results.