Brain Dump on Parameter Decompositions, Transcoders, etc.

Sep 27, 2025

Might be interesting, might be useless. As far as I know, the ideas here are new, but cannot be sure, as they seem rather simple to arrive at. So some may be folklore or tried-and-failed (though trying harder with extra tricks has incredible returns in my experience).

This post’s main objective is to discuss potential approaches I feel will have good outcomes in scaling SPD (Stochastic Parameter Decomposition, Bushnaq et al. (2025)), but I also meander and discuss related and overlapping topics like transcoder training and trying to speculate on how different parameter decomposition is from traditional sparse dictionary learning (SDL) approaches. I assume a certain level of familiarity with parameter decomposition and SDL methods (SAEs, transcoders, etc.). Finally, and most importantly, none of the claims discussed is tested, and they might fail miserably. Think of it as a project proposal, rather than a finished project.

Outline of this draft:

Drawing connections between vanilla gated MLPs used in modern transformers and the SPD architecture (also some thoughts on the interpretability of Bilinear MLPs)
Drawing on this connection to initialize SPD based on the original weights
Same point can also hold for the relationship between traditional MLPs and their transcoders

Part 1: Connecting Llama MLPs & Stochastic Parameter Decomposition

Stochastic Parameter Decomposition (SPD) is designed to replace weight matrices in an existing model with an interpretable neural module that preserves the model’s behavior after replacement. Unlike most other interpretable replacement models, it is applied to the weights, rather than activations. It creates a dynamic weight matrix composed of a sum of rank-1 subcomponent matrices, modulated by input-dependent coefficients g_i:

\(W(x) = \sum_i g_i(x) u_i v_i^T = \sum_i g_i(x)\ u_i \otimes v_i\)

with g_i being a neural network that returns a scalar in [0,1]. We use the outer product notation here, as it generalizes to 3rd order tensors, which will be useful to us soon:

\((a \otimes b \otimes c)_{ijk} = a_i b_j c_k\)

Aside on jargon: It is worth mentioning that the rank-1 matrices are called subcomponents, and one would probably assume that the weight matrix they are replacing is what we call the component. This is not the case. The “sub-” prefix goes back to the original parameter decomposition paper (Attribution-based Parameter Decomposition, Braun et al. (2025)), where components are network-wide. This is because the goal of parameter decomposition is to identify and categorize all the mechanisms the network uses, rather than a single layer. SPD makes a compromise of finding the layer-wise mechanisms first, and leaves for future work gluing them together, probably by clustering. Similarly, we will discuss only layerwise stuff here and assume gluing is orthogonal to our work.

SPD architecture and objective function are designed to fulfill the desiderata of parameter decomposition:

Faithfulness: The rank-1 matrices are optimized to sum to the original matrix they are replacing. This somewhat reduces their freedom to diverge from the original module’s computation.1
“Simplicity”: Simplicity is incorporated into SPD via sparsity constraints on the gate functions (i.e., per input, gates are optimized to be sparsely activated) and rank-1 subcomponent matrices.
Minimality: This is achieved by having a small number of active gates per input. I put sparsely-activated gates under simplicity, but the authors separate minimality and simplicity.

As you probably have noticed, I have discussed the objective function of SPD in passing and neglected some parts of the training process, because it is not very relevant to the analysis I do here. Once the architectures match, we can apply the same training process used in SPD. For more details on the objective and training process, refer to the original paper.

Now, let’s see how Llama MLPs (SwiGLU MLPs) are connected to the SPD architecture. The gated MLP in Llama, SwiGLU is of the form:

\(S(x) = P \big((U x) \odot \text{SiLU}(Vx)\big)\)

where P, U, V are parameter matrices, the circle represents elementwise multiplication, and the SiLU activation is:

\(\text{SiLU}(x) = x \odot \sigma(x)\)

with sigma representing the sigmoid function (applied elementwise).

Ignoring P for the moment, let’s look at the i-th coordinate of the inner part of S(x):

\((Ux)_i \cdot (Vx)_i \cdot \sigma((Vx)_i)\)

We can simplify notation by noting that the i-th coordinate of Ux is the inner product of the i-th row of U, and similarly for V:

\((u_i^T x) \cdot (v_i^T x) \cdot \sigma(v_i^T x)\)

denoting the nonlinear part as g_i:

\(g_i(x) = \sigma(v_i^T x)\)

and using some algebraic manipulation, we get that each coordinate is in fact a gated bilinear map of the form:

\(g_i(x)\ x^T (u_i \otimes v_i) x\)

Some more algebraic work leads us to the following expression for the entire S(x) function (the expression now reflects the projection with P, and represents all coordinates, not just the i-th):

\(\sum_i g_i(x)\ x^T (u_i \otimes v_i \otimes p_{:i}) x\)

where p_:i is the i-th column of P. Since the expression in brackets is a 3rd order tensor, we need to specify the dimensions along which we contract x^T and x to be precise,2 but I leave it out for brevity.

So, it turns out that architecturally the SwiGLU MLP is an SPD of another architecture. That architecture happens to be the Bilinear MLP (Pearce et al., 2025). This is simply the same structure but without the SiLU:

\(S(x) = P \big((U x) \odot (Vx)\big)\)

Following the same analysis as above, we would arrive at the same expanded expression for the Bilinear MLP, only without g_i:

\(\sum_i x^T (u_i \otimes v_i \otimes p_{:i}) x\)

So, the SwiGLU MLP is even faithful to its corresponding Bilinear MLP, in the sense used in SPD.3

Side Note: In fact, while the Bilinear MLP is touted as an interpretable variant of the SwiGLU MLP, as we can see, SwiGLU is very closely related to it, and it also has an additional Privileged Basis induced by the gates, which is generally good for interpretability. Further, in the Appendix, I will show that, under certain conditions and setup, SwiGLU is inextricably linked with its corresponding Bilinear MLP, which suggests that the same analysis methods presented in Pearce et al. (2025) can be used as-is on SwiGLU.

Part 2: Can We Use This?

I think we can. My intuition is similar to the data initialization of Peigné (2023). There, he initialized the features of an SAE with samples from the data distribution that the SAEs are trained to reconstruct. The assumption is that since the data is expected to be a superposition of SAE features, we can use it to initialize the features, and hope training would disentangle the features. This should be easier than learning them from scratch, theoretically speaking. It seems to provide some mild speedup (~10% with some variant that searches for datapoints with rare features) from the author’s experiments.

I think that in our case, we are in a much better position, because data points can be correlated, and they are bountiful, and it might be hard to find a good representation of all the features. I think that here, we have (1) a small set of already existing features (2) that serve a not dissimilar purpose to the one we expect to see in the features learned by the parameter decomposition (I mean by this that sparsity is encouraged to some degree by most standard activations). The training only requires improving the makeshift disentanglement already present in the MLP (due to things like the Privileged Basis). This is the general approach.

My other suggestion is that we can cheat a little, and instead of SPD’ing the parameters of the Llama MLP, we can use the Llama MLP as a starting point, as it already has the architecture of an SPD. In a way, what I am suggesting is SPD’ing the Bilinear Map that the Llama MLP represents its “makeshift SPD”. Once we make this “mental shift”, the solution is clear — we can apply the SPD training process to the Llama MLP (though probably need to weigh pros and cons of this mental shift, and whether it fits within the parameter decomposition framework. I think it should).

As discussed above, we can simply use the existing Llama matrices as a good initialization for the SPD, hoping training would be smart enough to find the true disentanglement of the features (if such a decomposition really exists, and to the extent to which it exists). But actually, we don’t need to stop there. We might be bottlenecked by the dimensionality of the MLP. We can also expand the MLP’s weights. This Expansion Initialization is easy to implement. We duplicate the columns of P, and the rows of U and V. Of course, we need to rescale the weights by the multiplicative factor we used to expand the network. Finally, to break the symmetry between duplicates of the same row, we need to add some random noise to the expanded weights.

Part 3: Expansion Initialization for Transcoders

We can generalize the approach here in two ways. First, we can apply a similar Expansion Initialization to traditional MLPs, translating them to transcoders (Dunefsky et al., 2024).4 Transcoders are trained to be an interpretable replacement module for MLPs. It is often designed as a shallow (traditional) MLP architecture:

\(F(x) = W_\text{dec}\ \tau(W_\text{enc} x + b_\text{enc}) + b_\text{dec}\)

where tau is some sparsity-encouraging activation function (e.g., ReLU, though better alternatives exist). A related concept, Sparse Auto-Encoders (SAEs), tries to reconstruct activations, rather than predict how they are transformed by the MLP. Transcoders have been shown to work better than SAEs (Paulo et al., 2025).

Transcoders are structurally similar to the module they replace (in the case of slightly older architectures. These MLPs take the same form structurally, with tau often being ReLU or GELU, not optimized for sparsity. This, by the way, might be one reason that transcoders are better than SAEs. An okay solution to the transcoder instantiation readily exists — simply the original MLP. Of course, that’s a simplification; the activation functions are usually not exactly the same, but it is a reasonable approximation in my eyes. SAEs are trying to use a network to reconstruct the activations, but it is likely foreign to the way the model makes its computations, so it adds extra complication to its application.

Just as before, we can use the original weights as the initialization of the transcoder. And similarly, we will need to expand the network by duplicating and adding noise.

The second way we can generalize Expansion Initialization is more philosophical. The strategy of initialize-and-pray can work well in theory. But I don’t want this to be taken as just a specific algorithm, but a general direction. Taking features out of superposition, as SDL and PD methods attempt to do, should gain tremendously from access to the original weights of the module. These weights are a small set of vectors that capture everything that this module can produce. We know that, by definition, the output of this module is a combination of these weight vectors. It is almost impossible for the original weights not to contain the information we need. Theoretically, it is possible that this information alone is enough, without any data, but I think that we probably need the data to disentangle them correctly, because the way these weights interact might not be trivial from just the static representation of the weights.

We might also need to be judicious about the training process. The algorithm might not work just because of training instabilities and sensitivities, even if all the information is in there. In summary, all the above discussion leads me to the hypothesis that, unless we are wrong about the way we think about superposition, access to original weights through initialization or some other way should almost necessarily be relevant. Through occasional conversations with friends, many related implementation strategies have come up, capitalizing on access to original weights. I will expand on them in the Appendix.

Again, none of these are tested, so at this point, we are just talking a good game. (^_^)

Finally, more thoughts and explorations will also be shared in the Appendix, alongside the other things I promised to discuss throughout the post. Consider giving it a read too!

Though this is a one-dimensional linear constraint, which on its own isn’t a very strong constraint. It might interact with the other constraints, such as sparsity and rank-1, to constrain the solution less trivially. Need to think.

Contraction is the generalization of matrix multiplication for tensors, which requires specifying the axes along which we multiply.

In this expanded form. Faithfulness is not invariant to the way a neural module’s weights are represented.

Though the idea originally appeared but not implemented in other works.

The Residual Stream

Ready for more?