Appendix to "Brain Dump on Parameter Decompositions, Transcoders, etc."

Sep 29, 2025

This post is an extension of the Brain Dump post on Parameter Decompositions and Transcoders. It has three parts:

Other approaches with the same conceptual strategy from the original post
A nice connection between Llama MLPs and Bilinear MLPs
Proofs of the algebraic transitions in the original post

The last part is provided only for completeness, but I suggest reading at least the first part. The second part is also fun, though not essential for anything.

Methods to Utilize Original Weights in Interpretable Modules

In the brain dump, we have discussed the potential of using the original weights in the module we use to replace them, be it SPD or transcoder. The immediate approach proposed was Expansion Initialization, where we use the original weights as the initialization of the replacement model, and duplicate vectors from the weights (with some additional noise), as the replacement modules tend to be wider than the original modules. However, I am a bit reluctant to stop there, because on the one hand, I am not sure it would be good enough, and on the other, I feel that this approach follows the correct general strategy, in an intrinsic way. Here, I discuss other ways to use the original weights in the initialization/training of the replacement module.

The first idea was proposed by Ariel Bereslavsky when I discussed transcoder expansion initialization with him. The idea relies on an elegant assumption: since the original model itself has some sparsified behavior, we can use it during training, not just initialization. We rely on expansion initialization’s one-to-many mapping between original and replacement weights, and allow updates to propagate only into weights that correspond to positions not zeroed out in the original weights. Of course, many implementations can apply here. Instead of zero, we can set another small threshold, top-k threshold, etc. Another option is to choose how much to propagate proportional to their corresponding gates in the original model, instead of simply choosing a subset to update.

Another idea, which came up during a conversation with Dan Barzilay, is adding a mixing matrix E on top of the original weights, instead of learning the replacement weights, but representing them as combinations of the original weights (then expansion initialization is implemented by E being a block matrix of identity matrices). The reason this is potentially powerful, from a training perspective, is that “moving information” might be simpler this way during training, as now, meaningful information is moved by adjusting single coefficients. Another approach, which might work even better: Instead of learning E to translate original weights to replacement weights, we can use it to translate replacement weights to original weights. Then, we can add a sparsity penalty (if we really believe there’s a sparse superposition of features in the weights, this should work well). Of course, we can allow some slack, as there likely is noise in the weights besides the true features.

Relationship Between Bilinear and SwiGLU MLPs

As discussed in the Brain Dump post, Bilinear MLPs and SwiGLU with the same P, U, V parameters are related. SwiGLU MLP architecture, in fact, can be interpreted as the SPD companion of the Bilinear MLP architecture. Since Pearce et al. (2025) have explored bilinear MLPs for their interpretability, the SPD connection we proved raises the question of how much of this carries over to SwiGLU. In this section, we will discuss another connection between principal directions in bilinear MLPs and their corresponding SwiGLU MLPs.

Specifically, in Pearce et al., the authors analyze the network by looking at singular value decompositions to identify important directions in activation space. They represent activations that activate the model the most, in a way that can be mathematically formalized.

Let’s say we want to find the input that most activates a given output direction. Refer to the paper to understand this setup more. This is performed using tensor contraction. It is easy to verify that when contracting with a given output direction, we get corresponding functions of the form:

\(S_\text{swi}(x) = \sum_i g_i(x)\ x^T A_i\ x; \ \ S_\text{bil}(x) = \sum_i \ x^T A_i\ x\)

where the A_i’s are simply matrices (tensor contraction of a 3rd-order tensor with a vector results in a matrix). Importantly, A_i’s are the same for both expressions.

The left expression is a non-linear sum of quadratic forms, and the right one is a sum of quadratic forms (therefore, a quadratic form itself).

The important directions can be thought of as vectors of unit norm, such that when they are plugged into S_bil, they spit out values which are either very positive or very negative. More generally, when the vector is not unit norm, we will normalize by the norm squared, because the norm of the input translates to growth in the output’s norm squared. In the case of quadratic forms, these definitions are equivalent, and their maximum is the largest eigenvalue of the form’s matrix (in absolute value). In the general case, non-linearity breaks the equivalence. In symbols, we define strength as:

\(\text{strength} = \frac{|S(x)|}{x^T x}\)

You can verify that the strength is indeed a directional quantity in quadratic forms, i.e., if you multiply x by some scalar alpha, both the numerator and denominator are multiplied by alpha^2, and these factors cancel each other.

We will show that the strongest directions in S_bil remain strong in the SwiGLU variant, with at least 1/2 the strength of the bilinear variant. Unfortunately, for the theorem to hold in its full glory, we must allow the input x’s norm to grow arbitrarily large, which is not the case in most architectures that use these MLPs. Another caveat is that the theorem does not preclude the possibility of stronger vectors in S_swi, due to non-linear behaviors. Still, it is sensible that the original remain significant, even if they are not optimal, and retain their importance in analyzing the network.

Proof is relatively simple: Let x be a strong direction, with S_bil(x) = y, and assume for simplicity that it has unit norm, hence its strength is ||y||. Now, plug a large multiple of x into S_swi:

\(S_\text{swi}(\alpha x) = \alpha^2 \sum_i g_i(\alpha x)\ x^T A_i\ x\)

alpha^2 is cancelled out when dividing by the squared norm of alpha * x. As alpha tends to infinity, the remaining part converges in an interesting way — each g_i converges to either 0 or 1. This is a direct consequence of the structure of g_i(x) = sigmoid(v_i^T x). So the expression becomes a sub-sum of the sum in S_bil. Another interesting point is that, due to g_i’s structure, for each g_i, if it converges to 0 for x, it converges to 1 for -x, and vice versa. So, the sub-sum for alpha * x is the exact complement of the sub-sum of -alpha * x — and the sum of the two is S_bil. Therefore, at least one of them must have strength greater than ||y||/2 (* in the limit of alpha —> infinity).

An annoying edge case is when x is orthogonal to some of the v_i, in which case, g_i remains 0 in all cases. To fix that, add to x an arbitrarily small epsilon multiple of a vector that isn’t orthogonal to any of the v_i. Such vectors are of probability one; therefore, such a vector exists. This results in an epsilon change in the internal expressions, but it brings back the required property inside the g_i. QED

Maybe some extra work can reveal better bounds or other interesting links. It is quite nice that such a cute link between these two architectures exists.

Proofs of Algebraic Transitions in the Brain Dump Post

Claim 1 —

\((x^T u_i) \cdot (x^T v_i) = x^T (u_i\otimes v_i) x \)

Proof. Due to commutativity of inner product, and associativity:

\((x^T u_i) \cdot (x^T v_i) = (x^T u_i) \cdot (v_i^T x)=\)

\(x^T (u_i v_i^T) x = x^T (u_i \otimes v_i) x\)

Claim 2 — We need to prove that:

\(S(x) = \sum_i g_i(x)\ x^T (u_i \otimes v_i \otimes p_{:i}) x\)

where it is given that

\(S(x) = P z(x); \quad z_i(x) = g_i(x)\ x^T (u_i \otimes v_i) x\)

More accurately, since x^T A x is not really well-defined for a 3rd-order tensor, the exact expression is, in Pytorch:

S(x) = torch.einsum('abc, a, b -> c', A, x, x)

with the tensor A being:

\(\mathbf{A} = \sum_i g_i(x) \ u_i \otimes v_i \otimes p_{:i}\)

(If you prefer mathematical symbols, we can use the notation of tensor contraction, and get the equivalent expression to the einsum:

\(S(x) = \mathbf{A} \times_1 x \times_2 x\)

)

Proof.

It is a well-known fact, and easily verifiable, that matrix-vector product is the sum of the columns of the matrix, weighted by the coordinates of the vector:

\(S(x) = Pz(x) = \sum_i z_i(x) p_{:i} = \)

\(\sum_i g_i(x)\ x^T (u_i\otimes v_i)x\ p_{:i}\)

The c-th coordinate of this expression is:

\(\sum_i g_i(x)\ x^T (u_i\otimes v_i)x\ p_{ci}=\)

\(x^T \big( \sum_i g_i(x)\ (u_i\otimes v_i)\ p_{ci} \big) x \qquad (*)\)

We used commutativity of scalars with vectors and associativity. So, we get a quadratic form with the expression in parentheses as its representing matrix. I denote this expression with (*) for future reference.

Now, let’s read out the c-th coordinate of the einsum. Einsum tells us that we need to sum over all a and b, and fix c to get the c-th coordinate of the result:

\(\sum_{ab} \mathbf{A}_{abc} x_a x_b = x^T \mathbf{A}_{: : c} x\)

where the right hand-side is expressed as a quadratic form whose matrix A_::c is obtained by fixing the last coordinate of A to c. The transition from LHS to RHS is simple if you ignore the last coordinate and think in matrices, but the additional dimension might make it look overwhelming at first. Imagine c is not there, by denoting B = A_::c, and it reads:

\(\sum_{ij} B_{ij} x_i x_j = x^T B x\)

We need to prove now that the representing matrix of the quadratic form in (*) is the same as A_::c. Concretely, this means we need to prove that its (a,b)-th entry is A_abc. To get the (a, b) entry, multiply by e_a from the left and e_b from the right, where e_i denotes the i-th standard basis element:

\(e_a^T \big( \sum_i g_i(x)\ (u_i\otimes v_i)\ p_{ci} \big) e_b = \)

\(\sum_i g_i(x)\ e_a^T(u_i\otimes v_i) e_b\ p_{ci} = \)

\(\sum_i g_i(x) \ u_{ia} v_{ib}\ p_{ci} = \mathbf{A}_{abc}\)

Where the first transition is due to commutativity with scalars again, transition 2 is the definition of outer product, and the last transition, too, comes from the definition of outer product. QED

Finally, notice we haven’t used any property of g_i(x), so it can be any function. Specifically, if we set g_i(x) = 1, we get the definition of a bilinear MLP, and therefore the entire theorem is true for bilinear MLPs, where we remove g_i terms.

The Residual Stream

Ready for more?