Localization By Design via Semantic Dropout Masks
Sketch of an idea for a novel and stronger localization by design
Introduction
Many interpretability methods try to localize parts of a language model responsible for a certain behavior. Unlike mere interpretability, localization identifies all parameters that are responsible for that behavior, and allows us to perform causal interventions. We can, for example, identify and remove unintended toxic tendencies. Unfortunately, models are not designed to be intervened upon, and intervention methods often have undesired side effects (Yang et al., 2024).
We will start by briefly exploring localization by design — pre-training models such that model behaviors are traceable and manipulatable. Localization by design produces models that allow tracking down, removing, and modifying behaviors of the model without severely affecting other behaviors. Several works (Hewitt et al., 2023, Park et al., 2024, Cloud et al., 2024) have proposed designated architectures that enable localization to some degree. One recurring issue with localized-by-design methods is their inflexibility. They cannot localize arbitrary behaviors, only ones pre-defined by the user (Cloud et al., 2024) or a closed set of behaviors found during training (Hewitt et al., 2023, Park et al., 2024).
In this article, I will propose a sketch of an idea for a new method to train transformers with localization by design. The new method, Localization By Design via Semantic Dropout Masks, aims for a good post hoc localization of arbitrary behaviors. The method also preserves a low parameter count and requires minimal changes to the architecture.
The method presented here shares many features with Gradient Routing (Cloud et al., 2024), in that they both define an input-dependent mask on the model parameters, and only the input-relevant parameters are updated. However, Semantic Dropout Masks do not require supplying a partition over the inputs. It is not restricted to a rigid partition but rather employs a fuzzy approach. It allows the user to localize arbitrary behaviors post hoc! We show how our method can extend beyond regular localization to training set localization — intervening upon the training set in retrospect!
Localization By Design
Throughout this article, we will use the term “behavior” as a general term for facts, capabilities, skills, tasks, and other forms of knowledge the model learns during training. Definitions of localization might vary on small technical details, but in this article, we will define localization as the ability to isolate all and only the entries of the weight matrix that correspond to a certain behavior.
This is an idealized definition. In reality, we will allow some small influence on other behaviors, as well as some reminiscent of the original behavior even after intervention. Overlap between neurons of different behaviors is very much inevitable, as models employ superposition.
Localization allows:
Tracing: Being able to trace a behavior to a certain portion of the model, for interpretability, debugging, and downstream applications.
Removing: Being able to remove a certain behavior without severely interfering with other behaviors.
Substitution: Being able to substitute a behavior (e.g., a fact) for another behavior (another fact).
Steering: Being able to alter the network’s behavior temporarily during the forward pass.
Training Set Localization
With a sufficiently flexible localized-by-design model, we can study how the model would have behaved if we had trained it with a counterfactual training set. This leads us to the following extended (and unrealistic) definition of a localized-by-design model. A model M is localized by design if it is equipped with an algorithm A(S,S′) such that the following holds:
Given the model's training set T, a subset S⊆T, and another set of training samples S′, the algorithm A(S,S′) returns a new model M′ such that M′ is equivalent to a model that was trained on T′=T∪S′∖S.
We will call this property training set localization: the ability to track down, debug, remove, and substitute parts of the training data retrospectively. This is a dramatic extension of Training Data Attribution (TDA) — identifying training examples responsible for the emergence of a certain behavior. Despite much work, TDA is notoriously hard to scale, even as it is already (Chang et al., 2024). A method that allows (training set) localization by design should have an innate support for TDA as a corollary. A key novelty of the semantic dropout mask method is training set localization.
Challenges
There are several challenges to localized-by-design architectures. Localization and interpretability by design often require substantial changes to the architecture (e.g., Tamkin et al., 2023, Hewitt et al., 2023). Another pain point of such architectures is that they are often confined to a restricted set of possible interventions, defined ahead of time. Finally, another major issue is behavior isolation. To achieve perfect separation, the number of parameters needs to grow linearly with the number of different behaviors. To avoid that, some reuse of the model parameters must be made, risking potential leakage.
Localization By Design via Semantic Dropout Masks
Setup
We will work with a vanilla transformer and divide our network’s training into two stages: preliminary pre-training and intensive pre-training. Both are trained on a pre-training corpus, so they do not correspond to pre-training and fine-tuning — they both coincide with the pre-training stage. We have two sets of parameters: W_frozen, W_adapt — the frozen and adaptation parameters, respectively. The frozen parameters are trained during the preliminary stage, and henceforth they are frozen. During the intensive stage, the adaptation parameters are added to the frozen parameters and only they are trained. They are initially set to zero.
In the preliminary pre-training, we train on a simple and clean dataset that will allow the model to learn the basics of natural language. The natural candidates are the BabyLM datasets (Warstadt et al., 2023). A model that was trained on BabyLM can serve as a firm basis for more complicated pre-training. Since the preliminary stage is common to all inputs, basic behaviors will be available across all inputs. Otherwise, they would require learning over and over.
In the intensive pre-training step, the model is trained on a fully-fledged pre-training dataset, such as The Pile (Gao et al., 2020) or RefinedWeb (Penedo et al., 2023). Here, the model learns complex behaviors and world knowledge, in relative isolation, by using only an input-specific portion of the adaptation parameters. The intensive training step is where we will later expect to localize behaviors. We will not localize behaviors that originated in the preliminary training step.
Semantic Dropout Mask
For the sake of simplicity, we will use the term dropout rate to refer to the leave rate, i.e. the probability that a neuron is active in the mask (i.e., it is unmasked). The semantic dropout mask is applied during the intensive pre-training step.
The solution we will explore in this section revolves around a simple idea. Similar inputs should be influenced by overlapping subsets of parameters. During training, we will use a dropout mask that depends on the input. Semantically related inputs will have substantial overlap in their dropout masks. Consequently, they will have similar parameters active.
The semantic dropout mask is computed in two steps, we first encode the input and then expand the encoded input into a binary dropout mask. We start with semantic encoding. Intuitively, the semantic encoder, E(x), plays a similar role to the encoder in RAG (Lewis et al., 2020). The similarity between encoded inputs is high when the inputs are similar, in analogy to the way RAG retrieves relevant sequences based on semantic similarity.
Once the semantic representations are obtained, we employ an expansion algorithm that expands them into a dropout mask. The semantic dropout mask is founded on the concept of Locality-Sensitive Hashing (LSH). LSH is designed to send continuous vectors to discrete objects (e.g., binary vectors), such that close vectors will have with high probability the same LSH vector. In our case, we are interested in sending close vectors to similar vectors, but the same concept holds.
A standard construction is multiplying by a randomly chosen matrix R and binarizing each coordinate of the output according to sign. The idea is that if the original vectors had similar values, they would likely have the same sign when multiplied by random vectors (this can be made rigorous using some calculations). Another difference in our case is that we want the dropout rate to be a small value p, and the construction I mentioned has p=1/2, but this is easy to solve too.
After applying the above steps, inputs that are very different will behave like independent variables and the probability of co-incident activated neurons will be p^2, where p is the dropout rate. If p is small, e.g. p=1/10,000, the probability of co-activated neurons will be around 1/100,000,000 — only over a few dozen for models whose parameters are in the billions! One can even design algorithms that encourage unrelated inputs to activate different neurons and then get an even lower rate of co-activation. This in turn allows us to use more lenient values of p.
Training
The semantic dropout mask decides which parameters of the adaptation parameters are activated during each pass of the model. Clearly, only the unmasked neurons are updated in the backward pass. In every pass, only a rate p of the (adaptation) parameters are updated. Given an input x, and a semantic mask M(x), the input-dependent weights are:
where StopGrad means that there is no gradient on this part, and the multiplication between M(x) and W_adapt is entry-wise.
Notice that there might be some overlap between some of the active parameters for unrelated inputs and there might still be leakage between unrelated inputs. Also, there might be some leakage between concepts that occurred together in multiple inputs. Hopefully, this leakage is manageable. Otherwise, some modifications should be applied to mitigate it.
Post Hoc Localization
Once the model is trained, we are poised to find arbitrary behaviors in the parameters. Given a behavior, we will equate it with the set of training examples that produced it, or more realistically, a small number of representative examples exhibiting the behavior. To localize the behavior as faithfully as possible, we can average the example set’s dropout masks. We then keep every neuron that passed a certain threshold, thus obtaining a behavior-specific semantic dropout mask.
Depending on the precision-recall tradeoff we want, we can choose the threshold to be high or low. To localize as many neurons relating to the behavior as possible, we need a low threshold. To maximally exclude irrelevant neurons, we need a high threshold.
Once we have pinpointed a set of neurons that should embody the behavior, we can apply any of the above-discussed features. We can remove a behavior, we can retrain this portion of the model to represent a counterfactual world, and we can test counterfactual assumptions on the training set.
Concluding Remarks
In this article, I presented a concept called “Semantic Dropout Mask” and showed how it can serve as a basis for a more capable localization by design method. In the appendix, you can find possible challenges and ways forward. It is important to note that the method is currently somewhat vague and theoretic. I am cautiously optimistic that this would work, especially given the successes of gradient routing in the non-fuzzy case. Fuzziness, of course, introduces a plethora of new challenges that will require handling down the line.
Appendix
Challenges
The method has several challenges. Primarily, design choices might influence severely the downstream performance of the method. Two major problems we face are generalization from other behaviors and leakage.
Leakage is most troublesome in the case of co-occurring behaviors. Since they require parameters of both to be active, parameters of one behavior “see” information present in the other behavior, which might break the isolation that is required for counterfactual interventions later. This is not unique to our method. The gradient routing has this even worse, as all inputs “see” all parameters, and the mask only decides which parameters are updated.
Another problem is generalization. Because we want isolation, we would like the parameters of a behavior to be active in the presence of a minimal set of unrelated parameters, as otherwise, leakage increases. This encourages us to increase segregation between tasks, including related ones. This might stunt the ability to generalize from similar tasks.
Extensions
Multiple adaptation layers: We don’t have to couple the number of adaptation and frozen parameters. Adaptation layers can be added on top of each other, decoupling the number of trainable parameters in the first and second stage. Clearly, we can also do the opposite. We can restrict an adaptation layer to overlay only a subset of the frozen parameters.
Two-tier adaptation parameters: Adaptation parameters can be useful even if not updated. We can define two sets of semantic dropouts. One chooses the parameters that will be updated, and the other chooses a larger set that is relevant to behavior context. A behavior that depends on another more general behavior can have the general behavior’s parameters as context parameters that are not trained. This way, we don’t contaminate the general behavior’s parameters with specifics of sub-behaviors.
Each semantic encoder will capture a different notion of relevance. The context parameters will be chosen to include behaviors that are allowed to influence the behavior (a superset of the behavior) but not to be influenced. The trainable parameters represent a narrower notion (the behavior itself).

