Emilio Cantú

Tiny Recursive Models Pt. 1

Sat, 11 Apr 2026 04:00:00 GMT

TRMs in a nutshell

A few months ago Hierarchical Reasoning Models (HRMs) showed remarkable ARC performance for their relatively tiny (27M) parameter count. While HRMs introduced a lot of tricks, ablations performed by ARC’s team showed that one (“deep supervision”) accounted for most of the gains. By focusing on “deep supervision”, TRMs greatly simplified and outperformed HRMs with a quarter of the parameters.

TRMs can get away with such small models because they emulate much bigger ones by recursively applying the network on its outputs. They maintain a pair of latent states, and , and refine them until the answer to the input puzzle is predicted by decoding . Hence we can think of as maintaining the embedded answer, which frees up to do the “latent reasoning”.

A naive way to train such a model is to backpropagate the loss after all recursions. However, you quickly run out of memory as the model and number of recursions grow. Instead, deep supervision performs a chunk of the recursions (one deep_recursion call in the paper), computes the loss and performs an optimizer step. Then, the latents are detached and used in the next chunk of recursions.

The whole algorithm can be described with 20 lines:

The full TRM training loop

I also made a diagram which omits details but helped me internalize the different recursion levels:

A diagram of TRM’s recursion levels

Note that since deep supervision drives most of the performance the algorithm could have been simplified even more by defining:

def deep_recursion(x, y, z, K):
  for i in range(K):
    y, z = net(x, y, z)
  return (y.detach(), z.detach()), output_head(y), Q_head(y)

However, there are a couple of reasons why the paper does not:

Because the network is kept single headed for simplicity, it cannot update and in a single forward pass.
Related is that latent_recursion allows to be updated more frequently than . We might think that the network is allowed a few “scratchpad” iterations before having to commit to a revised answer. (Note that while intuitive and probably correct, the paper doesn’t perform ablations directly changing the latent’s relative update frequency).
Lastly, the paper recurses a few times () without gradients before the final call to latent_recursion with gradients. The argument here is that since deep supervision trains the model to be a “local improver”, it should benefit from running a few extra recursions – even without gradients. However, too many recursions without gradients seem to hurt, as Table 3 shows.

Table 3 from the TRM paper (my highlights)

The last important detail is that deep supervision stops when the network has >50% confidence its predicted solution is correct. This “early-stopping” is turned off during testing for performance but makes training more efficient:

“ACT greatly diminishes the time spent per example (on average spending less than 2 steps on the Sudoku-Extreme dataset rather than the full steps), allowing more coverage of the dataset given a fixed number of training iterations.”

So, that is TRM in a nutshell. The paper does experiments on several tasks (sudokus, mazes, and ARC) and performs lots of ablations (number of latents, architecture, early-stopping, etc.) to show that the choices they landed on are optimal. It also builds up the final design by simplifying HRMs step-by-step and it’s a very pleasant read.

After its publication, others have found simple tweaks that improve performance:

These posts found that simply increasing at test time increases sudoku performance from ~87% to ~96%.
This paper trains the network about 2x faster by using a curriculum on the number of recursions. Instead of the paper’s , they do .
This paper found that if your puzzle admits contractive “milestones” that progressively zero in on the solution, supervising sequentially on those milestones performs better than supervising only on the final solution.

I’m sure there are many more, but those caught my eye. Now, let’s try a few simple experiments!

Experiments

We’ll be using a very simple JAX implementation to use the compute provided by TRC (thank you!) and focus on the sudoku-extreme dataset for now.

Random latent inits for best-of-k at inference

Since increasing the reasoning “depth” () at test time worked so well, could increasing “breadth” help?

The paper’s implementation initializes and to random values that are chosen and fixed at model initialization. That is, every forward pass during training and testing starts with the same latents. A simple way to try to get “breadth” is to see if different starting latents are mapped to diverse predictions.

We could then combine or choose among the diverse predictions to make a final one. Normally you would take the majority vote (as test time augmentation (TTA) in vision, or self-consistency in LLMs), but using the model’s own confidence in its prediction (using the halting head) should make more sense here. The paper actually uses both for the TTA they do for ARC, taking the majority vote and breaking ties with model confidence.

Note that it’s not clear that different initializations will get mapped to diverse solutions. It could be that they are ignored because the model maps them to the correct (unique) solution. Or the latents could get numerically “washed away” in the recursion, especially the first latent recursions without gradient.

We train networks with random latent inits and use the same forward passes to track the performance of predicting the cell-wise majority vote, keeping the whole puzzle the model is most confident in, and just using one of the forward passes as a baseline. Recycling the forward pass makes this the least noisy experiment we’ll make, since training run-to-run variance is really high. However, we also separately train networks with static latent inits as another baseline.

After training we evaluate models with a chunk of the test set using each method’s best validation checkpoint and let and grow.

Some observations:

Cell-wise mode seems silly to have tried out in retrospect since cells in sudokus are not independent. As the paper did for ARC, I should have taken puzzle-wise modes. I’ll try to run these at some point.
Random inits paired with model confidence do yield gains and they are most pronounced around (or slightly below) the recursion budget used during training.
As with most test-time scaling methods, increasing (and ) has diminishing returns.
Methods converge in performance at large recursion budgets.

However, bigger ’s require more compute and it seems that, at least for this problem and model size, it’s more efficient to scale than .

There might be other settings where random initial latents pay off. Maybe problems with multiple solutions or models with underfit predictions but with calibrated halt heads. Anyway, I was mostly surprised that initializations don’t collapse onto the same prediction (at least at low depths) and want to come back to this in the future.

More randomization and path independence

Randomizing initializations do not only allow for test-time augmentation, but have also been shown to stabilize recurrence by helping the model converge to fixed points. Path independent models are those in which latents converge to fixed points regardless of their initial path. The argument is that path independent models are more likely to take advantage of additional iterations by converging to the solution, unlike path dependent models which diverge.

Figure 1 from Path Independent Equilibrium Models Can Better Exploit Test-Time Computation.

The paper also randomized the number of iterations during training to increase path independence. In our case we could vary (the number of iterations in latent_recursion that we spend refining before updating ), (the number of “warm-up”, no-gradient latent_recursion calls), and (the number of full -sized recursion blocks, and the number of supervision steps).

We could also randomize two or more of these at a time. For example, varying both and slightly resembles the truncated BPTT with random start and end steps that this paper proposes. However, we keep it simple for now.

In each training forward pass we sample from uniforms centered at the default values (, , ) to match the expected compute of deterministic baselines. We try different ranges for these uniforms and keep inference deterministic.

We display mean ± SE and the uniforms’ half-width .

On the right we show the Asymptotic Alignment Score which tries to capture how path independent a network is. It measures the similarity between the final latents of two predictions made for the same data point. The first prediction is initialized normally, but the second uses the first’s final latents. If the similarity is high, the network maps different initializations to the same fixed points and is more path independent. Since we have both latents, we roll both and for the second forward pass.

Above we plotted the similarity of final predictions instead since we found those to be more consistent. Below we include scores for latents and plot against performance.

Large dots are per-method means; small ones individual seeds. Spearman above each panel.

Even with all the noise from sensitive training and only 5 seeds, we get a few observations:

Like the paper that introduced path independence, we find that it is correlated to performance in this setting.
Reasoning latents (’s) saturate and cluster around 1 more so than latent predictions (’s). This might be simply because ’s are updated much more than ’s.
Surprisingly, random initializations seem to decrease path independence and performance. This doesn’t contradict the path independence paper since their randomization is different. They initialize with zero vectors and only add gaussian noise to half the entries during training, and then use all zeros during testing. I reused the code from previous experiments and randomize during testing too. I’ll try to explore this in a future post.
Even though we increase at test time, it’s randomizing during training that yields the biggest gains.
Increasing randomization during training (increasing the uniforms’ range) generally seems to improve performance, with a curious blip in . To investigate: is the blip real? do gains stack by combining methods? how do gains scale with model capacity?

To be continued

There’s definitely a lot more to learn about and with TRMs. The random initialization idea likely belongs in the paper’s great “Ideas that failed” section and the experiments need polishing. But I thought releasing a first post and iterating would be more fun.

Fuzzy matching professors to their reviews

Tue, 09 Sep 2025 04:00:00 GMT

Context

During my undergrad, I built a simple schedule planning site for my university. By the third semester I’d grown tired of using excel and learned enough javascript to code up a site which ranked every possible schedule by my preferences. One of these was to upweight schedules based on professors’ ratings extracted from a site like RateMyProfessor.

This is where I first encountered fuzzy profile matching (also called record linkage, entity resolution, etc). Basically trying to decide if profiles in different databases refer to the same person. This came about because student reviewers create a professor’s profile, which meant different names across sites:

Nombre (School)	Nombre (Review)
Juan Pérez	J. Perez
María González	Maria Glez
Carlos Ramírez	C. Ramirez

At the time I decided to link profiles using a very simple approach based on the normalized string similarity (levenshtein ratio) between names:

def lev_ratio(a, b):
    return 1 - edit_dist(a, b) / (len(a)+len(b))

for n_a in school_names:
    for n_b in review_names:
        if lev_ratio(n_a, n_b) > 0.9:
            link(n_a, n_b)
            break

While this worked ok at the time, in retrospect it was a very bad approach for several reasons. Since a lot of students still use the site, I thought I could do better after 6 years and a master’s in stats.

How to do better

I thought I would simply:

Gather data by manually matching profiles. I need data to tune, but more importantly, evaluate the approach.
Engineer features and train a pairwise binary classifier to predict the probability of a true match given two profiles.
Use the pairwise classifier to predict matches (or ‘no match’) for every professor on the school’s site.

In all of these steps I encountered a few subtleties. I’ll quickly go over these, and what I would have done differently or will improve in the future.

Data

I spent a couple of hours manually matching 150 profiles, 100 for training and the rest for evaluation. Professors on the school site either had one, multiple, or no matching profiles on the review site.

Note that since we only have manually annotated matches, our training set only contains positives. So how to generate negative pairs to train our classifier?

A naive option could be to set every pair that was not manually matched to negative and end up with negative pairs. However, I did not want to deal with the high class imbalance and opted for randomly sampling of such pairs for every positive one.

As we’ll see below this worked alright. But most of the negatives were really “easy” since two randomly sampled profiles are likely to have wildly different names (and other features). Thus, this first approach will likely not perform very well on the rare but tough cases where truly distinct professors have similar features.

What I wished I had thought of before was using active learning. I could have collected a few matches (say 20 instead of 100), trained a model and used its predictions to find harder examples. To find them you can use low model confidence or a bunch of other heuristics (doubtlab has a bunch). After labelling a few of the hardest, you retrain and repeat (making sure to not overfit).

I think this approach could have saved me some labelling and resulted in higher quality negatives. I’ll try to do the experiment sometime soon.

Features & model

Besides the names from both platforms, I also included professors’ departments:

Nombre (S)	Nombre (R)	Depto (S)	Depto (R)
Juan Pérez	J. Perez	Matemáticas, Actuaria	Mate
María González	Maria Glez	Física, Matemáticas	Mate
Carlos Ramírez	C. Ramirez	Computación, Actuaria	Compu

After playing around a little, I settled on using only two features for simplicity:

A string similarity between the profile names. Went for token_sort_ratio since it handles missing words and is not sensitive to order. For example, “Juan Perez” and “J. Perez” score a perfect 1.
A string similarity between the departments. Because a professor could have multiple school departments (extracted from the classes they give that semester) I simply took the maximum of the similarities (token_set_ratio here).

Features per class for the training set. There are k=10 times as many negatives as positives. You can see how the negative examples are too easy because the string similarity for randomly sampled pairs is really low.

Keeping with the simplicity theme, the model was a logistic regressor. Later on it might be fun to train a character-level LLM or BERT model to see if they can beat these handcrafted features.

Final matching

Once the pairwise classifier is trained, how do we use it to make matches?

Should we do something similar to my previous approach from years ago?

for prof_a in school_profiles:
    for prof_b in review_profiles:
        if model.predict_proba([prof_a, prof_b]) > 0.9:
            link(prof_a, prof_b)
            break

No. In hindsight this was terrible for a couple of reasons. First, I tuned the threshold by skimming the output since I didn’t have any manually matched profiles. Second, we link to the first review profile above the threshold, not the highest. This is sensitive to ordering. A simple improvement and what I ended up using is to consider only the most probable predicted match:

for prof_a in school_profiles:
    argmax_prof_b, max_prob = best_match(prof_a)
    if max_prob > threshold:
        link(prof_a, argmax_prof_b)

This is still not perfect since it’s still possible to map two school profiles to the same review one. The more principled approach (I might get to later) is framing this as a bipartite graph matching problem. We have school profile nodes and review site nodes and the classifier’s match probabilities as (potential) weighted edges between them. You can then choose the edges that maximize the total edge weights while respecting one-to-one constraints.

Results

We still have to tune the threshold and k (no. of negatives to sample per positive for training) parameters. I initially set them to 0.5 and 10 respectively, but in writing this I thought it would be nice to cross validate to see how good this guess was.

Cross-validation results (5 folds) showing the percentage of correct matches, incorrect matches, missed matches (abstained when a correct match existed), and the overall number of matches made.

You can see the tradeoffs. Bigger k means more negatives to train with, a bigger training set, but also greater class imbalance. Seeing a lot more negatives than positives makes the model less confident when predicting positives. Hence you can see that for the k=100 model to achieve good performance we need to lower the threshold to 0.25.

The metrics we care about are inherently a trade-off, but I would say the initial guess (green) fared quite well. Two other configurations (blue and red) performed comparably, and the blue (0.75, 1) option could have saved a few seconds during training.

Anyway, the new method is better on our test set than what I implemented years ago:

Metric	New	Previous
Correct	0.9	0.82
Incorrect match	0.04	0.04
Incorrect no match	0.06	0.14
Has Match	0.76	0.66

The deployed model now links about 74% of professors to a review profile, up from 60% before.

Final thoughts

I had fun coming back to this site and fiddling around with a ML problem that is not just “standard” supervised learning. I tried tackling the problem without a literature review for fun, but this is obviously a studied problem. I should mention that since my datasets were pretty small (about 1k profiles each), I didn’t run into the usual compute challenges. Here is a good overview survey if you are interested in what they are and how to alleviate them.

Engression

Tue, 06 May 2025 04:00:00 GMT

Traditional regression models predict the conditional mean , or sometimes a few quantiles. In contrast, distributional regression attempts to learn the entire conditional distribution . Having access to the full distribution gives us calibrated uncertainty estimates, probabilistic forecasts, etc.

In a recent seminar I learned about engression, a lightweight and principled approach to distributional regression. Instead of predicting a parametric distribution or optimizing a likelihood, engression trains models to transform noise into samples from , using a proper scoring rule called the energy score. It’s implicit, generative, and remarkably straightforward to implement.

Here, I try to explain the main idea, reproduce some of the paper’s results, and discuss a few of its properties.

In a nutshell

We consider a general class of models , where each model takes covariates and a random vector as input. The noise vector is drawn from a fixed distribution independent of . We imagine each function in as defining a conditional distribution of . The “best” is found by minimizing the engression loss:

where and are independent draws from the noise distribution. This is the negative of the energy score, a proper scoring rule. As a result, the authors show that the minimizer recovers the true conditional distribution — assuming the model class is expressive enough. In practice, this means using neural networks with sufficient capacity.

Intuitively, the loss encourages two things. The first term ensures that the generated samples are close to the observed target. It pulls the predicted distribution toward the actual data. The second term penalizes collapsing all samples to a single point. It forces the model to generate diverse samples, reflecting the variability in the data.

If we only used the first term in the engression loss, our estimated distribution would collapse. Toy example where

To minimize the empirical version of the loss using a dataset , we sample noise vectors per -th observation and minimize

Once trained, acts as an implicit and generative model for . That is, won’t give us an explicit density , but we can use it to obtain samples from . We independently draw as much ’s as we want observations and feed them into to get . With the samples, we can estimate means, medians, confidence intervals, etc. as usual.

Note: While we focused on the main loss explored in the paper, the authors mention several generalizations in Appendix D. Specifically, the energy score is one example of a broader class of kernel scores, and engression can in principle use any proper scoring rule that characterizes a distribution (see Section 2). I wouldn’t be surprised if future work develops variants that emphasize the tails, which could be especially useful in risk-sensitive applications.

Pre-additive noise and extrapolation

While minimizing the loss above is sufficient to learn the conditional distribution within the training range (assuming is expressive enough), the authors show that, under certain assumptions about the noise structure, engression can also support limited extrapolation.

Most regression and generative models assume that the noise is post-additive. That is, that noise is added after applying a nonlinear transformation to the covariates: . Pre-additive noise instead assumes . This helps with extrapolation because, as the authors note:

“As such, if the data are generated according to a post-ANM, the observations for the response variable are perturbed values of the true function evaluated at covariate values within the support. We hence generally have no data-driven information about the behaviour of the true function outside the support. In contrast, data generated from a pre-ANM contain response values that reveal some information beyond the support”.

With a pre-Additive Noise Model (right) we gain information beyond the training support. Illustrated here with only two points. The blue and orange dots represent the possible values for and respectively due to noise from .

The authors formalize this and show that under certain structural assumptions — like smoothness or monotonicity of , and symmetric pre-additive noise — engression can recover aspects of beyond the training range. A key idea is that larger input noise gives you more indirect coverage of nearby regions, so the model can “see” a bit past the edge of the data. I won’t go into the technical details here, but the extrapolation results are interesting and worth checking out if you’re curious.

Implementation

We walk through how simple engression is to implement and attempt to reproduce Figure 4 from the paper, which highlights its extrapolation capabilities using synthetic data.

We implement as a MLP. In the paper, the authors feed the concatenated vector into a standard (deterministic) network. To keep things flexible and modular — and since must be sampled independently of — we can instead:

Defining g(x, eps)

Distilling the Knowledge in a Neural Network

Tue, 28 Jan 2025 05:00:00 GMT

Idea

This classic paper introduced distillation as a way of transferring knowledge from a big network teacher into a small one. The core observation is that we should use the big model’s output distribution as soft labels to train the small model.

Remember that in classification we measure the cross-entropy loss, given the predicted and correct class probabilities of an example, with:

To use soft labels we just set .

These soft labels provide a much richer training signal for the smaller model, especially when the larger model distributes its probability mass across multiple classes (i.e. when the labels have high entropy). To force this high entropy, the authors propose increasing the temperature of the softmax layer in the larger model to produce the soft labels. The small model trains with this same temperature but then sets it to 1 during testing.

Increasing the temperature of the big model produces softer and more informative labels.

They also had better results by adding a small term to the loss function with the regular hard-labeled cross-entropy. The reasoning is that the model may not have enough capacity to learn the soft targets, so “erring in the direction of the correct answer turns out to be helpful”. If we write the output of a model with temperature as , then the complete loss is

The first term is scaled by because the magnitudes of the gradients scale as and we want to control the contribution of each term by changing only .

Why do the gradient magnitudes scale as ?

Let be the logits, then the output th entry of the softmax layer with temperature is Plugging into the loss and differentiating w.r.t. (don’t forget the chain rule), we get So, we see that

MNIST

We try out distillation on the small-scale MNIST experiment that the authors describe. They use a two-layer linear ReLU architecture with dropout, a jitter image augmentation, and max norm as regularization.

Model definition

TENT: Fully Test-Time Adaptation By Entropy Minimization

Sun, 29 Dec 2024 05:00:00 GMT

Once a model is deployed the feature (covariate) data distribution might shift from that seen during training. These shifts make models go out-of-distribution and worsen their predictions. This paper proposes a simple method to help models adapt to these shifts: minimize the entropy of your predictions.

That is, before making test-time predictions for a batch, you nudge (SGD) the model to predict peakier (less entropic) class distributions.

Why minimize entropy?

Firstly, because it is convenient. In contrast to other methods, you don’t need to modify the training procedure nor require test-time labels. Because labels are rarely available at test time, this makes TENT “fully test-time”.

Second, the authors argue that entropy is related to both error and shifts:

“Entropy is related to error, as more confident predictions are all-in-all more correct (Figure 1). Entropy is related to shifts due to corruption, as more corruption results in more entropy, with a strong rank correlation to the loss for image classification as the level of corruption increases (Figure 2).”

To reproduce Figures 1 & 2 we train a ResNet on CIFAR-10 and evaluate its predictions on corrupted versions of the test set to simulate test-time shifts.

(Note: while the authors also show results for CIFAR-100 and ImageNet, we’ll only deal with this small dataset and model for convenience.)

Datasets

A Closer Look at Memorization in Deep Networks

Sat, 07 Sep 2024 04:00:00 GMT

This paper argues that memorization is a behavior exhibited by networks trained on random data, as, in the absence of patterns, they can only rely on remembering examples. The authors investigate this phenomenon and make three key claims:

Networks do not exclusively memorize data.
Networks initially learn simple patterns before resorting to memorization.
Regularization prevents memorization and promotes generalization.

Here we aim to reproduce Figures 1, 7, and 8 from the paper.

Fig 1

To support the first claim, the authors argue that if networks simply memorize inputs they perform equally on different training examples. However, if networks learn patterns, there should be points that are easy to learn because they fit these patterns better than others. To see if this is the case they train an MLP for a single epoch starting from 100 different initializations and data shufflings and log the percentage of times an example was correctly classified.

The experiment is performed with the CIFAR10 dataset, a noisy input version RandX, and a noisy label version RandY. We first define dataset wrappers to implement the noisy variants. Note that for epoch-to-epoch consistency we determine which examples to corrupt at initialization.

Random dataset wrappers

Approximate Nearest Cosine Neighbors

Fri, 09 Aug 2024 04:00:00 GMT

Suppose you have some vectors and wish to find, for each point, the nearest points. While you could compute pairwise distances using a naïve quadratic algorithm for small datasets, this approach becomes infeasible with millions or billions of points. If the points are in a low-dimensional space, clever data structures like kd-trees, ball trees, and M-trees can achieve substantial speedups. However, in high dimensions, performance degrades, and you may need to sacrifice exactness and turn to Approximate Nearest Neighbors (ANN) techniques.

Locality-sensitive hashing (LSH) is a family of ANN algorithms that aim to group similar points into the same (or nearby) buckets efficiently using specialized hash functions. Recall that traditional hashing tries to map items to a set of buckets uniformly and minimize collisions. Thus, traditionally, slightly changing a point results in a vastly different hash and assigned bucket. LSH uses different hashing functions that often depend on the distance metric employed. Here, we explore a simple approach using cosine distance based on Random Projection.

How it works

The mechanics are not very complicated. We first generate random hyperplanes. Let’s visualize this in two dimensions with points and :

Random data and visualisation

Understanding Batch Normalization

Wed, 17 Jul 2024 04:00:00 GMT

The paper investigates the cause of batch norm’s benefits experimentally. The authors show that its main benefit is allowing for larger learning rates during training. In particular:

“We show that the activations and gradients in deep neural networks without BN tend to be heavy-tailed. In particular, during an early on-set of divergence, a small subset of activations (typically in deep layer) “explode”. The typical practice to avoid such divergence is to set the learning rate to be sufficiently small such that no steep gradient direction can lead to divergence. However, small learning rates yield little progress along flat directions of the optimization landscape and may be more prone to convergence to sharp local minima with possibly worse generalization performance.”

We attempt to reproduce figures 1-3, 5, and 6.

Convolutional BN Layer

As a reminder, the input and output tensors to a batch norm layer are 4 dimensional. The dimensions correspond to the batch example, channel, and spatial , dimensions respectively. Batch norm (BN) applies a channel-wise normalization:

Where and are estimates channel ’s mean and standard deviation computed on the minibatch :

To make sure the layer does not lose expressive power we introduce learned parameters and . is a small constant added for numerical stability. In pytorch, we can simply use the BatchNorm2d layer.

Experimental setup

Let’s set up our data loaders, model, and training loop as described in Appendix B of the paper.

Imports and model evaluation function

Deep Learning is Robust to Massive Label Noise

Tue, 18 Jun 2024 04:00:00 GMT

The paper shows that neural networks can keep generalizing when large numbers of (non-adversarially) incorrectly labeled examples are added to datasets (MNIST, CIFAR, and ImageNet). It also appears that larger networks are more robust and that higher noise levels lead to lower optimal (fixed) learning rates.

We’ll focus on the uniform label noise experiment and attempt to reproduce Figure 1:

Figure 1. As we increase the amount of noise in the dataset the performance drops. However, note that even when there are 100 noisy labels per clean label performance is still acceptable. For example, the Convnet still achieves 91% accuracy.

Note: As far as I can tell the paper has no accompanying code so I’ll be filling in the details to the best of my abilities.

Imports and model evaluation function