By Shaoqing Tan in ai — 04 Aug 2025

Episode 21: Deep Dive: ReinboT and the Fusion of RL with Vision-Language-Action

Introduction: A New Twist in Robot Learning

Hello and welcome to Robotics Unwrapped, where we explore cutting-edge advances in robot learning. Today, we’re diving into ReinboT, a model fresh out of ICML 2025 that promises to amplify robot manipulation by weaving reinforcement learning (RL) ideas into vision-language-action (VLA) models. Imagine a robot that not only interprets what it sees and the instructions it’s given, but also has a sense of how rewarding its actions will be. ReinboT does exactly that – it’s a “Reinforced Robot GPT” that treats future returns as first-class citizens in its decision-making. In this monologue, we’ll unpack ReinboT’s architectural innovations and why they matter. We’ll explore how it predicts dense returns for each step, uses expectile regression to favor high-reward behaviors, and even adds a special token for Return-To-Go (RTG) – essentially giving the model an internal voice that whispers “maximize the long-term reward”. We’ll also see how ReinboT stacks up against other recent approaches like Decision Transformer, Reinformer, Google’s RT-1/RT-2, the GR-1 generalist robot model, the PIDM world-model policy, and DeepMind’s RoboFlamingo. By the end, we’ll understand not just what makes ReinboT unique, but also the trade-offs it navigates in pursuit of smarter, more generalizable robot control.

Background: From Imitation to Return Maximization

Over the past few years, VLA models – those that take in visual observations, language instructions, and output robot actions – have made impressive strides in general robotic tasks via imitation learning. Models like RT-1 (Robotics Transformer 1) showed that with enough demonstration data, a transformer policy can learn to mimic human teleoperator behavior across many tasks. GR-1 (Generalist Robot 1) went a step further, leveraging large-scale video pre-training to imbue a GPT-style policy with rich visual understanding before fine-tuning on robot demos. And RoboFlamingo, built on DeepMind’s Flamingo VLM, demonstrated the power of huge vision-language models for robotics by fine-tuning them (with a lightweight LSTM policy head) to follow instructions in single steps. These imitation-based models significantly improved robots’ semantic understanding and multi-task learning. However, they all share a common limitation: they replicate the behaviors in their data, without an explicit notion of which actions are better or worse. They treat all demonstrated actions more or less equally, learning the average behavior. When the training data quality is uneven – containing both efficient and clumsy demonstrations or even some failures – pure imitation learning struggles. It lacks a mechanism to distinguish the good trajectories from the bad, often resulting in suboptimal manipulation fidelity on tough tasks.

On the other hand, the RL community has developed ways to learn from mixed-quality data by emphasizing cumulative reward. Offline RL algorithms can, in principle, extract an optimal policy from a static dataset by maximizing returns – effectively cherry-picking the best parts of the data. Initial attempts to bring RL into VLA models have appeared (e.g. augmenting vision-language policies with value functions or Q-learning), but until now it’s been tricky to seamlessly merge these paradigms. Classic RL losses (like Q-learning or policy gradients) are hard to bolt onto large Transformers without destabilizing training. The key question is: Can we get the best of both worlds? – the generalization and flexibility of big VLA models and the performance-boosting focus on return maximization from RL.

This is where ReinboT enters the scene. The name riffs on being a “reinforced robot”, and the core idea is exactly that: inject the concept of return maximization directly into a Transformer-based VLA model, via supervised learning. Instead of relying purely on mimicking demonstrations, ReinboT trains the model to predict and maximize future rewards as it generates actions. By doing so, it aims to amplify good behaviors and downplay poor ones, effectively learning a higher-quality policy from the same data. Let’s break down how it works, piece by piece.

Key Innovations in ReinboT’s Architecture

Dense Return Prediction – Valuing Every Step

One of ReinboT’s most striking features is its use of dense return prediction. In conventional imitation learning, a robot might only know if it succeeded at the very end of a task (sparse success reward) or not at all. ReinboT, however, is taught to value every step along the way by providing a shaped reward signal throughout each trajectory. The team behind ReinboT introduced a reward densification scheme that automatically decomposes a long-horizon task into smaller sub-goals and assigns a reward at high frequency (dense reward) for progress toward those sub-goals. In their implementation, the dense reward is carefully designed to capture four aspects of manipulation skill: (1) Sub-goal achievement – did the robot achieve the immediate objective (e.g. moved an object to the target area?), (2) Task progress – how far along the overall task it is, (3) Behavior smoothness – is the motion efficient and not jerky or wasteful, and (4) Task completion – whether the final goal was accomplished. Each of these factors contributes to a numeric reward at each time step, so even before a task is fully done, the model gets feedback about how well it’s doing.

This dense reward provides rich supervision during training: every state-action step in the demonstration data is labeled with a return-to-go, i.e. the cumulative reward from that step onward. Intuitively, this means the model doesn’t just see “what action was taken” but also “how successful was it expected to be from here.” By predicting these returns, ReinboT gains a “feel” for the quality of trajectories in the dataset. Good demonstrations will have high returns at each step, poor ones will have lower returns – and the model learns to tell them apart. In essence, ReinboT learns a built-in value function over its visual and language inputs, but without ever explicitly running a traditional RL algorithm during training. Instead, it treats return prediction as an auxiliary task in a supervised learning setup.

Concretely, during training, each demonstration trajectory (which might be a few tens of seconds of robot manipulation) is broken into segments centered on sub-goals. For each segment, ReinboT calculates the dense rewards and sums them, obtaining a Return-to-Go (RTG) for every step. These RTGs – effectively the dense returns – serve as training targets for the model’s return prediction head. The result is that ReinboT doesn’t just copy actions; it also learns to forecast the payoff of those actions. This approach is broadly applicable: the reward design they chose is generic enough (tracking progress, smoothness, etc.) that it can be adapted to many manipulation tasks, providing a common yardstick for “good behavior”. The payoff of this design is evident in experiments – models that leverage dense returns achieved higher success rates on long, chained tasks than those using only sparse success signals. By valuing every step, ReinboT attains a deeper understanding of data quality, which translates to more robust performance.

Return-To-Go as a Modality Token – Reward in the Input Stream

Perhaps the boldest architectural move in ReinboT is treating the Return-To-Go as a modality token in the model’s input-output sequence. If you’re familiar with the Decision Transformer (DT), you’ll recall that DT conditions a GPT-like policy on a desired return value by feeding that return in as part of the input sequence. ReinboT takes inspiration from that idea, but flips it on its head: instead of requiring an external return input, it learns to predict the Return-To-Go internally and uses it as a guiding signal for action generation.

In practice, ReinboT’s architecture adds a special token embedding [RTG] to the model, analogous to how one might add a [IMAGE] token for an image patch or a [ACTION] token for an action output. This [RTG] token is designed to carry information about the cumulative reward. During a forward pass, the model ingests the current observation (e.g. camera image encoded by a ViT, proprioceptive state via an MLP, and the task’s language instruction via a CLIP text encoder). All these modality embeddings – image, language, robot state – are fed into a GPT-style causal Transformer backbone as a sequence. Alongside them, the model includes the [RTG] token in the sequence. Initially (at the start of inference), this [RTG] token doesn’t come with a fixed value – unlike DT, we don’t provide a number. Instead, the model will fill it in with a predicted return value on the fly.

How does that work? ReinboT uses a modular decoding approach: after the Transformer backbone processes the inputs, it produces latent features corresponding to each special token (including [RTG] and [ACTION]). The feature for [RTG] is passed into a small ReturnToGo decoder which outputs a predicted return value (or vector of return components). That decoder’s final hidden layer – essentially the model’s estimate of “how much return can we get from here” – is then fed back into the action decoder as additional context. In other words, the model’s prediction of the return literally influences the next action it decides. The action decoder takes the latent features for [ACTION] from the backbone, concatenates the [RTG] decoder’s hidden representation, and then outputs the actual motor command (like the joint velocities or end-effector movement and gripper open/close).

This design elegantly ensures that the policy’s decisions are aware of predicted returns. If the model foresees a higher return from one course of action, the [RTG] token will carry that information into the action head, nudging the output in that direction. During training, of course, we supervise the [RTG] predictions with the ground-truth dense returns from the dataset. But here’s the kicker: at inference time, no human needs to set any return value. The model generates its own RTG estimate on the fly. This is a big practical win. In the original Decision Transformer, you had to manually specify a target return (e.g. the maximum possible score) to coax the policy to be ambitious – an awkward hyperparameter to tune for each new deployment. ReinboT removes that burden. The authors emphasize that no RTG initialization is required for ReinboT’s test-time execution. The robot can operate without any manually set reward targets, which is important because in the real world you often don’t know the numerical value of the best possible return, and you can’t rely on an environment simulator to give you running rewards. ReinboT’s policy simply looks at the state and instruction, and internally computes “how well can I do?” as a guidance signal.

To summarize this innovation: ReinboT elevates the Return-to-Go to a modality in the model’s language. Just as the model processes visual tokens and word tokens, it processes a reward token. By doing so, it bakes in an understanding of “how good things are likely to go” at every step. This architectural choice differentiates it strongly from other VLA models that might use value estimates as a separate module or not at all – ReinboT’s single Transformer handles vision, language, action, and reward in one unified sequence. It’s a holistic approach: perception, instruction following, and return prediction are all intertwined.

Expectile Regression – Leaning Toward the Optimistic Side

So far, we have a model that predicts returns and uses them to select actions. But what makes those returns something the model wants to maximize, rather than just mimic? This is where expectile regression comes into play. Expectile regression might sound esoteric, but it’s essentially a tool to skew the model’s predictions toward the higher end of the returns distribution, without introducing a complicated RL objective. Think of it like a “optimistic lens”: instead of predicting the average outcome from a state, the model is trained to predict something closer to the best outcome seen from that state in the dataset.

In technical terms, an expectile is analogous to a quantile, but defined with a squared error weighting. By choosing an expectile parameter $\tau > 0.5$, the loss function penalizes underestimation of returns more than overestimation, pushing predictions upwards towards the high-return trajectories. In ReinboT’s loss function, they include a ReturnToGo loss $L_{RTG}$ that is computed as an expectile regression error between the predicted return and the ground-truth return from that trajectory. If $\tau = 0.5$, this just becomes a mean squared error – the model would learn to predict the average return (and essentially reduce to ordinary behavior cloning with a return token). But with $\tau$ set higher (say 0.7 or 0.8), the model starts aiming above the mean. It will still predict the ground-truth return for high-return examples (since those are already near the max), but for lower-return examples, the loss nudges the prediction upward toward the higher return that could have been achieved in a similar situation.

What’s the effect of this? It effectively implements return maximization in a supervised way. The model is not content to just predict “what was the return in this demonstration,” but rather “what’s the maximum return that might have been achieved from this state, within the data distribution”. It’s a clever proxy for the RL concept of optimal value function. Instead of doing iterative Bellman updates or policy gradients, they crank the return predictions toward the best seen outcomes. This guides the policy to choose actions that are more likely to lead to those high outcomes. In essence, expectile regression makes ReinboT optimistic, but in a controlled, data-driven way.

Of course, there’s a balance to strike. If $\tau$ is too high (too close to 1), the model might become overly optimistic – predicting returns beyond anything actually achievable, which could lead to selecting actions that are unrealistic or ungrounded (since the model might think it can get a huge reward when in truth it can’t). The authors acknowledge this risk: blindly increasing $\tau$ causes the model to overshoot, predicting out-of-distribution returns that negatively affect the action generation. In their experiments, they likely tuned $\tau$ to find a sweet spot where the model is sufficiently encouraged to exceed the demonstrated average, but not so much that it hallucinates absurdly high returns. Figure 2 of their paper shows how different values of the expectile parameter impact performance. The end result is that with an appropriate expectile setting, ReinboT consistently predicts a slightly higher return than the ground truth trajectory would have gotten, hence choosing slightly better actions than the human did – a form of one-step improvement over the data. This is analogous to certain offline RL algorithms (like IQL) that use expectile regression to implicitly favor high-return trajectories without explicit policy optimization. ReinboT basically bakes that idea into its training objective.

To sum up, expectile regression is ReinboT’s secret sauce for integrating the “maximization” part of RL into a purely supervised learning framework. No reward signals are needed at runtime, no actor-critic loop, just a tweak in the loss function that biases the model to be optimistic about returns. This stands in contrast to classical RL approaches, which would add a separate RL loss (e.g. maximizing Q or advantage) on top of the model. The authors point out that adding such RL-specific losses to Transformers can be problematic and unstable. Instead, ReinboT’s expectile-based return loss achieves a similar effect while remaining in the supervised learning paradigm, keeping training simple and stable. It’s a neat illustration of how careful loss design can marry RL objectives with large-scale sequence modeling.

Putting It Together: ReinboT’s One-Pass, Return-Aware Policy

Bringing these pieces together, ReinboT looks like this: a GPT-style Transformer backbone that encodes multimodal inputs (language instruction via CLIP, image via pre-trained ViT with a token compressor, proprioceptive state via MLP). This backbone produces contextual embeddings for special tokens, notably [RTG] and [ACTION] tokens. The [RTG] branch (decoder) predicts a dense return vector (including the total return and possibly each reward component), and its hidden state is fed into the [ACTION] branch decoder, which outputs the next action command. The model may also decode other modalities – for example, the paper mentions a future image prediction loss with a pixel-level MSE, meaning ReinboT, like GR-1, can predict what the camera will see next (this further helps it learn physical dynamics and visual consequences). So it’s truly end-to-end: from vision and language to actions (and predicted future states), all supervised by imitation trajectories augmented with reward labels.

A single forward pass through this network yields both an action and a value estimate. Unlike some prior approaches, there’s no need for iterative planning or two-stage inference. The authors highlight that ReinboT obtains the action with only a single model inference, as opposed to the earlier Reinformer model which required two passes (one to predict return, another to condition on that return for the action). This gives ReinboT an efficiency edge at inference time – it’s effectively as fast as a normal policy network, since it is a normal policy network, just with an internal critic. Moreover, not needing to manually set an initial RTG or run a separate value iteration makes it much easier to deploy: you can drop ReinboT into a new task (assuming it’s within its training distribution) and let it rip, no fiddling with reward scales needed.

To be clear, ReinboT is trained offline on a fixed dataset; it doesn’t learn from new real-time rewards during deployment. But because it learned to predict and maximize returns from its training data, it behaves as if it were performing an RL policy that seeks long-term rewards, even when it’s just running feed-forward at test time. This is a fascinating blend of paradigms: technically it’s still an imitation learned policy, but one that has internalized the reward structure of the task. The outcome is a robot policy that tends to choose actions leading to higher returns (in terms of the dense reward) than those seen in mediocre parts of the data. In the next section, we’ll see how this translates into performance gains and how ReinboT compares to other contemporary models tackling the vision-language-action challenge.

ReinboT vs. the Rest: How Does It Compare?

Now that we understand ReinboT’s design, let’s put it in context. The field has seen a flurry of models bridging vision, language, and action, each with a unique approach. We’ll compare ReinboT’s architectural strategies to a few notable ones:

Decision Transformer and Reinformer: Sequence Modeling for RL

Decision Transformer (DT) was one of the first models to pose RL as a conditional sequence modeling problem. It uses a transformer to predict actions by conditioning on past states, actions, and a desired Return-to-Go token. The DT proved that a GPT-like model can solve RL tasks by simply training on trajectories with supervised learning, treating the return as just another input. However, DT itself doesn’t increase the returns of trajectories – it largely just reproduces what’s in the data unless you manually set a higher return target at inference. It’s like a fancy conditional behavioral cloning: given a high return token, it will try to mimic trajectories that had high returns in the dataset. But it requires you to guess what “high return” means, and it might still struggle if the dataset is noisy or if “high return” examples are rare.

ReinboT’s philosophy is similar to DT in that it leverages the transformer’s sequence modeling to handle RL problems, but it makes two crucial advances: internal return prediction and return maximization via expectile. In essence, ReinboT builds on the ideas from Reinformer (Zhuang et al., 2024). Reinformer can be seen as an intermediate between DT and ReinboT. It extended DT by explicitly training the model to predict the maximized Return-to-Go from a state, rather than the actual return, thereby encouraging higher returns actions. However, Reinformer’s implementation needed a two-step process: it would first output a guess of the best achievable RTG, then feed that back in to generate an action (making inference more cumbersome). Reinformer also found that pushing returns too high could cause the model to predict out-of-distribution returns that were not grounded in reality. ReinboT addresses these issues head-on. By fusing the return prediction and action prediction into one seamless pass, and feeding the return token’s hidden state directly into the action decoder, it streamlines the architecture and avoids multiple forward passes. The expectile regression in ReinboT can be viewed as a refined way of achieving what Reinformer aimed for – to “tilt” the predictions toward maximum returns – but with a controlled optimism factor $\tau$ that can be tuned to avoid severe overshooting.

In summary, compared to DT, ReinboT doesn’t need a user-specified return input and it actually learns to increase returns above the demonstrations (thanks to expectile) instead of just reproducing them. Compared to Reinformer, ReinboT is architecturally more efficient (one-pass inference) and likely more stable, since it explicitly manages the optimism level and avoids extremely out-of-range return predictions. Both DT and Reinformer were evaluated on standard RL benchmarks, but ReinboT shows how these ideas port to robotic manipulation with rich modalities – an area DT hadn’t fully ventured into. ReinboT’s achievement is showing that even with complex inputs like images and text, the model can learn to predict returns and use them, bridging the gap between trajectory return modeling and real-world robot skills.

RT-1 and RT-2: Scaling Up vs. Widening Out

Moving to Google’s robotics Transformers: RT-1 and RT-2 took a different path to improving robotic policies. RT-1 was a landmark model that scaled up imitation learning with a large dataset of real robot experiences (over 130k demonstrations) and a transformer policy that could map images and task descriptions to actions across many tasks. It focused on breadth of data and a well-engineered model to achieve robustness, but it did not incorporate any explicit notion of reward – it learned purely by mimicking successful trials (failed trials were mostly filtered out, if used at all). Essentially, RT-1 maximized performance by sheer volume of expert data and model capacity, not by re-weighting data internally. This works great if you have tons of consistent demonstrations. But if your data is mixed-quality or you have fewer examples, RT-1 style learning might not fully exploit the good trajectories.

RT-2 (Robotics Transformer 2) took an even more intriguing approach: it sought semantic generalization by connecting to web-scale vision-language knowledge. RT-2’s key innovation was to treat robot actions as another language modality, essentially learning a policy that outputs text-like tokens corresponding to discrete actions, enabling it to leverage pre-trained vision-language models and even zero-shot understand new instructions. By conceptualizing actions as language tokens, RT-2 could transfer knowledge from vision-language pretraining (e.g. understanding the concept of “trash can” or “throw away”) to robotic actions. This gave it impressive zero-shot capabilities – for instance, executing instructions it never saw in the robot data by relying on semantic grounding learned from internet data. However, RT-2 still fundamentally relies on imitation learning for the policy. It doesn’t do any reward optimization; it just greatly expands what the policy can understand by linking it with a powerful language model. In essence, RT-2 prioritizes semantic generalization (knowing what the user means, even in new contexts) over task return optimization. It might do the right thing conceptually, but if the demonstrations were suboptimal, RT-2 has no built-in mechanism to do better than those demonstrations – it wasn’t trained to explicitly favor higher reward outcomes, only to follow instructions in the style of its data.

ReinboT differs by focusing on decision optimality rather than pure semantic reach. Where RT-2 brings in internet-scale knowledge, ReinboT brings in the RL principle of return maximization. In a sense, these approaches are complementary: one solves general understanding, the other solves choosing the best way to act. Notably, ReinboT’s training did utilize pre-trained encoders (CLIP for language and MAE-pretrained ViT for vision), so it does stand on the shoulders of foundation models for perception like many others. But the core transformer of ReinboT (the policy backbone) is trained on robot data directly, rather than inheriting from an LLM. That makes it much smaller than RT-2’s massive multi-billion parameter model, which has implications for inference speed and deployment. ReinboT’s param count isn’t explicitly stated in what we cited, but given it’s a “GPT-style” with presumably on the order of a few hundred million parameters (similar to GR-1’s 195M), it’s likely lighter than RT-2’s giant VLM core. This could mean ReinboT is easier to run on real robots in real time, whereas RT-2’s size could be a bottleneck (though RT-2 has been demonstrated on robots, often with off-board processing). In terms of generalization, RT-2 would likely win at tasks that require understanding out-of-vocabulary instructions or recognizing novel objects (thanks to web training), while ReinboT likely excels in tasks where the challenge is long-horizon optimization under familiar semantics. For example, on a multi-step task requiring the robot to carry out five sub-instructions in a row flawlessly, ReinboT’s return-driven chaining might yield higher success rates, as indeed evidenced in their results on the CALVIN benchmark (successfully chaining on average 2.26 sub-tasks vs. about 1.4–1.7 for imitation-based models). In fact, ReinboT achieved state-of-the-art success on the CALVIN long-horizon manipulation benchmark, clearly outperforming RT-1 style and other baselines.

In short, RT-1/2 and ReinboT reflect two different emphases: scale and knowledge versus optimality and data utilization. ReinboT doesn’t have the broad “common sense” that RT-2’s web training provides, but it squeezes more out of whatever data it has by explicitly modeling reward. For many closed-world robotic tasks (like those in a specific lab or home setting), that focus on reward may yield more tangible success and efficiency. And notably, because ReinboT doesn’t require reward function at runtime, it preserves the deployability of an imitation model while injecting some of RL’s magic.

GR-1 and RoboFlamingo: Pre-Trained Foundations vs. Learned Rewards

GR-1 and RoboFlamingo can be thought of as the foundation-model-powered imitation learners. GR-1 (from 2023) demonstrated that a large dose of offline pre-training on human videos (the Ego4D dataset) could drastically improve a robot policy when fine-tuned on actual robot tasks. Its architecture, like ReinboT, was a GPT-style causal transformer, but GR-1’s novelty was using video prediction as pre-training – the model learned to predict future frames ([OBS] tokens) and actions ([ACT] tokens) on human activity videos, then adapted to robot data. Essentially, GR-1 learned a lot of visual priors and even some physical intuition from human videos, then was finetuned to copy robot demonstrations. This gave it a big jump in performance and generalization: for example, GR-1 achieved a much higher success rate on the CALVIN tasks than prior methods, even when data was limited. However, GR-1’s policy is still purely an imitation policy at heart – during fine-tuning on robot data, it is optimizing a behavior cloning loss (plus maybe the video-prediction loss). It has no notion of reward or return; it just benefits from having seen more diverse videos to not overfit or to better interpret scenes.

RoboFlamingo takes another approach: leverage an existing multi-modal model (Flamingo) that already understands images and language through billions of parameters of pretraining. The RoboFlamingo work adapts Flamingo (a vision-language model) to output robot actions by attaching a small policy head (like an LSTM) on top of it. The idea is to use the powerful visual understanding of Flamingo to guide imitation learning on robot tasks, with minimal fine-tuning (to keep most of the pre-trained knowledge intact). RoboFlamingo is notable for using multi-step observation sequences as inputs to Flamingo, so it provides the VLM with a short video clip rather than a single image, giving it a sense of motion before the LSTM head decides on an action. This approach proved that pre-trained VLMs can be very effective robot imitators, achieving strong performance on manipulation benchmarks and being relatively efficient to adapt. In fact, RoboFlamingo and its improved variants have become a competitive baseline in vision-language policy learning, showing that anyone with a large VLM can fine-tune a robot policy with modest data. One interesting claim from the RoboFlamingo team is that because only a small portion of the model is updated (just the policy head and maybe a few adapters), it’s a “cost-effective and easy-to-use” solution for robotics. And it was able to run on relatively low-end hardware (with some optimizations), which is encouraging for deployment.

When we compare ReinboT to these two, the differences are illuminating. ReinboT did not use massive unlabeled video pretraining like GR-1 did, nor did it leverage a giant pre-trained VLM like Flamingo. It stands somewhat in between: it uses pre-trained encoders (so it’s not training vision from scratch), but its core policy network is trained on the actual robot data, learning the reward-informed policy from that data. This means ReinboT likely needs less total training data to achieve good performance on a specific set of tasks, because it is optimizing for those tasks’ returns, whereas GR-1 and RoboFlamingo rely on generalization from broad pretraining. In the CALVIN mixed-quality benchmark, we see that GR-1 and RoboFlamingo, despite their powerful pretraining, were outperformed by ReinboT when the data had a lot of suboptimal trajectories. For instance, RoboFlamingo could only complete on average 0.83 chained instructions and GR-1 about 1.4, whereas ReinboT (with dense return integration) achieved about 2.26 on average. The reason, as the authors analyze, is that pure imitation (which GR-1 and RoboFlamingo are limited to) can’t fully exploit the mixed-quality distribution – they essentially treat all those trajectories equally and end up “imitating the average,” which isn’t enough to excel. ReinboT, by contrast, distills the best parts of the mixed data (in effect, filtering out the bad tries and emphasizing the good ones via its return predictions). Therefore, even though GR-1 and RoboFlamingo have seen either more data or have bigger models, ReinboT’s policy quality is higher on tasks where data quality matters more than quantity.

That said, GR-1 and RoboFlamingo likely have advantages in other respects. GR-1’s video-pretrained features probably help with visual generalization (lighting changes, new object positions) because of exposure to Ego4D – and indeed GR-1 showed robustness in novel scenes and better adaptation to smaller datasets. RoboFlamingo inherits a broad understanding of language and imagery from Flamingo, so it can handle nuanced instructions or visual inputs that ReinboT might struggle with if they fall outside its training distribution. Also, these models underscore a design trade-off: ReinboT introduced a sophisticated return-prediction mechanism, whereas GR-1 and RoboFlamingo leaned on external data or models. In terms of inference speed, ReinboT is probably comparable or faster. GR-1’s model is moderately sized and runs in one pass, but RoboFlamingo’s Flamingo-based backbone is very large (Flamingo models can be tens of billions of parameters, though open versions might be smaller) – that likely makes RoboFlamingo slower or more resource-intensive per step. ReinboT’s relative simplicity (in using a mid-sized transformer) and one-pass decoding means it remains feasible to run on hardware that a robot might carry or an offboard PC for a real-time control loop. The authors specifically note that RoboFlamingo fine-tuned on “annotated data” (with text) did worse than ReinboT in long-horizon tests, and even a variant of GR-1 using all the data didn’t match ReinboT’s performance. This highlights that architectural bias (return awareness) can sometimes beat sheer model size or pretraining when it comes to complex sequential decision tasks.

PIDM (Predictive Inverse Dynamics Model): World Modeling vs. Return Modeling

Finally, let’s consider PIDM (Predictive Inverse Dynamics Model), an approach that, like ReinboT, emerged to improve robot learning from demonstrations – but via a different angle. PIDM, introduced by Tian et al. (ICLR 2025), closes the loop between vision and action by explicitly predicting the next visual state and then using an inverse dynamics model to output the action needed to get there. In other words, PIDM tries to give the robot a sort of imagination: “if I were to take the best action, what would the next camera image look like? Now, given that imagined next image, figure out the action to achieve it.” It processes visual states and actions with Transformers and is trained end-to-end on large robot datasets (like Google’s DROID dataset). By doing so, PIDM implicitly learns the dynamics of the environment – it’s a kind of world-model-based policy learning. The authors of PIDM reported significant gains in generalization and efficiency: their model (called “Seer”) outperformed prior state-of-the-art by 22% on CALVIN’s multi-task benchmark (ABC-D) and even 43% on some real-world tasks. These are big improvements, underscoring that integrating a predictive model of the environment can help the policy plan better at each step.

Comparing PIDM to ReinboT is interesting because they both extend imitation learning with additional prediction, but they predict different things. PIDM predicts the next state (and uses that to derive the action); ReinboT predicts the future return (and uses that to influence the action). PIDM’s strength is in physical reasoning – by forecasting what will happen, it can choose actions that make the desired next state reality. It essentially optimizes for accurate transition models, which helps in scenarios where the robot might face novel objects or slight distribution shifts (since a good world model can adapt by imagining outcomes). ReinboT’s strength is in outcome evaluation – by having a notion of return, it can choose actions that lead to success even if some steps look different than the training data, as long as the return model generalizes. A notable limitation of PIDM is that it still doesn’t explicitly understand “good vs bad”; it just tries to make its predicted next image match what an expert demonstration would show, then imitates that action. If all demonstrations are suboptimal in some way (e.g., always taking a long path to reach a goal), PIDM will perfectly predict those suboptimal trajectories and follow them – it has no incentive to find a better path unless that improves visual prediction accuracy (which it might not, since the data defines what’s correct visually). ReinboT, however, would see that those trajectories have lower return (maybe a longer path means lower task progress per step or extra energy cost) and thus try to pick a shorter route that yields a higher return, even if it deviates from the average demonstration. This is a fundamental difference: PIDM optimizes for consistency with the data (through predicted visuals), whereas ReinboT optimizes for the reward in the data.

In terms of architecture complexity, PIDM’s need to predict high-dimensional images (or visual features) at every step is heavy. It uses Transformers to handle vision and a possibly separate module for the inverse dynamics. ReinboT also has an image prediction branch (they included a future image loss), but it’s auxiliary; the primary focus is on the scalar return prediction which is a much lower-dimensional target and arguably easier to learn. That might make ReinboT’s training more stable or data-efficient when it comes to extracting what really matters (the reward), whereas PIDM must learn to generate detailed visual futures (which might or might not be directly tied to task success). It’s telling that in the CALVIN benchmark with limited annotated data, PIDM did better than plain imitation (RoboFlamingo, GR-1) – it achieved an average chain length of 1.73 vs 1.41 for GR-1 – but ReinboT (with offline RL flavor) achieved 2.26. And notably, an older pure RL baseline, RWR (Reward-Weighted Regression), also edged out PIDM in that setting (RWR with dense reward had 1.82). This suggests that when data quality is an issue, focusing on reward yields more gains than focusing on prediction. PIDM shines when visual generalization is key (and indeed, they report strong results on generalizing to new objects, lighting, etc.), whereas ReinboT shines when policy optimality and long-term success are the goals. ReinboT could possibly be combined with PIDM in the future – imagine a model that predicts both the next state and the return – for even more robust decision-making. But as of now, they represent two different philosophies: model the world (PIDM) vs. model the returns (ReinboT).

Inference Efficiency and Deployment Considerations

When comparing all these models, it’s worth noting how ReinboT manages a sweet spot in inference efficiency. We’ve touched on it earlier: ReinboT’s one-pass architecture (thanks to the integrated RTG token) means it requires no additional planning steps or optimizers at run-time. It’s just feed-forward through a transformer, like running any standard policy network. Decision Transformer required a chosen return input and a similar forward pass (so one-pass, but with manual input). Reinformer effectively needed two passes, doubling compute for each action step. RWR or other offline RL baselines might require computing weights or running an optimization to pick actions (though RWR as implemented here was likely also one-pass after training). Approaches like say, tree search or MCTS-guided transformers, are much slower (not in our scope here, but just for context). ReinboT, despite its additional head, likely has only a marginal overhead compared to a vanilla transformer policy – predicting a return scalar is trivial compared to generating an image or even an action vector. And because it doesn’t use an LSTM or recurrence, it can fully utilize parallelism in the transformer across the context window (processing the image, language, etc. together). This is similar to GR-1 and RT-1 which are also single-pass transformers, so they’re in the same ballpark. RoboFlamingo, due to model size, might be slower; PIDM, due to predicting images, might be heavier. In fact, the authors of RoboFlamingo++ (an extended version) note that updating the large VLM involves 1.8B parameters, whereas some efficient methods only adjust ~0.5% of that – which hints that running these huge models is a challenge and one might prefer smaller adapter-based policies for speed. ReinboT’s design being end-to-end does update the whole model, but the whole model is comparatively not enormous (since they intentionally used CLIP and ViT features to reduce token counts and dimensions).

Thus, in real-world deployments like a robot arm in a kitchen or a factory, ReinboT could be advantageous by requiring less tuning (no return parameter to set) and straightforward execution. The authors emphasize that not needing to set an initial RTG greatly alleviates the tediousness of manual adjustment and acknowledges that in actual deployment, you often cannot obtain a numeric reward signal at all. ReinboT is ready to go with just an instruction and observations, fitting naturally into existing robotics pipelines where a policy is queried at, say, 10 Hz to output motor commands.

Limitations and Critical Analysis

ReinboT’s approach is certainly promising, but it’s not without caveats. It’s important to critically assess where this method might face challenges:

Reward Design Dependency: ReinboT’s power comes from that dense reward we discussed – but someone had to design that reward function. In their work, the reward was tailored with four components relevant to manipulation. If you move ReinboT to a very different setting, you’d need to engineer a new dense reward scheme that captures success for that domain. While the authors claim their principles (sub-goals, progress, smoothness, completion) are broadly applicable, crafting a good reward for complex tasks can be tricky. If the reward is misspecified or too coarse, the model might focus on the wrong things. For example, if “smoothness” is over-emphasized, the robot might choose very slow, overly cautious actions to avoid any jerks – technically maximizing smoothness but at the expense of task speed. It requires careful balancing (they did include a reward weight normalization so that each component stays comparable across demos). Unlike pure imitation, which doesn’t need an explicit reward at all (just demonstrations), ReinboT requires a reward model for training – albeit offline. This means deploying ReinboT to a new domain involves a bit more effort upfront (defining and computing dense rewards on your data).
Optimism vs. Safety: By training with expectile regression, ReinboT is by design optimistic about what can be achieved. This usually improves success rates, but it could have downsides. In some scenarios, being overly optimistic might lead the robot to attempt maneuvers that were never actually successful in the data. If $\tau$ is set too high, the model might predict returns that no one ever saw, and thus select actions that correspond to those imaginary high returns – potentially leading to out-of-distribution actions. The authors note this risk and presumably mitigated it by choosing a moderate $\tau$ and observing the outcomes. Still, one must be cautious: the mechanism is akin to extrapolation. In controlled benchmarks, that worked in ReinboT’s favor. In an open-world environment, a robot that’s slightly too optimistic might, say, try a shortcut that ends up failing (because it didn’t have data for the pitfalls of that shortcut). This is somewhat analogous to the known challenge in offline RL of overestimation – RL algorithms often have to battle the agent getting over-optimistic Q-values and taking bad actions as a result. ReinboT might inherit a milder form of this issue. It’s essentially doing a single-step value overestimation (bounded by data distribution knowledge). So a critical eye is needed to ensure ReinboT doesn’t develop bad habits by chasing phantom rewards.
No Online Correction: Unlike a true RL algorithm, ReinboT does not refine itself with new experience. If it starts executing in a slightly different environment or if it does make a mistake, there’s no mechanism to correct course via reward feedback because it’s not actively using rewards at test time (it’s only using its learned model of rewards). It’s still fundamentally an open-loop policy (closed-loop in terms of observing new state each step, but no learning loop). This means if the robot encounters an out-of-distribution situation, its predicted returns might be unreliable. For example, if a new obstacle appears that the model never saw, its return predictor might still optimistically assume it can complete the task with a high return. It will choose actions accordingly and potentially get into trouble. Without online learning or at least online re-planning, it might not recover gracefully. Some classic RL robustness – like being able to learn from a failure – is not present. ReinboT must rely on the breadth of its offline data and the generality of its learned reward model to handle novel scenarios. The authors did test real-world few-shot and OOD scenarios (like new objects, new placements) and reported that ReinboT showed strong generalization in those cases. Likely the reason is that the dense reward model captures fundamental aspects of tasks (like distance to target) that generalize to new objects, so the policy can still evaluate and act reasonably. But if the physical situation changes radically (say a new kind of manipulation or a hardware change), additional fine-tuning or an updated reward model might be needed.
Complexity and Debuggability: ReinboT’s architecture, while efficient at run-time, is quite complex internally. It’s juggling multiple modalities and losses: vision, language, proprioception, action, return, and even future image predictions. Balancing these losses (they mention weights for each modality’s loss) is non-trivial. If the Return loss dominates, the model might sacrifice accurate action imitation in favor of guessing high returns (leading to the above optimism problem). If the action loss dominates, the return prediction might become a noisy afterthought (thus weakening the RL effect). The authors used previous work’s weights for modalities and introduced a weight for the RTG loss to tune its influence. Getting these settings right likely required experimentation. Moreover, debugging a decision transformer is already hard (you can’t easily interpret what each attention head is doing), and now with an integrated return, interpreting why the robot made a certain decision is even trickier. Was it because it predicted a high return erroneously? Did it think the sub-goal was achieved? There’s an argument that adding the return token actually improves interpretability a bit – at least you can log the model’s predicted RTG at each step and see its estimation of success. The authors indeed visualize the distribution of predicted returns vs. actual returns, which can provide insight (e.g., they showed ReinboT tends to predict higher returns than the average trajectory, but still within plausible range). However, if something goes wrong, there are more moving parts to consider (maybe the reward model was off, maybe the action module misused the return info, etc.).
Scope of Optimality: ReinboT assumes the best trajectory in the offline dataset is close to optimal. It pushes towards maximizing returns within the data distribution. If the dataset is truly mixed-quality but still contains near-optimal examples, this works great (it will identify and emulate those). But if the entire dataset is suboptimal, lacking any good examples of some aspect, ReinboT can’t magically invent a new strategy. It will do better than the average of the data (maybe), but it’s fundamentally limited by the returns it has seen. For instance, if in all demos a certain task was never fully completed (maybe human demos always fell slightly short), the maximum return in the data is still that suboptimal outcome. ReinboT will predict that maximum (maybe overshoot a bit, but not by a huge margin or else risk being unrealistic) and plan for that – meaning it might also fall short of true success. This is a limitation of any offline method: without any successful reference, it’s hard to learn success. One would need to integrate some external knowledge or exploration. ReinboT doesn’t solve that, it just utilizes the data better. It still needs at least some successful or high-quality data to point it in the right direction.
Generalization vs. Specialization: ReinboT demonstrated excellent generalization in their tests, such as few-shot adaptation to real-world tasks and handling unseen combinations of sub-goals. However, one might wonder if its focus on rewards makes it overfit to the reward structure of a specific benchmark. For example, if the reward heavily penalizes energy use, the policy might develop an unnatural hesitancy or overly smooth motions that aren’t strictly necessary for success but are driven by that reward component. If we then test it in a scenario where maybe speed is more important than energy efficiency (a different implicit reward), it might behave less optimally. Pure imitation learners like RT-1 or RoboFlamingo, which mimic human behavior, might sometimes be more adaptable if the distribution shifts to a new “style” of behavior (assuming the instructions still make sense), because they don’t have an internal bias beyond what they saw. ReinboT’s bias is “whatever yields high dense return as previously defined.” If that definition of high return doesn’t perfectly align with what we ultimately want in a new scenario, there could be a discrepancy. In fairness, though, the reward was designed to align with pretty general good practices (complete tasks, make progress, be smooth, finish goals), which likely align with what we want in many cases.

In conclusion on limitations: ReinboT is a sophisticated solution that undoubtedly raises the performance ceiling for offline-trained robot agents on complex tasks. Its integration of RL principles is clever and pragmatic. But users of ReinboT should be aware that it bakes in certain assumptions (through the reward model and expectile tuning) that might need revisiting when moving to new domains. It doesn’t eliminate the need for human insight – instead of curating data, now one curates reward functions and expectile parameters. Fortunately, those are relatively intuitive knobs (compared to say designing a full RL algorithm). And the benefits it reaps – significant boosts in success rates, more robust chaining of sub-tasks, and improved few-shot learning – make it a compelling approach in the toolkit of robot learning methods.

Conclusion: A New Paradigm for Learning from Offline Data

As we wrap up, let’s take a step back and appreciate what ReinboT represents in the evolution of robot learning. It’s part of a broader trend of convergence between imitation learning and reinforcement learning. For years, these were parallel tracks: one said “just imitate the experts,” the other said “learn by trial and error to maximize reward.” ReinboT elegantly blends the two by imitating with an eye towards return maximization. It leverages the stability and simplicity of imitation (no on-policy exploration, no environment noise in training) while sneaking in the objective-driven nature of RL via return tokens and expectile loss. In doing so, it addresses one of the thorniest issues in scaling robot learning: mixed-quality data. Rather than throwing out “bad” demos or relying on human labeling of quality, it learns to judge quality itself and prioritize the good – a very human-like capability if you think about it.

In our comparison with contemporaries, we saw that every model has its recipe: some add more data (RT-1), some add language knowledge (RT-2), some add world models (PIDM), some add bigger backbones (RoboFlamingo), and ReinboT adds an internal reward engine. These approaches aren’t mutually exclusive; indeed, one can imagine future systems that combine them – e.g., a RoboFlamingo that also predicts returns, or a Decision Transformer that uses pre-trained vision models and expectile regression. But ReinboT’s success in its own right – state-of-the-art on a challenging benchmark and impressive real-world demos – shows the merit of rethinking the architecture of our policies. The design trade-offs ReinboT navigated (slightly more complex training for a much better policy, a small bit of reward engineering for a big gain in robustness) seem well worth it in hindsight. It managed to significantly outperform pure imitation learners, and even an offline RL baseline (RWR), indicating that its integration was more than just academic – it yielded real returns (pun intended).

For a professional audience in robotic manipulation, ReinboT is an exciting development because it hints at a future where robots can learn from large heterogeneous datasets and still strive for optimal performance. We no longer have to either collect only perfect demos or settle for average behavior. A ReinboT-like model can take the messy, varied data that inevitably comes from scaling up robot learning (with failures, human-in-the-loop corrections, varying skill levels) and turn it into a high-performing policy by effectively assigning credit and blame internally via returns. It’s a bit like giving the robot intuition about which experiences were successful, something that typically only RL could imbue through many trials. And it achieves this without the trials – just by clever use of the data we have.

There are certainly more questions to explore. How well does ReinboT scale with even larger models or more diverse tasks? Can the idea generalize beyond manipulation, say to navigation or multi-agent settings? What if we combine it with online fine-tuning – would that break or enhance it? These are open areas. But one thing seems clear: the ReinboT paper has illustrated a powerful architectural motif – the policy as its own reward predictor – which could inspire many future works.

In the grand scheme, ReinboT differs from prior VLA models in that it embraces the objective (maximize return) within the model’s core, rather than leaving it outside as a separate algorithm. This blurring of model and objective might well be a hallmark of next-generation learning systems. As we conclude this deep dive, the take-home message is that ReinboT provides a compelling example of how adding a “sense of reward” to a vision-language-action model can dramatically enhance its decision-making prowess. It’s an approach that doesn’t just imitate but amplifies – turning a rainbow of varied demonstrations into a laser-focused beacon for robotic performance. And with that, we wrap up today’s exploration. Thank you for listening, and stay tuned for more insights on the frontier of robot learning!

Sources:

Zhang et al., “ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning,” ICML 2025 (architecture and overview)
ReinboT paper, Sec. 4 (Methodology) (dense return, RTG token, expectile details)
ReinboT paper, Sec. 5 (Experiments) (performance comparisons and analysis)
Chen et al., “Decision Transformer,” 2021 (discussed via ReinboT references)
Zhuang et al., “Reinforcer: Max-return sequence modeling for offline RL,” ICML 2024 (Reinformer)
Wu et al., “GR-1: Large-Scale Video Pre-training for Robot Manipulation,” arXiv 2023 (GR-1 model)
Li et al., 2023, “RoboFlamingo: VLMs as Effective Robot Imitators” (RoboFlamingo model)
Brohan et al., 2022 & 2023, “RT-1” and “RT-2” (Robotics Transformer models) (model descriptions)
Tian et al., 2025, “Predictive Inverse Dynamics Models (PIDM),” ICLR 2025 (PIDM approach and performance)
Additional references from ReinboT and related work for context, as cited above.

Episode 21: Deep Dive: ReinboT and the Fusion of RL with Vision-Language-Action

Introduction: A New Twist in Robot Learning

Background: From Imitation to Return Maximization

Key Innovations in ReinboT’s Architecture

Dense Return Prediction – Valuing Every Step

Return-To-Go as a Modality Token – Reward in the Input Stream

Expectile Regression – Leaning Toward the Optimistic Side

Putting It Together: ReinboT’s One-Pass, Return-Aware Policy

ReinboT vs. the Rest: How Does It Compare?

Decision Transformer and Reinformer: Sequence Modeling for RL

RT-1 and RT-2: Scaling Up vs. Widening Out

GR-1 and RoboFlamingo: Pre-Trained Foundations vs. Learned Rewards

PIDM (Predictive Inverse Dynamics Model): World Modeling vs. Return Modeling

Inference Efficiency and Deployment Considerations

Limitations and Critical Analysis

Conclusion: A New Paradigm for Learning from Offline Data

Episode 20: RIPT-VLA - The Fine-Tuning Revolution

Moving Embodied AI to its Own Podcast

Introduction: A New Twist in Robot Learning

Background: From Imitation to Return Maximization

Key Innovations in ReinboT’s Architecture

Dense Return Prediction – Valuing Every Step

Return-To-Go as a Modality Token – Reward in the Input Stream

Expectile Regression – Leaning Toward the Optimistic Side

Putting It Together: ReinboT’s One-Pass, Return-Aware Policy

ReinboT vs. the Rest: How Does It Compare?

Decision Transformer and Reinformer: Sequence Modeling for RL

RT-1 and RT-2: Scaling Up vs. Widening Out

GR-1 and RoboFlamingo: Pre-Trained Foundations vs. Learned Rewards

PIDM (Predictive Inverse Dynamics Model): World Modeling vs. Return Modeling

Inference Efficiency and Deployment Considerations

Limitations and Critical Analysis

Conclusion: A New Paradigm for Learning from Offline Data

Episode 20: RIPT-VLA - The Fine-Tuning Revolution

Moving Embodied AI to its Own Podcast

You might also like...