Episode 5: Beyond OpenVLA – The Evolving Landscape of Vision-Language-Action Systems

This is it – our final episode in the series. So far, we’ve focused on OpenVLA itself. Now we’re zooming out to the bigger picture of Vision-Language-Action models and where things are headed. Think of this as a round-table tour of the “who’s who” and “what’s next” in VLA. We’ll discuss some companion models, improvements, and future themes: from Google’s robotic transformers to new open models like Octo, to special techniques like RT-Trajectory, and even the latest forays into humanoid robots. If you want to know how OpenVLA fits into the grand scheme and what the road ahead looks like, stay tuned.

OpenVLA, as we’ve discussed, emerged as a pivotal open-source project in 2024. But it stands on the shoulders of prior work and exists alongside other innovative approaches. Let’s start with a quick historical context:

  • Google’s RT series: These were among the first large-scale VLA-type efforts. RT-1 (2022) was a multi-task model that took images and natural language commands and output robot actions, trained on 130k demonstrations within Google. It was somewhat limited to a specific robot (a mobile manipulator) but proved the concept that a single network could handle many tasks. RT-2 (mid-2023) was a breakthrough: they incorporated a pretrained vision-language model (similar spirit to OpenVLA) to give the robot broader visual and semantic understanding. RT-2 essentially was the first true VLA foundation model – but it was closed-source. Google reported that RT-2’s 55B-parameter version (sometimes called RT-2-X) could do things like reasoning about new objects via web knowledge, etc., and it performed well across multiple tasks. OpenVLA’s claim to fame was beating that RT-2-X on their benchmark with a much smaller open model, as we mentioned earlier. That was a big deal because it suggested open collaboration and data diversity can trump sheer scale and proprietary data.

  • Octo: Around the same time as OpenVLA’s debut, another team (with many of the same institutions involved) introduced Octo, described as an “Open-Source Generalist Robot Policy”. Octo took a different approach under the hood: it’s a transformer-based diffusion policy rather than using a language model to generate tokens. It too was trained on a large chunk of the Open X-Embodiment data (about 800k episodes). Octo’s architecture is smaller (tens of millions of parameters vs billions) but it uses diffusion models to output continuous actions, and it’s very flexible in terms of input/output modes. For example, Octo can condition on a goal image (you show it a picture of the desired end state) instead of or in addition to a language instruction. It can also handle observation histories and other sensor inputs, making it a very general framework. In evaluations, Octo, despite its smaller size, performed on par with the big RT-2-X on many tasks and outperformed previous open models like RT-1-X. However, because it doesn’t have a huge language model inside, tasks that require complex language understanding might lean in favor of something like OpenVLA. On the other hand, Octo’s diffusion-based action decoder could have advantages in generating very fine-grained or coordinated continuous motions (diffusion models are good at generating smooth trajectories). Both OpenVLA and Octo benefited hugely from the OpenX dataset – they’re like siblings that explored two different technical pathways using the same data treasure trove.

  • RT-Trajectory: Now, this is an interesting “feeder” method rather than a standalone model. RT-Trajectory was introduced by Google DeepMind researchers as a way to help any RT-style model generalize better by giving it hints on how to do a task. The idea is brilliantly simple: during training, take the video of a robot demonstration and overlay a 2D sketch line showing the path the robot’s gripper took. In other words, draw the trajectory. This sketch acts as an extra visual prompt to the model, indicating the relevant motion for the task. Why do this? Because models like RT-2 or OpenVLA have to implicitly figure out the necessary trajectory from trial and error; a visual trajectory prompt explicitly teaches the model the motion pattern. The RT-Trajectory system can even take a human-drawn path or infer one from human videos, and incorporate that. It showed that models trained with these trajectory overlays could better perform long-horizon tasks like wiping a table, since they “understand” the sweeping motion needed. Essentially, RT-Trajectory gives VLA models a form of procedural memory – a notion of the shape of the motion required, not just the goal. This concept was mirrored by an academic work named TraceVLA, where researchers fine-tuned OpenVLA with similar visual traces of past motion to improve spatial-temporal awareness. TraceVLA showed massive improvements: on real robot tasks, it was about 3.5× more successful than base OpenVLA after adding the visual trace cue and training on 150k new trajectories. The success of RT-Trajectory and TraceVLA tells us that adding memory of how the robot moved (not just where things are now) can dramatically help on tasks where the sequence of moves matters (e.g., drawing a circle versus a line – final state might be similar but path differs).

  • SARA-RT: Another innovation from the DeepMind camp, SARA-RT stands for Self-Adaptive Robust Attention for RT models. This was aimed at making the giant transformers more efficient. They developed a new attention mechanism that effectively reduces the computation from quadratic to linear in sequence length, and they fine-tuned RT-2 with it. The result: a model that was not only 14% faster but also 10.6% more accurate than the original, when given the same short history of images. Impressively, this speedup came with no accuracy loss (in fact, an accuracy gain). SARA-RT demonstrated that we can trim the fat in these big models – optimizing the transformer computations – to achieve real-time performance improvements. It’s a bit technical, but it essentially means a more scalable model: you can give it longer image sequences (temporal context) without blowing up compute, and you can run it faster. For practitioners, techniques like SARA-RT mean you might not need as heavy hardware to deploy a VLA model, or you can get quicker response which is always good for a robot operating in dynamic environments.

  • TinyVLA and Smaller Models: While one trend is making VLAs bigger and better, another is making them smaller and more accessible. TinyVLA (from Wen et al., 2024) introduced a family of compact VLA models. They trained sub-1-billion-parameter models from scratch on multimodal data, aiming to achieve decent performance at a fraction of the size. TinyVLA models boasted faster inference and data efficiency – in other words, they required less training data to reach good performance on certain tasks. One of their key results was that a well-designed 600M-ish parameter VLA, trained appropriately, could outperform older larger models on a set of manipulation tasks, while being much cheaper to run. This is encouraging for practical robotics: not every robot has a GPU, and not every application can afford a server for control. If a model can be shrunk to run on an embedded processor (imagine a future where a household robot has a $100 chip that runs a TinyVLA-like model onboard), that could open up commercial and widespread use. There’s also MiniVLA and even jokingly named SmolVLA, which are variations exploring model compression and distillation. The Stanford AI blog teased “MiniVLA” with a smaller footprint but better performance – presumably through clever training strategies (we might guess things like knowledge distillation from a big model into a smaller one are involved). All these efforts tell us that after the initial rush to scale up, the community is now also focused on right-sizing VLAs – maintaining the magic but cutting down size and cost.

  • New Embodiments – Humanoids: So far, most VLA work has targeted robotic arms for manipulation tasks. But what about more complex robots, like humanoids or bi-manual systems? 2025 has seen exciting developments here. Figure AI, a startup building humanoid robots, introduced Helix, which is a VLA model specifically for controlling a full humanoid upper body. Helix is unique because it’s a dual-system model: it splits into a slow “System 2” (an open 7B VLM similar to Llama2, running at ~8 Hz) that handles vision and language understanding, and a fast “System 1” (an 80M-parameter action policy running at 200 Hz) for fine motor control. The two systems pass information (S2 gives a latent goal vector to S1) and are trained together. This design cleverly addresses a challenge we mentioned: big VLMs are too slow for high-frequency control, and low-level controllers are too dumb to generalize – so Helix marries them. The results? Helix achieved some firsts: controlling two humanoid robots collaboratively with the same model (they show two human-sized robots working together to put away groceries), controlling fine finger movements and whole torso/arm motions, and doing it all onboard low-power hardware in real time. They essentially demonstrated an on-hardware deployment of a VLA in a humanoid, which is futuristic! Similarly, NVIDIA Research released GR00T N1, an open foundation model for humanoids, also using a dual-system approach (a vision-language module and a diffusion-based action module) trained on a mix of real and synthetic data. GR00T N1 can handle bimanual tasks and was shown to transfer well across simulations and onto a real humanoid (the Fourier GR-1 robot). The fact that both Helix and GR00T use two-part architectures (one part for high-level perception/reasoning, one for low-level rapid control) suggests a direction for future VLAs: hybrid models that combine the strengths of slow thinking and fast reflexes, akin to Daniel Kahneman’s “System 1 vs System 2” idea applied to robots. This could overcome the latency issue and also handle high-DoF robots that are challenging for end-to-end tokenization approaches (imagine trying to token-encode every joint of a humanoid – Helix argues it’s easier to directly output continuous control at that scale).

  • Multimodal and Other Extensions: Researchers are also extending VLAs beyond just vision. Some are integrating audio (so a robot could respond to spoken instructions or even use sound in its understanding), some are adding force feedback sensors so the model can feel what it’s doing. Others are exploring planning hybrids – combining these learned policies with traditional planners or logical reasoning. A notable example from Google was SayCan (2022), where a language model (like PaLM) would suggest possible actions and a value function would verify feasibility, forming a loop for high-level planning. While not a VLA in architecture, SayCan showed the value of combining pure reasoning with actual action execution on a robot. One can imagine future systems where a big brain (GPT-4 or a future multimodal GPT) plans a high-level strategy (“you need to find the coffee in the kitchen, then bring it”) and a VLA model like OpenVLA executes the low-level skills. In fact, Google’s AutoGPT-style robotics experiment (called AutoRT in the blog) had a setup where an LLM would propose creative tasks and direct multiple robot arms to collect data or perform tasks autonomously. That hints at a future where VLAs might be part of a larger autonomous system: the LLM provides flexible goal selection and reasoning, and the VLA handles the dirty work of physical execution.

As we stand here in 2025, the VLA field is vibrant. OpenVLA was a catalyst for openness and set a high bar for performance. Now we see a healthy ecosystem: open datasets (OpenX), open models (OpenVLA, Octo, GR00T, etc.), and targeted research addressing specific limitations (like RT-Trajectory for memory, SARA-RT for efficiency, TinyVLA for compactness, Helix for high-DoF control). All these pieces are complementing each other. It’s not hard to imagine that in a couple of years, we’ll have even more capable models that integrate these advances – e.g., an “OpenVLA 2.0” that has built-in trajectory memory, linear attention, maybe a dual-speed architecture, and can run on a modest GPU or even on a robot’s onboard computer.

One theme that’s emerging is generalist robotics doesn’t mean one model does everything alone. It can also mean an ensemble of specialized models that work in concert – like Helix’s two subsystems, or an LLM + VLA combo. The common thread is leveraging foundation model strengths (broad knowledge and adaptability) for physical tasks. We’re essentially witnessing the convergence of the AI revolution with robotics. As large language and vision models matured, it was natural to apply them to robots, and that’s exactly what VLA is.

For those listening who are excited by this, the fact that so much is open-source means you can actually tinker with these models. You can download OpenVLA or Octo from Hugging Face, you can get GR00T’s code from NVIDIA, etc. This openness accelerates progress: more eyes on the code, more experiments, maybe someone fine-tunes OpenVLA for drones or for a home cleaning task, and shares the results.

It’s been a long journey from Episode 1 to now. We started with the basics of what VLA models are and built up through OpenVLA’s design, training, real-world use, and now the broader landscape and future directions. If there’s one key takeaway from this series, I’d say it’s that Vision-Language-Action models are pushing robotics into a new era of generality and flexibility. Robots are learning in ways more akin to humans – through large-scale experience and multimodal understanding – and that’s enabling them to do things we used to only dream about or see in sci-fi. It’s not science fiction anymore when an off-the-shelf robot arm can be told “make me a cup of coffee” and, with a model like those we discussed, actually attempt to do it using its eyes and training.

We’re still at the early days; plenty of challenges remain (like how to ensure safety, how to handle truly novel situations beyond training distribution, how to further reduce the need for huge data by using simulation or self-play, etc.). But given the rapid progress from 2022 to 2025, one can’t help but be optimistic about what’s coming by 2030. Perhaps home assistant robots guided by VLAs, or factory robots that can be repurposed with a simple voice command, or disaster response bots that understand human language instructions in the field.

OpenVLA and its peers have opened the door. Now it’s up to the robotics and AI community to walk through it, together.

And that brings us to the end of our series! We’ve navigated the ins and outs of OpenVLA and journeyed through the evolving world of vision-language-action systems. From model architecture to real robot demos to the cutting-edge innovations enhancing these models, it’s been a fascinating ride. I hope you’ve enjoyed these episodes and learned a lot about how robots are starting to learn our language and our way of understanding the world. The field is moving fast, so keep an eye out – the next big breakthrough might be just around the corner. Thank you so much for listening, and until next time, remember: the future where robots can see, talk, and act is coming, and it’s open-source! Stay curious and take care.