Episode 1: From Vision and Language to Action – An Introduction to VLAs and OpenVLA
Hello and welcome! In this first episode, we’re laying the groundwork for our journey into Vision-Language-Action systems. Today we’ll answer: What is a Vision-Language-Action model, and why is OpenVLA making waves in robotics? So grab your headphones and let’s dive into the world where seeing, speaking, and doing all come together.
Imagine telling a robot, “Pick up the red block and put it on the table,” and it just does it—no hard coding, no task-specific training required. That’s the promise of Vision-Language-Action (VLA) models. In robot learning, a VLA model is essentially a multimodal foundation model that integrates vision, language, and actions, so that given an image (or video) of the robot’s surroundings and a text instruction, it can directly output low-level actions for the robot to execute. In other words, VLAs let robots see the world, understand our words, and then act to accomplish tasks, all within one unified AI model.
This concept is a big shift from traditional robotics. For decades, teaching a robot a new skill meant laboriously programming it or training a separate model for each specific task, each robot, even each environment. VLAs offer a more general solution: train a single large model on a broad set of vision-and-language labeled demonstrations so it learns how to do many things. Then, when given a new instruction (within its learned capabilities), it can figure out the appropriate actions on the fly. Early examples of this idea emerged around 2022–2023. Google’s Robotic Transformer series pioneered the approach: RT-1 showed that a transformer could learn multi-task policies from demonstration data, and RT-2 went further by incorporating knowledge from vision-language models to let a robot leverage some “internet-scale” visual understanding in performing commands. However, those cutting-edge models were mostly proprietary, closed systems. Researchers and developers outside those companies couldn’t fully access or build upon them, which slowed down broader adoption.
Enter OpenVLA. The name says it all: it’s an open-source Vision-Language-Action model, released to the public in 2024 as a collaboration between Stanford, Berkeley, Google DeepMind, and others. The team behind it wanted to address two major hurdles that were holding robotics back: (1) existing state-of-the-art VLA models were largely closed and inaccessible, and (2) there hadn’t been much work on how to efficiently fine-tune these hefty models to new tasks or robots. OpenVLA was their answer to both problems. It’s a 7-billion-parameter VLA model trained on a whopping 970,000 real-world robot demonstration episodes drawn from the communal Open X-Embodiment dataset. By open-sourcing this model and its training code, they aimed to give the research community a foundation to build on, rather than everyone reinventing the wheel behind closed doors.
So what makes OpenVLA particularly exciting? First, its sheer scope: with nearly a million demonstrations as training data, it has seen an unprecedented variety of robots, tasks, and scenarios during training. It’s essentially learned from many human teleoperators and scripted robot trials across different labs. This gives it a breadth of knowledge and “common sense” about manipulation that a smaller, task-specific model wouldn’t have. Second, it’s generalist and multi-embodiment by design. The trained policy isn’t just for one robot arm or one lab’s setup—it supports controlling multiple different robot platforms out of the box. In their experiments, the very same OpenVLA model could control a small WidowX arm in one setting and a larger industrial robot in another, without retraining for each. That kind of cross-robot generality was virtually unheard of previously. And third, OpenVLA set a new performance benchmark for these kinds of models. It achieves state-of-the-art results on general robot manipulation tasks, even outperforming some much larger closed models. For example, the team reported that OpenVLA outperformed a 55-billion-parameter Google model (RT-2-X) by over 16% in success rate on a broad evaluation suite—despite OpenVLA being only 7B in size (about 7× smaller in parameter count). That was a stunning result: an open model not only matching but actually topping the closed competition on many tasks.
We’ll unpack all those points in detail in upcoming episodes. But the takeaway is: OpenVLA represents a milestone where the robotics community has a powerful, shared foundation model to drive robots—much like how GPT-style models became foundation models for NLP. With OpenVLA available, researchers can fine-tune it to new tasks or new robots with relatively little data (instead of training from scratch), and practitioners can experiment with a ready-made system that understands vision and language together.
In this series, we’ll explore how OpenVLA works under the hood, how it was trained and how you can adapt it, how it’s been tested on real robots, and how it fits into the broader wave of VLA systems revolutionizing robotics. By the end, you should have a clear picture of why so many in the field are excited about Vision-Language-Action models—and how OpenVLA is accelerating progress toward robots that truly see, understand, and act.
That’s a wrap for our introductory episode. We defined what Vision-Language-Action models are and saw how OpenVLA burst onto the scene as a game-changer in open robotics. In our next episode, we’ll get technical and dig into OpenVLA’s architecture—how do you actually design a single model that takes in images and words and outputs robot actions? Spoiler: it involves some clever fusion of vision models and a language model. Until then, thanks for listening, and get ready to go deeper into the machine!