Episode 2: Under the Hood of OpenVLA – Architecture and Inference
Welcome back! Last time we talked about what OpenVLA is at a high level. Now it’s time to lift the hood and see how this engine runs. How can one AI model look at a camera image, read a command, and then generate robot arm motions to fulfill it? In this episode, we’ll break down OpenVLA’s architecture and discuss how it processes inputs and produces actions. If you’re into AI model design (or just curious how the sausage is made), this one’s for you.
So, let’s start with the big picture. OpenVLA’s architecture combines components for vision, language, and action in a single model. In fact, you can think of it as three parts working together in sequence:
-
Visual Encoder: OpenVLA uses a fused visual encoder that actually merges two pretrained vision models – one called DINOv2 and another called SigLIP – into a single perception module. These are both advanced vision backbones. DINOv2 is a vision transformer trained with self-supervision (excellent at extracting image features), and SigLIP is a vision-language model similar to CLIP (it was trained on image-text pairs, so it produces features that have some alignment with language semantics). By combining them, OpenVLA gets rich visual representations of the robot’s camera input: essentially a set of image patch embeddings that encode what’s in the scene in front of the robot. This fused encoder allows the model to “see” both the raw visual details (from DINOv2) and more abstract, nameable concepts in the image (thanks to SigLIP’s language-aware training).
-
Projection Layer: Next, the model has a projector module. Its job is to take those visual embeddings from the encoder and map them into a form that a language model can understand. Remember, the core of OpenVLA is built on a large language model (more on that in a second). Language models operate on sequences of tokens (basically numbers representing words or subwords). The projector acts like an interpreter, converting visual information into a sequence of vectors in the same space as the language tokens. You can think of this as translating “vision-speak” to “language-speak” internally. After this step, the original image has been transformed into a series of latent tokens that represent what’s going on in the scene, in a format our language model brain can work with.
-
Language Model & Action Decoder: The final piece is the LLM backbone, which in OpenVLA is a variant of Llama 2 with 7 billion parameters. This is essentially the “brain” that takes the text instruction (e.g. “pick up the block and place it on the shelf”) plus the tokenized visual context from the projector, and then generates a sequence of output tokens that correspond to actions. OpenVLA’s LLM has been fine-tuned so that its “vocabulary” includes not just English words, but also special tokens that represent robot actions. In practice, when it produces an output, it’s not an English sentence – it’s a string of discrete action tokens. These tokens are then decoded back into continuous commands that drive the robot’s motors. Concretely, OpenVLA’s output at each inference step is a set of 7 numbers representing the robot’s end-effector motion:
(Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper)
– basically a small movement in 3D space plus changes in orientation and an open/close amount for the gripper. Each of those numbers is produced via tokens – the model outputs a token for each degree of freedom (quantized to some discrete range) rather than a literal floating-point number. This discrete token representation is a common approach in VLAs like RT-2 and OpenVLA, as it allows the action space to be handled similarly to language generation. Finally, a simple decoding step maps those tokens to actual joint commands by un-normalizing them (scaling by pre-calibrated factors for the specific robot) so that the robot can execute the action.
Those are the three core elements: a vision module that sees, a language module that reasons, and an action decoder that acts. Let’s walk through what happens when you use OpenVLA in practice. Say you have a robot arm with a camera, and you give it an instruction: “Place the coke can upright on the table.” The camera image (let’s imagine it shows a soda can lying on its side in the robot’s workspace) gets fed through the visual encoder – DINOv2 and SigLIP churn out embeddings that capture the scene’s features and semantic hints. The projector then packages those into token sequences. Meanwhile, your text instruction is tokenized by the language model’s tokenizer (words into subword tokens). These text tokens and visual tokens are concatenated into one sequence (often a special prompt format is used; in OpenVLA’s code they prepend something like “In: [token_123][token_45]...
(not human-readable, but each might correspond to “move gripper right 5cm”, “rotate wrist 20°”, etc.). Once a full action vector’s worth of tokens is generated, OpenVLA stops. Those tokens are decoded into a continuous 7-DoF action: maybe that translates to something like “move end-effector +5cm in X, +2cm in Z, rotate 90°, close gripper” which would result in the robot picking up the can and turning it upright. Then the prompt-response cycle can repeat for the next step if the task isn’t done.
One important design choice here: OpenVLA is an autoregressive policy. It doesn’t spit out an entire plan or trajectory all at once; instead, it generates one action (one small motion) at a time, given the current image and goal. After the robot executes that action, you’d take a new image and feed it in again for the next action, and so on. This closed-loop feedback is crucial because it lets the model correct mistakes or adjust on the fly. It’s the same way humans work—you look, you move a bit, then you look again and adjust. The language model backbone, thanks to its training, can incorporate instruction understanding and even general reasoning at each step (“I need to be careful to place it upright”) while also relying on the image to not bump into things or to see when the can is upright.
Now, you might wonder about speed and responsiveness. A 7B-parameter model is fairly large, and inference speed is a concern if you want real-time robot control (real robots often need control signals at, say, 5–10 Hz or faster for smooth motion). OpenVLA’s creators took measures to ensure the model can run efficiently. For one, 7B parameters is actually modest compared to many language models – they deliberately kept it smaller than gigantic models like 55B because of latency and hardware limits. Additionally, they show that the model can be quantized (reducing precision of weights) to shrink memory and compute, without hurting performance. They even managed to serve the model in 4-bit precision successfully, which greatly speeds up inference. In practice, using modern optimized transformers (FlashAttention, etc.), OpenVLA can achieve control rates used in their experiments – for example, they ran a Franka robot at 15 Hz with this model in the loop. That implies each forward pass took well under 0.1 second on decent hardware, which is pretty good. And if 15 Hz isn’t enough, one emerging idea (which we’ll touch on in a later episode) is to use two models: one slower, smarter model guiding a second fast, reactive model – but that’s exactly the kind of advanced topic we’ll save for the “future of VLA” discussion. The bottom line is, OpenVLA’s architecture finds a sweet spot between capacity (large enough to be very general) and efficiency (small enough to deploy). In fact, the team demonstrated you can fine-tune it on a single high-end consumer GPU, thanks to efficient techniques, and then run it with quantization on an everyday GPU.
A couple of additional neat points about the architecture: Since it’s built on a Large Language Model, it inherits a strong ability to interpret linguistic nuance. OpenVLA can understand fairly complex instructions or rephrasings because Llama 2 was trained on extensive text data. And because the visual encoder uses models pretrained on huge image datasets, it has broad visual recognition – it wasn’t trained explicitly to identify, say, “coke can” during robot training, but the SigLIP component likely knows what a soda can looks like from its pretraining, and can tie that to the word “coke” in the instruction. This marriage of pre-existing vision and language knowledge with new robot action learning is what makes VLAs powerful. OpenVLA is essentially standing on the shoulders of giants (big vision and language models) and teaching them to drive robots.
To summarize the tech: OpenVLA’s architecture consists of a dual-vision encoder (DINOv2 + SigLIP) feeding into a Llama2-7B transformer. The model ingests an image and a text prompt and generates action tokens that decode into a 7-DoF robot motion command. It’s a sequence-to-sequence setup, much like an AI translating English to French – except here it’s translating sights and instructions into physical actions. By designing the action output as just another “language” (a string of symbols), the creators leveraged all the robustness techniques of NLP for robotics. Pretty cool, right?
Now that we’ve dissected how OpenVLA sees, thinks, and acts at a technical level, you might be curious about how it was taught to do all this. After all, no model, however well-designed, works out of the box without the right training. In the next episode, we’ll talk about the training process: the massive dataset behind OpenVLA, how they trained it (and for how long), and how we can fine-tune or adapt the trained model to specific new tasks. Stay tuned, because the training story has its own interesting twists!
We’ve navigated the inner workings of OpenVLA’s brain, from its hybrid vision encoder to its action token decoder. Quite the tour under the hood! Hopefully, you now understand how images and instructions flow through the model to become real robot motions. In our next episode, we’ll shift from architecture to training: how do you actually teach a model like this, and how can you retrain it for your own robot or task? We’ll talk datasets, fine-tuning tricks, and what it takes to get a 7B-parameter brain up to speed. Until next time, thanks for listening – and remember, even for robots, you are what you train on!