Episode 25: GPUs – The Engines of Modern AI
Introduction: Why GPUs Matter for AI
Welcome to Embodied AI 101. Today, we’re diving into the graphics processing units (GPUs) – the specialized chips that have become the beating heart of modern artificial intelligence. GPUs were originally designed to render video game graphics, but their massively parallel architecture turned out to be ideal for AI tasks. Unlike a CPU with a handful of powerful cores, a GPU packs thousands of smaller cores that can all work simultaneously on different pieces of data. This makes them exceptionally good at the linear algebra operations (like matrix multiplications) that underlie neural networks. In practical terms, using GPUs can speed up AI model training by orders of magnitude. For instance, when NVIDIA launched its A100 GPU in 2020, they claimed it delivered up to 20× higher AI performance than its predecessor (the Volta V100) thanks to architectural improvements and sheer core count[1]. That kind of leap in speed is why GPUs matter – they turned week-long AI training jobs into day-long jobs, and day-long jobs into hours, fundamentally changing what AI researchers and engineers can accomplish.
But it’s not just about raw speed. GPUs also brought specialized hardware for AI: think Tensor Cores (introduced in NVIDIA’s Volta and improved in later generations), which perform mixed-precision math tailored for deep learning. These innovations mean a single state-of-the-art GPU can execute trillions of operations per second optimized for AI. The result? Breakthrough models like image recognizers, speech translators, and large language models all have one thing in common – behind the scenes, they were trained on GPU-powered compute. In short, if data is the new oil, then GPUs are the engines turning that raw fuel into AI insights. They are the compute engines of AI, and understanding them is key to understanding the current AI landscape.
From Ampere to Blackwell: Evolution of NVIDIA’s AI GPUs
Let’s trace the recent evolution of NVIDIA’s data center GPUs – the chips that have powered the AI revolution – starting from the Ampere architecture through Hopper and into the latest Blackwell generation. Each step brought major leaps in performance and capability.
Ampere – NVIDIA A100 (2020): The NVIDIA A100 GPU was unveiled in May 2020 as the flagship of the Ampere generation[2]. This chip set a new standard for AI compute. Built on a cutting-edge 7nm process, the A100 packs a staggering 54 billion transistors[3] – at launch, it was one of the most complex processors ever manufactured. All those transistors allowed for 6,912 CUDA cores and 432 third-generation Tensor Cores working in parallel[3]. The A100 came with either 40 GB or 80 GB of high-bandwidth memory (HBM2/HBM2e), with the 80GB version debuting a world-record memory bandwidth of over 2 TB/s[4]. That high-speed memory meant the GPU could keep its many cores fed with data, avoiding bottlenecks. The Ampere architecture also introduced features like Multi-Instance GPU (MIG), which lets a single A100 be partitioned into up to seven smaller virtual GPUs to serve different jobs or users simultaneously[5] – a boon for cloud providers and researchers sharing resources. With around 400W power draw in its server (SXM4) variant[6], the A100 delivered up to 312 TFLOPS of deep-learning compute (with Tensor Cores and sparsity) and became the workhorse for AI in the early 2020s[7][8]. It’s no exaggeration to say the A100 enabled training of GPT-3–class models that would have been impractical before.
Hopper – NVIDIA H100 (2022): Two years later, NVIDIA introduced the H100 GPU, based on the Hopper architecture and named after computing pioneer Grace Hopper. If A100 was a workhorse, H100 was a thoroughbred racehorse. Built on TSMC’s 4N process, the H100 ramped up the transistor count to about 80 billion and increased the core counts (with 16,896 CUDA cores)[9]. It retained 80 GB of memory (now faster HBM3), pushing memory bandwidth to roughly 3.35 TB/s[10]. The Hopper generation’s headline innovation was the Transformer Engine – hardware support for new low-precision formats (like FP8) tailored to accelerate transformer neural networks. This made the H100 particularly adept at training and inferencing large language models. In fact, H100’s Tensor Cores can dynamically switch to 8-bit precision, giving a big speed boost for AI while smart algorithms maintain model accuracy. Real-world, an H100 can significantly outperform an A100 on giant models – for example, by some accounts an H100 could train transformers 2–3× faster than an A100. The H100 also bumped the NVLink interconnect (for GPU-to-GPU communication) to 4th generation, and raised the per-card power budget to about 700W to allow all that new muscle to flex. The H100 quickly became the gold-standard GPU for cutting-edge AI work circa 2022–2023[11][12] – powering top AI supercomputers and cloud offerings.
Hopper Mid-Life Upgrade – NVIDIA H200 (2024): NVIDIA wasn’t done with Hopper. In late 2023, they announced the H200, essentially a super-charged Hopper-based GPU aimed at the latest AI demands. The H200 didn’t overhaul the core architecture, but it did something significant: it became the first GPU to use HBM3e memory, boosting the on-board memory to 141 GB at a blistering 4.8 TB/s bandwidth[13][14]. That’s 76% more memory than the H100 and about 43% higher bandwidth[10]. Why does this matter? Because newer generative AI models and large language models crave memory – not just for the model weights but also for enormous context windows and intermediate data. The H200’s extra memory, earning it nicknames like “the memory monster,” allows handling 100+ billion parameter models or long-context inference that would spill out of an 80GB H100[15][16]. Impressively, it does this without increasing power consumption: the H200 still draws ~700W, meaning you get a lot more throughput per watt compared to H100[17][18]. In AI inference tasks, NVIDIA reported the H200 can deliver up to 2× the speed of H100 when dealing with large language models like Llama-2[19]. In short, H200 was a drop-in upgrade focused on feeding the beast – more memory and a faster “firehose” to shovel data, which many AI practitioners found made a big difference for the latest models.
Blackwell – NVIDIA B100 and B200 (2024/2025): Now we arrive at NVIDIA’s newest architecture, Blackwell, named after mathematician David Blackwell. This generation represents another sweeping leap, akin to what Ampere or Hopper did, but with some new twists. The flagship B200 GPU is a monster by every metric. It introduces a multi-die design – effectively two GPU chips in one package – interconnected so tightly that they present as a single GPU to software[20][21]. Together, these two dies pack 208 billion transistors (104 billion each), more than 2.5× the H100’s transistor count[22]. The B200 comes with a whopping 180–192 GB of HBM3e memory (sources vary on the exact figure, but NVIDIA’s recent specs put it around 192 GB) and an astounding ~8 TB/s memory bandwidth[23][24] – basically doubling the memory and bandwidth of the H200. It also doubles down on low-precision AI: Blackwell GPUs feature a second-generation Transformer Engine that introduces support for FP4 (4-bit precision) arithmetic[25]. In suitable workloads, FP4 can effectively double the compute throughput again over FP8 (though of course not all models can run at 4-bit without accuracy loss). The upshot is that for AI inference especially, Blackwell is in another league. Early benchmarks showed a single B200 outperforming four H100 GPUs in certain LLM inference tests[26]. In fact, an NVIDIA MLPerf result had one B200 process ~10.7k tokens/second on a 70B model, about equal to 4× H100 doing ~11k combined – meaning one B200 is roughly 3.7–4× the performance of one H100[26]. Against the H200, it’s about 2.5× faster per chip in those tests[27]. And when you can run at FP4 precision, the gains are even more dramatic: a full 8-GPU server built with B200s (known as a DGX B200) can deliver up to 15× the inference throughput of the previous DGX H100 system on large model workloads[27][19]. That kind of jump is almost unheard of generationally – it speaks to Blackwell’s focus on maximally accelerating modern AI (especially huge models).
It’s worth noting that NVIDIA actually plans two flavors of Blackwell for data centers. The B100 is a slightly pared-down version: essentially a Blackwell GPU designed to be a drop-in replacement for the H100 in existing servers[28]. The B100 runs at the same ~700W power and uses a single GPU die (no multi-die), which yields lower peak performance (about 14 PFLOPS in FP4 vs B200’s ~18 PFLOPS)[29][30]. But the advantage is that data centers can upgrade their HGX H100-based systems to HGX B100 without changing power/cooling infrastructure[28]. The B200, on the other hand, is the no-holds-barred version – 1000W TDP per card[31], likely requiring advanced cooling (liquid or at least very robust airflow), and delivering the absolute maximum performance. In practice, NVIDIA expects that mainstream users who need Blackwell’s improvements but can’t handle 1000W heat per GPU might adopt B100, while the cutting-edge “AI factories” will deploy B200 nodes with beefed-up cooling. Either way, the Blackwell generation brings huge upgrades in memory, interconnect, and compute that set the stage for the next era of AI models.
Before we move on, one more perspective: in just these three generations – Ampere A100 to Hopper H100 to Blackwell B200 – the progress is staggering. In 2020, A100 had 40/80GB of memory and ~1.5–2 TB/s bandwidth; by 2024–25, B200 has ~192GB and 8 TB/s[10][24]. That’s a 4× increase in memory size and bandwidth. Transistor counts went from 54B to 80B to 208B. Precision went from FP16 mixed-precision to FP8 to now FP4 support. And multi-GPU scaling went from NVLink connecting a few cards to literally two GPUs on one card with Blackwell. All driven by the insatiable demand of AI for more: more data, bigger models, faster iteration. NVIDIA’s cadence of architectural leaps has been the metronome of AI capability in recent years.
Memory Architecture and Multi-GPU Clusters (NVLink & NVSwitch)
Now, let’s talk about memory and connectivity, because a single GPU chip isn’t much use in isolation if it can’t access data or collaborate with others. Modern AI GPUs are as much about memory design and multi-GPU scaling as they are about the cores themselves.
On-board Memory – HBM: Each GPU we discussed contains its own high-speed memory, and this is absolutely critical. GPUs don’t directly use the same RAM sticks your CPU does; instead, they rely on VRAM mounted right next to the GPU die. In high-end AI chips, this VRAM is typically High Bandwidth Memory (HBM), which comes in stacked modules integrated very close to the processor. The advantage of HBM is in the name – high bandwidth. By using a very wide bus (e.g. 5,120-bit wide on A100/H100[32]) and by operating at high speed, HBM provides terabytes per second of throughput. For example, the H100’s 80GB of HBM3 delivers around 3.35 TB/s of bandwidth[10], and the H200’s 141GB HBM3e pushes that to 4.8 TB/s[10]. (For comparison, a high-end CPU might have ~0.1 TB/s memory bandwidth with DDR5 RAM – we’re talking orders of magnitude difference.) This matters because training AI models involves shuffling enormous amounts of data (think of all the activations and weight matrices). If your compute units are starved for data, they sit idle – a classic memory wall problem. HBM essentially feeds the GPU cores with data through a fire-hose instead of a garden hose, allowing those thousands of ALUs to stay busy. This is why each generation has boosted memory bandwidth alongside flops: H100 to H200 got +43% bandwidth[10], B200 doubles it again to ~8 TB/s[24]. It’s the unsung hero of performance, as one analysis put it – like widening the highway that your data travels on[33][34].
Just as important is memory capacity. The largest AI models today have hundreds of billions of parameters, which can easily require tens to hundreds of GB of memory to train or even just to do inference with long prompts. This is why GPU memory sizes have climbed from 16 GB (NVIDIA V100 circa 2017) to 40 GB (A100) to 80 GB (H100) and now 141–192 GB (H200, B200). More memory means you can fit larger chunks of a model or dataset on one GPU, which often makes the difference between an experiment being possible or impossible. However, single GPU memory will always be limited – you can’t just slap 2 TB of RAM on a GPU (at least not with current tech or reasonable cost). This is where scaling out to multi-GPU systems comes in.
NVLink – GPU Peer-to-Peer Communication: When you use multiple GPUs together (for instance, many AI training jobs use 4, 8, or even dozens of GPUs in parallel), they need to exchange data rapidly. The standard PCIe interface in computers is not fast enough for high-end multi-GPU communication (it’s general-purpose and relatively slow, with tens of GB/s bandwidth and high latency). NVIDIA addressed this with NVLink, a proprietary high-speed interconnect between GPUs. NVLink first appeared around 2016 (with the Pascal P100 GPU) and has evolved each generation. It essentially allows GPUs to directly access each other’s memory over a fast link, somewhat like a mini-network just for GPUs. In the A100 (Ampere) we had NVLink 3.0, which provided about 600 GB/s of bidirectional bandwidth between GPUs[35][36]. Hopper’s H100 introduced NVLink 4, presumably around 900 GB/s+ total bandwidth. And Blackwell brings NVLink 5.0, which pushes GPU-to-GPU links to roughly 1.8 TB/s aggregate[37][38]. To put that in perspective, NVLink 5 is so fast that it dwarfs even the main memory bandwidth of many previous-gen GPUs. This means multiple Blackwell GPUs in a server can effectively function like one giant GPU with 1.8 TB/s between each other, enabling nearly seamless scaling for workloads that need more than one GPU’s worth of memory or compute[37].
However, connecting several GPUs in a robust topology requires more than just point-to-point links. That’s where NVSwitch comes in. NVSwitch is essentially a crossbar switch chip that NVIDIA uses in their server designs (like DGX systems) to connect many NVLink ports from multiple GPUs. For example, in a system with 8 GPUs, an NVSwitch allows each GPU to talk to any other at full NVLink speed simultaneously, creating an all-to-all high-bandwidth network among the GPUs. NVIDIA first used NVSwitch in the DGX-2 (Volta generation) and has continued refining it. In an 8× H100 or 8× B200 server (often called an HGX baseboard), NVSwitch allows the whole set of GPUs to function as one coherent 8-GPU cluster with >1 TB of combined memory (e.g. 8×80GB = 640GB for H100, or 8×180GB = ~1.44TB for B200) accessible with low latency. This is incredibly important for training very large models that can’t fit on one GPU – pieces of the model can be spread across GPUs, and NVSwitch/NVLink shuffles the data between them much faster than if you had to go through standard networking or PCIe. In effect, technologies like NVLink and NVSwitch turn a multi-GPU server into a single gigantic GPU (from the software’s point of view)[39][40].
NVLink-C2C and Grace CPU integration: One more recent innovation in the memory and interconnect space is NVIDIA’s move to link GPUs with CPUs (and other chips) more directly. Traditionally, a GPU sits on a PCIe bus and has to communicate with the CPU and main memory through that, which is a bottleneck. NVIDIA introduced NVLink-C2C (chip-to-chip) and used it in their Grace Hopper Superchip. The Grace Hopper design pairs an NVIDIA Grace CPU (a custom 72-core ARM server CPU with lots of LPDDR5X memory) with an H100 GPU on the same module, connected by NVLink-C2C. This yields a 900 GB/s coherent link between the CPU and GPU[41], allowing the CPU and GPU to share memory and exchange data extremely fast. In practical terms, the GPU can directly use the CPU’s memory as an extension of its own (with some performance overhead but far less than going over PCIe). The Grace Hopper system had a combined memory of up to 198 GB (GB of LPDDR + HBM on the GPU) accessible to the GPU[13]. This concept is being carried forward: in the Blackwell generation, NVIDIA is planning Grace-Blackwell superchips (sometimes referred to as GB100 or GB200 depending on the model). These will tie Grace CPUs with Blackwell GPUs in a similar fashion. In fact, NVIDIA’s high-end design for next-gen AI supercomputers is something like the DGX GB200: a rack-scale system that links Grace CPU chips and 72 Blackwell GPUs into a single 72-GPU cluster using NVLink and NVSwitch fabrics[42][39]. In such a design, 36 Grace CPUs and 72 GPUs are interconnected so tightly that the whole rack can be treated almost like one giant pool of 72 GPUs, all with fast links among them[42][43]. This is aimed at colossal models (trillions of parameters) where even 8 GPUs aren’t enough – you need dozens acting in concert. By using NVLink at the rack scale, NVIDIA claims they can achieve dramatic speed-ups (one figure mentioned up to 30× faster inference on trillion-parameter models when using a fully NVLinked 72-GPU “super-node” vs a more traditional cluster of smaller nodes)[44][40].
To summarize, the evolution of memory architecture (HBM on-package) and interconnects (NVLink, NVSwitch, NVLink-C2C) has been just as crucial as raw compute power for enabling modern AI. Training a model might involve shuffling petabytes of data back and forth; without these high-bandwidth pathways, the GPUs would starve. So, when we talk about GPUs as AI engines, it’s really an engine plus its fuel system – the fuel being data and the fuel lines being memory buses and NVLinks. Each generation’s successes in AI have been built on not just more FLOPs, but also getting data where it needs to be, fast.
Grace Superchips and the Blackwell Generation
We touched on this above, but let’s dig a bit more into the Grace “superchip” systems and what’s new in the Blackwell generation beyond just speed. NVIDIA’s Grace CPU is an interesting piece of the puzzle. It’s a server-class CPU they announced in 2021, focused on bandwidth and energy efficiency, with the intent of pairing tightly with their GPUs. A Grace CPU by itself is a 144-core ARM CPU (made up of two 72-core chips connected by NVLink-C2C) and can have up to 1 TB/s memory bandwidth using LPDDR5X memory. The Grace Hopper Superchip (GH200) combined one Grace CPU and one Hopper H100 GPU on the same board, effectively fusing CPU and GPU into a single unit with unified memory addressing. Why do this? Because it simplifies the challenge of feeding data to GPUs. Instead of data having to hop from CPU memory across a slow PCIe bus to the GPU, it flows over NVLink-C2C at 7–10x the bandwidth and with coherence (meaning the CPU and GPU see a consistent memory view). This made it much easier, for example, to have very large AI models where only part of the model is in GPU memory and another part in CPU memory – the GPU could still access what it needed at high speed. The GH200 was particularly touted for giant recommender systems and LLMs that needed more memory than a single GPU had.
Now with Blackwell, NVIDIA is extending this concept. The references to GB200 (Grace-Blackwell) and specifically the NVIDIA DGX GB200 NVL72 solution indicate a full-blown architecture where not just one CPU is paired with one GPU, but an entire network of CPUs and GPUs are tightly bound. The DGX GB200, as mentioned, connects 36 Grace CPUs with 72 Blackwell GPUs in a single coherent cluster[45]. “NVL72” in the name suggests all 72 GPUs are in one NVLink domain (likely via multiple NVSwitches), which is extraordinary – typically past systems maxed out at 8 or 16 GPUs per coherent group. Here we’re talking 72 GPUs behaving almost like one unit. The Grace CPUs likely act as feeders and memory cache – each Grace has its own fast memory and can do preprocessing or handle parts of models that are more scalar/irregular, while the 72 GPUs handle the heavy lifting. The entire contraption is liquid-cooled and built as a rack-scale system rather than a single box[46][45]. It’s essentially NVIDIA’s blueprint for an “AI supercomputer in a rack”. By doing this, NVIDIA aims to tackle use-cases like enormous language models (think multi-trillion parameter mixtures of experts) or multi-modal models that need fast integration of different data types – cases where splitting work across multiple standard servers would introduce too much latency or networking overhead. In the GB200, everything is as local as possible thanks to NVLink fabric.
Another aspect of Blackwell and its platform is a focus on efficiency and total cost of ownership at scale. NVIDIA has emphasized things like improved energy efficiency for Blackwell (performance per watt gains, despite the high absolute power). For instance, even though a B200 is 1000W, it reportedly delivers proportionally more performance such that it’s still a win in perf/W. NVIDIA also introduced features like confidential computing at the GPU level (encryption of memory contents etc.) with Hopper and Blackwell – relevant for secure deployments of AI. And there’s a mention of new hardware such as a Transformer Engine v2, decompression engines (for fast data loading), and enhanced support for things like Spark on GPUs[47], which all aim at integrating AI GPUs into broader data processing pipelines more efficiently.
One specific new feature in Blackwell architecture worth noting is the support for FP4 precision as part of the Transformer Engine. As we discussed, FP4 could potentially double throughput over FP8, which was already an innovation in Hopper. Early reports indicate this enables Blackwell GPUs to run some inference workloads (with the right algorithms) using 4-bit weights/activations, dramatically cutting memory usage and boosting speed. Combined with the sheer memory size increase, it means Blackwell can handle extremely large model inference. For example, with 192 GB per GPU, one could load a 175B-parameter model fully in one B200 (if the model weights are in 4-bit precision, 175B * 4 bits ≈ 88 GB, which fits with plenty of room to spare for activations). That’s something H100 could not do alone for such models without partitioning. It underscores how Blackwell is oriented not just to training faster, but to deploying (inference) of gargantuan models more easily as well.
In summary, the Blackwell generation coupled with Grace CPUs is about scaling up (bigger unified systems, more memory, more GPUs in lock-step) and speeding up both compute (FP4, more FLOPS) and data movement (NVLink 5, NVLink-C2C, HBM3e). It represents NVIDIA doubling down on the “AI supercomputer” vision – where the distinction between CPUs, GPUs, memory, and networking blurs into one integrated platform for massive AI workloads.
The Wider Hardware Landscape: AMD, Intel, and Custom AI Chips
So far we’ve focused on NVIDIA, given their dominant role in AI acceleration. But they’re not the only game in town. Let’s zoom out and look at the competitive landscape and other AI silicon efforts – not as an exhaustive comparison, but to understand broader trends.
AMD (Advanced Micro Devices): NVIDIA’s most direct competitor in GPUs is AMD. AMD produces data center GPUs under the Instinct brand. In recent years, AMD has been trying to challenge NVIDIA’s AI stronghold with products like the Instinct MI100, MI200, and most recently the MI300 series. The MI200 (launched 2021) was notable for using a multi-chip module (MCM) design – in fact, the MI250X packed two GPU chiplets on one package (somewhat foreshadowing NVIDIA’s dual-die approach in Blackwell). AMD’s current flagship as of 2023/2024 is the Instinct MI300 family, which includes variants like MI300A and MI300X. The MI300A is actually an APU (Accelerated Processing Unit) that combines CPU and GPU dies in one package (sound familiar? It’s analogous to a Grace+Hopper superchip, but using AMD’s Zen4 CPU cores and CDNA3 GPU cores together). The MI300X is a GPU-focused version with 192 GB of HBM3 memory aimed squarely at large AI models – AMD specifically designed it for heavy inference on LLMs, touting that it can hold models like Falcon-40B entirely in memory with room to spare. AMD’s approaches often emphasize an open ecosystem: for instance, their ROCm software platform as an alternative to NVIDIA’s CUDA. Performance-wise, MI300 GPUs are reportedly in the same ballpark as H100 on some tasks, although NVIDIA still leads in software maturity and overall adoption. One advantage AMD tries to leverage is their chiplet expertise and integration with their CPUs (EPYC). For example, AMD has demonstrated cache-coherent interconnects between EPYC CPUs and Instinct GPUs (using their Infinity Fabric technology). The trend here is similar to NVIDIA’s Grace+GPU: tight CPU-GPU coupling and lots of memory. It’s a safe bet that AMD will continue down this path, with future Instinct generations (like the announced MI400 series) pushing more performance and perhaps more integration (there are rumors of upcoming AMD GPUs with 3D-stacked memory or even analog compute, but that’s speculative). In sum, AMD is providing an alternative stack – one that some cloud providers and labs are evaluating, especially to avoid vendor lock-in. Competition from AMD helps drive the whole industry forward, even if NVIDIA currently has the mindshare.
Intel: Intel’s journey in AI accelerators has been a bit more convoluted. They attempted high-end GPUs (the Intel Ponte Vecchio GPU, for HPC and AI, finally launched as part of the Aurora supercomputer). Ponte Vecchio was a massive multi-tile design with lots of HBM memory; however, it’s not widely available commercially beyond specific supercomputer deals. More relevantly, Intel acquired Habana Labs and has been pushing the Habana Gaudi accelerators for AI training in data centers. Gaudi2, launched in 2022, is a 7nm AI processor with 96 GB of HBM and built-in high-speed networking (10 x 100Gb RoCE links) to scale multiple cards. AWS offers Gaudi instances as a lower-cost alternative to NVIDIA GPUs for some workloads. Gaudi has had some success in showing competitive training performance for certain models (like ResNet, some transformers) and can be more cost-efficient, but it lacks the broad software ecosystem of CUDA. Intel’s strategy seems to be pursuing both paths: their own GPUs (the upcoming GPU architecture code-named Falcon Shores is expected around 2025 – potentially a hybrid CPU-GPU on one package with flexible ratios of compute and memory) and their purpose-built NPUs (like Gaudi). It’s an acknowledgment that AI is crucial and they need solutions beyond just CPUs. Intel’s advantage is in manufacturing and an existing data center footprint, but they are playing catch-up in the accelerator design and software.
Google and other custom ASICs: Outside the GPU sphere, Google’s Tensor Processing Units (TPUs) are a prime example of custom AI silicon. Google has deployed TPUs in its data centers since 2015 for both training and inference. As of 2023, they were at TPU v4, which deliver very high compute through specialized matrix units (much like GPUs’ Tensor Cores) and are deployed in pods of thousands of chips for Google’s internal use and Google Cloud customers. TPUs excel at throughput for training large models and have been used for breakthrough projects (AlphaGo, large language models at Google, etc.). They are not for sale; they’re a cloud service, but they represent a major non-GPU platform where a lot of AI happens. TPU v5 is rumored to be in the works, with even greater performance and possibly more focus on efficiency. TPUs illustrate the trend of vertically integrated AI hardware – design a chip exactly for your workloads (in Google’s case, search and ads and now LLMs) to gain an advantage.
There are also startups and other big tech companies with notable projects: Meta has the MTIA (Meta Training and Inference Accelerator) initiative – Meta has built some internal inference chips and is reportedly working on a training chip, aiming to reduce reliance on GPUs in the long term. Amazon has its Inferentia and Trainium chips (focus on inference and training respectively for AWS cloud) – these haven’t overtaken GPUs yet but serve specific niches in AWS for cost-sensitive deployments. Tesla developed the Dojo supercomputer system with their own D1 chip, specifically optimized for video and autopilot model training; Dojo eschews HBM in favor of dense on-chip SRAM and a custom mesh interconnect, aiming for high flops with lower cost, though it’s very specialized to Tesla’s needs. Cerebras created a literally wafer-sized chip (the WSE) to maximize compute and memory on a single huge silicon piece – a very novel approach targeting ease of programming (no need for multi-GPU distributed programming, since it’s one giant chip). Graphcore has their IPU (Intelligence Processing Unit) focusing on fine-grained parallelism and massive graph compute. SambaNova, Mythic, Groq, and others have all offered different takes (from analog compute to streaming architectures). Most of these are aimed at carving out a piece of the AI hardware market by optimizing for certain workloads or efficiency targets.
The common thread in all this: AI is so important (and costly) that many players are investing in custom silicon to accelerate it, rather than relying solely on general-purpose solutions. Each new approach – GPUs, TPUs, IPUs, wafer-scale – explores different balances of memory, compute, precision, and interconnect. For now, NVIDIA GPUs remain dominant for most use cases due to their all-around capability and software ecosystem. But the presence of competitors is influencing NVIDIA too – pushing them to innovate faster (e.g., the move to chiplet GPUs and superchips might be accelerated by competitive pressure) and sometimes to adjust pricing. For AI practitioners, this competition is generally a good thing, as it promises more options and potentially better price/performance in the future. We’re entering an era where large-scale AI might run on a heterogeneous mix of hardware: some GPU clusters, some custom ASICs for specific tasks, etc., all depending on the specific needs.
The Cost of AI Power: GPU Pricing and Rentals
It’s time to talk about the money – because all this cutting-edge hardware comes at eye-popping cost. Whether you’re renting time on GPUs in the cloud or buying your own for an on-premise cluster, the numbers are significant, and they factor into decisions of what hardware to use.
Let’s start with cloud rental pricing, since many listeners might have experience spinning up a cloud GPU instance. In early 2023, renting an NVIDIA H100 (the then top-of-line GPU) on a major cloud provider could cost on the order of $8 per hour. However, as supply improved and competition among cloud offerings increased, prices have come down. By 2025, H100 rentals dropped to around $2–3.50 per hour, with some providers (and reserved deals) as low as $1.90/hour[48]. This dramatic drop (over 50%) was partly due to more H100s becoming available (NVIDIA ramped up production) and the release of newer GPUs making H100 a slightly older model. Essentially, cloud GPU time started to commoditize a bit.
Now, the H200 being newer and more capable, typically commands a premium of about 20–25% over H100[49]. So if an H100 is $2/hr, an H200 might be in the $2.5–3/hr range initially. That premium reflects the extra memory and performance – and if your workload benefits from that, paying a bit more can be worth it. For outright purchase, an H100 80GB card (SXM form factor) was roughly $25,000 in 2024–2025[50]. The H200, being rare and new, might be on the order of $30k or more per card (not officially list-priced, but the premium suggests so).
The B200 (Blackwell) is the cutting-edge as of 2025, and early adopters are indeed paying a hefty premium. According to industry insights, the B200 is expected to cost at least 25% more than H200 – possibly putting it in the range of $40k or more per GPU[51]. And initially, supply is limited, which could drive prices even higher on a secondary market. For cloud, if/when B200s are offered, they might be, say, $4–5/hr or higher. In short, the latest and greatest doesn’t come cheap.
But the cost of a GPU is only part of the story. If you want a multi-GPU setup, you need to consider the system around it. NVIDIA sells complete systems like the DGX line. For example, a DGX H100 (with 8× H100 GPUs, plus CPUs, memory, storage, networking) can easily run in the hundreds of thousands of dollars. One estimate put a basic multi-GPU setup (with networking, cooling etc.) north of $400,000[52]. A full DGX SuperPOD (which might have 20+ such nodes plus high-end networking) can be in the many millions. So, these are capital expenditures that only well-funded companies and labs make lightly.
If you’re a startup or a research group, this leads to a natural question: rent vs buy? Renting (cloud) offers elasticity and zero up-front cost – you pay as you go. Buying (on-prem) is a huge up-front cost but could be cheaper in the long run if you utilize the GPUs fully. Industry anecdotes and studies have shown that at scale, owning can indeed be cheaper – for instance, one study (by McKinsey) noted cloud AI infrastructure can cost 2–3× more than equivalent on-prem if you are continuously using it at high utilization[53]. That’s because cloud providers charge a premium for flexibility and to cover their overhead. On the flip side, if you only occasionally need a lot of GPUs, cloud is far more efficient – you wouldn’t want expensive hardware idle most of the day.
It’s worth noting the secondary market and constraints in supply too. In 2023, we saw a kind of GPU crunch – demand for NVIDIA H100s far outstripped supply due to the generative AI boom. This led to long wait times for orders and even a gray market where prices spiked. By late 2024, supply started catching up, but for the very latest like B200, availability in 2025 might be tight. Early adopters who must have Blackwell might even pay scalper-like premiums or sign big contracts with NVIDIA to get allocation.
Power and cooling costs also tie into “cost of ownership.” A 1kW GPU (B200) running flat out 24/7 will consume 24 kWh per day. In a cluster of 8, that’s ~192 kWh/day, which at data center electricity rates could be on the order of $20/day just in electricity for one node (not including cooling overhead). Scale that to a large cluster and power becomes a significant operational cost – in fact, electricity can be a limiting factor (some data centers can’t even supply enough power or cooling for too many of these nodes). Thus, part of the cost conversation is also: do you have the facility to host these power-hungry beasts? If not, you pay a colocation or cloud that does.
In summary, for H100/H200 class, think ~$2–3 per GPU-hour in cloud and ~$25k–30k to buy. For B200, think maybe $4–5/hour and $40k+ to buy, at least initially, with those numbers likely to improve as it matures. And remember the “hidden” costs: networking (100Gb+ switches, InfiniBand networks for clusters are expensive), storage (feeding GPUs data fast needs big IO systems), and engineering time (setting up and maintaining clusters isn’t trivial). All of that means the leading edge of AI hardware is as much a budgetary challenge as a technical one. It also explains why many AI startups start in the cloud and only invest in on-prem gear once they’ve scaled to a point where it clearly saves them money to buy.
Where to Deploy: Cloud, On-Premise, and Hybrid Approaches
With cost and performance in mind, another strategic question arises: Where do you run your AI workloads? In the cloud, in your own data center, or a mix of both? The trend in the industry is towards a hybrid model, but let’s break down the considerations.
Cloud (Public Cloud Providers): Cloud services (like AWS, Google Cloud, Microsoft Azure, and smaller players like Oracle Cloud, Lambda Labs, CoreWeave, etc.) have been crucial in democratizing access to AI hardware. In the cloud, you can rent anything from one GPU for an hour to thousands of GPUs for a week, depending on your needs and budget. This on-demand availability lowers the barrier to entry – you don’t need to invest millions upfront just to try an idea. It’s also great for burst capacity: say you mostly train smaller models in-house, but occasionally you need to train a huge model; you can burst to cloud, use 100 GPUs for a day, and then shut them off. Cloud providers also manage the hardware for you – dealing with failures, upgrades, etc., which is convenient.
However, as we discussed, cost can be a downside if you have steady, heavy usage. Renting 24/7 adds up. Additionally, cloud GPUs can sometimes be a generation behind or less flexible in configuration. (Though to be fair, AWS, Azure, and GCP all offer H100 instances now, often in 8-GPU server configurations akin to NVIDIA’s DGX. And some specialized clouds like Lambda or CoreWeave try to offer the very latest like H200 or even test versions of B100/B200 when available.) Cloud also introduces potential data governance and security questions – some companies are wary of sending sensitive data to a third-party cloud or find it doesn’t meet certain compliance needs.
On-Premise (Own Infrastructure): Many large AI labs and enterprises invest in building their own GPU clusters. This involves buying racks of GPUs (often DGX nodes or similar OEM systems) and installing them in a data center (which might be on company premises or a co-location facility). The appeal here is control and potentially lower long-term cost. If you know you need X GPU-compute continuously, doing it in-house means you pay for it once and then just power/cooling and maintenance – no cloud premium. Companies like Meta, Google, Microsoft, OpenAI (via Azure arrangements), etc., all build giant on-prem (or dedicated colocated) clusters for their research. Even some smaller firms, once they hit a certain scale of usage, calculate that owning makes sense economically.
On-prem can also be optimized to your specific needs. For example, if you need a ultra-fast network between nodes, you can implement a custom InfiniBand fabric; if you want a specific storage system, you have free rein. You’re not constrained by cloud instance types. Some organizations also care about latency and consistency – on-prem, you’re not contending with noisy neighbors or multi-tenant jitter, and your job scheduling can be tailored to your priorities. There’s also the aspect of data egress costs in cloud – if you have petabytes of training data, it might be cheaper to keep and process them on-prem than upload/download to cloud constantly.
That said, running on-prem is not trivial: you need expertise to manage these systems (cluster management, job schedulers like Slurm or Kubernetes, debugging hardware issues, etc.). There’s risk of hardware obsolescence – if you invest millions in GPUs and a year later a much faster one comes, you can’t just click a button to upgrade; you’re stuck with them until you budget for new ones. There’s also capacity inflexibility – if your needs spike unexpectedly, your on-prem cluster might not handle it and then you scramble to supplement with cloud, leading to a hybrid anyway.
Hybrid: Because of the above trade-offs, many are choosing a hybrid approach. They keep a baseline capacity on-prem for the regular workload and burst to cloud for peaks or special projects. This can offer a sweet spot of cost efficiency and flexibility. For example, a company might run daily model training on their own servers, but if an urgent large experiment is needed, they go to cloud for that one. Or they might do training in-house but use cloud for serving/inference (or vice versa). Another hybrid angle is using colocation providers that specialize in GPU hosting – essentially your hardware, but in someone else’s data center with good connectivity and perhaps management services.
NVIDIA itself has recognized the hybrid trend and launched DGX Cloud, which is basically NVIDIA-owned high-end infrastructure hosted in various clouds (and accessible as a service). It’s like renting a ready-made supercomputer node, managed by NVIDIA, but delivered through a cloud provider’s data center. This targets enterprises who want the benefits of on-prem (dedicated performance, full NVIDIA stack with support) but with cloud-like subscription models.
Trends in deployment also vary by sector. Traditional enterprises dipping into AI often start in the cloud (for ease of pilot projects). AI-first tech companies (like those whose core product is an AI model) often invest in their own infrastructure sooner because they can justify it at scale. Government or sensitive industry players might insist on on-prem for security. Meanwhile, academia and smaller startups leverage cloud or shared academic clusters to avoid big costs.
One interesting emerging trend is the idea of community or decentralized compute – for instance, projects where individuals or companies with spare GPUs offer them on a marketplace. There are platforms where you can get cheaper GPU time from someone’s rig or a lesser-known provider. These haven’t displaced the big players, but it shows the demand for GPU compute is sparking creativity in sourcing it.
In summary, deployment of AI hardware is a strategic decision: cloud offers agility and zero maintenance, on-prem offers control and possibly lower cost at scale, and many find a mix is optimal. The key is to balance cost, scale, expertise, and data considerations. At the end of the day, what matters is that you have access to sufficient compute when you need it. The fact that we now have so many options – from major clouds to DIY clusters – is a testament to how central GPUs (and AI accelerators generally) have become in the computing landscape. Ten years ago, only a few supercomputing labs had this kind of power. Today, a small startup can tap into nearly state-of-the-art compute within minutes on the cloud. That’s played a huge role in the rapid progress of AI.
On the Horizon: Rubin, Feynman, and Future Roadmaps
What comes after Blackwell? In the fast-paced world of AI hardware, there’s always a next generation around the corner. NVIDIA has provided some tantalizing hints and the rumor mill is active. The code names “Rubin” (likely honoring astronomer Vera Rubin) and “Feynman” (after physicist Richard Feynman) have emerged as designations for future architectures.
According to recent disclosures and reports from NVIDIA’s 2025 roadmap presentations, Rubin is slated to be the next major GPU architecture after Blackwell[54][55]. It is expected around 2026. The details revealed (or inferred) include that Rubin will continue the chiplet approach: in fact, one report from GTC 2025 indicated Rubin will be a package of two GPUs in one, similar to Blackwell, but likely with further enhancements[56]. There was mention that paired with a next-gen NVIDIA CPU (called “Vera” CPU), a Rubin GPU module could achieve up to 50 PFLOPS of AI performance in inference[57]. For context, that’s more than double what current Blackwell chips (around 20 PFLOPS) deliver[58]. This suggests a significant jump, possibly from architectural improvements, faster clocks, more cores, maybe a move to a newer process node (likely 3nm class by then), and maybe even more aggressive use of low precision (who knows, maybe FP2 or some form of ternary computation? Pure speculation!).
NVIDIA also signaled that Rubin will come in an “Ultra” variant around 2027, referred to as Rubin Ultra. One description implied Rubin Ultra would combine four GPUs in a single package, delivering ~100 PFLOPS of AI performance[59]. Four GPUs in one package is mind-boggling – it likely means essentially two dual-die GPUs connected, or some 2x2 arrangement with NVLink-C2C bridging them. By that time, cooling and power will be even more challenging (imagine a single package consuming perhaps 1.5–2 kW). Rubin Ultra might be a special option for the absolute high end (much like “Ultra” in Blackwell context refers to B300 perhaps being Blackwell Ultra).
Following Rubin, around 2028 if timelines hold, is the architecture codenamed Feynman[60]. NVIDIA has been tight-lipped on Feynman (understandably, since it’s years out), but some clues can be pieced together. It’s likely that by Feynman’s time, NVIDIA will integrate their CPU and GPU even more. The TechCrunch coverage of GTC 2025 noted that Feynman will also feature an NVIDIA-designed CPU (the article said “features a Vera CPU”) and will succeed Rubin in the lineup[61]. One can speculate that Feynman could be the generation where silicon photonics or new interconnect tech might come into play – there have been research and hints that NVIDIA is looking at optical links to further scale bandwidth and reduce energy per bit for communication (the Tom’s Hardware piece mentioned plans up to 14× faster systems by 2027 possibly involving such tech[62][63]). By 2028, we might also see a transition to next-gen memory like HBM4 or whatever comes after, potentially pushing bandwidth even higher (HBM3e is likely a stepping stone; HBM4 might use more channels, higher clocks, maybe hitting 10+ TB/s per chip).
NVIDIA also appears to be using a strategy of intermediate “Ultra” refreshes. As noted, there’s Blackwell Ultra (the B300 series, expected 2025, which increases memory to 288GB on a Blackwell – as per TechCrunch, Blackwell Ultra will have 288GB vs 192GB and similar compute[64]). Then Rubin, then Rubin Ultra, then Feynman. This cadence keeps something new each year to some extent: architecture A, then an “Ultra” or refresh, then next architecture, etc. It’s a response to both competitive pressure and the slowing of Moore’s Law – they squeeze an extra boost mid-generation via packaging or memory upgrades.
Speaking of Moore’s Law and scaling limits, by the time of Rubin/Feynman we might be in an era of even slower transistor scaling (3nm, then 2nm…). NVIDIA may need to incorporate chiplets not just for GPUs but for memory (maybe integrating HBM stacks even more tightly or in 3D). There’s chatter about “Silicon photonics” possibly by 2027 – meaning using optical signals for communication either between chips or between boards, to break through electrical bandwidth and latency limitations. If NVIDIA is talking about systems 14× faster than NVL72 by 2027[62], that implies huge innovations because that’s beyond just doubling GPUs or something; it might involve connecting multiple racks or new tech.
In simpler terms, the future roadmap suggests: more integration, more memory, possibly more specialized cores. There’s also the question of whether NVIDIA will incorporate other types of processing – e.g., will future GPUs have AI-specific blocks beyond Tensor Cores, like dedicated hardware for certain ML operations (sparsity, transformers, etc.)? Already we see more of that – Hopper had a transformer engine, Blackwell enhances it, maybe Rubin goes further.
Another rumored codename that sometimes floats around is “Hopper Next” or internal project names, but since we have actual names Rubin and Feynman, those are what we use.
And it’s not just GPUs: NVIDIA is also planning further CPU developments (the “Vera” CPU to go with Rubin GPUs, and likely further Arm CPU advancements for Feynman era). They’re also heavily investing in their networking (InfiniBand, BlueField DPUs) – future GPU systems might integrate DPUs for networking right on the same module (there was talk of “NVLink Switch” to connect beyond one box, etc.). So the roadmap is as much about platform as chip.
To put a bow on it: If you’re mapping out your AI hardware strategy for the next 5 years, keep an eye on NVIDIA’s Rubin (circa 2026) for the next big jump, and Feynman (2028) for what could be a paradigm shift where AI supercomputers might have to adopt new tech (like photonics or extreme chiplet counts) to keep scaling. And of course, competitors will not be standing still either – AMD’s equivalent timeline might bring its own major advances (they’ve mentioned an MI500 series around 2026–27, likely 3nm, possibly chiplet, possibly with integrated networking). The race is on, and it’s fueled by the economic promise of AI – whoever can provide the fastest, most efficient hardware gains an edge, and those roadmaps are the battle plans.
The Limits of Scaling: Memory Walls, Power Draw, and Economic Challenges
We’ve celebrated the astounding progress in GPU capability, but it’s crucial to acknowledge the limits and challenges that come with scaling AI hardware. As we push towards ever more powerful chips and systems, we’re running into walls – physical, technical, and economic.
The Memory Wall: Perhaps the most fundamental issue is the so-called memory wall. This term, coined decades ago, describes the growing gap between how fast processors can compute versus how fast data can be fetched from memory. In the context of AI, it’s extremely relevant. A recent analysis of AI hardware trends showed that over ~20 years, compute (FLOPS) increased by 60,000×, while DRAM memory bandwidth only increased about 100×[65]. That’s an enormous divergence[65]. What it means is that, no matter how many ALUs or Tensor Cores we pack into a GPU, if they have to wait on data coming from memory, a lot of that potential goes unused. We see this vividly in modern AI: for many large neural network workloads, memory capacity and bandwidth are the bottlenecks, not the raw compute. Figure this – the largest GPT-3-sized models (hundreds of billions of params) don’t even fit in one GPU’s memory, and even when sharded across many GPUs, the process of moving activations and weights around becomes a headache.
The memory wall exists at multiple levels: on-chip caches vs HBM, HBM vs CPU memory, node vs node over networks – at each level, data movement is slower than compute. And the more we scale model sizes, the more memory and bandwidth become the fire limit. For example, one study noted that state-of-the-art model sizes (especially in NLP) were growing at 4× every two years, whereas single-GPU memory was growing at best 2× every two years[66]. This means model sizes have outpaced memory such that no single accelerator can hold them, forcing us to use distributed strategies that introduce communication overhead (which is another kind of memory wall, moving data across GPUs)[67].
We’ve tried to engineer around this: HBM was one response (move memory closer and make it faster). NVLink and NVSwitch are responses (make off-chip data movement faster). New memory hierarchy ideas like HBM as primary memory, or GPUs directly accessing SSDs for out-of-core training, are being explored. But these are workarounds; fundamentally, unless a new memory tech or paradigm comes (like significantly faster memory or processing-in-memory approaches where computation happens near where data is stored), the memory wall will continue to challenge AI scaling. One key symptom of the memory wall: diminishing returns with more compute units. For instance, doubling the number of GPU cores might not double performance if memory bandwidth is fixed – the cores would just contend for the same memory interface. AI model structures can sometimes mitigate this (e.g., mixing compute-heavy and memory-heavy operations), but the larger and sparser models get, the more memory-bound parts they have.
Power and Cooling Constraints: Next big issue – power consumption. We’ve seen GPU TDPs climb from ~250W a few years ago (Pascal/Volta era) to 400W (A100) to 700W (H100) and now 1000W (B200)[31]. This can’t keep going up indefinitely, or every GPU will be a space heater that needs its own generator. We are hitting practical limits of what a single server can power and cool. Data centers often deliver power in increments like a few tens of kW per rack. When one server node (like a DGX B200) can draw ~14 kW by itself under load[68], you’re looking at maybe only 4–6 of those in a rack before you max out a 60 kW rack limit[69]. That’s for state-of-the-art; imagine future 2kW GPUs – it gets even trickier. This is forcing changes in cooling: liquid cooling is becoming standard for high-end deployments (air cooling a 1kW card is borderline infeasible in many environments). Some setups use liquid-cooled heat exchangers on each card, others dunk entire servers in coolant (immersion cooling). These add complexity and cost. Also, the more power-hungry, the more energy cost – running an AI supercomputer can incur electricity bills in the millions of dollars a year. This raises sustainability concerns too: large AI clusters draw as much power as small towns. There is pressure (and rightfully so) to improve the energy efficiency (performance-per-watt) of AI hardware, not just the absolute performance.
We should note that each new generation so far has actually improved relative efficiency: e.g., H100 does more FLOPs per watt than A100, etc. But absolute usage still goes up because we deploy more chips to get more total work done. It’s like cars becoming more fuel-efficient but you just drive them more miles. At some point, if an AI model needs 1000 GPUs running for a week, the power cost and carbon footprint of that will be scrutinized. This is partly why NVIDIA and others emphasize perf-per-watt gains and why techniques like lower precision (FP8, FP4) matter – they effectively compute more with less energy.
Economic and Supply Constraints: Another limit is simply economic feasibility and supply chain. The cutting-edge chips use cutting-edge fabrication (TSMC 5nm, 4nm, soon 3nm). These fabs are astronomically expensive to operate, and there’s limited capacity. We’ve already seen geopolitical factors (like export controls on high-end GPUs to certain countries) impact supply. The chip shortage of 2020–2022 showed how delicate supply chains can bottleneck progress. If only, say, 50,000 top GPUs can be made in a year and demand is for 100,000+, then not everyone who wants to train a giant model can get the hardware when they want it. This can slow down progress or concentrate it in the hands of those with deep pockets or priority access.
The cost to design and manufacture each generation is also huge. Fewer players can afford to stay at the bleeding edge, which might slow the pace of innovation if ROI isn’t met. It’s a kind of economic scaling issue: are we reaching a point where only trillion-dollar companies or governments can train the very largest models because the hardware (and energy) cost is so high? Some observers worry about that – that a compute divide emerges. To counter that, we see efforts in more efficient algorithms (so you don’t need as much hardware to get results) and in broadening hardware availability (like cloud credits, academic partnerships, etc.). But it’s a challenge: many academic researchers can’t dream of training GPT-4-level models because they’d need millions worth of GPU time.
Diminishing Returns and Architectural Limits: There’s also a notion of diminishing returns in architecture improvements. For example, adding more and more Tensor Cores yields less benefit if the rest of the architecture (memory, cache, interconnect) doesn’t also scale. There’s likely going to be a point where simply adding more HBM stacks isn’t possible (there’s only so much space and cost per package), or adding more chiplets yields diminishing returns due to communication overhead between chiplets. The dual-die B200 works well with NVLink 5 bridging it, but if they tried, say, an 8-die package in the future, would the overhead and complexity undermine the gains? Possibly, unless new interconnect tech like silicon bridges or optical come in.
One noted limit is Amdahl’s Law – as we make the parallel parts faster (GPUs excel at parallel work), any serial or communication part becomes the bottleneck. In giant model training, there are synchronization and reduction steps (all-reduce operations, etc.) which become significant at scale. Techniques like pipeline parallelism, async methods, etc., are ways we try to circumvent that, but it’s an ongoing battle between hardware speedups and algorithmic work distribution.
Moreover, software becomes a limit too – writing software that can fully utilize, say, 1000 GPUs efficiently is very hard. Many deep learning frameworks are still catching up to optimized usage of H100 features; by the time we have FP4 in mainstream, not all algorithms can quantize to 4-bit without accuracy loss. So there’s a gap between theoretical hardware capability and practical achievable speed. This isn’t a hardware limit per se, but it tempers the real-world impact of hardware progress.
In terms of future physics limits: as transistors get smaller, we worry about reaching atomic scales, but 3nm, 2nm still have some headroom (with new materials maybe). However, clock speeds can’t go much higher without hitting cooling walls (most GPUs still run ~1.5 GHz – they haven’t increased much in clock in a decade). So improvements come from parallelism and architecture, which has worked, but the pace of improvement may slow if one day we can’t fit more parallel units or memory on a chip easily.
To tackle these challenges, the industry is exploring novel approaches: for memory – perhaps stacking more layers of memory on top of logic (3D stacking), or using “near-memory compute” where small compute engines sit on the memory chip itself to preprocess data. Or memory disaggregation: having memory pools accessible by many GPUs via fast networks (like some startups are doing with memory fabrics). For power – maybe more specialized accelerators that do only what’s needed and nothing more (to save energy), or even analog computing (compute using analog signals to save power, though that has its own issues). For communication – moving to optical interconnects that can maintain high bandwidth over distance with lower power, which could let us scale beyond a single rack more efficiently.
In essence, the free lunch is ending: AI hardware had a phenomenal run of improvements thanks to both Moore’s Law and clever specialization, but each step up is getting harder and costlier. It’s akin to climbing a mountain – the higher you go, the thinner the air. We’re at a high altitude now in terms of compute, and to go higher, we have to bring our own oxygen (i.e., fundamentally new ideas or much bigger investments).
Despite these challenges, the trajectory is still upward. The limits are being actively worked on. History shows that when one path hits a wall, engineers find a detour: e.g., single-core CPUs hit frequency/power walls, so we went to multi-core; we might hit monolithic GPU scaling limits, so we go chiplet and multi-GPU scaling; if that struggles, maybe distributed computing with better networking or a new paradigm (quantum? probably not for general AI anytime soon, but who knows far future). Economic pressures also inspire innovation: if training models is too expensive, there’s huge incentive to design more efficient algorithms (we’ve seen things like model quantization, sparsity exploitation, better optimizers all come to reduce the needed compute).
In conclusion on limits: Memory bandwidth is the ball-and-chain on the ankle of AI accelerators – alleviating that is paramount (through HBM, better caches, etc.), power density is the ceiling that we’re pressing against – requiring better cooling and efficiency, and cost/supply is the rocky ground – requiring careful navigation so the progress can be shared and sustained. Each new generation of GPU tries to address some of this (H200 tackled memory size, Blackwell tackles precision efficiency, etc.), but no single generation will magically overcome fundamental physics. It’s an ongoing battle, and one that the entire computing field is rallying around because AI’s hunger for compute seems insatiable. The hope is that through a combination of hardware advances and smarter software, we’ll continue to see leaps – perhaps not 10× leaps every two years like the early days, but still significant growth – without hitting an abrupt wall. And if we do hit a wall, it might just shift the paradigm (for example, maybe focus moves to more efficient small models, or neuromorphic computing, etc.). But right now, GPUs are still charging forward, even if they’re dragging some heavy constraints behind them.
So, that’s the landscape of GPUs in AI: from what they are and why they’re crucial, through the rapid evolution of NVIDIA’s hardware (Ampere to Hopper to Blackwell and beyond), an explanation of the memory and interconnect magic that makes multi-GPU “AI factories” possible, to the competition in the field, the dollars-and-cents of using these GPUs, where you might run them, what’s coming next, and the challenges that could slow things down.
It’s a lot to cover, but in a field moving this fast, staying informed is half the battle. We’re witnessing an era where every year brings new hardware that makes previously unthinkable AI projects feasible. If you’re a technically inclined professional (even without GPU hardware background), I hope this overview gave you a clear narrative of why GPUs are center stage in AI and how they got to where they are. The story of AI is inseparable from the story of its compute engines – and for now, GPUs remain the star of that story.
Thanks for listening to Embodied AI 101. Stay tuned for our next episode, where we’ll continue to explore the technologies and ideas that shape our embodied, intelligent future. Until then, keep learning and stay curious!