Getting infrastructure right for generative AI

Facts, it has been said, are stubborn things. For generative AI, a stubborn fact is that it consumes very large quantities of compute cycles, data storage, network bandwidth, electrical power, and air conditioning. As CIOs respond to corporate mandates to “just do something” with genAI, many are launching cloud-based or on-premises initiatives. But while the payback promised by many genAI projects is nebulous, the costs of the infrastructure to run them is finite, and too often, unacceptably high.

Infrastructure-intensive or not, generative AI is on the march. According to IDC, genAI workloads are increasing from 7.8% of the overall AI server market in 2022 to 36% in 2027. In storage, the curve is similar, with growth from 5.7% of AI storage in 2022 to 30.5% in 2027. IDC research finds roughly half of worldwide genAI expenditures in 2024 will go toward digital infrastructure. IDC projects the worldwide infrastructure market (server and storage) for all kinds of AI will double from $28.1 billion in 2022 to $57 billion in 2027.

But the sheer quantity of infrastructure needed to process genAI’s large language models (LLMs), along with power and cooling requirements, is fast becoming unsustainable.

“You will spend on clusters with high-bandwidth networks to build almost HPC [high-performance computing]-like environments,” warns Peter Rutten, research vice president for performance-intensive computing at IDC. “Every organization should think hard about investing in a large cluster of GPU nodes,” says Rutten, asking, “What is your use case? Do you have the data center and data science skill sets?”

Shifting to small language models, hybrid infrastructure

Savvy IT leaders are aware of the risk of overspending on genAI infrastructure, whether on-premises or in the cloud. After taking a hard look at their physical operations and staff capabilities as well as the fine print of cloud contracts, some are coming up with strategies that are delivering positive return on investment.

Seeking to increase the productivity of chronically understaffed radiology teams, Mozziyar Etemadi, medical director of advanced technologies at Northwestern Medicine undertook a genAI project designed to speed the interpretation of X-ray images. But instead of piling on compute, storage, and networking infrastructure to handle massive LLMs, Northwestern Medicine shrank the infrastructure requirements by working with small language models (SLMs).

Etemadi began by experimenting with cloud-based services but found them unwieldy and expensive. “I tried them, but we couldn’t get [generative AI] to work in a favorable cost envelope.” That led Etimadi to the realization that he would have to spearhead a dedicated engineering effort.

Heading a team of a dozen medical technologists, Etemadi built a four-node cluster of Dell PowerEdge XE9680 servers with eight Nvidia H100 Tensor Core GPUs, connected with Nvidia Quantum-2 InfiniBand networking. Running in a colocation facility, the cluster ingests multimodal data, including images, text, and video, which trains the SLM on how to interpret X-ray images. The resulting application, which was recently patented, generates highly accurate interpretations of the pictures, feeding them to a human-in-the-loop (HITL) for final judgement.

“It’s multimodal, but tiny. The number of parameters is approximately 300 million. That compares to ChatGPT, which is at least a trillion,” says Etimadi, who envisions building on the initial X-ray application to interpret CT scans, MRI images, and colonoscopies.

He estimates that using a cloud-based service for the same work would cost about twice as much as it costs to run the Dell cluster. “On the cloud, you’re paying by the hour and you’re paying a premium.” In contrast, he asserts, “Pretty much any hospital in the US can buy four computers. It’s well within the budget.”

When it comes to data storage, Northwestern Medicine uses both the cloud and on-premises infrastructure for both temporary and permanent storage. “It’s about choosing the right tool for the job. With storage, there is really no one-size-fits-all,” says Etemadi, adding, “As a general rule, storage is where cloud has the highest premium fee.”

On premises, Northwestern Medicine is using a mix of Dell NAS, SAN, secure, and hyperconverged infrastructure equipment. “We looked at how much data we needed and for how long. Most of the time, the cloud is definitely not cheaper,” asserts Editmadi.

The cost calculus of GPU clusters

Faced with similar challenges, a different approach was taken by Papercup Technologies, a UK company that has developed genAI-based language translation and dubbing services. Papercup clients seeking to globalize the appeal of their products use the company’s service to generate convincing voice-overs in many languages for use in commercial videos. Before a job is complete, an HITL examines output for accuracy and cultural relevance. The LLM work started in a London office building, which was soon outgrown by the infrastructure demands of generative AI.

“It was quite cost-effective at first to buy our own hardware, which was a four-GPU cluster,” says Doniyor Ulmasov, head of engineering at Papercup. He estimates initial savings between 60% and 70% compared with cloud-based services. “But when we added another six machines, the power and cooling requirements were such that the building could not accommodate them. We had to pay for machines we could not use because we couldn’t cool them,” he recounts.

And electricity and air conditioning weren’t the only obstacles. “Server-grade equipment requires know-how for things like networking setup and remote management. We expended a lot of human resources to maintain the systems, so the savings weren’t really there,” he adds.

At that point, Papercup decided the cloud was needed. The company now uses Amazon Web Services, where translation and dubbing workloads for customers are handled, to be reviewed by an HITL. Simpler training workloads are still run on premises on a mixture of servers powered by Nvidia A100 Tensor Core, GeForce RTX 4090, and GeForce RTX 2080Ti hardware. More resource-intensive training is handled on a cluster hosted on Google Cloud Platform. Building on its current services, Papercup is exploring language translation and dubbing for live sports events and movies, says Ulmasov.

For Papercup, infrastructure decisions are driven as much by geography as by technology requirements. “If we had a massive warehouse outside the [London] metro area, you could make the case [for keeping work on-premises]. But we are in the city center. I would still consider on-premises if space, power, and cooling were not issues,” says Ulmasov.

Beyond GPUs

For now, GPU-based clusters are simply faster than CPU-based configurations, and that matters. Both Etimadi and Ulmasov say using CPU-based systems would cause unacceptable delays that would keep their HITL experts waiting. But the high energy demands of the current generation of GPUs will only increase, according to IDC’s Rutten.

“Nvidia’s current GPU has a 700-watt power envelope, then the next one doubles that. It’s like a space heater. I don’t see how that problem gets resolved easily,” says the analyst.

The reign of GPUs in genAI and other forms of AI could be challenged by an emerging host of AI co-processors, and eventually perhaps, by quantum computing.

“The GPU was invented for graphics processing so it’s not AI-optimized. Increasingly, we’ll see AI-specialized hardware,” predicts Claus Torp Jensen, former CIO and CTO and currently a technology advisor. Although he does not anticipate the disappearance of GPUs, he says future AI algorithms will be handled by a mix of CPUs, GPUs, and AI co-processors, both on-premises and in the cloud.

Another factor working against unmitigated power consumption is sustainability. Many organizations have adopted sustainability goals, which power-hungry AI algorithms make it difficult to achieve. Rutten says using SLMs, ARM-based CPUs, and cloud providers that maintain zero-emissions policies, or that run on electricity produced by renewable sources, are all worth exploring where sustainability is a priority.

For implementations that require large-scale workloads, using microprocessors built with field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs) are a choice worth considering.

“They are much more efficient and can be more powerful. You have to hardware-code them up front and that takes time and work, but you could save significantly compared to GPUs,” says Rutten.

Until processors that run significantly faster while using less power and generating less heat emerge, the GPU is a stubborn fact of life for generative AI, and implementing cost-effective genAI implementations will require ingenuity and perseverance. But as Etimadi and Ulmasov demonstrate, the challenge is not beyond the reach of strategies utilizing small language models and a skillful mix of on-premises and cloud-based services.

© Foundry