Dec 17 (Reuters) – Alphabet’s Google is working on a new initiative to improve its artificial intelligence chips running PyTorch, the world’s most widely used AI software framework, in a move aimed at weakening Nvidia’s long-standing dominance of the AI computing market, according to people familiar with the matter.
The effort is part of Google’s aggressive plan to make its tensor processing units a viable alternative to market-leading Nvidia GPUs. TPU sales have become a key driver of Google’s cloud revenue growth as it tries to prove to investors that its AI investments are paying off.
But hardware alone is not enough to drive adoption. The new initiative, known internally as “TorchTPU,” aims to remove a key barrier that has slowed the adoption of TPU chips by making them fully compatible and developer-friendly for customers who have already built their technology infrastructure using PyTorch software, the sources said. Google is also considering open-sourcing parts of the software to speed customer uptake, some of the people said.
Compared to previous attempts to support PyTorch on TPU, Google has given more organizational attention, resources and strategic importance to TorchTPU as demand grows from companies that want to adopt the chips but view the software stack as a bottleneck, the sources said.
PyTorch, an open-source project strongly supported by Meta Platforms, is one of the most widely used tools for developers building AI models. In Silicon Valley, very few developers write every line of code that chips from Nvidia, Advanced Micro Devices or Google will actually execute.
Instead, these developers rely on tools like PyTorch, which is a collection of pre-written code libraries and frameworks that automate many common tasks in AI software development. Originally released in 2016, PyTorch’s history has been closely tied to Nvidia’s development of CUDA, software that some Wall Street analysts see as the company’s strongest shield against competitors.
Nvidia engineers have spent years making sure software developed with PyTorch runs as fast and efficiently as possible on its chips. Google, by contrast, has long had its in-house armies of software developers use a different code framework called Jax, and its TPU chips use a tool called XLA to make the code run efficiently. Much of Google’s AI and performance optimization software stack has been built around Jax, bridging the gap between how Google uses its chips and how customers want to use them.
A Google Cloud spokesperson would not comment on the details of the project, but confirmed to Reuters that the move would give customers a choice.
“We are seeing massive, growing demand for both our TPU and GPU infrastructure,” the spokesperson said. “Our focus is on providing the flexibility and scale developers need, regardless of the hardware they choose to build on.”
TPU FOR CUSTOMERS
Alphabet has long reserved the lion’s share of its own chips, or TPUs, for internal use only. That changed in 2022, when Google’s cloud computing unit successfully lobbied to oversee the group selling TPU. The move drastically increased Google Cloud’s allocation of TPUs, and as customer interest in AI grew, Google looked to capitalize by increasing production and sales of TPUs to external customers.
But the mismatch between the PyTorch frameworks used by most of the world’s AI developers and the Jax frameworks that Google’s chips are currently best tuned to run means that most developers can’t easily adopt Google’s chips and make them perform as well as Nvidia’s without significant additional engineering work. Such work takes time and money in the fast AI race.
If successful, Google’s “TorchTPU” initiative could significantly reduce switching costs for companies wanting alternatives to Nvidia GPUs. Nvidia’s dominance has been bolstered not only by its hardware, but also by its CUDA software ecosystem, which is deeply embedded in PyTorch and has become the default method by which companies train and run large AI models.
Enterprise customers told Google that TPUs were harder to adopt for AI workloads because they required developers to switch to Jax, a machine learning framework favored internally at Google, rather than PyTorch, which most AI developers already use, the sources said.
JOINT EFFORTS WITH META
To speed up development, Google is working closely with Meta, the creator and maintainer of PyTorch, according to the sources. The two tech giants have discussed deals for Meta to access more TPUs, a move first reported by The Information.
Initial offerings for Meta were structured as Google managed services, where customers like Meta installed Google chips designed to run Google software and models, with Google providing operational support. Meta has a strategic interest in working on software that makes it easier to run TPUs in a bid to lower inference costs and diversify its AI infrastructure away from Nvidia GPUs to gain bargaining power, the people said.
Meta declined to comment.
This year, Google began selling TPUs directly into customers’ data centers, rather than restricting access to its own cloud. Amin Vahdat, a Google veteran, was named head of AI infrastructure this month, reporting directly to CEO Sundar Pichai.
Google needs this infrastructure both to run its own AI products, including the Gemini chatbot and AI-powered search, and to provide Google Cloud customers, which sell access to TPUs to companies like Anthropic.
(Reporting by Krystal Hu, Kenrick Cai and Stephen Nellis in San Francisco; Editing by Kenneth Li and Matthew Lewis)