Home Back

Where does eBay do most of its AI development? You might be surprised

siliconangle.com 1 day ago
AI

For a company doing business in the cloud before the concept of cloud computing existed, eBay Inc. has taken a decidedly non-cloud-centric approach to artificial intelligence training and deployment.

Though the company bursts out to use public cloud resources during seasonal peaks, the bulk of its AI work happens in its own data centers. That enables it to meet high standards of customer privacy and compliance and speeds time to market, said Parantap Lahiri (pictured), vice president of network and data center engineering at the e-commerce giant.

“Public cloud is a friend where you can rent resources to solve some of your load balancing problems, but we are going to have our core competency on-prem,” he said in an interview with SiliconANGLE.

Blessed with talent

EBay manufactures much of its server hardware and is blessed with ample engineering talent, Lahiri said. “We found that beyond a certain point of scale, it makes lot more business and financial sense to run the bulk of our workloads on-premises,” he said. “We are lucky to have the right engineering talent to pretrain models, fine-tune models, deploy in our own infrastructure, and integrate with our applications.”

EBay was an early adopter of AI among commercial organizations, having built its first applications in 2016. Its generative Magical Listing feature lets sellers take or upload a photo and AI fills in details about the item being listed. The application can write titles, descriptions, product release dates, detailed category and subcategory metadata and even suggest a listing price and shipping cost.

A Personalized Recommendations feature launched late last year generates buyer recommendations from hundreds of candidates by taking into account an individual user’s shopping experience and predicted buying behavior. The company just took top honors for “Best Overall Gen AI Solution” at Tech Breakthrough LLC’s AI Breakthrough Awards.

AI is also used in customer service to comb through previous interactions with individual customers and summarize their concerns, setting the stage for “a much more effective call,” Lahiri said.

Dedicated AI stack

EBay’s on-premises AI stack consists of a dedicated high-performance computing cluster with a dedicated set of Nvidia Corp. H100 Tensor Core graphic processing units and high-speed interconnects. Lahiri said standard cloud infrastructure is ill-matched to the needs of large training jobs.

“You can’t train those models on cloud infrastructure because it needs more of an HPC approach with Infiniband, RDMA and back-end connections because one GPU needs to access the memory of another GPU to train the model,” he said. Remote direct memory access allows one computer to directly access the memory of another without involving either one’s operating system.

The company uses various language and machine learning models, from large language models to smaller, open-source variants. Having a dedicated on-premises resource has turned out to be “very time-efficient because we don’t have to wait for the resources to be acquired from the public cloud,” Lahiri said. “On-premises is a flat and high-speed network, so data movement is much easier.” Whatever capacity can’t be handled locally spills over into the public cloud.

Layered architecture

In honing its AI architecture over the years, eBay has built a layered approach that abstracts much of the complexity away from the user and the application.

At the top level, “the application makes a call, and then we created a shim (a small piece of code that acts as an intermediary between two systems or software components), so it doesn’t matter who is serving,” Lahiri said. “It could be Nvidia or AMD or Intel hardware. The application doesn’t have to worry about the differences between them.”

Running an HPC environment on-premises isn’t without its challenges. One is simply keeping up with the rapid evolution of GPUs.

“The capabilities are growing three to six times with each generation,” Lahiri said. “The cycles are completely different than on the CPU side. You can’t mix and match [GPU generations] because if you train a model with a lower-quality and a higher-quality GPU, it defaults to the lowest quality.”

Another challenge is that running large numbers of GPUs taxes power and cooling infrastructure. Although x86 server processors consume between 200 and 500 watts, that doesn’t compare to an Nvidia H100 GPU’s peak consumption of 700 watts or Nvidia’s GB200 “Superchip” at 1,200 watts.

Lahiri said that when eBay’s fleet of H100 GPUs is running at full power, the cooling systems create so much noise that data center employees have to wear ear protection. Liquid cooling is an alternative, but he called it expensive and disruptive to install.

Figuring out AI infrastructure

Lahiri said he’s confident such problems will be solved with time. “Over the next two to three years, we are going to figure out the right kind of infrastructure for the inferencing, training and managing GPU infrastructure,” he said. “There will be a lot of innovation in the inferencing world as multiple chips emerge that are focused mainly on that rather than training.”

There will be plenty of new options as more than a dozen startups are working on AI-specific chipsets, mostly focused on inferencing. Lahiri said his team keeps current on their progress but practical considerations merit caution.

“You can fall in love with technology, but you have to look at the reality of how to deploy it in your data center,” he said. “The technology might look really interesting right now, but it has to withstand the pressure of time.”

People are also reading