Minjia Zhang wins Amazon Research Award to advance efficient AI model training

12/1/2025 Jenny Applequist

CS professor Minjia Zhang has received an Amazon Research Award that will provide him with $250,000 in credits towards use of Amazon Web Services’ Trainium research cluster. He will use the cluster as a platform for advancing his work on mixture-of-experts (MoE) learning, a machine learning approach in which AI models are divided into submodels that can be selectively activated to improve computational efficiency.

Written by Jenny Applequist

Minjia Zhang
Photo Credit: University of Illinois / Holly Birch Photography
Minjia Zhang

Zhang will receive AWS cloud access worth $250k to advance mixture-of-experts (MoE) learning.

Amazon has announced a cohort of Amazon Research Award recipients for its “Build on Trainium” program, and Minjia Zhang is among them.

Zhang, who is an assistant professor in the Siebel School of Computing and Data Science, will receive $250,000 in credits toward use of Amazon’s Trainium research cluster.

He will use the award to advance his ongoing work on mixture-of-experts (MoE) learning, a machine learning (ML) approach wherein an AI model is divided into smaller “experts” that can be selectively activated, which makes AI model training more efficient.

The Trainium research cluster was designed for high-performance ML workloads and contains tens of thousands of Amazon Web Services (AWS) Trainium AI accelerator chips connected through a petabit-scale network. The Trainium chips are made specifically for high-performance deep learning training of AI models, and the Build on Trainium program provides academic research teams with compute credits to support novel AI research that uses these chips.

Zhang explained that MoE involves a big challenge: training of MoE models requires “intricate coordination” across kernels, memory hierarchy, and communication topologies, and that “the existing systems predominantly solve that challenge for NVIDIA-specific hardware.”

The Amazon award therefore presents an opportunity to develop efficient and scalable MoE training for a new hardware architecture and programming model that have been neglected by developers to date.

Zhang’s research interests are in building highly efficient systems for large-scale machine learning, and he has racked up multiple achievements in efficient MoE training in particular. He co-developed an MoE training framework called DeepSpeed-MoE that has been widely adopted; more recently, he has been working on an enhanced MoE training system called X-MoE that is designed to overcome scalability bottlenecks for emerging expert-specialized MoEs.

For the Amazon project, his research group will extend X-MoE to AWS’s Trainium research cluster and then redesign and expand it to achieve “a Trainium-native solution” for efficient MoE training. This will involve developing new sparse computation kernels via Trainium’s Neuron Kernel Interface, topology-aware expert routing optimized for Trainium’s UltraCluster interconnect, and hybrid MoE parallelism built on the Neuron SDK and Neuron Distributed (NxD) communication library.

“Together, our innovations will deliver scalable, resource-efficient MoE training pipelines that are fully optimized for Trainium’s architecture,” he said.


Affiliations

In addition to his primary affiliation in the Siebel School of Computing and Data Science, Zhang is also an affiliate of the Department of Electrical and Computer Engineering and the Center for Artificial Intelligence Innovation (CAII) in the National Center for Supercomputing Applications.


Share this story

This story was published December 1, 2025.