4/1/2025
CS professor Tianyin Xu discussed his research on improving the reliability of modern cloud infrastructures at The Grainger College of Engineering IBM-Illinois Discovery Accelerator Institute March 2025 newsletter.
Written by
Siebel School of Computing and Data Science professor Tianyin Xu seeks to improve the reliability of modern cloud operations. Xu’s research focuses on building reliable computer systems that empower next-generation cloud computing. IBM grants from the IBM-Illinois Discovery Accelerator Institute support his current work.
Q: What is the research problem your team is addressing?
Xu: Our overarching goal is to fundamentally improve the reliability of modern cloud infrastructures and the critical computing systems running atop (them). The IIDAI project specifically focuses on cloud system operation — how to ensure that cloud systems are always correctly and reliably managed and thus are always kept in their desired states. The recent CrowdStrike incident is one example that demonstrates the importance of correct system management — a simple configuration update could result in global outages and $5.4 billion in financial losses. In this IIDAI project, we aim to build proactive techniques to prevent such disasters.
Modern cloud systems are growing in scale and demand beyond what human operators can reliably, continuously, and efficiently manage. Therefore, cloud systems are increasingly being managed by programs that automate labor-intensive operations. So, how can we ensure that these operation programs are correctly designed and implemented? Developing reliable operation programs is very challenging; for example, they must correctly recover the managed systems from any error states, and they must tolerate unexpected faults and events (which are norms in cloud environments). We find that real-world operation programs are far from being reliable, incurring significant risks to the cloud systems they manage.
Q: What are the impacts of your research?
Xu: In this IIDAI project, we developed a series of “push-button” end-to-end testing techniques to check the operation correctness of cloud management programs. Some of the initial techniques were written in a research paper. We embodied these techniques into an open-source project, named Acto.
With the support from IIDAI’s entrepreneurship grant, we made Acto a research product from the early research prototype. To date, Acto has been used to detect hundreds of serious defects in widely used operation programs (called “operators”) in Kubernetes, a modern cloud management platform. Many of these defects have severe consequences, including data losses, system outages, and security vulnerabilities. Thanks to Acto, more than half of these defects have been patched or fixed by developers.
The project has received good attention from industry. The team was invited to give talks and showcases at several major industry conferences and an invited article was published in USENIX’s login: magazine.
Last but not least, the project has engaged a dozen of undergraduate students through programs such as the IIDAI Undergraduate Research Experience. Most of them go on to continue their graduate studies in top programs at Illinois, the University of Michigan, the University of Texas at Austin, the University of California, Berkeley, Harvard University, etc. Additionally, Acto has been incorporated into the curriculum of a graduate-level course at Illinois.
Q: How has AI impacted your research?
Xu: More and more cloud operation programs are empowered by AI these days. In fact, we are working closely with IBM Research through IIDAI to explore the uses of AI agents for cloud system management. You can read this IBM Blog for details.
Now, with the emergence of AI-driving technologies, safety and reliability have become an even bigger concern. If we cannot think through how to build seatbelts and airbags for AI-driving management technologies, it is hard to fully unleash the power of AI for the cloud.
Q: How has Grainger’s collaboration with IBM benefited your students?
Xu: The IBM/IIDAI collaboration is essential! Apart from the generous funding support, the technical discussion with our IBM collaborators helped us understand the research problems in a real-world context, including practical constraints and new opportunities in the solution space.
I would like to thank all my IBM collaborators, from whom I’ve learned tremendously over the years, especially Chen Wang, Hubertus Franke, Saurabh Jha, Rohan Arora, Michael V. Le, and Hani Jamjoom.
Interview from the The Grainger College of Engineering IBM-Illinois Discovery Accelerator Institute March 2025 newsletter.
Grainger Engineering Affiliations
Tianyin Xu is an Illinois Grainger Engineering professor of computer science and is affiliated with electrical and computer engineering and IBM-Illinois Discovery Accelerator Institute.