AI learns to spot problems in AI training systems before they occur
New approach could prevent disruptions, improve reliability and reduce operational costs for large-scale AI infrastructure
LOS ANGELES, March 12, 2026 (GLOBE NEWSWIRE) — Researchers have developed a new AI-based method for predicting optical transceiver failures in the computer clusters used for AI training. The new technology could allow operators to anticipate failures before they occur, helping prevent disruptions in AI training and reducing operational costs.
Jingyi Su from Shanghai Jiao Tong University in China will present this research at the 2026 Optical Fiber Communications Conference and Exhibition (OFC), the world’s largest annual gathering for optical networking and communications professionals, which will take place 15 March – 19 March 2026 at the Los Angeles Convention Center.
“As generative AI becomes increasingly integrated into daily life, users demand high real-time responsiveness and stability from AI services,” said Su. “Our technology shifts the paradigm from reactive failure recovery to proactive failure prediction. Instead of merely reducing the time to repair after failures occur, we can now anticipate and replace failing components before they disrupt training — achieving truly uninterrupted AI services through ‘zero-touch’ failure mitigation.”
The research was conducted in a tripartite collaboration between Shanghai Jiao Tong University and Chinese technology companies Baidu Inc. and Huawei Technologies. The proposed algorithm has been deployed in Baidu’s global AI data centers, where it continuously monitors and predicts failures across 400G optical transceivers, demonstrating its practical impact on real-world large-scale AI infrastructure.
According to the researchers, improved infrastructure efficiency could ultimately lower the cost of AI services, making advanced AI technologies more accessible to a broader population.
Using past data to predict future failures
AI training clusters are a specific set of servers within a data center that are dedicated to training AI models. They are usually optimized for GPU-heavy computations, high-speed interconnects and parallel processing. Optical transceivers form the critical connections between the servers and switches that support coordinated computation across tens of thousands of GPUs.
Unlike traditional data centers, AI clusters are highly sensitive to network instability. The impact of a single transceiver failure can be amplified multiple times in large-scale clusters, leading to computational waste and training interruptions.
In the new work, Su and colleagues developed a way to predict optical transceiver failures using a future-guided learning method based on a teacher-student architecture. For this approach, the teacher model learns failure signatures from data that precede failures and then transfers this information to the student model through knowledge distillation.
Accurate warnings in complex environments
The researchers validated their method on a test set of optical transceiver performance data from Baidu’s AI training clusters and then compared its performance to mainstream time-series prediction models, including long short-term memory (LSTM). The future-guided learning AI framework model achieved an F1-score of 0.964, a 9.3% improvement over the LSTM network. The F1-score is a measure of a model’s accuracy ranging from 0 (worst) to 1 (perfect).
“This improvement demonstrates that our approach effectively extracts clearer failure signatures from real-world operational data, overcoming the challenges of high noise, missing samples and irregular sampling that characterize production environment,” said Su. “These results show that our method is more robust to complex data environments.”
The researchers also showed that incorporating teacher guidance improved the percentage of actual failures detected from 95.1% to 100%, with the system achieving zero missed alarms on the test set. These results demonstrated that the method could provide reliable technical support for failure warnings of optical transceivers in AI data centers with the ability to issue warnings hours before failures occur.
“This paper presents a future-guided learning framework for predicting optical transceiver failures in AI data center networks,” said OFC program chair Qiong (Jo) Zhang from Amazon Web Services. “Validated on real-world field data from Baidu’s AI training clusters, the results are compelling — an F1 score of 0.964 and 100% recall — demonstrating strong potential for minimizing costly training interruptions in large-scale AI infrastructure.”
About OFC
The Optical Fiber Communication Conference and Exhibition (OFC) is the world’s largest event for optical communications and networking professionals — a showcase for the trends and technologies that impact how the world communicates and transacts. It is the locus for scientific visionaries and the industry’s biggest brands to make connections and move business forward. For more than 50 years, participants from all corners of the globe have been drawn to OFC by its high-impact, peer-reviewed research, dynamic business programs and the world’s largest in-person exhibition for optical communications.
OFC is co-sponsored by the IEEE Communications Society (IEEE/ComSoc) and the IEEE Photonics Society and co-sponsored and managed by Optica.
OFC takes place 15 – 19 March 2026, at the Los Angeles Convention Center, Los Angeles, California, USA. Learn more at OFCConference.org or follow @OFC-Conference on LinkedIn and X (#OFC26).
Media Contact
media@ofcconference.org
