Deep neural networks, or DNNs, which are machine learning algorithms designed to mimic organic brains in their learning style, power advances in artificial intelligence capabilities in fields such as self-driving vehicles, national defense technology and medical devices.

Providing the hefty amount of computing power required for large deep neural works are specialized accelerator chips, which have increased energy efficiency over traditional CPU and GPU processors. However, fabricating DNN chips at advanced technology nodes is challenging. Defects during the chip manufacturing process significantly reduce the functionality of DNN chips.

It requires hundreds of steps in the manufacturing process to fabricate a chip, followed by extensive testing. Chips with defects are often discarded. The wasted chips are part of the reason for the chip supply chain shortage that came to light in 2020, especially in the automobile industry.

Jeff Zhang, an assistant professor of electrical engineering in the Ira A. Fulton Schools of Engineering at Arizona State University, sought to improve DNN accelerators’ fault tolerance through the use of a technique called fault-aware pruning, or FAP, which bypasses faulty parts of the DNN accelerators’ circuitry to ensure continuous operation, without the need to slow down the chip.

In addition to his work focused on the use of FAP, Zhang and his colleagues also proposed the use of a method combining FAP with retraining the DNN accelerator to restore proper function while ensuring the chips will run at the same speed as they would without faults. These efforts resulted in the selection of Zhang’s 2018 paper proposing the idea as one of the Institute of Electrical and Electronics Engineers, Or IEEE, 2023 Top Picks in Test and Reliability.

The title recognizes research papers from the preceding six years deemed by the IEEE to have the greatest impact on the field of very large-scale integration system electrical testing and reliability.

“I’m humbled to receive this recognition for my work,” Zhang says. “It encourages my group to continue working in this exciting and critical research area to ensure fault tolerance and reliability of deep learning hardware.”

For his future research, Zhang plans to address the rise of large language models such as ChatGPT and their hardware needs. As deep learning chips become bigger and more complex, Zhang aims to better understand the impact of faults across the software and hardware stack to design better hardware and systems for next-generation machine learning.