Originally posted on Data Center POST
Today’s machine learning (ML) training systems are deployed on top of traditional datacenter fabrics with electrical packet switches arranged in a multi-tier topology. The performance and efficiency of this architecture faces severe limitations because of localized network bandwidth bottlenecks. Tech companies are looking to simplify modern-day workflows to increase enterprise productivity.
Telescent Inc., a leading manufacturer of automated fiber patch-panels and cross-connects for networks and data centers, announces today that results of the company’s collaboration with the Massachusetts Institute of Technology Computer Science & Artificial Intelligence Laboratory (MIT CSAIL), aimed at accelerating training time for machine learning workflows, will be showcased in an invited presentation at the Networked Systems Designs and Implementation (NSDI) conference taking place April 17-19, 2023 in Boston, MA.
The Telescent programmable patch panel can provision and deliver network connections with essentially unlimited network bandwidth (i.e. thousands of Terabits per second) within a massive GPU cluster while consuming minimal energy. The collaboration between Telescent and MIT CSAIL focused on improving the training time for machine learning workflows by optimizing the communication between workers in the Graphics Processing Unit (GPU) cluster through programmable network connections. The collaboration accelerated workflows by 3.4 times, a significant performance improvement that overcomes limitations of current GPU clusters in ML training applications.
According to Manya Ghobadi, Associate Professor at MIT CSAIL and program co-chair of NSDI, large-scale ML clusters require enormous computational resources and consume a significant amount of energy. TopoOpt is the first ML-centric network architecture that co-optimizes the distributed training process across three dimensions, computation, communication, and network topology, to significantly improve performance. Inspired by Telescent’s recent inventions on reconfigurable optical patch panels, we dive deep into the world of reconfigurable topology specifically for DNN training. Using reconfigurable network topology brings a new dimension for optimizing large DNN training workloads.”
To read the full article please click here.