Zeus automatically optimizes the energy and time of recurring DNN training jobs by finding the optimal batch size and GPU power limit. Please refer to NSDI’23 publication for details. Zeus is part of The ML.ENERGY Initiative.
Extended Zeus from single-GPU to single-node multi-GPU that supports efficient data parallel training. [Pull requests #2 and #7]
Won the 2nd Best Overall Solution in Carbon Hack 22 in a team of three by creating Carbon-Aware Zeus, which reduces the total carbon footprint in DNN training significantly with almost no training time increase.
Presented a one-hour knowledge share at SymbioticLab meeting with the title “MLOps: Machine Learning from Lab to Production”. [Slides]
Ongoing: integrating Zeus with Kubeflow, an open-source Kubernetes-native MLOps platform for machine learning workflow orchestration, to ease the adoption of Zeus into industries.