Publications

Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures

EuroSys, 2024

Deep Learning training jobs process large amounts of training data using many GPU devices, often running for weeks or months. When hardware or software failures happen, these jobs need to restart, losing the memory state for the Deep Neural Network (DNN) model trained so far, unless checkpointing mechanisms are used to save training state periodically. However, for large models, periodic checkpointing incurs significant steady state overhead, and during recovery, a large number of GPUs need to redo work since the last checkpoint. This is especially problematic when failures are frequent for large DNN training jobs using many GPUs. Read more

PressureML: Modelling Pressure Waves to Generate Large-Scale Water-Usage Insights in Buildings

TCCML, NeurIPS 2023 and BALANCES, BuildSys, 2023

Several studies have indicated that delivering insights and feedback on water usage has been effective in curbing water consumption, making it a pivotal component in achieving long-term sustainability objectives. Despite a significant proportion of water consumption originating from large residential and commercial buildings, there is a scarcity of cost-effective and easy-to-integrate solutions that provide water usage insights in such structures. Furthermore, existing methods for disaggregating water usage necessitate training data and rely on frequent data sampling to capture patterns, both of which pose challenges when scaling up and adapting to new environments. Read more

Verified Telemetry: A General, Easy to use, Scalable and Robust Fault Detection SDK for IoT Sensors

ACM/IEEE IoTDI, 2023

We propose a general, easy-to-use, scalable, and robust fault detection SDK called Verified Telemetry (VT) SDK. VT SDK builds on the sensor fingerprinting approach and can work with a wide variety of sensors (both analog and digital) and IoT devices with very minimal changes. We propose improved sensor fingerprinting algorithms that are robust to signal variations, sensor circuitry, and real-world conditions. VT SDK is already implemented in 1000s of devices and we show its usage on several practical applications. Read more

Tanmaey Gupta

Publications

Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures

PressureML: Modelling Pressure Waves to Generate Large-Scale Water-Usage Insights in Buildings

Verified Telemetry: A General, Easy to use, Scalable and Robust Fault Detection SDK for IoT Sensors