Sitemap
A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.
Pages
About Me
Archive Layout with Content
Posts by Category
Posts by Collection
CV
Markdown
Page not in menu
Page Archive
Portfolio
Publications
Sitemap
Software
Posts by Tags
Talk map
Talks and presentations
Teaching
Terms and Privacy Policy
Blog posts
Jupyter notebook markdown generator
Posts
Future Blog Post
Published:
This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.
Blog Post number 4
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Blog Post number 3
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Blog Post number 2
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Blog Post number 1
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
portfolio
publications
Stance Detection in Web and Social Media: A Comparative Study
Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2019), 2019
Online forums and social media platforms are increasingly being used to discuss topics of varying polarities where different people take different stances. Several methodologies for automatic stance detection from text have been proposed in literature. To our knowledge, there has not been any systematic investigation towards their reproducibility, and their comparative performances. In this work, we explore the reproducibility of several existing stance detection models, including both neural models and classical classifier-based models. Through experiments on two datasets -- (i) the popular SemEval microblog dataset, and (ii) a set of health-related online news articles -- we also perform a detailed comparative analysis of various methods and explore their shortcomings.
@InProceedings{10.1007/978-3-030-28577-7_4,
author="Ghosh, Shalmoli and Singhania, Prajwal and Singh, Siddharth and Rudra, Koustav and Ghosh, Saptarshi",
title="Stance Detection in Web and Social Media: A Comparative Study",
booktitle="Experimental IR Meets Multilinguality, Multimodality, and Interaction",
year="2019",
publisher="Springer International Publishing",
address="Cham",
pages="75--87",
isbn="978-3-030-28577-7"
}AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2022
In the last few years, the memory requirements to train state-of-the-art neural networks have far exceeded the DRAM capacities of modern hardware accelerators. This has necessitated the development of efficient algorithms to train these neural networks in parallel on large-scale GPU-based clusters. Since computation is relatively inexpensive on modern GPUs, designing and implementing extremely efficient communication in these parallel training algorithms is critical for extracting the maximum performance. This paper presents AxoNN, a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU, thereby reducing GPU idle time and maximizing hardware efficiency. By using the CPU memory as a scratch space for offloading data periodically during training, AxoNN is able to reduce GPU memory consumption by four times. This allows us to increase the number of parameters per GPU by four times, thus reducing the amount of communication and increasing performance by over 13%. When tested against large transformer models with 12-100 billion parameters on 48-384 NVIDIA Tesla V100 GPUs, AxoNN achieves a per-GPU throughput of 49.4-54.78% of theoretical peak and reduces the training time by 22-37 days (15-25% speedup) as compared to the state-of-the-art.
@INPROCEEDINGS{9820664,
author={Singh, Siddharth and Bhatele, Abhinav},
booktitle={2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
title={AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning},
year={2022},
volume={},
number={},
pages={606-616},
keywords={Training;Deep learning;Schedules;Neural networks;Memory management;Graphics processing units;Clustering algorithms;parallel deep learning;asynchrony;message driven scheduling;memory optimizations},
doi={10.1109/IPDPS53621.2022.00065}}PySchedCL: Leveraging Concurrency in Heterogeneous Data-Parallel Systems
IEEE Transactions on Computers, 2022
In the past decade, high performance compute capabilities exhibited by heterogeneous GPGPU platforms have led to the popularity of data parallel programming languages such as CUDA and OpenCL. Such languages, however, involve a steep learning curve as well as developing an extensive understanding of the underlying architecture of the compute devices in heterogeneous platforms. This has led to the emergence of several High Performance Computing frameworks which provide high-level abstractions for easing the development of data-parallel applications on heterogeneous platforms. However, the scheduling decisions undertaken by such frameworks only exploit coarse-grained concurrency in data parallel applications. In this paper, we propose PySchedCL, a framework which explores fine-grained concurrency aware scheduling decisions that harness the power of heterogeneous CPU/GPU architectures efficiently. We showcase the efficacy of such scheduling mechanisms over existing coarse-grained dynamic scheduling schemes by conducting extensive experimental evaluations for a Machine Learning based inferencing application.
@ARTICLE{9606595,
author={Ghose, Anirban and Singh, Siddharth and Kulaharia, Vivek and Dokara, Lokesh and Maity, Srijeeta and Dey, Soumyajit},
journal={IEEE Transactions on Computers},
title={PySchedCL: Leveraging Concurrency in Heterogeneous Data-Parallel Systems},
year={2022},
volume={71},
number={9},
pages={2234-2247},
doi={10.1109/TC.2021.3125792}}Exploiting sparsity in pruned neural networks to optimize large model training
2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2023
Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely -- data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication time and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication time by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.
@INPROCEEDINGS{10177389,
author={Singh, Siddharth and Bhatele, Abhinav},
booktitle={2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
title={Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training},
year={2023},
volume={},
number={},
pages={245-255},
keywords={Deep learning;Training;Artificial satellites;Computational modeling;Neural networks;Memory management;Parallel processing;lottery ticket hypothesis;sparse computations;GPUs;parallel deep learning;memory optimizations},
doi={10.1109/IPDPS54959.2023.00033}}A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
ICS '23: Proceedings of the 37th International Conference on Supercomputing, 2023
Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4-8x larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.
@inproceedings{10.1145/3577193.3593704,
author = {Singh, Siddharth and Ruwase, Olatunji and Awan, Ammar Ahmad and Rajbhandari, Samyam and He, Yuxiong and Bhatele, Abhinav},
title = {A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training},
year = {2023},
isbn = {9798400700569},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3577193.3593704},
doi = {10.1145/3577193.3593704},
abstract = {Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4--8\texttimes{} larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26\% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.},
booktitle = {Proceedings of the 37th ACM International Conference on Supercomputing},
pages = {203–214},
numpages = {12},
keywords = {expert parallelism, tensor parallelism, mixture-of-experts, parallel deep learning},
location = {Orlando, FL, USA},
series = {ICS '23}
}Democratizing AI: Open-Source Scalable LLM Training on GPU-Based Supercomputers
SC'24: Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2024
ACM Gordon Bell Finalist
Training and fine-tuning large language models (LLMs) with hundreds of billions to trillions of parameters requires tens of thousands of GPUs, and a highly scalable software stack. In this work, we present a novel four-dimensional hybrid parallel algorithm implemented in a highly scalable, portable, open-source framework called AxoNN. We describe several performance optimizations in AxoNN to improve matrix multiply kernel performance, overlap non-blocking collectives with computation, and performance modeling to choose performance optimal configurations. These have resulted in unprecedented scaling and peak flop/s (bf16) for training of GPT-style transformer models on Perlmutter (620.1 Petaflop/s), Frontier (1.381 Exaflop/s) and Alps (1.423 Exaflop/s). While the abilities of LLMs improve with the number of trainable parameters, so do privacy and copyright risks caused by memorization of training data, which can cause disclosure of sensitive or private information at inference time. We highlight this side effect of scale through experiments that explore catastrophic memorization, where models are sufficiently large to memorize training data in a single pass, and present an approach to prevent it. As part of this study, we demonstrate fine-tuning of a 405-billion parameter LLM using AxoNN on Frontier.
@inproceedings{10.1109/SC41406.2024.00010,
author = {Singh, Siddharth and Singhania, Prajwal and Ranjan, Aditya and Kirchenbauer, John and Geiping, Jonas and Wen, Yuxin and Jain, Neel and Hans, Abhimanyu and Shu, Manli and Tomar, Aditya and Goldstein, Tom and Bhatele, Abhinav},
title = {Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers},
year = {2024},
isbn = {9798350352917},
publisher = {IEEE Press},
url = {https://doi.org/10.1109/SC41406.2024.00010},
doi = {10.1109/SC41406.2024.00010},
booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis},
articleno = {4},
numpages = {14},
keywords = {GPGPUs, asynchrony, collective communication, large language models, parallel training},
location = {Atlanta, GA, USA},
series = {SC '24}
}Be like a Goldfish, Don’t Memorize! Mitigating Memorization in Generative LLMs
Advances in Neural Information Processing Systems 37 (NeurIPS), 2024
Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, randomly sampled subsets of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set. We run extensive experiments training billion-scale Llama-2 models, both pre-trained and trained from scratch, and demonstrate significant reductions in extractable memorization with little to no impact on downstream benchmarks.
@inproceedings{NEURIPS2024_2ad2dffb,
author = {Hans, Abhimanyu and Kirchenbauer, John and Wen, Yuxin and Jain, Neel and Kazemi, Hamid and Singhania, Prajwal and Singh, Siddharth and Somepalli, Gowthami and Geiping, Jonas and Bhatele, Abhinav and Goldstein, Tom},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
pages = {24022--24045},
publisher = {Curran Associates, Inc.},
title = {Be like a Goldfish, Don\textquotesingle t Memorize! Mitigating Memorization in Generative LLMs},
url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/2ad2dffba5079687651226ac8752df97-Paper-Conference.pdf},
volume = {37},
year = {2024}
}Loki: Low-Rank Keys for Efficient Sparse Attention
Advances in Neural Information Processing Systems 37 (NeurIPS), 2024
Inference on large language models (LLMs) can be expensive in terms of the compute and memory costs involved, especially when long sequence lengths are used. In particular, the self-attention mechanism used in LLM inference contributes significantly to these costs, which has sparked an interest in approximating the self-attention computation to reduce such costs. In this work, we propose to approximate self-attention by focusing on the dimensionality of key vectors computed in the attention block. Our analysis reveals that key vectors lie in a significantly lower-dimensional space, consistently across several datasets and models. Exploiting this observation, we propose Loki, a novel sparse attention method that ranks and selects tokens in the KV-cache based on attention scores computed in low-dimensional space. Our evaluations show that Loki is able to speed up the attention computation due to reduced data movement (load/store) and compute costs while maintaining the efficacy of the models better than other popular approximation methods.
@inproceedings{NEURIPS2024_1e027da6,
author = {Singhania, Prajwal and Singh, Siddharth and He, Shwai and Feizi, Soheil and Bhatele, Abhinav},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
pages = {16692--16723},
publisher = {Curran Associates, Inc.},
title = {Loki: Low-rank Keys for Efficient Sparse Attention},
url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/1e027da6bec9ceb2ec37951ceeccae93-Paper-Conference.pdf},
volume = {37},
year = {2024}
}NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
arXiv preprint arXiv:2508.14444, 2025
@misc{nvidia2025nvidianemotronnano2,
title={NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model},
author={NVIDIA},
year={2025},
eprint={2508.14444},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.14444}
}HPC-Coder-V2: Studying Code LLMs Across Low-Resource Parallel Languages
ISC High Performance Conference 2025, 2025
Large Language Model (LLM) based coding tools have been tremendously successful as software development assistants, yet they are often designed for general purpose programming tasks and perform poorly for more specialized domains such as high performance computing. Creating specialized models and tools for these domains is crucial towards gaining the benefits of LLMs in areas such as HPC. While previous work has explored HPC-specific models, LLMs still struggle to generate parallel code and it is not at all clear what hurdles are still holding back these LLMs and what must be done to overcome them. In this work, we conduct an in-depth study along the many axes of fine-tuning a specialized HPC LLM in order to better understand the challenges. Based on our findings we fine-tune and evaluate a specialized HPC LLM that is shown to be the best performing open-source code LLM for parallel code generation to date.
@misc{chaturvedi2024hpccoderv2studyingcodellms,
title={HPC-Coder-V2: Studying Code LLMs Across Low-Resource Parallel Languages},
author={Aman Chaturvedi and Daniel Nichols and Siddharth Singh and Abhinav Bhatele},
year={2024},
eprint={2412.15178},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2412.15178},
}Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training
SC'25: Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2025
Graph neural networks (GNNs) leverage the connectivity and structure of real-world graphs to learn intricate properties and relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU due to their sheer size, and training GNNs on such graphs requires techniques such as mini-batch sampling to scale. The alternative approach of distributed full-graph training suffers from high communication overheads and load imbalance due to the irregular structure of graphs. We propose a three-dimensional (3D) parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. In addition, we introduce optimizations such as a double permutation scheme for load balancing, and a performance model to predict the optimal 3D configuration of our parallel implementation -- Plexus. We evaluate Plexus on six different graph datasets and show scaling results on up to 2048 GPUs of Perlmutter, and 1024 GPUs of Frontier. Plexus achieves unprecedented speedups of 2.3-12.5x over prior state of the art, and a reduction in time-to-solution by 5.2-8.7x on Perlmutter and 7.0-54.2x on Frontier.
@inproceedings{10.1145/3712285.3759890,
author = {Ranjan, Aditya K. and Singh, Siddharth and Wei, Cunyang and Bhatele, Abhinav},
title = {Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training},
year = {2025},
isbn = {9798400714665},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3712285.3759890},
doi = {10.1145/3712285.3759890},
booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
pages = {200--216},
numpages = {17},
keywords = {graph neural networks, training, social networks, GPGPUs, SpMM},
series = {SC '25}
}Gemstones: A Model Suite for Multi-Faceted Scaling Laws
Advances in Neural Information Processing Systems 38 (NeurIPS), 2025
Scaling laws are typically fit using a family of models with a narrow range of frozen hyperparameter choices. In this work we study scaling laws using multiple architectural shapes and hyperparameter choices, highlighting their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: an open-source scaling law dataset, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters and diverse architectural shapes; including ablations over learning rate and cooldown. Our checkpoints enable more complex studies of scaling, such as analyzing the relationship between width and depth. By examining our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting.
@inproceedings{
mcleish2025gemstones,
title={Gemstones: A Model Suite for Multi-Faceted Scaling Laws},
author={Sean Michael McLeish and John Kirchenbauer and David Yu Miller and Siddharth Singh and Abhinav Bhatele and Micah Goldblum and Ashwinee Panda and Tom Goldstein},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=iZk78dZ1Ap}
}Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Advances in Neural Information Processing Systems 38 (NeurIPS), 2025
Spotlight
We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.
@inproceedings{
geiping2025scaling,
title={Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach},
author={Jonas Geiping and Sean Michael McLeish and Neel Jain and John Kirchenbauer and Siddharth Singh and Brian R. Bartoldson and Bhavya Kailkhura and Abhinav Bhatele and Tom Goldstein},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=S3GhJooWIC}
}Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
arXiv preprint arXiv:2512.20848, 2025
@misc{nvidia2025nemotron3nanoopen,
title={Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning},
author={NVIDIA},
year={2025},
eprint={2512.20848},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.20848}
}The Big Send-off: Scalable and Performant Collectives for Deep Learning
2026 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2026
Collective communication is becoming increasingly important in data center and supercomputer workloads with an increase in distributed AI related jobs. However, existing libraries that provide collective support such as NCCL, RCCL, and Cray-MPICH exhibit several performance and scalability limitations on modern GPU supercomputers. To address these challenges, we introduce the Performant Collective Communication Library (PCCL), specifically targeted for distributed deep learning (DL) workloads. PCCL provides highly optimized implementations of key collectives used in distributed DL: all-gather, reduce-scatter, and all-reduce. PCCL uses a hierarchical design with learning-based adaptive selection of the best performing algorithms to scale efficiently to thousands of GPUs. It achieves substantial performance speedups over RCCL on 2048 GCDs of Frontier -- up to 168x for reduce-scatter, 33x for all-gather and 10x for all-reduce. More modest but still significant gains up to 5.7x over NCCL are observed on Perlmutter. These gains translate directly to performance improvement of production DL workloads: up to 4.9x speedup over RCCL in DeepSpeed ZeRO-3 training, and up to 2.4x speedup in DDP training.
@INPROCEEDINGS{singh2026bigsendoff,
author={Singh, Siddharth and Pradeep, Keshav and Singh, Mahua and Wei, Cunyang and Bhatele, Abhinav},
booktitle={2026 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
title={The Big Send-off: Scalable and Performant Collectives for Deep Learning},
year={2026}}talks
teaching
Teaching experience 1
This is a description of a teaching experience. You can use markdown like any other post.
Teaching experience 2
This is a description of a teaching experience. You can use markdown like any other post.
