Research
Current Projects
I. Learning Algorithms
A. Bioinspired Machine Learning Algorithms
B. AI for Robotics
C. Video Processing Algorithms
II. AI Hardware
A. Compute-In-Memory and Compute-Near-Memory (CIM and CNM)
B. Neuro-mimetic devices
III. Reliable AI
A. Privacy
B. Robustness
C. Distributed Learning
C. Dataset Security
IV. Co-designing Algorithm and Hardware
A. System Technology Co-design (STCO)
B. Device Technology Co-design (DTCO)
C. Algorithmic Optimizations for Hardware Efficiency
V. Generative AI
A. Large Language Models
Click on the expansion arrows in each section to read the publication abstracts.
I. Learning Algorithms
A. Bioinspired Machine Learning Algorithms
Our research focuses on developing bioinspired machine learning algorithms, leveraging insights from neuroscience to create more efficient and robust AI systems. Our work spans four broad themes: spiking neural networks, foveation-based active learning, continual learning and unlearning, and local learning.
Spiking Neural Networks
Spiking neural networks (SNNs) are inspired by the human brain's neurons, operating on a temporal domain with binary inputs and outputs, or "spikes." These models are not only more energy-efficient compared to traditional artificial neural networks but also excel in capturing temporal data effectively. By mimicking the human neural system, SNNs offer enhanced solutions for remembering and forgetting specific features based on training, providing a more efficient approach to processing temporal information.
Publications:
-
Rathi, Nitin, and Kaushik Roy. "LITE-SNN: Leveraging Inherent Dynamics to Train Energy-Efficient Spiking Neural Networks for Sequential Learning." IEEE Transactions on Cognitive and Developmental Systems (2024). CODE AVAILABLE HERE
Abstract: Spiking Neural Networks (SNNs) are gaining popularity for their promise of low-power machine intelligence on event-driven neuromorphic hardware. SNNs have achieved comparable performance as ANNs on static tasks (image classification) with lower compute energy. In this work, we explore the inherent dynamics of SNNs for sequential tasks like gesture recognition, sentiment analysis, and sequence-to-sequence learning on data from dynamic vision sensors (DVS) and natural language processing (NLP). Sequential data is generally processed with complex RNNs (LSTM/GRU) with explicit feedback connections and internal states to handle the long-term dependencies. The neuron models in SNNs - integrate-and-fire (IF) or leaky-integrate-and-fire (LIF) - have internal states (membrane potential) that can be efficiently leveraged for sequential tasks. The membrane potential in the IF/LIF neuron integrates the incoming current and outputs an event (or spike) when the potential crosses a threshold value. Since SNNs compute with highly sparse spike-based spatio-temporal data, the energy/inference is lower than LSTMs/GRUs. We also show that SNNs require fewer parameters than LSTM/GRU resulting in smaller models and faster inference. We observe the problem of vanishing gradients in vanilla SNNs for longer sequences and implement a convolutional SNN with attention layers to perform sequence-to-sequence learning tasks. The inherent recurrence in SNNs, in addition to the fully parallelized convolutional operations, provide additional mechanisms to model sequential dependencies that lead to better accuracy than convolutional neural networks (CNNs) with ReLU activations. We evaluate SNN on gesture recognition from the IBM DVS dataset, sentiment analysis from the IMDB movie reviews dataset, and German-to-English translation from the Multi30k dataset.
-
Garg, Isha, Sayeed Shafayet Chowdhury, and Kaushik Roy. "Dct-snn: Using dct to distribute spatial information over time for low-latency spiking neural networks." Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. CODE AVAILABLE HERE
Abstract: Spiking Neural Networks (SNNs) offer a promising alternative to traditional deep learning, since they provide higher computational efficiency due to event-driven information processing. SNNs distribute the analog values of pixel intensities into binary spikes over time. However, the most widely used input coding schemes, such as Poisson based rate-coding, do not leverage the additional temporal learning capability of SNNs effectively. Moreover, these SNNs suffer from high inference latency which is a major bottleneck to their deployment. To overcome this, we propose a time-based encoding scheme that utilizes Discrete Cosine Transform (DCT) to reduce the number of timesteps required for inference (DCT-SNN). DCT decomposes an image into a weighted sum of sinusoidal basis images. At each time step, a single frequency base, taken in order and modulated by its corresponding DCT coefficient, is input to an accumulator that generates spikes upon crossing a threshold. We use the proposed scheme to train DCT-SNN, a low-latency deep SNN with leaky-integrate-and-fire neurons using surrogate gradient descent based backpropagation. We achieve top-1 accuracy of 89.94%, 68.30% and 52.43% on CIFAR10, CIFAR-100 and TinyImageNet, respectively using VGG architectures. Notably, DCT-SNN performs inference with 2-14X reduced latency compared to other state-of-the-art SNNs, while achieving comparable accuracy to their standard deep learning counterparts. The dimension of the transform allows us to control the number of timesteps required for inference. Additionally, we can trade-off accuracy with latency in a principled manner by dropping the highest frequency components during inference.
-
Rathi, Nitin, and Kaushik Roy. "Diet-snn: Direct input encoding with leakage and threshold optimization in deep spiking neural networks." arXiv preprint arXiv:2008.03658 (2020). CODE AVAILABLE HERE
Abstract: Bio-inspired spiking neural networks (SNNs), operating with asynchronous binary signals (or spikes) distributed over time, can potentially lead to greater computational efficiency on event-driven hardware. The state-of-the-art SNNs suffer from high inference latency, resulting from inefficient input encoding, and sub-optimal settings of the neuron parameters (firing threshold, and membrane leak). We propose DIET-SNN, a low-latency deep spiking network that is trained with gradient descent to optimize the membrane leak and the firing threshold along with other network parameters (weights). The membrane leak and threshold for each layer of the SNN are optimized with end-to-end backpropagation to achieve competitive accuracy at reduced latency. The analog pixel values of an image are directly applied to the input layer of DIET-SNN without the need to convert to spike-train. The first convolutional layer is trained to convert inputs into spikes where leaky-integrate-and-fire (LIF) neurons integrate the weighted inputs and generate an output spike when the membrane potential crosses the trained firing threshold. The trained membrane leak controls the flow of input information and attenuates irrelevant inputs to increase the activation sparsity in the convolutional and dense layers of the network. The reduced latency combined with high activation sparsity provides large improvements in computational efficiency. We evaluate DIET-SNN on image classification tasks from CIFAR and ImageNet datasets on VGG and ResNet architectures. We achieve top-1 accuracy of 69% with 5 timesteps (inference latency) on the ImageNet dataset with 12x less compute energy than an equivalent standard ANN. Additionally, DIET-SNN performs 20-500x faster inference compared to other state-of-the-art SNN models.
-
Sengupta, Abhronil, Yuting Ye, Robert Wang, Chiao Liu, and Kaushik Roy. "Going deeper in spiking neural networks: VGG and residual architectures." Frontiers in neuroscience, 13, 2019.
Abstract: Over the past few years, Spiking Neural Networks (SNNs) have become popular as a possible pathway to enable low-power event-driven neuromorphic hardware. However, their application in machine learning have largely been limited to very shallow neural network architectures for simple problems. In this paper, we propose a novel algorithmic technique for generating an SNN with a deep architecture, and demonstrate its effectiveness on complex visual recognition problems such as CIFAR-10 and ImageNet. Our technique applies to both VGG and Residual network architectures, with significantly better accuracy than the state-of-the-art. Finally, we present analysis of the sparse event-driven computations to demonstrate reduced hardware overhead when operating in the spiking domain.
-
Lee, Chankyu, Priyadarshini Panda, Gopalakrishnan Srinivasan, and Kaushik Roy. "Training deep spiking convolutional neural networks with STDP-based unsupervised pre-training followed by supervised fine-tuning." Frontiers in neuroscience, 12, 2018.
Abstract: Spiking Neural Networks (SNNs) are fast becoming a promising candidate for brain-inspired neuromorphic computing because of their inherent power efficiency and impressive inference accuracy across several cognitive tasks such as image classification and speech recognition. The recent efforts in SNNs have been focused on implementing deeper networks with multiple hidden layers to incorporate exponentially more difficult functional representations. In this paper, we propose a pre-training scheme using biologically plausible unsupervised learning, namely Spike-Timing-Dependent-Plasticity (STDP), in order to better initialize the parameters in multi-layer systems prior to supervised optimization. The multi-layer SNN is comprised of alternating convolutional and pooling layers followed by fully-connected layers, which are populated with leaky integrate-and-fire spiking neurons. We train the deep SNNs in two phases wherein, first, convolutional kernels are pre-trained in a layer-wise manner with unsupervised learning followed by fine-tuning the synaptic weights with spike-based supervised gradient descent backpropagation. Our experiments on digit recognition demonstrate that the STDP-based pre-training with gradient-based optimization provides improved robustness, faster (~2.5 ×) training time and better generalization compared with purely gradient-based training without pre-training.
Foveation-Based Active Learning
In traditional deep neural networks, the entire image is processed to answer both "what is observed" and "where is it located." However, this approach differs significantly from human visual processing, as revealed by neuroscience research. Humans process visual cues selectively, focusing on specific regions of interest. By developing algorithms that emulate this foveation-based perception, we can achieve robustness to noise and adversarial attacks, as well as gain a better understanding of objects and images by learning the "grammar" of objects.
Publications:
-
Ibrayev, Timur, Amitangshu Mukherjee, Sai Aparna Aketi, and Kaushik Roy. "Towards Two-Stream Foveation-based Active Vision Learning." IEEE Transactions on Cognitive and Developmental Systems (2024).
Abstract: Deep neural network (DNN) based machine perception frameworks process the entire input in a one-shot manner to provide answers to both "what object is being observed" and "where it is located". In contrast, the "two-stream hypothesis" from neuroscience explains the neural processing in the human visual cortex as an active vision system that utilizes two separate regions of the brain to answer the what and the where questions. In this work, we propose a machine learning framework inspired by the "two-stream hypothesis" and explore the potential benefits that it offers. Specifically, the proposed framework models the following mechanisms: 1) ventral (what) stream focusing on the input regions perceived by the fovea part of an eye (foveation), 2) dorsal (where) stream providing visual guidance, and 3) iterative processing of the two streams to calibrate visual focus and process the sequence of focused image patches. The training of the proposed framework is accomplished by label-based DNN training for the ventral stream model and reinforcement learning for the dorsal stream model. We show that the two-stream foveation-based learning is applicable to the challenging task of weakly-supervised object localization (WSOL), where the training data is limited to the object class or its attributes. The framework is capable of both predicting the properties of an object and successfully localizing it by predicting its bounding box. We also show that, due to the independent nature of the two streams, the dorsal model can be applied on its own to unseen images to localize objects from different datasets.
-
Ibrayev, Timur, Manish Nagaraj, Amitangshu Mukherjee, and Kaushik Roy. "Exploring Foveation and Saccade for Improved Weakly-Supervised Localization." Gaze Meets Machine Learning Workshop, PMLR, 2024. CODE AVAILABLE HERE
Abstract: Deep neural networks have become the de facto choice as feature extraction engines, ubiquitously used for computer vision tasks. The current approach is to process every input with uniform resolution in a one-shot manner and make all of the predictions at once. However, human vision is an ?active? process that not only actively switches from one focus point to another within the visual field, but also applies spatially varying attention centered at such focus points. To bridge the gap, we propose incorporating the bio-plausible mechanisms of foveation and saccades to build an active object localization framework. While foveation enables it to process different regions of the input with variable degrees of detail, saccades allow it to change the focus point of such foveated regions. Our experiments show that these mechanisms improve the quality of predicted bounding boxes by capturing all the essential object parts while minimizing unnecessary background clutter. Additionally, they enable the resiliency of the method by allowing it to detect multiple objects while being trained only on data containing a single object per image. Finally, we explore the alignment of our method with human perception using the interesting ?duck-rabbit? optical illusion.
-
Chowdhury, Sayeed Shafayet, Soumyadeep Chandra, and Kaushik Roy. "Towards Visual Syntactical Understanding." IEEE Access (2024).
Abstract: Syntax is usually studied in the realm of linguistics and refers to the arrangement of words in a sentence. Similarly, an image can be considered as a visual ?sentence?, with the semantic parts of the image acting as ?words?. While visual syntactic understanding occurs naturally to humans, it is interesting to explore whether deep neural networks (DNNs) are equipped with such reasoning. To that end, we alter the syntax of natural images (e.g. swapping the eye and nose of a face), referred to as ?incorrect? images, to investigate the sensitivity of DNNs to such syntactic anomaly. Through our experiments, we discover an intriguing property of DNNs where we observe that state-of-the-art convolutional neural networks, as well as vision transformers, fail to discriminate between syntactically correct and incorrect images, when trained on only correct ones. To counter this issue and enable visual syntactic understanding with DNNs, we propose a three-stage framework- (i) the ?words? (or the sub-features) in the image are detected, (ii) the detected words are sequentially masked and reconstructed using an autoencoder, (iii) the original and reconstructed parts are compared at each location to determine syntactic correctness. The reconstruction module is trained with BERT-like masked autoencoding for images, with the motivation to leverage language model inspired training to better capture the syntax. Note, our proposed approach is unsupervised in the sense that the incorrect images are only used during testing and the correct versus incorrect labels are never used for training. We perform experiments on CelebA, and AFHQ datasets and obtain classification accuracy of 92.10%, and 90.89%, respectively. Notably, the approach generalizes well to ImageNet samples which share common classes with CelebA and AFHQ without explicitly training on them.
-
Tao, Chun, Timur Ibrayev, and Kaushik Roy. "Towards Image Semantics and Syntax Sequence Learning." arXiv preprint arXiv:2401.17515 (2024).
Abstract: Convolutional neural networks and vision transformers have achieved outstanding performance in machine perception, particularly for image classification. Although these image classifiers excel at predicting image-level class labels, they may not discriminate missing or shifted parts within an object. As a result, they may fail to detect corrupted images that involve missing or disarrayed semantic information in the object composition. On the contrary, human perception easily distinguishes such corruptions. To mitigate this gap, we introduce the concept of "image grammar", consisting of "image semantics" and "image syntax", to denote the semantics of parts or patches of an image and the order in which these parts are arranreliged to create a meaningful object. To learn the image grammar relative to a class of visual objects/scenes, we propose a weakly supervised two-stage approach. In the first stage, we use a deep clustering framework that relies on iterative clustering and feature refinement to produce part-semantic segmentation. In the second stage, we incorporate a recurrent bi-LSTM module to process a sequence of semantic segmentation patches to capture the image syntax. Our framework is trained to reason over patch semantics and detect faulty syntax. We benchmark the performance of several grammar learning models in detecting patch corruptions. Finally, we verify the capabilities of our framework in Celeb and SUNRGBD datasets and demonstrate that it can achieve a grammar validation accuracy of 70 to 90% in a wide variety of semantic and syntactical corruption scenarios.
-
Mukherjee, Amitangshu, Timur Ibrayev, and Kaushik Roy. "On Inherent Adversarial Robustness of Active Vision Systems." arXiv preprint arXiv:2404.00185 (2024).
Abstract: Current Deep Neural Networks are vulnerable to adversarial examples, which alter their predictions by adding carefully crafted noise. Since human eyes are robust to such inputs, it is possible that the vulnerability stems from the standard way of processing inputs in one shot by processing every pixel with the same importance. In contrast, neuroscience suggests that the human vision system can differentiate salient features by (1) switching between multiple fixation points (saccades) and (2) processing the surrounding with a non-uniform external resolution (foveation). In this work, we advocate that the integration of such active vision mechanisms into current deep learning systems can offer robustness benefits. Specifically, we empirically demonstrate the inherent robustness of two active vision methods - GFNet and FALcon - under a black box threat model. By learning and inferencing based on downsampled glimpses obtained from multiple distinct fixation points within an input, we show that these active methods achieve (2-3) times greater robustness compared to a standard passive convolutional network under state-of-the-art adversarial attacks. More importantly, we provide illustrative and interpretable visualization analysis that demonstrates how performing inference from distinct fixation points makes active vision methods less vulnerable to malicious inputs.
Continual Learning and Unlearning
Humans have the remarkable ability to learn new topics without relearning everything from scratch. While continual learning algorithms have been explored extensively in research, they often suffer from inefficiencies in memory complexity and training time. Our lab develops algorithms that address these bottlenecks by examining the mathematics of learning algorithms. Furthermore, we are pioneers in proposing unlearning algorithms, inspired by the human ability to "unlearn" or correct previously acquired biases without starting anew. In an era where data privacy and model efficiency are critical, unlearning algorithms play a pivotal role in enhancing AI systems.
Publications:
-
Kodge, Sangamesh, Gobinda Saha, and Kaushik Roy. "Deep unlearning: Fast and efficient gradient-free class forgetting." Transactions on Machine Learning Research (TMLR) 2024. CODE AVAILABLE HERE
Abstract: Machine unlearning is a prominent and challenging field, driven by regulatory demands for user data deletion and heightened privacy awareness. Existing approaches involve retraining model or multiple finetuning steps for each deletion request, often constrained by computational limits and restricted data access. In this work, we introduce a novel class unlearning algorithm designed to strategically eliminate specific classes from the learned model. Our algorithm first estimates the Retain and the Forget Spaces using Singular Value Decomposition on the layerwise activations for a small subset of samples from the retain and unlearn classes, respectively. We then compute the shared information between these spaces and remove it from the forget space to isolate class-discriminatory feature space. Finally, we obtain the unlearned model by updating the weights to suppress the class discriminatory features from the activation spaces. We demonstrate our algorithm's efficacy on ImageNet using a Vision Transformer with only ?1.5% drop in retain accuracy compared to the original model while maintaining under 1% accuracy on the unlearned class samples. Furthermore, our algorithm exhibits competitive unlearning performance and resilience against Membership Inference Attacks (MIA). Compared to baselines, it achieves an average accuracy improvement of 1.38% on the ImageNet dataset while requiring up to 10× fewer samples for unlearning. Additionally, under stronger MIA attacks on the CIFAR-100 dataset using a ResNet18 architecture, our approach outperforms the best baseline by 1.8%.
-
Saha, Gobinda, and Kaushik Roy. "Online continual learning with saliency-guided experience replay using tiny episodic memory." Machine Vision and Applications 34.4 (2023): 65. CODE AVAILABLE HERE
Abstract: Artificial learning systems aspire to mimic human intelligence by continually learning from a stream of tasks without forgetting past knowledge. One way to enable such learning is to store past experiences in the form of input examples in episodic memory and replay them when learning new tasks. However, performance of such method suffers as the size of the memory becomes smaller. In this paper, we propose a new approach for experience replay, where we select the past experiences by looking at the saliency maps, which provide visual explanations for the model?s decision. Guided by these saliency maps, we pack the memory with only the parts or patches of the input images important for the model?s prediction. While learning a new task, we replay these memory patches with appropriate zero-padding to remind the model about its past decisions. We evaluate our algorithm on CIFAR-100, miniImageNet and CUB datasets and report better performance than the state-of-the-art approaches. We perform a detailed study to show the effectiveness of zero-padded patch replay compared to the other candidate approaches. Moreover, with qualitative and quantitative analyses we show that our method captures richer summaries of past experiences without any memory increase and hence performs well with small episodic memory.
-
Saha, Gobinda, and Kaushik Roy. "Continual learning with scaled gradient projection." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. No. 8. 2023. CODE AVAILABLE HERE
Abstract: In neural networks, continual learning results in gradient interference among sequential tasks, leading to catastrophic forgetting of old tasks while learning new ones. This issue is addressed in recent methods by storing the important gradient spaces for old tasks and updating the model orthogonally during new tasks. However, such restrictive orthogonal gradient updates hamper the learning capability of the new tasks resulting in sub-optimal performance. To improve new learning while minimizing forgetting, in this paper we propose a Scaled Gradient Projection (SGP) method, where we combine the orthogonal gradient projections with scaled gradient steps along the important gradient spaces for the past tasks. The degree of gradient scaling along these spaces depends on the importance of the bases spanning them. We propose an efficient method for computing and accumulating importance of these bases using the singular value decomposition of the input representations for each task. We conduct extensive experiments ranging from continual image classification to reinforcement learning tasks and report better performance with less training overhead than the state-of-the-art approaches.
-
Saha, Gobinda, Isha Garg, and Kaushik Roy. "Gradient Projection Memory for Continual Learning." International Conference on Learning Representations, (2021). CODE AVAILABLE HERE
Abstract: The ability to learn continually without forgetting the past tasks is a desired attribute for artificial learning systems. Existing approaches to enable such learning in artificial neural networks usually rely on network growth, importance-based weight update or replay of old data from the memory. In contrast, we propose a novel approach where a neural network learns new tasks by taking gradient steps in the orthogonal direction to the gradient subspaces deemed important for the past tasks. We find the bases of these subspaces by analyzing network representations (activations) after learning each task with Singular Value Decomposition (SVD) in a single shot manner and store them in the memory as Gradient Projection Memory (GPM). With qualitative and quantitative analyses, we show that such orthogonal gradient descent induces minimum to no interference with the past tasks, thereby mitigates forgetting. We evaluate our algorithm on diverse image classification datasets with short and long sequences of tasks and report better or on-par performance compared to the state-of-the-art approaches.
-
Saha, Gobinda, Isha Garg, Aayush Ankit, and Kaushik Roy. "Space: Structured compression and sharing of representational space for continual learning." IEEE Access 9 (2021): 150480-150494. CODE AVAILABLE HERE
Abstract: Humans learn incrementally from sequential experiences throughout their lives, which has proven hard to emulate in artificial neural networks. Incrementally learning tasks causes neural networks to overwrite relevant information learned about older tasks, resulting in ?Catastrophic Forgetting?. Efforts to overcome this phenomenon often utilize resources poorly, for instance, by growing the network architecture or needing to save parametric importance scores, or violate data privacy between tasks. To tackle this, we propose SPACE, an algorithm that enables a network to learn continually and efficiently by partitioning the learnt space into a Core space, that serves as the condensed knowledge base over previously learned tasks, and a Residual space, which is akin to a scratch space for learning the current task. After learning each task, the Residual is analyzed for redundancy, both within itself and with the learnt Core space. A minimal number of extra dimensions required to explain the current task are added to the Core space and the remaining Residual is freed up for learning the next task. We evaluate our algorithm on P-MNIST, CIFAR and a sequence of 8 different datasets, and achieve comparable accuracy to the state-of-the-art methods while overcoming catastrophic forgetting. Additionally, our algorithm is well suited for practical use. The partitioning algorithm analyzes all layers in one shot, ensuring scalability to deeper networks. Moreover, the analysis of dimensions translates to filter-level sparsity, and the structured nature of the resulting architecture gives us up to 5x improvement in energy efficiency during task inference over the current state-of-the-art.
Local Learning
Local learning methodologies draw inspiration from the synchronization of neural activity in the human brain, reflecting the temporal coordination of brain signals. Our focus is on creating training methodologies efficient for edge device-based training. These bioinspired techniques offer significant improvements in time and memory complexities, as well as computational energy required for training. In today's machine learning landscape, where training and inference resources are limited, local learning provides an essential framework for developing more resource-efficient AI systems.
Our lab's commitment to integrating principles from neuroscience into machine learning aims to push the boundaries of AI, making it more robust, efficient, and aligned with human cognitive processes.
Publications:
-
Apolinario, Marco Paul E., Arani Roy, and Kaushik Roy. "LLS: Local Learning Rule for Deep Neural Networks Inspired by Neural Activity Synchronization." arXiv preprint arXiv:2405.15868 (2024).
Abstract: Training deep neural networks (DNNs) using traditional backpropagation (BP) presents challenges in terms of computational complexity and energy consumption, particularly for on-device learning where computational resources are limited. Various alternatives to BP, including random feedback alignment, forward-forward, and local classifiers, have been explored to address these challenges. These methods have their advantages, but they can encounter difficulties when dealing with intricate visual tasks or demand considerable computational resources. In this paper, we propose a novel Local Learning rule inspired by neural activity Synchronization phenomena (LLS) observed in the brain. LLS utilizes fixed periodic basis vectors to synchronize neuron activity within each layer, enabling efficient training without the need for additional trainable parameters. We demonstrate the effectiveness of LLS and its variations, LLS-M and LLS-MxM, on multiple image classification datasets, achieving accuracy comparable to BP with reduced computational complexity and minimal additional parameters. Furthermore, the performance of LLS on the Visual Wake Word (VWW) dataset highlights its suitability for on-device learning tasks, making it a promising candidate for edge hardware implementations.
-
Apolinario, Marco Paul E., and Kaushik Roy. "S-tllr: Stdp-inspired temporal local learning rule for spiking neural networks." arXiv preprint arXiv:2306.15220 (2023).
Abstract: Spiking Neural Networks (SNNs) are biologically plausible models that have been identified as potentially apt for deploying energy-efficient intelligence at the edge, particularly for sequential learning tasks. However, training of SNNs poses significant challenges due to the necessity for precise temporal and spatial credit assignment. Back-propagation through time (BPTT) algorithm, whilst the most widely used method for addressing these issues, incurs a high computational cost due to its temporal dependency. In this work, we propose S-TLLR, a novel three-factor temporal local learning rule inspired by the Spike-Timing Dependent Plasticity (STDP) mechanism, aimed at training deep SNNs on event-based learning tasks. Furthermore, S-TLLR is designed to have low memory and time complexities, which are independent of the number of time steps, rendering it suitable for online learning on low-power edge devices. To demonstrate the scalability of our proposed method, we have conducted extensive evaluations on event-based datasets spanning a wide range of applications, such as image and gesture recognition, audio classification, and optical flow estimation. In all the experiments, S-TLLR achieved high accuracy, comparable to BPTT, with a reduction in memory between 5-50× and multiply-accumulate (MAC) operations between 1.3-6.6×.
B. AI for Robotics
Modern AI has become essential for many everyday applications, particularly those requiring real-time processing for complex tasks under resource constraints at the edge. Cloud-based AI algorithms often fall short of meeting energy and latency requirements, underscoring the need for efficient edge AI solutions. Our research towards this goal focusses on developing an efficient vision-based robot navigation system by utilizing suitable sensors and developing efficient algorithms suitable for the edge. Towards this, we explore algorithms for tasks relating to robot perception, as well as control and planning.
Perception
Robot perception typically involves tasks to enable robots to sense, interpret, and understand their environment. These tasks can include optical flow and depth estimation, object detection and tracking, semantic segmentation, gesture and emotion recognition etc. Our research focuses on leveraging low-power, high-temporal-resolution sensors, such as event cameras, combined with bio-inspired algorithms like spiking neural networks (SNNs) to develop efficient, low-latency algorithms suitable for edge computing.
Publications:
-
Biswas, Shristi Das, Adarsh Kosta, Chamika Liyanagedera, Marco Apolinario, and Kaushik Roy. "Halsie: Hybrid approach to learning segmentation by simultaneously exploiting image and event modalities." In 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 5952-5962, IEEE, 2024.
Abstract: Event cameras detect changes in per-pixel intensity to generate asynchronous ?event streams?. They offer great potential for accurate semantic map retrieval in real-time autonomous systems owing to their much higher temporal resolution and high dynamic range (HDR) compared to conventional cameras. However, existing implementations for event-based segmentation suffer from sub-optimal performance since these temporally dense events only measure the varying component of a visual signal, limiting their ability to encode dense spatial context compared to frames. To address this issue, we propose a hybrid end-to-end learning framework HALSIE, utilizing three key concepts to reduce inference cost by up to 20× versus prior art while retaining similar performance: First, a simple and efficient cross-domain learning scheme to extract complementary spatio-temporal embeddings from both frames and events. Second, a specially designed dual-encoder scheme with Spiking Neural Network (SNN) and Artificial Neural Network (ANN) branches to minimize latency while retaining cross-domain feature aggregation. Third, a multi-scale cue mixer to model rich representations of the fused embeddings. These qualities of HALSIE allow for a very lightweight architecture achieving state-of-the-art segmentation performance on DDD-17, MVSEC, and DSEC-Semantic datasets with up to 33× higher parameter efficiency and favorable inference cost (17.9mJ per cycle). Our ablation study also brings new insights into effective design choices that can prove beneficial for research across other vision tasks.
-
Negi, Shubham, Deepika Sharma, Adarsh Kumar Kosta, and Kaushik Roy. "Best of Both Worlds: Hybrid SNN-ANN Architecture for Event-based Optical Flow Estimation." arXiv preprint arXiv:2306.02960 (2023).
Abstract: In the field of robotics, event-based cameras are emerging as a promising low-power alternative to traditional frame-based cameras for capturing high-speed motion and high dynamic range scenes. This is due to their sparse and asynchronous event outputs. Spiking Neural Networks (SNNs) with their asynchronous event-driven compute, show great potential for extracting the spatio-temporal features from these event streams. In contrast, the standard Analog Neural Networks (ANNs) fail to process event data effectively. However, training SNNs is difficult due to additional trainable parameters (thresholds and leaks), vanishing spikes at deeper layers, and a non-differentiable binary activation function. Furthermore, an additional data structure, membrane potential, responsible for keeping track of temporal information, must be fetched and updated at every timestep in SNNs. To overcome these challenges, we propose a novel SNN-ANN hybrid architecture that combines the strengths of both. Specifically, we leverage the asynchronous compute capabilities of SNN layers to effectively extract the input temporal information. Concurrently, the ANN layers facilitate training and efficient hardware deployment on traditional machine learning hardware such as GPUs. We provide extensive experimental analysis for assigning each layer to be spiking or analog, leading to a network configuration optimized for performance and ease of training. We evaluate our hybrid architecture for optical flow estimation on DSEC-flow and Multi-Vehicle Stereo Event-Camera (MVSEC) datasets. On the DSEC-flow dataset, the hybrid SNN-ANN architecture achieves a 40% reduction in average endpoint error (AEE) with 22% lower energy consumption compared to Full-SNN, and 48% lower AEE compared to Full-ANN, while maintaining comparable energy usage.
-
Nagaraj, Manish, Chamika Mihiranga Liyanagedera, and Kaushik Roy. "Dotie-detecting objects through temporal isolation of events using a spiking architecture." In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 4858-4864, IEEE, 2023.
Abstract: Vision-based autonomous navigation systems rely on fast and accurate object detection algorithms to avoid obstacles. Algorithms and sensors designed for such systems need to be computationally efficient, due to the limited energy of the hardware used for deployment. Biologically inspired event cameras are a good candidate as a vision sensor for such systems due to their speed, energy efficiency, and robustness to varying lighting conditions. However, traditional computer vision algorithms fail to work on event-based outputs, as they lack photometric features such as light intensity and texture. In this work, we propose a novel technique that utilizes the temporal information inherently present in the events to efficiently detect moving objects. Our technique consists of a lightweight spiking neural architecture that is able to separate events based on the speed of the corresponding objects. These separated events are then further grouped spatially to determine object boundaries. This method of object detection is both asynchronous and robust to camera noise. In addition, it shows good performance in scenarios with events generated by static objects in the background, where existing event-based algorithms fail. We show that by utilizing our architecture, autonomous navigation systems can have minimal latency and energy overheads for performing object detection.
-
Kosta, Adarsh Kumar, and Kaushik Roy. "Adaptive-spikenet: event-based optical flow estimation using spiking neural networks with learnable neuronal dynamics." In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 6021-6027, IEEE, 2023.
Abstract: Event-based cameras have recently shown great potential for high-speed motion estimation owing to their ability to capture temporally rich information asynchronously. Spiking Neural Networks (SNNs), with their neuro-inspired event-driven processing can efficiently handle such asynchronous data, while neuron models such as the leaky-integrate and fire (LIF) can keep track of the quintessential timing information contained in the inputs. SNNs achieve this by maintaining a dynamic state in the neuron memory, retaining important information while forgetting redundant data over time. Thus, we posit that SNNs would allow for better performance on sequential regression tasks compared to similarly sized Analog Neural Networks (ANNs). However, deep SNNs are difficult to train due to vanishing spikes at later layers. To that effect, we propose an adaptive fully-spiking framework with learnable neuronal dynamics to alleviate the spike vanishing problem. We utilize surrogate gradient-based backpropagation through time (BPTT) to train our deep SNNs from scratch. We validate our approach for the task of optical flow estimation on the Multi-Vehicle Stereo Event-Camera (MVSEC) dataset and the DSEC-Flow dataset. Our experiments on these datasets show an average reduction of ? 13% in average endpoint error (AEE) compared to state-of-the-art ANNs. We also explore several down-scaled models and observe that our SNN models consistently outperform similarly sized ANNs offering ?10%-16% lower AEE. These results demonstrate the importance of SNNs for smaller models and their suitability at the edge. In terms of efficiency, our SNNs offer substantial savings in network parameters (? 48.3 ×) and computational energy (? 10.2 ×) while attaining ? 10% lower EPE compared to the state-of-the-art ANN implementations.
-
Joshi, Amogh, Adarsh Kosta, Wachirawit Ponghiran, Manish Nagaraj, and Kaushik Roy. "FEDORA: Flying Event Dataset fOr Reactive behAvior." arXiv preprint arXiv:2305.14392 (2023).
Abstract: The ability of resource-constrained biological systems such as fruitflies to perform complex and high-speed maneuvers in cluttered environments has been one of the prime sources of inspiration for developing vision-based autonomous systems. To emulate this capability, the perception pipeline of such systems must integrate information cues from tasks including optical flow and depth estimation, object detection and tracking, and segmentation, among others. However, the conventional approach of employing slow, synchronous inputs from standard frame-based cameras constrains these perception capabilities, particularly during high-speed maneuvers. Recently, event-based sensors have emerged as low latency and low energy alternatives to standard frame-based cameras for capturing high-speed motion, effectively speeding up perception and hence navigation. For coherence, all the perception tasks must be trained on the same input data. However, present-day datasets are curated mainly for a single or a handful of tasks and are limited in the rate of the provided ground truths. To address these limitations, we present Flying Event Dataset fOr Reactive behAviour (FEDORA) - a fully synthetic dataset for perception tasks, with raw data from frame-based cameras, event-based cameras, and Inertial Measurement Units (IMU), along with ground truths for depth, pose, and optical flow at a rate much higher than existing datasets.
-
Ponghiran, Wachirawit, Chamika Mihiranga Liyanagedera, and Kaushik Roy. "Event-based temporally dense optical flow estimation with sequential learning." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9827-9836, 2023.
Abstract: Event cameras provide an advantage over traditional frame-based cameras when capturing fast-moving objects without a motion blur. They achieve this by recording changes in light intensity (known as events), thus allowing them to operate at a much higher frequency and making them suitable for capturing motions in a highly dynamic scene. Many recent studies have proposed methods to train neural networks (NNs) for predicting optical flow from events. However, they often rely on a spatio-temporal representation constructed from events over a fixed interval, such as 10Hz used in training on the DSEC dataset. This limitation restricts the flow prediction to the same interval (10Hz) whereas the fast speed of event cameras, which can operate up to 3kHz, has not been effectively utilized. In this work, we show that a temporally dense flow estimation at 100Hz can be achieved by treating the flow estimation as a sequential problem using two different variants of recurrent networks - Long-short term memory (LSTM) and spiking neural network (SNN). First, We utilize the NN model constructed similar to the popular EV-FlowNet but with LSTM layers to demonstrate the efficiency of our training method. The model not only produces 10x more frequent optical flow than the existing ones, but the estimated flows also have 13% lower errors than predictions from the baseline EV-FlowNet. Second, we construct an EV-FlowNet SNN but with leaky integrate and fire neurons to efficiently capture the temporal dynamics. We found that simple inherent recurrent dynamics of SNN lead to significant parameter reduction compared to the LSTM model. In addition, because of its event-driven computation, the spiking model is estimated to consume only 1.5% energy of the LSTM model, highlighting the efficiency of SNN in processing events and the potential for achieving temporally dense flow.
-
Ponghiran, Wachirawit, and Kaushik Roy. "Spiking neural networks with improved inherent recurrence dynamics for sequential learning." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 7, pp. 8001-8008, 2022.
Abstract: Spiking neural networks (SNNs) with leaky integrate and fire (LIF) neurons, can be operated in an event-driven manner and have internal states to retain information over time, providing opportunities for energy-efficient neuromorphic computing, especially on edge devices. Note, however, many representative works on SNNs do not fully demonstrate the usefulness of their inherent recurrence (membrane potential retaining information about the past) for sequential learning. Most of the works train SNNs to recognize static images by artificially expanded input representation in time through rate coding. We show that SNNs can be trained for practical sequential tasks by proposing modifications to a network of LIF neurons that enable internal states to learn long sequences and make their inherent recurrence resilient to the vanishing gradient problem. We then develop a training scheme to train the proposed SNNs with improved inherent recurrence dynamics. Our training scheme allows spiking neurons to produce multi-bit outputs (as opposed to binary spikes) which help mitigate the mismatch between a derivative of spiking neurons' activation function and a surrogate derivative used to overcome spiking neurons' non-differentiability. Our experimental results indicate that the proposed SNN architecture on TIMIT and LibriSpeech 100h speech recognition dataset yields accuracy comparable to that of LSTMs (within 1.10% and 0.36%, respectively), but with 2x fewer parameters than LSTMs. The sparse SNN outputs also lead to 10.13x and 11.14x savings in multiplication operations compared to GRUs, which are generally considered as a lightweight alternative to LSTMs, on TIMIT and LibriSpeech 100h datasets, respectively.
-
Lee, Chankyu, Adarsh Kumar Kosta, and Kaushik Roy. "Fusion-FlowNet: Energy-efficient optical flow estimation using sensor fusion and deep fused spiking-analog network architectures." In 2022 International Conference on Robotics and Automation (ICRA), pp. 6504-6510, IEEE, 2022.
Abstract: Standard frame-based cameras that sample light intensity frames are heavily impacted by motion blur for high-speed motion and fail to perceive scene accurately in high-dynamic range environments. Event-based cameras, on the other hand, overcome these limitations by asynchronously detecting the variation in individual pixel intensities. However, event cameras only capture pixels in motion, leading to sparse information. Hence, estimating the overall dense behavior of pixels is difficult. To address aforementioned issues associated with both sensors, we present Fusion-FlowNet, a sensor fusion framework for energy -efficient optical flow estimation. Fusion-FlowNet utilizes both frame- and event-based sensors, leveraging their complementary characteristics. Our proposed network architecture is also a fusion of Spiking Neural Net-works (SNNs) and Analog Neural Networks (ANNs) where each network is designed to simultaneously process asynchronous event streams and regular frame-based images, respectively. We perform end-to-end training using unsupervised learning to avoid expensive video annotations. Our method generalizes well across distinct environments (rapid motion and challenging lighting conditions) and demonstrates state-of-the-art optical flow prediction on the Multi-Vehicle Stereo Event Camera (MVSEC) dataset. Furthermore, the usage of SNNs in our architecture offers substantial savings in terms of the number of network parameters and computational energy cost.
-
Lee, Chankyu, Adarsh Kumar Kosta, Alex Zihao Zhu, Kenneth Chaney, Kostas Daniilidis, and Kaushik Roy. "Spike-flownet: event-based optical flow estimation with energy-efficient hybrid neural networks." In European Conference on Computer Vision, pp. 366-382. Cham: Springer International Publishing, 2020.
Abstract: Event-based cameras display great potential for a variety of tasks such as high-speed motion detection and navigation in low-light environments where conventional frame-based cameras suffer critically. This is attributed to their high temporal resolution, high dynamic range, and low-power consumption. However, conventional computer vision methods as well as deep Analog Neural Networks (ANNs) are not suited to work well with the asynchronous and discrete nature of event camera outputs. Spiking Neural Networks (SNNs) serve as ideal paradigms to handle event camera outputs, but deep SNNs suffer in terms of performance due to the spike vanishing phenomenon. To overcome these issues, we present Spike-FlowNet, a deep hybrid neural network architecture integrating SNNs and ANNs for efficiently estimating optical flow from sparse event camera outputs without sacrificing the performance. The network is end-to-end trained with self-supervised learning on Multi-Vehicle Stereo Event Camera (MVSEC) dataset. Spike-FlowNet outperforms its corresponding ANN-based method in terms of the optical flow prediction capability while providing significant computational efficiency.
Control and Planning
Our research efforts in this area again focus on developing efficient AI solutions for robot control and planning. These delve into domain of physics-informed neural networks (PINNs) and neuro-symbolic AI to predict control actions or optimal navigation paths from the provided perception information.
Publications:
-
Joshi, Amogh, Sourav Sanyal, and Kaushik Roy. "Real-Time Neuromorphic Navigation: Integrating Event-Based Vision and Physics-Driven Planning on a Parrot Bebop2 Quadrotor." arXiv preprint arXiv:2407.00931 (2024).
Abstract: In autonomous aerial navigation, real-time and energy-efficient obstacle avoidance remains a significant challenge, especially in dynamic and complex indoor environments. This work presents a novel integration of neuromorphic event cameras with physics-driven planning algorithms implemented on a Parrot Bebop2 quadrotor. Neuromorphic event cameras, characterized by their high dynamic range and low latency, offer significant advantages over traditional frame-based systems, particularly in poor lighting conditions or during high-speed maneuvers. We use a DVS camera with a shallow Spiking Neural Network (SNN) for event-based object detection of a moving ring in real-time in an indoor lab. Further, we enhance drone control with physics-guided empirical knowledge inside a neural network training mechanism, to predict energy-efficient flight paths to fly through the moving ring. This integration results in a real-time, low-latency navigation system capable of dynamically responding to environmental changes while minimizing energy consumption. We detail our hardware setup, control loop, and modifications necessary for real-world applications, including the challenges of sensor integration without burdening the flight capabilities. Experimental results demonstrate the effectiveness of our approach in achieving robust, collision-free, and energy-efficient flight paths, showcasing the potential of neuromorphic vision and physics-driven planning in enhancing autonomous navigation systems.
-
Sanyal, Sourav, Rohan Kumar Manna, and Kaushik Roy. "EV-Planner: Energy-Efficient Robot Navigation via Event-Based Physics-Guided Neuromorphic Planner." IEEE Robotics and Automation Letters (2024).
Abstract: Vision-based object tracking is an essential precursor to performing autonomous aerial navigation in order to avoid obstacles. Biologically inspired neuromorphic event cameras are emerging as a powerful alternative to frame-based cameras, due to their ability to asynchronously detect varying intensities (even in poor lighting conditions), high dynamic range, and robustness to motion blur. Spiking neural networks (SNNs) have gained traction for processing events asynchronously in an energy-efficient manner. On the other hand, physics-based artificial intelligence (AI) has gained prominence recently, as they enable embedding system knowledge via physical modeling inside traditional analog neural networks (ANNs). In this letter, we present an event-based physics-guided neuromorphic planner (EV-Planner) to perform obstacle avoidance using neuromorphic event cameras and physics-based AI. We consider the task of autonomous drone navigation where the mission is to detect moving gates and fly through them while avoiding a collision. We use event cameras to perform object detection using a shallow spiking neural network in an unsupervised fashion. Utilizing the physical equations of the brushless DC motors present in the drone rotors, we train a lightweight energy-aware physics-guided neural network (PgNN) with depth inputs. This predicts the optimal flight time responsible for generating near-minimum energy paths. We spawn the drone in the Gazebo simulator and implement a sensor-fused vision-to-planning neuro-symbolic framework using Robot Operating System (ROS). Simulation results for safe collision-free flight trajectories are presented with performance analysis, ablation study and potential future research directions.
-
Sanyal, Sourav, and Kaushik Roy. "Ramp-net: A robust adaptive mpc for quadrotors via physics-informed neural network." In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 1019-1025, IEEE, 2023.
Abstract: Model Predictive Control (MPC) is a state-of-the-art (SOTA) control technique which requires solving hard constrained optimization problems iteratively. For uncertain dynamics, analytical model based robust MPC imposes additional constraints, increasing the hardness of the problem. The problem exacerbates in performance-critical applications, when more compute is required in lesser time. Data-driven regression methods such as Neural Networks have been proposed in the past to approximate system dynamics. However, such models rely on high volumes of labeled data, in the absence of symbolic analytical priors. This incurs non-trivial training overheads. Physics-informed Neural Networks (PINNs) have gained traction for approximating non-linear system of ordinary differential equations (ODEs), with reasonable accuracy. In this work, we propose a Robust Adaptive MPC framework via PINNs (RAMP-Net), which uses a neural network trained partly from simple ODEs and partly from data. A physics loss is used to learn simple ODEs representing ideal dynamics. Having access to analytical functions inside the loss function acts as a regularizer, enforcing robust behavior for parametric uncertainties. On the other hand, a regular data loss is used for adapting to residual disturbances (non-parametric uncertainties), unaccounted during mathematical modelling. Experiments are performed in a simulated environment for trajectory tracking of a quadrotor. We report 7.8% to 43.2% and 8.04% to 61.5% reduction in tracking errors for speeds ranging from 0.5 to 1.75m/s compared to two SOTA regression based MPC methods.
-
Sanyal, Sourav, and Kaushik Roy. "Neuro-ising: Accelerating large-scale traveling salesman problems via graph neural network guided localized ising solvers." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, no. 12 (2022): 5408-5420.
Abstract: One of the most extensively studied combinatorial optimization problems is the Travelling Salesman Problem (TSP). Considerable research efforts in the past have resulted in exact solvers. However, the runtime of such hand-crafted solutions increases exponentially with problem size. Ising model based solvers have also gained prominence due to their abilities to find fast and approximate solutions for combinatorial optimization problems. However, such Ising based heuristics also suffer from scalability as the solution quality becomes increasingly sub-optimal with increase in problem size. In this work, we propose Neuro-Ising ? a machine learning framework which uses Ising models to find clusters of near-optimal partial solutions of large scale TSPs and combines those solutions by employing a supervised data driven mechanism, which we model as a Graph Neural Network (GNN). The GNN is trained from solution instances obtained through exact solvers and hence, the proposed approach generalizes to unseen problems while avoiding the run-time complexity otherwise required, if the solution is built from scratch. Using standard computing resources, our proposed framework rapidly converges to near-optimal solutions for 15 TSPs (upto ~5k cities) from the TSPLib benchmark suite. We report ~10.66× speedup over Tabu Search for 8 problems. Furthermore, compared to two state-of-the-art clustering-based TSP solvers, Neuro-Ising achieves ~38× faster convergence along with ~8.9% better quality of solution, on average.
C. Video Processing Algorithms
Publications:
-
Soufleri, Efstathia, Deepak Ravikumar, and Kaushik Roy. "Advancing Compressed Video Action Recognition through Progressive Knowledge Distillation." arXiv preprint arXiv:2407.02713 (2024).
Abstract: Compressed video action recognition classifies video samples by leveraging the different modalities in compressed videos, namely motion vectors, residuals, and intra-frames. For this purpose, three neural networks are deployed, each dedicated to processing one modality. Our observations indicate that the network processing intra-frames tend to converge to a flatter minimum than the network processing residuals, which in turn converges to a flatter minimum than the motion vector network. This hierarchy in convergence motivates our strategy for knowledge transfer among modalities to achieve flatter minima, which are generally associated with better generalization. With this insight, we propose Progressive Knowledge Distillation (PKD), a technique that incrementally transfers knowledge across the modalities. This method involves attaching early exits (Internal Classifiers - ICs) to the three networks. PKD distills knowledge starting from the motion vector network, followed by the residual, and finally, the intra-frame network, sequentially improving IC accuracy. Further, we propose the Weighted Inference with Scaled Ensemble (WISE), which combines outputs from the ICs using learned weights, boosting accuracy during inference. Our experiments demonstrate the effectiveness of training the ICs with PKD compared to standard cross-entropy-based training, showing IC accuracy improvements of up to 5.87% and 11.42% on the UCF-101 and HMDB-51 datasets, respectively. Additionally, WISE improves accuracy by up to 4.28% and 9.30% on UCF-101 and HMDB-51, respectively.
-
Chandra, Soumyadeep, Sayeed Shafayet Chowdhury, Courtney Yong, Chandru P. Sundaram, and Kaushik Roy. "ViTALS: Vision Transformer for Action Localization in Surgical Nephrectomy." arXiv preprint arXiv:2405.02571 (2024).
Abstract: Surgical action localization is a challenging computer vision problem. While it has promising applications including automated training of surgery procedures, surgical workflow optimization, etc., appropriate model design is pivotal to accomplishing this task. Moreover, the lack of suitable medical datasets adds an additional layer of complexity. To that effect, we introduce a new complex dataset of nephrectomy surgeries called UroSlice. To perform the action localization from these videos, we propose a novel model termed as `ViTALS' (Vision Transformer for Action Localization in Surgical Nephrectomy). Our model incorporates hierarchical dilated temporal convolution layers and inter-layer residual connections to capture the temporal correlations at finer as well as coarser granularities. The proposed approach achieves state-of-the-art performance on Cholec80 and UroSlice datasets (89.8% and 66.1% accuracy, respectively), validating its effectiveness.
II. AI Hardware
Our research focuses on developing specialized systems that efficiently support the growing demands of artificial intelligence workloads. As AI algorithms become more complex and data-intensive, we recognize the challenges that traditional hardware architectures face in terms of performance, power consumption, and scalability. To address these limitations, we explore innovative hardware approaches and technologies that make AI processing faster, more efficient, and secure. Our work spans across Compute-In-Memory and Compute-Near-Memory architectures, neuromimetic devices, architectural simulators, and AI-driven secure hardware, each contributing to the next generation of AI systems.
A. Compute-In-Memory and Compute-Near-Memory (CIM and CNM)
The increasing disparity between computation and data movement energy in modern technology nodes, often referred to as the memory wall problem, has made data movement a significant performance bottleneck. To address this challenge, we focus on integrating computation directly within or near memory arrays, through approaches like Compute-In-Memory (CIM) and Compute-Near-Memory (CNM). These methods reduce the latency and energy costs associated with transferring data between separate memory and processing units, enabling more efficient and faster AI processing.
Publications:
-
Ghosh, Arkapravo, Hemkar Reddy Sadana, Mukut Debnath, Panthadip Maji, Shubham Negi, Sumeet Gupta, Mrigank Sharad, and Kaushik Roy. "Approximate ADCs for In-Memory Computing.." arXiv preprint arXiv:2408.06390.
Abstract: In memory computing (IMC) architectures for deep learning (DL) accelerators leverage energy-efficient and highly parallel matrix vector multiplication (MVM) operations, implemented directly in memory arrays. Such IMC designs have been explored based on CMOS as well as emerging non-volatile memory (NVM) technologies like RRAM. IMC architectures generally involve a large number of cores consisting of memory arrays, storing the trained weights of the DL model. Peripheral units like DACs and ADCs are also used for applying inputs and reading out the output values. Recently reported designs reveal that the ADCs required for reading out the MVM results, consume more than 85% of the total compute power and also dominate the area, thereby eschewing the benefits of the IMC scheme. Mitigation of imperfections in the ADCs, namely, non-linearity and variations, incur significant design overheads, due to dedicated calibration units. In this work we present peripheral aware design of IMC cores, to mitigate such overheads. It involves incorporating the non-idealities of ADCs in the training of the DL models, along with that of the memory units. The proposed approach applies equally well to both current mode as well as charge mode MVM operations demonstrated in recent years., and can significantly simplify the design of mixed-signal IMC units.
-
Negi, Shubham, Utkarsh Saxena, Deepika Sharma, and Kaushik Roy. "HCiM: ADC-Less Hybrid Analog-Digital Compute in Memory Accelerator for Deep Learning Workloads." arXiv preprint arXiv:2403.13577.
Abstract: Analog Compute-in-Memory (CiM) accelerators are increasingly recognized for their efficiency in accelerating Deep Neural Networks (DNN). However, their dependence on Analog-to-Digital Converters (ADCs) for accumulating partial sums from crossbars leads to substantial power and area overhead. Moreover, the high area overhead of ADCs constrains the throughput due to the limited number of ADCs that can be integrated per crossbar. An approach to mitigate this issue involves the adoption of extreme low-precision quantization (binary or ternary) for partial sums. Training based on such an approach eliminates the need for ADCs. While this strategy effectively reduces ADC costs, it introduces the challenge of managing numerous floating-point scale factors, which are trainable parameters like DNN weights. These scale factors must be multiplied with the binary or ternary outputs at the columns of the crossbar to ensure system accuracy. To that effect, we propose an algorithm-hardware co-design approach, where DNNs are first trained with quantization-aware training. Subsequently, we introduce HCiM, an ADC-Less Hybrid Analog-Digital CiM accelerator. HCiM uses analog CiM crossbars for performing Matrix-Vector Multiplication operations coupled with a digital CiM array dedicated to processing scale factors. This digital CiM array can execute both addition and subtraction operations within the memory array, thus enhancing processing speed. Additionally, it exploits the inherent sparsity in ternary quantization to achieve further energy savings. Compared to an analog CiM baseline architecture using 7 and 4-bit ADC, HCiM achieves energy reductions up to 28% and 12%, respectively.
-
Ali, Mustafa, Indranil Chakraborty, Sayeed Choudhary, Muya Chang, Dong Eun Kim, Arijit Raychowdhury, and Kaushik Roy. "TOPS/W adaptive-SNR sparsity-aware CIM core with load balancing support for DL workloads." In 2023 IEEE Custom Integrated Circuits Conference (CICC) (pp. 1-2), IEEE.
Abstract: The growing trends of developing domain-specific accelerators for Deep Learning (DL) applications has led to exploration of compute-in-memory (CIM) primitives based on SRAM [1] ? [5]. Multiple research chips have demonstrated macro and core-level designs supporting multi-bit Matrix-Vector Multiplication (MVM) and sparsity to increase energy-efficiency and performance. However, CIM designs suffer from the following challenges, as shown in Fig. 1: (1) Difficulty in leveraging both input and weight unstructured sparsity in existing DL accelerators. Note, unstructured sparsity is more amenable during DL model training than structured sparsity. Fig. 1 (top) shows input and weight bit-level sparsity of ResNet20 running a CIFAR10 task and mapped on a 64×64CIM macro. We observe that activations and weights of each layer experience different bit-level sparsity, also, sparsity levels vary significantly across layers.(2) Mixed-signal CIM macros suffer from noise and variation-based computation errors and signal-to-noise ratio (SNR) degradation. Moreover, the macro errors get accumulated in scaled-up CIM architectures leading to significant model accuracy drop. (3) Sparsity-aware CIM compute units encounter different sparsity; hence they might finish their corresponding MVMs at different times leading to load imbalance. To overcome the aforementioned challenges, this work proposes a sparsity-aware, adaptive-SNR CIM core based on sparsity aware CIM macros with load balancing support. The proposed core achieves 1.4-6.7 TOPS/W 8b energy efficiency and is fabricated in 65 nm technology. The core contributions are: 1) Input and weight unstructured bit-level sparsity exploitation by dynamically reconfiguring the CIM macros ADC precision. 2) Adaptive HW SNR using reconfigurable Word Line (WL) parallelism to adapt to workload SNR requirements and achieve optimal energy efficiency which provides 2x and 1.78x performance and energy benefits, respectively compared to iso-accuracy baseline where only 8 RWLs are enabled to maximize CIM SNR. 3) Flexible MVM kernel mapping and compiler-level load balancing and its corresponding HW support to balance weight sparsity among Sparse Compute Units (SCUs).
-
Kim, Dong Eun, Aayush Ankit, Cheng Wang, and Kaushik Roy. "SAMBA: sparsity aware in-memory computing based machine learning accelerator." IEEE Transactions on Computers, 72(9), pp.2615-2627.
Abstract: Machine Learning (ML) inference is typically dominated by highly data-intensive Matrix Vector Multiplication (MVM) computations that may be constrained by memory bottleneck due to massive data movement between processor and memory. Although analog in-memory computing (IMC) ML accelerators have been proposed to execute MVM with high efficiency, the latency and energy of such computing systems can be dominated by the large latency and energy costs from analog-to-digital converters (ADCs). Leveraging sparsity in ML workloads, reconfigurable ADCs can save MVM energy and latency by reducing the required ADC bit precision. However, such improvement in latency can be hindered by non-uniform sparsity of the weight matrices mapped into hardware. Moreover, data movement between MVM processing cores may become another factor that delays the overall system-level performance. To address these issues, we propose SAMBA, Sparsity Aware IMC Based Machine Learning Accelerator. First, we propose load balancing during mapping of weight matrices into physical crossbars to eliminate non-uniformity in the sparsity of mapped matrices. Second, we propose optimizations in arranging and scheduling the tiled MVM hardware to minimize the overhead of data movement across multiple processing cores. Our evaluations show that the proposed load balancing technique can achieve performance improvement. The proposed optimizations can further improve both performance and energy-efficiency regardless of sparsity condition. With the combination of load balancing and data movement optimization in conjunction with reconfigurable ADCs, our proposed approach provides up to 2.38x speed-up and 1.54x energy-efficiency over state-of-the-art analog IMC based ML accelerators for ImageNet datasets on Resnet-50 architecture.
-
Ali, Mustafa, Indranil Chakraborty, Utkarsh Saxena, Amogh Agrawal, Aayush Ankit, and Kaushik Roy. "A 35.5-127.2 tops/w dynamic sparsity-aware reconfigurable-precision compute-in-memory SRAM macro for machine learning." IEEE Solid-State Circuits Letters, 4, pp.129-132.
Abstract: This letter presents an energy-efficient sparsity-aware reconfigurable-precision compute-in-memory (CIM) 8T-SRAM macro for machine learning (ML) applications. The proposed macro dynamically leverages workload sparsity by reconfiguring the output precision in the peripheral circuitry without degrading application accuracy. Specifically, we propose a new energy-efficient reconfigurable-precision SAR ADC design with the ability to form ( n+m)-bit precision using n-bit and m-bit ADCs. Additionally, the transimpedance amplifier (TIA) ?required to convert the summed current into voltage before conversion? is reconfigured based on sparsity to improve sense margin at lower output precision. The proposed macro, fabricated in 65-nm technology, provides 35.5-127.2 TOPS/W as the ADC precision varies from 6 to 2 bit, respectively.
-
Agrawal, Amogh, Mustafa Ali, Minsuk Koo, Nitin Rathi, Akhilesh Jaiswal, and Kaushik Roy. "IMPULSE: A 65-nm digital compute-in-memory macro with fused weights and membrane potential for spike-based sequential learning tasks." IEEE Solid-State Circuits Letters, 4, pp.137-140.
Abstract: The inherent dynamics of the neuron membrane potential in spiking neural networks (SNNs) allows the processing of sequential learning tasks, avoiding the complexity of recurrent neural networks. The highly sparse spike-based computations in such spatiotemporal data can be leveraged for energy efficiency. However, the membrane potential incurs additional memory access bottlenecks in current SNN hardware. To that effect, we propose a 10T-SRAM compute-in-memory (CIM) macro, specifically designed for state-of-the-art SNN inference. It consists of a fused weight ( W MEM ) and membrane potential ( V MEM ) memory and inherently exploits sparsity in input spikes leading to ~97.4% reduction in energy-delay product (EDP) at 85% sparsity (typical of SNNs considered in this work) compared to the case of no sparsity. We propose staggered data mapping and reconfigurable peripherals for handling different bit precision requirements of W MEM and V MEM , while supporting multiple neuron functionalities. The proposed macro was fabricated in 65-nm CMOS technology, achieving energy efficiency of 0.99 TOPS/W at 0.85-V supply and 200-MHz frequency for signed 11-bit operations. We evaluate the SNN for sentiment classification from the IMDB dataset of movie reviews and achieve within ~1% accuracy difference and ~ 5× higher energy efficiency compared to a corresponding long short-term memory network.
B. Neuro-mimetic devices
In this area, our research is centered on developing devices that mimic the functionality of biological neurons and synapses. By emulating synaptic plasticity and neural dynamics, these devices enable us to build ultra-low-power, adaptive computing systems that can learn and process information in a way that?s similar to the human brain.
Publications:
-
Yu, Eunseon, Gaurav K, Utkarsh Saxena, and Kaushik Roy. "Ferroelectric capacitors and field-effect transistors as in-memory computing elements for machine learning workloads." Scientific Reports, 14(1), p.9426.
Abstract: This study discusses the feasibility of Ferroelectric Capacitors (FeCaps) and Ferroelectric Field-Effect Transistors (FeFETs) as In-Memory Computing (IMC) elements to accelerate machine learning (ML) workloads. We conducted an exploration of device fabrication and proposed system-algorithm co-design to boost performance. A novel FeCap device, incorporating an interfacial layer (IL) and (HZO), ensures a reduction in operating voltage and enhances HZO scaling while being compatible with CMOS circuits. The IL also enriches ferroelectricity and retention properties. When integrated into crossbar arrays, FeCaps and FeFETs demonstrate their effectiveness as IMC components, eliminating sneak paths and enabling selector-less operation, leading to notable improvements in energy efficiency and area utilization. However, it is worth noting that limited capacitance ratios in FeCaps introduced errors in multiply-and-accumulate (MAC) computations. The proposed co-design approach helps in mitigating these errors and achieves high accuracy in classifying the CIFAR-10 dataset, elevating it from a baseline of 10% to 81.7%. FeFETs in crossbars, with a higher on-off ratio, outperform FeCaps, and our proposed charge-based sensing scheme achieved at least an order of magnitude reduction in power consumption, compared to prevalent current-based methods.
-
Wang, Cheng, Chankyu Lee, and Kaushik Roy. "Noise resilient leaky integrate-and-fire neurons based on multi-domain spintronic devices." Scientific Reports, 12(1), p.8361.
Abstract: The capability of emulating neural functionalities efficiently in hardware is crucial for building neuromorphic computing systems. While various types of neuro-mimetic devices have been investigated, it remains challenging to provide a compact device that can emulate spiking neurons. In this work, we propose a non-volatile spin-based device for efficiently emulating a leaky integrate-and-fire neuron. By incorporating an exchange-coupled composite free layer in spin-orbit torque magnetic tunnel junctions, multi-domain magnetization switching dynamics is exploited to realize gradual accumulation of membrane potential for a leaky integrate-and-fire neuron with compact footprints. The proposed device offers significantly improved scalability compared with previously proposed spin-based neuro-mimetic implementations while exhibiting high energy efficiency and good controllability. Moreover, the proposed neuron device exhibits a varying leak constant and a varying membrane resistance that are both dependent on the magnitude of the membrane potential. Interestingly, we demonstrate that such device-inspired dynamic behaviors can be incorporated to construct more robust spiking neural network models, and find improved resiliency against various types of noise injection scenarios. The proposed spintronic neuro-mimetic devices may potentially open up exciting opportunities for the development of efficient and robust neuro-inspired computational hardware.
III. Reliable AI
A. Privacy
Publications:
-
Ravikumar, Deepak, Efstathia Soufleri, and Kaushik Roy. "Curvature Clues: Decoding Deep Learning Privacy with Input Loss Curvature." arXiv preprint arXiv:2407.02747 (2024).
Abstract: This study examines the loss curvature concerning input data in deep neural networks (DNNs). The research focuses on how input loss curvature differentiates between training and testing sets and its implications for distinguishing them. A theoretical framework is developed to determine an upper bound on train-test distinguishability based on privacy and training set size. The study introduces a novel black-box membership inference attack using input loss curvature and validates the findings through experiments on computer vision tasks, demonstrating the effectiveness of this method over existing techniques. The analysis reveals the variability in membership inference attack performance with training set size, showing superior results on large datasets.
-
Ravikumar, Deepak, Efstathia Soufleri, Abolfazl Hashemi, and Kaushik Roy. "Unveiling Privacy, Memorization, and Input Curvature Links." In Forty-first International Conference on Machine Learning.
Abstract: This paper explores the relationship between deep neural networks' memorization tendencies and input loss curvature, building on Feldman's formal memorization score. The study derives a theoretical upper bound on memorization, characterized by differential privacy and input loss curvature, and demonstrates that input loss curvature is limited by the differential privacy parameter. Empirical validation on CIFAR and ImageNet datasets shows a strong correlation between theoretical predictions and practical results, providing insights into the memorization and privacy implications in DNNs.
-
Garg, Isha, Deepak Ravikumar, and Kaushik Roy. "Memorization Through the Lens of Curvature of Loss Function Around Samples." In Forty-first International Conference on Machine Learning.
Abstract: This paper proposes using the curvature of the loss function around each training sample as a measure of memorization in deep neural networks. The curvature metric effectively captures memorization statistics in popular image datasets and shows that mislabeled or conflicting samples have higher curvature. The method outperforms existing approaches in detecting mislabeled data and reveals novel failure modes in CIFAR100 and ImageNet datasets.
B. Robustness
Publications:
-
Ravikumar, Deepak, Sangamesh Kodge, Isha Garg, and Kaushik Roy. "Intra-class mixup for out-of-distribution detection." IEEE Access 11 (2023): 25968-25981.
Abstract: This paper addresses out-of-distribution (OoD) detection in deep neural networks by proposing intra-class mixup to improve angular separability between in-distribution and OoD data. The method reduces variance during training, enhancing OoD detection performance. The proposed technique improves AUROC performance by 4.21% and 6.21% over empirical risk minimization and inter-class mixup, respectively.
-
Ravikumar, Deepak, and Kaushik Roy. "Norm-scaling for out-of-distribution detection." arXiv preprint arXiv:2205.03493 (2022).
Abstract: This study proposes norm-scaling to normalize logits for each class separately, improving out-of-distribution (OoD) detection in deep neural networks. The method ensures consistent uncertainty representation across classes, achieving significant improvements in AUROC, AUPR, and FPR95 metrics compared to previous state-of-the-art methods.
-
Ravikumar, Deepak, Sangamesh Kodge, Isha Garg, and Kaushik Roy. "TREND: Transferability-Based Robust ENsemble Design." IEEE Transactions on Artificial Intelligence 4, no. 3 (2022): 534-548.
Abstract: TREND studies the transferability of adversarial examples across different neural network architectures and proposes a methodology for designing robust ensembles. The research reveals the impact of network architecture, optimizer, and quantization on transferability and introduces a new state-of-the-art ensemble attack, demonstrating better adversarial robustness with carefully chosen diverse networks.
-
Mukherjee, Amitangshu, Timur Ibrayev, and Kaushik Roy. "On Inherent Adversarial Robustness of Active Vision Systems." arXiv preprint arXiv:2404.00185 (2024).
Abstract: his paper advocates integrating active vision mechanisms, such as saccades and foveation, into deep learning systems to enhance adversarial robustness. Empirical results show that active vision methods, GFNet and FALcon, achieve 2-3 times greater robustness under black box threat models compared to standard convolutional networks.
C. Distributed Learning
Publications:
-
Choudhary, Sakshi, Sai Aparna Aketi, and Kaushik Roy. "SADDLe: Sharpness-Aware Decentralized Deep Learning with Heterogeneous Data." arXiv preprint arXiv:2405.13961 (2024).
Abstract: SADDLe proposes sharpness-aware decentralized deep learning algorithms to address data heterogeneity and communication costs in decentralized training. The approach leverages Sharpness-Aware Minimization (SAM) for better generalization and robustness, demonstrating 1-20% improvement in test accuracy and resilience to communication compression.
-
Ravikumar, Deepak, Gobinda Saha, Sai Aparna Aketi, and Kaushik Roy. "Homogenizing non-iid datasets via in-distribution knowledge distillation for decentralized learning." arXiv preprint arXiv:2304.04326 (2023).
Abstract: This paper introduces In-Distribution Knowledge Distillation (IDKD) to homogenize data distribution across decentralized nodes without sacrificing privacy. The method uses a public dataset for knowledge distillation, achieving superior generalization performance on heterogeneously distributed data with minimal communication overhead.
-
Choudhary, Sakshi, Sai Aparna Aketi, Gobinda Saha, and Kaushik Roy. "CoDeC: Communication-Efficient Decentralized Continual Learning." Transactions on Machine Learning Research.
Abstract: CoDeC addresses the challenges of continual learning and high communication costs in decentralized training. The algorithm combines orthogonal gradient projection with gossip averaging and introduces a novel lossless communication compression scheme, achieving up to 4.8x reduction in communication costs with minimal performance loss.
-
Aketi, Sai Aparna, Abolfazl Hashemi, and Kaushik Roy. "AdaGossip: Adaptive Consensus Step-size for Decentralized Deep Learning with Communication Compression." arXiv preprint arXiv:2404.05919 (2024).
Abstract: AdaGossip proposes an adaptive technique for adjusting the consensus step-size based on compressed model differences between agents in decentralized learning. The method improves test accuracy by 0-2% compared to current state-of-the-art techniques and demonstrates effectiveness across various datasets and network topologies.
-
Aketi, Sai Aparna, Abolfazl Hashemi, and Kaushik Roy. "Global update tracking: A decentralized learning algorithm for heterogeneous data." Advances in neural information processing systems 36 (2024).
Abstract: Global Update Tracking (GUT) is a decentralized learning algorithm designed to mitigate the impact of heterogeneous data across devices without additional communication overhead. Experiments show that GUT achieves state-of-the-art performance with a 1-6% improvement in test accuracy compared to existing techniques.
-
Aketi, Sai Aparna, and Kaushik Roy. "Cross-feature Contrastive Loss for Decentralized Deep Learning on Heterogeneous Data.." In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 12-21, 2024.
Abstract: This paper introduces a novel approach for decentralized learning on heterogeneous data using cross-feature contrastive loss. The technique improves performance by 0.2-4% compared to existing methods, demonstrating effectiveness across various datasets, model architectures, and network topologies.
-
Aketi, Sai Aparna, Sangamesh Kodge, and Kaushik Roy. "Neighborhood gradient mean: An efficient decentralized learning method for non-iid data." Transactions on Machine Learning Research (2023).
Abstract: Neighborhood Gradient Mean (NGM) proposes a novel decentralized learning algorithm for non-IID data, utilizing self- and cross-gradient information to improve performance. The method achieves competitive or superior results with significantly less compute and memory requirements, improving performance on non-IID data by 3-20% without additional communication costs.
D. Dataset Security
Publications:
-
Garg, Isha, and Kaushik Roy. "Samples with low loss curvature improve data efficiency." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20290-20300, 2023.
Abstract: This paper studies the second-order properties of loss functions in deep neural networks, identifying samples with low curvature as data-efficient. The proposed SLo-Curves algorithm selects these samples for training, achieving up to 9% improvement in coreset selection methods on CIFAR-10 and CIFAR-100 datasets, generalizing across architectures for downstream tasks.
IV. Co-designing Algorithm and Hardware
A. System Technology Co-design (STCO)
Publications:
-
Sharma, Tanvi, Mustafa Ali, Indranil Chakraborty, and Kaushik Roy. "WWW: What, When, Where to Compute-in-Memory." arXiv preprint arXiv:2312.15896.
Abstract: Compute-in-memory (CiM) has emerged as a compelling solution to alleviate high data movement costs in von Neumann machines. CiM can perform massively parallel general matrix multiplication (GEMM) operations in memory, the dominant computation in Machine Learning (ML) inference. However, re-purposing memory for compute poses key questions on 1) What type of CiM to use: Given a multitude of analog and digital CiMs, determining their suitability from systems perspective is needed. 2) When to use CiM: ML inference includes workloads with a variety of memory and compute requirements, making it difficult to identify when CiM is more beneficial than standard processing cores. 3) Where to integrate CiM: Each memory level has different bandwidth and capacity, that affects the data movement and locality benefits of CiM integration. In this paper, we explore answers to these questions regarding CiM integration for ML inference acceleration. We use Timeloop-Accelergy for early system-level evaluation of CiM prototypes, including both analog and digital primitives. We integrate CiM into different cache memory levels in an Nvidia A100-like baseline architecture and tailor the dataflow for various ML workloads. Our experiments show CiM architectures improve energy efficiency, achieving up to 0.12x lower energy than the established baseline with INT-8 precision, and upto 4x performance gains with weight interleaving and duplication. The proposed work provides insights into what type of CiM to use, and when and where to optimally integrate it in the cache hierarchy for GEMM acceleration.
-
Roy, Arani, and Kaushik Roy. "HADES: Hardware/Algorithm Co-design in DNN accelerators using Energy-efficient Approximate Alphabet Set Multipliers." arXiv preprint arXiv:2302.01990.
Abstract: Edge computing must be capable of executing computationally intensive algorithms, such as Deep Neural Networks (DNNs) while operating within a constrained computational resource budget. Such computations involve Matrix Vector Multiplications (MVMs) which are the dominant contributor to the memory and energy budget of DNNs. To alleviate the computational intensity and storage demand of MVMs, we propose circuit-algorithm co-design techniques with low-complexity approximate Multiply- Accumulate (MAC) units derived from the principles of Alphabet Set Multipliers (ASMs). Selection of few and proper alphabets from ASMs lead to a Multiplier-less DNN implementation, and enables encoding of low precision weights and input activations into fewer bits. To maintain accuracy under alphabet set approximations, we developed a novel ASM-alphabet aware training. The proposed low- complexity multiplication-aware algorithm was implemented In-Memory and Near-Memory with efficient shift operations to further improve the data-movement cost between memory and processing unit. We benchmark our design on CIFAR10 and ImageNet datasets for ResNet and MobileNet models and attain <1-2% accuracy degradation against full precision with energy benefits of >50% compared to standard Von-Neumann counterpart.
-
He, Kang, Indranil Chakraborty, Cheng Wang, and Kaushik Roy. "Design space and memory technology co-exploration for in-memory computing based machine learning accelerators." In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design (pp. 1-9).
Abstract: In-Memory Computing (IMC) has become a promising paradigm for accelerating machine learning (ML) inference. While IMC architectures built on various memory technologies have demonstrated higher throughput and energy efficiency compared to conventional digital architectures, little research has been done from system-level perspective to provide comprehensive and fair comparisons of different memory technologies under the same hardware budget (area). Since large-scale analog IMC hardware relies on the costly analog-digital converters (ADCs) for robust digital communication, optimizing IMC architecture performance requires synergistic co-design of memory arrays and peripheral ADCs, wherein the trade-offs could depend on the underlying memory technologies. To that effect, we co-explore IMC macro design space and memory technology to identify the best design point for each memory type under iso-area budgets, aiming to make fair comparisons among different technologies, including SRAM, phase change memory, resistive RAM, ferroelectrics and spintronics. First, an extended simulation framework employing spatial architecture with off-chip DRAM is developed, capable of integrating both CMOS and nonvolatile memory technologies. Subsequently, we propose different modes of ADC operations with distinctive weight mapping schemes to cope with different on-chip area budgets. Our results show that under an iso-area budget, the various memory technologies being evaluated will need to adopt different IMC macro-level designs to deliver the optimal energy-delay-product (EDP) at system level. We demonstrate that under small area budgets, the choice of best memory technology is determined by its cell area and writing energy. While area budgets are larger, cell area becomes the dominant factor for technology selection.
-
Sharma, Deepika, Aayush Ankit, and Kaushik Roy. "Identifying efficient dataflows for spiking neural networks." In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (pp. 1-6).
Abstract: Deep feed-forward Spiking Neural Networks (SNNs) trained using appropriate learning algorithms have been shown to match the performance of state-of-the-art Artificial Neural Networks (ANNs). The inputs to an SNN layer are 1-bit spikes distributed over several timesteps. In addition, along with the standard artificial neural network (ANN) data structures, SNNs require one additional data structure ? the membrane potential (Vmem) for each neuron which is updated every timestep. Hence, the dataflow requirements for energy-efficient hardware implementation of SNNs can be different from the standard ANNs. In this paper, we propose optimal dataflows for deep spiking neural network layers. To evaluate the energy and latency of different dataflows, we considered three hardware architectures with varying on-chip resources to represent a class of spatial accelerators. We developed a set of rules leading to optimum dataflow for SNNs that achieve more than 90% improvement in Energy-Delay Product (EDP) compared to the baseline for some workloads and architectures.
-
Negi, Shubham, Indranil Chakraborty, Aayush Ankit, and Kaushik Roy. "NAX: neural architecture and memristive xbar based accelerator co-design." In Proceedings of the 59th ACM/IEEE Design Automation Conference (pp. 451-456).
Abstract: Neural Architecture Search (NAS) has provided the ability to design efficient deep neural network (DNN) catered towards different hardwares like GPUs, CPUs etc. However, integrating NAS with Memristive Crossbar Array (MCA) based In-Memory Computing (IMC) accelerator remains an open problem. The hardware efficiency (energy, latency and area) as well as application accuracy (considering device and circuit non-idealities) of DNNs mapped to such hardware are co-dependent on network parameters such as kernel size, depth etc. and hardware architecture parameters such as crossbar size and the precision of analog-to-digital converters. Co-optimization of both network and hardware parameters presents a challenging search space comprising of different kernel sizes mapped to varying crossbar sizes. To that effect, we propose NAX - an efficient neural architecture search engine that co-designs neural network and IMC based hardware architecture. NAX explores the aforementioned search space to determine kernel and corresponding crossbar sizes for each DNN layer to achieve optimal tradeoffs between hardware efficiency and application accuracy. For CIFAR-10 and Tiny ImageNet, our models achieve 0.9% and 18.57% higher accuracy at 30% and -10.47% lower EDAP (energy-delay-area product), compared to baseline ResNet-20 and ResNet-18 models, respectively.
-
Kosta, Adarsh, Efstathia Soufleri, Indranil Chakraborty, Amogh Agrawal, Aayush Ankit, and Kaushik Roy. "HyperX: A hybrid RRAM-SRAM partitioned system for error recovery in memristive Xbars." In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 88-91), IEEE.
Abstract: Memristive crossbars based on Non-volatile Memory (NVM) technologies such as RRAM, have recently shown great promise for accelerating Deep Neural Networks (DNNs). They achieve this by performing efficient Matrix-Vector-Multiplications (MVMs) while offering dense on-chip storage and minimal off-chip data movement. However, their analog nature of computing introduces functional errors due to non-ideal RRAM devices, significantly degrading the application accuracy. Further, RRAMs suffer from low endurance and high write costs, hindering on-chip trainability. To alleviate these limitations, we propose HyperX, a hybrid RRAM-SRAM system that leverages the complementary benefits of NVM and CMOS technologies. Our proposed system consists of a fixed RRAM block offering area and energy-efficient MVMs and an SRAM block enabling on-chip training to recover the accuracy drop due to the RRAM non-idealities. The improvements are reported in terms of energy and product of latency and area (ms×mm2) , termed as area-normalized latency. Our experiments on CIFAR datasets using ResNet-20 show up to 2.88 × and 10.1 × improvements in inference energy and area-normalized latency, respectively. In addition, for a transfer learning task from ImageNet to CIFAR datasets using ResNet-18, we observe up to 1.58 × and 4.48 × improvements in energy and area-normalized latency, respectively. These improvements are with respect to an all-SRAM baseline.
B. Device Technology Co-design (DTCO)
Publications:
-
Sharma, Tanvi, Cheng Wang, Amogh Agrawal, and Kaushik Roy. "Enabling robust SOT-MTJ crossbars for machine learning using sparsity-aware device-circuit co-design." In 2021 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) (pp. 1-6), IEEE.
Abstract: Embedded non-volatile memory (eNVM) based crossbars have emerged as energy-efficient building blocks for machine learning accelerators. However, the analog computations in crossbars introduce errors due to several non-idealities. Moreover, since communications between crossbars are usually done in the digital domain, the energy and area costs are dominated by the Analog-to-Digital Converters (ADC). Among the eNVM technologies, Resistive Random-Access-Memory (RRAM) and Phase-Change Memory (PCM) devices suffer from poor endurance, write variability and conductance drift. Whereas magneto-resistive technologies provide superior endurance, write stability and reliability. To that effect, we propose sparsity-aware device/circuit co-design of robust crossbars using Spin-Orbit- Torque Magnetic Tunnel Junctions (SOT-MTJs). Note, standard MTJs have low ROFF/RON and low RON, making them unsuitable for crossbars. In this work, we first demonstrate SOT-MTJs as crossbar elements with high RON and high ROFF/RON by allowing the read-path to have thicker tunneling-barrier, leaving the write path undisturbed. Second, through extensive simulations, we quantitatively assess the impact of various device-circuit parameters such as RON, ROFF/RON ratio, crossbar size, along with input and weight sparsity, on both circuit and application level accuracy and energy consumption. We evaluate system accuracy for Resnet-20 inference on CIFAR-10 dataset and show that leveraging sparsity allows reduced ADC precision, without degrading accuracy. Our results show that an SOT-MTJ (RON=200k? and ROFF/RON=7) crossbar array of size 32×32 could achieve near software accuracy. The 64×64 and 128×128 crossbars show an accuracy degradation of 2% and 9.8%, respectively, from the software accuracy and an energy improvement of up to 3.8× and 6.3× compared to a 32×32 array with 4bit-ADC.
C. Algorithmic Optimizations for Hardware Efficiency
Publications:
-
Saxena, Utkarsh, Gobinda Saha, Sayeed Choudhary, and Kaushik Roy. "Eigen Attention: Attention in Low-Rank Space for KV Cache Compression.." arXiv preprint arXiv:2408.05646.
Abstract: Large language models (LLMs) represent a groundbreaking advancement in the domain of natural language processing due to their impressive reasoning abilities. Recently, there has been considerable interest in increasing the context lengths for these models to enhance their applicability to complex tasks. However, at long context lengths and large batch sizes, the key-value (KV) cache, which stores the attention keys and values, emerges as the new bottleneck in memory usage during inference. To address this, we propose Eigen Attention, which performs the attention operation in a low-rank space, thereby reducing the KV cache memory overhead. Our proposed approach is orthogonal to existing KV cache compression techniques and can be used synergistically with them. Through extensive experiments over OPT, MPT, and Llama model families, we demonstrate that Eigen Attention results in up to 40% reduction in KV cache sizes and up to 60% reduction in attention operation latency with minimal drop in performance.
-
Ibrayev, Timur, Isha Garg, Indranil Chakraborty, and Kaushik Roy. "Pruning for Improved ADC Efficiency in Crossbar-based Analog In-memory Accelerators." arXiv preprint arXiv:2403.13082.
Abstract: Deep learning has proved successful in many applications but suffers from high computational demands and requires custom accelerators for deployment. Crossbar-based analog in-memory architectures are attractive for acceleration of deep neural networks (DNN), due to their high data reuse and high efficiency enabled by combining storage and computation in memory. However, they require analog-to-digital converters (ADCs) to communicate crossbar outputs. ADCs consume a significant portion of energy and area of every crossbar processing unit, thus diminishing the potential efficiency benefits. Pruning is a well-studied technique to improve the efficiency of DNNs but requires modifications to be effective for crossbars. In this paper, we motivate crossbar-attuned pruning to target ADC-specific inefficiencies. This is achieved by identifying three key properties (dubbed D.U.B.) that induce sparsity that can be utilized to reduce ADC energy without sacrificing accuracy. The first property ensures that sparsity translates effectively to hardware efficiency by restricting sparsity levels to Discrete powers of 2. The other 2 properties encourage columns in the same crossbar to achieve both Unstructured and Balanced sparsity in order to amortize the accuracy drop. The desired D.U.B. sparsity is then achieved by regularizing the variance of L0 norms of neighboring columns within the same crossbar. Our proposed implementation allows it to be directly used in end-to-end gradient-based training. We apply the proposed algorithm to convolutional layers of VGG11 and ResNet18 models, trained on CIFAR-10 and ImageNet datasets, and achieve up to 7.13x and 1.27x improvement, respectively, in ADC energy with less than 1% drop in accuracy.
-
Negi, Shubham, Deepika Sharma, Adarsh Kosta, and Kaushik Roy. "Best of Both Worlds: Hybrid SNN-ANN Architecture for Event-based Optical Flow Estimation." arXiv preprint arXiv:2306.02960.
Abstract: In the field of robotics, event-based cameras are emerging as a promising low-power alternative to traditional frame-based cameras for capturing high-speed motion and high dynamic range scenes. This is due to their sparse and asynchronous event outputs. Spiking Neural Networks (SNNs) with their asynchronous event-driven compute, show great potential for extracting the spatio-temporal features from these event streams. In contrast, the standard Analog Neural Networks (ANNs) fail to process event data effectively. However, training SNNs is difficult due to additional trainable parameters (thresholds and leaks), vanishing spikes at deeper layers, and a non-differentiable binary activation function. Furthermore, an additional data structure, membrane potential, responsible for keeping track of temporal information, must be fetched and updated at every timestep in SNNs. To overcome these challenges, we propose a novel SNN-ANN hybrid architecture that combines the strengths of both. Specifically, we leverage the asynchronous compute capabilities of SNN layers to effectively extract the input temporal information. Concurrently, the ANN layers facilitate training and efficient hardware deployment on traditional machine learning hardware such as GPUs. We provide extensive experimental analysis for assigning each layer to be spiking or analog, leading to a network configuration optimized for performance and ease of training. We evaluate our hybrid architecture for optical flow estimation on DSEC-flow and Multi-Vehicle Stereo Event-Camera (MVSEC) datasets. On the DSEC-flow dataset, the hybrid SNN-ANN architecture achieves a 40% reduction in average endpoint error (AEE) with 22% lower energy consumption compared to Full-SNN, and 48% lower AEE compared to Full-ANN, while maintaining comparable energy usage.
-
Saxena, Utkarsh and Kaushik Roy. "McQueen: Mixed Precision Quantization of Early Exit Networks." In BMVC (pp. 511-513).
Abstract: Mixed precision quantization offers a promising way of obtaining the optimal tradeoff between model complexity and accuracy. However, most quantization techniques do not support input adaptive execution of neural networks resulting in a fixed computational cost for all the instances in a dataset. On the other hand, early exit networks augment traditional architectures with multiple exit classifiers and spend varied computational effort depending on dataset instance complexity, reducing the computational cost. In this work, we propose McQueen, a mixed precision quantization technique for early exit networks. Specifically, we develop a Parametric Differentiable Quantizer (PDQ) which learns the quantizer precision, threshold, and scaling factor during training. Further, we propose a gradient masking technique that facilitates the joint optimization of exit and final classifiers to learn PDQ and network parameters. Extensive experiments on a variety of datasets demonstrate that our method can achieve significant reduction in BitOperations (BOPs) while maintaining the top-1 accuracy of the original floating-point model. Specifically, McQueen is able to reduce BOPs by 109x compared to floating point baseline without accuracy degradation on ResNet-18 trained on ImageNet.
-
Saxena, Utkarsh, Indranil Chakraborty, and Kaushik Roy. "Towards ADC-less compute-in-memory accelerators for energy efficient deep learning." In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 624-627), IEEE.
Abstract: Compute-in-Memory (CiM) hardware has shown great potential in accelerating Deep Neural Networks (DNNs). However, most CiM accelerators for matrix vector multiplication rely on costly analog to digital converters (ADCs) which becomes a bottleneck in achieving high energy efficiency. In this work, we propose a hardware-software co-design approach to reduce the aforementioned ADC costs through partial-sum quantization. Specifically, we replace ADCs with 1-bit sense amplifiers and develop a quantization aware training methodology to compensate for the loss in representation ability. We show that the proposed ADC-less DNN model achieves 1.1x-9.6x reduction in energy consumption while maintaining accuracy within 1% of the DNN model without partial-sum quantization.
-
Soufleri, Efstathia and Kaushik Roy. "Network compression via mixed precision quantization using a multi-layer perceptron for the bit-width allocation." IEEE Access, 9, pp.135059-135068.
Abstract: Deep Neural Networks (DNNs) are a powerful tool for solving complex tasks in many application domains. The high performance of DNNs demands significant computational resources, which might not always be available. Network quantization with mixed-precision across the layers can alleviate this high demand. However, determining layer-wise optimal bit-widths is non-trivial, as the search space is exponential. This article proposes a novel technique for allocating layer-wise bit-widths for a DNN using a multi-layer perceptron (MLP). The Kullback-Leibler (KL) divergence of the softmax outputs between the quantized and full precision network is used as the metric to quantify the quantization quality. We explore the relationship between the KL-divergence and the network size, and from our experiments observe that more aggressive quantization leads to higher divergence, and vice versa. The MLP is trained with layer-wise bit-widths as labels and their corresponding KL-divergence as the input. The MLP training set, i.e. the pairs of the layer-wise bit-widths and their corresponding KL-divergence, is collected using a Monte Carlo sampling of the exponential search space. We introduce a penalty term in the loss to ensure that the MLP learns to predict bit-widths resulting in the smallest network size. We show that the layer-wise bit-width predictions from the trained MLP result in reduced network size without degrading accuracy while achieving better or comparable results with SOTA work but with less computational overhead. Our method achieves up to 6x, 4x, 4x compression on VGG16, ResNet50, and GoogLeNet respectively, with no accuracy drop compared to the original full precision pretrained model, on the ImageNet dataset.
-
Chowdhury, Sayeed, Isha Garg, Kaushik and Roy. "Spatio-temporal pruning and quantization for low-latency spiking neural networks." In 2021 International Joint Conference on Neural Networks (IJCNN) (pp. 1-9), IEEE.
Abstract: Spiking Neural Networks (SNNs) are a promising alternative to traditional deep learning methods since they perform event-driven information processing. However, a major drawback of SNNs is high inference latency. The efficiency of SNNs could be enhanced using compression methods such as pruning and quantization. Notably, SNNs, unlike their non-spiking counterparts, consist of a temporal dimension, the compression of which can lead to latency reduction. In this paper, we propose spatial and temporal pruning of SNNs. First, structured spatial pruning is performed by determining the layer-wise significant dimensions using principal component analysis of the average accumulated membrane potential of the neurons. This step leads to 10-14X model compression. Additionally, it enables inference with lower latency and decreases the spike count per inference. To further reduce latency, temporal pruning is performed by gradually reducing the timesteps while training. The networks are trained using surrogate gradient descent based backpropagation and we validate the results on CIFAR10 and CIFAR100, using VGG architectures. The spatiotemporally pruned SNNs achieve 89.04% and 66.4% accuracy on CIFAR10 and CIFAR100, respectively, while performing inference with 3-30X reduced latency compared to state-of-the-art SNNs. Moreover, they require 8-14X lesser compute energy compared to their unpruned standard deep learning counterparts. The energy numbers are obtained by multiplying the number of operations with energy per operation. These SNNs also provide 1?4% higher robustness against Gaussian noise corrupted inputs. Furthermore, we perform weight quantization and find that performance remains reasonably stable up to 5-bit quantization.
V. Generative AI
A. Large Language Models
Publications:
-
He, Kang, Yinghan Long, and Kaushik Roy. "Prompt-Based Bias Calibration for Better Zero/Few-Shot Learning of Language Models." arXiv preprint arXiv:2402.10353 (2024).
Abstract: Prompt learning is susceptible to intrinsic bias present in pre-trained language models (LMs), resulting in sub-optimal performance of prompt-based zero/few-shot learning. In this work, we propose a null-input prompting method to calibrate intrinsic bias encoded in pre-trained LMs. Different from prior efforts that address intrinsic bias primarily for social fairness and often involve excessive computational cost, our objective is to explore enhancing LMs' performance in downstream zero/few-shot learning while emphasizing the efficiency of intrinsic bias calibration. Specifically, we leverage a diverse set of auto-selected null-meaning inputs generated from GPT-4 to prompt pre-trained LMs for intrinsic bias probing. Utilizing the bias-reflected probability distribution, we formulate a distribution disparity loss for bias calibration, where we exclusively update bias parameters (0.1% of total parameters) of LMs towards equal probability distribution. Experimental results show that the calibration promotes an equitable starting point for LMs while preserving language modeling abilities. Across a wide range of datasets, including sentiment analysis and topic classification, our method significantly improves zero/few-shot learning performaance of LMs for both in-context learning and prompt-based fine-tuning (on average 9% and 2%, respectively).