Institute of Chips and AI
Friday, October, 17, 2025
(Registration Required)
Purdue Material Sciences and Electrical Engineering Building
501 Northwestern Avenue
West Lafayette, IN 47907
Agenda
Click this link to view the agenda.
Poster List
1: 3D-CIMlet: A Chiplet Co-Design Framework for Heterogeneous In-Memory Acceleration of Edge LLM Inference and Continual Learning
S. Du, L. Zheng, A. Parvathy, F. Xie, T. Wei, A. Raghunathan, H. Li
The design space for edge AI hardware supporting large language model (LLM) inference and continual learning is underexplored. We present 3D-CIMlet, a thermal-aware modeling and co-design framework for 2.5D/3D edge-LLM engines exploiting heterogeneous computing-in-memory (CIM) chiplets, adaptable for both inference and continual learning. We develop memory-reliability-aware chiplet mapping strategies for a case study of edge LLM system integrating RRAM, capacitor-less eDRAM, and hybrid chiplets in mixed technology nodes. Compared to 2D baselines, 2.5D/3D designs improve energy efficiency by up to 9.3x and 12x, with up to 90.2% and 92.5% energy-delay product (EDP) reduction respectively, on edge LLM continual learning.
2: ASLink: Modeling Multi-GPU Execution in Accel-Sim
C. Bose, C. Avalos, J. Pan, Y. Liu, M. Khairy, C. Hughes, T. Rogers
Graphical processing units (GPUs) are widely used in numerous modern application domains, including modeling and simulation, machine learning, and data analytics. Many applications such as recommendation models and graph neural networks benefit from the use of multiple GPUs to scale up the size of the workload and increase throughput. While current open-sourced GPU architectural simulators can model multi-GPU workloads, doing so remains inefficient and challenging, limiting their broader applicability across various application domains. On the other hand, simulation tools used by the industry are often closed-source, thus hindering efforts to democratize architectural research. This paper proposes an open-source simulator design, ASLink, that extends Accel-sim to support multi-GPU configurations. We highlight the limitations of popular state-of-the-art GPU architecture simulators and propose mechanisms to improve user experience and modelling fidelity in multi-GPU systems. Finally, we validate our proposed infrastructure against kernels representative of real-world workloads.
3: ASMA: An Adaptive Safety Margin Algorithm for Vision-Language Drone Navigation via Scene-Aware Control Barrier Functions
S. Sanyal, K. Roy (Presented by A. Joshi)
In the rapidly evolving field of vision—language navigation (VLN), ensuring safety for physical agents remains an open challenge. For a human-in-the-loop language operated drone to navigate safely, it must understand natural language commands, perceive the environment, and simultaneously avoid hazards in real time. Control Barrier Functions (CBFs) are formal methods that enforce safe operating conditions. Model Predictive Control (MPC) is an optimization framework that plans a sequence of future actions over a prediction horizon, ensuring smooth trajectory tracking while obeying constraints. In this work, we consider a VLN-operated drone platform and enhance its safety by formulating a novel scene-aware CBF that leverages ego-centric observations from a camera which has both Red-Green-Blue as well as Depth (RGB-D) channels. A CBF-less baseline system uses a Vision—Language Encoder with cross?modal attention to convert commands into an ordered sequence of landmarks. An object detection model identifies and verifies these landmarks in the captured images to generate a planned path. To further enhance safety, an Adaptive Safety Margin Algorithm (ASMA) is proposed. ASMA tracks moving objects and performs scene-aware CBF evaluation on the fly, which serves as an additional constraint within the MPC framework. By continuously identifying potentially risky observations, the system performs prediction in real time about unsafe conditions and proactively adjusts its control actions to maintain safe navigation throughout the trajectory. Deployed on a Parrot Bebop2 quadrotor in the Gazebo environment using the Robot Operating System (ROS), ASMA achieves 64%-67% increase in success rates with only a slight increase (1.4%-5.8%) in trajectory lengths compared to the baseline CBF-less VLN.
4: ECO: Low Power Context-Aware Multimodal AI on NPUs
A. Das, Y. Agarwal, S. Ghosh, A. Raha, V. Raghunathan (Purdue University, Intel Corporation)
ECO is the first system to enable efficient multimodal AI deployment on commercial Neural Processing Units (NPUs) through real-time, context-aware optimization. It introduces runtime-tunable, NPU-architecture-aware knobs-approximate interpolation, quantization, and model scaling—that adapt dynamically to system conditions such as energy availability and sensor reliability. Deployed on an Intel Core Ultra Series 2 NPU with RGB and LiDAR inputs for semantic segmentation, ECO achieves up to 4.9× performance and 11.3× energy-efficiency improvement over CPU execution, while preserving higher segmentation quality under constrained or degraded conditions. By combining sensor- and compute-level adaptivity in a lightweight control layer, ECO demonstrates robust, energy-aware multimodal inference on power-limited edge platforms, advancing the practical deployment of context-sensitive AI at the edge.
5: Efficient SoC Power Estimation with Machine Learning
S. Pandit, S. Dey, A. Raghunathan
We propose ML-Power, the first machine learning (ML) based framework that (i) accelerates end-to-end RTL power estimation by addressing both its key bottlenecks—RTL simulation and power-model evaluation—and (ii) extends ML-based power models to System-on-Chips (SoCs) with configurable IP blocks. ML-Power builds models that predict power vs. time traces of each SoC block using a very small subset of internal signals called power proxies. The framework is composed of three components: PACE, SCOPE, and RECAL. PACE trains a sequence-to-sequence ML model to translate a transaction-level execution trace into a cycle-level trace of the power proxy signals, allowing much faster simulation models to be used in place of RTL simulation. SCOPE selects power proxies and trains ML-based power models for each block within the SoC. RECAL enables ML-Power to handle SoCs with configurable IP blocks by using active learning to select a small subset of representative configurations, which are then used to train a unified power model that generalizes across the entire design space. We evaluate ML-Power on an ARM SSE-300 SoC and two RISC-V-based SoCs. Compared to prior state-of-the-art ML-based power estimation frameworks, SCOPE trains power models ~65× faster, picks 40% fewer proxies, and achieves 2% lower estimation error. For the two RISC-V SoCs with configurable IP blocks, ML-Power achieves less than 10% error in per-cycle power across the entire design space using only 10-14 training configurations. ML-Power also achieves ~300× speedup over a commercial RTL power estimation tool with less than 7% error in per-cycle power estimates.
6: HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference
S. Negi, K. Roy
The rapid adoption of Large Language Models (LLMs) has driven a growing demand for efficient inference, particularly in latency-sensitive applications such as chatbots and personalized assistants. Unlike traditional deep neural networks, LLM inference proceeds in two distinct phases: the prefill phase, which processes the full input sequence in parallel, and the decode phase, which generates tokens sequentially. These phases exhibit highly diverse compute and memory requirements, making accelerator design particularly challenging. Prior works have primarily optimized for high-batch inference to maximize throughput, leaving the low-batch regime—critical for real-world, interactive applications—largely unexplored.
In this work, we propose HALO, a heterogeneous, memory-centric accelerator designed to address the unique challenges of low-batch LLM inference. We first characterize the performance trade-offs of two extremes: a fully compute-in-DRAM (CiD) accelerator and a fully on-chip compute-in-memory (CiM) accelerator. Based on these insights, we introduce a heterogeneous CiD/CiM architecture that flexibly leverages the strengths of both paradigms. To further exploit hardware efficiency, we develop a mapping strategy that adapts to the distinct demands of the prefill and decode phases. Experimental results demonstrate that LLMs mapped to HALO achieve up to 18× speedup over AttAcc and 2.5× speedup over CENT, highlighting the effectiveness of heterogeneous CiD/CiM co-design in enabling efficient low-batch LLM inference.
7: Implications of Local Learning Rules for On-Device Learning Hardware
M. Apolinario, K. Roy
Modern deep neural networks rely on global gradient-based training such as backpropagation, which demands extensive memory and energy resources, rendering on-device adaptation impractical for edge hardware. Local learning rules offer a biologically inspired alternative by enabling each layer or synapse to update weights using locally available signals, eliminating global gradient propagation. This poster explores the algorithmic and hardware implications of such rules, including LLS (Local Learning via Synchronization), S-TLLR (STDP-inspired Temporal Local Learning Rule), and TESS (Temporally and Spatially Learning Rule for SNNs). These approaches reduce computation and memory by orders of magnitude compared to backpropagation, while achieving competitive accuracy on vision and event-based benchmarks. Through hardware-software co-design, we demonstrate that locally learnable models can leverage simple, on-chip update primitives, paving the way for energy-efficient, real-time, and adaptive AI in future neuromorphic and embedded systems.
8: Large Language Model-guided Multi-modal Motion Planning via Mixed Integer Program
X. Sun, K. Cheng, Z. Pan, A. Bera
Multi-Modal Motion Planning is a rather challenging form of motion planning where the planner searches through the continuous space of motions as well as discrete space of modes. For instance, a biped robot may need to walk to a target location and then use its arms to grasp an object, capturing both mode transitions and continuous dynamics to find feasible paths that neither purely discrete nor continuous planners can handle. Traditional global search methods are computationally expensive, while Mixed-Integer Programming (MIP) offers a more efficient alternative through branch-and-bound pruning. However, MIP typically assumes disjoint convex domains, limiting its use for general non-convex motion planning. To overcome this, we propose using Large Language Models (LLMs) to automatically translate non-convex optimization problems into approximate MIP formulations.
9: Operator Learning Using Weak Supervision from Walk-on-Spheres
H. Viswanath, A. Bera
Training neural PDE solvers is often bottlenecked by expensive data generation or unstable physics-informed neural network (PINN) that involves challenging optimization landscapes due to higher-order derivatives. To tackle this issue, we propose an alternative approach using weak supervision from stochastic processes to produce training data of varying quality. Specific to Poisson PDEs, an efficient Monte-Carlo algorithm called Walk-on-Spheres (WoS) is capable of generating solutions using efficient random walks. We introduce a learning scheme called Walk-on-Spheres Neural Operator (WoS-NO) using weak supervision from WoS to train any given neural operator. The central principle of our method is to amortize the cost of Monte Carlo walks across the distribution of PDE instances. Our method leverages stochastic representations using the WoS algorithm to generate cheap, noisy, yet unbiased estimates of the PDE solution during training. This is formulated into a data-free physics-informed objective where a neural operator is trained to regress against these weak supervisions. Leveraging the unbiased nature of these estimates, the operator learns a generalized solution map for an entire family of PDEs. This strategy results in a mesh-free framework that operates without expensive pre-computed datasets, avoids the need for computing higher- order derivatives for loss functions that are memory-intensive and unstable, and demonstrates zero-shot generalization to novel PDE parameters and domains. Experiments show that for the same number of training steps, our method exhibits upto 8.75× improvement in L2-error compared to standard physics-informed training schemes, upto 6.31× improvement in training-speed, and reductions of up to 2.97× in GPU memory consumption.
10: ResQ: Mixed Precision Quantization of Large Language Models with Low Rank Residuals
U. Saxena, K. Roy
Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g. 8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama and Qwen2.5 families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33% lower perplexity on Wikitext than the next best method SpinQuant, and upto 3× speedup over 16-bit baseline.
11: RTG: Reverse Trajectory Generation for Reinforcement Learning Under Sparse Reward
K. Cheng, J. Shen, X. Sun, X. Gao, K. Wu, Z. Pan, A. Bera
Deep Reinforcement Learning (DRL) under sparse reward conditions remains a long-standing challenge in robotic learning. In such settings, extensive exploration is often required before meaningful reward signals can guide the propagation of state-value functions. Prior approaches typically rely on offline demonstration data or carefully crafted curriculum learning strategies to improve exploration efficiency. In contrast, we propose a novel method tailored to rigid body manipulation tasks that addresses sparse reward without the need for guidance data or curriculum design. Leveraging recent advances in differentiable rigid body dynamics and trajectory optimization, we introduce the Reverse Rigid-Body Simulator (RRBS), a system capable of generating simulation trajectories that terminate at a user-specified goal configuration. This reverse simulation is formulated as a trajectory optimization problem constrained by differentiable physical dynamics. RRBS enables the generation of physically plausible trajectories with known goal states, providing informative guidance for conventional RL in sparse reward environments. Leveraging this, we present Reverse Trajectory Generation (RTG), a method that integrates RRBS with a beam search algorithm to produce reverse trajectories, which augment the replay buffer of off-policy RL algorithms like DDQN to solve the sparse reward problem. We evaluate RTG across various rigid body manipulation tasks, including sorting, gathering, and articulated object manipulation. Experiments show that RTG significantly outperforms vanilla DRL and improved sampling strategies like Hindsight Experience Replay (HER) and Reverse Curriculum Generation (RCG).
12: Scalable Multi-Robot Informative Path Planning for Target Mapping via Deep Reinforcement Learning
A. Vashisth, M. Kulshrestha, D. Conover, A. Bera
Autonomous robots are widely utilized for mapping and exploration tasks due to their cost-effectiveness. Multi-robot systems offer scalability and efficiency, especially in terms of the number of robots deployed in more complex environments. These tasks belong to the set of Multi-Robot Informative Path Planning (MRIPP) problems. In this paper, we propose a deep reinforcement learning approach for the MRIPP problem. We aim to maximize the number of discovered stationary targets in an unknown 3D environment while operating under resource constraints (such as path length). Here, each robot aims to maximize discovered targets, avoid unknown static obstacles, and prevent inter-robot collisions while operating under communication and resource constraints. We utilize the centralized training and decentralized execution paradigm to train a single policy neural network. A key aspect of our approach is our coordination graph that prioritizes visiting regions not yet explored by other robots. Our learned policy can be copied onto any number of robots for deployment in more complex environments not seen during training. Our approach outperforms state-of-the-art approaches by at least 26.2% in terms of the number of discovered targets while requiring a planning time of less than 2 sec per step. We present results for more complex environments with up to 64 robots and compare success rates against baseline planners.
13: ARLIN-VLM: Action Resampling for Lifelong Interactive Navigation via Vision Language Models
A. Vashisth, M. Kulshrestha, P. Bakshi, D. Conover, G. Sartoretti, A. Bera
Visual navigation has achieved remarkable progress in recent years. A common assumption in this field is the existence of at least one obstacle-free path from start position to goal position, which must be discovered/planned by the robot. However, this assumption might not hold in many real-world scenarios, such as home environments, where clutter may block all potential paths to goal. Targeted at such cases, this work considers the problem of Lifelong Interactive Navigation, where a mobile robot with manipulation abilities is able to move clutter to forge its own path to a series of goals. In this work, we consider the challenge of choosing which obstructing object to move, and where to re-locate the object so that they do not block the optimal paths to future goals. We introduce ARLIN-VLM, a novel Zero-shot framework that utilizes the common-sense ability of large-scale, pre-trained Vision Language Models (VLMs) for optimizing long-term objectives via action (re-)sampling.
14: Scalable stochastic computing for efficient machine learning
H. Cho, T. Chen, Y. Kim
Counter-Based Stochastic Computing (CBSC) achieves significant area reduction for Stochastic Number Generators (SNGs) compared to conventional stochastic computing. However, a key limitation of CBSC multiplication is its input-dependent nature, which causes different multiplications to finish at varying times, making it difficult to scale data parallelism and leading to synchronization overhead in large-scale computations such as deep neural networks. Although several methods have been explored to enhance parallelism, this dependency still introduces a performance bottleneck. This work proposes and investigates a weight scheduling technique that minimizes this bottleneck and reduces overall computation latency, enabling scalable CBSC architectures for machine learning applications.
15: TADA
T. Sharma, K. Roy
TADA introduces a technology-aware design space exploration framework for deep neural network accelerators, enabling rapid prediction of system-level performance across diverse hardware technologies. Built on the LightGBM gradient boosting framework, TADA models energy and latency for digital compute-in-memory (CIM) accelerators that employ different buffer technologies. To ensure broad applicability, TADA is trained on a synthetically generated dataset that embeds a technology signature - including read energy, write energy, and capacity - directly into the model inputs. This approach allows TADA to generalize effectively across unseen configurations. When evaluated, TADA achieves exceptionally high correlation scores of 0.999 on synthetic datasets and 0.956 on unseen real datasets, demonstrating its robustness and predictive precision for a given transformer-buffer-hardware configuration.
16: Techniques for Efficient Deployment of Deep Neural Networks
S. Selvam, A. Raghunathan
Deep Neural Networks (DNNs) are widely adopted across a range of real-world applications, from natural language processing to autonomous systems, posing challenges across the whole spectrum of computing systems. Considerable attention has been paid to the rapid increase in computational requirements for training DNNs. However, their increasing computational and memory demands also pose significant challenges for deployment, especially in resource-constrained and real-time environments. This dissertation proposes techniques that enable efficient DNN deployment across different scenarios.
First, we address the computational bottleneck of VLMs in scene understanding applications through SimCache, a novel similarity-based caching mechanism. By exploiting temporal and semantic locality in videos, SimCache identifies similar regions across frames and caches the embeddings of these regions to save future computations.
Second, we tackle the challenge of batched inference in Conditional Neural Networks, where the computations performed by the DNN vary across inputs, causing computational irregularity and inefficiency when inputs are batched. We propose BatchCond, an optimized batching framework composed of two complementary solutions: SimBatch, which predictively groups inputs with similar computational patterns, and ABR, which dynamically reorganizes batches in a hardware-aware manner to address any residual irregularity.
Finally, we consider the problem of deploying DNNs on heterogeneous SoCs, where the application developer must select from several network architectures and map the computations within the chosen architecture to the various processing units to optimize performance and power. To support this design space exploration, we develop CoCO- ML, a tool that combines ML models with graph algorithms to enable fast and accurate prediction of execution characteristics. In summary, the proposed inference optimization techniques enable efficient DNN deployment across diverse hardware platforms and applications.
17: ThermAI: TCAD Model Informed Thermal Analysis of Circuits using GenAI
S. Chandra, K. Roy
Thermal analysis is increasingly critical in modern integrated circuits, where non-uniform power dissipation and high transistor densities lead to rapid temperature spikes and reliability concerns. Traditional methods such as FEM-based simulations are accurate but computationally prohibitive for early-stage design, often requiring multiple iterative redesign cycles to resolve late-stage thermal failures. To address these challenges, we propose 'ThermAl', a physics-informed generative AI framework which effectively identifies heat sources and estimates full-chip transient and steady-state thermal distributions from input activity profiles. ThermAl leverages a hybrid U-Net architecture enhanced with positional encoding and a Boltzmann regularizer to maintain physical fidelity. Our model is trained on an extensive dataset of heat dissipation maps for more than 200+ circuit configurations, ranging from simple logic gates (e.g., inverters, NAND, XOR) to complex designs, generated via COMSOL and Cadence EDA flows. The dataset captures diverse activity patterns, and we note that material-dependent thermal properties may require targeted fine-tuning to ensure accuracy across different fabrication contexts. Experimental results demonstrate that ThermAl delivers precise temperature mappings for large circuits, with a root mean squared error (RMSE) of only 0.71°C and outperforms conventional FEM tools by running up to ~200 times faster. We analyze performance across diverse layouts and workloads, and discuss its applicability to large-scale EDA workflows. Limitations such as 2D-only modeling and real-world validation are addressed with concrete future directions, including 3D extension, generalization across technology nodes, and transfer learning strategies.
18: Variational Shape Inference for Grasp Diffusion on SE(3)
T. Bukhari, K. Agrawal, Z. Kingston, A. Bera
Grasp synthesis is a fundamental task in robotic manipulation which usually has multiple feasible solutions. Multimodal grasp synthesis seeks to generate diverse sets of stable grasps conditioned on object geometry, making the robust learning of geometric features crucial for success. To address this challenge, we propose a framework for learning multimodal grasp distributions that leverages variational shape inference to enhance robustness against shape noise and measurement sparsity. Our approach first trains a variational autoencoder for shape inference using implicit neural representations, and then uses these learned geometric features to guide a diffusion model for grasp synthesis on the SE(3) manifold. Additionally, we introduce a test-time grasp optimization technique that can be integrated as a plugin to further enhance grasping performance. Experimental results demonstrate that our shape inference for grasp synthesis formulation outperforms state-of-the-art multimodal grasp synthesis methods on the ACRONYM dataset by 6.3%, while demonstrating robustness to deterioration in point cloud density compared to other approaches. Furthermore, our trained model achieves zero-shot transfer to real-world manipulation of household objects, generating 34% more successful grasps than baselines despite measurement noise and point cloud calibration errors.
19: TBD
A. Nallathambi, A. Raghunathan
Abstract: TBD.
20: Balancing Machine Learning Software Pipelines via Issue-Level Warp Prioritization
F. Shen, T. Rogers
Contemporary GPU kernels for deep learning workloads, particularly attention mechanisms in transformers, employ sophisticated parallel programming techniques to maximize hardware utilization. These techniques create software pipelines that coordinate multiple specialized hardware units (Tensor Cores, Tensor Memory Accelerators, and SIMT cores) through warp-level synchronization. However, current warp scheduling mechanisms are not designed for these new programming paradigms, leading to suboptimal performance due to imbalance and inefficient instruction issue patterns.
We propose Progress-Aware Warp Scheduling (PAWS), a novel scheduling mechanism that adapts to modern attention kernels by encoding priority information in the instruction stream. Our approach enables the warp scheduler to identify and prioritize warps executing slower program phases, thereby improving overall pipeline throughput. We demonstrate that compilers can automatically implement this mechanism, providing significant performance improvements for attention kernels on modern GPU architectures. We perform a holistic parameter sweep of 1800 attention implementations representing contemporary and future attention kernels. PAWS consistently performs better than state-of-the-art hardware warp schedulers, delivering 28% speedup when phases are highly skewed and 15% improvement on attention configurations found in QWen3, Llama4, and Grok 1.0.
21: CENTAUR: A 38.5-TFLOPS/W 600MHz Floating-Point Digital Compute-In-Memory Engine with 40nm Fusion RRAM-eDRAM Macros Featuring 3D-MAC Operation
L. Zheng, H. Li
We present CENTAUR, a 40 nm floating-point compute-in-memory (FP-CIM) engine achieving 600 MHz frequency and 38.5 TFLOPS/W energy efficiency, leveraging fusion RRAM-eDRAM macros for energy-efficient, high-speed AI inference. This work presents the first NVM-eDRAM CIM chip demonstration and addresses key limitations of existing NVM-based FP-CIM designs by introducing a novel FP 3D-MAC dataflow that eliminates pre-alignment overhead, reduces area, and preserves high inference accuracy. CENTAUR's architecture decouples static mantissa operations (RRAM) from dynamic exponent processing (eDRAM), with shift-vector-based 3-operand MACs mapped efficiently to a 3T1C gain-cell structure. Validated on Tiny-ViT with only 1.75% degradation from software baseline accuracy, CENTAUR outperforms prior SRAM-CIM by 4.84× and RRAM-CIM by 8.48×, in a comprehensive figure-of-merit (FoM).
22: CHEETA:CMOS+MRAMHardware for Energy-EfficienT AI
M. Mukherjee, K. Roy
We present a CMOS+MRAM compute-in-memory (CiM) macro that targets energy-efficient edge AI by collapsing data movement across the von-Neumann boundary. Built around a 1T-1MTJ bit-cell array with robust mixed-signal peripherals (sense-line stabilization, dummy-current cancellation, low-overhead current-to-voltage conversion, and compact flash-ADC quantization), the macro performs in-situ multiply—accumulate to cut memory-to-compute traffic while preserving inference accuracy. Partial-wordline activation and carefully co-designed analog/digital control limit switching and improve sense margins under PVT and device variability, enabling reliable operation at low supply voltages. The resulting architecture delivers high parallelism, non-volatility, and fine-grained quantization tailored for always-on, latency-sensitive workloads at the edge, where battery budget and thermal headroom are tight. Overall, this work's purpose is to demonstrate a tape-out-ready MRAM CiM building block that pairs manufacturable 1T-1MTJ devices with pragmatic peripheral design to achieve substantial energy savings per inference without sacrificing accuracy or robustness, offering a scalable path toward practical, system-level reductions in AI cost and power.
23: CREST-CiM: Cross-Coupling-Enhanced Differential STT-MRAM for Robust Computing-in-Memory in Binary Neural Networks
I. Ahmed, A. Malhotra, S. Gupta
We propose CREST-CiM, an STT-MRAM-based Computing-in-Memory (CiM) technique, targeted for binary neural networks. To circumvent the low-distinguishability issue in standard MRAM-based CiM, CREST-CiM utilizes two magnetic tunnel junctions (MTJs) to store +1 and -1 weights in a bitcell and cross-couples the MTJs, achieving a high-to-low current ratio of up to 8100 for a bit-cell. Our analysis for 64x64 arrays shows up to 3.4x higher CiM sense-margin, 27.6% higher read-disturb-margin, and resilience to process variations, and other hardware non-idealities, albeit at the cost of just 7.9% overall-area overhead, and <1% energy and latency overhead compared to a 2T-2MTJ-CiM design. Our system-level analysis for ResNet-18 trained on CIFAR-10 shows near-software inference accuracy with CREST-CiM, with 10.7% improvement over 2T-2MTJ baseline.
24: Edge AI for Wearable and Wireless Sensing
H. Vu, Y. Kim
The deployment of Edge AI in real-world monitoring systems is fundamentally constrained by two major challenges: a sufficient energy budget for continuous operation and the availability of rich, multimodal data for model development. This work presents two complementary solutions to address these limitations. First, to overcome the energy requirement, we present eTag, an energy-neutral platform that enables perpetual operation of mobile devices through opportunistic charging. eTag provides a robust hardware foundation capable of supporting the energy-intensive demands of on-device inference. Second, to address the data scarcity problem, we developed MmCows, a large-scale multimodal dataset featuring nine distinct modalities that capture real-world complexity. MmCows serves as a critical benchmark for training and validating the complex sensor fusion models required for developing future robust and generalizable AI. Together, this validated energy-neutral hardware and comprehensive benchmark dataset provide a practical and scalable framework for advancing Edge AI in long-term monitoring systems.
25: Evaluation of STT-MRAM as a Scratchpad for Training in ML Accelerators
S. Roy, C. Wang, A. Raghunathan
Progress in machine learning has been driven by the ability to train larger deep neural networks (DNNs). Training DNNs is a memory-intensive process, requiring different data structures during different stages. Limited on chip memory leads to many expensive DRAM operations. Dense nonvolatile memories are being explored to provide more on chip memory capacity and mitigate the high leakage power of SRAM. Among the emerging non-volatile memories, Spin-Transfer-Torque MRAM (STT-MRAM) offer properties like high endurance and reasonable access time making it suitable for training DNNs. Compared to SRAM, STT-MRAM provides 3-4 times higher density and significantly reduced leakage power. However, STT-MRAM requires high write energy and latency. In this study, we perform a comprehensive device-to-system evaluation and co-optimization of STT-MRAM for efficient ML training accelerator design. We devised a cross-layer simulation framework to evaluate the effectiveness of STT-MRAM as a scratchpad in a systolic-array-based DNN accelerator with respect to SRAM. We address the inefficiency of writes in STT-MRAM by utilizing reduced write voltage, which comes at the cost of bit errors. To evaluate the accuracy-efficiency tradeoff, we studied the error tolerance of the different data structures. We propose heterogeneous memory configurations that enable training convergence with good accuracy. Our results indicate that replacing SRAM with STT-MRAM can provide up to 15x and 22x improvement in system level energy for iso-capacity and iso-area scenarios respectively, across a suite of DNN benchmarks. Further optimizing STT-MRAM write operations can provide over 2x improvement in write energy for minimal degradation in application-level training accuracy.
26: Experimental Investigation of Variations in Polycrystalline Hf0.5Zr0.5O2 (HZO)-based MFIM
T. Kim, R. Koduru, Z. Lin, P. Ye, S. Gupta
A Ferroelectric (FE) devices based on Hf0⋅5Zr0⋅5O2 (HZO) are promising memory technologies for high-density and energy-efficient AI hardware owing to their nonvolatile multi-state polarization and CMOS compatibility. However, device-to-device variation in the HZO layer, particularly in remanent polarization (PR), degrades the precision and reliability of FE-based compute-in-memory systems. We experimentally investigate the origin of PR variation in HZO-based metal-ferroelectric-insulator-metal (MFIM) capacitors across different set voltages (VSET) and ferroelectric thicknesses (TFE). The standard deviation of PR exhibits a non-monotonic dependence on VSET, peaking near the coercive voltage (VC). In the low- and high-VSET regions, PR variation is mainly governed by saturation polarization (PS) fluctuations arising from charge trapping and polycrystallinity, whereas in the mid-VSET region, VC variation from random domain nucleation dominates. Our work further reveals that thinner FE layers suppress nucleation, reducing the peak of PR variation and non-monotonicity. Our findings offer insight into selecting appropriate VSET and TFE conditions for reliable FE memory design in AI hardware.
27: Fault Tolerant Computing-in-memory for Ternary LLMs
A. Malhotra, D. Kim, S. Gupta
ATernary Large language models (LLMs), which utilize ternary precision weights and 8-bit activations, have demonstrated competitive performance while significantly reducing the high computational and memory requirements of full precision LLMs. These costs can be further reduced by deploying them on Ternary Computing-in-memory (TCiM) accelerators, that alleviate the large data movement costs of conventional von-Neumann systems. However, despite these advantages, TCiM accelerators are prone to memory stuck-at faults (SAFs), which corrupt a portion of the memory bits, leading to degradation in the model accuracy. In this work, we propose a training-free solution addressing this issue, that utilizes the natural redundancy present in TCiM bitcells as well as weight transformations, to enhance the fault tolerance of TCiM accelerators. Our experiments on BitNet b1.58 700M and 3B ternary LLMs show that our technique furnishes sizeable accuracy improvements, notably up to a ~40% reduction in perplexity, while incurring a mild hardware overhead.
28: HyFPCiM: A 65nm 417 µW 18.2 TFLOPS/W/mm2 Hybrid FP8 CiM Macro for Sub-mW Edge AI
G. Kumar K, K. Roy (Presented by: S. Bose, R. Maity)
Edge AI applications such as wearables, health monitors, and IoT sensors require sub-mW inference, but existing floating-point computing-in-memory (FP-CiM) solutions consume 10-460 mW, far exceeding device budgets. Integer quantization (INT8/INT4) reduces bit-width but often demands high-precision scale-factor multiplications and retraining, which erode efficiency.
We present HyFPCiM, a hybrid FP8 (1S-4E-3M) CiM architecture that exploits the asymmetric error sensitivity of floating-point components. Exponents, which are 8-15× more error-sensitive than mantissas, processed digitally (DCiM) for exact, low-cost computation, while mantissas are handled by a switched-capacitor MAC-based analog CiM (ACiM), where minor non-idealities are tolerable and enable massive parallelism. A duty-cycled, power-gated TIA/ADC further reduces ADC energy by 8× and ADC+TIA power by 75%.
Fabricated in 65 nm TSMC LP, HyFPCiM achieves 417 μW at 50 MHz (1 V), enabling sub-mW FP8 inference. This is >24× lower power than prior FP-CiM designs, with 18.2 TFLOPS/W/mm2 area efficiency and 40 GFLOPS/W system efficiency. After iso-28 nm normalization, HyFPCiM outperforms the best prior work by >2.75×. FP8 inference on ResNet-18 with CIFAR-10/100 and Tiny-ImageNet shows <0.55% accuracy loss versus software FP8, confirming its suitability for real-world edge AI .
29: PRISM: Toward Energy-Efficient AIoT with Processing-In-Sensor+Memory in Battery-Free Devices
A. Das, Y. Agarwal, C. Wang, A. Raha, S. Thirumala, S. Gupta, V. Raghunathan (Purdue University, Intel Corporation, Micron Technology)
The proliferation of battery-free devices for the Artificial Intelligence of Things (AIoT) demands energy-efficient computation under intermittent power. PRISM introduces the first architecture that tightly integrates Processing-in-Sensor (PiS) and Processing-in-Memory (PiM) techniques to enable Binary Neural Network (BNN) workloads on energy-harvesting platforms. PRISM executes early neural layers directly at the sensor to minimize data movement and checkpointing, while leveraging PiM for deeper layers to optimize energy efficiency and latency. Evaluations using our full-stack simulation framework, EPCEN-P, demonstrate up to 145× improvement in execution cycles and 110× reduction in checkpointing frequency compared to conventional microcontroller-only designs. This hybrid PiS+PiM approach represents a significant step toward practical, energy-autonomous AIoT deployments.
30: Processing-in-DRAM to Alleviate the Memory Bottleneck
E. Berscheid, S. Roy, A.Raghunathan
Modern AI workloads demand increasingly large compute and memory resources. However, DRAM size and bandwidth have scaled at a slower rate than FLOPS and model size, leading to a memory bottleneck. Performing computation near and inside memory structures to reduce data movement has been a successful strategy for alleviating this bottleneck. Our work focuses on processing-in-DRAM, where computation takes place in-array, with charge-sharing techniques, or near-array, near-bank, and near-channel with digital processing elements. The first work targets DNN/CNN applications and proposes a mapping technique and dataflow to fully enable DNN computation within the DRAM. It achieves 19.5x better performance compared to a GPU. The second work targets recommendation systems, which rely heavily on embedding table "gather-reduce" operations, which have very low arithmetic intensity. We propose a table reindexing method using historical frequent set statistics of vector accesses to increase the probability of vector co-locality at runtime. Vectors are intelligently placed in memory to optimize the performance of the given processing-in-DRAM architecture. It achieves 34.9% speedup compared to a baseline processing-in-DRAM design.
31: SpiDR: A 65nm 5 TOPS/W Digital CIM Accelerator with Reconfigurable Precision and Temporal Pipelining for Spiking Neural Networks
D. Sharma, K. Roy
Spiking Neural Networks (SNNs) provide a framework for energy-efficient processing of event-based and temporal data by exploiting sparse and asynchronous computations. However, existing SNN accelerators are often limited by fixed network architectures, limited bit-precision support, and inefficient handling of neuron membrane potential (Vmem) dynamics, restricting their adaptability across diverse workloads. This work presents SpiDR, a digital compute-in-memory (CIM) based SNN accelerator that addresses these limitations by introducing architectural and dataflow-level reconfigurability. SpiDR features: 1) CIM macros with fused weight and Vmem memory to reduce data movement. 2) Reconfigurable peripherals coupled with staggered data mapping to achieve high utilization and support flexible weight/Vmem bit precisions. 3) Zero-skipping mechanism for leveraging spike sparsity to reduce energy consumption, without introducing high overhead for low sparsity. 4) An asynchronous handshaking mechanism to ensure high computational efficiency for event-driven SNNs. 5) Support for various neuron models and two chip operating modes to enable compatibility with a broad range of SNNs. Fabricated in 65 nm Taiwan Semiconductor Manufacturing Company (TSMC) low power (LP) technology, SpiDR achieves up to 5 TOPS/W energy efficiency under typical event-based sparsity conditions (95%) and provides support for diverse SNN workloads. We evaluated SpiDR on two representative event-based tasks: gesture recognition on the IBM DVS dataset and optical flow estimation on the DSEC dataset, demonstrating energy-accuracy trade-offs with change in network architecture and data precision. SpiDR achieves 2x to 1000x higher area efficiency and up to 7.6x higher energy efficiency compared to prior designs.
32: TAXI: Traveling Salesman Problem Accelerator with X-bar-based Ising Macros Powered by SOT-MRAMs and Hierarchical Clustering
S. Yoo, A. Holla, S. Sanyal, D. Kim, F. Iacopi, D. Biswas, J. Myers, K. Roy
Ising solvers with hierarchical clustering have shown promise for large-scale Traveling Salesman Problems (TSPs), in terms of latency and energy. However, most of these methods still face unacceptable quality degradation as the problem size increases beyond a certain extent. Additionally, their hardware-agnostic adoptions limit their ability to fully exploit available hardware resources. In this work, we introduce TAXI -- an in-memory computing-based TSP accelerator with crossbar (Xbar)-based Ising macros. Each macro independently solves a TSP sub-problem, obtained by hierarchical clustering, without the need for any off-macro data movement, leading to massive parallelism. Within the macro, Spin-Orbit-Torque (SOT) devices serve as compact energy-efficient random number generators enabling rapid "natural annealing". By leveraging hardware-algorithm co-design, TAXI offers improvements in solution quality, speed, and energy-efficiency on TSPs up to 85,900 cities (the largest TSPLIB instance). TAXI produces solutions that are only 22% and 20% longer than the Concorde solver's exact solution on 33,810 and 85,900 city TSPs, respectively. TAXI outperforms a current state-of-the-art clustering-based Ising solver, being 8x faster on average across 20 benchmark problems from TSPLib.
33: WAGONN: Weight Bit Agglomeration in Crossbar Arrays for Reduced Impact of Interconnect Resistance on DNN Inference Accuracy
J. Louis, S. Gupta
Deep neural network (DNN) accelerators employing crossbar arrays capable of in-memory computing (IMC) are highly promising for neural computing platforms. However, in deeply scaled technologies, interconnect resistance severely impairs IMC robustness, leading to a drop in the system accuracy. To address this problem, we propose WAGONN - a technique based on agglomerating weight bits in crossbar arrays which alleviates the detrimental effect of wire resistance. For 8T-SRAM-based 128×128 crossbar arrays in 7nm technology, WAGONN enhances the accuracy from 47.78% to 83.5% for ResNet-20/CIFAR-10. We also show that WAGONN can be used synergistically with Partial-Word-Line-Activation, further boosting the accuracy. Further, we evaluate the implications of WAGONN for compact ferroelectric transistor-based crossbar arrays and show accuracy enhancement. WAGONN incurs minimal hardware overhead, with less than a 1% increase in energy consumption. Additionally, the latency and area overheads of WAGONN are ~1% and ~16%, respectively when 1 ADC is utilized per crossbar array.