Research

Past Research

Research directions I've explored in the past.

International Conference on Learning Representations (ICLR)

Deep networks run with low precision operations at inference time offer power and space advantages over high precision alternatives, but need to overcome the challenge of maintaining high accuracy as precision decreases. Learned Step Size Quantization (LSQ) introduced in this work is now a widely adopted method for training such networks that scales well across a wide variety of architectures and applications, demonstrating state of the art performance with weights and activations quantized to 2-, 3- or 4-bits of precision and often reaching full precision baseline accuracy. LSQ improves how the quantizer itself is configured by introducing a novel means to estimate and scale the task loss gradient at each weight and activation layer's quantizer step size, such that it can be learned with other network parameters.

Quantization Efficient inference ICLR

Neural Inference at the Frontier of Energy, Space and Time

Science, 2023

NorthPole is a brain-inspired neural inference accelerator that eliminating off-chip memory, intertwining compute with memory on-chip, appearing externally as an active memory chip. By tightly co-optimizing low-precision compute, dense interconnects, and a high-utilization programming model in 12 nm silicon, it delivers highly parallel, energy-efficient neural network inference. On standard benchmarks, it achieves dramatic gains over comparable architectures including 25× better FPS/W, 5× better FPS/transistor, and 22× lower latency, surpassing even many architectures built on more advanced process nodes.

Efficient inference Science

SiLQ: Simple Large Language Model Quantization-Aware Training

ACL Findings, 2025

Large language models can be quantized to reduce latency, size and energy consumption, thereby delivering better user experience at a lower cost. A challenge exists to deliver quantized models with minimal loss of accuracy in reasonable time, and in particular to do so without requiring mechanisms incompatible with specialized inference accelerators. SiLQ is a simple, end-to-end quantization-aware training approach that, using total model training budget of less than 0.1%, outperforms leading published quantization methods by large margins on several modern benchmarks, with both base and instruct model variants. The approach easily generalizes across different model architectures, can be applied to activations, cache, and weights.

LLM quantization ACL

Entropy Approximation Guided Layer Selection (EAGL) for Mixed-Precision Neural Network Quantization

ASPLOS '24 Workshop on Energy Efficient Machine Learning and Cognitive Computing

EAGL is a principled approach for selecting which layers to quantize and at what precision, guided by an entropy-based sensitivity approximation reducing the search cost for mixed-precision configurations. The key insight in the development of this metric is that the entropy of the empirical distribution of parameters of a layer in a network represents a measure of the required complexity to achieve the desired performance. With this insight, an entropy based metric is introduced for quantifying the advantage of keeping a layer at a higher precision. EAGL is (i) easy to approximate, and (ii) does not need access to the training dataset to compute, making it faster and more generally applicable to other problem domains than the other metrics introduced in the mixed-precision quantization literature.

Mixed precision Quantization ASPLOS

Improving Transfer Using Augmented Feedback in Progressive Neural Networks

NeurIPS Workshop on Cognitively Informed Artificial Intelligence

An investigation into how augmented feedback signals improve lateral knowledge transfer in progressive neural networks, drawing inspiration from cognitive science models of learning. Taking inspiration from reciprocal feedback connections in the visual cortex, we augment lateral connections in the progressive neural network architecture and show that our modified architecture improves transfer over the progressive neural network baseline.

Transfer learning NeurIPS Cognitively inspired

Incorporating Attention in World Models for Improved Dynamics Modeling

NeurIPS Workshop on Modeling the Physical World

A workshop paper from before attention took over the machine learning community, this work extends world model architectures with attention mechanisms, improving the accuracy and generalization of learned dynamics models for physical prediction tasks.

World models Attention NeurIPS