I am a PhD student at MIT advised by the wonderful Armando Solar-Lezama.
I also did my bachelor's and Master's degrees at MIT, advised by Armando and Jacob Andreas. I have done internships at Meta AI Research, Jane Street, and pony.ai. My research is generously funded by the NSF GRFP.
My research dreams currently lie in these three (somewhat different) areas, though I'm currently focusing on the first two.
I'm very open to both collaborating and advising students (through UROP if you're an undergrad at MIT,
we can discuss otherwise), so don't hesitate to reach out if you're excited about these topics and would like to work together!
1. Neurosymbolic programming, specifically through the lens of
smashing symbolic techniques together with LLM's to create more reliable and powerful ways to perform program synthesis.
2. Developing methodologies to interpret the behavior and understand the capabilities of code LM's,
understand their reasoning abilities, and distinguish them from language LM's.
3. Understanding theoretical foundations and training dynamics behind large language models (for code),
such as sensitivity to of initialization/learning rates, sharpness-aware training, or pruning.
We apply a method used in multilingual NLP on code, showing that vector embeddings for code can be decomposed into
syntax-like and semantic-like components. We show that when removing the syntax-like component, language identification
becomes expectedly difficult, and Text2Code and Code2Code retrieval performance improves.
We propose, LINC, a neurosymbolic approach to logical reasoning from natural language where the LLM acts as an
autoformalizer to first-order logic and a logic theorem prover makes a deduction. We also qualitatively compare LINC to chain of thought,
showing they make different mistakes and thus are complimentary.
We release LeanDojo: an open-source playground consisting of toolkits, benchmarks,
and models for LLMs to prove formal theorems in the Lean proof assistant. LeanDojo contains 1) tools for data extraction
and interaction with Lean, 2) fine-grained annotations of where lemmas are used and defined, 3) a new benchmark of 97K
human-written theorems from mathlib, and 4) a retrieval-augmented theorem prover using retrieval for relevant premise selection.
StarCoder and StarCoderBase are 15.5B parameter models with 8K context length, infilling capabilities,
and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion
tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with
inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens,
resulting in the creation of StarCoder.
Pruning CodeBERT for Improved Code-to-Text Efficiency [Coming Soon!]
Alex Gu, Ria Sonecha, Saaketh Vedantam, Bharat Runwal, Diganta Misra
Workshop on Sparsity in Neural Networks, ICLR 2023
We do some very preliminary experiments on pruning the encoder piece of CodeBERT. Preprint coming soon,
but if you're interested in this topic, please reach out!
We extend standard single-objective bilevel optimization to a min-max multi-objective framework that aims to minimize
the worst-case loss of all tasks. Our main result is theoretical: we introduce a new algorithm (MORBiT) for our framework
and show a convergence result. We also highlight applications in representation learning and hyperparameter optimization.
🎅SantaCoder: Don't Reach for the Stars!🌟
Loubna Ben Allal*, Raymond Li*, Denis Kocetkov*, ..., Alex Gu, ..., Leandro von Werra*
Deep Learning for Code Workshop, ICLR 2023 [Best Paper]
The BigCode project is an open-scientific collaboration working on the responsible open-source development of large language models
for code. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack, and our best model outperforms
previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling.
ObSynth leverages domain knowledge embedded in GPT-3 to help users design object models from high level natural language prompts.
We synthesize object names, field names, field types, method names, and relationships between objects.
Also, we conduct a user study to highlight how users may interact with ObSynth to design a restaurant management application.
We introduce the disagreement problem in explainable machine learning, showing that commonly used
algorithms including LIME, SHAP, and SmoothGrad often disagree in practice. We also conduct a user study
highlighting that practitioners do not have good ways of resolving these disagreements in their day to day.
Standard three operator splitting minimizes the sum of three convex functions f(x) + g(x) + h(x), where f is smooth and the prox operators of
g and h are computable. We focus on three settings: (i) f is nonsmooth, (ii) we only have noisy gradients and subgradients of f, (iii) an adaptive setting where smoothness properties of f are unknown.
Our algorithm, CORGI, computes certified bounds on an interpretability algorithm known as CAM (Class Activation Mapping), showing that within the certified radius,
the top-K pixels of the CAM map do not change.
Teaching Experience
Spring 2022: TA, Nonlinear Optimization (6.252)
Fall 2021: TA, Machine Learning (6.867)
Spring 2020: TA, Signals, Systems, and Inference (6.011)
The intention of this research project is to explore alternative methodologies available to people who are struggling with quarantine and isolation, providing them with an opportunity to resuscitate positive thinking and behavior. The result of the study proved that direct online social engagement can be an alternative way to resuscitate positive thinking and behavior for everyone under strenuous circumstances, such as during COVID.
CUTHBERT, a music generator that uses Markov chain models to gain an intuition for music, is a software tool designed to remove knowledge barriers for new musicians and provide inspiration to experienced musicians that just need an idea to start writing.