I am a PhD student at MIT advised by the wonderful Armando Solar-Lezama.
I also did my bachelor's and Master's degrees at MIT, advised by Armando and Jacob Andreas. I have done internships at Meta AI Research, Jane Street, and pony.ai. My research is generously funded by the NSF GRFP.
My research interests are to improve the capabilities of AI systems on tasks like programming and mathematics through methods such as
neurosymbolic programming.
The best part of doing a PhD is that I get to learn from all kinds of people, so feel free to reach out if you want to chat or collaborate!
Bandwidth permitting, I am also happy to mentor undergraduate students who are interested in my research and provide advice for people who would benefit from my experiences.
Outside research, I love music. I studied piano for over 15 years under inspiring pianists Yukiko Sekino and Chun-Chi An and
received a music minor in my undergraduate. As a highlight, I performed the first movement of
Rachmaninoff's 2nd Piano Concerto with the MIT Symphony Orchestra.
I occasionally perform piano recitals and rarely make EDM-like tracks and remixes.
[May 2024]The Counterfeit Conundrum was accepted to ACL Findings, 2024. Look forward to enjoying Bangkok with everyone!
[May 2024]CRUXEval was accepted to ICML 2024! Try out our benchmark to see if your code models can reason about code execution, and see you in Vienna!
[May 2024] I will be doing an internship at NVIDIA with the LLM model optimization team led by Di Wu in Santa Clara! Let me know if you are in the Bay Area!
LiveCodeBench is a holistic and contamination-free
benchmark for code LMs. It consists of 4 tasks: code generation, self-repair, code execution,
and test output prediction. We update the benchmark periodically with new high-quality problems
from platforms like LeetCode, AtCoder, and Codeforces.
In The Counterfeit Conundrum, we analyze the ability of
open code language models to understand their counterfeit samples. These are samples that
1) have a high enough log-probability to be generated at a moderate temperature, 2) are incorrect,
but 3) pass weak correctness checks. We find that open code LMs 1) think counterfeits are correct,
2) execute them as if they were correct, and 3) can't repair them without feedback.
We train new models with 3B, 7B, and 15B on Software Heritage source code including 619 programming
languages and high-quality data sources like GitHub pull requests, Kaggle notebooks, and code documentation.
Our models outperform most similarly sized models on a variety of benchmarks.
CRUXEval is a benchmark of 800 Python functions
and input-output pairs designed to test the ability of code LMs on code reasoning, understanding,
and execution. We find that despite being trained on 100G of Python code and 1T of code data,
models like Code Llama fail over half the time at simple execution prediction and code reasoning!
We apply a method used in multilingual NLP on code, showing that vector embeddings for code can be decomposed into
syntax-like and semantic-like components. We show that when removing the syntax-like component, language identification
becomes expectedly difficult, and Text2Code and Code2Code retrieval performance improves.
We propose, LINC, a neurosymbolic approach to logical reasoning from natural language where the LLM acts as an
autoformalizer to first-order logic and a logic theorem prover makes a deduction. We also qualitatively compare LINC to chain of thought,
showing they make different mistakes and thus are complimentary.
We release LeanDojo: an open-source playground consisting of toolkits, benchmarks,
and models for LLMs to prove formal theorems in the Lean proof assistant. LeanDojo contains 1) tools for data extraction
and interaction with Lean, 2) fine-grained annotations of where lemmas are used and defined, 3) a new benchmark of 97K
human-written theorems from mathlib, and 4) a retrieval-augmented theorem prover using retrieval for relevant premise selection.
StarCoder and StarCoderBase are 15.5B parameter models with 8K context length, infilling capabilities,
and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion
tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with
inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens,
resulting in the creation of StarCoder.
Pruning CodeBERT for Improved Code-to-Text Efficiency [Coming Soon!]
Alex Gu, Ria Sonecha, Saaketh Vedantam, Bharat Runwal, Diganta Misra
Workshop on Sparsity in Neural Networks, ICLR 2023
We do some very preliminary experiments on pruning the encoder piece of CodeBERT. Preprint coming soon,
but if you're interested in this topic, please reach out!
We extend standard single-objective bilevel optimization to a min-max multi-objective framework that aims to minimize
the worst-case loss of all tasks. Our main result is theoretical: we introduce a new algorithm (MORBiT) for our framework
and show a convergence result. We also highlight applications in representation learning and hyperparameter optimization.
🎅SantaCoder: Don't Reach for the Stars!🌟
Loubna Ben Allal*, Raymond Li*, Denis Kocetkov*, ..., Alex Gu, ..., Leandro von Werra*
Deep Learning for Code Workshop, ICLR 2023 [Best Paper]
The BigCode project is an open-scientific collaboration working on the responsible open-source development of large language models
for code. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack, and our best model outperforms
previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling.
ObSynth leverages domain knowledge embedded in GPT-3 to help users design object models from high level natural language prompts.
We synthesize object names, field names, field types, method names, and relationships between objects.
Also, we conduct a user study to highlight how users may interact with ObSynth to design a restaurant management application.
We introduce the disagreement problem in explainable machine learning, showing that commonly used
algorithms including LIME, SHAP, and SmoothGrad often disagree in practice. We also conduct a user study
highlighting that practitioners do not have good ways of resolving these disagreements in their day to day.
Standard three operator splitting minimizes the sum of three convex functions f(x) + g(x) + h(x), where f is smooth and the prox operators of
g and h are computable. We focus on three settings: (i) f is nonsmooth, (ii) we only have noisy gradients and subgradients of f, (iii) an adaptive setting where smoothness properties of f are unknown.
Our algorithm, CORGI, computes certified bounds on an interpretability algorithm known as CAM (Class Activation Mapping), showing that within the certified radius,
the top-K pixels of the CAM map do not change.
Invited Talks
If you find my work interesting, I am happy to give a talk, in-person along the Northeast Corridor (e.g. NYC, Boston) and virtually otherwise!
March 2024: Beyond Code Generation: Do code language models understand programs the same way humans do? at Lei Lab (CMU, virtual), Cornell Tech, and Columbia PL/SE seminar
February 2024: CRUXEval: Code Reasoning, Understanding, and Execution Evaluation at AGI Leap Summit (virtual)
October 2023: LeanDojo: Theorem Proving with Retrieval-Augmented Language Models at CMSA New Technologies in Mathematics seminar (Harvard)
May 2023: What we need before even attempting to replace programmers with AI at TEDxBoston
Teaching Experience
Spring 2022: TA, Nonlinear Optimization (6.252)
Fall 2021: TA, Machine Learning (6.867)
Spring 2020: TA, Signals, Systems, and Inference (6.011)
The intention of this research project is to explore alternative methodologies available to people who are struggling with quarantine and isolation, providing them with an opportunity to resuscitate positive thinking and behavior. The result of the study proved that direct online social engagement can be an alternative way to resuscitate positive thinking and behavior for everyone under strenuous circumstances, such as during COVID.
CUTHBERT, a music generator that uses Markov chain models to gain an intuition for music, is a software tool designed to remove knowledge barriers for new musicians and provide inspiration to experienced musicians that just need an idea to start writing.