TL;DR: Frontier coding model served to millions of users in Cursor.
Niklas Muennighoff
I'm building Composer at Cursor, a frontier AI coding model. I'm also a PhD student & Knight-Hennessy Scholar at Stanford, advised by Yejin Choi & Andrew Ng.
My work includes MTEB, a widely used AI evaluation framework with 18M+ downloads; s1, which helped define test-time scaling; and Scaling Data-Constrained LMs, which helped establish multi-epoch pretraining, now standard at frontier labs.
Awards include NeurIPS Outstanding Paper Runner-Up, CVPR Best Paper Honorable Mention, ACL Best Paper + Best Theme Paper + Best Resource Paper, and 2nd/3300+ in Meta's Hateful Memes Challenge.
I did my bachelor's at Peking University & am comfortable working in Chinese.
Select AI Research
Watch
TL;DR: Training LLMs to reason with just 1K training samples & a simple technique to control reasoning duration called "budget forcing".
Watch
TL;DR: How to scale LLMs when data is scarce & predict performance ("scaling laws"). First to train LLMs across 1000s of AMD GPUs.
Watch
TL;DR: The standard for evaluating image/audio/text embeddings. Used by OpenAI, Google, Meta with 18M+ downloads.
Watch
TL;DR: State-of-the-art fully open sparse language models. Many training ablations & analysis on routing behavior.
Watch
TL;DR: Explores how LLMs generalize across languages & released state-of-the-art open LLMs at the time.
Watch
TL;DR: The first LLM that yields state-of-the-art performance on both generative & embedding tasks. Can speed up RAG by >60%.
Watch
TL;DR: Strong code LLMs trained via largest dataset of git commits (OctoCoder & CommitPack). Also built HumanEvalPack for evaluation.
Blog
TL;DR: Built Vision-Language Models that finished 2nd/3300+ in Meta's $100K Hateful Memes Competition.
Paper
TL;DR: The first work to use LLMs for embedding, concurrently with OpenAI. I wrote the paper poorly, but all top embedders now use LLMs.
Contact
Questions on papers I’ve co-authored: GitHub issues on the relevant code repository are usually the best place :)
Starting AI Research: If you want to get started in research I recommend contributing to MTEB. We’re a community building the go-to place for everything embeddings with 400K monthly users on our leaderboard & regular publications you can co-author! Example papers from our community: MMTEB, MIEB, MAEB, HUME, SEB.
My email is n.muennighoff@gmail.com :)
Other
Health: I'm pretty into health optimization; my fav sports are swimming/beachvb/tennis :)
Languages: I've worked in Chinese, Japanese, English, German & French. I also took extensive AI coursework in Chinese at Peking University & passed their Chinese placement test with 100/100.
Arts: As a kid I worked as a voice-over artist for 8 years dubbing German voices for Peter Pan (Disney), Pokemon, Game of Thrones (HBO), Dracula (NBC) & others (sample: Gortimer here/here & Victor here) 🎬