Mukur Gupta
Be yourself, everyone else is already taken.
Denoising & Diffusing Intelligence @ Granica
I am a researcher at Granica AI, where I work on foundation models for structured data focusing on generative modeling for discrete and mixed-type tabular domains. My recent work studies the interplay between continuous and discrete diffusion models on challenging combinatorial generation problems, and explores how diffusion-based modeling can support broad tabular foundation models for several generation, and predictive tasks. I am especially interested in this area because tabular data remains central to all the enterprice use cases, yet generative modeling for them is still far less explored than for language, vision, or speech.
I completed my M.S. in Computer Science at Columbia University with an Advanced Master's in Research (AMR) program, where I was advised by Prof. Kathleen McKeown and funded through my work as a Graduate Research Assistant in the NLP Lab. Before that, I earned my B.Tech. from IIT Kharagpur, graduating first in my department by GPA.
More broadly, my work spans diffusion models, multimodal video understanding, code generation, and recommendation systems. At Columbia, I contributed to a DARPA-funded project on complex video understanding for cross-cultural dialogue assistance in low-resource multilingual settings. I spent a summer at Apple, where I worked on multimodal diagram reasoning and diagram question answering. In LLM-based code generation, my research on the interpretability of code models was selected for an oral presentation at NAACL, and my work on adversarial attacks against code models led to a security fix in a major coding assistant. Before graduate school, I was an Associate Data Scientist at Gartner, where I built recommendation and ranking systems for product recommendation, user-interest modeling, retrieval, and research question answering, and was recognized through multiple internal excellence award nominations.
Education
-
M.S. in Computer Science (2023 - May 2025)
Columbia University, New York, NY
Advanced Master's in Research (AMR) Track advised by Prof. Kathleen McKeown
Fully funded under MS-GRA (awarded to < 1% students in Engineering School) -
B.Tech (Hons.) in Mechanical Engg (2017 - 2021)
Indian Institute of Technology (IIT) Kharagpur
Graduated as Department Rank 1 (highest GPA holder of stream)
Bachelor thesis advised by Dr. Pawan Goyal
Research
Generating from Discrete Distributions Using Diffusions: Insights from Random Constraint Satisfaction Problems
Under Review, 2025
Studied generative models for uniformly sampling solutions of random satisfiability formulas and compared them with theoretically optimal methods. Showed that continuous diffusions with score matching outperform masked discrete diffusions, that learned diffusions can achieve theoretically optimal accuracy, and that noise-reweighting improves denoising over random noising.
Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning
Under Review, 2025
Proposed SemTrace, a novel technique for measuring semantic code recall in LLMs, revealing substantial drops in code reasoning accuracy as relevant snippets move toward the middle of the input context. Identified a disconnect between lexical and semantic recall, suggesting that existing code reasoning benchmarks may underestimate the difficulty of leveraging in-context information.
Links: paper
XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants
Under Review, 2025
Led to a security fix in a major AI coding assistantIntroduced Cross-Origin Context Poisoning (XOXO), a novel attack that uses semantically equivalent adversarial code modifications to compromise AI coding assistants. Developed the GCGS algorithm, achieving an 83.09% attack success rate across 11 models, including GPT-4o and Claude 3.5 Sonnet, and demonstrated the ineffectiveness of existing defenses.
Links: paper
AdvSumm: Adversarial Training for Bias Mitigation in Text Summarization
NewSumm @ EMNLP, 2025
Introduced AdvSumm, a domain-agnostic adversarial training framework for mitigating bias in text summarization through gradient-guided perturbations at the embedding level. Demonstrated reductions in name-nationality and political framing biases without sacrificing summarization quality, outperforming standard transformer baselines and data augmentation methods.
Links: paper
CodeSCM: Causal Analysis for Multi-Modal Code Generation
NAACL, 2025 (Oral); CALM @ NeurIPS, 2024 (Poster)
Proposed CodeSCM, a structural causal model for analyzing multimodal code generation in LLMs through causal interventions and mediation analysis. Showed that input-output examples significantly influence generation alongside natural language instructions, and quantified spurious model behavior through latent mediator variables.
Links: paper
MQAdapt: Layerwise Adaptive Multi-Query Attention for Efficient LLM Inference
2025
Proposed MQAdapt, a training-free inference-time optimization method for efficient LLM deployment using Multi-Query Attention. Showed that selectively applying MQA across transformer layers, especially in alternate-layer patterns, yields faster inference with smaller accuracy degradation than naive strategies.
TrajSyn: Privacy-Preserving Dataset Distillation from Federated Model Trajectories for Server-Side Adversarial Training
arXiv, 2026
Introduced TrajSyn, a privacy-preserving framework for synthesizing proxy datasets from federated client model trajectories to enable server-side adversarial training without accessing raw client data. Demonstrated improved adversarial robustness on image classification benchmarks with no additional compute burden on client devices.
Links: paper
Intent Detection and Entity Extraction from Biomedical Literature
CL4Health @ LREC-COLING, 2024 (Oral)
Conducted a comprehensive empirical study showing that supervised fine-tuned approaches remain more effective than general-purpose LLMs for biomedical NLP tasks. Demonstrated that PubMedBERT can outperform ChatGPT on NER with as few as five supervised examples, highlighting the value of domain-specific models in specialized settings.
Avenues in IoT with Advances in Artificial Intelligence
SPT-IoT @ IEEE PerCom, 2024 (Oral)
Surveyed current challenges in the Internet of Things and examined the transformative role of AI, NLP, and machine learning in enabling domain-specific IoT solutions. Discussed future directions for deeper integration of intelligent systems into IoT applications and digital interactions.
Links: paper
Curriculum Generation using Autoencoder-Based Continuous Optimization
arXiv Preprint, 2021
Presented Training Sequence Optimization (TSO), a curriculum learning approach that uses autoencoders and continuous optimization to learn effective training data orderings. Achieved a 2 AP improvement over random ordering on CIFAR-10 and CIFAR-100, outperforming prior curriculum learning methods.
Links: paper
Experience
-
Research Scientist — Granica AI, CA
June 2025 - Present
Developing tabular foundation model on terabyte-scale datasets using diffusion modeling to learn a unified generative model for tabular tasks. Derived novel insights into diffusion generative modeling, showing that continuous diffusion with score matching outperforms masked discrete diffusion on a computationally hard sampling problem; and introduced a noise-reweighting denoising strategy that outperforms random noising. Applied Granica's ICLR award-winning data selection research (https://www.granica.ai/research) to recommendation systems of a social media giant with 400 million monthly active users, driving significant gains in Click-Through Rate. -
Graduate Research Assistant — NLP Lab, Columbia University, NY
December 2023 - May 2025
Developed multimodal video understanding systems at Columbia NLP Lab, with a focus on future activity prediction, long-video reasoning, and synthetic data generation for instruction tuning. Pretrained LLaVA-OneVision on 2M synthetic video instruction-tuning pairs for a novel future activity prediction objective, then instruction-finetuned it on a subset of the LLaVA-NeXT corpus, improving abductive and defeasible video reasoning by up to 3% over the LLaVA video baseline. Built a 6,000-hour synthetic video instruction dataset using Molmo-7B and optimized large-scale training data generation on shared NVIDIA A100 40GB GPUs through efficient frame sampling, memory-aware data loading, and slow-fast sampling for long-video processing. Contributed this work within a DARPA-funded project on cross-cultural dialogue assistance in low-resource multilingual settings, building multimodal methods for changepoint detection from video, speech, transcripts, and social-norm signals. -
Applied Scientist Intern — Apple, CA
June 2024 - August 2024
Built multimodal reasoning systems at Apple for diagram understanding by fine-tuning a LLaVA architecture on a synthetic diagram-to-LaTeX generation task. Developed an RCA system for planning the annual mega iPhone launches by framing the problem as natural-language-to-SQL generation, with systems adopted in real planning use cases. -
Associate Data Scientist — Gartner, India
August 2021 - June 2023
Worked on Gartner's Personalized Insights and Recommendations team, building recommendation engines and ranking systems for product discovery and user personalization across Gartner services. Designed and improved key components across the recommendation stack, including candidate retrieval, peer-based recommendation, user-interest modeling, semantic matching, and list-wise learning-to-rank; co-invented a patent-pending product recommendation and value-tracking system; and delivered measurable gains in recall, NDCG, and MAP through DeepFM, XGBoost, VAE-based topic modeling, and interaction-driven ranking models. -
Machine Learning Intern — Hike, India
May 2020 - August 2020
Designed a novel autoencoder based Curriculum Learning algorithm with LSTM, led to faster model convergence -
Data Science Intern — Manthan (Algonomy), IN
May 2019 - July 2019
Using machine learning on the clickstream data of mobile app of a global pizza chain to build an intent prediction model to predict the next best time when a customer is most likely to open the mobile app to order a pizza and align all the communications (push notification, SMS, emails) at that time to boost the open rates and reduce the possibility of spamming