CV

Research Statement

  • My research spans the full foundation model development lifecycle — pretraining, mid-training, post-training, and reinforcement learning — with a focus on Data-Centric AI, Reasoning, and Mixture-of-Experts architectures.

    Specifically, I focus on problems such as Pretraining and Post-Training of LLMs, Mixture-of-Experts (MoE) models, Reinforcement Learning & Alignment, Data-Centric AI & Data Mixture Optimization, Synthetic Data Generation, Scaling Laws & Model Evaluation, LLM Reliability & Hallucination, Efficient LLM Decoding, and Agentic Systems & Tool Use.

    My long-term research vision is to make foundation models learn more effectively from data, feedback, and interaction, rather than relying solely on increasing model size or compute. I pursue this through scalable training methodologies, teacher-guided RL, and taxonomy-driven data curation, validated across large-scale foundation models.

Publication Venues

  • ACL
    2022 (4 Conference + 1 Workshop)
    2023 (2 Conference + 1 Workshop)
    2024 (3 Conference)
  • EMNLP
    2022 (2 Conference)
    2023 (1 Conference)
    2024 (1 Conference)
  • EACL
    2023 (1 Conference)
  • NAACL
    2024 (1 Conference)
    2022 (1 Workshop)
  • AAAI
    2022 (1 Workshop)
    2023 (1 Workshop)
  • AAMAS
    2023 (1 Conference)

Research Interests

  • Pretraining, Mid-Training & Post-Training of LLMs
    Reinforcement Learning & Alignment
    Data-Centric AI & Data Mixture Optimization
    Synthetic Data Generation
    Mixture-of-Experts (MoE)
    Scaling Laws & Model Evaluation
    LLM Reliability, Hallucination & Abstention
    Agentic Systems & Tool Use
    Efficient LLM Decoding & Inference
    Reasoning
    Retrieval Augmented Inference
    LLM Defense Strategies
    Selective Prediction

Technical Skills

  • Languages
    Python, C++, SQL, Bash
  • ML / DL
    PyTorch, PyTorch Lightning, Hugging Face Transformers, TRL
  • Distributed Training
    Slime, Megatron-LM
  • Inference / Serving
    vLLM, SGLang
  • Data Processing
    PySpark, Hugging Face Datasets, Pandas
  • Experiment Tracking
    Weights & Biases, MLflow

Education

Work Experience

  • 2024 -
    Present
    Senior Applied Scientist
    Amazon — Palo Alto, California
    • Core contributor to three generations of large-scale Mixture-of-Experts foundation models trained on 40T+ tokens, ranging up to 800B+ total parameters.
    • Owned the data curation and quality assessment stack end-to-end — quality filtering, deduplication, topic modeling, taxonomy development, and data-mixture optimization — enabling models to outperform top-tier public datasets such as Nemotron-CC, FineWeb, and DCLM on knowledge, reasoning, and coding benchmarks.
    • Designed a multi-dimensional taxonomy curation framework spanning 14 orthogonal document-quality dimensions; resulting filters recover high-value content from deprioritized web data tiers and surpass top-tier data on reasoning and coding benchmarks.
    • Led improvements across code, math, and multilingual data tracks — including code-specific quality classifiers and synthetic data generation for reasoning.
    • Developing reinforcement learning approaches for improving reasoning and learnability of foundation models, leveraging privileged-information and teacher-guided training signals.
    • Received organization-wide internal performance award for high-impact contributions to foundation-model training data quality and downstream model performance.
  • Summer
    2023
    NLP Research Intern
    Tencent AI
    • Detecting and Mitigation Hallucinations of Large Language Models
  • Summer
    2022
    Applied Scientist Intern
    Amazon Science
    • Web Question Answering Leveraging Information Retrieval for Alexa AI
  • 2018-19
    Software Engineer
    Microsoft
    • Contributed towards development of a Machine Learning driven chat recommendation system aimed at augmenting user engagement with Microsoft's product 'Teams'.
    • Collaborated with MSR researchers for a feature titled 'Intelligent Feeds' that finds relevant messages for users based on their prior activities and message text features.
  • Summer
    2017
    Research Intern
    Samsung R&D Institute
    • Orchestrated a 'context prediction' application incorporating features based on device events (e.g app usage, location) and sensor data (proximity sensor).

Honors and Awards

  • Industry
  • 2025
    Amazon Internal Performance Award — for high-impact contributions to foundation-model training data quality and downstream model performance.
  • Academic
  • 2024
    Outstanding CS PhD Graduating student Award for the 2023-2024 at Arizona State University
  • 2023
    Outstanding Reviewer for EACL’23 (Question Answering track)
  • 2023
    Outstanding Research Award, GPSA ASU, 2023
  • 2023
    SCAI Doctoral Fellowship, ASU, 2023
  • 2023, 2024
    ASU Jumpstart Research Grant, 2023 and 2024

Books

  • 2024
    Advances in Multimodal Information Retrieval and Generation
    • Springer, Synthesis Lectures on Computer Vision (SLCV).
    • Authors: Man Luo, Tejas Gokhale, Neeraj Varshney, Yezhou Yang, Chitta Baral.
    • A comprehensive treatment of Transformer-based multimodal retrieval, generation, and retrieval-augmented generation across vision and language.

Service

  • Area Chair
    ACL Rolling Reviews
  • Reviewer
    ACL, EMNLP, EACL (Outstanding Reviewer), Computational Linguistics Journal, COLM, AMLC, CVPR Workshop
  • Outreach
    Author of 20+ ML/NLP articles on Medium with 100K+ cumulative views; mentored several PhD interns and supported multiple co-authored publications.

Collaborators