I am dedicated to cutting-edge research on scalable reinforcement learning (RL) and agentic alignment from AI to AGI, aiming to bridge the gap between specialized artificial intelligence and general-purpose autonomous systems through ethical, scalable, and adaptive frameworks.

I am advancing research on scalable reinforcement learning methods and agentic alignment techniques within large language model (LLM) foundation models, aiming to enhance their complex reasoning capabilities. My work focuses on developing advanced algorithms and frameworks that leverage high-quality data, including R1/O1-related scalable RL alignment algorithms, post-training methods such as reinforcement learning with diverse feedbacks (RLXF) and supervised fine-tuning (SFT), and broader AI alignment strategies. Additionally, I am actively involved in research on multimodal interaction and demonstrate a keen interest in controllable AI-generated content (AIGC).

In my prior work, I have made valuable contributions to reinforcement learning and multi-agent systems, particularly through the development of reward tuning, off-policy and on-policy RL algorithms, and evaluation frameworks, as well as algorithms for cooperative and competitive multi-agent learning. Furthermore, my research, which integrates preference learning, has been widely applied to practical domains such as ranking, pricing, marketing, and recommendation systems.

Now, my research areas of primary focus, collaborative engagement, and curiosity-driven are as follows:

Scalable RL Reasoning

  1. R1 /O1 related scalable RL reasoning algorithm and framework

LLM Post-Training, such as RLHF, SFT

  1. RLHF, RLAIF, RLXF
  2. Reward Modeling
  • Scale-law of Reward Modeling
  • Reward Overoptimization / Reward Hacking(such as length hacking)

LLM Pretraining

  1. GPT Pretraining
  2. MOE Pretraining(collaborative engagement)

RL, Multi-Agent Learning Algorithm and Framework

  1. Reward Modeling
  • Reward shaping or tuning: Behavior Cloning/Inverse RL/Meta Learning/Imitation Learning
  • Reward distribution: delay rewards, sparse rewards, noisy or biased rewards, misalignment, distribution shift
  1. Off-policy and on-policy RL algorithms and framework
  2. Multi-Task & Meta-Learning
  3. Cooperated and competitive Multi-Agent learning algorithm and framework

Reinforcement Preference Learning

  1. Ranking
  2. Pricing
  3. Marketing
  4. Recommendation algorithm and system

Other Areas

Areas of curiosity-driven and collaborative engagement

  1. AI alignment / Foundation model decision
  • Multimodal alignment through RLXF
  1. Agent Foundation Model/Scalable Agentic Alignment
  2. Multimodal RL
  • Multimodal Interaction

Areas of curiosity-driven

  1. Controllable AIGC
  • Diffusion Models
  1. Aero/Embodied Agents/Robots

And my past primary research projects are as follows:

Research Experience

Ant Group and Peking University Frontier Computing Center | Research Leader

Co-Advisors: Prof. Xiaotie Deng, August 2023 - Now

Topic 1: Scale Law optimization solutions for reinforcement learning large models with feedback

  • Led research on Scale Law optimization solutions for reinforcement learning large models with feedback mechanisms

Topic 2: Alignment of reinforcement learning large models with multiple constraints

  • Developed alignment methodologies for reinforcement learning large models under multiple constraint conditions
  • Published paper: Hummer: Towards Limited Competitive Preference Dataset in COLM’24

Topic 3: Research on innovative algorithms for alignment of reinforcement learning feedback models and cross-modal content generation based on intelligent generation algorithm RLAIF

  • Conducted innovative research on cross-modal content generation alignment algorithms based on intelligent RLAIF (Reinforcement Learning from AI Feedback) frameworks

R & D Skills: Bailing(Ant Group’s Pretrain and Post-training Alignment Framework: ATorch, Ling, AReaL, etc.)/DeepSpeed/Python/Java/AIStudio (Ant Group’s AI Platform)

Ant Group and Damo Academy, Alibaba Group | Research Leader

Supervisor: Dr. James Zhang, May 2019 - December 2022

Topic 1: Digital Human Interactive Recommendation Decision-Making Based on Reinforcement Learning

  • Proposed a novel and practical digital human recommendation agent framework based on reinforcement learning to improve the efficiency of decision-making by leveraging both the digital human features and the superior flexibility of reinforcement learning
  • Evaluated the performance of the proposed algorithm framework under the context of live-streaming broadcast with real-world business data and showed the framework can provide better personalized customer engagement and better customer experiences
  • Published paper: ‘Digital Human Interactive Recommendation Decision-Making Based on Reinforcement Learning’ in NeurIPS’22 workshop on Human in the Loop Learning

Topic 2: Sample Efficiency and Off-policy Evaluation of Model-based Reinforcement Learning

  • Proposed a model-embedding model-based reinforcement learning algorithm in the framework of probabilistic reinforcement learning
  • Evaluated the algorithm on several benchmarks and achieved state-of-the-art performance
  • Applied in various personalized large-scale dynamic contexts and made great improvements over classic reinforcement learning baseline models
  • Published paper: ‘Model-based Off-policy Deep Reinforcement Learning with Model-embedding’ on Arxiv

R & D Skills: Ray/Tensorflow/PyTorch/Python/Java/AIStudio (Ant Group’s AI Platform)/Dataphin (Ant Group’s Cloud Service Platform)

Ant Group and Hong Kong University of Science and Technology | Research Leader

Advisor: Associate Prof. Yangqiu Song, June 2020 - May 2021

Topic: Representation Learning of Data with Hierarchical Structures Through Hyperbolic Embedding

  • Formulated this problem in the complex hyperbolic space to address the limitation of hyperbolic embeddings
  • Proposed a learning algorithm to learn the embeddings of hierarchically structured data in the unit ball model of the complex hyperbolic space
  • Evaluated the algorithm on synthetic and real-world data and showed the approach improved over the hyperbolic embedding models significantly; also explored the competence of complex hyperbolic geometry on the multitree structure and 1-N structure
  • Published paper: ‘Unit Ball Model for Embedding Hierarchical Structures in the Complex Hyperbolic Space’ on Arxiv

R & D Skills: Tensorflow/Python/Dataphin (Ant Group’s Cloud Service Platform)

Ant Group and University of California, Berkeley | Research Leader

Supervisor and Co-Advisors: Prof. Le Song, Yuan Qi, and Michael I. Jordan, November 2018 - April 2020

Topic 1: Cooperative Policy Learning Through Multi-Agent Collaboration

  • Formulated this problem to improve the efficiency of resource management such as asset liability management and clouding resources scheduling for the company and proposed two policy learning algorithms as follows:
    1. Value Propagation, a fully distributed collaborative policy through multi-agent reinforcement learning and graph propagation, where agents learn to coordinate to achieve joint success; proved this algorithm with the convergence rate O(1/T) with non-linear deep neural network function approximation
    2. Variational policy propagation (VPP), a collaborative multi-agent reinforcement learning algorithm to learn a joint policy through actions over agents; evaluated the proposed algorithm on several large scale challenging tasks and demonstrated that it outperforms the previous state-of-the-art
  • Published papers:
    1. ‘Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning’ in NeurIPS’19
    2. ‘Intention propagation for multi-agent reinforcement learning’ on Arxiv

Topic 2: Optimal Policy Learning for Personalized Marketing Models with Constrained Budgets

  • Formulated this problem with the method of domain adaptation and took into account the additional reward structure and budget constraints
  • Developed a novel two-step method for solving this constrained counterfactual policy optimization problem and established the theoretical error bounds for the estimation procedure
  • The proposed approach led to significant improvement in both synthetic and real business datasets and outperformed state-of-the-art methods
  • Published paper: ‘Cost-effective incentive allocation via structured counterfactual inference’ in AAAI’20

R & D Skills: Tensorflow/Python/Dataphin (Ant Group’s Cloud Service Platform)

Ant Group and Peking University | Research Leader

Co-Advisors: Prof. Xiaotie Deng and Yuan Qi, June 2017 - July 2019

Topic 1: Automated Mechanism Design for Internet Mobile Payment Market Share Competition

  • Formalized this problem as an imperfect and incomplete stochastic game and implemented it under the reinforcement learning framework
  • Proposed a novel LDA model to explore latent variables representing the preferences of the company’s customers and strategies of competitors
  • The proposed algorithm and framework showed significant improvement to find the optimal decision-making strategies and thus greatly increased user stickiness and market share of the company
  • Published paper: ‘Latent Dirichlet Allocation for Internet Price War’ in AAAI’19

Topic 2: Uplift Modeling for Marketing Campaigns of Internet Consumer Financial Products

  • Formulated the uplifting problem as a Markov Decision Process and learned it through repeated interactions between the customers and recommendation agent
  • The proposed algorithm achieved state-of-the-art under various metrics and significantly improved the accuracy of marketing campaigns
  • Published paper: ‘Reinforcement learning for uplift modeling’ on Arxiv

R & D Skills: Tensorflow/Python/Dataphin (Ant Group’s Cloud Service Platform)

School of Computer Science and Engineering, Beihang University | Master

Advisor: Associate Prof. Huan Li, March 2009 - August 2010

Topic 1: Sensor Network Mathematical Modeling and Data Aggregation Techniques

Sponsored by the Scientific Research Foundation for the Returned Overseas Chinese Scholars (ROCS), State Education Ministry (SEM)

  • Did extended research on the problem of the ‘energy hole’ problem in the large-scale deployment of wireless sensor networks (more than 10,000 nodes)
  • Proposed a novel optimal unequal clustering protocol algorithm COCA (Constructing Optimal Clustering Architecture) and evaluated its performance
  • Published paper: ‘COCA: Constructing optimal clustering architecture to maximize sensor network lifetime’ in Computer Communications’13
  • Thesis: Research of Energy Efficiency Optimization Strategy with Data Aggregation for Wireless Sensor Networks, lecture, demo, paper

Topic 2: Structuralized Clustering Approaches for Wireless Sensor Networks

  • Conducted research on the problem of uneven energy consumption of nodes in large-scale multi-hop structured wireless sensor networks
  • Derived mathematical model for the optimal clustering algorithm and implemented some of the routing algorithms of the proposed protocol
  • Published paper: ‘Constructing optimal clustering architecture for maximizing lifetime in large scale wireless sensor networks’ in ICPADS’09

Topic 3: Modelling Threshold Secret Sharing Schemes in Ad Hoc Networks

Cooperated with Associate Professor Weifeng Chen, California University of Pennsylvania

  • Implemented the simulation of the Adhoc Network secret sharing schemes through the RandomWalk2dMobility movement model, AODV routing algorithm, and queuing theory, and then evaluated the performance

Topic 4: Cyber Physical System

Sponsored by the Graduated Student Innovation Fund of the University

  • Carried out extensive research on the development of the cyber-physical system

R & D Skills: Matlab/Maple/Mathematica; Cygwin/NS-2/NS-3/WAF/TCL; C++/C/Python/GCC/G++/GDB/Shell/Vim

Future Network Center of Hong Kong City University | Research Assistant

Advisor: Associate Prof. Huan Li, April 2008 - September 2008

Topic: WiFi IPTV & VoIP

Focused on video input decoding and encoding, conference video synthesis, etc.

  • Carried out surveys and implemented video codec algorithms, such as video input decoding, H.264-like scalable video coding (SVC) encoding, conference video synthesis, etc.
  • Implemented:
    1. Half-pixel difference and sub-sampling difference algorithms to realize image data transmission
    2. Diamond search, and cross-pattern fast search algorithms to realize image motion estimation
    3. Binary floating-point to fixed-point conversion algorithm to realize macroblock transformation rate control

R & D Skills: C/XVID/Windows

College of Information and Intelligence, Hunan Agricultural University | Undergraduate

Advisor: Prof. Tiejun Zhou, January 2007 - May 2007

Topic: Computation and Performance Evaluation of Bidirectional Associative Memory Neural Based on Time-Delay Differential Equation

  • Implemented the time-delay differential equations of bidirectional associative memory neural networks model through ordinary differential equations (such as the Euler method, improved Euler method, Runge-Kutta method, semi-discrete method, etc.)
  • Evaluated the performances, the equilibrium points, and the existence and stability of periodic solutions of these time-delay differential equations through visualization
  • Thesis: Solving time-delay differential equation in several numerical ways with computer, lecture, paper

R & D Skills: Matlab