I am committed to pioneering research in large reasoning models(LRMs) and agentic alignment, exploring the progression from AI through AGI toward ASI. My work aims to bridge specialized intelligence with general-purpose, self-evolving agentic systems by developing mathematically grounded ethical frameworks and designing efficient, adaptive superalignment methodologies.

I am advancing research on LRMs and agentic alignment techniques through scalable reinforcement learning methods, aiming to enhance their complex reasoning capabilities. My work focuses on developing advanced algorithms and frameworks that leverage high-quality long chain-of-thoughts(CoTs) data, including R1/O1-related scalable RL alignment algorithms, post-training methods such as reinforcement learning with diverse feedbacks (RLXF), and broader AI alignment strategies. Additionally, I am actively involved in research on multimodal interaction and demonstrate a keen interest in controllable AI-generated content (AIGC).

In my prior work, I have made valuable contributions to reinforcement learning and multi-agent systems, particularly through the development of reward tuning, off-policy and on-policy RL algorithms, and evaluation frameworks, as well as algorithms for cooperative and competitive multi-agent learning. Furthermore, my research, which integrates preference learning, has been widely applied to practical domains such as ranking, pricing, marketing, and recommendation systems.

Now, my research areas of primary focus, collaborative engagement, and curiosity-driven are as follows:

Large Reasoning Models

Scalable RL algorithm, framework and system for R1/O1-related foundation models
Scalable and efficient reasoning mechanism
- Efficient Reward Modeling: ORM, PRM, GRM, w/o RM, etc.
- Effective and efficient chain-of-thoughts(CoTs)
- Long(context, COT)-to-short/essential: Minimize {Overthinking, intermediate hacking}, free from ambiguities, biases, or errors etc.
- Scalable oversight
Enhancing Large Reasoning Models: From Specialized STEM Proficiency to General Reasoning Capabilities, including logic puzzles, strategic games & other advanced paradigms
General Verifiers
- General verifiers: ensuring accuracy, consistency, and alignment with intended objectives
- Effective and efficient step-aware verifier

Agentic Alignment

LLM Agent Foundation Model
- Agentic Reward Modeling
- Agent reasoning and act
- Agentic simulation
Scalable Agentic Alignment
- Multi-Agent, Multi-Users, Multi-Task, Multi-Session/Turns(collaborative engagement)

LLM Post-Training, such as RLHF, SFT

RLHF, RLAIF, RLXF
General Reward Modeling
- Scale-law of Reward Modeling
- Reward Overoptimization / Reward Hacking(such as length hacking)

LLM Pretraining

GPT Pretraining
MOE Pretraining(collaborative engagement)

RL, Multi-Agent Learning Algorithm and Framework

Reward Modeling
- Reward shaping or tuning: Behavior Cloning/Inverse RL/Meta Learning/Imitation Learning
- Reward distribution: delay rewards, sparse rewards, noisy or biased rewards, misalignment, distribution shift
Off-policy and on-policy RL algorithms and framework
Multi-Task & Meta-Learning
Cooperated and competitive Multi-Agent learning algorithm and framework

Reinforcement Preference Learning

Ranking
Pricing
Marketing
Recommendation algorithm and system

Other Areas

Areas of curiosity-driven and collaborative engagement

AI alignment / Foundation model decision
- Multimodal alignment through RLXF
Multimodal RL
- Multimodal Interaction

Areas of curiosity-driven

Controllable AIGC
- Diffusion Models
Aero/Embodied Agents/Robots
Continual RL

And my past primary research projects are as follows:

Research Experience

Ant Group and Peking University Frontier Computing Center | Research Leader

Co-Advisors: Prof. Xiaotie Deng, August 2023 - Now

Topic 1: Scale Law optimization solutions for reinforcement learning large models with feedback

Led research on Scale Law optimization solutions for reinforcement learning large models with feedback mechanisms

Topic 2: Alignment of reinforcement learning large models with multiple constraints

Developed alignment methodologies for reinforcement learning large models under multiple constraint conditions
Published paper: Hummer: Towards Limited Competitive Preference Dataset in COLM’24

Conducted innovative research on cross-modal content generation alignment algorithms based on intelligent RLAIF (Reinforcement Learning from AI Feedback) frameworks

R & D Skills: Bailing(Ant Group’s Pretrain and Post-training Alignment Framework: ATorch, Ling, AReaL, etc.)/DeepSpeed/Python/Java/AIStudio (Ant Group’s AI Platform)

Ant Group and Damo Academy, Alibaba Group | Research Leader

Supervisor: Dr. James Zhang, May 2019 - December 2022

Topic 1: Digital Human Interactive Recommendation Decision-Making Based on Reinforcement Learning

Proposed a novel and practical digital human recommendation agent framework based on reinforcement learning to improve the efficiency of decision-making by leveraging both the digital human features and the superior flexibility of reinforcement learning
Evaluated the performance of the proposed algorithm framework under the context of live-streaming broadcast with real-world business data and showed the framework can provide better personalized customer engagement and better customer experiences
Published paper: ‘Digital Human Interactive Recommendation Decision-Making Based on Reinforcement Learning’ in NeurIPS’22 workshop on Human in the Loop Learning

Topic 2: Sample Efficiency and Off-policy Evaluation of Model-based Reinforcement Learning

Proposed a model-embedding model-based reinforcement learning algorithm in the framework of probabilistic reinforcement learning
Evaluated the algorithm on several benchmarks and achieved state-of-the-art performance
Applied in various personalized large-scale dynamic contexts and made great improvements over classic reinforcement learning baseline models
Published paper: ‘Model-based Off-policy Deep Reinforcement Learning with Model-embedding’ on Arxiv

R & D Skills: Ray/Tensorflow/PyTorch/Python/Java/AIStudio (Ant Group’s AI Platform)/Dataphin (Ant Group’s Cloud Service Platform)

Ant Group and Hong Kong University of Science and Technology | Research Leader

Advisor: Associate Prof. Yangqiu Song, June 2020 - May 2021

Topic: Representation Learning of Data with Hierarchical Structures Through Hyperbolic Embedding

Formulated this problem in the complex hyperbolic space to address the limitation of hyperbolic embeddings
Proposed a learning algorithm to learn the embeddings of hierarchically structured data in the unit ball model of the complex hyperbolic space
Evaluated the algorithm on synthetic and real-world data and showed the approach improved over the hyperbolic embedding models significantly; also explored the competence of complex hyperbolic geometry on the multitree structure and 1-N structure
Published paper: ‘Unit Ball Model for Embedding Hierarchical Structures in the Complex Hyperbolic Space’ on Arxiv

R & D Skills: Tensorflow/Python/Dataphin (Ant Group’s Cloud Service Platform)

Ant Group and University of California, Berkeley | Research Leader

Supervisor and Co-Advisors: Prof. Le Song, Yuan Qi, and Michael I. Jordan, November 2018 - April 2020

Topic 1: Cooperative Policy Learning Through Multi-Agent Collaboration

Formulated this problem to improve the efficiency of resource management such as asset liability management and clouding resources scheduling for the company and proposed two policy learning algorithms as follows: 1). Value Propagation, a fully distributed collaborative policy through multi-agent reinforcement learning and graph propagation, where agents learn to coordinate to achieve joint success; proved this algorithm with the convergence rate O(1/T) with non-linear deep neural network function approximation 2). Variational policy propagation (VPP), a collaborative multi-agent reinforcement learning algorithm to learn a joint policy through actions over agents; evaluated the proposed algorithm on several large scale challenging tasks and demonstrated that it outperforms the previous state-of-the-art
Published papers: 1). ‘Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning’ in NeurIPS’19 2). ‘Intention propagation for multi-agent reinforcement learning’ on Arxiv

Topic 2: Optimal Policy Learning for Personalized Marketing Models with Constrained Budgets

Formulated this problem with the method of domain adaptation and took into account the additional reward structure and budget constraints
Developed a novel two-step method for solving this constrained counterfactual policy optimization problem and established the theoretical error bounds for the estimation procedure
The proposed approach led to significant improvement in both synthetic and real business datasets and outperformed state-of-the-art methods
Published paper: ‘Cost-effective incentive allocation via structured counterfactual inference’ in AAAI’20

R & D Skills: Tensorflow/Python/Dataphin (Ant Group’s Cloud Service Platform)

Ant Group and Peking University | Research Leader

Co-Advisors: Prof. Xiaotie Deng and Yuan Qi, June 2017 - July 2019

Formalized this problem as an imperfect and incomplete stochastic game and implemented it under the reinforcement learning framework
Proposed a novel LDA model to explore latent variables representing the preferences of the company’s customers and strategies of competitors
The proposed algorithm and framework showed significant improvement to find the optimal decision-making strategies and thus greatly increased user stickiness and market share of the company
Published paper: ‘Latent Dirichlet Allocation for Internet Price War’ in AAAI’19

Topic 2: Uplift Modeling for Marketing Campaigns of Internet Consumer Financial Products

Formulated the uplifting problem as a Markov Decision Process and learned it through repeated interactions between the customers and recommendation agent
The proposed algorithm achieved state-of-the-art under various metrics and significantly improved the accuracy of marketing campaigns
Published paper: ‘Reinforcement learning for uplift modeling’ on Arxiv

R & D Skills: Tensorflow/Python/Dataphin (Ant Group’s Cloud Service Platform)

School of Computer Science and Engineering, Beihang University | Master

Advisor: Associate Prof. Huan Li, March 2009 - August 2010

Topic 1: Sensor Network Mathematical Modeling and Data Aggregation Techniques

Sponsored by the Scientific Research Foundation for the Returned Overseas Chinese Scholars (ROCS), State Education Ministry (SEM)

Did extended research on the problem of the ‘energy hole’ problem in the large-scale deployment of wireless sensor networks (more than 10,000 nodes)
Proposed a novel optimal unequal clustering protocol algorithm COCA (Constructing Optimal Clustering Architecture) and evaluated its performance
Published paper: ‘COCA: Constructing optimal clustering architecture to maximize sensor network lifetime’ in Computer Communications’13
Thesis: Research of Energy Efficiency Optimization Strategy with Data Aggregation for Wireless Sensor Networks, lecture, demo, paper

Topic 2: Structuralized Clustering Approaches for Wireless Sensor Networks

Conducted research on the problem of uneven energy consumption of nodes in large-scale multi-hop structured wireless sensor networks
Derived mathematical model for the optimal clustering algorithm and implemented some of the routing algorithms of the proposed protocol
Published paper: ‘Constructing optimal clustering architecture for maximizing lifetime in large scale wireless sensor networks’ in ICPADS’09

Cooperated with Associate Professor Weifeng Chen, California University of Pennsylvania

Implemented the simulation of the Adhoc Network secret sharing schemes through the RandomWalk2dMobility movement model, AODV routing algorithm, and queuing theory, and then evaluated the performance

Topic 4: Cyber Physical System

Sponsored by the Graduated Student Innovation Fund of the University

Carried out extensive research on the development of the cyber-physical system

R & D Skills: Matlab/Maple/Mathematica; Cygwin/NS-2/NS-3/WAF/TCL; C++/C/Python/GCC/G++/GDB/Shell/Vim

Future Network Center of Hong Kong City University | Research Assistant

Advisor: Associate Prof. Huan Li, April 2008 - September 2008

Topic: WiFi IPTV & VoIP

Focused on video input decoding and encoding, conference video synthesis, etc.

Carried out surveys and implemented video codec algorithms, such as video input decoding, H.264-like scalable video coding (SVC) encoding, conference video synthesis, etc.
Implemented: 1). Half-pixel difference and sub-sampling difference algorithms to realize image data transmission 2). Diamond search, and cross-pattern fast search algorithms to realize image motion estimation 3). Binary floating-point to fixed-point conversion algorithm to realize macroblock transformation rate control

R & D Skills: C/XVID/Windows

College of Information and Intelligence, Hunan Agricultural University | Undergraduate

Advisor: Prof. Tiejun Zhou, January 2007 - May 2007

Topic: Computation and Performance Evaluation of Bidirectional Associative Memory Neural Based on Time-Delay Differential Equation

Implemented the time-delay differential equations of bidirectional associative memory neural networks model through ordinary differential equations (such as the Euler method, improved Euler method, Runge-Kutta method, semi-discrete method, etc.)
Evaluated the performances, the equilibrium points, and the existence and stability of periodic solutions of these time-delay differential equations through visualization
Thesis: Solving time-delay differential equation in several numerical ways with computer, lecture, paper

R & D Skills: Matlab

Xiong Jun Wu

Large Reasoning Models

Agentic Alignment

LLM Post-Training, such as RLHF, SFT

LLM Pretraining

RL, Multi-Agent Learning Algorithm and Framework

Reinforcement Preference Learning

Other Areas

Research Experience

Ant Group and Peking University Frontier Computing Center | Research Leader

Topic 1: Scale Law optimization solutions for reinforcement learning large models with feedback

Topic 2: Alignment of reinforcement learning large models with multiple constraints

Topic 3: Research on innovative algorithms for alignment of reinforcement learning feedback models and cross-modal content generation based on intelligent generation algorithm RLAIF

Ant Group and Damo Academy, Alibaba Group | Research Leader

Topic 1: Digital Human Interactive Recommendation Decision-Making Based on Reinforcement Learning

Topic 2: Sample Efficiency and Off-policy Evaluation of Model-based Reinforcement Learning

Ant Group and Hong Kong University of Science and Technology | Research Leader

Topic: Representation Learning of Data with Hierarchical Structures Through Hyperbolic Embedding

Ant Group and University of California, Berkeley | Research Leader

Topic 1: Cooperative Policy Learning Through Multi-Agent Collaboration

Topic 2: Optimal Policy Learning for Personalized Marketing Models with Constrained Budgets

Ant Group and Peking University | Research Leader

Topic 1: Automated Mechanism Design for Internet Mobile Payment Market Share Competition

Topic 2: Uplift Modeling for Marketing Campaigns of Internet Consumer Financial Products

School of Computer Science and Engineering, Beihang University | Master

Topic 1: Sensor Network Mathematical Modeling and Data Aggregation Techniques

Topic 2: Structuralized Clustering Approaches for Wireless Sensor Networks

Topic 3: Modelling Threshold Secret Sharing Schemes in Ad Hoc Networks

Topic 4: Cyber Physical System

Future Network Center of Hong Kong City University | Research Assistant

Topic: WiFi IPTV & VoIP

College of Information and Intelligence, Hunan Agricultural University | Undergraduate

Topic: Computation and Performance Evaluation of Bidirectional Associative Memory Neural Based on Time-Delay Differential Equation