Research

Mechanistic Analysis of Alignment Algorithms in Language Models

Researchers conducted a systematic analysis of six preference-optimization methods (PPO, DPO, SimPO, ORPO, GRPO, and KTO) to understand how they reshape language models' internal computations. The study found that different alignment objectives induce qualitatively distinct representational changes, with some methods enhancing feature separability while others degrade it, revealing that behavioral alignment doesn't guarantee uniform internal restructuring.

Read full story at cs.LG updates on arXiv.org →V:0.1 · A:0.3 · D:0.7

Research

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Researchers introduce Pythagoras-Prover, a compute-efficient family of Lean theorem provers that achieves strong perform...

Research

Can AI Agents Synthesize Scientific Conclusions?

Researchers created SciConBench, a benchmark testing AI agents' ability to synthesize scientific conclusions from multip...

Research

How memory tools can make AI models worse

New research suggests that AI memory systems intended to improve model performance can actually degrade capabilities and...