标题:Absolute Zero:基于零数据的强化自博弈推理

Absolute Zero:基于零数据的强化自博弈推理

Andrew Zhao1, Yiran Wu3, Yang Yue1, Tong Wu2, Quentin Xu1, Yang Yue1, Matthieu Lin1, Shenzhi Wang1, Qingyun Wu3, Zilong Zheng2,✉️ 和 Gao Huang1,✉️

1清华大学 2北京通用人工智能研究院 3宾夕法尼亚州立大学

✉️通讯作者

Andrew Zhao: zqc21@mails.tsinghua.edu.cn, Yiran Wu: yiran.wu@psu.edu, Zilong Zheng: zlzheng@bigai.ai, Gao Huang: gaohuang@tsinghua.edu.cn

论文 代码 模型 日志 Twitter 引用

Andrew Zhao 目前正在寻找2026年6月开始的全职研究科学家工业界职位(最好在北美)。 欢迎发送电子邮件:andrewzhao112@gmail.com 或 zqc21@mails.tsinghua.edu.cn。 他的研究兴趣主要包括用于 LLM 的 RL、推理和 RL 智能体。Yiran Wu (yiran.wu@psu.edu) 也在寻找全职研究科学家职位,开始时间相同。

查看简历

摘要 & 动机

目前使用具有可验证奖励的强化学习 (RLVR) 训练的推理模型通常依赖于手动管理的数据集,这引发了可扩展性问题,并可能限制未来人工智能在人类定义的任务之外的增长。 为了解决这个问题,我们提出了 Absolute Zero 范式,其中单个模型自主地提出针对自身学习优化的任务,并通过自博弈解决这些任务来改进,完全消除了对外部数据的需求。 这种方法使用来自可验证环境的反馈,确保有根据的学习并防止奖励破解。

我们的实现,Absolute Zero Reasoner (AZR),利用这种范例进行基于代码的推理。 AZR 使用代码执行器来验证自我生成的任务(涵盖演绎、溯因和归纳)并验证解决方案,从而为训练提供可靠的奖励。 尽管使用了零人工管理的数据,但 AZR 在各种编码和数学推理基准上取得了最先进的结果,甚至优于在大型领域内数据集上训练的模型。 这证明了复杂的推理技能有可能纯粹通过自博弈出现,而无需特定领域的监督。

分布外一般推理性能

Absolute Zero Reasoner Performance

Absolute Zero Reasoner 以零数据实现了最先进的性能。 在不依赖任何黄金标签或人类定义的查询的情况下,我们使用提出的自博弈方法训练的 Absolute Zero Reasoner 展示了卓越的分布外推理能力,甚至优于在数万个专家标记的领域内示例上训练的模型。

AZR 提出的程序

所有 AZR 提出的代码样本首先嵌入 jina-embeddings-v2-base-code,然后使用 UMAP 投影到 2D。

尝试不同的模型

选择一个模型 AZR Coder 7B AZR Base 7B AZR Coder 14B AZR Base 14B

将鼠标悬停在点上以探索程序。 单击以锁定选择。

Python 代码 输入 输出 任务信息 训练步骤

旋转六边形 Vibe Check

Prompt: 编写一个脚本,显示 10 个球在旋转的六边形内弹跳。 球应受重力和摩擦力的影响,并且必须真实地从旋转的墙壁上弹起

AZR-Coder-14b (Ours)

AZR-Coder-14b

代码


import pygame
import math
import random
# Initialize Pygame
pygame.init()
# Set up the display
width, height = 800, 600
screen = pygame.display.set_mode((width, height))
pygame.display.set_caption("Bouncing Balls in a Spinning Hexagon")
# Define colors
WHITE = (255, 255, 255)
BLACK = (0, 0, 0)
RED = (255, 0, 0)
# Define constants
FPS = 60
GRAVITY = 0.1
FRICTION = 0.99
BALL_RADIUS = 10
NUM_BALLS = 10
# Define the hexagon
hexagon_center = (width // 2, height // 2)
hexagon_radius = 200
hexagon_vertices = [
  (hexagon_center[0] + hexagon_radius * math.cos(2 * math.pi * i / 6),
    hexagon_center[1] + hexagon_radius * math.sin(2 * math.pi * i / 6))
  for i in range(6)
]
# Define the balls
balls = []
for _ in range(NUM_BALLS):
  x = random.randint(BALL_RADIUS, width - BALL_RADIUS)
  y = random.randint(BALL_RADIUS, height - BALL_RADIUS)
  vx = random.uniform(-5, 5)
  vy = random.uniform(-5, 5)
  balls.append((x, y, vx, vy, RED))
# Game loop
running = True
clock = pygame.time.Clock()
angle = 0
while running:
  # Handle events
  for event in pygame.event.get():
    if event.type == pygame.QUIT:
      running = False
  # Clear the screen
  screen.fill(WHITE)
  # Rotate the hexagon
  angle += 0.01
  rotated_vertices = [
    (hexagon_center[0] + (v[0] - hexagon_center[0]) * math.cos(angle) - (v[1] - hexagon_center[1]) * math.sin(angle),
      hexagon_center[1] + (v[0] - hexagon_center[0]) * math.sin(angle) + (v[1] - hexagon_center[1]) * math.cos(angle))
    for v in hexagon_vertices
  ]
  # Draw the hexagon
  pygame.draw.polygon(screen, BLACK, rotated_vertices)
  # Update the positions and velocities of the balls
  for i, (x, y, vx, vy, color) in enumerate(balls):
    # Apply gravity
    vy += GRAVITY
    # Apply friction
    vx *= FRICTION
    vy *= FRICTION
    # Update position
    x += vx
    y += vy
    # Check for collisions with the hexagon walls
    for j in range(6):
      v1 = rotated_vertices[j]
      v2 = rotated_vertices[(j + 1) % 6]
      dx = v2[0] - v1[0]
      dy = v2[1] - v1[1]
      dist = math.sqrt(dx**2 + dy**2)
      nx = dy / dist
      ny = -dx / dist
      px = x - v1[0]
      py = y - v1[1]
      proj = px * nx + py * ny
      if 0 <= proj <= dist:
        if px * ny - py * nx <= BALL_RADIUS:
          overlap = BALL_RADIUS - (px * nx + py * ny)
          x -= overlap * nx
          y -= overlap * ny
          vx = -vx * 0.9
          vy = -vy * 0.9
    # Check for collisions with the screen edges
    if x - BALL_RADIUS < 0 or x + BALL_RADIUS > width:
      vx = -vx
    if y - BALL_RADIUS < 0 or y + BALL_RADIUS > height:
      vy = -vy
    # Update the ball's position and velocity
    balls[i] = (x, y, vx, vy, color)
    # Draw the ball
    pygame.draw.circle(screen, color, (int(x), int(y)), BALL_RADIUS)
  # Update the display
  pygame.display.flip()
  clock.tick(FPS)
# Quit Pygame
pygame.quit()
              

配置


{
  "parameters": {
    "temperature": 0.6,
    "top_p": 0.8
  },
  "generation_config": {
    "do_sample": true
  }
}
              

GPT-4o-mini

GPT-4o-mini

代码


    
import pygame
import random
import math
# Initialize Pygame
pygame.init()
# Screen settings
WIDTH, HEIGHT = 800, 600
screen = pygame.display.set_mode((WIDTH, HEIGHT))
pygame.display.set_caption('Bouncing Balls in Spinning Hexagon')
# Colors
WHITE = (255, 255, 255)
BLACK = (0, 0, 0)
BALL_COLOR = (255, 0, 0)
HEX_COLOR = (0, 255, 0)
# Constants
GRAVITY = 0.1 # Gravity affecting the balls
FRICTION = 0.99 # Friction affecting the ball's speed when not bouncing
BALL_RADIUS = 10 # Radius of the balls
NUM_BALLS = 10 # Number of balls
# Hexagon settings
HEX_SIZE = 250 # Hexagon radius
HEX_CENTER = (WIDTH // 2, HEIGHT // 2)
# Ball class
class Ball:
  def __init__(self, x, y, radius):
    self.x = x
    self.y = y
    self.radius = radius
    self.vx = random.uniform(-2, 2) # Initial x velocity
    self.vy = random.uniform(-2, 2) # Initial y velocity
  def move(self):
    self.x += self.vx
    self.y += self.vy
    self.vy += GRAVITY # Apply gravity
    # Apply friction
    self.vx *= FRICTION
    self.vy *= FRICTION
  def bounce(self, normal):
    # Reflect the ball's velocity based on the surface normal (for realistic bouncing)
    dot = self.vx * normal[0] + self.vy * normal[1]
    self.vx -= 2 * dot * normal[0]
    self.vy -= 2 * dot * normal[1]
# Function to draw the hexagon
def draw_hexagon(angle):
  points = []
  for i in range(6):
    x = HEX_CENTER[0] + HEX_SIZE * math.cos(math.radians(i * 60 + angle))
    y = HEX_CENTER[1] + HEX_SIZE * math.sin(math.radians(i * 60 + angle))
    points.append((x, y))
  pygame.draw.polygon(screen, HEX_COLOR, points, 5)
# Function to check if the ball is inside the hexagon
def check_collision_with_hexagon(ball):
  angle = pygame.time.get_ticks() / 10 # Time-dependent rotation angle
  # Hexagon vertices (rotated)
  hex_points = []
  for i in range(6):
    x = HEX_CENTER[0] + HEX_SIZE * math.cos(math.radians(i * 60 + angle))
    y = HEX_CENTER[1] + HEX_SIZE * math.sin(math.radians(i * 60 + angle))
    hex_points.append((x, y))
  for i in range(6):
    p1 = hex_points[i]
    p2 = hex_points[(i + 1) % 6]
    
    # Calculate the normal vector of the edge
    edge = (p2[0] - p1[0], p2[1] - p1[1])
    normal = (-edge[1], edge[0]) # Perpendicular to the edge
    # Normalize the normal
    length = math.sqrt(normal[0]**2 + normal[1]**2)
    normal = (normal[0] / length, normal[1] / length)
    # Check if the ball is outside the hexagon (distance from the center to the edge)
    ball_to_edge = (ball.x - p1[0], ball.y - p1[1])
    dist = ball_to_edge[0] * normal[0] + ball_to_edge[1] * normal[1]
    
    if dist < ball.radius:
      ball.bounce(normal)
      break
# Create a list of balls
balls = [Ball(random.randint(50, WIDTH-50), random.randint(50, HEIGHT-50), BALL_RADIUS) for _ in range(NUM_BALLS)]
# Main game loop
running = True
clock = pygame.time.Clock()
while running:
  screen.fill(BLACK) # Clear screen
  
  # Handle events
  for event in pygame.event.get():
    if event.type == pygame.QUIT:
      running = False
  # Draw the spinning hexagon
  angle = pygame.time.get_ticks() / 10 # Get a time-based rotation angle
  draw_hexagon(angle)
  # Update and draw the balls
  for ball in balls:
    ball.move()
    check_collision_with_hexagon(ball)
    
    # Draw ball
    pygame.draw.circle(screen, BALL_COLOR, (int(ball.x), int(ball.y)), ball.radius)
  # Refresh the display
  pygame.display.flip()
  # Cap the frame rate
  clock.tick(60)
# Quit Pygame
pygame.quit()
              

配置


{
  "query_date": "2025-05-03",
  "api_location": "https://chatgpt.com/"
}
              

Qwen2.5-72B-Instruct

Qwen2.5-72B-Instruct

代码


import pygame
import math
import random
# Initialize Pygame
pygame.init()
# Screen dimensions
WIDTH, HEIGHT = 800, 600
screen = pygame.display.set_mode((WIDTH, HEIGHT))
pygame.display.set_caption("Balls Bouncing in a Spinning Hexagon")
# Colors
WHITE = (255, 255, 255)
BLACK = (0, 0, 0)
RED = (255, 0, 0)
# Hexagon properties
hexagon_center = (WIDTH // 2, HEIGHT // 2)
hexagon_radius = 200
hexagon_sides = 6
hexagon_angle = 0 # Initial angle of the hexagon
hexagon_rotation_speed = 0.01 # Rotation speed in radians per frame
# Ball properties
num_balls = 10
balls = []
ball_radius = 10
gravity = 0.5
friction = 0.99
# Function to create a ball
def create_ball():
  x = random.randint(ball_radius, WIDTH - ball_radius)
  y = random.randint(ball_radius, HEIGHT - ball_radius)
  vx = random.uniform(-3, 3)
  vy = random.uniform(-3, 3)
  return [x, y, vx, vy]
# Create balls
for _ in range(num_balls):
  balls.append(create_ball())
# Function to draw the hexagon
def draw_hexagon(angle):
  points = []
  for i in range(hexagon_sides):
    angle_i = angle + i * (2 * math.pi / hexagon_sides)
    x = hexagon_center[0] + hexagon_radius * math.cos(angle_i)
    y = hexagon_center[1] + hexagon_radius * math.sin(angle_i)
    points.append((x, y))
  pygame.draw.polygon(screen, RED, points, 3)
# Function to check collision with the hexagon
def check_collision(ball, angle):
  x, y, vx, vy = ball
  for i in range(hexagon_sides):
    angle_i = angle + i * (2 * math.pi / hexagon_sides)
    angle_next = angle + (i + 1) * (2 * math.pi / hexagon_sides)
    x1 = hexagon_center[0] + hexagon_radius * math.cos(angle_i)
    y1 = hexagon_center[1] + hexagon_radius * math.sin(angle_i)
    x2 = hexagon_center[0] + hexagon_radius * math.cos(angle_next)
    y2 = hexagon_center[1] + hexagon_radius * math.sin(angle_next)
    
    # Check collision with the line segment (x1, y1) to (x2, y2)
    dx = x2 - x1
    dy = y2 - y1
    dot = dx * (x - x1) + dy * (y - y1)
    len_sq = dx * dx + dy * dy
    param = -1
    if len_sq != 0: # in case of a zero length line
      param = dot / len_sq
    
    if param < 0:
      nx = x1
      ny = y1
    elif param > 1:
      nx = x2
      ny = y2
    else:
      nx = x1 + param * dx
      ny = y1 + param * dy
    
    dx = x - nx
    dy = y - ny
    dist = math.sqrt(dx * dx + dy * dy)
    
    if dist < ball_radius:
      # Calculate the normal and tangent vectors
      nx = (x - nx) / dist
      ny = (y - ny) / dist
      tx = -ny
      ty = nx
      
      # Reflect the velocity
      dot = vx * nx + vy * ny
      vx = vx - 2 * dot * nx
      vy = vy - 2 * dot * ny
      
      # Move the ball away from the wall
      x += nx * (ball_radius - dist)
      y += ny * (ball_radius - dist)
      
      ball[0] = x
      ball[1] = y
      ball[2] = vx
      ball[3] = vy
# Main loop
running = True
clock = pygame.time.Clock()
while running:
  for event in pygame.event.get():
    if event.type == pygame.QUIT:
      running = False
  # Update ball positions and velocities
  for ball in balls:
    x, y, vx, vy = ball
    vy += gravity
    x += vx
    y += vy
    vx *= friction
    vy *= friction
    ball[0] = x
    ball[1] = y
    ball[2] = vx
    ball[3] = vy
    # Check collision with the hexagon
    check_collision(ball, hexagon_angle)
  # Update hexagon angle
  hexagon_angle += hexagon_rotation_speed
  # Draw everything
  screen.fill(BLACK)
  draw_hexagon(hexagon_angle)
  for ball in balls:
    pygame.draw.circle(screen, WHITE, (int(ball[0]), int(ball[1])), ball_radius)
  pygame.display.flip()
  clock.tick(60)
pygame.quit()
              

配置


{
  "api_location": "https://huggingface.co/Qwen/Qwen2.5-Coder-72B-Instruct"
}
              

Qwen2.5-32B-Instruct

Qwen2.5-32B-Instruct

代码


import pygame
import math
import random
# Initialize Pygame
pygame.init()
# Set up the display
screen_width, screen_height = 800, 600
win = pygame.display.set_mode((screen_width, screen_height))
pygame.display.set_caption("Bouncing Balls Inside Spinning Hexagon")
# Colors
WHITE = (255, 255, 255)
RED = (255, 0, 0)
# Constants for the simulation
FPS = 60
GRAVITY = 0.1
BALL_RADIUS = 10
FRIC_COEFFICIENT = 0.98 # To reduce velocity upon collision.
MAX_ANGLE_DEGREES = 360
DEG_TO_RAD = math.pi / 180 # Conversion factor between degrees and radians.
class Ball:
  def __init__(self):
    self.pos = pygame.Vector2(random.randint(BALL_RADIUS, screen_width-BALL_RADIUS),
                  random.randint(BALL_RADIUS, screen_height-BALL_RADIUS))
    self.vel = pygame.Vector2(random.uniform(-3, 3), random.uniform(-3, 3))
  def update(self):
    self.vel.y += GRAVITY
    self.pos += self.vel
    # Check for collisions with borders
    if self.pos.x - BALL_RADIUS <= 0 or self.pos.x + BALL_RADIUS >= screen_width:
      self.vel.x *= -FRIC_COEFFICIENT
      if self.pos.x - BALL_RADIUS <= 0:
        self.pos.x = BALL_RADIUS
      elif self.pos.x + BALL_RADIUS >= screen_width:
        self.pos.x = screen_width - BALL_RADIUS
    if self.pos.y - BALL_RADIUS <= 0 or self.pos.y + BALL_RADIUS >= screen_height:
      self.vel.y *= -FRIC_COEFFICIENT
      if self.pos.y - BALL_RADIUS <= 0:
        self.pos.y = BALL_RADIUS
      elif self.pos.y + BALL_RADIUS >= screen_height:
        self.pos.y = screen_height - BALL_RADIUS
  def draw(self, win):
    pygame.draw.circle(win, RED, (int(self.pos.x), int(self.pos.y)), BALL_RADIUS)
def get_hexagon_points(center, radius, angle_degrees):
  """Generate points of a hexagon."""
  points = []
  for i in range(6):
    theta = (math.pi * 2) / 6 * i + angle_degrees * DEG_TO_RAD
    x = center.x + radius * math.cos(theta)
    y = center.y + radius * math.sin(theta)
    points.append(pygame.Vector2(x, y))
  return points
def check_bounce(ball, poly_points):
  """Check if the ball has hit any of the edges and bounce off appropriately."""
  p1, p2, p3, p4, p5, p6 = poly_points
  for start_point, end_point in [(p1, p2), (p2, p3), (p3, p4), (p4, p5), (p5, p6), (p6, p1)]:
    line_vec = end_point - start_point
    normal_vec = pygame.Vector2(-line_vec.y, line_vec.x).normalize() # Normal vector to edge
    
    # Line equation parameters: Ax + By + C = 0
    A = normal_vec.y
    B = -normal_vec.x
    C = -(A * start_point.x + B * start_point.y)
    
    # Point-line-distance formula
    distance = A * ball.pos.x + B * ball.pos.y + C
    distance /= math.sqrt(A**2 + B**2)
    
    if abs(distance) < BALL_RADIUS:
      # Ball is hitting this edge.
      dist_vec = ball.pos - (start_point + end_point) / 2
      dot_product = dist_vec.dot(normal_vec)
      reflection_vec = ball.vel.reflect(normal_vec)
      
      if dot_product < 0 and reflection_vec.x * A + reflection_vec.y * B > 0:
        # Reflect and apply some slowdown using friction coefficient.
        ball.vel = reflection_vec * FRIC_COEFFICIENT
def redraw_window(win, balls, poly_points):
  win.fill(WHITE)
  pygame.draw.polygon(win, (0, 0, 255), [(p.x, p.y) for p in poly_points])
  for ball in balls:
    ball.draw(win)
  
  pygame.display.update()
def main():
  run = True
  clock = pygame.time.Clock()
  balls = [Ball() for _ in range(10)]
  hexagon_center = pygame.Vector2(screen_width // 2, screen_height // 2)
  hexagon_radius = screen_height // 4
  angle_degrees = 0
  while run:
    dt = clock.tick(FPS) / 1000 # time elapsed since last frame in seconds
    for event in pygame.event.get():
      if event.type == pygame.QUIT:
        run = False
    
    # Spin the hexagon by increasing the angle
    angle_degrees = (angle_degrees + 1 * dt * MAX_ANGLE_DEGREES) % MAX_ANGLE_DEGREES
    poly_points = get_hexagon_points(hexagon_center, hexagon_radius, angle_degrees)
    # Update each ball and check for collision with hexagon walls.
    for ball in balls:
      ball.update()
      check_bounce(ball, poly_points)
    redraw_window(win, balls, poly_points)
  pygame.quit()
if __name__ == "__main__":
  main()
              

配置


{
  "api_location": "https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct"
}
              

Qwen2.5-14B-Instruct

Qwen2.5-14B-Instruct

代码


import pygame
import sys
import math
import random
# Initialize Pygame
pygame.init()
# Screen dimensions
SCREEN_WIDTH = 800
SCREEN_HEIGHT = 600
screen = pygame.display.set_mode((SCREEN_WIDTH, SCREEN_HEIGHT))
pygame.display.set_caption("Bouncing Balls in Spinning Hexagon")
# Colors
WHITE = (255, 255, 255)
BLACK = (0, 0, 0)
RED = (255, 0, 0)
# Ball properties
BALL_RADIUS = 10
BALL_COUNT = 10
balls = []
# Hexagon properties
HEX_SIDES = 6
HEX_RADIUS = 200
HEX_CENTER = (SCREEN_WIDTH // 2, SCREEN_HEIGHT // 2)
HEX_SPIN_SPEED = 0.01
# Gravity and friction
GRAVITY = 0.5
FRICTION = 0.99
def create_balls():
  for _ in range(BALL_COUNT):
    x = SCREEN_WIDTH // 2 + (random.randint(-100, 100))
    y = SCREEN_HEIGHT // 2 + (random.randint(-100, 100))
    vx = random.uniform(-3, 3)
    vy = random.uniform(-3, 3)
    balls.append([x, y, vx, vy])
def draw_hexagon(angle):
  points = []
  for i in range(HEX_SIDES):
    angle_rad = math.radians(i * 60 - angle)
    x = HEX_CENTER[0] + HEX_RADIUS * math.cos(angle_rad)
    y = HEX_CENTER[1] + HEX_RADIUS * math.sin(angle_rad)
    points.append((int(x), int(y)))
  return points
def check_collision(ball, points):
  px, py = ball[:2]
  for i in range(len(points)):
    j = (i + 1) % len(points)
    if is_point_in_triangle(px, py, points[i], points[j], HEX_CENTER):
      return True
  return False
def is_point_in_triangle(px, py, p1, p2, p3):
  def sign(p1, p2, p3):
    return (p1[0] - p3[0]) * (p2[1] - p3[1]) - (p2[0] - p3[0]) * (p1[1] - p3[1])
  
  b1 = sign((px, py), p1, p2) < 0
  b2 = sign((px, py), p2, p3) < 0
  b3 = sign((px, py), p3, p1) < 0
  
  retu