openai o1

reward engineering

Why Test/Interence scaling law

First, knowledge knowledge distillation

Q1: Generally, a model’s performance is determined by its training phase, so if you say that pre-training or post-training can improve a model’s reasoning ability, I understand that. But how can the inference phase improve a model’s reasoning ability? What do you mean by using computational power during the inference phase?

Q2:Are post-training and inference two separate methods for improving a model’s inference capabilities? Can they be used together?


A good movie analogy for Beam Search might be “Inception” (2010). Here’s why:

Exploration of multiple possibilities: Just as Beam Search keeps a fixed number of the best candidates at each step, the characters in Inception navigate through multiple layers of dreams, focusing only on the most relevant threads to achieve their mission. Limited scope at each step: Beam Search limits the number of possibilities it considers (beam width), and in Inception, the team must prioritize certain actions and ideas to succeed while discarding others due to time and resource constraints. Optimization: Both are about making the best choice at every level (or dream layer) without exhaustively considering every possibility.


trial, error, and adaptation aspect, Edge of Tomorrow works well, especially for audiences familiar with the movie.


The Hunger Games

only the best—the one who survives—comes out on top. best-of-N approach, where you evaluate N options and pick the one with the highest reward,

Written on December 23, 2024