openai o1
reward engineering
Why Test/Interence scaling law
First, knowledge knowledge distillation
Q1: Generally, a model’s performance is determined by its training phase, so if you say that pre-training or post-training can improve a model’s reasoning ability, I understand that. But how can the inference phase improve a model’s reasoning ability? What do you mean by using computational power during the inference phase?
Q2:Are post-training and inference two separate methods for improving a model’s inference capabilities? Can they be used together?
A good movie analogy for Beam Search might be “Inception” (2010). Here’s why:
Exploration of multiple possibilities: Just as Beam Search keeps a fixed number of the best candidates at each step, the characters in Inception navigate through multiple layers of dreams, focusing only on the most relevant threads to achieve their mission. Limited scope at each step: Beam Search limits the number of possibilities it considers (beam width), and in Inception, the team must prioritize certain actions and ideas to succeed while discarding others due to time and resource constraints. Optimization: Both are about making the best choice at every level (or dream layer) without exhaustively considering every possibility.
trial, error, and adaptation aspect, Edge of Tomorrow works well, especially for audiences familiar with the movie.
The Hunger Games
only the best—the one who survives—comes out on top. best-of-N approach, where you evaluate N options and pick the one with the highest reward,
one of the selling point of o1, is that it natively incorperate the CoT
it THINK BEFORE YOU SPEAK. this is test time compute, has proven to be highligh effective in improving performance
how? let the model talk to itself before give out the answer
RLHF, turn knowledgable AI model into chatbot, but how do you know the model is good at pretending to be good model, rather than a actually better answer?
o1 did’t improve much on English literature, but great bump in Math and logic. which is logical reasoning.
CoT may just a facade? as model need more filler token in order to effective generate result. (slacking) – Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
Chain of thought reasoning without prompting, the basic idea is model already good reasoning process or generate good candidate answer, just current token decode strategy is not picking it.
Ok, but how we can get started on the answer? how to ensure that the first few token is hitting the right answer?
pick answers that look good, with reasoning with it.
This bring the idea of test-time compute.
this is next big buzzword as the next frontier of LLM scaling CoT is actually worser type of time-time compute, it go way beyong CoT.
- Rewarding model
- self verification and Verifier
- search method
- Monte Carolos tree search
- STaR algo
- Best of N sampling
the test-time compute extract exisitng knowledge more effective [scaling LLM test-time compute google deepmind]
key question: how to plan budget between model traing and inference. what’s best ROI? 10B + TTC > 14x large model (smaller model is under exploited) [pretraing………][post-training][inference] or [pretraing][post-training][inference………]
two methods:
-
search against verifers, generate N answer and filter, search problem with a ranking or reward model. a. best of N () b. monte carlos tree search c. …. (break down assisant coach)
-
modifying proposal distribution, mostly RL, so that the model refined its own output. a. overthinking will backfire b. trade off between task difficulties and search budget