note-gen-sync

Adaptive Rectification Sampling for Test-Time Compute Scaling

Zhendong Tan, Xingjun Zhang, Chaoyi Hu, Yancheng Pan, Shaoxun Wang School of Computer Science and Technology, Xi’an Jiaotong University 772316639@stu.xjtu.edu.cn, xjzhang@xjtu.edu.cn

Abstract

The newly released OpenAI-o1 and DeepSeekR1 have demonstrated that test-time scaling can significantly improve model performance, especially in complex tasks such as logical reasoning. Common test-time scaling methods involve generating more chain of thoughts (CoTs) or longer CoTs with self-correction. However, while self-correction can improve performance, it may lead to significant token waste and reduce readability of the CoT if the reasoning steps are already correct. To demonstrate that large language models (LLMs) can rectify errors at a more fine-grained level, we propose A daptive R ectification Sampling (ARSampling), which can guide the LLMs to self-correction at the appropriate step. ARSampling leverages a process-supervised reward model (PRM) as a verifier and constructed trigger sentences to guide the model in adaptive step-level rethinking. Through the experiments on GSM8K and MATH500, it indicate that our approach enables the models to rethink in more fine-grained level, improving the accuracy of solutions, while generating a reasonable number of additional tokens. Our code is avail able at: https://github.com/TanZhendong/ AR-Sampling. 1 Introduction

The newly released OpenAI-o1 and DeepSeek-R1 (Jaech et al., 2024; Guo et al., 2025) models have demonstrated remarkable capabilities in complex tasks such as logical reasoning (Shao et al., 2024; Yang et al., 2024b) and code generation (Roziere et al., 2023). With post-training techniques, represented by reinforcement learning, these models are capable of deep thinking, generating longer chain of thought (CoT) (Wei et al., 2022a), and improving the quality of their outputs. On the other hand, increasing the scale of pre-trained models requires more computational resources and data, which is difficult to afford. Therefore, many researchers are

focusing on post-training and test-time scaling to enhance the model performance. Common test-time scaling methods involve generating more CoTs, such as best of N, beam search, and other tree-of-thought approaches (Snell et al., 2024; Yao et al., 2023; Wan et al., 2024). We refer to this method as increasing the width of CoT, which means increasing the number of N or beams. Correspondingly, the o1-like model scales test-time inference by increasing the length of CoT. By analyzing the deep thinking results of DeepSeek-R1, we can observe that during this phase, the model often produces phrases such as “Let me check again” or “Alternatively,” leading to new solutions. This phenomenon is referred to as the “aha moment”(Guo et al., 2025), which signifies allocating more thinking time to a problem by reevaluating its initial approach. However, this phenomenon typically arises spontaneously and uncontrollably. The model may still generate lengthy responses when problems are quite simple, a phenomenon known as “overthinking” (Chen et al., 2024). Although the “aha moment” can enhance model performance, when the current reasoning steps are correct, checking and rethinking a new solution can result in a significant waste of tokens and reduce the readability of the CoT. Theoretically, if LLMs only rethink and rectify at the step where an error occurs, it could effectively reduce the number of tokens generated. However, during test-time, it is challenging to identify at which step the model made mistakes and to guide the model to regenerate from the incorrect step. As a result, the key research problem is: during test-time, how to guide LLMs to rethink at the appropriate moments? In this paper, we find that using a processsupervised reward model (PRM) as a verifier to check the reasoning steps can help identify potential errors. Consequently, we propose A daptive R ectification Sampling (AR-Sampling), which leverages the verifier and constructed trigger sen


tence to guide the model in adaptive step-level rethinking. On the one hand, AR-sampling can enhance the LLMs reasoning without generating redundant tokens; on the other hand, the sampled data can be used in other self-critique methods. At the same time, we demonstrate that LLMs have the ability to rethink at more fine-grained level, which is beneficial for addressing overthinking in the future.

The main contributions of this work are:

Test-Time Compute Scaling. Snell et al. (2024) provides a detailed demonstration of how LLMs can utilize additional computation at test time to improve accuracy. As illustrated in Section 1, the forms of test-time compute scaling can be categorized into increasing the width or length of the CoT. Increasing the width of CoT typically requires a verifier to aggregate or select the best answer from the proposer (Cobbe et al., 2021). If combined with majority voting (Wang et al., 2022), the accuracy and stability of best-of-N sampling can be further improved. According to Lightman et al. (2023), process-based verifier generally perform better than outcome-based verifier (Uesato et al., 2022). Due to the branching nature of per-step predictions, we can search within a tree-like solution space. Methods such as beam search and Monte Carlo tree search, which explore the tree of thoughts, can be more efficient and enable the model to perform better (Wan et al., 2024; Yao et al., 2023; Xie et al., 2023). Increasing the length of CoT always relies on the model’s self-reflection capabilities, meaning the model can refine its own outputs, regardless of whether they are correct or not, to enhance its

responses. Madaan et al. (2023) demonstrates that LLMs can provide feedback and utilize it to selfrefine. Building on this foundation, many applications leverage this self-reflection mechanism to improve the outputs of LLMs (Gou et al., 2023; Chen et al., 2023). In addition, manually inserting prompts can also trigger self-reflection (Chen et al., 2025). Beyond this, we can leverage reinforcement learning and direct preference optimization (Rafailov et al., 2023) on sampled data for fine-tuning, enabling the model to achieve selfimprovement (Qu et al., 2024).

LLM Reasoning. LLM reasoning has always been an important research area. Its primary goal is to enhance the logical reasoning capabilities of LLMs, particularly in solving mathematical problems. CoT has become an essential process for LLMs to answer mathematical questions, as solving problems step by step can significantly improve the accuracy and increase the readability of the solving process, which is widely applied in recent works (Cobbe et al., 2021; Kojima et al., 2022; Wei et al., 2022b; Uesato et al., 2022). Moreover, many test-time compute scaling methods also employ mathematical reasoning for validation (Chen et al., 2025; Snell et al., 2024; Beeching et al.). As a result, we also validate our AR-sampling on mathematical reasoning.

Efficient Thinking. OpenAI-o1 and DeepSeekR1 have already demonstrated the amazing potential that comes with deep thinking. However, they tend to generate a very large number of tokens in response, even when the questions are quite simple. The core objective of efficient thinking is to explore methods for scaling test-time compute efficiently and intelligently. A commonly adopted approach is to adaptively set the width of the CoT. Aggarwal et al. (2023) and Li et al. (2024) have explored the possibility of early stopping within self-consistency from different perspectives, aiming to prevent the model from excessive generation. Expanding upon this, Wang et al. (2024) incorporates prior knowledge about question difficulty to adaptively allocate inference resources. What’s more, recent works focus on reducing the length of CoT, including token-budgetaware inference (Han et al., 2024; Yu et al., 2025) and CoT compression (Luo et al., 2025; Ma et al., 2025). The aforementioned methods primarily focus on saving inference budget at the solution level. Instead, we concentrate on a more fine-grained


system prompt

Question Standard Answer

score: 0.98 score: 0.99 score: 0.16

Figure 1: The framework of AR-Sampling. AR-Sampling uses PRM as a verifier to check each step. If the score is lower than the threshold, we consider this step unfavorable for the reasoning and use a trigger to force the model to rethink from the current step. By adjusting the threshold score and the maximum number of rethinks, we can dynamically control the generation budget.

step-level rethinking, aiming to further explore the self-correction mechanisms to improve generation efficiency.

3 Adaptive Rectification Sampling

3.1 Preliminaries

Best-of-N sampling is one of the most commonly used methods for test-time compute scaling. To better understand our method, we will first introduce the details of it.

Intuitively, when adopting best-of-N sampling, the model needs to generate N different samples through stochastic decoding methods such as top-k, top-p, and temperature sampling, and then select the one with the highest score from these candidates. When using an outcome-supervised reward model (ORM), a single score is assigned to a solution path. However, when using a PRM, since the model provides scores at each step, we need to aggregate these scores. Typically, we can use

· reduction operations f ( ) to obtain the aggregated score, such as taking the minimum of these step scores, the product of them, or simply using the score from the final step. After obtaining the scores for each solution path, we can select the final answer a with the highest score from the answer set A :

a BoN = arg max f ( r a [(1)] [, r] a [(2)] [, . . ., r] a [(] [k] [)] [)] (1) a∈A

r a [(] [i] [)] [represents the score of the i-th step in the solu-] tion path where the final answer is a . We assume that there are k steps for a solution.

Moreover, according to Li et al. (2022), we can further improve the stability and accuracy of bestof-N sampling by utilizing majority voting, using weighted scores for selection:

3.2 AR-Sampling

Given a question, AR-Sampling requires the model to generate step by step. AR-Sampling uses a verifier to identify incorrect steps and then constructs trigger sentences to guide the model to rethink from it. Next, we describe it in more detail in section

3.2.1 and 3.2.2.

3.2.1 Adaptive Step Detection

We use PRM as a verifier to check each step. Typically, PRM is a LLM fine-tuned on datasets that are either manually annotated (Lightman et al., 2023) or automatically annotated (Wang et al., 2023). It treats verification as a classification problem, where each step can be categorized into two or three classes: good, bad, and neutral . Neutral indicates that the step is correct but irrelevant to the reasoning goal. And it can be considered incorrect in two classes case. According to Wang et al. (2023), there is not much difference between the binary and the three classification models. For the model architecture, while keeping the base model unchanged, we can use a token or sequence classification output head to replace the causal language model head. For the sequence classification model, each step is treated as a token

a BoN = arg max a∈A

N � 1 ( a i = a ) f ( r a [(1)] i [, . . .] [)][ (2)]

i =1


1.0

0.5

1.0

0.5

0.0 0.0 0.5 1.0

1.0

0.5

0.0 0.0 0.5 1.0

1.0

0.5

0.0 0.0 0.5 1.0

1.0

0.5

0.0 0.0 0.5 1.0

1.0

0.5

1.0

0.5

0.0 0.0 0.5 1.0

1.0

0.5

0.0 0.0 0.5 1.0

1.0

0.5

0.0 0.0 0.5 1.0

1.0

0.5

0.0 0.0 0.5 1.0

1.0

0.5

0.0 0.0 0.5 1.0

1.0

0.5

0.0 0.0 0.5 1.0

1.0

0.5

0.0 0.0 0.5 1.0

1.0

0.5

0.0 0.0 0.5 1.0

1.0

0.0 0.0 0.5 1.0

1.0

0.5

0.0 0.0 0.5 1.0

1.0

0.5

0.0 0.0 0.5 1.0

1.0

0.5

0.0 0.0 0.5 1.0

step

0.0 0.0 0.5 1.0

step

0.5

0.0 0.0 0.5 1.0

(a) The final answer is correct.

(b) The final answer is wrong.

Figure 2: Samples of the PRM score. The x axis is the normalized step index. We classify them into correct (a) and wrong (b) according to the final answers. For the wrong cases, there is greater fluctuation, and the scores of some steps are very low. We believe that these steps are more likely to lead to wrong answers.

sequence. For the token classification model, we can use the class of the last token as the class for

the step (von Werra et al., 2020). In addition, we can also utilize the original causal language model head, employing certain special tokens (such as ’+’ and ’-’) as markers for classes (Dong et al., 2024).

Given a question q, the reward r of the step s i can be considered as the probability of the good class:

r a [(] [i] [)] = P(+ * q, s* 1 , s 2 , . . ., s i ) (3)

We believe that the score of each step can effectively reflect the correctness of it. Meanwhile, due to the close dependence among the steps, a wrong step is more likely to lead to a wrong answer. In order to display the distribution of the scores intuitively, we sampled several PRM scores during reasoning and plotted the trends in Figure 2.

We divide the scores into two groups based on whether the final answer is correct or not. For most of the correct cases, the scores of every step are relatively high and close to 1. For the wrong cases, there is greater fluctuation, and the scores of some steps are very low. We believe that these steps are more likely to lead to wrong answers. Therefore, we need to introduce triggers after these steps to guide the LLMs to rethink from them. Specifically, we introduced a threshold p, which ranges between 0 and 1. If the score of the current step is less than p, a trigger will be introduced; otherwise, the reasoning will continue. Generally, the larger the value of p is, the more likely the model is to trigger the rethink, and the more tokens will be generated.

3.2.2 Step-level Rectification

When LLMs generate the solution step by step, they often use the word “Step” or similar tags as markers at the beginning of each step. Additionally, separators such as “ \ n” or “ \ n \ n” can be used to indicate the end of a step. By setting the separator as a stop word, we can ensure that the model stops after each step of generation. After we identify steps to rethink, we can construct a trigger sentence to guide the LLM to conduct step-level rethink.

Figure 1 provides a specific example to illustrate this process. In the system prompt, we will provide the information about the separator and the step marker. Then, we can parse the step index (in Figure 1, the step index is 3.) and construct a specific trigger. In order to ensure that the model follows the instructions of the rethink trigger, we also add the step marker to the trigger sentence.

After the model generates a new solution step, we will continue to use the verifier to check it. It should be noted that sometimes the score of the re think step is still lower than the threshold p we set. This may be caused by reward hacking, problem difficulty, and the capabilities of the model. In order to prevent the model from repeatedly thinking about the same step, we set a maximum number m of rethink attempts for single step. If the number of rethinking attempts exceeds m, we will no longer add the trigger for this step. We provide the detailed description of AR-Sampling for one time in Algorithm 1.


Al g orithm 1: AR-Sam p lin g for one time

Input : llm, prm, question q, threshold p, max attempts m, max length l, seperator sep Output : model generation s

1 index = 1, s = q, count = 0;

2 for index ≤ l do

3 if index < l then

// set stop word

4 step = llm .generate( s, stop= sep );

5 score = prm .score( s, step, sep );

6 if score < p and count < m then

7 step = AddTrigger( step );

8 count += 1;

9 else

10 count = 0;

11 else

12 step = llm .generate( s )

13 s += step ;

14 index += 1;

15 return s ;

3.3 Relationship with Other Sampling

Methods

In section 3.2, we introduce AR-Sampling and provide a detailed description of a single generation instance. Our approach emphasizes step detection and prompt triggering, which is orthogonal to several generation methods. When employing ARSampling during test-time, it can be combined with best-of-N sampling to easily achieve scaling. To better understand AR-Sampling, we compare it with the most commonly used methods, best-ofN and beam search, as follows:

to retain all steps generated by the model, regardless of their correctness.

4 Experiments

4.1 Experiment Setup

Models and Datasets. In our experiments, we evaluate AR-Sampling in mathematical reasoning. For the proposer model, we choose the Llama3.2-1B-Instruct, Llama3.2-3B-Instruct (Grattafiori et al., 2024), and Qwen2.5-7B-Instruct models (Yang et al., 2024a). To ensure the models have sufficient instruction-following capabilities, we use the instruction-tuned versions rather than the base models. For the verifier model, we choose the PRM trained from Llama3.1-8B Instruct on RLHFlow/Deepseek-PRM-Data (Xiong et al., 2024). The verifier use ’+’ and ’-’ tokens to label the class of a step, as illustrated in section 3.2.1. A larger model size ensures that the verifier can effectively identify potential wrong steps for AR-Sampling and efficiently select the best answer. For datasets, we select two representative datasets: GSM8K (Cobbe et al., 2021) and MATH500 (Lightman et al., 2023).

Baselines. We combine best-of-N with AR Sampling to achieve test-time scaling. Because with best-of-N sampling, all the solution steps are retained, which facilitates our analysis of the solution path and allows us to explore the step-level rectification capabilities of the model. Consequently, we compare our approach against best-of-N sampling and its variants:

Metrics. The primary metric is the accuracy of the final answer. When scaling at test-time, we use pass@N to represent the accuracy when the model generate N samples. A key issue when evaluating the answers is that there are many equivalent expressions in mathematics, such as 1/2 and 0.5.


Model ( -Instruct ) Method Pass@2 Pass@4 Pass@8 Pass@16 Pass@32

SC 44.9 54.0 59.2 63.8 66.8

BoN 55.3 62.5 67.1 70.3 71.8

Llama3.2-1B AR+BoN 58.4 65.4 71.5 74.4 74.5

BoN+SC 55.3 61.0 64.7 68.0 70.3

AR+BoN+SC 58.4 65.0 70.4 73.7 73.7

SC 78.8 85.2 87.7 88.8 88.6

BoN 85.0 86.8 88.0 88.9 89.8

Llama3.2-3B AR+BoN 86.1 88.4 89.5 90.1 90.4

BoN+SC 85.0 87.1 88.6 89.6 89.5

AR+BoN+SC 86.1 88.3 89.3 89.8 90.3

SC 87.6 92.0 92.3 92.9 93.5

BoN 91.1 92.3 93.2 93.3 93.7

Qwen2.5-7B AR+BoN 90.1 92.5 93.1 93.6 94.2

BoN+SC 91.1 92.3 93.2 93.3 93.7

AR+BoN+SC 90.1 92.3 93.3 93.5 93.7

Table 1: AR-Sampling can improve the accuracy (%) on GSM8K.

The standard approach (Lewkowycz et al., 2022) to address it is to let the model generate answers in LaTeX format and use symbolic computation to verify whether they are equivalent.

Implementation. We run the experiments on a single NVIDIA A100 (80GB) GPU. For the proposer model, we use the vLLM inference engine (Kwon et al., 2023), and for the verifier, we use Hugging Face Transformers (Wolf et al., 2020). For the parameters, we set m = 1 and adjust p to control the rethinking. Generally speaking, the stronger the model’s capability or the easier the dataset difficulty, the larger p becomes. Consequently, on GSM8K, we set p to 0.6, 0.7, and 0.8 for the 1B, 3B, and 7B models, respectively. On MATH500, we set them to 0.3, 0.5, and 0.5. Additionally, we set the maximum value of N to 32, which is sufficient to reflect the effect of test-time scaling. We set the stop word to “ \ n \ n”, and the system prompt is provided in Appendix A.

4.2 Results

We evaluate AR-Sampling from three aspects: the accuracy of test-time scaling, step-level rethinking efficiency, and ablation study on parameters p and m . The details are as follows.

4.2.1 Accuracy

GSM8K. GSM8K (Grade School Math 8K) is a dataset comprising high-quality, linguistically diverse grade school math word problems. We use the test dataset containing 1.32K problems. While

the questions are straightforward for humans, they effectively evaluate the multi-step reasoning capability of the model. The results are presented in Table 1. As N increases, the accuracy of all approaches shows significant improvement. With only a marginal improvement observed from pass@16 to pass@32, it suggests that the performance is almost convergence. Additionally, it can be confirmed that the verifier model effectively improves the accuracy compared with SC. We observe that AR-Sampling can improve the accuracy on GSM8K in almost all cases. This indicates that the model can use the

self-correction mechanism to improve its performance. What’s more, we note that in some cases, when combined with major voting, the accuracy will slightly decrease, which may be caused by bias or hallucination. Finally, for Qwen2.5-7B-Instruct, since its performance on GSM8K is saturated and it is out-of-distribution with the verifier, the results of SC are comparable to those of BoN. However, using AR-Sampling still brings improvement, indicating that our hypothesis generalizes to some degree.

MATH500. To avoid over-fitting, Lightman et al. (2023) expanded the PRM training set to include part of the MATH test problems (Hendrycks et al., 2021). Therefore, they selected 500 test problems for evaluation, referred to as MATH500. This dataset includes knowledge areas such as precalculus and algebra, which are challenging for LLMs. We believe it can effectively reflect the reasoning


Model ( -Instruct ) Method Pass@2 Pass@4 Pass@8 Pass@16 Pass@32

SC 28.2 34.4 37.6 40.8 43.8

BoN 31.8 36.6 39.6 42.2 41.8

Llama3.2-1B AR+BoN 32.4 37.6 41.8 43.4 44.0

BoN+SC 31.8 38.6 41.4 44.4 45.0

AR+BoN+SC 32.4 36.8 42.2 46.0 47.8

SC 46.8 53.4 56.4 59.0 61.0

BoN 49.6 53.2 54.8 57.2 55.8

Llama3.2-3B AR+BoN 47.2 51.6 52.6 57.0 56.6

BoN+SC 49.6 55.4 57.0 60.0 62.0

AR+BoN+SC 47.2 54.4 56.6 62.0 63.2

SC 66.8 73.4 77.2 79.2 80.0

BoN 70.2 72.6 71.0 70.6 71.2

Qwen2.5-7B AR+BoN 70.4 70.2 72.6 73.4 73.4

BoN+SC 70.2 74.8 77.8 79.0 79.8

AR+BoN+SC 70.4 73.6 76.8 79.6 81.0

Table 2: AR-Sampling can improve the accuracy (%) on MATH500.

2~4

Model Method GSM8K Math500

R1-Distilled - 377.9 1113.3

BoN 214.9 566.1 Llama3.2-1B AR 391.8 986.6

BoN 208.0 478 . 7 Llama3.2-3B AR 254.1 858.7

BoN 195.3 395.6 Qwen2.5-7B AR 210.2 461.6

Table 3: The average number of tokens per solution.

ability of LLMs.

The results are shown in Table 2. We observe

that the AR-Sampling can improve the accuracy across most scenarios. However, for Llama3.23B-Instruct and Qwen2.5-7B-Instruct, it is interesting that the accuracy of AR-Sampling decreases when N = 4 , 8 . We believe this is due to out-ofdistribution, where PRM cannot effectively guide the LLMs. In the case of Qwen, the performance of SC even surpasses that of using PRM. Another reason is that because the question is hard, the self-correction mechanism cannot consistently improve performance—in fact, the proposer model may even change a correct step to an incorrect one after rethinking.

4.2.2 Rethinking Efficiency

the Number of Tokens. To better understand the

rethinking overhead of AR-Sampling, we measure the average number of tokens generated per solution. The results are shown in Table 3.

Figure 3: The distribution of the number of rethinks for the Llama3.2-1B-Instruct model generation.

We observe that as models scale up and their capabilities improve, the average number of tokens per solution gradually decreases. Larger models can solve problems with fewer steps and fewer tokens per step. Additionally, the rethinking overhead diminishes with increased model size, allowing models to perform self-correction more efficiently. We also measure the average number of tokens generated by DeepSeek-R1-Distilled-Qwen7B, a model fine-tuned on DeepSeek-R1 data (Guo et al., 2025). Since the model will think before generating solutions, the average number of tokens increases significantly, especially when the model repeatedly rethinks. In comparison, step-level rethinking obtain lower overhead.

Adaptive Rethink. We investigate the ratio of our trigger sentences in the total generated solutions. Figure 3 shows the distribution of the number

2

0

(a) Distribution on GSM8K

(b) Distribution on MATH500


50

40

50

40

30

30

2 [1] 2 [2] 2 [3] 2 [4] 2 [5]

N

2 [1] 2 [2] 2 [3] 2 [4] 2 [5]

N

(a) Ablation study on parameter p .

(b) Ablation study on parameter m .

Figure 4: An ablation study on the Llama3.2-1B-Instruct model, exploring its performance on the MATH500 dataset under different parameters. The accuracy is calculated using major voting.

of rethinking steps for the Llama3.2-1B-Instruct model on GSM8K and MATH500.

By controlling the threshold p, we can adjust the proportion of rethink. For approximately 40% of the solutions, trigger sentences are introduced. And for the remaining 60%, since the PRM score is relatively high, it is unnecessary for the model to rethink. From the difficulty perspective, the model needs less rethinking for simpler questions. Additionally, about half of the rethinking solutions introduce the trigger sentence only once. This indicates that the model can effectively influence the correctness of subsequent steps through one critical self-correction.

4.2.3 Ablation Study

To explore the impact of parameters p and m in AR-Sampling, we conduct an ablation study on the Llama3.2-1B-Instruct model as an example. We vary p from 0.1 to 0.5 and m from 1 to 5. The results are presented in Figure 4.

For the parameter p, it cannot effectively introduce the trigger sentence when p is too small. As a result, the improvement is not significant. Generally speaking, as the value of p increases, the number of rethinking iterations increases, leading to better performance. Additionally, when N is less than 16, the impact on accuracy is limited. This indicates that without fine-tuning, the selfcorrection of the model will be not precise, and a small number of generation cannot effectively improve performance.

For the parameter m, the trend of the plot is opposite to that of parameter p . When N is relatively

large, different values of m show little difference. Although increasing m can also raise the number of rethinking iterations, setting m too large may lead to overthinking at a certain step or even result in cyclic generation. Consequently, we believe that setting m = 1 is sufficient to guide the model in self-correction.

From the perspective of generation length, as p and m increase, the length of the CoT grows rapidly. Therefore, we need to set appropriate values to ensure efficiency. Finally, different models and datasets require different parameter configurations. For easier datasets such as GSM8K, since the accuracy is already high, the PRM scores are typically high as well. Therefore, it is necessary to appropriately increase p or m to ensure the model can effectively use the self-correction mechanism.

5 Conclusions

In this work, we propose Adaptive Rectification Sampling (AR-Sampling), which leverages a verifier and constructed trigger sentence to guide the model in adaptive step-level rethinking. With more fine-grained rethinking, AR-Sampling can improve the accuracy of solutions at test-time while generating a reasonable number of additional tokens. Through our research, we demonstrate that LLMs have the ability to rethink at a more fine-grained level, which is beneficial for addressing overthinking in the future. Limitations

Our proposed method has some limitations. We believe they can be addressed by future work. Firstly,


as DeepSeek-R1 reported, the “aha moment” can emerge spontaneously when scaling in reinforcement learning. If we aim to mitigate overthinking, merely relying on trigger sentences at test time to guide LLMs’ self-correction is insufficient. Instead, we need to integrate fine-grained rectification into reinforcement learning to better control the “aha moment” phenomenon. Additionally, we scale AR-Sampling at test-time through best-of-N. Some more promising tree-of-thoughts methods, such as beam search, may get better results even without leveraging the self-correction mechanism. We hope to further explore the potential applications of self-correction in the future to improve model performance.

References

Pranjal Aggarwal, Aman Madaan, Yiming Yang, and 1 others. 2023. Let’s sample step by step: Adaptiveconsistency for efficient reasoning and coding with llms. arXiv preprint arXiv:2305.11860 .

Edward Beeching, Lewis Tunstall, and Sasha Rush.

Scaling test-time compute with open models.

Weizhe Chen, Sven Koenig, and Bistra Dilkina. 2025. Iterative deepening sampling for large language models. arXiv preprint arXiv:2502.05449 .

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, and 1 others.

  1. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187 .

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 .

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 .

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. 2024. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863 .

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen.

  1. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738 .

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad AlDahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 .

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 .

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. 2024. Token-budget-aware llm reasoning. arXiv preprint arXiv:2412.18547 .

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 .

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, and 1 others. 2024. Openai o1 system card. arXiv preprint arXiv:2412.16720 .

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199– 22213.

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles .

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, and 1 others. 2022. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857.

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. 2022. Making large language models better reasoners with stepaware verifier. arXiv preprint arXiv:2206.02336 .

Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li.

  1. Escape sky-high cost: Early-stopping selfconsistency for multi-step reasoning. arXiv preprint arXiv:2401.10480 .

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe.


  1. Let’s verify step by step. In The Twelfth Inter- national Conference on Learning Representations .

Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. 2025. O1-pruner: Lengthharmonizing fine-tuning for o1-like reasoning pruning. arXiv preprint arXiv:2501.12570 .

Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. 2025. Cot-valve: Lengthcompressible chain-of-thought tuning. arXiv preprint arXiv:2502.09601 .

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 others. 2023. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36:46534–46594.

Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar. 2024. Recursive introspection: Teaching language model agents how to self-improve. Advances in Neural Information Processing Systems, 37:55249– 55285.

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn.

  1. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728– 53741.

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, and 1 others. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 .

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 .

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 .

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022. Solving math word problems with process-and outcomebased feedback. arXiv preprint arXiv:2211.14275 .

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. 2020. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.

Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang.

  1. Alphazero-like tree-search can guide large language model decoding and training. In Forty-first International Conference on Machine Learning .

Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui.

  1. Math-shepherd: Verify and reinforce llms stepby-step without human annotations. arXiv preprint arXiv:2312.08935 .

Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, and Kan Li. 2024. Make every penny count: Difficultyadaptive self-consistency for cost-efficient reasoning. arXiv preprint arXiv:2408.13457 .

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 .

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022a. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. Trans- formers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.

Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, MinYen Kan, Junxian He, and Qizhe Xie. 2023. Decomposition enhances reasoning via self-evaluation guided decoding. arXiv preprint arXiv:2305.00633, 2.

Wei Xiong, Hanning Zhang, Nan Jiang, and Tong Zhang.

  1. An implementation of generative prm. https: //github.com/RLHFlow/RLHF-Reward-Modeling.

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, and 1 others. 2024a. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 .


An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, and 1 others. 2024b. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122 .

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan.

  1. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822.

Zishun Yu, Tengyu Xu, Di Jin, Karthik Abinav Sankararaman, Yun He, Wenxuan Zhou, Zhouhao Zeng, Eryk Helenowski, Chen Zhu, Sinong Wang, and 1 others. 2025. Think smarter not harder: Adaptive reasoning with inference aware optimization. arXiv preprint arXiv:2501.17974 .

A System Prompt

The system prompt of our experiments is provided as follows:

Solve the following math problem efficiently and clearly:

Step 1: [Concise description]

[Brief explanation and calculations]

Step 2: [Concise description]

[Brief explanation and calculations]

Regardless of the approach, always conclude with:

Therefore, the final answer is: $ \ boxed{answer}$. I hope it is correct.

Where [answer] is just the final number or expression that solves the problem.

B Case Study

We provide the case data for a specific problem in GSM8K. For AR-Sampling data, the model use step-level rethinking to rectify the wrong step. In contrast, DeepSeek-R1 performs solution-level verification and reflection, which may sometimes consume additional tokens unnecessarily on easy questions.


Question: Every day, Wendi feeds each of her chick- ens three cups of mixed chicken feed, containing seeds, mealworms and vegetables to help keep them healthy. She gives the chickens their feed in three separate meals. In the morning, she gives her flock of chickens 15 cups of feed. In the afternoon, she gives her chickens another 25 cups of feed. How many cups of feed does she need to give her chickens in the final meal of the day if the size of Wendi’s flock is 20 chickens? DeepSeek-R1 Answer: Okay, let me try to figure out this problem. … So, Wendi needs to give 20 cups in the final meal. Wait, let me make sure I didn’t make a mistake. Let me check again. Each chicken has 3 cups per day. 20 chickens would need 3*20=60. Morning is 15, afternoon is 25. 15+25=40. 60-40=20. Yeah, that seems right. Hmm. Alternatively, maybe there’s another way to think about it. Let’s see. Maybe per meal, how much does each chicken get? Wait, but the problem says she gives them three separate meals, but doesn’t specify if each meal is equal. So, in the morning, she gives 15 cups to the flock, afternoon 25 cups, and the final meal is the remaining. So, since total per day is 60, then 15 +25 + x =60. Solve for x: x=60-15-25=20. So, that’s the same answer. So, 20 cups for the final meal. Alternatively, maybe check per chicken. Each chicken gets 3 cups a day. If the total feed given in the morning is 15 cups, then per chicken, morning meal is 15 cups divided by 20 chickens. Let me compute that: 15/20 = 0.75 cups per chicken in the morning. Then afternoon is 25 cups, so 25/20=1.25 cups per chicken in the afternoon. So, so far, each chicken has received 0.75 +1.25 =2 cups. Since they need 3 cups total, they need 1 more cup per chicken in the final meal. So, 1 cup per chicken times 20 chickens is 20 cups. So, that’s the same answer. Therefore, both methods give the same result. So, Wendi needs to give 20 cups in the final meal. I think that’s the correct answer. Let me just check if there’s any other way to interpret the problem. … So, the total has to be 60, so the remaining is 20. So, that seems correct. Therefore, I think 20 cups is the answer. Final Answer Wendi needs to give her chickens \boxed{20} cups of feed in the final meal.
Answer: If each chicken eats 3 cups of feed per day, then for 20 chickens they would need 320=«320=60»60 cups of feed per day. If she feeds the flock 15 cups of feed in the morning, and 25 cups in the afternoon, then the final meal would require 60-15-25=«60-15-25=20»20 cups of chicken feed. #### 20  
AR-Sampling Answer: ## Step 1: Calculate the total cups of feed given in the morning and afternoon meals. Total cups given in morning and afternoon = 15 cups + 25 cups = 40 cups. ## Step 2: Calculate the remaining cups of feed needed for the final meal. Remaining cups needed for the final meal = Total cups given - Total cups given in morning and afternoon = 40 cups - 40 cups = 0 cups. Wait! Maybe I made some mistakes in Step 2. I need to rethink from it. ## Step 2: Calculate the total cups of feed for the day. Total cups of feed for the day = 3 meals * 20 chickens = 60 cups. ## Step 3: Calculate the remaining cups of feed needed for the final meal. Remaining cups needed for the final meal = Total cups of feed for the day - Total cups given in morning and afternoon = 60 cups - 40 cups = 20 cups. The final answer is: $\boxed{20}$