Thank you for your great work on the L1 paper. I have one question I'd like to clarify. In the paper, it’s described that LCPO allows control over the reasoning chain length. However, when I look into the code, it seems that the length being controlled is actually the total output length — the thinking length + solution length.
Could you please clarify whether LCPO is controlling just the reasoning chain length, or the entire output length (reasoning + solution)?