Hi,
Thank you for your contributions.
I ran some experiments using the code you provided and switched from GPT-4o to Claude-3.7-Sonnet. I would expect similar performance to GPT-4o when using Claude. However, the current run only achieves 0.25 ROUGE-1 on both insight-level and summary-level evaluation, which is far behind the reported results (i.e., ~0.32).
Our parameters are:
{ "benchmark_type": "full", "branch_depth": 4, "max_questions": 3, "model_name": "claude" }
This should be aligned with your experiments reported in the paper. Could you please give some suggestions for our experiments?
Best,
Ethan