Discrepancy between Leaderboard and my test with Provided Code #6

YvetteLaw · 2025-02-13T05:50:56Z

Hi, I'm trying to reproduce the results in leaderboard. I think I have totally follow your settings but it seems that the results are quiet different, especially in complong set. Could you help provide some reasons that may affect the results?

yilunzhao · 2025-02-13T06:00:38Z

Hi @YvetteLaw, thanks for your interest in our work. We’ve provided all the model outputs in Google Drive (as outlined in the README), which should match the leaderboard results. Could you please check and compare our output with your reproduced results?

For the long-document setting, a common issue is incorrect processing of the supporting evidence (see the closed issues for reference).

Let me know if you still have the questions!

YvetteLaw · 2025-02-13T13:52:28Z

Hi @YvetteLaw, thanks for your interest in our work. We’ve provided all the model outputs in Google Drive (as outlined in the README), which should match the leaderboard results. Could you please check and compare our output with your reproduced results?

For the long-document setting, a common issue is incorrect processing of the supporting evidence (see the closed issues for reference).

Let me know if you still have the questions!

Hi @yilunzhao Thanks for your reply. I will check my output with yours.

The other question is that on the simplong set, some results are quiet similar, so I believe I have followed your settings. But on complong set, almost all cases fail. These two long sets use different settings?

yilunzhao · 2025-02-13T16:25:08Z

Both long subsets follow the same retrieval-then-generate pipeline. Have you implemented this approach?

YvetteLaw · 2025-02-19T06:44:37Z

I checked my output, one is as follows. Does the header "<|start_header_id|>assistant<|end_header_id|>\n\n" affect the result? I do not find other difference except the reasoning process is not correct actually.

"output": [
"<|start_header_id|>assistant<|end_header_id|>\n\nTo find the answer to the question, we need to identify the relevant information from the provided notes related to derivative contracts and other in 2018.\n\nNote 5. Derivatives and Risk Management, \n| Effective Portion Recognized in Other Comprehensive Income (OCI) |\n| | 2018 | 2017 |\n| Foreign currency exchange contracts | $ | (4) | $ | (25) |\n\nNote 9. Accumulated Other Comprehensive Loss \n| (in millions) |\n| Balance at January 1, 2018 (net of tax effect of $2) |\n| Changes in value (net of tax effect of $2) |\n| Changes in value (net of tax effect of $5) |\n\n| Derivative Contracts and Other |\n| Jan 1, 2018 | (7) | |\n| Jan 1, 2018 (net of tax effect) | 85 | |\n| Changes 2018 (net of tax effect) | (7) | |\n| Balance at Mar 31, 2018 (net of tax effect) | 80 | |\n\n\nNow we need to find the changes in value for "Derivative Contracts and Other" in 2018 (net of tax effect). \n\nChanges in value = $ 7 million \n\n| Effective Portion Recognized in Other Comprehensive Income (OCI) |\n| | 2018 | 2017 |\n| Foreign currency exchange contracts | $ | (4) | $ | (25) |\n\nNow we need to find the Effective Portion Recognized in Other Comprehensive Income for all years.\nFor 2018 Effective Portion Recognized in other Comprehensive Income is -$4 million and for 2017 is - $25 million.\n| | 2018 | 2017 |\n| Effective portion reclassified from AOCI to Earnings | $ | 1 | $ | (4) |\n| Effective Portion Recognized in Other Comprehensive Income | $ | (4) | $ | (25) |\n\nChanges in value for Derivative Contracts and Other in 2018 will be added to Effective Portion Recognized in other Comprehensive Income.\nChanges in Value = $ 7 million\n \n\nTotal Effective Portion recognized in other Comprehensive Income for 2017 and 2018 = -$4 million + $ 7 million - $25 million = $-22 million"
],"

yilunzhao · 2025-02-19T14:07:16Z

It looks like the output is getting truncated before completion. The expected model response should conclude with something like, "Therefore, the answer is …”
Based on the screenshot from the first thread, the execution rate on complong set is quite low, indicating that the LLM evaluator is struggling to extract the final answer from the raw model outputs.

Could you please check these two issues? for example, if simply increasing the max_token limit from 512 to 1024 or 2024 resolves the issue? If the problem persists, it would be helpful if you could share your codebase so I can test it on my end.

YvetteLaw · 2025-02-25T04:58:13Z

Hi, you are right, the output is truncated, but I don't know why it occurs. I checked the output file on compshort, this situation also exists. I can't get the same result as your output file, my output format is like that, it starts with header but ends in advance, even though I have changed max_token to 1024.

:<|start_header_id|>assistant<|end_header_id|>

To find the difference between the payments due by Year 1 between Interest obligations and Operating lease obligations if the Interest obligations were $10,000 thousand instead of $28,200 thousand, we need to follow these steps:

I put the relevant code and output files on Google Drive. Could you help me see that? https://drive.google.com/drive/folders/1X0Ar68zdXLRGTGi8zewmJxIBaDOqRrUT?usp=drive_link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy between Leaderboard and my test with Provided Code #6

Discrepancy between Leaderboard and my test with Provided Code #6

YvetteLaw commented Feb 13, 2025

yilunzhao commented Feb 13, 2025

YvetteLaw commented Feb 13, 2025

yilunzhao commented Feb 13, 2025

YvetteLaw commented Feb 19, 2025

yilunzhao commented Feb 19, 2025

YvetteLaw commented Feb 25, 2025 •

edited

Loading

Discrepancy between Leaderboard and my test with Provided Code #6

Discrepancy between Leaderboard and my test with Provided Code #6

Comments

YvetteLaw commented Feb 13, 2025

yilunzhao commented Feb 13, 2025

YvetteLaw commented Feb 13, 2025

yilunzhao commented Feb 13, 2025

YvetteLaw commented Feb 19, 2025

yilunzhao commented Feb 19, 2025

YvetteLaw commented Feb 25, 2025 • edited Loading

YvetteLaw commented Feb 25, 2025 •

edited

Loading