Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between Leaderboard and my test with Provided Code #6

Open
YvetteLaw opened this issue Feb 13, 2025 · 6 comments
Open

Discrepancy between Leaderboard and my test with Provided Code #6

YvetteLaw opened this issue Feb 13, 2025 · 6 comments

Comments

@YvetteLaw
Copy link

Hi, I'm trying to reproduce the results in leaderboard. I think I have totally follow your settings but it seems that the results are quiet different, especially in complong set. Could you help provide some reasons that may affect the results?

Image
@yilunzhao
Copy link
Collaborator

Hi @YvetteLaw, thanks for your interest in our work. We’ve provided all the model outputs in Google Drive (as outlined in the README), which should match the leaderboard results. Could you please check and compare our output with your reproduced results?

For the long-document setting, a common issue is incorrect processing of the supporting evidence (see the closed issues for reference).

Let me know if you still have the questions!

@YvetteLaw
Copy link
Author

Hi @YvetteLaw, thanks for your interest in our work. We’ve provided all the model outputs in Google Drive (as outlined in the README), which should match the leaderboard results. Could you please check and compare our output with your reproduced results?

For the long-document setting, a common issue is incorrect processing of the supporting evidence (see the closed issues for reference).

Let me know if you still have the questions!

Hi @yilunzhao Thanks for your reply. I will check my output with yours.

The other question is that on the simplong set, some results are quiet similar, so I believe I have followed your settings. But on complong set, almost all cases fail. These two long sets use different settings?

@yilunzhao
Copy link
Collaborator

Both long subsets follow the same retrieval-then-generate pipeline. Have you implemented this approach?

@YvetteLaw
Copy link
Author

I checked my output, one is as follows. Does the header "<|start_header_id|>assistant<|end_header_id|>\n\n" affect the result? I do not find other difference except the reasoning process is not correct actually.

"output": [
"<|start_header_id|>assistant<|end_header_id|>\n\nTo find the answer to the question, we need to identify the relevant information from the provided notes related to derivative contracts and other in 2018.\n\nNote 5. Derivatives and Risk Management, \n| Effective Portion Recognized in Other Comprehensive Income (OCI) |\n| | 2018 | 2017 |\n| Foreign currency exchange contracts | $ | (4) | $ | (25) |\n\nNote 9. Accumulated Other Comprehensive Loss \n| (in millions) |\n| Balance at January 1, 2018 (net of tax effect of $2) |\n| Changes in value (net of tax effect of $2) |\n| Changes in value (net of tax effect of $5) |\n\n| Derivative Contracts and Other |\n| Jan 1, 2018 | (7) | |\n| Jan 1, 2018 (net of tax effect) | 85 | |\n| Changes 2018 (net of tax effect) | (7) | |\n| Balance at Mar 31, 2018 (net of tax effect) | 80 | |\n\n\nNow we need to find the changes in value for "Derivative Contracts and Other" in 2018 (net of tax effect). \n\nChanges in value = $ 7 million \n\n| Effective Portion Recognized in Other Comprehensive Income (OCI) |\n| | 2018 | 2017 |\n| Foreign currency exchange contracts | $ | (4) | $ | (25) |\n\nNow we need to find the Effective Portion Recognized in Other Comprehensive Income for all years.\nFor 2018 Effective Portion Recognized in other Comprehensive Income is -$4 million and for 2017 is - $25 million.\n| | 2018 | 2017 |\n| Effective portion reclassified from AOCI to Earnings | $ | 1 | $ | (4) |\n| Effective Portion Recognized in Other Comprehensive Income | $ | (4) | $ | (25) |\n\nChanges in value for Derivative Contracts and Other in 2018 will be added to Effective Portion Recognized in other Comprehensive Income.\nChanges in Value = $ 7 million\n \n\nTotal Effective Portion recognized in other Comprehensive Income for 2017 and 2018 = -$4 million + $ 7 million - $25 million = $-22 million"
],"

@yilunzhao
Copy link
Collaborator

  • It looks like the output is getting truncated before completion. The expected model response should conclude with something like, "Therefore, the answer is …”
  • Based on the screenshot from the first thread, the execution rate on complong set is quite low, indicating that the LLM evaluator is struggling to extract the final answer from the raw model outputs.

Could you please check these two issues? for example, if simply increasing the max_token limit from 512 to 1024 or 2024 resolves the issue? If the problem persists, it would be helpful if you could share your codebase so I can test it on my end.

@YvetteLaw
Copy link
Author

YvetteLaw commented Feb 25, 2025

Hi, you are right, the output is truncated, but I don't know why it occurs. I checked the output file on compshort, this situation also exists. I can't get the same result as your output file, my output format is like that, it starts with header but ends in advance, even though I have changed max_token to 1024.

:<|start_header_id|>assistant<|end_header_id|>

To find the difference between the payments due by Year 1 between Interest obligations and Operating lease obligations if the Interest obligations were $10,000 thousand instead of $28,200 thousand, we need to follow these steps:

I put the relevant code and output files on Google Drive. Could you help me see that? https://drive.google.com/drive/folders/1X0Ar68zdXLRGTGi8zewmJxIBaDOqRrUT?usp=drive_link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants