-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy between Leaderboard and my test with Provided Code #6
Comments
Hi @YvetteLaw, thanks for your interest in our work. We’ve provided all the model outputs in Google Drive (as outlined in the README), which should match the leaderboard results. Could you please check and compare our output with your reproduced results? For the long-document setting, a common issue is incorrect processing of the supporting evidence (see the closed issues for reference). Let me know if you still have the questions! |
Hi @yilunzhao Thanks for your reply. I will check my output with yours. The other question is that on the simplong set, some results are quiet similar, so I believe I have followed your settings. But on complong set, almost all cases fail. These two long sets use different settings? |
Both long subsets follow the same retrieval-then-generate pipeline. Have you implemented this approach? |
I checked my output, one is as follows. Does the header "<|start_header_id|>assistant<|end_header_id|>\n\n" affect the result? I do not find other difference except the reasoning process is not correct actually.
|
Could you please check these two issues? for example, if simply increasing the max_token limit from 512 to 1024 or 2024 resolves the issue? If the problem persists, it would be helpful if you could share your codebase so I can test it on my end. |
Hi, you are right, the output is truncated, but I don't know why it occurs. I checked the output file on compshort, this situation also exists. I can't get the same result as your output file, my output format is like that, it starts with header but ends in advance, even though I have changed max_token to 1024.
I put the relevant code and output files on Google Drive. Could you help me see that? https://drive.google.com/drive/folders/1X0Ar68zdXLRGTGi8zewmJxIBaDOqRrUT?usp=drive_link |
Hi, I'm trying to reproduce the results in leaderboard. I think I have totally follow your settings but it seems that the results are quiet different, especially in complong set. Could you help provide some reasons that may affect the results?
The text was updated successfully, but these errors were encountered: