-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-tune on TVQA dataset #2
Comments
I have some details about data processing that are not very clear. If you can help me, I would be very grateful 1.It is mentioned in the paper that in TVQA, each video sample 6 frames evenly. What is the text content of each frame? If it is dialogue text, how to select the corresponding dialogue text for each picture? If not, what is the content of the text? 2.Question and answer constitute five groups of hypotheses, and then through MLP, take the hypotheses CLS_TOKEN respectively, and then concat them with the image CLS_TOKEN? Or by some other way I really hope to get confirmation of these details. Thank you very much |
Let us know if you have further questions! |
This is an awesome work! Do you have the plan to release the pre-train model on TVQA+ and TVQA? |
I also have some questions about TVQA finetuning as I am trying to reproduce your results.
It would be very helpful if you could give more detail on how the input to the model looks like for TVQA. Thanks! |
|
Thank you very much for your work. May I ask if you can release the code for fine-tune on tvqa dataset
The text was updated successfully, but these errors were encountered: