Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tune on TVQA dataset #2

Open
Curry-AI opened this issue Jun 19, 2021 · 5 comments
Open

Fine-tune on TVQA dataset #2

Curry-AI opened this issue Jun 19, 2021 · 5 comments

Comments

@Curry-AI
Copy link

Thank you very much for your work. May I ask if you can release the code for fine-tune on tvqa dataset

@Curry-AI
Copy link
Author

I have some details about data processing that are not very clear. If you can help me, I would be very grateful

1.It is mentioned in the paper that in TVQA, each video sample 6 frames evenly. What is the text content of each frame? If it is dialogue text, how to select the corresponding dialogue text for each picture? If not, what is the content of the text?

2.Question and answer constitute five groups of hypotheses, and then through MLP, take the hypotheses CLS_TOKEN respectively, and then concat them with the image CLS_TOKEN? Or by some other way

I really hope to get confirmation of these details. Thank you very much

@GloriaXimingLu
Copy link
Collaborator

GloriaXimingLu commented Jul 14, 2021

  1. The text part is dialogue text (subtitle)

  2. For each [images, context_i, question_i, answer_i], we feed into the model and MLP, and takes max over the N logits. Basically, we copied images part N time to concatenate them with N candidates separately.

Let us know if you have further questions!

@Lee-Ft
Copy link

Lee-Ft commented Jul 21, 2021

This is an awesome work! Do you have the plan to release the pre-train model on TVQA+ and TVQA?

@simon-ging
Copy link

I also have some questions about TVQA finetuning as I am trying to reproduce your results.

  1. Do you use the ground-truth timestamps of the question to select frames from the video, provided from the TVQA dataset?

  2. How do you select the subtitles exactly? Subtitles are pretty long (like 260 tokens on average) so I can't fit them all into the input sequence.

It would be very helpful if you could give more detail on how the input to the model looks like for TVQA. Thanks!

@GloriaXimingLu
Copy link
Collaborator

  1. Yes, we extract the frames corresponding to ground-truth timestamps.

  2. We use all subtitles, and cut it if it's longer than 732 tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants