-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add REC tasks for testing model ability to locally ground objects, given a description. This adds REC for all RefCOCO datasets. #52
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution! Looks cool to me. Is it possible that you can upload an image of the final run result?
For sure, happy to provide these results and details of my setup. I've run this using torch version 2.1.2. + CUDA 12.4. I'm running on an Azure VM (Standard NC96ads A100 v4 (96 vcpus, 880 GiB memory), 4xA100(80GB). Just as a short reference, these are the results for ACC@IoU=0.5:
Additionally, I'll attach the result summaries (as .txt files) for the different versions of COCO: Let me know if you want me to directly screenshot / image the results, or if this is sufficient information! I have some more REC tasks I'd like to contribute, as well as some evaluations on screen-based benchmarks (ScreenSpot, RICO tasks, etc.). I'm also running some these on the v1.5-13b model to see the shift, so I'll circle back on the results there too when I have them. |
Thanks for your PR! We are checking it and will have some discussions about the changes in main pipline (these files inside the apis folder). And if you are ready with the 13b results, you can also put here and we will try to finish this PR asap. |
Sounds good. I should hopefully have benchmarked the 13B model in the next day or two. Regarding the core file changes, if there's another way to achieve similar functionality that's more in-line with how you all manage this package, I'm happy to change approaches. The main issue was that I needed the width and height information of images in order to normalize bounding boxes. I also wanted to avoid re-loading the datasets multiple times, especially for splits that aren't needed. |
def refcoco_bbox_rec_preprocess_dataset(dataset: Dataset): | ||
# PIL image stored in dataset['image'] | ||
# add `image_width` and `image_height` to the dataset | ||
dataset = dataset.map(lambda x: {"image_width": x["image"].width, "image_height": x["image"].height}) | ||
|
||
# Original bbox format (top x, top y, width, height) | ||
# Convert to (top-left x, top-left y, bottom-right x, bottom-right y) | ||
# Normalize the bounding box coordinates to be between 0 and 1 | ||
# using the image width and height | ||
dataset = dataset.map( | ||
lambda x: {"bbox": [x["bbox"][0] / x["image_width"], | ||
x["bbox"][1] / x["image_height"], | ||
(x["bbox"][0] + x["bbox"][2]) / x["image_width"], | ||
(x["bbox"][1] + x["bbox"][3]) / x["image_height"]]} | ||
) | ||
|
||
# currently, the dataset has `answer` as a list of strings | ||
# each answer should be its own row | ||
# we will explode the dataset to have one row per answer | ||
# duplicate the other columns | ||
def explode_answers(example): | ||
answers = example.pop('answer') | ||
return [{'answer': answer, **example} for answer in answers] | ||
|
||
# Apply the function to each element, collecting the results | ||
exploded_rows = [] | ||
for example in dataset: | ||
exploded_rows.extend(explode_answers(example)) | ||
|
||
# Create a new dataset from the exploded rows | ||
new_dataset = Dataset.from_list(exploded_rows) | ||
print(f"Exploded dataset from {len(dataset)} to {len(new_dataset)} rows") | ||
|
||
return new_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hunterheiden Do you think here can be moved into the process_result part? In process_result this can be done row by row for the dataset and do not need to do any dataset modification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kcz358 I don't think this can be applied as a one-to-one operation on rows because each row has multiple answers. One row in the original dataset becomes (on average) 3 rows in the actual testing dataset, and I want to test the general case of a user supplying one caption and the model returning a grounding result.
Would it be possible to do a dataset explode operation like this through process_result
or is my understanding correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this some this evening and I'm wondering if actually I should do this differently by simply creating a new dataset in the right format (i.e., explode the rows out, normalize the bounding boxes as expected), uploading that to huggingface, and then simplifying the implementation by cutting this function altogether.
I think that might make my modifications to the core API redundant, and so then I might be able to remove those edits as well.
What do you think? Worth me revising this implementation with a new dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you are right. Current implementation seems good to me. Otherwise you will have to upload a new dataset to the huggingface. If you feel creating a new dataset and upload it to huggingface is a better way, we can also invite you to our organization to let you upload the dataset since you already write done the script to process the dataset. But there are already lots of dataset on the hub so I guess process the dataset before each evaluation would still be good for me as long as the speed is acceptable. I will review again and approve this PR. @Luodian , this LGTM for me.
* Revise GPT4V to allow interleaved image and text * Use the first char as the answer for seedbench * Save gpt eval's answer
* Revise GPT4V to allow interleaved image and text * Use the first char as the answer for seedbench * Save gpt eval's answer
…ab#52) * Revise GPT4V to allow interleaved image and text * Use the first char as the answer for seedbench * Save gpt eval's answer
…ab#52) * Revise GPT4V to allow interleaved image and text * Use the first char as the answer for seedbench * Save gpt eval's answer
…ab#52) * Revise GPT4V to allow interleaved image and text * Use the first char as the answer for seedbench * Save gpt eval's answer
…ab#52) * Revise GPT4V to allow interleaved image and text * Use the first char as the answer for seedbench * Save gpt eval's answer
…ab#52) * Revise GPT4V to allow interleaved image and text * Use the first char as the answer for seedbench * Save gpt eval's answer
…ab#52) * Revise GPT4V to allow interleaved image and text * Use the first char as the answer for seedbench * Save gpt eval's answer
…ab#52) * Revise GPT4V to allow interleaved image and text * Use the first char as the answer for seedbench * Save gpt eval's answer
…ab#52) * Revise GPT4V to allow interleaved image and text * Use the first char as the answer for seedbench * Save gpt eval's answer
Add REC tasks for testing model ability to locally ground objects, given a description. This adds REC for all RefCOCO datasets.
This PR adds REC evaluations, assuming that bounding boxes will be output as raw square brackets and coordinates. It adds REC eval for all RefCOCO sets (RefCOCO, RefCOCO+, RefCOCOg). Specifically:
.yaml
files are added, both for the defaults and specifics for splitsutil_rec.py
for each set (identical across RefCOCO sets)