Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add REC tasks for testing model ability to locally ground objects, given a description. This adds REC for all RefCOCO datasets. #52

Merged
merged 4 commits into from
Apr 17, 2024

Conversation

hunterheiden
Copy link
Contributor

This PR adds REC evaluations, assuming that bounding boxes will be output as raw square brackets and coordinates. It adds REC eval for all RefCOCO sets (RefCOCO, RefCOCO+, RefCOCOg). Specifically:

  • .yaml files are added, both for the defaults and specifics for splits
  • util_rec.py for each set (identical across RefCOCO sets)
  • minor modifications to Task to change how no_image_dataset is created. I've tried to do this more efficiently so that we can explode the dataset into one answer-bounding box pair instead of a set of answers for a single bounding box.

Copy link
Collaborator

@kcz358 kcz358 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! Looks cool to me. Is it possible that you can upload an image of the final run result?

@hunterheiden
Copy link
Contributor Author

For sure, happy to provide these results and details of my setup. I've run this using torch version 2.1.2. + CUDA 12.4. I'm running on an Azure VM (Standard NC96ads A100 v4 (96 vcpus, 880 GiB memory), 4xA100(80GB).

Just as a short reference, these are the results for ACC@IoU=0.5:

Dataset Split liuhaotian/llava-v1.5-7b
RefCOCO val 56.2
RefCOCO test 58.1
RefCOCO testA 64.4
RefCOCO testB 47.5
RefCOCO+ val 50.0
RefCOCO+ testA 59.2
RefCOCO+ testB 39.0
RefCOCOg val 48.8
RefCOCOg test 48.4

Additionally, I'll attach the result summaries (as .txt files) for the different versions of COCO:

Let me know if you want me to directly screenshot / image the results, or if this is sufficient information! I have some more REC tasks I'd like to contribute, as well as some evaluations on screen-based benchmarks (ScreenSpot, RICO tasks, etc.).

I'm also running some these on the v1.5-13b model to see the shift, so I'll circle back on the results there too when I have them.

@Luodian
Copy link
Contributor

Luodian commented Apr 13, 2024

Thanks for your PR! We are checking it and will have some discussions about the changes in main pipline (these files inside the apis folder). And if you are ready with the 13b results, you can also put here and we will try to finish this PR asap.

@hunterheiden
Copy link
Contributor Author

Sounds good. I should hopefully have benchmarked the 13B model in the next day or two.

Regarding the core file changes, if there's another way to achieve similar functionality that's more in-line with how you all manage this package, I'm happy to change approaches. The main issue was that I needed the width and height information of images in order to normalize bounding boxes. I also wanted to avoid re-loading the datasets multiple times, especially for splits that aren't needed.

@Luodian
Copy link
Contributor

Luodian commented Apr 15, 2024

@kcz358 @jzhang38 Please help to check it thanks!

Comment on lines +10 to +43
def refcoco_bbox_rec_preprocess_dataset(dataset: Dataset):
# PIL image stored in dataset['image']
# add `image_width` and `image_height` to the dataset
dataset = dataset.map(lambda x: {"image_width": x["image"].width, "image_height": x["image"].height})

# Original bbox format (top x, top y, width, height)
# Convert to (top-left x, top-left y, bottom-right x, bottom-right y)
# Normalize the bounding box coordinates to be between 0 and 1
# using the image width and height
dataset = dataset.map(
lambda x: {"bbox": [x["bbox"][0] / x["image_width"],
x["bbox"][1] / x["image_height"],
(x["bbox"][0] + x["bbox"][2]) / x["image_width"],
(x["bbox"][1] + x["bbox"][3]) / x["image_height"]]}
)

# currently, the dataset has `answer` as a list of strings
# each answer should be its own row
# we will explode the dataset to have one row per answer
# duplicate the other columns
def explode_answers(example):
answers = example.pop('answer')
return [{'answer': answer, **example} for answer in answers]

# Apply the function to each element, collecting the results
exploded_rows = []
for example in dataset:
exploded_rows.extend(explode_answers(example))

# Create a new dataset from the exploded rows
new_dataset = Dataset.from_list(exploded_rows)
print(f"Exploded dataset from {len(dataset)} to {len(new_dataset)} rows")

return new_dataset
Copy link
Collaborator

@kcz358 kcz358 Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hunterheiden Do you think here can be moved into the process_result part? In process_result this can be done row by row for the dataset and do not need to do any dataset modification

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kcz358 I don't think this can be applied as a one-to-one operation on rows because each row has multiple answers. One row in the original dataset becomes (on average) 3 rows in the actual testing dataset, and I want to test the general case of a user supplying one caption and the model returning a grounding result.

Would it be possible to do a dataset explode operation like this through process_result or is my understanding correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this some this evening and I'm wondering if actually I should do this differently by simply creating a new dataset in the right format (i.e., explode the rows out, normalize the bounding boxes as expected), uploading that to huggingface, and then simplifying the implementation by cutting this function altogether.

I think that might make my modifications to the core API redundant, and so then I might be able to remove those edits as well.

What do you think? Worth me revising this implementation with a new dataset?

Copy link
Collaborator

@kcz358 kcz358 Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are right. Current implementation seems good to me. Otherwise you will have to upload a new dataset to the huggingface. If you feel creating a new dataset and upload it to huggingface is a better way, we can also invite you to our organization to let you upload the dataset since you already write done the script to process the dataset. But there are already lots of dataset on the hub so I guess process the dataset before each evaluation would still be good for me as long as the speed is acceptable. I will review again and approve this PR. @Luodian , this LGTM for me.

Luodian pushed a commit that referenced this pull request Apr 16, 2024
* Revise GPT4V to allow interleaved image and text

* Use the first char as the answer for seedbench

* Save gpt eval's answer
Luodian pushed a commit that referenced this pull request Apr 16, 2024
* Revise GPT4V to allow interleaved image and text

* Use the first char as the answer for seedbench

* Save gpt eval's answer
@Luodian Luodian merged commit 1141765 into EvolvingLMMs-Lab:main Apr 17, 2024
1 check passed
This was referenced Aug 15, 2024
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…ab#52)

* Revise GPT4V to allow interleaved image and text

* Use the first char as the answer for seedbench

* Save gpt eval's answer
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…ab#52)

* Revise GPT4V to allow interleaved image and text

* Use the first char as the answer for seedbench

* Save gpt eval's answer
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…ab#52)

* Revise GPT4V to allow interleaved image and text

* Use the first char as the answer for seedbench

* Save gpt eval's answer
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…ab#52)

* Revise GPT4V to allow interleaved image and text

* Use the first char as the answer for seedbench

* Save gpt eval's answer
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…ab#52)

* Revise GPT4V to allow interleaved image and text

* Use the first char as the answer for seedbench

* Save gpt eval's answer
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…ab#52)

* Revise GPT4V to allow interleaved image and text

* Use the first char as the answer for seedbench

* Save gpt eval's answer
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…ab#52)

* Revise GPT4V to allow interleaved image and text

* Use the first char as the answer for seedbench

* Save gpt eval's answer
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…ab#52)

* Revise GPT4V to allow interleaved image and text

* Use the first char as the answer for seedbench

* Save gpt eval's answer
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
Add REC tasks for testing model ability to locally ground objects, given a description. This adds REC for all RefCOCO datasets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants