Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulties during reproduction. [Errors in ./script/get_emb_test.sh] #2

Open
liboqiao1234 opened this issue Mar 10, 2025 · 11 comments

Comments

@liboqiao1234
Copy link

I tried to reproduce the experiment and I followed the instruction in README.md. However, some problems occured and I doubt if there is someone who really check or run the whole process.

Belows are few problems I met (till now):

  1. Use local package in requirements.txt. This makes using pip install -r impossible
  2. I tried to download datasets on huggingface "whalezzz/M2RAG" (follow the instruction in README: 'Second, you can either directly download and use M2RAG, or follow the instructions in 'data/data_preprocess' to build it step by step.') But when I run 'get_embed_test.sh', task2, Error occurs: FileNotFoundError: [Errno 2] No such file or directory: '../data/raw_data/webqa/all_imgs.json','Also' Btw, I suffered from downloading datasets without any compression.
  3. Also in the running of get_embed_test.sh. During Task3 (fact_verify), it seems that the program tried to open M2RAG/data/m2rag/val_images/3385.jpg and failed. I think it should be '../data/m2rag/fact_verify/val_images/3385.jpg'
  4. Actually there are other problems (including wrong path, comment symbol missing......)

Is there any possibility that someone can fix those problems or tell me how to solve them.
Thanks.

@whale-z
Copy link
Collaborator

whale-z commented Mar 10, 2025

Thank you very much for raising these issues. We will further optimize and fix the scripts based on your feedback.

Regarding your first issue, the requirements.txt file provides all the necessary packages for our runtime environment. However, for certain specific packages, you may need to install them via GitHub manually.

For the second and third issues, we sincerely apologize for the inconvenience caused by the incorrect paths. Specifically:
• In Task 2, the path '../data/raw_data/webqa/all_imgs.json' should be replaced with '../data/m2rag/mmqa/all_imgs.json'.
• In Task 3, as you correctly pointed out, the path should be corrected to '../data/m2rag/fact_verify/val_images/3385.jpg'.

We apologize once again for the inconvenience. We plan to fix all these issues within the this week and look forward to receiving more feedback from you.

@liboqiao1234
Copy link
Author

liboqiao1234 commented Mar 11, 2025

Thank you very much for raising these issues. We will further optimize and fix the scripts based on your feedback.

Regarding your first issue, the requirements.txt file provides all the necessary packages for our runtime environment. However, for certain specific packages, you may need to install them via GitHub manually.

For the second and third issues, we sincerely apologize for the inconvenience caused by the incorrect paths. Specifically: • In Task 2, the path '../data/raw_data/webqa/all_imgs.json' should be replaced with '../data/m2rag/mmqa/all_imgs.json'. • In Task 3, as you correctly pointed out, the path should be corrected to '../data/m2rag/fact_verify/val_images/3385.jpg'.

We apologize once again for the inconvenience. We plan to fix all these issues within the this week and look forward to receiving more feedback from you.

Thanks for your prompt response.

Problem also occurs in Task 4 in get_embed_test.sh (same as the second issue).
For the third issue, I realized that the path is wrong, but I don't know how to correct it. Could you tell me how to fix it?

Thank you in advance for your help.

@whale-z
Copy link
Collaborator

whale-z commented Mar 11, 2025

Thank you very much for bringing up the issues.

Regarding the issues raised in Task 3 and Task 4, we have corrected the relevant parts in the get_embed_test.sh file.

For the image path issue in Task 3, please re-download the file and replace it accordingly.

If there are any other issues, feel free to contact us. We will also recheck and update the code soon.

@liboqiao1234
Copy link
Author

Thanks! That works fine.

And I think there is still a problem in Task-3 candas part.

FileNotFoundError: [Errno 2] No such file or directory: 'M2RAG/data/m2rag/fact_verify/cand_images/image_0.jpg'
There are only cand_images_[i] in fact_verify data folder.

@whale-z
Copy link
Collaborator

whale-z commented Mar 11, 2025

Due to the file quantity limitation on HuggingFace, we split the cand_images file into several subfiles. You can process the data using the following script and then rerun get_embed_test.sh.
bash data/m2rag/merge_data.sh

@liboqiao1234
Copy link
Author

It seems that I have finished get_embed_test.sh and there suppposed to be a .pkl file named webqa_mmqa_query_text_test_embedding.pkl. But there isn't, which makes the retrieval_test.sh doesn't work/

@whale-z
Copy link
Collaborator

whale-z commented Mar 13, 2025

Hi, we’ve fixed the bug. You can rerun it now. If the issue persists, please provide clear runtime logs or error messages to help us resolve it.

@liboqiao1234
Copy link
Author

When building raw_data in minicpmv_datasets_modules_for_ppl.py , it will construct a raw_data line like "raw_data: {'id': 'test_780', 'image': '../output/retrieval/image_rerank/webqa_image_rerank_test_retrieval_images_5/image_30323269.png', 'conversations': [{'role': 'user', 'content': ''}, {'role': 'assistant', 'content': 'Building the Buddha - panoramio'}]}" but certainly, there is no such a file.This id is from .trec file.

Should I change img_id to id ?

for line in fin:
            json_line = json.loads(line)
            id = str(json_line['id'])
            caption =  json_line['caption']
            for img_id in qid_2_candids[id]:
                if 'webqa' in image_path:
                    image = os.path.join(image_path, f'image_{img_id}.png') # here
                one = {
                    'id':id,
                    'image':image,
                    'conversations':[
                        {
                            'role':'user',
                            'content':'<image>'
                        },
                        {
                            'role':'assistant',
                            'content':caption
                        }
                    ]
                }
                data.append(one)

@namingsohard
Copy link
Collaborator

In the minicpmv_datasets_modules_for_ppl.py file, we use the .trec format to link test queries with their corresponding retrieved image indices. Therefore, there's no need to change img_id to id because id represents the query identifier.

However, in the branch used for the image rerank task (src/get_retrieval_multi_data.py), an accumulative index named image_index was used to overwrite the original document identifier (did in src/get_retrieval_multi_data.py, correspond to img_id in minicpmv_datasets_modules_for_ppl.py). This caused a mismatch preventing the .trec files from properly associating queries with their relevant retrieved images. We have resolved this BUG by modifying the relevant branch in src/get_retrieval_multi_data.py.

Now you can rerun it now with the updated code.

@liboqiao1234
Copy link
Author

I noticed that there is a TODO in the new 'construct_finetune_data.sh' file, saying that

# TODO: not trec file, should be jsonl file

Indeed, error occurred when using trec file as input.

@namingsohard
Copy link
Collaborator

namingsohard commented Mar 16, 2025

Sorry for our mistake. We have just updated the content of the script construct_finetune_data.sh. Modify the path from .trec to .jsonl, where the latter is the output of the get_retrieval_multi_data.py process.

# --mmqa_retrieve_data output/retrieval/mmqa/webqa_mmqa_train_retrieval_multi_5.jsonl
--mmqa_retrieve_data output/retrieval/mmqa/webqa_mmqa_query_train_visualbge_5_multi.trec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants