GitHub - PGSmall/clip-pgs: Official code for CVPR2025 "Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection"

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection

The text input is processed by the text encoder, while the image undergoes our patch generation-to-selection strategy before entering the image encoder. The loss subsequently aligns the visual and textual embeddings, strengthening cross-modal representation alignment.

Abstract

The CLIP model has demonstrated significant advancements in aligning visual and language modalities through large-scale pre-training on image-text pairs, enabling strong zero-shot classification and retrieval capabilities on various domains. However, CLIP’s training remains computationally intensive, with high demands on both data processing and memory. To address these challenges, recent masking strategies have emerged, focusing on the selective removal of image patches to improve training efficiency. Although effective, these methods often compromise key semantic information, resulting in suboptimal alignment between visual features and text descriptions. In this work, we present a concise yet effective approach called Patch Generation-to-Selection (CLIP-PGS) to enhance CLIP’s training efficiency while preserving critical semantic content. Our method introduces a gradual masking process in which a small set of candidate patches is first pre-selected as potential mask regions. Then, we apply Sobel edge detection across the entire image to generate an edge mask that prioritizes the retention of the primary object areas. Finally, similarity scores between the candidate mask patches and their neighboring patches are computed, with optimal transport normalization refining the selection process to ensure a balanced similarity matrix. Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks, achieving superior performance in robustness evaluation and language compositionality benchmarks.

Visual comparison of masking strategies: random masking, cluster-based masking, and our proposed CLIP-PGS.

Performance comparison of vision-language pre-training models, such as CLIP, FLIP, A-CLIP, E-CLIP, and the proposed CLIP-PGS, evaluated across three critical dimensions: (a) generalizability, (b) robustness, and (c) compositionality.

1. Pre-training

Refer to OpenCLIP for instructions on installation and data downloads.

# CLIP-PGS-0.5 and CLIP-PGS-0.3
bash clip-train.sh

2. Downstream Evaluation Tasks

Kindly ensure that clip-benchmark is installed before running the script below.

# Step 1: install clip-benchmark
pip install clip-benchmark

Download the models, including CLIP-PGS-0.5 and CLIP-PGS-0.3.

# Step 2: Zero-Shot Classification, Zero-Shot Retrieval, Linear Probing Classification, Robustness Assessment, and Language Compositionality
bash clip-test.sh

3. Quantitative Results

Zero-Shot Classification

Zero-Shot Retrieval

Linear Probing Classification

Robustness Assessment

Language Compositionality

4. Qualitative Results

Zero-Shot Retrieval (Text-to-Image)

Zero-Shot Retrieval (Image-to-Text)

Acknowledgement

This repository based on CLIP and OpenCLIP.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
clip-ckpts		clip-ckpts
open_clip		open_clip
training		training
.gitignore		.gitignore
README.md		README.md
clip-test.sh		clip-test.sh
clip-train.sh		clip-train.sh
task-zero-shot-cls.txt		task-zero-shot-cls.txt
task-zero-shot-retr.txt		task-zero-shot-retr.txt
task-zero-shot-robu.txt		task-zero-shot-robu.txt
task_retrieval.txt		task_retrieval.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection

Abstract

1. Pre-training

2. Downstream Evaluation Tasks

3. Quantitative Results

4. Qualitative Results

Acknowledgement

About

Releases 1

Packages

Languages

PGSmall/clip-pgs

Folders and files

Latest commit

History

Repository files navigation

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection

Abstract

1. Pre-training

2. Downstream Evaluation Tasks

3. Quantitative Results

4. Qualitative Results

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages