Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add difficulty #35

Merged
merged 7 commits into from
Jun 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ old_data/*
src/image2structure/compilation/webpage/test_data/valid_repo/_site/feed.xml
tmp/*
credentials/*
experimental/*

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
79 changes: 75 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,19 @@
# Image2Structure - Data collection
# Image2Struct
[Paper](TODO) | [Website](https://crfm.stanford.edu/helm/image2structure/latest/) | Datasets ([Webpages](https://huggingface.co/datasets/stanford-crfm/i2s-webpage), [Latex](https://huggingface.co/datasets/stanford-crfm/i2s-latex), [Music sheets](https://huggingface.co/datasets/stanford-crfm/i2s-musicsheet)) | [Leaderboard](https://crfm.stanford.edu/helm/image2structure/latest/#/leaderboard) | [HELM repo](https://github.com/stanford-crfm/helm)

This repository contains the data collection for the Image2Structure project.
Welcome, the `image2struct` Python package contains code usied in the **Image2Struct: A Benchmark for Evaluating Vision-Language Models in Extracting Structured Information from Images** paper. This repo includes the following features:
* Data collection: scrapers, filters, compilers, and uploaders for the different data types (Latex, Webpages, MusicSheets) from public sources (ArXiV, GitHub, IMSLP, ...)
* Dataset upload: upload the datasets to the Hugging Face Datasets Hub
* Wild data collection: collection of screenshots from webpages specified by a determined list of URLs and formatting of equations screenshots of your choice.

This repo **does not** contain:
* The evaluation code: the evaluation code is available in the [HELM repo](https://github.com/stanford-crfm/helm).

## Installation
To install the package, you can use pip:
To install the package, you can use `pip` and `conda`:

conda create -n image2struct python=3.9.18 -y
conda activate image2struct
pip install -e ".[all]"

Some formats require additional dependencies. To install all dependencies, use:
Expand All @@ -14,7 +23,69 @@ Some formats require additional dependencies. To install all dependencies, use:
Finally, create a `.env` file by copying the `.env.example` file and filling in the required values.


# Contributing
## Usage

### Data collection

You can run `image2structure-collect` to collect data from different sources. For example, to collect data from GitHub Pages:

image2structure-collect --num_instances 300 --num_instances_at_once 50 --max_instances_per_date 40 --date_from 2024-01-01 --date_to 2024-02-20 --timeout 30 --destination_path data webpage --language css --port 4000 --max_size_kb 100

The general arguments are:
* `--num_instances`: the number of instances to collect
* `--num_instances_at_once`: the number of instances to collect at once. This means that when the scraper is called, it won't ask the API used (here GitHub Developer API) for more than `num_instances_at_once` instances. This is useful to avoid hitting the rate limit.
* `--max_instances_per_date`: the maximum number of instances to collect for a single date. This is useful to avoid collecting too many instances for a single date.
* `--date_from`: the starting date to collect instances from.
* `--date_to`: the ending date to collect instances from.
* `--timeout`: the timeout in seconds for each instance collection.
* `--destination_path`: the path to save the collected data to.

Then you can add specific arguments for the data type you want to collect. To do so simply add the data type, here `webpage`, followed by the data-specific arguments. You can find the data-specific arguments in the `src/image2struct/run_specs.py` file.

The script will save the collected data to the specified destination path under this format:

output_path
├── subcategory1
│ ├── assets
│ ├── images
│ | ├── uuid1.png
│ | ├── uuid2.png
│ | └── ...
│ ├── metadata
│ | ├── uuid1.json
│ | ├── uuid2.json
│ | └── ...
│ ├── structures # Depends on the data type
│ | ├── uuid1.{tex,tar.gz,...}
│ | ├── uuid2.{tex,tar.gz,...}
│ | └── ...
│ ├── (text) # Depends on the data type
│ | ├── uuid1.txt
│ | ├── uuid2.txt
│ | └── ...
├── subcategory2
└── ...


### Upload datasets

Once you have collected some datasets, you can upload them to the Hugging Face Datasets Hub. For example, to upload the latex dataset:

image2structure-upload --data-path data/latex --dataset-name stanford-crfm/i2s-latex --max-instances 50

This will upload the dataset to the Hugging Face Datasets Hub under the `stanford-crfm/i2s-latex` dataset name. The `max-instances` argument specifies the maximum number of instances to upload. The `--data-path` argument specifies the path to the dataset files. These files should respect the format outputed by the collection scripts.


### Wild data collection

There are two scripts to build the wild datasets: `src/image2struct/wildwild/wild_latex.py` and `src/image2struct/wildwild/wild_webpage.py`. You can simply run them to format the data (you will need to collect screenshots of equations manually for the `wild_latex` script while the `wild_webpage` will take screenshots of websites by itself):

python src/image2struct/wild/wild_webpage.py
python src/image2struct/wild/wild_latex.py

You can then upload the datasets to the Hugging Face Datasets Hub as explained above.

## Contributing
To contribute to this project, first install the development dependencies:

pip install -e ".[dev]"
Expand Down
2 changes: 1 addition & 1 deletion src/image2structure/compilation/music_compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ def generate_sheet_image(
print(
f"Success: Extracted page {page_number} from {pdf_path} as an image."
)
except (RuntimeError, PDFPageCountError) as e:
except (RuntimeError, PDFPageCountError, Image.DecompressionBombError) as e:
if self._verbose:
print(f"Skipping: Error generating image from {pdf_path}: {e}")
return False, image
Expand Down
111 changes: 65 additions & 46 deletions src/image2structure/fetch/imslp_fetcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,12 @@
from mwclient.page import Page
from mwclient.image import Image

from image2structure.fetch.fetcher import Fetcher, ScrapeResult, DownloadError
from image2structure.fetch.fetcher import (
Fetcher,
ScrapeResult,
DownloadError,
ScrapeError,
)


import requests
Expand Down Expand Up @@ -49,47 +54,57 @@ def fetch_images_metadata(page: mwclient.page.Page) -> list:

images = []

for f in page.images():

f_title = f.base_title
f_esc_title = urllib.parse.quote(f_title.replace(" ", "_"))

# Hacky way of finding the relevant metadata
t1 = s.find(attrs={"href": "/wiki/File:{}".format(f_esc_title)})
t2 = s.find(attrs={"title": "File:{}".format(f_title)})

if t1 is None and t2 is None:
continue

t = t1 or t2
if t.text.strip() == "":
continue

page_count = None
m = IMSLP_REGEXP_PAGE_COUNT.search(t.parent.text)
if m is not None:
try:
page_count = int(m.group(1))
except ValueError:
pass

file_id = int(t.text.replace("#", ""))

# Fix image URL
if f.imageinfo["url"][0] == "/":
# URL is //imslp.org/stuff...
f.imageinfo["url"] = "http:" + f.imageinfo["url"]

images.append(
{
"id": file_id,
"title": f_title,
"url": f.imageinfo["url"],
"page_count": page_count,
"size": f.imageinfo.get("size"),
"obj": f,
}
)
try:
for f in page.images():

f_title = f.base_title
f_esc_title = urllib.parse.quote(f_title.replace(" ", "_"))

# Hacky way of finding the relevant metadata
t1 = s.find(attrs={"href": "/wiki/File:{}".format(f_esc_title)})
t2 = s.find(attrs={"title": "File:{}".format(f_title)})

if t1 is None and t2 is None:
continue

t = t1 or t2
if t.text.strip() == "":
continue

page_count = None
m = IMSLP_REGEXP_PAGE_COUNT.search(t.parent.text)
if m is not None:
try:
page_count = int(m.group(1))
except ValueError:
pass

file_id = int(t.text.replace("#", ""))

# Fix image URL
if f.imageinfo["url"][0] == "/":
# URL is //imslp.org/stuff...
f.imageinfo["url"] = "http:" + f.imageinfo["url"]

images.append(
{
"id": file_id,
"title": f_title,
"url": f.imageinfo["url"],
"page_count": page_count,
"size": f.imageinfo.get("size"),
"obj": f,
}
)
except requests.exceptions.ReadTimeout as e:
print(f"Read timeout: {e}")
raise ScrapeError(f"Read timeout: {e}")
except requests.exceptions.ConnectionError as e:
print(f"Connection error: {e}")
raise ScrapeError(f"Connection error: {e}")
except requests.exceptions.RequestException as e:
print(f"Request exception: {e}")
raise ScrapeError(f"Request exception: {e}")

return images

Expand Down Expand Up @@ -234,7 +249,11 @@ def download(self, download_path: str, scrape_result: ScrapeResult) -> None:
):
raise DownloadError("No metadata or invalid metadata in the scrape result.")

image: Image = scrape_result.additional_info["metadata"]["obj"]
file_path: str = os.path.join(download_path, scrape_result.instance_name)
with open(file_path, "wb") as file:
image.download(file)
try:
image: Image = scrape_result.additional_info["metadata"]["obj"]
file_path: str = os.path.join(download_path, scrape_result.instance_name)
with open(file_path, "wb") as file:
image.download(file)
except Exception as e:
print(f"Error downloading {scrape_result.instance_name}: {e}")
raise DownloadError(f"Error downloading {scrape_result.instance_name}: {e}")
92 changes: 78 additions & 14 deletions src/image2structure/upload.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
from typing import Any, Dict, List
from tqdm import tqdm
from datasets import Dataset
from sklearn.model_selection import train_test_split
from datasets import DatasetDict, Features, Value, Image as HFImage, Sequence

import argparse
Expand All @@ -10,6 +9,7 @@
from PIL import Image
import base64
import pandas as pd
import numpy as np
import json
import imagehash

Expand Down Expand Up @@ -56,6 +56,65 @@ def transform(row: dict) -> dict:
return row


def classify_difficulty(dataset, data_type: str, wild_data: bool = False):
"""
Classify the difficulty of the instances in the dataset.
- 1/3 of the instances are easy
- 1/3 of the instances are medium
- 1/3 of the instances are hard

Args:
dataset: The dataset to classify, expected to be an iterable of dictionaries.
data_type: The type of data to classify (e.g., webpage, latex).

Returns:
The dataset with the difficulty classified.
"""
if not wild_data:
if data_type == "latex":
lengths = [len(item["text"]) for item in dataset]
elif data_type == "musicsheet":
lengths = []
for item in tqdm(dataset, desc="Computing difficulty"):
with Image.open(io.BytesIO(item["image"]["bytes"])) as img:
img_array = np.array(img)
# Assuming the image is grayscale; update this if it's not
black_pixels = np.sum(img_array < np.max(img_array) / 4.0)
lengths.append(black_pixels)
elif data_type == "webpage":
lengths = [
int(json.loads(item["file_filters"])["RepoFilter"]["num_lines"]["code"])
+ int(
json.loads(item["file_filters"])["RepoFilter"]["num_lines"]["style"]
)
for item in dataset
]
else:
raise ValueError(f"Unknown data type: {data_type}")

# Sort lengths and find thresholds
lengths_sorted = sorted(lengths)
easy_threshold = lengths_sorted[len(lengths) // 3]
medium_threshold = lengths_sorted[(len(lengths) // 3) * 2]

# Assign difficulty based on thresholds
# Add "difficulty" to the columns of the dataset
df = pd.DataFrame(dataset)
if wild_data:
df["difficulty"] = "hard"
else:
df["length"] = lengths
df["difficulty"] = "easy"
df.loc[
(df["difficulty"] == "easy") & (df["length"] > easy_threshold), "difficulty"
] = "medium"
df.loc[
(df["difficulty"] == "medium") & (df["length"] > medium_threshold),
"difficulty",
] = "hard"
return Dataset.from_pandas(df)


def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Upload collected data to huggingface")
parser.add_argument(
Expand All @@ -82,6 +141,8 @@ def parse_args() -> argparse.Namespace:
def main():
args = parse_args()

data_type: str = os.path.basename(args.data_path)
print(f"\nUploading {data_type} dataset...")
for category in os.listdir(args.data_path):
print(f"\nUploading {category} dataset...")
data_path: str = os.path.join(args.data_path, category)
Expand Down Expand Up @@ -123,12 +184,12 @@ def main():
# Load the structure
df: pd.DataFrame = pd.DataFrame()
structure_set = set()
file_names: List[str] = os.listdir(structure_path)
file_names: List[str] = os.listdir(image_path)
image_set = set()
for i in tqdm(range(num_data_points), desc="Loading data"):
try:
values = {}
file_name: str = file_names[i].replace(extension, "")
file_name: str = file_names[i].replace(".png", "")

if has_structure:
structure_file = os.path.join(
Expand Down Expand Up @@ -167,8 +228,11 @@ def main():
continue

# Remove duplicates
# Only check the structure
df = df.drop_duplicates(subset=["structure"])
# Only check the structure if present, otherwise check the image (path)
if has_structure:
df = df.drop_duplicates(subset=["structure"])
else:
df = df.drop_duplicates(subset=["image"])

# Limit the number of instances
if args.max_instances > 0:
Expand All @@ -178,18 +242,19 @@ def main():
df = df.sample(frac=1)
df = df.head(args.max_instances)

# Split the dataset
valid_df, test_df = train_test_split(df, test_size=0.2)
valid_dataset = Dataset.from_pandas(valid_df).map(transform).shuffle()
test_dataset = Dataset.from_pandas(test_df).map(transform).shuffle()
valid_dataset = Dataset.from_pandas(df).map(transform).shuffle()

# Classify the difficulty of the instances
valid_dataset = classify_difficulty(
valid_dataset, data_type, category == "wild"
)
# valid_dataset = Dataset.from_pandas(df)
# Print first 5 instances

# Remove the '__index_level_0__' column from the datasets
if "__index_level_0__" in valid_dataset.column_names:
print("Removing __index_level_0__")
valid_dataset = valid_dataset.remove_columns("__index_level_0__")
if "__index_level_0__" in test_dataset.column_names:
print("Removing __index_level_0__")
test_dataset = test_dataset.remove_columns("__index_level_0__")

# Define the features of the dataset
features_dict = {
Expand All @@ -199,10 +264,9 @@ def main():
features_dict["assets"] = Sequence(Value("string"))
features = Features(features_dict)
valid_dataset = valid_dataset.cast(features)
test_dataset = test_dataset.cast(features)

# Push the dataset to the hub
dataset_dict = DatasetDict({"validation": valid_dataset, "test": test_dataset})
dataset_dict = DatasetDict({"validation": valid_dataset})
dataset_dict.push_to_hub(args.dataset_name, config_name=category)


Expand Down
Empty file.
Loading
Loading