Skip to content

Commit

Permalink
[datasets] read labels in utf-8 and log input string on vocab error (m…
Browse files Browse the repository at this point in the history
  • Loading branch information
eikaramba authored Feb 23, 2024
1 parent dd1fbbe commit d547ef9
Show file tree
Hide file tree
Showing 3 changed files with 6 additions and 3 deletions.
2 changes: 1 addition & 1 deletion doctr/datasets/recognition.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def __init__(
super().__init__(img_folder, **kwargs)

self.data: List[Tuple[str, str]] = []
with open(labels_path) as f:
with open(labels_path, encoding="utf-8") as f:
labels = json.load(f)

for img_name, label in labels.items():
Expand Down
5 changes: 4 additions & 1 deletion doctr/datasets/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,10 @@ def encode_string(
try:
return list(map(vocab.index, input_string))
except ValueError:
raise ValueError("some characters cannot be found in 'vocab'")
raise ValueError(
f"some characters cannot be found in 'vocab'. \
Please check the input string {input_string} and the vocabulary {vocab}"
)


def decode_sequence(
Expand Down
2 changes: 1 addition & 1 deletion references/recognition/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ The order of entries in the json does not matter.
}
```

When typing your labels, be aware that the VOCAB doesn't handle spaces.
When typing your labels, be aware that the VOCAB doesn't handle spaces. Also make sure your `labels.json` file is using UTF-8 encoding.

## Advanced options

Expand Down

0 comments on commit d547ef9

Please sign in to comment.