Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Staccato tokenizer and retokenize method #3642

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

alanakbik
Copy link
Collaborator

This PR adds a new tokenizer following the ideas outlined in #3631.

The tokenizer works like any other, but is intended to give a reasonable, if somewhat trigger-happy tokenization of any language. Some illustration of how this works for different languages:

# English sentence tokenized with Staccato tokenizer
print("Testing English with numbers and punctuation:")
sentence = Sentence("It's 03-16-2025", use_tokenizer=StaccatoTokenizer())
desired_tokenization = ["It", "'", "s", "03", "-", "16", "-", "2025"]

for token, desired_token in zip(sentence, desired_tokenization):
    print(f"Got: '{token.text}', Expected: '{desired_token}'")
    assert token.text == desired_token
print("✓ English test passed")


# Test for Cyrillic (Russian)
print("\nTesting Cyrillic (Russian):")
russian_sentence = Sentence("Привет, мир! Это тест 123.", use_tokenizer=StaccatoTokenizer())
russian_desired = ["Привет", ",", "мир", "!", "Это", "тест", "123", "."]

for token, desired_token in zip(russian_sentence, russian_desired):
    print(f"Got: '{token.text}', Expected: '{desired_token}'")
    assert token.text == desired_token
print("✓ Russian test passed")


# Test for Chinese with Kanji/Hanzi
print("\nTesting Chinese with Kanji/Hanzi:")
chinese_sentence = Sentence("你好,世界!123", use_tokenizer=StaccatoTokenizer())
chinese_desired = ["你", "好", ",", "世", "界", "!", "123"]

for token, desired_token in zip(chinese_sentence, chinese_desired):
    print(f"Got: '{token.text}', Expected: '{desired_token}'")
    assert token.text == desired_token
print("✓ Chinese test passed")


# Test for Japanese (mix of Kanji, Hiragana, Katakana)
print("\nTesting Japanese (mixed scripts):")
japanese_sentence = Sentence("こんにちは世界!テスト123", use_tokenizer=StaccatoTokenizer())
japanese_desired = ["こんにちは", "世", "界", "!", "テスト", "123"]

for token, desired_token in zip(japanese_sentence, japanese_desired):
    print(f"Got: '{token.text}', Expected: '{desired_token}'")
    assert token.text == desired_token
print("✓ Japanese test passed")


# Test for Arabic
print("\nTesting Arabic:")
arabic_sentence = Sentence("مرحبا بالعالم! 123", use_tokenizer=StaccatoTokenizer())
arabic_desired = ["مرحبا", "بالعالم", "!", "123"]

for token, desired_token in zip(arabic_sentence, arabic_desired):
    print(f"Got: '{token.text}', Expected: '{desired_token}'")
    assert token.text == desired_token
print("✓ Arabic test passed")

To test this tokenizer, this PR also adds a retokenize method for the Sentence object. Use it like this:

# Create a sentence with default tokenization
from flair.data import Sentence
from flair.tokenization import StaccatoTokenizer

sentence = Sentence("01-03-2025 New York")

# Add span labels
sentence.get_span(1, 3).add_label('ner', "LOC")
sentence.get_span(0, 1).add_label('ner', "DATE")

# Retokenize with a different tokenizer while preserving labels
sentence.retokenize(StaccatoTokenizer())

Further, it adds a typename property for convenience to Label, to allow us to directly access the label type name.

@alanakbik alanakbik changed the title Gh 3631 staccato tokenizer Add Staccato tokenizer and retokenize method Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant