Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy Tokenization in Flair #3641

Merged
merged 8 commits into from
Mar 17, 2025
Merged

Lazy Tokenization in Flair #3641

merged 8 commits into from
Mar 17, 2025

Conversation

alanakbik
Copy link
Collaborator

@alanakbik alanakbik commented Mar 15, 2025

Closes #3635

Flair previously always tokenized sentences. However, for some tasks, such as text classification using transformers, tokenization is not necessary, causing memory overheads and speed bottlenecks.

This PR implements lazy tokenization for Flair sentences to improve efficiency, particularly for long texts. Sentences are now only tokenized when necessary, rather than immediately upon creation.

Key Changes:

  • Added _is_tokenized() method to check tokenization state
  • Modified transformer embeddings to preserve lazy tokenization state
  • Added tests to verify tokenization behavior
  • Improved code clarity around tokenization states
  • Adds a new .truncate() method to Sentence that allows an existing sentence to be shortened to a maximum number of tokens.

Example to illustrate how lazy tokenization works:

from flair.data import Sentence

# Create a sentence - no tokenization happens yet
sentence = Sentence("This is a long text that won't be tokenized immediately")
print(sentence) # Printing a sentence does not require tokenization
print(sentence._is_tokenized())  # Output: False

# Tokenization happens only when needed
len(sentence)  # This triggers tokenization since length is computed using the number of tokens
print(sentence._is_tokenized())  # Output: True

This gives speed improvements when, for example, predicting sentiment over many long texts. The script used to test:

from flair.models import TextClassifier
from flair.data import Sentence
import time

# Load classifier once
classifier = TextClassifier.load("sentiment")

# Test data
text = "a " * 500
n_sentences = 5000

# Test 1: Without forced tokenization
start_time = time.time()
sentences_1 = [Sentence(text) for _ in range(n_sentences)]
assert all(not s._is_tokenized() for s in sentences_1), "Sentences should not be tokenized initially"
for sentence in sentences_1:
    classifier.predict(sentence)
time_1 = time.time() - start_time
print(f"Time without forced tokenization: {time_1:.2f} seconds")
assert all(not s._is_tokenized() for s in sentences_1), "Sentences should NOT be tokenized after prediction"


# Test 2: With forced tokenization
start_time = time.time()
sentences_2 = [Sentence(text) for _ in range(n_sentences)]
for sentence in sentences_2:
    sentence._tokenize()  # Force tokenization
assert all(s._is_tokenized() for s in sentences_2), "Sentences should be tokenized after len()"
for sentence in sentences_2:
    classifier.predict(sentence)
time_2 = time.time() - start_time
print(f"Time with forced tokenization: {time_2:.2f} seconds")
assert all(s._is_tokenized() for s in sentences_2), "Sentences should still be tokenized after prediction"


print(f"Difference: {abs(time_1 - time_2):.2f} seconds")
print(f"Relative speed difference: {((time_1 - time_2) / time_1 * 100):.1f}%")

Which prints:

Time without forced tokenization: 26.40 seconds
Time with forced tokenization: 40.02 seconds
Difference: 13.62 seconds
Relative speed difference: -51.6%

@alanakbik alanakbik merged commit 39ec21e into master Mar 17, 2025
2 checks passed
@alanakbik alanakbik deleted the GH-3635-lazy-tokenization branch March 17, 2025 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Lazy tokenization to speed up sentence-level inference tasks
1 participant