Keywords: Data augmentation, Neuro-Symbolic AI, NLP, LLM, UMLS
A Toolkit for Biomedical Text Augmentation
This Python package consists of a Neuro-Symbolic pipeline, blending knowledge-driven and data-driven approaches.
Knowledge-Based perturbation (knowledge-driven):
Med-synonym replacement
: Replaces medical terms with one of their formalized synonyms from structured domain knowledge (UMLS Metathesaurus).General synonym replacement
: Replaces terms with one of their general-purpose synonyms from Wordnet.
Transformer-Based perturbation (data-driven):
Back-translation
: Translates text into an intermediate language and then back into the original language using multilingual MT models.Contextual word prediction
: Fills in masked single-token words within the input text based on the in-context predictions from BERT-based language models.Rephrasing
: Rewrites text using the capabilities of LLMs.
- Unified Medical Language System® (UMLS®) License:
- Mandatory for using the
Med-synonym replacement
component. - Optional for the
General synonym replacement
.
- LLM Functional Block:
- A functional block with any preferred (open source or proprietary) LLM must be configured to use the
Rephrasing
component. - Alternatively, you can use the default gpt-4o-mini model by providing your personal API key.
- Make sure you have the latest version of pip installed
pip install --upgrade pip
- Install
BiomedicalAugmentation-for-Text
through pippip install --index-url https://test.pypi.org/simple/ --no-deps BiomedicalAugmentation-for-Text
Here is a minimal example of how the BAT package can be invoked with BiomedicalAugmentation-for-Text
.
- Through the
AugmentedSample
class: A compact and streamlined interface that integrates all components into a cohesive workflow.
from bioTextAugPackage.init import *
import bioTextAugPackage.augmented_sample as aug_sample
config = Config()
input_text = "No lytic lesions are observed at the vertebral levels included in the scans. No signs of listhesis."
augmented_sample = aug_sample.AugmentedSample(config_params=config, technique_tag="TB-back_translation",
src_data=input_text, src_lang="english", n_synth_data=5)
ans = augmented_sample.run()
- By invoking individual functions: Provides more control and flexibility to apply specific components independently.
from bioTextAugPackage.init import *
import bioTextAugPackage.transformer_based_functions as tb
import bioTextAugPackage.metrics as metrics
config = Config()
input_text = "No lytic lesions are observed at the vertebral levels included in the scans. No signs of listhesis."
src_lang = "en"
trg_lang = "fr"
mt_model_name1 = f"Helsinki-NLP/opus-mt-{src_lang}-{trg_lang}"
mt_model1 = AutoModelForSeq2SeqLM.from_pretrained(mt_model_name1)
mt_tokenizer1 = AutoTokenizer.from_pretrained(mt_model_name1)
mt_model_name2 = f"Helsinki-NLP/opus-mt-{trg_lang}-{src_lang}"
mt_model2 = AutoModelForSeq2SeqLM.from_pretrained(mt_model_name2)
mt_tokenizer2 = AutoTokenizer.from_pretrained(mt_model_name2, clean_up_tokenization_spaces=True)
ans = tb.back_translation(src_text=input_text,
model1=mt_model1, tokenizer1=mt_tokenizer1,
model2=mt_model2, tokenizer2=mt_tokenizer2)
print(ans)
overlap_score = metrics.compute_overlap(synthetic_data=ans[0], src_data=input_text, tokenizer=config.base_tokenizer)
similarity_score = metrics.compute_similarity(synthetic_data=ans[0], src_data=input_text, se_model_name=config.se_model_name)
print(f"overlap_score: {overlap_score} - similarity_score: {similarity_score}")
A more extensive example, including advanced usage, can be found in this notebook.
-
Project Link: https://github.com/bmi-labmedinfo/BAT
-
Package Link: https://test.pypi.org/project/BiomedicalAugmentation-for-Text/
Distributed under MIT License. See LICENSE