Skip to content

Package for Biomedical Textual data Augmentation

License

Notifications You must be signed in to change notification settings

bmi-labmedinfo/BAT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BAT - Biomedical Augmentation for Text

Contributors Watchers Forks Stargazers Issues MIT License

Status

Keywords: Data augmentation, Neuro-Symbolic AI, NLP, LLM, UMLS


A Toolkit for Biomedical Text Augmentation

Package Overview

This Python package consists of a Neuro-Symbolic pipeline, blending knowledge-driven and data-driven approaches.

Pipeline Components

Knowledge-Based perturbation (knowledge-driven):

  • Med-synonym replacement: Replaces medical terms with one of their formalized synonyms from structured domain knowledge (UMLS Metathesaurus).
  • General synonym replacement: Replaces terms with one of their general-purpose synonyms from Wordnet.

Transformer-Based perturbation (data-driven):

  • Back-translation: Translates text into an intermediate language and then back into the original language using multilingual MT models.
  • Contextual word prediction: Fills in masked single-token words within the input text based on the in-context predictions from BERT-based language models.
  • Rephrasing: Rewrites text using the capabilities of LLMs.

Requirements

  1. Unified Medical Language System® (UMLS®) License:
  • Mandatory for using the Med-synonym replacement component.
  • Optional for the General synonym replacement.
  1. LLM Functional Block:
  • A functional block with any preferred (open source or proprietary) LLM must be configured to use the Rephrasing component.
  • Alternatively, you can use the default gpt-4o-mini model by providing your personal API key.

Installation

  1. Make sure you have the latest version of pip installed
    pip install --upgrade pip
  2. Install BiomedicalAugmentation-for-Text through pip
    pip install --index-url https://test.pypi.org/simple/ --no-deps BiomedicalAugmentation-for-Text

Usage

Here is a minimal example of how the BAT package can be invoked with BiomedicalAugmentation-for-Text.

  1. Through the AugmentedSample class: A compact and streamlined interface that integrates all components into a cohesive workflow.
from bioTextAugPackage.init import *
import bioTextAugPackage.augmented_sample as aug_sample

config = Config()
input_text = "No lytic lesions are observed at the vertebral levels included in the scans. No signs of listhesis."
augmented_sample = aug_sample.AugmentedSample(config_params=config, technique_tag="TB-back_translation",
                                              src_data=input_text, src_lang="english", n_synth_data=5)
ans = augmented_sample.run()
  1. By invoking individual functions: Provides more control and flexibility to apply specific components independently.
from bioTextAugPackage.init import *
import bioTextAugPackage.transformer_based_functions as tb
import  bioTextAugPackage.metrics as metrics

config = Config()
input_text = "No lytic lesions are observed at the vertebral levels included in the scans. No signs of listhesis."

src_lang = "en"
trg_lang = "fr"

mt_model_name1 = f"Helsinki-NLP/opus-mt-{src_lang}-{trg_lang}"
mt_model1 = AutoModelForSeq2SeqLM.from_pretrained(mt_model_name1)
mt_tokenizer1 = AutoTokenizer.from_pretrained(mt_model_name1)

mt_model_name2 = f"Helsinki-NLP/opus-mt-{trg_lang}-{src_lang}"
mt_model2 = AutoModelForSeq2SeqLM.from_pretrained(mt_model_name2)
mt_tokenizer2 = AutoTokenizer.from_pretrained(mt_model_name2, clean_up_tokenization_spaces=True)

ans = tb.back_translation(src_text=input_text,
                          model1=mt_model1, tokenizer1=mt_tokenizer1,
                          model2=mt_model2, tokenizer2=mt_tokenizer2)

print(ans)
overlap_score = metrics.compute_overlap(synthetic_data=ans[0], src_data=input_text, tokenizer=config.base_tokenizer)
similarity_score = metrics.compute_similarity(synthetic_data=ans[0], src_data=input_text, se_model_name=config.se_model_name)
print(f"overlap_score: {overlap_score} - similarity_score: {similarity_score}")

A more extensive example, including advanced usage, can be found in this notebook.

Contacts and Useful Links

License

Distributed under MIT License. See LICENSE

About

Package for Biomedical Textual data Augmentation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages