Skip to content

Latest commit

 

History

History
417 lines (268 loc) · 20 KB

readme.md

File metadata and controls

417 lines (268 loc) · 20 KB

Linting and Formatting

HICRIC: A Dataset of Law, Policy, and Regulatory Guidance for Health Insurance Coverage Understanding

Health Insurance Coverage Rules Interpretation Corpus (HICRIC) is a curated collection of reputable legal and medical text designed to support applications that require understanding of U.S. health insurance coverage rules.

The corpus is comprised of documents from six categories: law, regulatory guidance, coverage rules, policy opinion, case descriptions, and clinical guidelines. It is primarily intended for use in pretraining language models and as a knowledge base for retrieval applications.

The corpus was designed with a specific eye toward supporting patients in pursuing appeals of inappropriate health insurance denials (see the intro of this article for a primer on appeals). For example, we use the corpus to train our appeal letter generators. Toward the same end, we introduced an appeal outcome adjudication task, and constructed a benchmark dataset to support appeal outcome forecasting. This benchmark, and all related bootstrapping and training code are also being released with this data.

Dataset Breakdown

Corpus

Each document in our corpus comes equipped with a set of plain-text tags. In constructing the data we formulated a particular privileged set of partitioning tags: these are a set of tags with the property that each document in the dataset is associated with exactly one tag in the set, and none of the tags are unused.

The tags are the following:

  • legal

  • regulatory-guidance

  • contract-coverage-rule-medical-policy

  • opinion-policy-summary

  • case-description

  • clinical-guidelines

In addition to this set of partitioning tags, we introduce another privileged tag:

  • kb This tag indicates that a document is suitable for use in a knowledge base.

    This is a subjective determination, but the intent is to label text that comes from a reputable, definitive source. For example, a summary of Medicaid rules as stated by an employee of HHS during congressional testimony would not be labeled with the kb tag, because such testimony is not the definitive source for the ground truth of such rules. On the other hand, federal law describing those same rules would be labeled with the kb tag.

A high level characterization of the distribution of text in our corpus in terms of these privileged tags is shown in the table below.

Category Num Documents Words Chars Size (GB)
All Partition Parts 8,310 417,617,646 2,699,256,987 2.81
kb 1,434 170,717,368 1,120,961,295 1.13
legal 335 92,357,802 596,044,008 0.60
regulatory-guidance 1,110 5,536,585 38,607,587 0.04
contract-coverage-rule-medical-policy 7 196,156,813 1,228,184,524 1.31
opinion-policy-summary 2,094 19,462,399 133,049,956 0.14
case-description 2,629 214,267,074 1,351,074,791 1.45
clinical-guidelines 2,150 81,955,020 553,041,990 0.56

Adjudication Benchmark

In addition to our unlabeled corpus, we are releasing a v0 benchmark for an external appeal outcome prediction task.

External Appeal Adjudication Task Formulation

Given a description of a denial, predict whether an external appeal would result in overturn, uphold, or whether the description is insufficient.

The benchmark consists of (Background Context, External Appeal Outcome, Sufficiency Label) triples. The background context is brief, non-leaking context extracted from real adjudication summaries. The outcomes are actual binding case determinations made by independent medical reviewers. The sufficiency label is a binary pseudo-label indicating the extent to which the background context is sufficient to make an informed prediction about the expected case outcome.

Here is an example from the benchmark:

Please see our forthcoming paper for more details about how and why this benchmark was constructed.

Using the Data

Access

Corpus

The corpus can be found on Huggingface:

https://huggingface.co/datasets/Persius/hicric

To download the corpus for the purpose of using it with this code, use our script:

python download_corpus_hf.py

Case Adjudications

The dataset can be found on Huggingface:

https://huggingface.co/datasets/Persius/imr-appeals

To download the corpus for the purpose of using with this code, use our script:

python download_adjudications_hf.py

Redistribution

Please consult the licenses for all source data for yourself if you plan to redistribute any of it. To the best of our knowledge, our redistributions abide by all such licenses.

Risks

We believe there are numerous risks associated with our released data, which we've done our best to mitigate. Our main concerns involve:

  • Potential for Propagation of Bias
  • Potential for Misuse

Please see our forthcoming paper for a thorough discussion of these perceived risks.

Limitations

There are many limitations associated with our released data, and our advice is to consider and weigh these limitations carefully to inform resonsible and effective use. The main categories of limitation are:

  • Task Shortcomings
  • Simplicity of the Benchmark
  • Corpus Deficiencies

Please see our forthcoming paper for a thorough discussion of these perceived risks.

Using the Code

To use any of our downloaders, processors, or text generation scripts to reproduce the unlabeled dataset generation in full, proceed as follows:

Setup Environment

python3 -m venv .env
source .env/bin/activate
pip install -r requirements.txt

Reproduce Unlabeled Corpus

Download Raw Sources + Produce Source Metadata

python download_raw.py

Process Sources + Produce Processed Metadata

python process_local.py

Train Outcome Predictor

To train outcome predictors, you need to either:

  • Follow the steps above to download the dataset from Huggingface, using the custom scripts.
  • Reproduce the entire unlabeled corpus using our scrapers, as described above.

Train pseudo-labeling models

# Train background span selector
export WANDB_API_KEY=your=api-key # only necessary if using wandb, as specified in default config
python -m src.modeling.train_background_token_classification --config_path="src/modeling/config/background_token_classification/default.yaml"
# Train sufficiency classifier
export WANDB_API_KEY=your=api-key # only necessary if using wandb, as specified in default config
python -m src.modeling.train_sufficiency_classifier --config_path="src/modeling/config/sufficiency_classification/default.yaml"

Note: the models above will use the manual annotations in ./data/annotated/case-backgrounds.jsonl

# Use models above to extract background spans and label with 3-class pseudo-label
python -m src.modeling.background_extraction

Train Outcome Prediction Models

By Only Finetuning a Pretrained Model on the Benchmark
# Use (background, outcome) pairs to train outcome predictor (from HF pretrained model)
export WANDB_API_KEY=your=api-key # only necessary if using wandb, as specified in default config
python -m src.modeling.train_outcome_predictor --config_path="src/modeling/config/outcome_prediction/distilbert.yaml"

By Pretraining on HICRIC, then Finetuning on the Benchmark

# Pretrain a BERT type model on HICRIC via MLM
export WANDB_API_KEY=your=api-key # only necessary if using wandb, as specified in default config
python -m src.modeling.pretrain --config_path="src/modeling/config/pretrain/distilbert.yaml"

# Then use (background, outcome pairs to train outcome predictor (from hicric pretrained variant)
export WANDB_API_KEY=your=api-key # only necessary if using wandb, as specified in default config
python -m src.modeling.train_outcome_predictor --config_path="src/modeling/config/outcome_prediction/distilbert_hicric_pretrained.yaml"

Generate Alignment Data for Supervised Fine Tuning

python generate_sft_alignment_data.py

Repository Organization

We separate the source metadata and download utilities from the processed metadata and processing utilities to retain modularity in these unrelated concerns. This organizaton supports redownloading all of the raw data, or, independently, re-processing only the subset of data scraped from pdfs with an alternate pdf processing pipeline, for example.

Downloaders

Each data source in the dataset has an associated downloader, housed in src/downloaders. Each downloader is a function with the signature:

def download(output_dir: str, source_meta_path: str) -> None:
    pass

The role of the function is to download or scrape raw data from a source, write it to disk in the specified output_dir, and write a piece of metadata that points to the downloaded artifact in a jsonl file (the file located at source_meta_path). The nature of the metadata is described in the following section.

Source Metadata

The file sources.jsonl documents metadata pertaining to the raw sources ultimately used in this dataset. This means either file downloads, or scraped data (which requires a poetic license to deem "raw").

For example, the first line of sources.jsonl is:

{
    "url": "https://downloads.cms.gov/medicare-coverage-database/downloads/exports/ncd.zip",
    "date_accessed": "2024-01-17",
    "local_path": "./data/raw/medicare/ncd/ncd_csv.zip",
    "tags": ["medicare", "kb", "contract-coverage-rule-medical-policy"],
    "preprocessor": "medicare_cds",
    "md5": "39bb06a088e67aad89ee2ddcb26e03ba"
}

This is metadata that refers to a particular subset of the Medicare Coverage Database that was downloaded from a link on the page: https://www.cms.gov/medicare-coverage-database/downloads/downloads.aspx. Subsequently, that raw download was parsed and processed to produce text, but such further steps are beyond the purview of sources.jsonl.

Each source metadata record includes a few pieces of information, including the direct download url or url from which the data was acquired (as applicable), the date a download or scrape occurred, a relative local path to the downloaded data, plain-text tags associated with the data, and a plain-text key for a preprocessor with which the downloaded data can be converted to our standard processed format.

Metadata Description

Name Description Definition Required Example Value
url Source Url A source url from which the data was obtained. Yes https://downloads.cms.gov/medicare-coverage-database/downloads/exports/ncd.zip
date_accessed Date of Access The date at which the data was downloaded or scraped (YYYY-MM-DD). Yes 2024-01-17
local_path Local Path to Data Relative path to local raw data download or scrape. Yes ./data/raw/medicare/ncd/ncd_csv.zip
tags Source Tags An array of plain-text tags that pertain to the raw data. Yes (possibly empty) ["medicare", "kb", "contract-coverage-rule-medical-policy"]
preprocessor Processing Function Key A key for a processor function that was used to transform the file at local_path to the expected standard format. Yes. medicare_cds
md5 MD5 Hash MD5 hash of the file contents stored at local_path. Yes 39bb06a088e67aad89ee2ddcb26e03ba

Note: Text for which no further processing is desired (e.g. because it was parsed and processed into the standard format at scrape time) has a preprocessor value of null in sources.jsonl.

Processors

Each raw source item in sources.jsonl is labeled with a preprocessor key for an associated processor. Processors are housed in src/processors.

Each processor is a function with the signature:

def process(source_lineitem: dict, output_dirname: str) -> dict:
    pass

The role of the function is to accept a lineitem from the source metadata, process the raw file to which that metadata points to produce text data in our standard format, and then return an updated lineitem with metadata about the newly processed variant. The processor will write the processed copy of the data to disk in the directory specified by output_dirname. The detailed nature of the updated metadata returned by these functions is described in the following section.

Processed Metadata

The file processed_sources.jsonl documents metadata pertaining to the standardized constituents of our dataset, and how they were acquired from the raw records enumerated in sources.jsonl. For the most part, this updated metadata is the same as sources.jsonl. The main difference is that there are now file pointers pointing to the local, standardized, processed variants of the lineitems.

For example, the first line in processed_sources.jsonl corresponding to the source example above is:

{
    "url": "https://downloads.cms.gov/medicare-coverage-database/downloads/exports/ncd.zip",
    "date_accessed": "2024-01-17",
    "local_path": "./data/raw/medicare/ncd/ncd_csv.zip",
    "tags": ["medicare", "kb", "contract-coverage-rule-medical-policy"],
    "preprocessor": "medicare_cds",
    "md5": "39bb06a088e67aad89ee2ddcb26e03ba",
    "local_processed_path": "./data/processed/medicare/ncd/ncd.jsonl",
    "stats": {"size": 600852, "words": 84013, "chars": 583462}
}

Note here that the metadata is exactly the same, with the exception of two new fields: local_processed_path and stats.

Name Description Definition Required Example Value
local_processed_path Local Path to Processed Data Relative path to local processed data. Yes. .data/processed/medicare/ncd/ncd.jsonl
stats Some basic stats about the text field of the processed file. A dict with the total size (bytes), number of words, and number of chars in the text components of the processed jsonl file. No {"size": 600852, "words": 84013, "chars": 583462}

Standardized Processed Format

The actual standardized constituents of our dataset (rather than the metadata just described) are also jsonl files. Each jsonl file must satisfy one property: each json lineitem has a text key with the raw text data.

Name Description Definition Required Example Value
text The text of the processed file. The text of the processed file. Yes Summary Reviewer \n\n\nA 51-year-old female enrollee has requested reimbursement for Avastin provided on 12/16/14, 1/6/15, 1/27/15, 2/17/15, and 3/10/15.

In addition to this minimal standardization requirement, those processed files corresponding to case summary data (as indicated by the case-description tag in their processed metadata) may optionally include the following additional fields in each of their json lineitems:

Name Description Definition Required Example Value
appeal_type The Type of Appeal String indicating the grounds on which the denial being appealed was made. Specificity level not yet standardized. No Medical Necessity
coverage_type The Type of Coverage String indicating the coverage type. Specificity level not yet standardized. No Commercial
diagnosis The diagnosis or medical event in question. String indicating the diagnosis or medical event. Specificity level not yet standardized. No Metastatic Cancer
treatment The treatment or service in question. String indicating the treatment or service. Specificity level not yet standardized. No Chemotherapy/ Cancer Medications
decision The appeal outcome. String indicating the appeal outcome. Not yet standardized. No Insurer Denial Overturned
appeal_expedited Appeal expedited status. Boolean indicating whether the appeal was expedited. No False
patient_race The reported race of the patient. The reported race of the patient. No Asian

License

Curated Data

Annotations, Documentation, and Original Data

All original data, documentation, and media presented in this repository is licensed under Creative Commons Attribution-ShareAlike 4.0 International License.

See LICENSE.CC-BY-SA-4.0 for a full text copy of this license.

Code

All original source code in this repository, including that used to scrape and parse data, and to train models, is licensed under Apache 2.0 License.

See LICENSE for a full text copy of this license.

Please start a discussion thread for any question or concerns related to licensing.

Attribution

If you find this data useful in your work, please consider citing it.

In adhering to the attribution clause of the license governing our original data, documentation, and other media, you can attribute this work as "HICRIC Data", and share the url: https://github.com/TPAFS/hicric.

For the code, you can use the following citation:

  @software{
  hicric,
  author ={Mike Gartner},
  title={HICRIC: Law, Policy, and Medical Guidance for Health Insurance Coverage Understanding},
  year={2024},
  url={https://github.com/TPAFS/hicric}
}

Contact

For questions or comments, please reach out to info@persius.org