-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
99 lines (55 loc) · 3.42 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
output:
md_document:
variant: markdown_github
---
# Some convenient extensions to MeSH
> The MeSH extensions available here include two sets. The first set consists of the MeSH ontology in simple data frame formats, including `Descriptor Terms`, `Supplementary Concept Terms`, and `MeSH Tree Structures`. Additionally, `Pharmacological Actions` have been extracted from both descriptor and supplemental concept files, and aggregated as a single, simple resource as data frame. The R code for XML extraction and restructuring processes is available [here](https://github.com/jaytimm/mesh-builds/blob/main/descriptor-records-trees.Rmd). The second set of MeSH extensions collected here are descriptor-level word embeddings made available by [Noh & Kavuluru (2021)](https://www.sciencedirect.com/science/article/pii/S1532046421001969).
> Files used in the R package `pubmedtk`.
## MeSH ontology
> Based on MeSH files `desc2024`, `mtrees2024` & `supp2024`.
### `descriptor-terms`
[Raw data](https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/)
```{r}
readRDS('data/data_mesh_thesaurus.rds') |>
head() |> knitr::kable()
```
### `descriptor-tree-numbers`
[Raw data](https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/meshtrees/)
```{r}
readRDS('data/data_mesh_trees.rds') |>
head() |> knitr::kable()
```
### `supplemental-terms`
[Raw data](https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/)
```{r}
readRDS('data/data_scr_thesaurus.rds') |>
head() |> knitr::kable()
```
### Pharmacological Actions
> For drugs included in both MeSH-proper and Supplementary Concept Records.
```{r}
readRDS('data/data_pharm_action.rds') |>
head() |> knitr::kable()
```
### Notes & useful links:
* [MeSH XML data elements](https://www.nlm.nih.gov/mesh/xml_data_elements.html)
* XML elements as [RDF](https://id.nlm.nih.gov/mesh/D000001.html)
* [Utility functions](https://github.com/scienceai/mesh-tree)
* Supplementary Concept Records
> 'SCR records are created for some chemicals, drugs, and other concepts
such as rare diseases. They are labeled as "MeSH Supplementary Concept
Data" and the unique ID begins with the letter "C."'
> 'Supplementary Concept Records - these are not full MeSH Headings
and do not fall under the MeSH tree hierarchy. Many times they are used
to identify substances that are not included in the MeSH terms.'
> 'These do not belong to the controlled vocabulary as such and are not used for indexing MEDLINE articles; instead they enlarge the thesaurus and contain links to the closest fitting descriptor to be used in a MEDLINE search. Many of these records describe chemical substances.'
## Descriptor embeddings
December 21, 2020 Other Open Access
BERT-CRel: Improved Biomedical Word Embeddings in the Transformer Era
https://zenodo.org/record/4383195#.Y1wDBb7MJhE
> "BERT-CRel is a transformer model for fine-tuning biomedical word embeddings that are jointly learned along with concept embeddings using a pre-training phase with fastText and a fine-tuning phase with a transformer setup. The goal is to provide high quality pre-trained biomedical embeddings that can be used in any `downstream task` by the research community. The corpus used for BERT-CRel contains biomedical citations from PubMed and the concepts are from the Medical Subject Headings (MeSH codes) terminology used to index citations."
```{r eval=FALSE}
readRDS('data/data_scr_embeddings.rds')
readRDS('data/data_mesh_embeddings.rds')
```