Improved prediction of drug-induced liver injury literature using natural language processing and machine learning methods
Challenge: The Critical Assessment of Massive Data Analysis (CAMDA) 2022 in collaboration with the Intelligent Systems for Molecular Biology (ISMB) hosted the Literature AI for Drug Induced Liver Injury (DILI) challenge.
A pipeline of data analysis using natural language processing in conjunction with machine learning methods
The t-SNE visualization of the TF-IDF vectors obtained using (A) the title and abstract and (B) only the title of each publication
- Data for modeling
- DILIPositive.tsv: DILI-related literature (title + abstract)
- DILINegative.tsv: DILI-unrelated literature (title + abstract)
- External validaiton data
- Requests to access data should be directed to CAMDA Challenge: http://camda.info/.
- Code
- CAMDA_word_frequency.ipynb: To generate word frequecy and TSNE figures
- CAMDA_word2vec+TFIDF.ipynb: Modeling and test using DILIPositive.tsv and DILINegative.tsv