-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
CONDEC-912: Implement data balancing for the text classifier (#446)
* Balance ground truth data for the binary and fine-grained classifiers using random undersampling, split lists of knowledge elements for k-fold cross-validation in such a way that the knowledge type is equally distributed * Add data and evaluation results of NLP4RE'21 workshop * Set window.onbeforeunload null to prevent warnings when leaving a page * Add more parts of text to default training data * Update version to 2.3.2
- Loading branch information
Showing
21 changed files
with
2,049 additions
and
119 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# Automatic Text Classification/Rationale Identification | ||
|
||
The ConDec Jira plug-in offers a feature that automatically classifies text either as relevant decision knowledge elements or as irrelevant. | ||
The text classifier consists of a binary and a fine-grained classifier. | ||
|
||
## Ground Truth Data | ||
Ground truth data is needed to train and evaluate the text classifier. | ||
ConDec installs two default training files: [one rather small one](https://github.com/cures-hub/cures-condec-jira/tree/master/src/main/resources/classifier/defaultTrainingData.csv) and one with the data used for the NLP4RE'21 workshop. | ||
|
||
To reproduce the results from the **NLP4RE'21 workshop** do the following steps: | ||
- Install the [version 2.3.2](https://github.com/cures-hub/cures-condec-jira/releases/tag/v2.3.2) of the ConDec Jira plug-in and activate the plug-in for a Jira project. | ||
- Navigate to the text classification settings page (see section below). | ||
- Choose the training file [CONDEC-NLP4RE2021.csv](https://github.com/cures-hub/cures-condec-jira/tree/master/src/main/resources/classifier/CONDEC-NLP4RE2021.csv). | ||
- Set the machine-learning algorithm to Logistic Regression for both the binary and fine-grained classifiers. | ||
- Run 10-fold cross-validation (you need to set k to 10). | ||
- ConDec writes evaluation results to a text file. The output file should be similar to [evaluation-results-CONDEC-NLP4RE2021-LR-10fold](https://github.com/cures-hub/cures-condec-jira/raw/master/doc/features/evaluation_results_CONDEC-NLP4RE2021-LR-10fold.txt). The results might differ a little bit because of the random undersampling that we did to balance the training data. | ||
|
||
Basic descriptive statistics on ground truth files can be calculated using the R file [training-data-analysis.r](https://github.com/cures-hub/cures-condec-jira/raw/master/doc/features/training-data-analysis.r). | ||
|
||
## Activation and Configuration | ||
The text classifier can be trained and evaluated directly in Jira. | ||
|
||
![Configuration view for the automatic text classifier](https://github.com/cures-hub/cures-condec-jira/raw/master/doc/screenshots/config_automatic_text_classification.png) | ||
*Configuration view for the automatic text classifier* |
This file was deleted.
Oops, something went wrong.
75 changes: 75 additions & 0 deletions
75
doc/features/evaluation-results-CONDEC-NLP4RE2021-LR-10fold.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
Ground truth file name: CONDEC-NLP4RE2021.csv | ||
Trained and evaluated using 10-fold cross-validation | ||
{Binary smile.classification.LogisticRegression$Binomial={ | ||
fit time: 565,977 ms, | ||
score time: 7,748 ms, | ||
validation data size: 41, | ||
error: 11, | ||
accuracy: 72,20%, | ||
sensitivity: 96,47%, | ||
specificity: 47,74%, | ||
precision: 64,91%, | ||
F1 score: 77,46%, | ||
MCC: 50,82% | ||
}, Fine-grained Overall smile.classification.LogisticRegression$Multinomial={ | ||
fit time: 37275,166 ms, | ||
score time: 80,785 ms, | ||
validation data size: 1080, | ||
error: 721, | ||
accuracy: 33,24% | ||
}, Fine-grained Alternative={ | ||
fit time: 37275,166 ms, | ||
score time: 80,785 ms, | ||
validation data size: 1080, | ||
error: 230, | ||
accuracy: 78,70%, | ||
sensitivity: 17,59%, | ||
specificity: 93,98%, | ||
precision: 42,22%, | ||
F1 score: 24,84%, | ||
MCC: 16,75% | ||
}, Fine-grained Pro={ | ||
fit time: 37275,166 ms, | ||
score time: 80,785 ms, | ||
validation data size: 1080, | ||
error: 210, | ||
accuracy: 80,56%, | ||
sensitivity: 11,11%, | ||
specificity: 97,92%, | ||
precision: 57,14%, | ||
F1 score: 18,60%, | ||
MCC: 18,68% | ||
}, Fine-grained Con={ | ||
fit time: 37275,166 ms, | ||
score time: 80,785 ms, | ||
validation data size: 1080, | ||
error: 215, | ||
accuracy: 80,09%, | ||
sensitivity: 12,50%, | ||
specificity: 96,99%, | ||
precision: 50,94%, | ||
F1 score: 20,07%, | ||
MCC: 17,57% | ||
}, Fine-grained Decision={ | ||
fit time: 37275,166 ms, | ||
score time: 80,785 ms, | ||
validation data size: 1080, | ||
error: 574, | ||
accuracy: 46,85%, | ||
sensitivity: 84,72%, | ||
specificity: 37,38%, | ||
precision: 25,28%, | ||
F1 score: 38,94%, | ||
MCC: 18,81% | ||
}, Fine-grained Issue={ | ||
fit time: 37275,166 ms, | ||
score time: 80,785 ms, | ||
validation data size: 1080, | ||
error: 213, | ||
accuracy: 80,28%, | ||
sensitivity: 40,28%, | ||
specificity: 90,28%, | ||
precision: 50,88%, | ||
F1 score: 44,96%, | ||
MCC: 33,48% | ||
}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
setwd("~/gits/paper/2021-nlp4re/evaluation/") | ||
|
||
trainingData <- read.csv("CONDEC-NLP4RE2021.csv") | ||
summary(trainingData) | ||
trainingData[880,] | ||
|
||
numIssues <- table(trainingData$isIssue)[2] # 392 | ||
numDecisions <- table(trainingData$isDecision)[2] # 332 | ||
numAlternatives <- table(trainingData$isAlternative)[2] # 218 | ||
numPros <- table(trainingData$isPro)[2] # 288 | ||
numCons <- table(trainingData$isCon)[2] # 238 | ||
|
||
numRelevant <- numIssues + numDecisions + numAlternatives + numPros + numCons # 1468 | ||
numIrrelevant <- nrow(trainingData) - numRelevant # 220 | ||
|
||
# get parts of text per type | ||
rowsWithIrrelevantText <- | ||
which( | ||
trainingData$isIssue == 0 & | ||
trainingData$isDecision == 0 & | ||
trainingData$isCon == 0 & | ||
trainingData$isPro == 0 & trainingData$isAlternative == 0 | ||
) | ||
trainingData[rowsWithIrrelevantText,] | ||
|
||
rowsWithIssues <- which(trainingData$isIssue == 1) | ||
trainingData[rowsWithIssues,] | ||
|
||
rowsWithDecisions <- which(trainingData$isDecision == 1) | ||
trainingData[rowsWithDecisions,] | ||
|
||
rowsWithAlternatives <- which(trainingData$isAlternative == 1) | ||
trainingData[rowsWithAlternatives,] | ||
|
||
rowsWithCons <- which(trainingData$isCon == 1) | ||
trainingData[rowsWithCons,] | ||
|
||
rowsWithPros <- which(trainingData$isPro == 1) | ||
trainingData[rowsWithPros,] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.