Skip to content

Commit

Permalink
CONDEC-912: Implement data balancing for the text classifier (#446)
Browse files Browse the repository at this point in the history
* Balance ground truth data for the binary and fine-grained classifiers using random undersampling, split lists of knowledge elements for k-fold cross-validation in such a way that the knowledge type is equally distributed

* Add data and evaluation results of NLP4RE'21 workshop

* Set window.onbeforeunload null to prevent warnings when leaving a page

* Add more parts of text to default training data

* Update version to 2.3.2
  • Loading branch information
kleebaum authored Mar 22, 2021
1 parent e633bf4 commit 9fb03a4
Show file tree
Hide file tree
Showing 21 changed files with 2,049 additions and 119 deletions.
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ bin/
*target/
.DS_Store
.vscode/
src/main/resources/classifier/

# Files
.classpath
Expand All @@ -21,4 +20,5 @@ config.properties
preview_standalone.*
pngLaTeX.*
tikz-uml.sty
*.model
*.model
src/main/resources/classifier/TEST*
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ The [project setting page](https://github.com/cures-hub/cures-condec-jira/raw/ma

### Features
ConDec offers the following features:
- [Automatic text classification to identify decision knowledge in natural language text](https://github.com/cures-hub/cures-condec-jira/tree/master/doc/features/automatic_text_classification.md)
- [Automatic text classification to identify decision knowledge in natural language text](https://github.com/cures-hub/cures-condec-jira/tree/master/doc/features/automatic-text-classification.md)

## Implementation Details

Expand Down
24 changes: 24 additions & 0 deletions doc/features/automatic-text-classification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Automatic Text Classification/Rationale Identification

The ConDec Jira plug-in offers a feature that automatically classifies text either as relevant decision knowledge elements or as irrelevant.
The text classifier consists of a binary and a fine-grained classifier.

## Ground Truth Data
Ground truth data is needed to train and evaluate the text classifier.
ConDec installs two default training files: [one rather small one](https://github.com/cures-hub/cures-condec-jira/tree/master/src/main/resources/classifier/defaultTrainingData.csv) and one with the data used for the NLP4RE'21 workshop.

To reproduce the results from the **NLP4RE'21 workshop** do the following steps:
- Install the [version 2.3.2](https://github.com/cures-hub/cures-condec-jira/releases/tag/v2.3.2) of the ConDec Jira plug-in and activate the plug-in for a Jira project.
- Navigate to the text classification settings page (see section below).
- Choose the training file [CONDEC-NLP4RE2021.csv](https://github.com/cures-hub/cures-condec-jira/tree/master/src/main/resources/classifier/CONDEC-NLP4RE2021.csv).
- Set the machine-learning algorithm to Logistic Regression for both the binary and fine-grained classifiers.
- Run 10-fold cross-validation (you need to set k to 10).
- ConDec writes evaluation results to a text file. The output file should be similar to [evaluation-results-CONDEC-NLP4RE2021-LR-10fold](https://github.com/cures-hub/cures-condec-jira/raw/master/doc/features/evaluation_results_CONDEC-NLP4RE2021-LR-10fold.txt). The results might differ a little bit because of the random undersampling that we did to balance the training data.

Basic descriptive statistics on ground truth files can be calculated using the R file [training-data-analysis.r](https://github.com/cures-hub/cures-condec-jira/raw/master/doc/features/training-data-analysis.r).

## Activation and Configuration
The text classifier can be trained and evaluated directly in Jira.

![Configuration view for the automatic text classifier](https://github.com/cures-hub/cures-condec-jira/raw/master/doc/screenshots/config_automatic_text_classification.png)
*Configuration view for the automatic text classifier*
10 changes: 0 additions & 10 deletions doc/features/automatic_text_classification.md

This file was deleted.

75 changes: 75 additions & 0 deletions doc/features/evaluation-results-CONDEC-NLP4RE2021-LR-10fold.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
Ground truth file name: CONDEC-NLP4RE2021.csv
Trained and evaluated using 10-fold cross-validation
{Binary smile.classification.LogisticRegression$Binomial={
fit time: 565,977 ms,
score time: 7,748 ms,
validation data size: 41,
error: 11,
accuracy: 72,20%,
sensitivity: 96,47%,
specificity: 47,74%,
precision: 64,91%,
F1 score: 77,46%,
MCC: 50,82%
}, Fine-grained Overall smile.classification.LogisticRegression$Multinomial={
fit time: 37275,166 ms,
score time: 80,785 ms,
validation data size: 1080,
error: 721,
accuracy: 33,24%
}, Fine-grained Alternative={
fit time: 37275,166 ms,
score time: 80,785 ms,
validation data size: 1080,
error: 230,
accuracy: 78,70%,
sensitivity: 17,59%,
specificity: 93,98%,
precision: 42,22%,
F1 score: 24,84%,
MCC: 16,75%
}, Fine-grained Pro={
fit time: 37275,166 ms,
score time: 80,785 ms,
validation data size: 1080,
error: 210,
accuracy: 80,56%,
sensitivity: 11,11%,
specificity: 97,92%,
precision: 57,14%,
F1 score: 18,60%,
MCC: 18,68%
}, Fine-grained Con={
fit time: 37275,166 ms,
score time: 80,785 ms,
validation data size: 1080,
error: 215,
accuracy: 80,09%,
sensitivity: 12,50%,
specificity: 96,99%,
precision: 50,94%,
F1 score: 20,07%,
MCC: 17,57%
}, Fine-grained Decision={
fit time: 37275,166 ms,
score time: 80,785 ms,
validation data size: 1080,
error: 574,
accuracy: 46,85%,
sensitivity: 84,72%,
specificity: 37,38%,
precision: 25,28%,
F1 score: 38,94%,
MCC: 18,81%
}, Fine-grained Issue={
fit time: 37275,166 ms,
score time: 80,785 ms,
validation data size: 1080,
error: 213,
accuracy: 80,28%,
sensitivity: 40,28%,
specificity: 90,28%,
precision: 50,88%,
F1 score: 44,96%,
MCC: 33,48%
}}
39 changes: 39 additions & 0 deletions doc/features/training-data-analysis.r
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
setwd("~/gits/paper/2021-nlp4re/evaluation/")

trainingData <- read.csv("CONDEC-NLP4RE2021.csv")
summary(trainingData)
trainingData[880,]

numIssues <- table(trainingData$isIssue)[2] # 392
numDecisions <- table(trainingData$isDecision)[2] # 332
numAlternatives <- table(trainingData$isAlternative)[2] # 218
numPros <- table(trainingData$isPro)[2] # 288
numCons <- table(trainingData$isCon)[2] # 238

numRelevant <- numIssues + numDecisions + numAlternatives + numPros + numCons # 1468
numIrrelevant <- nrow(trainingData) - numRelevant # 220

# get parts of text per type
rowsWithIrrelevantText <-
which(
trainingData$isIssue == 0 &
trainingData$isDecision == 0 &
trainingData$isCon == 0 &
trainingData$isPro == 0 & trainingData$isAlternative == 0
)
trainingData[rowsWithIrrelevantText,]

rowsWithIssues <- which(trainingData$isIssue == 1)
trainingData[rowsWithIssues,]

rowsWithDecisions <- which(trainingData$isDecision == 1)
trainingData[rowsWithDecisions,]

rowsWithAlternatives <- which(trainingData$isAlternative == 1)
trainingData[rowsWithAlternatives,]

rowsWithCons <- which(trainingData$isCon == 1)
trainingData[rowsWithCons,]

rowsWithPros <- which(trainingData$isPro == 1)
trainingData[rowsWithPros,]
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
<modelVersion>4.0.0</modelVersion>
<groupId>de.uhd.ifi.se.decision</groupId>
<artifactId>management.jira</artifactId>
<version>2.3.1</version>
<version>2.3.2</version>
<organization>
<name>Software Engineering Research Group, Heidelberg University</name>
<url>https://github.com/cures-hub</url>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,7 @@ public Classifier<double[]> train(double[][] trainingSamples, int[] trainingLabe
@Override
public Map<String, ClassificationMetrics> evaluateUsingKFoldCrossValidation(int k, GroundTruthData groundTruthData,
ClassifierType classifierType) {
Map<GroundTruthData, GroundTruthData> splitData = GroundTruthData.splitForKFoldCrossValidation(k,
groundTruthData.getKnowledgeElements());
Map<GroundTruthData, GroundTruthData> splitData = groundTruthData.splitForBinaryKFoldCrossValidation(k);
Classifier<double[]> entireModel = model;

List<ClassificationValidation<Classifier<double[]>>> validations = new ArrayList<>();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ private static List<File> getFilesMatchingRegex(String regex) {
* @return updated file with default training content.
*/
static File copyDefaultTrainingDataToClassifierDirectory() {
copyDataToFile("CONDEC-NLP4RE2021.csv");
return copyDataToFile("defaultTrainingData.csv");
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,7 @@ public Classifier<double[]> train(double[][] trainingSamples, int[] trainingLabe
@Override
public Map<String, ClassificationMetrics> evaluateUsingKFoldCrossValidation(int k, GroundTruthData groundTruthData,
ClassifierType classifierType) {
Map<GroundTruthData, GroundTruthData> splitData = GroundTruthData.splitForKFoldCrossValidation(k,
groundTruthData.getDecisionKnowledgeElements());
Map<GroundTruthData, GroundTruthData> splitData = groundTruthData.splitForFineGrainedKFoldCrossValidation(k);
Classifier<double[]> entireModel = model;

int[] truth = new int[0];
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,14 @@
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Date;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

import org.apache.commons.csv.CSVFormat;
import org.slf4j.Logger;
Expand Down Expand Up @@ -287,23 +289,6 @@ public static DataFrame readDataFrameFromCSVFile(File trainingDataFile) {
return trainingData;
}

/**
* @return list of knowledge elements created from training data. Only the
* summary and the type is set!
*/
public List<KnowledgeElement> getKnowledgeElements() {
List<KnowledgeElement> elements = new ArrayList<>();
for (Map.Entry<String, Integer> entry : allSentenceRelevanceMap.entrySet()) {
if (entry.getValue().equals(0)) {
KnowledgeElement element = new KnowledgeElement();
element.setSummary(entry.getKey());
elements.add(element);
}
}
elements.addAll(getDecisionKnowledgeElements());
return elements;
}

/**
* @return list of decision knowledge (rationale) elements created from training
* data. Only the summary and the type is set!
Expand All @@ -330,10 +315,9 @@ public String toString() {
* @param k
* @return
*/
public static Map<GroundTruthData, GroundTruthData> splitForKFoldCrossValidation(int k,
private static Map<GroundTruthData, GroundTruthData> splitForKFoldCrossValidation(int k,
List<KnowledgeElement> elements) {
Map<GroundTruthData, GroundTruthData> splitData = new HashMap<>();
Collections.shuffle(elements);
int chunkSize = (int) Math.ceil(elements.size() / k);
List<List<KnowledgeElement>> parts = Lists.partition(elements, chunkSize);
for (int i = 0; i < k; i++) {
Expand All @@ -352,8 +336,107 @@ public static Map<GroundTruthData, GroundTruthData> splitForKFoldCrossValidation
return splitData;
}

public Map<GroundTruthData, GroundTruthData> splitForBinaryKFoldCrossValidation(int k) {
return splitForKFoldCrossValidation(k, getBalancedKnowledgeElementsWrtRelevance(true));
}

public Map<GroundTruthData, GroundTruthData> splitForFineGrainedKFoldCrossValidation(int k) {
return splitForKFoldCrossValidation(k, getBalancedDecisionKnowledgeElements(true));
}

public String getFileName() {
return fileName != null ? fileName : "";
}

/**
* @return list of knowledge elements created from training data. Only the
* summary and the type is set!
*/
public List<KnowledgeElement> getKnowledgeElements() {
List<KnowledgeElement> elements = getIrrelevantPartsOfText();
elements.addAll(getDecisionKnowledgeElements());
return elements;
}

private List<KnowledgeElement> getIrrelevantPartsOfText() {
List<KnowledgeElement> irrelevantPartsOfText = new ArrayList<>();
for (Map.Entry<String, Integer> entry : allSentenceRelevanceMap.entrySet()) {
if (entry.getValue().equals(0)) {
KnowledgeElement element = new KnowledgeElement();
element.setSummary(entry.getKey());
irrelevantPartsOfText.add(element);
}
}
return irrelevantPartsOfText;
}

/**
* @param isRandom
* true if random undersampling, false if first elements in list are
* taken for undersampling.
* @return list of balanced knowledge elements regarding their relevance. Uses
* undersampling.
*/
public List<KnowledgeElement> getBalancedKnowledgeElementsWrtRelevance(boolean isRandom) {
int numberOfAllParts = allSentenceRelevanceMap.size();
int numberOfRelevantPartsOfText = relevantSentenceKnowledgeTypeLabelMap.size();
int numberOfIrrelevantPartsOfText = numberOfAllParts - numberOfRelevantPartsOfText;
int min = Math.min(numberOfIrrelevantPartsOfText, numberOfRelevantPartsOfText);
List<KnowledgeElement> irrelevantParts = getSubList(getIrrelevantPartsOfText(), min, isRandom);
List<KnowledgeElement> relevantParts = getSubList(getDecisionKnowledgeElements(), min, isRandom);
List<KnowledgeElement> balancedElements = new ArrayList<>();
for (int i = 0; i < min; i++) {
balancedElements.add(irrelevantParts.get(i));
balancedElements.add(relevantParts.get(i));
}
return balancedElements;
}

public static <T> List<T> getSubList(List<T> list, int newSize, boolean isRandom) {
if (isRandom) {
Collections.shuffle(list);
}
return list.subList(0, newSize);
}

/**
* @param isRandom
* true if random undersampling, false if first elements in list are
* taken for undersampling.
* @return list of balanced knowledge elements regarding their type. Uses random
* undersampling.
*/
public List<KnowledgeElement> getBalancedDecisionKnowledgeElements(boolean isRandom) {
List<KnowledgeElement> elements = getDecisionKnowledgeElements();
List<KnowledgeElement> issues = getElementsOfType(elements, KnowledgeType.ISSUE);
List<KnowledgeElement> decisions = getElementsOfType(elements, KnowledgeType.DECISION);
List<KnowledgeElement> alternatives = getElementsOfType(elements, KnowledgeType.ALTERNATIVE);
List<KnowledgeElement> proArguments = getElementsOfType(elements, KnowledgeType.PRO);
List<KnowledgeElement> conArguments = getElementsOfType(elements, KnowledgeType.CON);

List<Integer> sampleSizes = Arrays.asList(issues.size(), decisions.size(), alternatives.size(),
proArguments.size(), conArguments.size());
int min = Collections.min(sampleSizes);

List<KnowledgeElement> balancedIssues = getSubList(issues, min, isRandom);
List<KnowledgeElement> balancedDecisions = getSubList(decisions, min, isRandom);
List<KnowledgeElement> balancedAlternatives = getSubList(alternatives, min, isRandom);
List<KnowledgeElement> balancedPros = getSubList(proArguments, min, isRandom);
List<KnowledgeElement> balancedCons = getSubList(conArguments, min, isRandom);

List<KnowledgeElement> balancedElements = new ArrayList<>();
for (int i = 0; i < min; i++) {
balancedElements.add(balancedIssues.get(i));
balancedElements.add(balancedDecisions.get(i));
balancedElements.add(balancedAlternatives.get(i));
balancedElements.add(balancedPros.get(i));
balancedElements.add(balancedCons.get(i));
}
return balancedElements;
}

private static List<KnowledgeElement> getElementsOfType(List<KnowledgeElement> allElements, KnowledgeType type) {
return allElements.stream().filter(e -> e.getType() == type).collect(Collectors.toList());
}

}
2 changes: 2 additions & 0 deletions src/main/resources/atlassian-plugin.xml
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,8 @@
<context>jira.view.issue</context>
<resource type="download" name="defaultTrainingData.csv"
location="/classifier/defaultTrainingData.csv" />
<resource type="download" name="CONDEC-NLP4RE2021.csv"
location="/classifier/CONDEC-NLP4RE2021.csv" />

<!-- Language models -->
<resource type="download" name="glove.6b.50d.csv" location="/classifier/glove.6b.50d.csv" />
Expand Down
Loading

0 comments on commit 9fb03a4

Please sign in to comment.