Skip to content

Commit

Permalink
CSVRecordFactory: Add spreadsheet support
Browse files Browse the repository at this point in the history
Add XLSX and ODS file formats for reading spreadsheets as CSV files.
  • Loading branch information
berombau authored and DylanVanAssche committed Jul 27, 2021
1 parent 6c19568 commit 23c02be
Show file tree
Hide file tree
Showing 331 changed files with 4,704 additions and 34 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
### Added
- FunctionLoader: throw error on missing function parameters (see [issue 125](https://gitlab.ilabt.imec.be/rml/proc/rmlmapper-java/-/issues/125))
- HTMLRecordFactory: add CSS3 selector support (see [issue 52](https://gitlab.ilabt.imec.be/rml/proc/rmlmapper-java/-/issues/52))
- CSVRecordFactory: add spreadsheet support (see [issue 42](https://gitlab.ilabt.imec.be/rml/proc/rmlmapper-java/-/issues/42))

## [4.11.0] - 2021-07-05

Expand Down
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ The RMLMapper loads all data in memory, so be aware when working with big datase

### Supported
- local data sources:
- Excel (.xlsx)
- LibreOffice (.ods)
- CSV files (including CSVW)
- JSON files (JSONPath)
- XML files (XPath)
Expand Down Expand Up @@ -252,6 +254,9 @@ and up to which level metadata should be stored (dataset, triple, or term level

Run the tests via `test.sh`.

#### Derived tests
Some tests (Excel, ODS) are derived from other tests (CSV) using a script (`./generate_spreadsheet_test_cases.sh`)

### RDBs
Make sure you have [Docker](https://www.docker.com) running.

Expand Down Expand Up @@ -317,6 +322,9 @@ We also offer consulting for all-things-RML.

## Remarks

### Typed spreadsheet files
All spreadsheet files are as of yet regarded as plain CSV files. No type information like Currency, Date... is used.

### XML file parsing performance

The RMLMapper's XML parsing implementation (`javax.xml.parsers`) has been chosen to support full XPath.
Expand Down
Empty file.
Empty file.
64 changes: 64 additions & 0 deletions generate-spreadsheet-test-cases.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/usr/bin/env bash

# REQUIRES libreoffice!!

TEST_LOCATION="src/test"
TEST_FILE_LOCATION="java/be/ugent/rml"
TEST_RESOURCES_LOCATION="resources/test-cases"
NAME_CSV_TEST="Mapper_CSV_Test.java"

# Check for libreoffice
if [[ ! `libreoffice --help` ]]
then
echo "Install libreoffice to convert CSV."
return 1
fi

cd ${TEST_LOCATION}
TEST_DIR=$(pwd)

for i in "EXCEL xlsx" "ODS ods"
do
set -- ${i}
echo "Generating ${1} tests from CSV tests"

## Test files
cd "${TEST_DIR}/${TEST_FILE_LOCATION}"
NAME_NEW_TEST="Mapper_${1}_Test.java"
cp ${NAME_CSV_TEST} ${NAME_NEW_TEST}
sed -i "s/CSV/${1}/g" ${NAME_NEW_TEST}

## Test resources
cd "${TEST_DIR}/${TEST_RESOURCES_LOCATION}"
for csv_dir in *CSV*
do
# Copy CSV test directory
NEW_DIR_NAME=$(echo ${csv_dir} | sed "s/CSV/${1}/")
if [[ -d ${NEW_DIR_NAME} ]]
then
rm -Rf ${NEW_DIR_NAME}
fi
cp -r ${csv_dir} ${NEW_DIR_NAME}
cd ${NEW_DIR_NAME}

# Change files within directory

echo "Test case: ${NEW_DIR_NAME}"
# csv source file
for csv_source in *.csv
do
if [[ ! -f ${csv_source} ]]; then break; fi
# UTF-8 encoding issue
# https://bugs.documentfoundation.org/show_bug.cgi?id=36313
libreoffice --headless --convert-to ${2} --infilter=CSV:44,34,UTF8 ${csv_source}
rm ${csv_source}
done
# mapping file
sed -i "s/.csv/.${2}/g" "mapping.ttl"

cd ..
done
done

echo "Success!"

35 changes: 30 additions & 5 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -132,11 +132,6 @@
<artifactId>commons-lang</artifactId>
<version>2.6</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-csv</artifactId>
<version>1.8</version>
</dependency>
<dependency>
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
Expand Down Expand Up @@ -219,6 +214,11 @@
<type>pom</type>
<version>3.8.0</version>
</dependency>
<dependency>
<groupId>com.hp.hpl.jena</groupId>
<artifactId>arq</artifactId>
<version>2.8.8</version>
</dependency>
<!-- Keep this Fuseki library on this version bc of compatibility Jetty -->
<dependency>
<groupId>org.apache.jena</groupId>
Expand Down Expand Up @@ -250,6 +250,31 @@
<artifactId>jsoup</artifactId>
<version>1.10.2</version>
</dependency>
<!-- START spreadsheet dependencies -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-csv</artifactId>
<version>1.8</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>4.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.odftoolkit/simple-odf -->
<dependency>
<!--
This should be below Apache Jena dependencies
in the pom declaration order for the correct dependency mediation
Otherwise: java.lang.NoClassDefFoundError: org/apache/jena/shared/JenaException
and no 0.8-incubating version yet
https://issues.apache.org/jira/browse/ODFTOOLKIT-415?jql=project%20%3D%20ODFTOOLKIT%20AND%20fixVersion%20%3D%200.7-incubating
-->
<groupId>org.apache.odftoolkit</groupId>
<artifactId>simple-odf</artifactId>
<version>0.8.2-incubating</version>
</dependency>
<!-- END spreadsheet dependencies-->
</dependencies>

<build>
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/be/ugent/rml/Executor.java
Original file line number Diff line number Diff line change
Expand Up @@ -423,7 +423,7 @@ private List<ProvenancedTerm> getAllIRIs(Term triplesMap) throws Exception {
return iris;
}

private List<Record> getRecords(Term triplesMap) throws IOException, SQLException, ClassNotFoundException {
private List<Record> getRecords(Term triplesMap) throws Exception {
if (!this.recordsHolders.containsKey(triplesMap)) {
this.recordsHolders.put(triplesMap, this.recordsFactory.createRecords(triplesMap, this.rmlStore));
}
Expand Down
112 changes: 90 additions & 22 deletions src/main/java/be/ugent/rml/records/CSVRecordFactory.java
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,15 @@
import be.ugent.rml.term.Term;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.apache.commons.io.FilenameUtils;
import org.odftoolkit.simple.Document;
import org.odftoolkit.simple.SpreadsheetDocument;

import java.io.IOException;
import java.io.InputStream;
Expand All @@ -36,39 +43,84 @@ public class CSVRecordFactory implements ReferenceFormulationRecordFactory {
* @throws IOException
*/
@Override
public List<Record> getRecords(Access access, Term logicalSource, QuadStore rmlStore) throws IOException, SQLException, ClassNotFoundException {
public List<Record> getRecords(Access access, Term logicalSource, QuadStore rmlStore) throws Exception {
List<Term> sources = Utils.getObjectsFromQuads(rmlStore.getQuads(logicalSource, new NamedNode(NAMESPACES.RML + "source"), null));
Term source = sources.get(0);
CSVParser parser;

if (source instanceof Literal) {
// We are not dealing with something like CSVW.
parser = getParserForNormalCSV(access);
// Check for different spreadsheet formats
String filePath = source.getValue();
String extension = FilenameUtils.getExtension(filePath);
switch (extension) {
case "xlsx":
return getRecordsForExcel(access);
case "ods":
return getRecordsForODT(access);
default:
return getRecordsForCSV(access, null);
}

} else {
List<Term> sourceType = Utils.getObjectsFromQuads(rmlStore.getQuads(source, new NamedNode(NAMESPACES.RDF + "type"), null));

// Check if we are dealing with CSVW.
if (sourceType.get(0).getValue().equals(NAMESPACES.CSVW + "Table")) {
CSVW csvw = new CSVW(access.getInputStream(), rmlStore, logicalSource);
parser = csvw.getCSVParser();
return getRecordsForCSV(access, csvw);
} else {
// RDBs fall under this.
parser = getParserForNormalCSV(access);
return getRecordsForCSV(access, null);
}
}
}

if (parser != null) {
List<org.apache.commons.csv.CSVRecord> myEntries = parser.getRecords();
/**
* Get Records for Excel file format.
* @param access
* @return
* @throws IOException
*/
private List<Record> getRecordsForExcel(Access access) throws IOException, SQLException, ClassNotFoundException {
List<Record> output = new ArrayList<>();
Workbook workbook = new XSSFWorkbook(access.getInputStream());
for (Sheet datatypeSheet : workbook) {
Row header = datatypeSheet.getRow(0);
boolean first = true;
for (Row currentRow : datatypeSheet) {
// remove the header
if (first) {
first = false;
} else {
output.add(new ExcelRecord(header, currentRow));
}
}
}
return output;
}

return myEntries.stream()
.map(record -> new CSVRecord(record, access.getDataTypes()))
.collect(Collectors.toList());
} else {
// We still return an empty list of records when a parser is not found.
// This is to support certain use cases with RDBs where queries might not be valid,
// but you don't want the RMLMapper to crash.
return new ArrayList<>();
/**
* Get Records for ODT file format.
* @param access
* @return
* @throws IOException
*/
private List<Record> getRecordsForODT(Access access) throws Exception {
List<Record> output = new ArrayList<>();
InputStream is = access.getInputStream();
Document document = SpreadsheetDocument.loadDocument(is);
for (org.odftoolkit.simple.table.Table table : document.getTableList()) {
org.odftoolkit.simple.table.Row header = table.getRowByIndex(0);
boolean first = true;
for (org.odftoolkit.simple.table.Row currentRow : table.getRowList()) {
if (first) {
first = false;
} else {
output.add(new ODSRecord(header, currentRow));
}
}
}
return output;
}

/**
Expand All @@ -78,19 +130,35 @@ public List<Record> getRecords(Access access, Term logicalSource, QuadStore rmlS
* @return a CSVParser.
* @throws IOException
*/
private CSVParser getParserForNormalCSV(Access access) throws IOException, SQLException, ClassNotFoundException {
CSVFormat csvFormat = CSVFormat.DEFAULT.withHeader().withSkipHeaderRecord(false).withNullString("@@@@NULL@@@@");
InputStream inputStream = access.getInputStream();
private List<Record> getRecordsForCSV(Access access, CSVW csvw) throws IOException, SQLException, ClassNotFoundException {
CSVParser parser;
// Check if we are dealing with CSVW.
if (csvw != null) {
parser = csvw.getCSVParser();
} else {
// RDBs fall under this.
CSVFormat csvFormat = CSVFormat.DEFAULT.withHeader().withSkipHeaderRecord(false).withNullString("@@@@NULL@@@@");
InputStream inputStream = access.getInputStream();

if (inputStream != null) {
try {
return CSVParser.parse(inputStream, StandardCharsets.UTF_8, csvFormat);
parser = CSVParser.parse(inputStream, StandardCharsets.UTF_8, csvFormat);
} catch (IllegalArgumentException e) {
logger.debug("Could not parse CSV inputstream", e);
return null;
parser = null;
}
}

if (parser != null) {
List<org.apache.commons.csv.CSVRecord> myEntries = parser.getRecords();

return myEntries.stream()
.map(record -> new CSVRecord(record, access.getDataTypes()))
.collect(Collectors.toList());
} else {
return null;
// We still return an empty list of records when a parser is not found.
// This is to support certain use cases with RDBs where queries might not be valid,
// but you don't want the RMLMapper to crash.
return new ArrayList<>();
}
}
}
Loading

0 comments on commit 23c02be

Please sign in to comment.