Skip to content

Commit

Permalink
DRILL-8417: Allow Excel Reader to Ignore Formula Errors (#2783)
Browse files Browse the repository at this point in the history
  • Loading branch information
cgivre authored and jnturton committed Apr 18, 2023
1 parent 9a2e109 commit 20f9f2c
Show file tree
Hide file tree
Showing 5 changed files with 83 additions and 16 deletions.
28 changes: 15 additions & 13 deletions contrib/format-excel/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Excel Format Plugin
This plugin enables Drill to read Microsoft Excel files. This format is best used with Excel files that do not have extensive formatting, however it will work with formatted files, by allowing you to define a region within the file where the data is.
This plugin enables Drill to read Microsoft Excel files. This format is best used with Excel files that do not have extensive formatting, however it will work with formatted files, by allowing you to define a region within the file where the data is.

The plugin will automatically evaluate cells which contain formulae.
The plugin will automatically evaluate cells which contain formulae.

## Plugin Configuration
This plugin has several configuration variables which must be set in order to read Excel files effectively. Since Excel files often contain other elements besides data, you can use the configuration variables to define a region within your spreadsheet in which Drill should extract data. This is potentially useful if your spreadsheet contains a lot of formatting or other complications.
## Plugin Configuration
This plugin has several configuration variables which must be set in order to read Excel files effectively. Since Excel files often contain other elements besides data, you can use the configuration variables to define a region within your spreadsheet in which Drill should extract data. This is potentially useful if your spreadsheet contains a lot of formatting or other complications.

### Configuration Options:
The most basic configuration is simply to add the following to a file based storage plugin:
Expand All @@ -24,8 +24,10 @@ The plugin has many other configuration options listed below:
* `firstColumn`: In order to define a region within a spreadsheet, this is the left-most column index. This is indexed from one. If set to `0` Drill will start at the left most
column.
* `lastColumn`: To define a region within a spreadsheet, this is the right-most column index. This is indexed from one. If set to `0` Drill will read all available columns. This
is not inclusive, so if you ask for columns 2-5 you will get columns 2,3 and 4.
is not inclusive, so if you ask for columns 2-5 you will get columns 2,3 and 4.
* `allTextMode`: When set to `true`, Drill will not attempt to infer column data types and will read everything as `VARCHAR`. Defaults to `false`;
* `ignoreErrors`: Defaults to `true`. When set to `true` Drill will return `null` for any
formulas or any values that are unparseable.
* `maxArraySize`: Overrides the default [POI config](https://poi.apache.org/components/configuration.html) for `setByteArrayMaxOverride(int maxOverride)`. May be needed with large Excel files.
* `thresholdBytesForTempFiles`: Overrides the default [POI config](https://poi.apache.org/components/configuration.html) for `ZipInputStreamZipEntrySource.setThresholdBytesForTempFiles(int thresholdBytes)`. May be needed with large Excel files.
* `useTempFilePackageParts`: Overrides the default [POI config](https://poi.apache.org/components/configuration.html) for `ZipPackage.setUseTempFilePackageParts(boolean tempFilePackageParts)`. May be needed with large Excel files.
Expand All @@ -35,24 +37,24 @@ You can specify the configuration at runtime via the `table()` function or in th
execute the query as follows:

```
SELECT <fields>
SELECT <fields>
FROM dfs.`somefile.xlsx`
```
To query a different sheet other than the default, use the `table()` function as shown below:
```
SELECT <fields>
SELECT <fields>
FROM table( dfs.`test_data.xlsx` (type => 'excel', sheetName => 'secondSheet'))
```
To join data together from different sheets in one file, use the query below:
```
SELECT <fields>
SELECT <fields>
FROM table( dfs.`test_data.xlsx` (type => 'excel', sheetName => 'secondSheet')) AS t1
INNER JOIN table( dfs.`test_data.xlsx` (type => 'excel', sheetName => 'thirdSheet')) AS t2
INNER JOIN table( dfs.`test_data.xlsx` (type => 'excel', sheetName => 'thirdSheet')) AS t2
ON t1.id = t2.id
```

### Implicit Columns
Drill includes several columns of file metadata in the query results. These fields are **not** included in star queries and thus must be explicitly listed in a query.
Drill includes several columns of file metadata in the query results. These fields are **not** included in star queries and thus must be explicitly listed in a query.

The fields are:

Expand All @@ -75,10 +77,10 @@ The fields are:

### Known Limitations:
At present, Drill requires that all columns be of the same data type. If they are not, Drill will throw an exception upon trying to read a column of mixed data type. If you are
trying to query data with heterogenoeus columns, it will be necessary to set `allTextMode` to `true`.
trying to query data with heterogenoeus columns, it will be necessary to set `allTextMode` to `true`.

An additional limitation is that Drill infers the column data type from the first row of data. If a column is `null` in the first row, Drill will default to a datatype of
`VARCHAR`. However if in fact the column is `NUMERIC` this will cause errors.
`VARCHAR`. However if in fact the column is `NUMERIC` this will cause errors.

Drill ignores blank rows.

Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,7 @@ static class ExcelReaderConfig {
final int firstColumn;
final int lastColumn;
final boolean allTextMode;
final boolean ignoreErrors;
final String sheetName;
final int maxArraySize;
final int thresholdBytesForTempFiles;
Expand All @@ -187,6 +188,7 @@ static class ExcelReaderConfig {
firstColumn = plugin.getConfig().getFirstColumn();
lastColumn = plugin.getConfig().getLastColumn();
allTextMode = plugin.getConfig().getAllTextMode();
ignoreErrors = plugin.getConfig().getIgnoreErrors();
sheetName = plugin.getConfig().getSheetName();
maxArraySize = plugin.getConfig().getMaxArraySize();
thresholdBytesForTempFiles = plugin.getConfig().getThresholdBytesForTempFiles();
Expand Down Expand Up @@ -420,7 +422,7 @@ private String deconflictColumnNames(String columnName) {

/**
* Helper function to get the selected sheet from the configuration
* @return Sheet The selected sheet
* @return {@link Sheet} The selected sheet
*/
private Sheet getSheet() {
int sheetIndex = 0;
Expand Down Expand Up @@ -520,7 +522,20 @@ private boolean nextLine(RowSetLoader rowWriter) {

populateColumnArray(cell, colPosition);
if (colWriterIndex < cellWriterArray.size()) {
cellWriterArray.get(colWriterIndex).load(cell);
CellWriter writer = cellWriterArray.get(colWriterIndex);
try {
writer.load(cell);
} catch (NumberFormatException e) {
if (readerConfig.ignoreErrors) {
logger.warn("Error writing cell: {} {}", cell.getStringCellValue(), e.getMessage());
writer.writeNull();
} else {
throw UserException.dataReadError()
.message("Error writing data: " + e.getMessage())
.addContext(errorContext)
.build(logger);
}
}
}

colPosition++;
Expand Down Expand Up @@ -785,6 +800,10 @@ public static class CellWriter {
}

public void load(Cell cell) {}

public void writeNull() {
columnWriter.setNull();
}
}

public class StringCellWriter extends ExcelBatchReader.CellWriter {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ public class ExcelFormatConfig implements FormatPluginConfig {
private final int firstColumn;
private final int lastColumn;
private final boolean allTextMode;
private final boolean ignoreErrors;
private final String sheetName;
private final int maxArraySize;
private final int thresholdBytesForTempFiles;
Expand All @@ -65,6 +66,7 @@ public ExcelFormatConfig(
@JsonProperty("firstColumn") Integer firstColumn,
@JsonProperty("lastColumn") Integer lastColumn,
@JsonProperty("allTextMode") Boolean allTextMode,
@JsonProperty("ignoreErrors") Boolean ignoreErrors,
@JsonProperty("sheetName") String sheetName,
@JsonProperty("maxArraySize") Integer maxArraySize,
@JsonProperty("thresholdBytesForTempFiles") Integer thresholdBytesForTempFiles,
Expand All @@ -79,6 +81,7 @@ public ExcelFormatConfig(
this.firstColumn = firstColumn == null ? 0 : firstColumn;
this.lastColumn = lastColumn == null ? 0 : lastColumn;
this.allTextMode = allTextMode == null ? false : allTextMode;
this.ignoreErrors = ignoreErrors == null ? true : ignoreErrors;
this.sheetName = sheetName == null ? "" : sheetName;
this.maxArraySize = maxArraySize == null ? -1 : maxArraySize;
this.thresholdBytesForTempFiles = thresholdBytesForTempFiles == null ? -1 : thresholdBytesForTempFiles;
Expand Down Expand Up @@ -110,6 +113,9 @@ public int getLastColumn() {
public boolean getAllTextMode() {
return allTextMode;
}
public boolean getIgnoreErrors() {
return ignoreErrors;
}

public String getSheetName() {
return sheetName;
Expand Down Expand Up @@ -151,7 +157,7 @@ public ExcelReaderConfig getReaderConfig(ExcelFormatPlugin plugin) {
@Override
public int hashCode() {
return Objects.hash(extensions, headerRow, lastRow,
firstColumn, lastColumn, allTextMode, sheetName);
firstColumn, lastColumn, allTextMode, ignoreErrors, sheetName);
}

@Override
Expand All @@ -169,6 +175,7 @@ public boolean equals(Object obj) {
&& Objects.equals(firstColumn, other.firstColumn)
&& Objects.equals(lastColumn, other.lastColumn)
&& Objects.equals(allTextMode, other.allTextMode)
&& Objects.equals(ignoreErrors, other.ignoreErrors)
&& Objects.equals(sheetName, other.sheetName)
&& Objects.equals(maxArraySize, other.maxArraySize)
&& Objects.equals(thresholdBytesForTempFiles, other.thresholdBytesForTempFiles)
Expand All @@ -185,6 +192,7 @@ public String toString() {
.field("firstColumn", firstColumn)
.field("lastColumn", lastColumn)
.field("allTextMode", allTextMode)
.field("ignoreErrors", ignoreErrors)
.field("maxArraySize", maxArraySize)
.field("thresholdBytesForTempFiles", thresholdBytesForTempFiles)
.field("useTempFilePackageParts", useTempFilePackageParts)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@

import org.apache.drill.categories.RowSetTest;
import org.apache.drill.common.exceptions.DrillRuntimeException;
import org.apache.drill.common.exceptions.UserException;
import org.apache.drill.common.types.TypeProtos;
import org.apache.drill.common.types.TypeProtos.MinorType;
import org.apache.drill.exec.physical.rowSet.RowSet;
Expand Down Expand Up @@ -329,6 +330,43 @@ public void testNonDefaultSheetQuery() throws RpcException {
new RowSetComparison(expected).verifyAndClearAll(results);
}

@Test
public void testErrorOnFormulaQuery() throws RpcException {
String sql = "SELECT * FROM table(cp.`excel/text-formula.xlsx` (type => 'excel', sheetName " +
"=> 'Sheet with Errors', ignoreErrors => True))";

RowSet results = client.queryBuilder().sql(sql).rowSet();
TupleMetadata expectedSchema = new SchemaBuilder()
.addNullable("field1", MinorType.FLOAT8)
.addNullable("field2", MinorType.FLOAT8)
.addNullable("result", MinorType.FLOAT8)
.buildSchema();

RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema)
.addRow(1,2,0.5)
.addRow(2,3,0.6666667)
.addRow(3,4,0.75)
.addRow(4,0, null)
.addRow(5,6,0.8333333)
.build();

new RowSetComparison(expected).verifyAndClearAll(results);
}

@Test
public void testErrorOnFormulaQueryWithoutIgnoreErrors() throws RpcException {
String sql = "SELECT * FROM table(cp.`excel/text-formula.xlsx` (type => 'excel', " +
"ignoreErrors => False, sheetName => 'Sheet with Errors'))";
try {
client.queryBuilder().sql(sql).rowSet();
fail();
} catch (UserException e) {
assertTrue(e.getMessage().contains("DATA_READ ERROR: Error writing data: For input string: " +
"\"#DIV/0!\""));
}
}


@Test
public void testExplicitNonDefaultSheetQuery() throws RpcException {
String sql = "SELECT event_date, ip_address, user_agent FROM table(cp.`excel/test_data.xlsx` (type => 'excel', sheetName => 'secondSheet'))";
Expand Down
Binary file modified contrib/format-excel/src/test/resources/excel/text-formula.xlsx
Binary file not shown.

0 comments on commit 20f9f2c

Please sign in to comment.