-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c/driver/sqlite: Column type is always int64 with empty table #581
Comments
It's kind of a consequence of it inferring/guessing column types instead of trying to leverage metadata (since SQLite doesn't enforce types). We could have it use specified types as a starting point/as a restriction (erroring if data differs). |
Ah, I thought that diff --git a/c/driver/sqlite/statement_reader.c b/c/driver/sqlite/statement_reader.c
index 694b8cc..21fb20f 100644
--- a/c/driver/sqlite/statement_reader.c
+++ b/c/driver/sqlite/statement_reader.c
@@ -495,7 +495,7 @@ void StatementReaderRelease(struct ArrowArrayStream* self) {
/// Initialize buffers for the first (type-inferred) batch of data.
/// Use raw buffers since the types may change.
-AdbcStatusCode StatementReaderInitializeInfer(int num_columns, size_t infer_rows,
+AdbcStatusCode StatementReaderInitializeInfer(sqlite3_stmt *stmt, int num_columns, size_t infer_rows,
struct ArrowBitmap* validity,
struct ArrowBuffer* data,
struct ArrowBuffer* binary,
@@ -507,7 +507,21 @@ AdbcStatusCode StatementReaderInitializeInfer(int num_columns, size_t infer_rows
ArrowBufferInit(&data[i]);
CHECK_NA(INTERNAL, ArrowBufferReserve(&data[i], infer_rows * sizeof(int64_t)), error);
memset(&binary[i], 0, sizeof(struct ArrowBuffer));
- current_type[i] = NANOARROW_TYPE_INT64;
+ int sqlite_type = sqlite3_column_type(stmt, i);
+ switch (sqlite_type) {
+ case SQLITE_FLOAT:
+ current_type[i] = NANOARROW_TYPE_DOUBLE;
+ break;
+ case SQLITE_TEXT:
+ current_type[i] = NANOARROW_TYPE_STRING;
+ break;
+ case SQLITE_BLOB:
+ current_type[i] = NANOARROW_TYPE_BINARY;
+ break;
+ default:
+ current_type[i] = NANOARROW_TYPE_INT64;
+ break;
+ }
}
return ADBC_STATUS_OK;
} // NOLINT(whitespace/indent)
@@ -835,7 +849,7 @@ AdbcStatusCode AdbcSqliteExportReader(sqlite3* db, sqlite3_stmt* stmt,
enum ArrowType* current_type = malloc(num_columns * sizeof(enum ArrowType));
AdbcStatusCode status = StatementReaderInitializeInfer(
- num_columns, batch_size, validity, data, binary, current_type, error);
+ stmt, num_columns, batch_size, validity, data, binary, current_type, error);
if (status == ADBC_STATUS_OK) {
int64_t num_rows = 0;
while (num_rows < batch_size) { But We may want to use |
Yes, decltype would work. I think the hard part there will be that SQLite doesn't really check the type name so we would have to replicate 3.1.1 here: https://www.sqlite.org/datatype3.html
Also it looks that decltype will fail if it's something like `x + 1` so in that case the user might still get a type that is a little unexpected.
…On Mon, Apr 10, 2023, at 10:10, Sutou Kouhei wrote:
Ah, I thought that `sqlite3_column_type()` can work with an empty table. So I thought that we can improve the current implementation by:
diff --git a/c/driver/sqlite/statement_reader.c b/c/driver/sqlite/statement_reader.c
index 694b8cc..21fb20f 100644
--- a/c/driver/sqlite/statement_reader.c
+++ b/c/driver/sqlite/statement_reader.c
@@ -495,7 +495,7 @@ void StatementReaderRelease(struct ArrowArrayStream* self) {
/// Initialize buffers for the first (type-inferred) batch of data.
/// Use raw buffers since the types may change.
-AdbcStatusCode StatementReaderInitializeInfer(int num_columns, size_t infer_rows,
+AdbcStatusCode StatementReaderInitializeInfer(sqlite3_stmt *stmt, int num_columns, size_t infer_rows,
struct ArrowBitmap* validity,
struct ArrowBuffer* data,
struct ArrowBuffer* binary,
@@ -507,7 +507,21 @@ AdbcStatusCode StatementReaderInitializeInfer(int num_columns, size_t infer_rows
ArrowBufferInit(&data[i]);
CHECK_NA(INTERNAL, ArrowBufferReserve(&data[i], infer_rows * sizeof(int64_t)), error);
memset(&binary[i], 0, sizeof(struct ArrowBuffer));
- current_type[i] = NANOARROW_TYPE_INT64;
+ int sqlite_type = sqlite3_column_type(stmt, i);
+ switch (sqlite_type) {
+ case SQLITE_FLOAT:
+ current_type[i] = NANOARROW_TYPE_DOUBLE;
+ break;
+ case SQLITE_TEXT:
+ current_type[i] = NANOARROW_TYPE_STRING;
+ break;
+ case SQLITE_BLOB:
+ current_type[i] = NANOARROW_TYPE_BINARY;
+ break;
+ default:
+ current_type[i] = NANOARROW_TYPE_INT64;
+ break;
+ }
}
return ADBC_STATUS_OK;
} // NOLINT(whitespace/indent)
@@ -835,7 +849,7 @@ AdbcStatusCode AdbcSqliteExportReader(sqlite3* db, sqlite3_stmt* stmt,
enum ArrowType* current_type = malloc(num_columns * sizeof(enum ArrowType));
AdbcStatusCode status = StatementReaderInitializeInfer(
- num_columns, batch_size, validity, data, binary, current_type, error);
+ stmt, num_columns, batch_size, validity, data, binary, current_type, error);
if (status == ADBC_STATUS_OK) {
int64_t num_rows = 0;
while (num_rows < batch_size) {
But `sqlite3_column_type()` always returns `SQLITE_NULL` with the case. :<
We may want to use `sqlite3_column_decltype()`...
—
Reply to this email directly, view it on GitHub <#581 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACQB36Y4GKADDJHBM6LKXLXANMXJANCNFSM6AAAAAAWYK5JDU>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Is there a mechanism by which a user can bind an output schema? (That would let a user get a consistent output schema if that was important for an application...short of replicating an entire SQL parser, it might be the only way to ensure that in this case?). |
Not really. There's also not a great way to pass in a schema currently. We don't need a SQL parser. It's just that SQLite is rather fuzzy about what type names it accepts. The rules are simple, but surprising. |
Or really: SQLite is untyped, despite whatever it lets you say. We can sort of patch over cases like this that look like they should work, but fundamentally it's just a fiction that we're just trying to construct. |
Let me add a related problem, in which I incur very often: when the first batch of rows is all null, the driver will infer that the type is INT64, but then fail when the first non-null row is read, eg.
|
We should add a recipe to demonstrate it, but there's an |
I think ultimately, what we want is #1514 |
Thanks for clarifying this in the docs (#2523 is very informative!), but I was aware of the fact that Unfortunately in my use case almost all values of a single column are null (and no, I do not have control on the database, which is written by another app), so there is no easy heuristics for determining the correct number of So yes, #1514 would be a possible solution, but maybe also a form of (opt in) adaptive type inference could be useful: |
Hmm, an adaptive mode could be reasonable (or even having it be the normal mode) |
If we returned Other than that, I think requesting a schema would be helpful here (SQLite is not the only loosly typed data source that we might want a driver for!) |
Returning |
With the following SQL, the SQLite driver returns
number:int64, string:int64
schema notnumber:int64, string:string
schema:The text was updated successfully, but these errors were encountered: