Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c/driver/sqlite: Column type is always int64 with empty table #581

Open
kou opened this issue Apr 9, 2023 · 15 comments
Open

c/driver/sqlite: Column type is always int64 with empty table #581

kou opened this issue Apr 9, 2023 · 15 comments

Comments

@kou
Copy link
Member

kou commented Apr 9, 2023

With the following SQL, the SQLite driver returns number:int64, string:int64 schema not number:int64, string:string schema:

CREATE TABLE data (number int, string text);
SELECT * FROM data;
@lidavidm
Copy link
Member

It's kind of a consequence of it inferring/guessing column types instead of trying to leverage metadata (since SQLite doesn't enforce types). We could have it use specified types as a starting point/as a restriction (erroring if data differs).

@kou
Copy link
Member Author

kou commented Apr 10, 2023

Ah, I thought that sqlite3_column_type() can work with an empty table. So I thought that we can improve the current implementation by:

diff --git a/c/driver/sqlite/statement_reader.c b/c/driver/sqlite/statement_reader.c
index 694b8cc..21fb20f 100644
--- a/c/driver/sqlite/statement_reader.c
+++ b/c/driver/sqlite/statement_reader.c
@@ -495,7 +495,7 @@ void StatementReaderRelease(struct ArrowArrayStream* self) {
 
 /// Initialize buffers for the first (type-inferred) batch of data.
 /// Use raw buffers since the types may change.
-AdbcStatusCode StatementReaderInitializeInfer(int num_columns, size_t infer_rows,
+AdbcStatusCode StatementReaderInitializeInfer(sqlite3_stmt *stmt, int num_columns, size_t infer_rows,
                                               struct ArrowBitmap* validity,
                                               struct ArrowBuffer* data,
                                               struct ArrowBuffer* binary,
@@ -507,7 +507,21 @@ AdbcStatusCode StatementReaderInitializeInfer(int num_columns, size_t infer_rows
     ArrowBufferInit(&data[i]);
     CHECK_NA(INTERNAL, ArrowBufferReserve(&data[i], infer_rows * sizeof(int64_t)), error);
     memset(&binary[i], 0, sizeof(struct ArrowBuffer));
-    current_type[i] = NANOARROW_TYPE_INT64;
+    int sqlite_type = sqlite3_column_type(stmt, i);
+    switch (sqlite_type) {
+    case SQLITE_FLOAT:
+      current_type[i] = NANOARROW_TYPE_DOUBLE;
+      break;
+    case SQLITE_TEXT:
+      current_type[i] = NANOARROW_TYPE_STRING;
+      break;
+    case SQLITE_BLOB:
+      current_type[i] = NANOARROW_TYPE_BINARY;
+      break;
+    default:
+      current_type[i] = NANOARROW_TYPE_INT64;
+      break;
+    }
   }
   return ADBC_STATUS_OK;
 }  // NOLINT(whitespace/indent)
@@ -835,7 +849,7 @@ AdbcStatusCode AdbcSqliteExportReader(sqlite3* db, sqlite3_stmt* stmt,
   enum ArrowType* current_type = malloc(num_columns * sizeof(enum ArrowType));
 
   AdbcStatusCode status = StatementReaderInitializeInfer(
-      num_columns, batch_size, validity, data, binary, current_type, error);
+      stmt, num_columns, batch_size, validity, data, binary, current_type, error);
   if (status == ADBC_STATUS_OK) {
     int64_t num_rows = 0;
     while (num_rows < batch_size) {

But sqlite3_column_type() always returns SQLITE_NULL with the case. :<

We may want to use sqlite3_column_decltype()...

@lidavidm
Copy link
Member

lidavidm commented Apr 10, 2023 via email

@paleolimbot
Copy link
Member

Is there a mechanism by which a user can bind an output schema? (That would let a user get a consistent output schema if that was important for an application...short of replicating an entire SQL parser, it might be the only way to ensure that in this case?).

@lidavidm
Copy link
Member

Not really. There's also not a great way to pass in a schema currently.

We don't need a SQL parser. It's just that SQLite is rather fuzzy about what type names it accepts. The rules are simple, but surprising.

@lidavidm
Copy link
Member

Or really: SQLite is untyped, despite whatever it lets you say. We can sort of patch over cases like this that look like they should work, but fundamentally it's just a fiction that we're just trying to construct.

@miccoli
Copy link

miccoli commented Feb 13, 2025

Let me add a related problem, in which I incur very often: when the first batch of rows is all null, the driver will infer that the type is INT64, but then fail when the first non-null row is read, eg.

[SQLite] Type mismatch in column 6: expected INT64 but got STRING/BINARY

@lidavidm
Copy link
Member

We should add a recipe to demonstrate it, but there's an adbc.sqlite.query.batch_rows parameter that can be set on the statement to tell it to read more rows (and hence base inference on more data)

@lidavidm
Copy link
Member

I think ultimately, what we want is #1514

@lidavidm
Copy link
Member

@miccoli see an example in the last commit of #2523

lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Feb 14, 2025
@miccoli
Copy link

miccoli commented Feb 14, 2025

@miccoli see an example in the last commit of #2523

Thanks for clarifying this in the docs (#2523 is very informative!), but I was aware of the fact that batch_rows can be increased.

Unfortunately in my use case almost all values of a single column are null (and no, I do not have control on the database, which is written by another app), so there is no easy heuristics for determining the correct number of batch_rows. Currently I simply read all other columns with arrow-adbc and then have some custom code to create and append the missing column.

So yes, #1514 would be a possible solution, but maybe also a form of (opt in) adaptive type inference could be useful:
in the presence of null columns the type inference should be considered incomplete and retried with a bigger batch of rows.

@lidavidm
Copy link
Member

Hmm, an adaptive mode could be reasonable (or even having it be the normal mode)

@paleolimbot
Copy link
Member

If we returned null instead of int64 for an empty table then the custom code would be much easier (pa.concat_xxx() I believe will cast a null to anything else, but won't cast an int64).

Other than that, I think requesting a schema would be helpful here (SQLite is not the only loosly typed data source that we might want a driver for!)

@lidavidm
Copy link
Member

Returning null is also an option, I hesitate to change the behavior but admittedly null is more accurate perhaps than defaulting to int64.

@lidavidm
Copy link
Member

I filed #2531 and #2532

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants