Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go][Parquet] A uint16 number written to parquet file not parseable by DuckDB #209

Closed
venkat-oss opened this issue Dec 6, 2024 · 1 comment · Fixed by #210
Closed

[Go][Parquet] A uint16 number written to parquet file not parseable by DuckDB #209

venkat-oss opened this issue Dec 6, 2024 · 1 comment · Fixed by #210
Labels
Type: bug Something isn't working

Comments

@venkat-oss
Copy link

venkat-oss commented Dec 6, 2024

Hi @zeroshade I've come across this closed issue #38616 and I could still reproduce it while writing arrow data to a parquet file using pqarrow.

Here is the code that's writing to parquet file, I'm using one of your examples:

arrChan := make(chan arrow.Record, 10)

go func(ch <-chan arrow.Record) {
  
    first_rec := <-ch
    f, err := os.OpenFile("./test.parquet", os.O_CREATE|os.O_WRONLY, 0644)
    if err != nil {
	    panic(err)
    }
    defer f.Close()
    // ...
    // we'll use the default writer properties, but you could easily pass
    // properties to customize the writer
    props := parquet.NewWriterProperties()
    writer, err := pqarrow.NewFileWriter(first_rec.Schema(), f, props,
	    pqarrow.DefaultWriterProps())
    if err != nil {
	    panic(err)
    }
    defer writer.Close()
    fmt.Println("here")
    
    if err := writer.Write(first_rec); err != nil {
	    fmt.Println(err)
	    panic(err)
    }
    // first_rec.Release()
    
    for rec := range ch {
	    if err := writer.Write(rec); err != nil {
		    panic(err)
	    }
	    // rec.Release()
}
}(arrChan)

The arrow records are Released outside this function.

This code writes out a test.parquet file and when I read it using DuckDB, I get this error:

Error: Invalid Input Error: Failed to cast value: Type UINT32 with value 4294967295 can't be cast because the value is out of range for the destination type UINT16

Here is the output from the parquet-cli tool similar to what's in #38616

$ parquet pages test.parquet

Column: id
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  1       4.00 B     4 B       
  0-1    data  _ R  1       3.00 B     3 B                 0       "0" / "0"


Column: resource.id
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  1       4.00 B     4 B       
  0-1    data  _ R  1       9.00 B     9 B                 0       "4294967295" / "0"


Column: resource.schema_url
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  1       43.00 B    43 B      
  0-1    data  _ R  1       9.00 B     9 B                 0       "https://opentelemetry.io/..." / "https://opentelemetry.io/..."


Column: scope.id
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  1       4.00 B     4 B       
  0-1    data  _ R  1       9.00 B     9 B                 0       "4294967295" / "0"


Column: metric_type
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  1       4.00 B     4 B       
  0-1    data  _ R  1       3.00 B     3 B                 0       "1" / "1"


Column: name
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  1       7.00 B     7 B       
  0-1    data  _ R  1       3.00 B     3 B                         "gen" / "gen"

Columns: resource.id and scope.id have incorrect min values.

$ parquet meta test.parquet

File path:  test.parquet
Created by: parquet-go version 18.0.0-SNAPSHOT
Properties: (none)
Schema:
message schema {
  required int32 id (INTEGER(16,false));
  required group resource {
    optional int32 id (INTEGER(16,false));
    optional binary schema_url (STRING);
  }
  required group scope {
    optional int32 id (INTEGER(16,false));
  }
  required int32 metric_type (INTEGER(8,false));
  required binary name (STRING);
}


Row group 0:  count: 1  464.00 B records  start: 4  total(compressed): 464 B total(uncompressed):464 B 
--------------------------------------------------------------------------------
                     type      encodings count     avg size   nulls   min / max
id                   INT32     _ _ R     1         56.00 B    0       "0" / "0"
resource.id          INT32     _ _ R     1         62.00 B    0       "4294967295" / "0"
resource.schema_url  BINARY    _ _ R     1         171.00 B   0       "https://opentelemetry.io/..." / "https://opentelemetry.io/..."
scope.id             INT32     _ _ R     1         62.00 B    0       "4294967295" / "0"
metric_type          INT32     _ _ R     1         56.00 B    0       "1" / "1"
name                 BINARY    _ _ R     1         57.00 B            "gen" / "gen"

I'm hoping these reproduction details are sufficient., if there are any missing details that I can provide, please let me know and I can produce them as soon as possible. Thank you :thank

GOARCH='amd64'
GOOS='linux'
GOVERSION='go1.23.4'

Component(s)

Parquet

@venkat-oss venkat-oss added the Type: bug Something isn't working label Dec 6, 2024
@zeroshade
Copy link
Member

Well that's annoying :( I thought we got this one.

Thanks for the great reproducer @venkat-oss!! I'll take a look at this over the weekend and see if i can figure out what the issue is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants