You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey team, we’re building a funnel observability solution using Pinot, where we have a large table with more than 200B rows and around 300+ columns. Of these, only 20–30 columns are commonly used, while the rest are almost always NULL (sparse). Based on my understanding of Pinot’s architecture, each column’s forward index will store some default value for every row, even if the column is NULL most of the time.
What is your recommendation for dealing with a table that has hundreds of mostly NULL columns, especially in terms of segment sizing, ingestion performance, and ongoing query efficiency?
Are there any roadmap items or upcoming features aimed at optimizing storage or performance for extremely sparse columns (e.g., skipping or compressing columns with mostly NULL values)?
Are there any existing best practices or configurations we could apply to make this more efficient (e.g., using defaultNullValue, dictionary vs. raw encoding, or segment-level pruning)?
Pinot always stores column values in a forward index. Forward index never stores null values but have to store a value for each row. Therefore independent of the null handling configuration, Pinot always stores a default value for nulls rows in the forward index. The default value used in a column can be specified in the schema configuration by setting the defaultNullValue field spec. The defaultNullValue depends on the type of data.
The text was updated successfully, but these errors were encountered:
Hey team, we’re building a funnel observability solution using Pinot, where we have a large table with more than 200B rows and around 300+ columns. Of these, only 20–30 columns are commonly used, while the rest are almost always NULL (sparse). Based on my understanding of Pinot’s architecture, each column’s forward index will store some default value for every row, even if the column is NULL most of the time.
What is your recommendation for dealing with a table that has hundreds of mostly NULL columns, especially in terms of segment sizing, ingestion performance, and ongoing query efficiency?
Are there any roadmap items or upcoming features aimed at optimizing storage or performance for extremely sparse columns (e.g., skipping or compressing columns with mostly NULL values)?
Are there any existing best practices or configurations we could apply to make this more efficient (e.g., using defaultNullValue, dictionary vs. raw encoding, or segment-level pruning)?
The text was updated successfully, but these errors were encountered: