Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand dataframe.utils.py functions #42

Closed
Evan-Kim2028 opened this issue Sep 15, 2023 · 1 comment
Closed

Expand dataframe.utils.py functions #42

Evan-Kim2028 opened this issue Sep 15, 2023 · 1 comment
Labels
duplicate This issue or pull request already exists enhancement New feature or request

Comments

@Evan-Kim2028
Copy link

Feature Request & Rationale

Subgrounds should offer dataframe graphql function support for multiple libraries as well, not just Pandas. Currently the only dataframe utility functions are Pandas, found here

The current direction of Subgrounds is going towards a multi-client world. One alternative client to the base client would be to utilize polars instead of pandas dataframes. However, currently dataframe_utils.py only offers pandas function helpers, which actively discriminates against using polars with Subgrounds.

To utilize subgrounds with polars, examples of functions that need to be constantly defined are fmt_dict_cols and fmt_arr_cols.

  • fmt_dict_cols - required to convert graphql json data into polars dataframe columns
  • fmt_arr_cols - required to separate graphql json data fields that contain arrays into polars individual dataframe columns.

Example code:

def fmt_dict_cols(df: pl.DataFrame) -> pl.DataFrame:
    """
    formats dictionary cols, which are 'structs' in a polars df, into separate columns and renames accordingly.
    """
    for column in df.columns:
        if isinstance(df[column][0], dict):  
            col_names = df[column][0].keys()
            # rename struct columns
            struct_df = df.select(
                pl.col(column).struct.rename_fields([f'{column}_{c}' for c in col_names])
            )
            struct_df = struct_df.unnest(column)
            # add struct_df columns to df and
            df = df.with_columns(struct_df)
            # drop the df column
            df = df.drop(column)
    
    return df

def fmt_arr_cols(df: pl.DataFrame) -> pl.DataFrame:
    """
    formats lists, which are arrays in a polars df, into separate columns and renames accordingly.
    Since there isn't a direct way to convert array -> new columns, we convert the array to a struct and then
    unnest the struct into new columns.
    """
    # use this logic if column is a list (rows show up as pl.Series)
    for column in df.columns:
        if isinstance(df[column][0], pl.Series):
            # convert struct to array
            struct_df = df.select([pl.col(column).arr.to_struct()])
            # rename struct fields
            struct_df = struct_df.select(
                pl.col(column).struct.rename_fields([f"{column}_{i}" for i in range(len(struct_df.shape))])
            )
            # unnest struct fields into their own columns
            struct_df = struct_df.unnest(column)
            # add struct_df columns to df and
            df = df.with_columns(struct_df)
            # drop the df column
            df = df.drop(column)

    return df
    ```
@Evan-Kim2028 Evan-Kim2028 added the enhancement New feature or request label Sep 15, 2023
@0xMochan 0xMochan added the duplicate This issue or pull request already exists label Sep 15, 2023
@0xMochan
Copy link
Collaborator

Duplicate of #29

@0xMochan 0xMochan marked this as a duplicate of #29 Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants