Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pydantic compatibility issue #1677

Closed
2 tasks done
riziles opened this issue Jun 10, 2024 · 30 comments
Closed
2 tasks done

Pydantic compatibility issue #1677

riziles opened this issue Jun 10, 2024 · 30 comments
Labels
bug Something isn't working

Comments

@riziles
Copy link

riziles commented Jun 10, 2024

I believe that the latest versions of Pydantic and Pandera are not fully compatible.

This relates to #1395 which was closed, but I think should still be open

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.

This code throws an error:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic

class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str] = pa.Field(unique=True)

class PydanticModel(pydantic.BaseModel):
    x: int
    df: DataFrame[SimpleSchema]

print(PydanticModel.model_json_schema())

error message:

Exception has occurred: PydanticInvalidForJsonSchema
Cannot generate a JsonSchema for core_schema.PlainValidatorFunctionSchema ({'type': 'no-info', 'function': functools.partial(<bound method DataFrame.pydantic_validate of <class 'pandera.typing.pandas.DataFrame'>>, schema_model=SimpleSchema)})

For further information visit https://errors.pydantic.dev/2.7/u/invalid-for-json-schema
  File "C:\LocalTemp\Repos\RA\RiskCalcs\scratch.py", line 18, in <module>
    print(PydanticModel.model_json_schema())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic.errors.PydanticInvalidForJsonSchema: Cannot generate a JsonSchema for core_schema.PlainValidatorFunctionSchema ({'type': 'no-info', 'function': functools.partial(<bound method DataFrame.pydantic_validate of <class 'pandera.typing.pandas.DataFrame'>>, schema_model=SimpleSchema)})

For further information visit https://errors.pydantic.dev/2.7/u/invalid-for-json-schema

I have tried various config options to get around this error to no avail.

  • OS: Windows
  • Pydantic version: 2.7.3
  • Pandera version: 0.19.3
@riziles riziles added the bug Something isn't working label Jun 10, 2024
@riziles
Copy link
Author

riziles commented Jun 10, 2024

Here is my real hacky workaround (no idea if it is right):

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as _DataFrame, Series

from pydantic_core import core_schema, CoreSchema
from pydantic import GetCoreSchemaHandler, BaseModel
from typing import TypeVar, Generic, Any

T = TypeVar("T")  

class DataFrame(_DataFrame, Generic[T]):

    @classmethod
    def __get_pydantic_core_schema__(
        cls, source_type: Any, handler: GetCoreSchemaHandler
    ) -> CoreSchema:

        schema = source_type().__orig_class__.__args__[0].to_schema()

        type_map = {
            "str": core_schema.str_schema(),
            "int64": core_schema.int_schema(),
            "float64": core_schema.float_schema(),
            "bool": core_schema.bool_schema(),
            "datetime64[ns]": core_schema.datetime_schema()
        }

        return core_schema.list_schema(
            core_schema.typed_dict_schema(
                {
                    i:core_schema.typed_dict_field(type_map[str(j.dtype)]) for i,j in schema.columns.items()
                },
            )
        )


class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str]

class PydanticModel(BaseModel):
    x: int
    df: DataFrame[SimpleSchema]

@eharkins
Copy link

eharkins commented Jan 23, 2025

@riziles @cosmicBboy any update on this pydantic compatibility issue with json schema and a possible fix in pandera? I am running into this same error in pandera 0.22.1. Looks like the fix PR did not get merged.

@cosmicBboy
Copy link
Collaborator

Looks like #1704 addresses this, but it still has CI test errors

@ragrawal
Copy link

ragrawal commented Jan 26, 2025

any update on this. This issue blocks generating docs page for fastapi.

@riziles
Copy link
Author

riziles commented Jan 27, 2025

@ragrawal , you're welcome to take a swing at figuring out why some tests are failing. I don't have the bandwidth to work on this right now.

@ragrawal
Copy link

@riziles -- I looked into the PR and not able to get it working. I am having trouble setting up the development environment. Also I don't think the PR is generic enough. It is trying to handle very special case. I don't have in-depth understanding of pydantic or pandera. Will appreciate if someone can suggest any other hack to get past the above issue

@riziles
Copy link
Author

riziles commented Jan 28, 2025

@ragrawal ,

This works:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series

from pydantic import BaseModel, WithJsonSchema
from typing import Annotated

from fastapi import FastAPI

app = FastAPI()


class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str]


class PydanticModel3(BaseModel):
    y: Annotated[
        DataFrame[SimpleSchema],
        WithJsonSchema(SimpleSchema.to_json_schema()),
    ]



@app.post("/input_api")
def input_this(pm3:PydanticModel3) -> list[str]:

    return pm3.y["str_col"].to_list()

@riziles
Copy link
Author

riziles commented Jan 28, 2025

...if you specify a to_format in your Panera config then you can output a dataframe, too:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series

from pydantic import BaseModel, WithJsonSchema
from typing import Annotated

from fastapi import FastAPI

app = FastAPI()


class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str]

    class Config:
        to_format = "dict"


class PydanticModel3(BaseModel):
    y: Annotated[
        DataFrame[SimpleSchema],
        WithJsonSchema(SimpleSchema.to_json_schema()),
    ]



@app.post("/input_api")
def input_this(pm3:PydanticModel3) -> PydanticModel3:

    return pm3

@riziles
Copy link
Author

riziles commented Jan 28, 2025

... also, you can just use Annotated directly with FastAPI. You don't need to nest it in a Pydantic object:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from typing import Annotated
from pydantic import WithJsonSchema

from fastapi import FastAPI

app = FastAPI()


class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str]

    class Config:
        to_format = "dict"


@app.post("/input_api")
def input_this(
    pm3: Annotated[
        DataFrame[SimpleSchema],
        WithJsonSchema(SimpleSchema.to_json_schema()),
    ],
) -> Annotated[
    DataFrame[SimpleSchema],
    WithJsonSchema(SimpleSchema.to_json_schema()),
]:
    return pm3

@ragrawal
Copy link

Thanks @riziles .. this works great. Wondering do you know how can provide input data in "records" format. I tried adding

from_format = "dict"
from_format_kwargs = {orient='records'}

However I got this error message: "Value error, Expected 'index', 'columns' or 'tight' for orient parameter. Got 'records' instead",

Below is my full code

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from typing import Annotated
from pydantic import WithJsonSchema

from fastapi import FastAPI

app = FastAPI()


class SimpleSchema(pa.DataFrameModel):
    col1: Series[str]
    col2: Series[int]

    class Config:
        to_format = "dict"
        from_format = "dict"
        from_format_kwargs = {"orient": 'records'}



@app.post("/input_api")
def input_this(
    pm3: Annotated[
        DataFrame[SimpleSchema],
        WithJsonSchema(SimpleSchema.to_json_schema()),
    ],
) -> Annotated[
    DataFrame[SimpleSchema],
    WithJsonSchema(SimpleSchema.to_json_schema()),
]:
    return pm3

@riziles
Copy link
Author

riziles commented Jan 28, 2025

@ragrawal , I'd recommend creating your own custom Pydantic class to read in whatever format you want if you don't want to use Pandera's default config. For example, something like this:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from typing import Annotated
from pydantic import WithJsonSchema, BaseModel

from fastapi import FastAPI

app = FastAPI()

class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str]
    str_col2: Series[str]

    class Config:
        to_format = "dict"

class InputModel(BaseModel):
    str_col: str
    str_col2: str

@app.post("/input_api")
def input_this(
    pm3: list[InputModel],
) -> list[str]:
    
    df = DataFrame[SimpleSchema](pd.DataFrame([vars(i) for i in pm3]))

    print(pm3)
    print(type(df))
    return df["str_col2"].to_list()

@riziles
Copy link
Author

riziles commented Jan 30, 2025

@ragrawal , can we close this issue?

@ragrawal
Copy link

Sure..appreciate your help on this.

@eharkins
Copy link

eharkins commented Jan 30, 2025

@riziles I think the issue is still relevant despite the above workaround since ideally pandera would work without special annotation when generating schema in pydantic and fastapi

@riziles
Copy link
Author

riziles commented Jan 30, 2025

Wait a second. Just realizing that I opened this issue. I'm closing it as resolved because this project is awesome and @cosmicBboy probably has better things to work on.

@riziles riziles closed this as completed Jan 30, 2025
@alejandro-yousef
Copy link

alejandro-yousef commented Jan 31, 2025

Agree with @eharkins. FastAPI is already very popular and it is likely to become the most popular python web framework in the future. I believe that having full compatibility on documentation generation would be beneficial for pandera usage in production environments.

@cosmicBboy
Copy link
Collaborator

@riziles let's open it back up! There's a WIP PR that addresses it #1704 but there are still some unit test issues on it.

@imseananriley not sure if you still have capacity to work on this, if not perhaps someone on the thread can look into making tests pass

@cosmicBboy cosmicBboy reopened this Jan 31, 2025
@riziles
Copy link
Author

riziles commented Jan 31, 2025

imseanriley is preoccupied at the moment. I might be able to throw some resources at it this summer, but I'd much rather focus on killing the Pandas dependency. We're very intent on migrating to Polars/Lance/DuckDB. Right now there is a competing project that has better Polars support: https://github.com/JakobGM/patito . I'd prefer to leave our Pandera models in tact, but not if I have to keep Pandas in our containers.

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jan 31, 2025

thanks @riziles, let me digest this feedback. It might be time to do pandera 1.0 and force users to install pandas so that it's not a core dependency.

Right now there is a competing project that has better Polars support

What are some of the deltas you see in patito that are missing in the pandera-polars integration?

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jan 31, 2025

In the mean time I'll look into fixing up #1704 to unblock this issue

@riziles
Copy link
Author

riziles commented Feb 1, 2025

What are some of the deltas you see in patito that are missing in the pandera-polars integration?

It's just the removal of the Pandas dependency. Pandas is a heavy package that takes up a lot of space when spinning up environments and slows down start times if it needs to be imported.

@riziles
Copy link
Author

riziles commented Feb 2, 2025

@ragrawal , I just discovered @cosmicBboy 's PydanticModel adapter here:
https://pandera.readthedocs.io/en/stable/pydantic_integration.html#using-pydantic-models-in-pandera-schemas

Easier way to do what you are looking for:

import pandera as pa
from pandera.typing import DataFrame as DataFrame
from pydantic import BaseModel, TypeAdapter
from pandera.engines.pandas_engine import PydanticModel

from fastapi import FastAPI

app = FastAPI()

class InputModel(BaseModel):
    str_col: str
    str_col2: str


class SimpleSchema(pa.DataFrameModel):
    class Config:  # type: ignore
        dtype = PydanticModel(InputModel)
        coerce = True


@app.post("/test")
def input_this(pm3: list[InputModel]) -> list[str]:
    df = DataFrame[SimpleSchema](TypeAdapter(list[InputModel]).dump_python(pm3))

    return df["str_col2"].to_list()

@ragrawal
Copy link

ragrawal commented Feb 3, 2025

Hi @riziles -- Thanks for the suggestion. I have used PydanticModel before and had two concerns

  1. I am not sure between PydanticModel and DataFrameModel, what is more natively supported within Pandera. I couldn't find good documentation on what is the difference or similarity between the two. Also I read somewhere that PydanticModel tends to be slower as it is evaluating one row at a time.
  2. I feel PydanticModel has lot of overhead. For instance, instead of a single schema, I have two define two different schema: InputModel and SimpleSchema. When the number of schema expands, this becomes a problem. Using DataFrameModel, I only have to define a single schema and looks cleaner.

@riziles
Copy link
Author

riziles commented Feb 3, 2025

@ragrawal , if you want to input row wise data, there's always going to be more overhead. The whole reason Pandas, Polars, Arrow, Lance and DuckDB are so fast is that the data is stored in column vectors.

@cosmicBboy
Copy link
Collaborator

fixed by #1904

@ragrawal
Copy link

hi @cosmicBboy -- wondering with 1904 now merged, how to simplify the below solution

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from typing import Annotated
from pydantic import WithJsonSchema

from fastapi import FastAPI

app = FastAPI()


class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str]

    class Config:
        to_format = "dict"


@app.post("/input_api")
def input_this(
    pm3: Annotated[
        DataFrame[SimpleSchema],
        WithJsonSchema(SimpleSchema.to_json_schema()),
    ],
) -> Annotated[
    DataFrame[SimpleSchema],
    WithJsonSchema(SimpleSchema.to_json_schema()),
]:
    return pm3

@cosmicBboy
Copy link
Collaborator

@ragrawal I can test it out and see if we can simplify. Can you share full repro code on starting the server and making a call to the /input_api endpoint?

@vilmar-hillow
Copy link

If I understand correctly, the above code is a workaround to enable proper json schema generation for openapi docs. E.g. running the above code stored in app.py with fastapi dev app.py and checking http://127.0.0.1:8000/docs results in working docs page. With pydantic v1, a simpler definition worked:

from typing import Annotated
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series

from fastapi import FastAPI, Body

app = FastAPI()


class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str]


@app.post("/input_api")
def input_this(
    pm3: Annotated[DataFrame[SimpleSchema], Body()],
) -> DataFrame[SimpleSchema]:
    return pm3

@cosmicBboy
Copy link
Collaborator

It's just the removal of the Pandas dependency. Pandas is a heavy package that takes up a lot of space when spinning up environments and slows down start times if it needs to be imported.

Hey @riziles just to follow up here: I made a PR that removes the pandas dependency from polars, and makes it the user's responsibility to install pandas explicitly (or use the pandera[pandas] extra)

#1926

@riziles
Copy link
Author

riziles commented Mar 7, 2025

Thank you @cosmicBboy !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
6 participants