post python · 2025-01-18 · 5 min read
Pydantic v2 for production data contracts
The line “we use Pydantic” gets thrown around like it’s just a faster dataclasses. It is not. Pydantic v2 is a contract layer for any place where data crosses from outside your code (HTTP request, JSON file, LLM tool response, queue message) to inside it. Used well, it eliminates an entire class of “we got bad data and the bug surfaced 4 layers downstream” problem.
This post is the patterns I actually reach for in production, with code, plus the cases where I deliberately choose dataclasses or attrs instead.
The basic shape
from pydantic import BaseModel, Field
class User(BaseModel): id: int email: str age: int = Field(ge=0, le=150) is_active: bool = TrueValidation happens at construction:
User(id=1, email="x@y.com", age=30) # validUser(id="abc", email="x@y.com", age=30) # ValidationError: id must be intUser(id=1, email="x@y.com", age=200) # ValidationError: age must be ≤ 150Field(ge=...) adds runtime constraints. The model becomes a self-documenting contract: read the class, you know exactly what’s allowed.
Validators where types are not enough
For constraints types alone can’t express, use validators.
Single-field validator — runs after type coercion:
from pydantic import BaseModel, field_validator
class User(BaseModel): email: str
@field_validator("email") @classmethod def email_must_have_at(cls, v: str) -> str: if "@" not in v: raise ValueError("email must contain '@'") return v.lower() # normalise on the way inNote the @classmethod decorator and cls first arg. Pydantic v2 enforces this.
Whole-model validator — runs after every field is parsed, sees the whole instance:
from pydantic import model_validator
class DateRange(BaseModel): start: date end: date
@model_validator(mode="after") def end_must_follow_start(self) -> "DateRange": if self.end < self.start: raise ValueError("end must be on or after start") return selfUse model_validator for cross-field invariants (“end must follow start”, “if A is null then B must not be”). Use field_validator for per-field rules.
mode="before" for pre-coercion — runs on the raw input before type conversion:
class User(BaseModel): age: int
@field_validator("age", mode="before") @classmethod def parse_age(cls, v) -> int: if isinstance(v, str) and v.endswith(" years"): return int(v.removesuffix(" years")) return vUse mode="before" when the incoming data has a quirky encoding you need to normalise. Default (mode="after") when you just want to validate the parsed value.
Computed fields for derived state
from pydantic import BaseModel, computed_field
class Rectangle(BaseModel): width: float height: float
@computed_field @property def area(self) -> float: return self.width * self.heightRectangle(width=3, height=4).area → 12.0, included in model_dump() output. Useful when you want a derived value to ship with the serialised data without storing it as a field.
Discriminated unions for polymorphic JSON
This is where Pydantic v2 actually shines. You have JSON like:
[ { "kind": "click", "x": 10, "y": 20 }, { "kind": "scroll", "delta": 5 }, { "kind": "key", "code": "Enter" }]Three different shapes, distinguished by the kind field. Without Pydantic this is awful. With:
from typing import Annotated, Literal, Unionfrom pydantic import BaseModel, Field, TypeAdapter
class Click(BaseModel): kind: Literal["click"] x: int y: int
class Scroll(BaseModel): kind: Literal["scroll"] delta: int
class KeyEvent(BaseModel): kind: Literal["key"] code: str
Event = Annotated[ Union[Click, Scroll, KeyEvent], Field(discriminator="kind"),]
# Single eventevent = TypeAdapter(Event).validate_python({"kind": "click", "x": 10, "y": 20})# event is now correctly typed as Click
# List of eventsevents = TypeAdapter(list[Event]).validate_python(raw_json_list)The discriminator="kind" tells Pydantic to look at the kind field first and pick the matching schema. Validation is fast (no trial-and-error) and errors point at the right place.
This pattern is the cleanest way I know to parse LLM tool-call payloads, agent message buses, or any polymorphic JSON.
TypeAdapter for non-model validation
You don’t need a BaseModel for everything. For a one-off shape:
from pydantic import TypeAdapter
UserList = TypeAdapter(list[User])parsed = UserList.validate_python(raw_data)Or for a primitive with constraints:
from typing import Annotatedfrom pydantic import Field, TypeAdapter
PostalCode = Annotated[str, Field(pattern=r"^\d{5}$")]NL_PostalCode = Annotated[str, Field(pattern=r"^\d{4}\s?[A-Z]{2}$")]
TypeAdapter(NL_PostalCode).validate_python("1316XW") # passesTypeAdapter(NL_PostalCode).validate_python("13 16XW") # raises ValidationErrorUseful for validating function arguments, config values, or anywhere a full BaseModel is overkill.
model_config for cross-cutting behaviour
Tweak how a model behaves with model_config:
from pydantic import BaseModel, ConfigDict
class StrictUser(BaseModel): model_config = ConfigDict( frozen=True, # immutable after construction populate_by_name=True, # accept either field name or alias extra="forbid", # reject unknown keys str_strip_whitespace=True, # auto-strip strings )
id: int email: strThe four flags above are the ones I set most often:
frozen=True— turns instances into pseudo-records. Hashable, can go in sets and as dict keys.populate_by_name=True— needed when you have field aliases (camelCase JSON ↔ snake_case Python) and want to construct from either name.extra="forbid"— strict-mode for ingest contracts. Catch typos likeenabel: trueinstead of silently ignoring.str_strip_whitespace=True— common normalisation; saves a.strip()call everywhere.
Serialisation: model_dump() is what you want
user = User(id=1, email="x@y.com", age=30)
user.model_dump() # dictuser.model_dump_json() # JSON stringuser.model_dump(exclude={"email"}) # drop fieldsuser.model_dump(by_alias=True) # use aliases (camelCase JSON)Forget .dict() (v1 API, deprecated). Use model_dump() everywhere.
When to NOT use Pydantic
A few cases where dataclasses or attrs win:
- Internal data with no external boundary. If the data only flows between trusted internal functions and you have full type-checker coverage, the runtime validation overhead isn’t worth it. Use
@dataclass(slots=True). - Hot paths. Pydantic is fast (Rust core in v2), but it’s not free. Validating millions of objects per second still measurably slower than constructing dataclasses. Profile before assuming.
- Existing attrs codebases. Don’t introduce a third schema library. Pick one and stick with it.
A real-world example: LLM tool argument schema
This is where Pydantic earns its keep in 2025:
from pydantic import BaseModel, Field
class SearchArgs(BaseModel): query: str = Field(description="free-text search term, eg 'vegan restaurants'") center_lat: float = Field(description="latitude of search centre, decimal degrees") center_lon: float = Field(description="longitude of search centre, decimal degrees") radius_meters: int = Field(default=1000, ge=50, le=50_000) limit: int = Field(default=10, ge=1, le=50) min_rating: float | None = Field(default=None, ge=0, le=5)
# Hand the schema to the LLMschema_json = SearchArgs.model_json_schema()
# Parse the LLM's tool calldef search(raw_args: dict) -> SearchResult: args = SearchArgs.model_validate(raw_args) # ValidationError if bad # args is now fully typed, with bounds enforced ...The Field(description="...") strings end up in the JSON schema the LLM sees. The bounds (ge, le) get enforced at parse time. Bad arguments fail fast with a structured error you can turn into a “retry with corrected args” envelope (covered in the LLM tool-design post).
Closing
Pydantic isn’t “dataclasses with validation”. It’s a contract layer for everywhere data is untrusted: HTTP, file, LLM, queue. Use it at boundaries, lean on validators for invariants types can’t express, lean on discriminated unions for polymorphism, and lean on TypeAdapter for one-off shapes. Inside the trusted core of your code, dataclasses are fine.
The mental shift: stop thinking of it as “schema for my API” and start thinking of it as “the gate at every boundary.” Once you do, the code that survives makes more sense.