Dataglasses: easy creation of dataclasses from JSON, and JSON schemas from dataclasses

Posted by Udzu@reddit | Python | View on Reddit | 37 comments

What My Project Does

A small package with just two functions: from_dict to create dataclasses from JSON, and to_json_schema to create JSON schemas for validating that JSON. The first can be thought of as the inverse of dataclasses.asdict.

The package uses the dataclass's type annotations and supports nested structures, collection types, Optional and Union types, enums and Literal types, Annotated types (for property descriptions), forward references, and data transformations (which can be used to handle other types). For more details and examples, including of the generated schemas, see the README.

Here is a simple motivating example:

from dataclasses import dataclass
from dataglasses import from_dict, to_json_schema
from typing import Literal, Sequence

@dataclass
class Catalog:
    items: "Sequence[InventoryItem]"
    code: int | Literal["N/A"]

@dataclass
class InventoryItem:
    name: str
    unit_price: float
    quantity_on_hand: int = 0

value = { "items": [{ "name": "widget", "unit_price": 3.0}], "code": 99 }

# convert value to dataclass using from_dict (raises if value is invalid)
assert from_dict(Catalog, value) == Catalog(
    items=[InventoryItem(name='widget', unit_price=3.0, quantity_on_hand=0)], code=99
)

# generate JSON schema to validate against using to_json_schema
schema = to_json_schema(Catalog)
from jsonschema import validate
validate(value, schema)

Target Audience

The package's current state (small and simple, but also limited and unoptimized) makes it best suited for rapid prototyping and scripting. Indeed, I originally wrote it to save myself time while developing a simple script.

That said, it's fully tested (with 100% coverage enforced) and once it has been used in anger (and following any change suggestions) it might be suitable for production code too. The fact that it is so small (two functions in one file with no dependencies) means that it could also be incorporated into a project directly.

Comparison

pydantic is more complex to use and doesn't work on built-in dataclasses. But it's also vastly more suitable for complex validation or high performance.

dacite doesn't generate JSON schemas. There are also some smaller design differences: dataglasses transformations can be applied to specific dataclass fields, enums are handled by default, non-standard generic collection types are not handled by default, and Optional type fields with no defaults are not considered optional in inputs.

Tooling

As an aside, one of the reasons I bothered to package this up from what was otherwise a throwaway project was the chance to try out uv and ruff. And I have to report that so far it's been a very pleasant experience!

[-]

raydeo@reddit

This is also similar to desert which works on top of marshmallow. Great library but not enough maintenance :-(

[-]

Adrewmc@reddit

How is this better then

  from dataclasses import asdict 
  import json

  My_dict = asdict(myDataClass)
  json.dumps(my_dict, outfile)

[-]

Udzu@reddit (OP)

from_dict is the opposite of addict. Typically from_dict(asdict(data)) == data.

[-]

Adrewmc@reddit

 MyClass(**mydict)

…

[-]

ihavebeesinmyknees@reddit

That will not work if your dict is straight output from json.loads and the dataclass uses any custom types.

[-]

Adrewmc@reddit

Yeah…you….program that part…

The point of a dataclass isn’t stability, but quick loads and deletion’s.

This is basically just pickled. Or something.

   dataclassses.make_dataclass()

It’s there anyway.

The goal here is to dynamically load classes, I do not recommend this mindset. You should be loading in to the classes in that program.

[-]

ihavebeesinmyknees@reddit

The point isn't to dynamically create classes, no clue why you're even mentioning make_dataclass. The point here is to load JSON-like data into an existing arbitrary dataclass. That functionality doesn't exist in dataclasses.

[-]

Adrewmc@reddit

But it does **kwargs… and if you need more than that…my suggestion is to you know make a normal class.

Dataclasses are used to store specific information efficiently, to be loaded and discarded quickly. They are really good at that.

But if you combing from a persistent place like a json file, the. You will always need to load up that file some where so we use

  @classmethod
  def from_json(cls, file_name):
          with open(“file_name”) as f: 
                 data = json.load(f) 
                  #format data
          return cls(**data)

See because you are the one saving, and storing the information, you can just run it better yourself.

[-]

Freschu@reddit

It's doing the exact inverse of your example, in addition to providing json-schema generation.

[-]

mistabuda@reddit

Json schema generation is already covered by a decent amount of packages on pypi

[-]

jegerarthur@reddit

I don't get the "more complex to use" for pydantic, when you can use exactly what you are doing with validate_model etc.

[-]

Udzu@reddit (OP)

Correct me if I'm wrong but I thought you can't just apply pydantic to existing vanilla dataclasses? But yes, applying a TypeAdapter to a pydantic.dataclasses.dataclass isn't much more complex.

[-]

jegerarthur@reddit

You actually can parse any objects to pydantic models, being dataclasses or custom ones, in just one line

[-]

Udzu@reddit (OP)

I meant parsing JSON to an existing dataclass without defining a pydantic model?

[-]

jegerarthur@reddit

I think you can, with create_model. The only question is: should you use this kind of uncontrolled behaviour? Creating arbitrary models / dataclasses?

For me the main point of dataclasses is to structure data objects

[-]

fv__@reddit

there is also dataconf that supports several formats including json.

[-]

ubitub@reddit

you forgot to sneak "py" somewhere in the name, literally unsusable

[-]

simon-brunning@reddit

You might also want to add a comparison with https://pypi.org/project/dataclasses-json/ which is the library I've been using for this.

[-]

Udzu@reddit (OP)

Will check it out thanks!

[-]

mistabuda@reddit

from_dict seems like it replicates The built in functionality of **kwargs that you can use with dataclasses (and any function that takes keyword args) in the stdlib

Dataclasses also already have a built in way to be converted to dicts with the as_dict builtin. Generated a JSONschema from a dict at runtime is cool tho

[-]

Udzu@reddit (OP)

But using kwargs isn’t the opposite of addict as it doesn’t handle nested structures: eg Catalog(**value) in the example above will not convert the items into InventoryItems. Also the constructor doesn’t type check: Catalog(None, None) will execute fine.

[-]

mistabuda@reddit

But using kwargs isn’t the opposite of addict

I didnt say it was the opposite. I said it looks like the functionality replicates the builtin

Also the constructor doesn’t type check: Catalog(None, None) will execute fine.

Which is what it should do since the language is not statically typed.

Recursively creating the objects is neat. Which I would see being really helpful when dealing with JSON apis, but unfortunately pydantic has already cornered that usecase.

[-]

Freschu@reddit

but unfortunately pydantic has already cornered that usecase.

Maybe that's exactly the point of this library. I personally would rather use this library than use that awful behemoth Pydantic.

At my workplace, schemas with the jsonschema package cover 100% of our validation cases and documentation cases. Pydantic models have zero value to us, because they're just replicating common schemas already shared across components in multiple languages, and the code we write doesn't need or rely on fully modeled data - and we're making extensive use of additionalProperties which is annoying with Pydantic.

Also pydantic imposes many stupid pointless limitations on property names that not only clash with internally defined data, but with common existing services like MongoDB using _id as primary id... which is considered hidden/private by Pydantic. This only makes sense in Python-land, but is stupid when interfacing with other systems that follow other conventions.

[-]

mistabuda@reddit

I hear what you're saying.

I think in the vast majority of cases tho people want the exact behavior pydantic provides and thats is why it has garnered so much community support.

That battlefield this product serves has largely been decided and this product is similar enough that most people won't see a point in switching to pydantic which already has the majority of the market share.

I have seen a few projects that do stuff similar to this and they all lost to pydantic because pydantic is what the community has indicated what it wants.

Like I've said earlier I hope OP keeps creating things. The niche for this particular invention seems a tad bit too small.

[-]

Udzu@reddit (OP)

I didnt say it was the opposite. I said it looks like the functionality replicates the builtin.

Yes, but my point was that it doesn’t replicate the builtin. Ditto re typing (which as you say the constructor shouldn’t care about).

pydantic has already cornered that usecase

I did mention pydantic in the comparison. And you may well be right, but having a minimal implementation that supports normal dataclasses proved useful for me, and might do for others too.

[-]

mistabuda@reddit

My comments are not to dissuade you from creating things.

In fact its the opposite. I hope you keep making things! The use-case for this particular invention may be a lil too niche.

[-]

Freschu@reddit

The default kwargs constructor of dataclasses isn't applied recursive. Coupled with the non-evaluation of the type hints, you'll end up with broken data. This library's from_dict however does apply recursively.

Example of "broken" usage:

@dataclass
class Address:
  street: str

@dataclass
class Person:
  name: str
  address: Address

person = Person(**{"name": "person name", "address": "oops!"})
print(type(person.address)) # gives <class 'str'>

person = Person(**{"name": "person name", "address": {"street": "oops!"}})
print(type(person.address)) # gives <class 'dict'>

[-]

Freschu@reddit

Good idea and focus for a library, terrible name. I can see why you want to choose that name, however naming your library something that could very well be a typo (and homonym) of a stdlib package looks borderline suspicious and will be annoying while talking about it. Especially for a small and focused library, I feel like a more focused and distinguishable name would be desirable.

You're not specifying the version/draft of JSON-Schema your generated schemas will follow. Each draft of JSON-Schema has significant changes, making older drafts incompatible with newer ones, and is an important information for business code.

I appreciate the addition of transforms, but I don't see how transformation of specific attributes like `(Person, "name")` is useful. Either the transformation is already contained in the transforms via type, or the dataclass defines a `__post_init` or the dataclass used `Field` and either of those two already handles value transformation. It's less about not wanting a third way of achieving the same, more a question of why include and maintain code something that already is solved elsewhere.

Some minor nitpicks from the README:

> type-annotated dataclasses

**All** dataclasses are type-annotated, it's in the design of dataclasses. It's not incorrect, just superfluous.

> dictionary data extracted from JSON

The bit about extracted from JSON is irrelevant. Again not incorrect, just superfluous. At worst mildly misleading, because it implies dictionaries **not** extracted JSON wouldn't be valid input.

> The library contains just one file and two functions, so can even be directly copied into a project.

Not sure I'd put that in a README. Although your package has a really tight focus, once you gain users and deal with tickets, it could possibly evolve further. At which point you can still remove that sentence, but why even include it in the first place.

[-]

Udzu@reddit (OP)

Thanks again for the feedback.

Re the name: I hadn’t though of it being suspicious or confusing but you’re probably right. I’ll try to think of an improvement.

Re the schema: it currently uses 2020-12 (specified in the output). Do you think the library readme should mention that explicitly? Will there ever be a reason to support multiple drafts as opposed to just the most recent widely supported one?

Re transforming individual fields: good point re post_init, though I wonder whether there’s still a use for this given the focus on scripting, eg when using someone else’s dataclass? It’s very little code, though I agree that it should be removed if it’s superfluous.

[-]

Freschu@reddit

eg when using someone else’s dataclass?

You're right, that's a case I hadn't considered and where transforms on properties would indeed be useful!

it currently uses 2020-12 (specified in the output). Do you think the library readme should mention that explicitly?

Yes, please do mention that version in the readme. People might only have time to skim the package description, don't assume everyone can/will read and test your code ;)

Will there ever be a reason to support multiple drafts as opposed to just the most recent widely supported one?

It's less about support for multiple drafts, more a compatibility concern. At my workplace we began using JSON-Schema when draft4 was current, and as these things often go, we haven't been able to upgrade. Technically, probably, draft4 schemas in the way we use them should be compatible for an upgrade to 2020-12, but we'd still have to allocate resources to verify that across multiple components and transition all of them at once. So until there's resources available to do this, the first check on any potential new library is "what version/draft of JSON-Schema?"

I don't think there's value in supporting multiple versions of JSON-Schema output, just mentioning the version/draft in the README is plenty.

[-]

Udzu@reddit (OP)

Thanks for the useful feedback!

[-]

richgio@reddit

Can it handle deserialization of a list of dataclasses?

[-]

Udzu@reddit (OP)

Yes, as long as you pass in the correct type: eg from_dict(list[InventoryItem], value).

[-]

ekbravo@reddit

Data Glasses!

[-]

Udzu@reddit (OP)

(I was originally going to call it dataklasses but that was already taken, so I came up with this! I mainly wanted it to sound like missing functionality from the dataclasses module, and appear next to it in the imports.)

[-]

rainyy_day@reddit

Dataglasses sound more pythonic

[-]

ekbravo@reddit

I love it!