How can we diff two very big and nested Json while ignore some attributes in it
Posted by repel_humans@reddit | learnprogramming | View on Reddit | 12 comments
so I am working a task where we have very nested and complex json about an entities configuration details . (100s of attributes with combination of dict,list)
now i need to compare two similar json files (downloaded from non-sync servers)
we need to perform a diff and only update if there are diffrences .
problem is, there are certain attributes at different depths of json that needs to be ignored while performing diff .
that is if ignored attributes have changes we should not update the configurations
I am use deepDiff library for finding Diff true or false .
can someone help around this
python
EliSka93@reddit
That seems like a straightforward "loop over all properties and skip the ones you don't want to check" kinda situation. What's the difficulty?
repel_humans@reddit (OP)
Loop over it ? Looping would not solve as there are nested attributes like a dict has list that further down has list/dict etc Or is there something else you talking about
Afraid-Locksmith6566@reddit
looping with recursion, same thing
repel_humans@reddit (OP)
Will it become slow when expected json are in mb's ? Also i need to add back those removed attributes before final update . So again same way?
gyroda@reddit
This might be an X Y problem.
Take a step back, why do you want to do this strange comparison? Have you got an example of two inputs and an expected output? It doesn't have to be actual data, just enough to demonstrate the problem.
johnpeters42@reddit
Also, how often do you expect to do this? Speed matters more for "100k times per day" than "100 times per day".
gyroda@reddit
I wouldn't worry about performance too much at this step. Computers are very fast when it comes to crunching data like this.
Figure out what you're actually trying to do, then get it done as simply as possible, then measure it and only then do you start worrying about optimising. Especially if you're new enough to be posting a question here.
iOSCaleb@reddit
Use
jqto extract the parts that you care about from each and then usediffor similar to compare them.Or, feed them both into you favorite AI tool and tell it what you want.
repel_humans@reddit (OP)
Ai gave so complex slop , that it seems like a "solution tape applied to code until it worked "
DehabAsmara@reddit
Deepdiff is definitely the best tool for this, but the trick for 'different depths' is to leverage exclude_regex_paths rather than standard paths. We hit this exact issue in a multi-agent project (manga generation) when comparing complex JSON outputs from different LLM agents. Since fields like 'timestamp' or 'token_usage' appear at various depths, regex is the only sane way to manage it.
You can handle this in three steps:
```python from deepdiff import DeepDiff
Target any attribute named 'id' or 'meta' anywhere in the tree
ignore_list = [r".*\\['id'\\]", r".*\\['meta'\\]"]
result = DeepDiff( old_data, new_data, exclude_regex_paths=ignore_list, ignore_order=True )
if result: # Only updates if meaningful diffs exist do_update(new_data) ```
One major caveat: performance degrades quickly as your JSON grows into the multi-megabyte range because regex matching happens at every single node. If it gets too slow, it is actually more efficient to 'sanitize' the JSON objects by deleting the ignored keys before you even start the diff.
Are your servers returning strictly structured data? If it's the latter, the regex approach is basically mandatory.
repel_humans@reddit (OP)
I adapted the aproach to delete the ignoreParamters by recusively finding the attr. And removing (preprocess)
But i expect that json could be large files sometimes
Another important aspect is that the feilds that i remove while comparing the data from diffrent datacenters, Lets say dataDC1 and dataDC2 , and find out that we need to sync DC2 to DC1 due to some changes Constrain is , we need to add those removed attribute again to DC2 payload before firing the put apis. At exact same place
Flimsy_Actuator_6947@reddit
You could try preprocessing the JSONs to remove those ignored attributes before running deepDiff. Something like a recursive function that walks through and pops out the keys you don't want to compare
Another option is using deepDiff's `exclude_paths` parameter - you can specify exact paths to ignore during comparison