Has anyone tried this TOON? is it worth the trouble?

[-]

programming-ModTeam@reddit

This post was removed for violating the "/r/programming is not a support forum" rule. Please see the side-bar for details.

[-]

andarmanik@reddit

Unless you are passing it a list of objects might as well just use YAML.

The main (only??) advantage to toon is when the data are arrays of objects, such that you can factor out the keys to only need to write once,

So if your datapoints are like

{user: , id: , location: ,}

Your array of these would need to say user, id, and location for each datapoint, but toon factors that out so that it’s like,

[count] {user, id, location}

Followed by the csv of data.

In which case it’s like, we have that already, it’s called csv.

[-]

norman-complete@reddit (OP)

it seems like yet another notation. I would prefer to do YAML

[-]

wgrata@reddit

Honestly I'm sick of human readable formats like this. It always means I end up writing it directly instead of using a DSL in a real language

[-]

norman-complete@reddit (OP)

+1 on that friend. we all know the YAML hell

[-]

SeaworthinessFar7265@reddit

What if i feed csv, is there any difference?

[-]

Cacoda1mon@reddit

Toon can handle nested data, like yaml. But lose all its advantages if the schema is inconsistent.

[-]

Hot-Employ-3399@reddit

(Benchmark) Generate ~150-160 questions across 4 datasets

This feel a small sample size for a contender to overthrow json. It interesting to see how it scale especially once we start including tabs, quotes, etc.

(Though personally I don't use agents and love minimum preprocessing to the point of ctrl-a ctrl-c from libre office and ctrl-v in text editor which kinda gives tsv and no quotes if cell has line breaks)

Also I'm not sure if number of rows is needed instead of suffix: models generally prefer suffixes (prompts use Im_end and im_start with no count of anything).

Not sure on results but counter at start has one disadvantage: cache destruction. If i append row it has to reprocess everything.

[-]

slvrsmth@reddit

I'm no LLM surgeon, but don't you gain tons of precision by using a format well represented in training sets?

This seems to gain little over CSV for injecting tons of data, and there should be tons of CSV samples in training sets. Better for representing deeply nested objects tho.