Has anyone tried this TOON? is it worth the trouble?
Posted by norman-complete@reddit | programming | View on Reddit | 10 comments
Posted by norman-complete@reddit | programming | View on Reddit | 10 comments
programming-ModTeam@reddit
This post was removed for violating the "/r/programming is not a support forum" rule. Please see the side-bar for details.
andarmanik@reddit
Unless you are passing it a list of objects might as well just use YAML.
The main (only??) advantage to toon is when the data are arrays of objects, such that you can factor out the keys to only need to write once,
So if your datapoints are like
{user: , id: , location: ,}
Your array of these would need to say user, id, and location for each datapoint, but toon factors that out so that it’s like,
[count] {user, id, location}
Followed by the csv of data.
In which case it’s like, we have that already, it’s called csv.
norman-complete@reddit (OP)
it seems like yet another notation. I would prefer to do YAML
wgrata@reddit
Honestly I'm sick of human readable formats like this. It always means I end up writing it directly instead of using a DSL in a real language
norman-complete@reddit (OP)
+1 on that friend. we all know the YAML hell
SeaworthinessFar7265@reddit
What if i feed csv, is there any difference?
Cacoda1mon@reddit
Toon can handle nested data, like yaml. But lose all its advantages if the schema is inconsistent.
Cacoda1mon@reddit
When the schema is less consistent like the books in my example toon loses all advantages over e.g. yaml.
yaml
toon
Hot-Employ-3399@reddit
This feel a small sample size for a contender to overthrow json. It interesting to see how it scale especially once we start including tabs, quotes, etc.
(Though personally I don't use agents and love minimum preprocessing to the point of ctrl-a ctrl-c from libre office and ctrl-v in text editor which kinda gives tsv and no quotes if cell has line breaks)
Also I'm not sure if number of rows is needed instead of suffix: models generally prefer suffixes (prompts use Im_end and im_start with no count of anything).
Not sure on results but counter at start has one disadvantage: cache destruction. If i append row it has to reprocess everything.
slvrsmth@reddit
I'm no LLM surgeon, but don't you gain tons of precision by using a format well represented in training sets?
This seems to gain little over CSV for injecting tons of data, and there should be tons of CSV samples in training sets. Better for representing deeply nested objects tho.