CSV is the dominant format for data transfer. It isn’t great. Let’s do better! But to defeat CSV, first we need to understand it.
Invented in the early 1970s, CSV is still going strong today. Why? Because almost every data app can speak it. That means you can use CSV to move data between otherwise incompatible apps. And there are a LOT of apps in the world now, and CSV is often the only format they have in common.
But why is support for CSV so widespread? The reason it wins starts with its simplicity. Your data, with commas. Easy.
Your data, with commas.
But just because we all use CSV doesn’t mean we all love it. Just the opposite. There’s no one who has touched CSV who hasn’t been burned by it, and lost hours to some silly muddle. Commas appear in data, so they have to be quoted somehow. But then what about the quote marks you used, they can appear in data too right? So you have to be ready to quote them as well. And let’s not even talk about the line endings. CSV starts simple, but when your data gets fancy, CSV needs to get fancy too. And if the fanciness added in exporting doesn’t match the fanciness used in importing it, you get your data, with corruption.
Your data, with corruption.
Can we do better with a new format? That sounds hard. It has been tried a few times, with fancy formats with many technical benefits. But CSV is still here. It is hard to beat its (initial) simplicity.
How about the same format, but with the comma replaced with something somehow better? That sounds easier to do, and would be just as simple a format. This has also been tried more than a few times, with tabs, semicolons, pipe symbols, special-purpose ASCII symbols, and the like. But CVS is still here. The replacements just fill niches and didn’t take off.
CSV is still here.
But maybe, just maybe, that’s because no-one has ever been brave enough to pick the right replacement?
That seems unlikely.
Don’t listen to the pull quote, listen to me.
Here’s an idea: what if we use a separator that you’ve never heard of or seen in any file you’ve looked at (before today)? Then it is very unlikely to conflict with anything in your data, right?
Go on.
Okay. Consider the Angzarr: “⍼”. Here are some facts about this symbol from Wikipedia:
- “Angzarr is the name of a character in Unicode that has an unknown origin.”
- “… found in a 1972 Monotype typeset catalog … listed as matrix serial number S16139 … It is unknown why Monotype added the character, or what purpose it was intended to serve.”
- “The symbol appeared in the ISO publication Proposal … although the symbol has no specific purpose.”
Perfect, right? Just use this as a separator instead of a comma and we’re done?
That was easy.
Well no, obviously not, because now that we’ve talked about this symbol, you’ll tell all your friends, they’ll write about it, and suddenly it is everywhere and two weeks later tiktokers are all dancing the Angzarr.
⍼ TWIST YOU BODY / STICK YOU LEG OUT SHAWTY ⍼
It will definitely start showing up in your data. So we are back to needing to carefully quote our data to not conflict with the separator. Boo.
But … are we actually back entirely where we started? Not quite. One important thing did change. As soon as we pick a symbol like ⍼ as a separator, any app without decent unicode handling will choke immediately. That’s actually great! That way you know you’ve a problem straight away, and not weeks later when a customer with an accent in their name harangues you. Another format called (very straightforwardly) “Unicode Separated Values” uses unicode symbols specifically designed to be separators and has basically the same advantage.
But we’re not done yet. Grab your passport: we’re going to visit the astral planes.
But I’m not done dancing the angzarr.
Before 2000, all unicode symbols lived on what is now called the Basic Multilingual Plane, filled with Latin script, Korean, math symbols, and so on. That didn’t cut it in the complex business environment of the 21st century once high-speed trading of memes and reactions took off, so “supplementary” planes were bolted on for all the symbols needed.
“Needed.”
These are informally known as astral planes, perhaps due to their distance from the mundane beginnings of unicode. If you need more than 4 hexadecimal digits to represent a unicode symbol, it’s on an astral plane. An example of such a symbol is U+1F4A9 PILE OF POO, the humble “💩”.
Turns out there’s an advantage to using 💩 as a separator. If you want to catch problems early, 💩 will do it for you better than ⍼ or other mundane symbols drawn from the Basic Multilingual Plane. Programming languages have been playing catch-up as the space of symbols exploded, and an astral plane symbol is a good test that an app hasn’t fallen behind. Mathias Bynens covers some important problems in Javascript has a Unicode Problem, and specifically recommends using the 💩 symbol as a quick test for breakage.
Okay, so we switch the separator to đź’© and get free quality control of data import and export, without any increase in the implementation difficulty beyond that needed to faithfully convey modern text (which is exactly what we want to check). Is that it? Did we just win?
You do realize you are talking đź’©
To win, we need to get this new format everywhere. That starts today.
First, a word on naming. For credit in the historical record, the format has previously been proposed as PSV, with the earliest known pitch being in 2016 when this specification was posted:
We propose calling this new 💩 based format “DSV” for DOO Separated Values. This is mostly for deniability in import/export dialog menus in important business apps. DOO Separated Values looks more plausibly professional than Poop Separated Values. DOO is probably Digital Object Operations or some such, right? It is also to avoid a collision with Pipe Separated Values (this is a format that uses the ASCII “|” symbol instead of a comma because it looks pretty on a terminal if things line up just right).
Whatever its name, the practical advantages of the new format were first spelled out in a 2017 post called (somewhat prematurely) “The year of poop on the desktop” by an engineer who later joined Grist Labs, attracted by the idea of finally being able to implement his dreams in a modern spreadsheet.
“The year of poop on the desktop?”
Today the dream of a spreadsheet with native 💩 support becomes a reality. We have added DSV support to Grist, meaning you’ll be able to import and export the format anywhere Grist lives. DSV support has reached our open-source repository and our SaaS, and will be spreading out from there to grist-electron
(a desktop app) and to grist-static
(a pure front-end web component). So ultimately DSV will be supported everywhere Grist can go, which is a lot of places.
Another way to work with DSV today is to use Frictionless Data. DSV is by design just a dialect of CSV with a different separator. This is easy to express as a Frictionless Data CSV Dialect. Here is an example dsv.json
file describing the dialect:
{
"csvddfVersion": 1.2,
"delimiter": "đź’©",
"header": true
}
Then using frictionless-py
we can immediately work with DSV files, converting them safely back and forth to and from correctly quoted CSV:
# convert shopping.csv to shopping.dsv
frictionless convert shopping.csv \
--to-path shopping.dsv \
--to-format csv --to-dialect dsv.json
# check that shopping.dsv is a valid DSV file
frictionless validate --format csv --dialect dsv.json shopping.dsv
# convert shopping.dsv to shopping.csv
frictionless convert shopping.dsv \
--format csv --dialect dsv.json \
--to-path shopping.csv
Frictionless data, flawless victory
Until DSV is everywhere, it will need a little adapter like frictionless
here and there. But when you see DSV support in an app, you’ll be able to rest easy knowing that it will treat your data decently and not spring nasty surprises on you. If it isn’t there yet, well, you’ll just have to hold your nose, use stinky old CSV, and dream of a lighter future.