While working on some data visualization ideas, I came across the idea of “Sparklines” which in the context of my job, would be particularly valuable for coordinating information exchange (e.g. no daily data emails, no misinterpretation, etc). My next thought was a push capable data distribution channel, and my first thought was … Twitter. Which gives us our initial set of conditions: How much data can one cram into 140 characters? There are 95 graphical ASCII characters, so that equals 95^140 possible character combinations. That’s a lot of information. Doing a little more reading, we find that Twitter actually uses UTF-8 instead of ASCII (This allows the usage of Twitter in places other than the US & Europe) which puts our character space at 107,154 different characters – which changes the potential for data to 107,154^140 possible character combinations.
Initially I thought about just having some sort of line graph but at this point in time, with my understanding of Unicode being what it is, that seems a little limited.
So instead of actual literal line graphs, I thought that these characters are representative of data, and with the proper encode and decode functions, you could just send whatever file to and from with the proper handler. Rather than throwing files through a URL shortening service, could you actually throw the file itself?
Each Unicode character is 4-bytes worth of data.. So let’s start with a basic example. The default google.com page (viewed 12 May 2011, 10:04 AM via google chrome v12.0.742.30 dev-m) is 107,001 bytes. So by a 1-1 conversion, this could be represented by 26751 Unicode characters (26750.25 to be exact) Simple (windows) compression it puts it to 34,551 or 8638 Unicode characters (8637.75) or 62 tweets. Now admittedly this is more than simple text, and not all potential Unicode combinations are displayable 107,154 is a significantly smaller subset of 1,112,064.. This suggests that we have to mindful of edge cases.
So formless data transmission is out – but here is where constraints actually improve the efficiency of a “text squirt”. Here is when context actually helps by allowing assumptions to be made about the content. Let’s say we’re doing a sparkline. The first thing that should appear before the data should be an “escape character” this would be a reserved character that from that point on until a terminating character (likely a space) should be interpreted as a dataset. The maximum number that could be stored by a 4-byte Unicode character would be slightly less than 4.3 million (256)^4 once again it’s a limitation of the graphical space, (107,154).
Resolution – The default font that twitter uses is 15px for the standard window, 24px for the pop out version. 32768 characters could indicate 2^15 pixel position at 139 positions. Multiple pixel groups would exponentially grow the space, a different escape character could indicate a sparkline and 15 characters would be necessary. This bounds the resolution of the graph. Depending on the upper and lower data bounds, coupled with the anecdotal desired slope of the change line (approximately 1 or -1) This begins to define the space for our graph. I’ll put in some more work on this in the future, hopefully generating some example code.
Addendum: it was brought to my attention that bar graphs inspired by sparklines are making their way into twitter, even appearing in the WSJ twitter feed, and there’s even some discussion of ANSI character line graphs in the comment thread.