Decoding and Encoding RSV Files with J

Another year has come and gone, and my failures continue to accumulate. I’ll stop failing when I’m dead. For the nonce, I will stew in my abyss of hubris, biting off more than I can ever hope to chew. But there will be plenty of time for self-abasement later (at least if my health holds). Today, I am commenting on an interesting little YouTube video that the algorithm shoved in my face.

You can peruse the video by clicking on RSV Rows of String Values. The author (Stenway) introduces another standard data format. He knows that advocates of new standards often run afoul of the XKCD curse.

Usually, I would ignore such things, at least until I can no longer ignore them. But what caught my attention was his solution to the dreaded delimiter collision problem. Data formats like CSV have long suffered from delimiter collisions. I use the TSV (TAB-separated values) format to reduce delimiter collisions. For humdrum data, TAB collisions are less frequent than COMMA collisions. Sadly, less frequent means collisions still occur, and when they do, your little parser must run down the nth layered Hell of escape characters rabbit hole. Wouldn’t it be nice if delimiter collisions could be completely side-stepped?

The RSV format exploits a feature of UTF8 encodings to eliminate delimiter collisions for UTF8 encoded string values. Until I watched RSV Rows of String Values, I was unaware that UTF8 encoders should not emit specific bytes. Stenway notes that proper UTF8 encoders should never emit these bytes.

BinaryHexDecimal
11111111FF255
11111110FE254
11111101FD253
11111100FC252
11111011FB251
11111010FA250
11111001F9249
11111000F8248
Invalid UTF8 bytes

If you are content to pass UTF8 strings around, you can delimit them with any of these bytes without collisions. This is precisely what the RSV format does. RSV uses the three bytes.

NameHexDecimal
REOVFF255
RNULLFE254
REORFD253
RSV delimiters

Even better, the bytes are value terminators, which makes RSV easy to parse in J with the cut conjunction <;._2. The following J verbs encode and decode RSV: two lines of J code suffice.

NB. decode RSV bytes - NULLMARK marks nulls - example 'null'.
rsvdec=:  {{]`(NULLMARK"_)@.((,RNULL)&-:)L:0 <;._2&.> <;._2 y}}

NB. encode RSV bytes
rsvenc=: {{(0=#y)}. ;,&REOR&.> ;&.> REOV -.&.>~ ,&REOV L: 0 (]`(RNULL"_))@.(NULLMARK&-:) L: 0 y}}

I have added rsv.ijs to my collection of JACKSHACKS J addons. You can install the addons on your own J system with:

NB. J package manager
load 'pacman'

NB. files from https://github.com/bakerjd99/jackshacks
install 'github:bakerjd/jackshacks'
dir '~addons/jacks'

NB. load script
load '~addons/jacks/rsv.ijs'

Enjoy the hack!