Seems like they are just using a custom 5 bit charset for strings with a limited alphabet.
That's not really a rethink of string encoding, just using a different fixed encoding.
Given the usecase, not sure i understand why not just use static huffman.
Everything old is new again. This reminds me of ZSCII, which was how Infocom games encoded string data to fit ~100kb of text data into a Commodore 64, packing 3 characters into every 2 bytes.
--first byte------- --second byte---
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
bit --first-- --second--- --third--
The 5-bit character encoding was one of three character sets: Z-char 6789abcdef0123456789abcdef
current --------------------------
A0 abcdefghijklmnopqrstuvwxyz
A1 ABCDEFGHIJKLMNOPQRSTUVWXYZ
A2 ^0123456789.,!?_#'"/\-:()
--------------------------
Z-char 0 was space in all character sets, z-char 1 was newline, z-chars 2 and 3 switched to one of the other character sets depending on which one you were already in for the next character only, and z-chars 4 and 5 switched permanently, the "shift-lock".https://inform-fiction.org/zmachine/standards/z1point0/sect0...
It's a compression via domain-specific frequency encoding.
Good, but that's a pretty common technique for cases where every bit counts.
In rpc/serialization systems, we often need to send namespace/path/filename/fieldName/packageName/moduleName/className/enumValue string between processes.
Those strings are mostly ascii strings. In order to transfer between processes, we encode such strings using utf-8 encodings. Such encoding will take one byte for every char, which is not space efficient actually.
If we take a deeper look, we will found that most chars are lowercase chars, ., $ and _, which can be expressed in a much smaller range 0~32. But one byte can represent range 0~255, the significant bits are wasted, and this cost is not ignorable. In a dynamic serialization framework, such meta will take considerable cost compared to actual data.
So we proposed a new string encoding which we called meta string encoding in Fury. It will encode most chars using 5 bits instead of 8 bits in utf-8 encoding, which can bring 37.5% space cost savings compared to utf-8 encoding.
For string can't be represented by 5 bits, we also proposed encoding using 6 bits which can bring 25% space cost savings
In addition to the existing dictionary compression algorithms, it is very important to know the characteristics of string contents. The proposed encoding has a very crude mechanism to describe them, but that's all (and that encoding flag is included to every string encoded, even though the schema could have included that information instead). Namespaces would start with only a small number of fixed prefixes like `java.`, `com.` and `org.`. Namespaces, paths and file names will have a lot of overlapping substrings as well. Class and field names are more chaotic, but some heuristics can be easily extracted as most of them will obey the common naming convention.
By the way, the specification is highly confusing to interpret to say the least as well. For example, is the "ref(erence) meta" a separate section in the serialization format or not? The corresponding section never mentions which bytes are to be written, and reference flags apparently describe the state of each object, so it can't be in that section anyway. Providing at least a single worked example would have significantly reduced confusion.
(Due to the inability to fully comprehend the specification, I can't give a general criticism and/or advice for now. But it does seem to miss several lessons from existing serialization formats. For instance, a varint encoding based on a contiuation bit is actually the worst encoding for given distribution, even though it is too common! Consider an alternative like `1^n 0 bit^(7n+7)` which can determine the contiuation length from the first byte instead.)
I can't help but feel like something has gone fundamentally wrong when there are so many arbitrary strings in the system that this optimisation makes a tangible difference.
OOOh, very very nice. I might have to copy this into the H2 database engine. We already use variable-size encoding for numbers, and this would make a nice addition to reducing the size of row data containing text.
If you're intersted in this topic, there exist more string compression algorithms that save data transfer or memory in different ways. Look into Thrift Compact Protocol, MessagePack, Protocol Buffers and Avro.
How does this compare to the official Unicode compression schemes? [1] [2]
[1] https://en.wikipedia.org/wiki/Standard_Compression_Scheme_fo...
[2] https://en.wikipedia.org/wiki/Binary_Ordered_Compression_for...
For implemetation details, https://github.com/apache/incubator-fury/blob/main/java/fury... can be taken as an example
What about using gzip on messages? I'd expect for it to yield similar savings while keeping message format simpler.
For spec details, please see fury meta string spec: https://fury.apache.org/docs/specification/fury_xlang_serial...
Apple's NSString apparently uses an in-memory encoding (not wire encoding) that can go as low as 5 bits per byte, as of OS X 10.10 in 2014:
https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-point...
If the length is between 0 and 7, store the string as raw eight-bit characters.
If the length is 8 or 9, store the string in a six-bit encoding, using the alphabet "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX".*
If the length is 10 or 11, store the string in a five-bit encoding, using the alphabet "eilotrm.apdnsIc ufkMShjTRxgC4013"
---
How about 5 bits? This isn't totally ludicrous. There are probably a lot of strings which are just lowercase, for example. 5 bits gives 32 possible values. If you include the whole lowercase alphabet, there are 6 extra values, which you could allot to the more common uppercase letters, or some symbols, or digits, or some mix.
Related thread - https://lobste.rs/s/5417dx/storing_data_pointers#c_l4zfrv
It's interesting how network formats and in-memory formats kinda converge in design, because retrieving data from memory to CPU is so expensive now.
[dead]
When doing ad-hoc space optimized encodings, it’s usually a good idea to compare the compress+decompress speed in addition to the resultant size against just compressing the data. This may win on one-off encodings of strings within an RPC, but the compression mechanism probably wins on the overall RPC message unless the message is very small & you can’t build a more optimized baseline dictionary for small messages.
It’s really hard to actually beat something like zstd with these ad-hoc compression since both the compressed size will be bigger than what zstd produces with a dictionary or on larger messages without a dictionary AND the speed of a round trip through zstd is likely faster.