Meta String: A more space-efficient string encoding than UTF-8 in Fury

chaokunyang | 75 points

When doing ad-hoc space optimized encodings, it’s usually a good idea to compare the compress+decompress speed in addition to the resultant size against just compressing the data. This may win on one-off encodings of strings within an RPC, but the compression mechanism probably wins on the overall RPC message unless the message is very small & you can’t build a more optimized baseline dictionary for small messages.

It’s really hard to actually beat something like zstd with these ad-hoc compression since both the compressed size will be bigger than what zstd produces with a dictionary or on larger messages without a dictionary AND the speed of a round trip through zstd is likely faster.

vlovich123 | 12 days ago

Seems like they are just using a custom 5 bit charset for strings with a limited alphabet.

That's not really a rethink of string encoding, just using a different fixed encoding.

Given the usecase, not sure i understand why not just use static huffman.

bawolff | 12 days ago

Everything old is new again. This reminds me of ZSCII, which was how Infocom games encoded string data to fit ~100kb of text data into a Commodore 64, packing 3 characters into every 2 bytes.

    --first byte-------   --second byte---
    7    6 5 4 3 2  1 0   7 6 5  4 3 2 1 0
    bit  --first--  --second---  --third--
The 5-bit character encoding was one of three character sets:

       Z-char 6789abcdef0123456789abcdef
    current   --------------------------
      A0      abcdefghijklmnopqrstuvwxyz
      A1      ABCDEFGHIJKLMNOPQRSTUVWXYZ
      A2       ^0123456789.,!?_#'"/\-:()
              --------------------------
Z-char 0 was space in all character sets, z-char 1 was newline, z-chars 2 and 3 switched to one of the other character sets depending on which one you were already in for the next character only, and z-chars 4 and 5 switched permanently, the "shift-lock".

https://inform-fiction.org/zmachine/standards/z1point0/sect0...

amiga386 | 12 days ago

It's a compression via domain-specific frequency encoding.

Good, but that's a pretty common technique for cases where every bit counts.

eps | 12 days ago

In rpc/serialization systems, we often need to send namespace/path/filename/fieldName/packageName/moduleName/className/enumValue string between processes.

Those strings are mostly ascii strings. In order to transfer between processes, we encode such strings using utf-8 encodings. Such encoding will take one byte for every char, which is not space efficient actually.

If we take a deeper look, we will found that most chars are lowercase chars, ., $ and _, which can be expressed in a much smaller range 0~32. But one byte can represent range 0~255, the significant bits are wasted, and this cost is not ignorable. In a dynamic serialization framework, such meta will take considerable cost compared to actual data.

So we proposed a new string encoding which we called meta string encoding in Fury. It will encode most chars using 5 bits instead of 8 bits in utf-8 encoding, which can bring 37.5% space cost savings compared to utf-8 encoding.

For string can't be represented by 5 bits, we also proposed encoding using 6 bits which can bring 25% space cost savings

chaokunyang | 12 days ago

In addition to the existing dictionary compression algorithms, it is very important to know the characteristics of string contents. The proposed encoding has a very crude mechanism to describe them, but that's all (and that encoding flag is included to every string encoded, even though the schema could have included that information instead). Namespaces would start with only a small number of fixed prefixes like `java.`, `com.` and `org.`. Namespaces, paths and file names will have a lot of overlapping substrings as well. Class and field names are more chaotic, but some heuristics can be easily extracted as most of them will obey the common naming convention.

By the way, the specification is highly confusing to interpret to say the least as well. For example, is the "ref(erence) meta" a separate section in the serialization format or not? The corresponding section never mentions which bytes are to be written, and reference flags apparently describe the state of each object, so it can't be in that section anyway. Providing at least a single worked example would have significantly reduced confusion.

(Due to the inability to fully comprehend the specification, I can't give a general criticism and/or advice for now. But it does seem to miss several lessons from existing serialization formats. For instance, a varint encoding based on a contiuation bit is actually the worst encoding for given distribution, even though it is too common! Consider an alternative like `1^n 0 bit^(7n+7)` which can determine the contiuation length from the first byte instead.)

lifthrasiir | 11 days ago

I can't help but feel like something has gone fundamentally wrong when there are so many arbitrary strings in the system that this optimisation makes a tangible difference.

NohatCoder | 12 days ago

OOOh, very very nice. I might have to copy this into the H2 database engine. We already use variable-size encoding for numbers, and this would make a nice addition to reducing the size of row data containing text.

grandinj | 10 days ago

If you're intersted in this topic, there exist more string compression algorithms that save data transfer or memory in different ways. Look into Thrift Compact Protocol, MessagePack, Protocol Buffers and Avro.

agilob | 12 days ago
hgs3 | 11 days ago
[deleted]
| 12 days ago

For implemetation details, https://github.com/apache/incubator-fury/blob/main/java/fury... can be taken as an example

chaokunyang | 12 days ago

What about using gzip on messages? I'd expect for it to yield similar savings while keeping message format simpler.

vbezhenar | 12 days ago

For spec details, please see fury meta string spec: https://fury.apache.org/docs/specification/fury_xlang_serial...

chaokunyang | 12 days ago

Apple's NSString apparently uses an in-memory encoding (not wire encoding) that can go as low as 5 bits per byte, as of OS X 10.10 in 2014:

https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-point...

If the length is between 0 and 7, store the string as raw eight-bit characters.

If the length is 8 or 9, store the string in a six-bit encoding, using the alphabet "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX".*

If the length is 10 or 11, store the string in a five-bit encoding, using the alphabet "eilotrm.apdnsIc ufkMShjTRxgC4013"

---

How about 5 bits? This isn't totally ludicrous. There are probably a lot of strings which are just lowercase, for example. 5 bits gives 32 possible values. If you include the whole lowercase alphabet, there are 6 extra values, which you could allot to the more common uppercase letters, or some symbols, or digits, or some mix.

Related thread - https://lobste.rs/s/5417dx/storing_data_pointers#c_l4zfrv

It's interesting how network formats and in-memory formats kinda converge in design, because retrieving data from memory to CPU is so expensive now.

chubot | 12 days ago

[dead]

weipeng_wang3 | 12 days ago