PDEP-13: The Pandas Logical Type System

hackandthink | 46 points

This actually addresses a huge hole right now in the ecosystem. At the moment, Arrow treats things that are logically the same as different types, for example a dictionary-encoded utf8 is different from a utf8. Really there needs to be a distinction between logical and physical types, and it’s a big source of rough edges in things like Acero. Hacking on my own query engine on the side, I’ve put some thought into how to group physical into logical types in Arrow.

I am however worried that the place to define the logical type system might be inside Arrow instead of Arrow consumers. If everyone has their own logical type system, we’re just going to end up with the same incompatibilities that Arrow was trying to solve.

sakras | 12 days ago

I've opened some issues over the years with Pandas on the transition to Arrow dtypes. They are generally very nice but even today I have many workaround patches in our codebase to avoid bugs (e.g. fillna and dropna working as expected).

This PDEP doesn't discuss nullable types at all (i.e. what is currently Float32 instead of float32), which is worth mentioning (though support is pretty good now in Pandas). If any of the core developers reads this, can I suggest `float32?` instead of `Float32`? I think it's more obvious what is intended with the question mark than with capitalization.

singhrac | 13 days ago

For analytics, I am a fan of type minimalism, like in SAS. You only really need to distinguish numbers and strings (each with missing value), and table as an overarching structure. They do everything else with input and output "formats". Perhaps you want to throw in enumerations (and sets of thereof) for a good measure.

That being said, typing is hard. There are many ways to consider what is a type, and I don't see a clear physical/logical distinction. Whether type should determine storage method is one thing. Another consideration is whether type is intrinsic to the value (e.g. string in UTF-8, float64, etc.) or is extrinsic (e.g. length, weight, area and so on). This is somewhat based on whether the type is implied by the operations used on the value (for example, you can subtract numbers but not strings).

Edit: Also, if you're gonna treat time as a separate type, why not also have 2D and 3D coordinates as its own type? It's just hard to make an ultimate standard.

js8 | 12 days ago

I keep on reading this as pdp-13, a la pdp11 computers from history

carterschonwald | 12 days ago

Does pola.rs fix all this stuff?

esafak | 12 days ago