A humble guide to database schema design

helloiloveyou | 414 points

Most of this advice is stuff they tell you in school that doesn’t matter at all in reality.

Real advice:

- Normalize until it hurts, denormalize until it works. (I.e. prefer normal forms but denormal will have better performance.)

- You need nullable dates and nullable numbers. You never need a nullable string. If you must distinguish blank string from unset, add a column called “unset”. Using null for this will just result in bad data entry.

- Names should be stored as display_name, first_name, and last_name. A display name is not the same as just first plus last. They have different purposes, so don’t shove them together.

- When in doubt, just have a column full of JSON and figure it out later.

- Offset pagination is bad for performance. Avoid it at all costs unless you want some spider indexing page 10,432 of your archives to eat all your CPU.

earthboundkid | 2 months ago

> Never store firstname and lastname in the same field

I disagree. Why would you want to force the "firstname lastname" pattern on people? What if someone has three names? Or only one? Just use a "name" field!

tasuki | 2 months ago

Design guidance aside, this is not a good article because it doesn't show how to fix the presented problem scenarios.

"This is not first normal form" - OK, how would you fix it?

"This is not second normal form" - OK, how would you fix it?

"This is neither second nor third normal form" - OK, how would you fix it?

If you're going to show problems, show solutions as well.

The rest is a big red flag list of falsehoods that programmers believe about names and addresses.

ebg13 | 2 months ago

If you're dealing with international addresses, point 3 becomes very challenging. The 'tokens' of an address are called different things everywhere, take different forms, and sometimes don't make much sense to compare. Figuring out a good balance of usability and generality can be really tricky.

Here's nearly 100 pages on the subject (from the perspective of addressing mail for USPS): http://www.columbia.edu/~fdc/postal/

tomlagier | 2 months ago

Off-topic comment / feedback about the blog post:

The author self-describes himself as "_Trusted_ Consultant" in the blog's subtitle.

The comic strips illustrating the article do not mention about the permission from identified license provider(s), and are cropped, does not help building trust - at least in my case.

Are they used with authorization from their authors?

me_bx | 2 months ago

This person would disagree about point 4: https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...

heisenzombie | 2 months ago

In what way is this guide 'humble'? Seems fairly self-confidently opinionated to me.

boffinism | 2 months ago

Firstname lastname is too fragile and oversimplified. Suffix, prefix and a variable number of middle names/initials can also be necessary. Occasionally, legal name vs. nickname is also needed. But if there's no pressing need for grouping or sorting by surname, why piecemeal any of it out?

awaythrower | 2 months ago

> Never store firstname and lastname in the same field

funny, where would you store the middle name then? and the second middle name? married name in cultures where the spouse changes or double it? of course in normal form you need a name table, with an order column, and so on and so long.

anyway, this article goes straight to the notions everyone can fetch on any book and skip the interesting and diffiylcult part about database design, which is understanding your domain so that no future expansion is prevented by a wrong cardinality on a relation that was simplified from reality.

because if it's a user database you might get away with just the name, so you can address people during support calls. if it's an accounting database, you might want to know someone name at the time of invoicing, having multiple records with start and end validity times, and if it's a FBI database for protected identities you might have to have multiple name aliases referring to a single person.

LoSboccacc | 2 months ago

Since everyone has to have an opinion and provide it, here's mine, maybe a retitle?

>A humble guide to transactional database schema design

There are a number of things that don't apply to analytical database design and I'm seeing the two confused fairly often, then again, this comes entirely from my point of view so take it for what it's worth.

atwebb | 2 months ago

I work on an e-commerce platform that's live in about 20 countries. One of the hardest records to normalize correctly is the address, because not only do users have very different rules depending on their country (e.g. not all addresses in Turkey have post codes, GB has some very weird building names/numbers...), we also connect with vendor API's that have their own ideas of how to model an address.

My personal opinion on this is something like address_line_1, address_line_2, city, region, post_code, with each field being varchar(max). Anyone else have an opinion on how to do this?

davedx | 2 months ago

“Beginning database design” By Clare Churcher is a really accessible book on these issues, very practical (starts off with simple spreadsheet tables!) and takes you through step by step to understanding the problems with and the need for the different levels of normalization.

ghufran_syed | 2 months ago

Recently I ran into a project where everything is stored in around 3k unique fields, with many levels of nesting, in a JSONB column with NO schema or documentation. Generated JSDOC from the data and I'm hoping to create some kind of schema.

You can use JS on the server. Just don't do it this way ...

winrid | 2 months ago

I’m just here with popcorn and for the comments.

Lucky for me SQL just clicked and when I wrote my first schema third normal form is how it came out more or less so I knew I was on to something.

I would say a study of how the internals of your databases of choice works would help in schema design.

gigatexal | 2 months ago

That's one issue I see with NoSql, data is not normalized and consistency goes out of the window if database is big enough and the volume of data is big enough. That might not matter if you work on a search engine, social network, music suggestion app, but for most business apps and for most websites, having inconsistency is really bad.

On the other hand, having some data denormalized can improve performance as you can avoid expensive joins. It's for designer to see if they need to trade consistency for performance and in what measure. I'd avoid it, I'd prefer to add more servers, use load balancing, use caches.

DeathArrow | 2 months ago

In my programming life I've barely used SQL databases so far, and there is something I've never understood about these recommendations. Maybe someone here can explain that to me.

I thought databases are designed to organize the data optimally internally and execute all operations on them as efficiently as possible. Why do you then need to organize databases so stringently? Doesn't that mean that there is something wrong with how the db works?

To me the advice always seemed similar to bit twiddling instead of relying on the compiler to optimize. Shouldn't the database lay out the data optimally internally?

jonathanstrange | 2 months ago

My humble opinion is that poor choices in database design are the source of a great deal of technical debt.

I used to avoid making schema changes as they were "scary" (until I got a decent understanding of Django Migrations). Just a quick hack to make up for it in the application layer was far easier. Then another, and another until you have real mess on your hands.

And avoid nulls when you can. Unfortunately most of the time I have had a new database design there is existing data in excel sheets that is messy and you can't avoid it.

collyw | 2 months ago

I've got 2 things to say:

1) There is no "good design". It is always a compromise of trying to satisfy various practical constraints. OLAP vs OLTP probably rings a bell here. And this Last / First name thing is laughable. The complexity of handling all possible human names can be insane. Depending on the goal the proper design can be anything starting from single field and up to some relational structure.

2) The article itself looks like let's just write something up to score some points.

FpUser | 2 months ago

> 4) Never store firstname and lastname in the same field

I have the opposite opinion: unless you have a specific reason why you need to know one's "first name" and "last name" (e.g. to interface with some other system that requires names to be split up like that), just use a single field and stop caring about whether or not the name conforms to your preconceived notions of format and structure.

yellowapple | 2 months ago

(fwiw, I like the article, and I find it suitably humble)

Just want to hang a thought here ... depending on the precise definition of First Normal Form, it is either impossible to avoid[0] or impossible to enforce[1] without further constraints.

[0] if definition is "must be rectangular", i.e. all columns have the same number of rows and all rows have the same number of columns.

[1] if the definition includes "do not allow duplicate rows"

emmelaich | 2 months ago

The Data Model Resource Book [1] is a good resource how to design database. It is a kind of reference book

[1] https://www.wiley.com/en-us/The+Data+Model+Resource+Book%2C+...

reader_1000 | 2 months ago

This article is perfect if you're just getting started.

There's lore around design tricks to get the most out of a database, but it's often implementation or domain specific.

You'll learn what rules to break, and why, as you solve real life production problems.

FWIW - you can go a long, long, long way before de-normalising is actually necessary.

rukuu001 | 2 months ago

I wish there will be a DBMS which satisfy all aspects of CAP theorem and also provide high performance.

Until then, we are stuck with RDBMS and NoSql as lesser evils based on the use cases.

DeathArrow | 2 months ago

Make the primary key the thing you query by. Treat a sql database as similar to a nosql database (no joins) and your performance will be amazing.

paulmendoza | 2 months ago

RE: not storing address in a single field, splitting an address into semantic fields feels like a premature optimisation. You’re forced to anticipate all the fields that you may want to aggregate by, before you may actually even use them.

Instead, you could just create a custom function that takes in a string and tries to identify the city or street number or whatever element you want extract from the address e.g.

    select * from users where extract_city(address) = 'london';
SPBS | 2 months ago

Use the array data type for simple collections if using Postgres because it saves a join.

paulmendoza | 2 months ago

An advice I think is worth adding to the list is: use indexing.

DeathArrow | 2 months ago

This a DDD aggregates design guide, not database schema

banq | 2 months ago

Use third normal form. - Till your system gets 10 year old and linking your user's first name to their address requires joins on 10 tables having millions of rows.

darth_avocado | 2 months ago