HACKER Q&A
📣 vanilla-almond

What are some examples of good database schema designs?


I've seen many examples of database schemas but don't always recognise a good design over a mediocre or poor design.

This is what I mean by schema (is this a good design?)

https://www.drupal.org/node/1785994

When I read about database schemas with thousands of tables I wonder what kind of problem could possibly require so many tables.

What are good schema examples? And what are some poor examples?


  👤 ganonm Accepted Answer ✓
I'd highly recommend reading SQL Antipatterns. It's a very approachable book that illustrates how to design a database schema by taking commonly encountered scenarios and first showing you the naive approach. After explaining why this is bad, it then shows you the recommended way of doing it and why this way works better.

I think 'learning from how not to do something' is a really powerful pedagogical technique that should be more widespread.


👤 singingwolfboy
PostGraphile [1] is a framework for generating a GraphQL API based on the tables in your database; as a result, good database design is crucial. Graphile Starter [2] is a quickstart project that demonstrates best practices for getting up and running with PostGraphile quickly. In particular, check out the SQL migration file in that project [3]. It demonstrates:

1. Dividing up tables so that one user can have more than one email address

2. Using PostgreSQL’s “row-level security” system to restrict database results based on the logged in user

3. Dividing tables across multiple schemas for additional security

4. SQL comments for documentation

5. SQL functions for common functionality, such as logging in or verifying an email

It’s a fascinating project, and well worth a look!

[1](https://www.graphile.org/postgraphile/)

[2](https://github.com/graphile/starter)

[3](https://github.com/graphile/starter/blob/master/%40app/db/mi...)


👤 adventured
The Stackoverflow schema is decent. Especially for someone that is a relative amateur to study. It's nothing special or complex, however it's a good example of something that actually works well in practice (and at massive scale). The CRUD-CMS Q&A style lends itself nicely to a basic db schema that is easy to get your head around at a glance.

https://meta.stackexchange.com/questions/2677/database-schem...

https://i.stack.imgur.com/AyIkW.png

19 million questions, 29m answers, 12m users, 10m visits per day, and most of the db action is in less than two dozen tables.


👤 DanielBMarkham
A schema is a way of designing data structures such that they are efficiently organized and easy to use.

But that leaves the obvious question: use for what?

Structures, whether it's databases or object graphs, exist for only two reasons: to do stuff and to fit into a pattern of rules you've decided to use beforehand. Without either of those, there is no way to judge a schema. We could talk about generally organizing data. That's set theory and normal forms. But that wasn't your question.

I would design the schema for a three-microservice widely-distributed high-performance application far differently than I would a local checkbook app for my brother. You gotta have more information.

Apologies. I am unable to help.


👤 coltnz
I've been using Len Silverstein's Universal Data Models for 15 years. You'll be writing to lots of tables and will want views for your common aggregates. But you'll have the common tables you'll need, the patterns for those you don't and be able to handle new requirements with minimal change.

There is no Customer table.

"The Data Model Resource Book, Vol. 1: A Library of Universal Data Models for All Enterprises"


👤 JacKTrocinskI
1000 tables isn't an extraordinary amount. As far as database schema design goes it really depends on your needs, e.g. is your database intended for OLTP or OLAP use? Depending on your needs you can decide to what degree you want to normalise your tables, also there a lots schema designs to chose from (star, snowflake, etc.) and it's worth reading about them. Having a good schema design is about knowing the data you have and how you plan on using it, there is no one "good" solution in my opinion.

Edit: Security is also something worth taking into account, design for the applications and users that are going to access your database. You might create a separate schema for an application and separate schema for users and then grant them access to a "main" schema for instance, but this really all depends on your needs in the end.


👤 kohtatsu
On this topic: I'm in the process of making a compiler for a DSL I designed to help with the schema design process.

https://gist.github.com/nomsolence/69bc0b5fe1ba943d82cd37fdb...

Pictured is the compiler internals; the attached .syrup is the DSL. (It started as a tangent on a project I'm doing for my girlfriend, the schema described in the .syrup has been improved a bit since)

Note: things after # are comments.

I find even just defaulting to NOT NULL and not having to worry about commas is a boon for when I create schemas.

The DSL will of course support things like compound primary keys and SQLite's WITHOUT ROWID.

I'll post the code here, likely before the weekend: https://github.com/nomsolence/syrup


👤 mkettn
10 years ago, I'd said "at least third normal form"... but today: Whatever gets the job done. When the application is not really dependent on weird queries (e.g. just a blog), screw the normal forms and design your schema to use the least number of queries for a certain task. Nobody understands three lines of code of queries with left and right joins.

On the other hand, if your bookkeeping application uses a database try to keep things as tidy as possible.


👤 tundrax
DatabaseAnswers[1] has been a valuable resource of learning for me. [1] http://www.databaseanswers.org/data_models/index.htm

👤 jacques_chester
> When I read about database schemas with thousands of tables I wonder what kind of problem could possibly require so many tables.

Businesses that trade in thousands of products, employ thousands of people in dozens of countries, with hundreds of different taxation, human relations, timekeeping, payroll, health insurance and retirement savings laws, dozens of sites, inventories in hundreds of locations managed by dozens of contractors, moved by one of potentially thousands of logistics firms with at least a dozen modes of shipment, reporting standards for multiple countries, reporting standards for stock exchanges, internal management reports, bank accounts in multiple countries, in multiple currencies, with a mix of transfers via a variety of means, hundreds of cash alternatives with varying rules about whether they are cash alternatives...

Modern large businesses are very, very complex.


👤 semireg
I’ve lived 20 years as an amateur SQL database designer and last year I designed my first production nosql schema on mongo. Nosql is a different world... collections, embedded documents. I’m not sure it’s 100% “correct” but it’s working great for me so far.

My project is a license server for my electron app. The tech stack is a joy: mongo/mongoose, express, graphql, JWT, all with a React web front end. Payments by Stripe. The actual licenses are the signed JWT vended by the graphql API (not to be confused with the totally separate JWT for auth).

The main goal is to sell software so I license by machine fingerprint (node module on electron).

It’s been running for over 6 months without issue. I’m just beginning to start architecting an update where I allow a subscription model similar to Jetbrains Perpetual Fallback license, but slightly different in favor of the user. I’ve taken a lot of notes from comments at https://news.ycombinator.com/item?id=17007036

Here’s what I’m thinking so far:

A) Non-Expiring License at $97.99. Unlimited bug fixes for that one version. or B) Expiring License at $9.79/month, and you always have access to the latest version. After 12 consecutive payments you are granted a Non-Expiring License for the most recent version at the time of each payment.

Now, to model this...


👤 gigatexal
Read up on third normal form. It’s all you need for 99% of oltp databases (the backend to a store) and read up on snowflake (The design not the company) and Star schema for datawarehouses for analytics.

👤 derefr
> I wonder what kind of problem could possibly require so many tables.

There are three main factors behind this, none of which are bad schema design:

1. Table partitioning/sharding. RDBMS storage engines don't tend to scale for common hybrid access patterns (e.g. contended inserts+reads on tables with many indices and uniqueness constraints) past a billion rows or so. You can scale them yourself, though, by splitting a table's dataset into multiple tables (all with the same schema) and getting the RDBMS to direct reads/writes appropriately.

2. ETL ingest tables for separate datasources. If you have e.g. 100 different scrapers for various APIs used in your data warehouse, all writing into your DB, then even if they all are writing data with the same schema, it's operationally very challenging to have them write to the same physical table. Much simpler to have them write into their own staging tables, which can have their own exclusive-access locks, get their own deduping, get TRUNCATEd separately between batch loads, etc. They can copy into a shared table, but this step is often unnecessary and skipped.

3. Matviews built for extremely expensive queries that exist for running specific reports against. These are effectively materializataions of subqueries shared by common customer queries. As such, there might be hundreds or thousands of these.

In the end, none of these are really tables ("relations", in the relational-algebra sense); they're all operational implementation details. Partitions are literally "parts" of larger tables; ETL ingest staging tables are distinct "inboxes" to a shared table; and reporting matviews are "caches" of queries. All of these things could be separate first-class object types, hidden inside the logic of the DBMS, not presented as "tables" per se, despite taking up storage space like tables. But—in most-any RDBMS I know of—they're not. So you see thousands of tables.


👤 davismwfl
Kinda tough to give a good answer without more context, IMO. What I mean is that a good e-commerce schema that serves a single small store and runs off a single database server would look quite different then a multi-tenant or distributed data store for a e-commerce site at scale.

The one you linked is a pretty typical relational model and isn't bad, but it has trade offs that I'd personally not make, however, that doesn't make it bad.

In the end context, scale and usage all determine a good schema design. Sometimes what would be a good relational design on paper would be tragically horrid in practice once you get beyond a small dataset.


👤 blowski
There are three volumes of "The Data Model Resource Book" by Len Silverston, and despite being about 20 years old they remain industry best practices. All sorts of data model design patterns - insurance, ecommerce, data warehouses, party, finance, etc. David Hay's "Data Model Patterns" is also excellent, although less concrete.

👤 femto113
Take a look at Magento (open source e-commerce). I'll leave it to you to decide whether it's reducible to notions like "good" or "bad" but it's definitely fascinating in a WTF sort of way.

https://i.stack.imgur.com/wnwrJ.jpg


👤 codegladiator
3NF or more is good. Else its poor. Number of tables does not matter.

👤 gerdusvz
Schema's that reflects reality not the current specs. Flexibility is key. In my experience adding tables and migrating existing data to them is hard, adding columns easy. So spend extra time at the start on what tables there should be.

Spec: Product has a supplier [tables:product, supplier]

Reality: Product can be bought from multiple suppliers [table:product, supplier, product_supplier]


👤 sakopov
Database Answers [1] has been around for years. It doesn't exactly have complicated schemas but there are enough examples to showcase various design patterns.

[1] http://www.databaseanswers.org/data_models/


👤 fsajkdnjk
i have been working with eventsourcing for the past few years and a design i have implemented in the repositories(db) lately is to have one table for events and one table for snapshots(ie. the objects in the current state in serialized form). then, depending on the needs of the application(ie. what queries will be run) I will create tables that will serve as pure indices by which I can then lookup the aggregates(object) I need. This gives me incredible flexibility and I do not need to bother with complex schema at all. I use event reactors within transaction context(imagine pre-save trigger per object) to fill these tables with data(or remove data). and from now on i think this is the way to go for me for anything. having your sql schema matching your objects/entities is very restricting and not flexible for future development. with this approach i have the full data available(snapshots) so i don't need to hydrate each aggregate from the event stream and i also have the ability to filter the aggregates as i need and also have highly optimized schema for any query i desire. when something changed in the future, i can simply play through the entire event stream and fill in new indices or whatever is needed. machines are fast these days so storing the entire object as snapshot in serialized form is nothing and it beats having to load tens of fields/columns and parse it into objects manually.

👤 thrownaway954
it really all depends on the requirements.

i use to think wordpress had the worst schema in the world. after actually using it and writing plugins for it, i've come to the conclusion that it is genius. their schema design really makes it very easy for others to "extend" the structure without having to actually alter the schema.

to elaborate on what i mean by extend, wordpress's schema is basically a post table that has the very minimal required columns for creating a post (like the title, date, status and a couple of other), the post_meta table references that post table and basically is used like a huge key/value store. really all it is post_id column (which reference back to the posts table), a key column and a value column. You can add whatever you want to a post by adding them to the post_meta.

this design is copied for all other areas as well, such as users, comments and what have you.

https://codex.wordpress.org/Database_Description

now obviously this kind of design isn't going to work for something like a financial institution where you most likely want a ton of referential integrity built into the schema to prevent accidental data lost and to validate data input, but for a CMS it works very well.

again... it all depends on what the system requirements are.


👤 davidw
I've long recommended this book:

"The Data Model Resource Book": https://amzn.to/2tXNiuF

You can look at an implementation of a number of the ideas here: https://ofbiz.apache.org/

It's kind of complicated - probably overly so for some things - but there are a lot of ideas to think about and maybe utilize at least in part.


👤 lifepillar
Take a look at http://www.softwaregems.com/Documents/Student%20Resolutions/.

I will refrain to comment on the quality of Drupal’s schema, but that diagram just shows a bunch of tables and how those tables are connected by foreign keys. What do those connections mean? What other constraints are there in the data? Are they represented in the diagram?

A good database design conveys a lot more semantics. There is currently one ISO standard (ISO 31320-2, key-based style modeling) for database design. Adop ting that standard does not automagically guarantee good designs, but, if used correctly, it helps a lot (it doesn’t help if you don’t have a good grasp of the Relational model, so I would recommend that you would get familiar with that first and foremost).

Most database schemas you’ll find around are rippled with useless ID attributes that are essentially record identifiers (they are not even “surrogates” as many people call them: for the definition of a “surrogate” read Codd’s 1979 paper) and, as a consequence, they are navigational in nature (they resemble more CODASYL structures than Relational models): to connect records in table A with records in table D, you must join with (navigate through) B and C, while in a well-designed Relational schema you might have joined A and D directly. Do you want a rule of thumb to recognized poorly designed databases? Check for ID attributes in every table (there are many other ways in which database design can suck, though).

How do you recognize good database diagrams? They can be easily translated into natural language (predicates and facts), and two (technically competent) different people would translate them in the same way, i.e., the model is unambiguous. Can you say so for Drupal’s schema?


👤 hyperpallium
diaclaimer: theoretical opinion.

I think the primary problem of giving examples here is similar to teaching software engineering, which needs complex projects solving complex problems - too big for a semester project.

A good schema depends on the problem it's solving.

A secondary problem is similar to code that has sacrificed clarity for performance. The tweaks made for performance are not intrinsic to the logical problem, but are an additional constraint.

For performance on common queries, schema can be flattened or "denomalized". The ability to do so was one of the original motivations for Codd's relational algebra.


👤 OJFord
> When I read about database schemas with thousands of tables I wonder what kind of problem could possibly require so many tables.

They're probably mostly small (in terms of columns), many of them done just to have foreign key constraints, separate indexes, and arguably easier querying (joining) on just the data you need.

But I think it's a particular style, rather than a 'problem' that 'requires' so many.

(IANA database expert though, just my tuppence.)


👤 larrywright
When you think about large enterprise software systems with screen after screen full of data entry forms, with drop down lists everywhere, every single one of those drop down selectors is likely a database table. Every schema like that will have a mix of large tables (customers, products, etc) and lots of small tables.

I don’t think there’s a number above which you should not go, it’s really about how the data is organized and how easy it is to get data in and out of. I agree with another comment here about looking at anti-patterns first. There are some common ones that would be red flags.

As an example of what not to do: I once worked with a DBA who insisted that tables shouldn’t have more than 15 columns. So any table with more than 15 columns would have to be split into multiple tables, even if those tables would always have to be fetched together.


👤 dougk16
Not to be too self-promoty, but I spent a ton of time thinking about the schema for my site, and have gotten some compliments on the detail and documentation: https://aytwit.com/sql

I hope that helps spark some ideas for you.


👤 Namoshek
I came to the same conclusion when I read that a basic SAP installation comes with over 20k tables. (see: https://retool.com/blog/erp-for-engineers/)

👤 iblaine
> What are good schema examples?

Anything that doesn't break the first normal form("1NF")[1].

> And what are some poor examples?

Anything that breaks 1NF.

You break 1NF when...

* a column cannot be described by the key, ex. user.id describes user.name but not user.items_purchased.

* values in a column are repeated, ex. a user table that stores multiple addresses for the same user should be split into an addresses table.

Treat tables as classes and you'll be fine. Just like a method may not belong in a class, a column may not belong in a table.

[1](https://en.wikipedia.org/wiki/First_normal_form)


👤 ArtWomb
I always think there is some magical perfect schema that I am missing. And if I just designed things perfectly in the beginning, I wouldn't need all this complex query logic. And reverse lookup tables, and fan outs, and other sundry hacks

But action always trumps thought. And its better to just slurp up as much data as possible

The simplest design for keys in a dict type data store such as Redis hashes, is to just auto_increment user ids. Resolve id=user. And then all data is just flat {var:id=value}. Key type is just a string delimited by ":". Gets you in the game fast. Mine structure and relationships later ;)


👤 gfody
A database schema is the result of practicing the art of information modeling - a timeless art that will unlock the deepest mysteries of the universe if you master it. When database threads hit HN it's basically disgusting how poorly understood this is and the bad advice that flows here. Everyone here is wrong wrong wrong wrong wrong wrong wrong wrong wrong wrong wrong............. apologies I'm not myself today, posting anyways, good luck!

👤 arethuza
"what kind of problem could possibly require so many tables"

CRMs often have hundreds of tables and ERPs have thousands or tens of thousands or more.


👤 joshdick
WikiMedia's database schema is used by Wikipedia, one of the most popular websites there is, so it's arguably a very successful database schema design:

https://www.mediawiki.org/w/index.php?title=Manual:Database_...


👤 petercooper
Vertabelo has a blog which is almost entirely devoted to modelling schemas for various contrived scenarios: https://www.vertabelo.com/blog/ .. I take umbrage with some of the decisions made but broadly they are interesting and well thought out.

👤 the-dude
Northwind Traders

👤 felixyz
The Sakila schema: https://dev.mysql.com/doc/sakila/en/

Which I learned about from the jOOQ folks: https://www.jooq.org/sakila


👤 tommaho
The Kimball book series suggests some generalized schemas by industry, for the purpose of data warehousing. Here the academic schema rules tend to bend a bit - denormalization can become a useful and often necessary technique, for example.

Might be helpful for a rounded understanding of good schema design - it can depend on the context.


👤 mshanu
Too many tables in one schema tells me that there is no clear separation of responsibilities. I would first look at clear domain boundaries and separate them. Json type capability is one thing I have found useful when it comes to need for a non relational data to be stored simpler without creating associative tables.

👤 auvi
This is a bit off topic maybe, but I have heard SAP works with tens of thousands of tables. Is that correct?

👤 khanguy
I've always liked how isfdb, the Internet Speculative Fiction Database, laid theirs out. http://www.isfdb.org/wiki/index.php/Database_Schema

👤 Dwolb
Dumb question, but does anyone have a recommendation for good software for generating the schema diagrams in the Drupal link but for Redshift?

👤 jaequery
I always wondered why people don't share SQL schemas like we do with codes.

Anyone know if there's such a site like it or even a marketplace for it?


👤 frye0031
Take a look at StackOverflow’s schema. Freely available to study, used in a production site, and a great resource.

👤 madeofpalk
> good database schema designs?

Ones that ship.


👤 iovrthoughtthis
Where are the repositories of this info?

Just for record keeping, analysis, learning etc.


👤 mamcx
Ask this is like ask:

Exist an example of a good collection of variables/classes in Java?

For a rdbms, the schema IS the "classes, variables and types" on other languages. Only have the massive advantage of being much more visible and easier to understand.


👤 pezo1919
My naive approach these days is avoiding relational schemas, because I'm tired of migrations. Document based NoSQL is hard to maintain too (dupes). The only solution to me are Graph databases like Neo4j or GunDB (inmemory).

👤 38911BBF
1-2-3NF is super important and superficially simple. But just to give you an example with 1NF where the devil is hidden in the deails, let's have a closer look at ye good old "Address" entity. If we are to store a postal address for package delivery the columns "Address Line 1-2-3" should suffice and enable quick entry in the UI and simple printing on letters and package labels. What's not to love?

But what if you are working in a municipality and need to be able select all commercial buildings with more than five stories, or businesses situated in the basement of the building for the yearly fire inspections? Then string-searching all those randomly entered address lines will quickly become a complete nightmare - where as if the floor number was normalized and stored in its own column the query for the fire inspector's report would be a piece of cake.

This is a good example of why it's so hard to do cookie-cutter-implementaion-ready-schema-design-templates. It's also a good example of why datamodeling is important no matter underlying tech-stack this data model is going to be implemented on.

Also, I prefer modeling the app, business or process in Chen's ERD first as I think it is much better at capturing modeling details than UML and other ERD-variants.

Also, just as each object class in OOP should do only one thing, each entity should be saved just one table. Eg. that "Employee" table in the first chapter of every beginner database book with a "Manager" relation as a foreign key to itself is an absolute catastrophe and very . The moment your CEO decides you are now in a matrix-organisation, everything breaks down datamodel-wise. The Employees go into one table, the Organisational Structure type into another - they are related by foreign keys and it's not that different from good OO modeling as people say. The tables containing organisational structure should probaly also have columns with a from- and to-date and a relationship to a Department table so different departments can be organised differently throughout their lifetime.

Also, entities which have some sort of lifecycle should also be split into different tables. So there should be a table for "Prospective Employees", "Current Employees", "Resigned Employees", "Former Employees", etc. An employee's day of resignation can now be not null and go into the right table. You can always UNION these three or four tables together into one big view or materialized table, and at the same time you will avoid a massive amount of WHERE statements that each need their own indexes, picking out just the right employees from that big generic Employee table in every effing query.

Also, columns with endless NULL values are a "code smell" in a relational database. Whatever value is hiding in the few rows with values probably belong to another entity and should have been stored in another table with the name of that "thing". Eg. the employee's day of resignation again.

Also, 99% of all business datamodels can be implemented in a relational database using just tables, views and queries created in standard SQL. You will rarely if ever need user defined functions, generators, connectors, stored procedures, foreign code blobs and other exotic and vendor specific extensions.

Also, I recommend reading everything by Joe Celko.