This is what I mean by schema (is this a good design?)
https://www.drupal.org/node/1785994
When I read about database schemas with thousands of tables I wonder what kind of problem could possibly require so many tables.
What are good schema examples? And what are some poor examples?
I think 'learning from how not to do something' is a really powerful pedagogical technique that should be more widespread.
1. Dividing up tables so that one user can have more than one email address
2. Using PostgreSQL’s “row-level security” system to restrict database results based on the logged in user
3. Dividing tables across multiple schemas for additional security
4. SQL comments for documentation
5. SQL functions for common functionality, such as logging in or verifying an email
It’s a fascinating project, and well worth a look!
[1](https://www.graphile.org/postgraphile/)
[2](https://github.com/graphile/starter)
[3](https://github.com/graphile/starter/blob/master/%40app/db/mi...)
https://meta.stackexchange.com/questions/2677/database-schem...
https://i.stack.imgur.com/AyIkW.png
19 million questions, 29m answers, 12m users, 10m visits per day, and most of the db action is in less than two dozen tables.
But that leaves the obvious question: use for what?
Structures, whether it's databases or object graphs, exist for only two reasons: to do stuff and to fit into a pattern of rules you've decided to use beforehand. Without either of those, there is no way to judge a schema. We could talk about generally organizing data. That's set theory and normal forms. But that wasn't your question.
I would design the schema for a three-microservice widely-distributed high-performance application far differently than I would a local checkbook app for my brother. You gotta have more information.
Apologies. I am unable to help.
There is no Customer table.
"The Data Model Resource Book, Vol. 1: A Library of Universal Data Models for All Enterprises"
Edit: Security is also something worth taking into account, design for the applications and users that are going to access your database. You might create a separate schema for an application and separate schema for users and then grant them access to a "main" schema for instance, but this really all depends on your needs in the end.
https://gist.github.com/nomsolence/69bc0b5fe1ba943d82cd37fdb...
Pictured is the compiler internals; the attached .syrup is the DSL. (It started as a tangent on a project I'm doing for my girlfriend, the schema described in the .syrup has been improved a bit since)
Note: things after # are comments.
I find even just defaulting to NOT NULL and not having to worry about commas is a boon for when I create schemas.
The DSL will of course support things like compound primary keys and SQLite's WITHOUT ROWID.
I'll post the code here, likely before the weekend: https://github.com/nomsolence/syrup
On the other hand, if your bookkeeping application uses a database try to keep things as tidy as possible.
Businesses that trade in thousands of products, employ thousands of people in dozens of countries, with hundreds of different taxation, human relations, timekeeping, payroll, health insurance and retirement savings laws, dozens of sites, inventories in hundreds of locations managed by dozens of contractors, moved by one of potentially thousands of logistics firms with at least a dozen modes of shipment, reporting standards for multiple countries, reporting standards for stock exchanges, internal management reports, bank accounts in multiple countries, in multiple currencies, with a mix of transfers via a variety of means, hundreds of cash alternatives with varying rules about whether they are cash alternatives...
Modern large businesses are very, very complex.
My project is a license server for my electron app. The tech stack is a joy: mongo/mongoose, express, graphql, JWT, all with a React web front end. Payments by Stripe. The actual licenses are the signed JWT vended by the graphql API (not to be confused with the totally separate JWT for auth).
The main goal is to sell software so I license by machine fingerprint (node module on electron).
It’s been running for over 6 months without issue. I’m just beginning to start architecting an update where I allow a subscription model similar to Jetbrains Perpetual Fallback license, but slightly different in favor of the user. I’ve taken a lot of notes from comments at https://news.ycombinator.com/item?id=17007036
Here’s what I’m thinking so far:
A) Non-Expiring License at $97.99. Unlimited bug fixes for that one version. or B) Expiring License at $9.79/month, and you always have access to the latest version. After 12 consecutive payments you are granted a Non-Expiring License for the most recent version at the time of each payment.
Now, to model this...
There are three main factors behind this, none of which are bad schema design:
1. Table partitioning/sharding. RDBMS storage engines don't tend to scale for common hybrid access patterns (e.g. contended inserts+reads on tables with many indices and uniqueness constraints) past a billion rows or so. You can scale them yourself, though, by splitting a table's dataset into multiple tables (all with the same schema) and getting the RDBMS to direct reads/writes appropriately.
2. ETL ingest tables for separate datasources. If you have e.g. 100 different scrapers for various APIs used in your data warehouse, all writing into your DB, then even if they all are writing data with the same schema, it's operationally very challenging to have them write to the same physical table. Much simpler to have them write into their own staging tables, which can have their own exclusive-access locks, get their own deduping, get TRUNCATEd separately between batch loads, etc. They can copy into a shared table, but this step is often unnecessary and skipped.
3. Matviews built for extremely expensive queries that exist for running specific reports against. These are effectively materializataions of subqueries shared by common customer queries. As such, there might be hundreds or thousands of these.
In the end, none of these are really tables ("relations", in the relational-algebra sense); they're all operational implementation details. Partitions are literally "parts" of larger tables; ETL ingest staging tables are distinct "inboxes" to a shared table; and reporting matviews are "caches" of queries. All of these things could be separate first-class object types, hidden inside the logic of the DBMS, not presented as "tables" per se, despite taking up storage space like tables. But—in most-any RDBMS I know of—they're not. So you see thousands of tables.
The one you linked is a pretty typical relational model and isn't bad, but it has trade offs that I'd personally not make, however, that doesn't make it bad.
In the end context, scale and usage all determine a good schema design. Sometimes what would be a good relational design on paper would be tragically horrid in practice once you get beyond a small dataset.
Spec: Product has a supplier [tables:product, supplier]
Reality: Product can be bought from multiple suppliers [table:product, supplier, product_supplier]
i use to think wordpress had the worst schema in the world. after actually using it and writing plugins for it, i've come to the conclusion that it is genius. their schema design really makes it very easy for others to "extend" the structure without having to actually alter the schema.
to elaborate on what i mean by extend, wordpress's schema is basically a post table that has the very minimal required columns for creating a post (like the title, date, status and a couple of other), the post_meta table references that post table and basically is used like a huge key/value store. really all it is post_id column (which reference back to the posts table), a key column and a value column. You can add whatever you want to a post by adding them to the post_meta.
this design is copied for all other areas as well, such as users, comments and what have you.
https://codex.wordpress.org/Database_Description
now obviously this kind of design isn't going to work for something like a financial institution where you most likely want a ton of referential integrity built into the schema to prevent accidental data lost and to validate data input, but for a CMS it works very well.
again... it all depends on what the system requirements are.
"The Data Model Resource Book": https://amzn.to/2tXNiuF
You can look at an implementation of a number of the ideas here: https://ofbiz.apache.org/
It's kind of complicated - probably overly so for some things - but there are a lot of ideas to think about and maybe utilize at least in part.
I will refrain to comment on the quality of Drupal’s schema, but that diagram just shows a bunch of tables and how those tables are connected by foreign keys. What do those connections mean? What other constraints are there in the data? Are they represented in the diagram?
A good database design conveys a lot more semantics. There is currently one ISO standard (ISO 31320-2, key-based style modeling) for database design. Adop ting that standard does not automagically guarantee good designs, but, if used correctly, it helps a lot (it doesn’t help if you don’t have a good grasp of the Relational model, so I would recommend that you would get familiar with that first and foremost).
Most database schemas you’ll find around are rippled with useless ID attributes that are essentially record identifiers (they are not even “surrogates” as many people call them: for the definition of a “surrogate” read Codd’s 1979 paper) and, as a consequence, they are navigational in nature (they resemble more CODASYL structures than Relational models): to connect records in table A with records in table D, you must join with (navigate through) B and C, while in a well-designed Relational schema you might have joined A and D directly. Do you want a rule of thumb to recognized poorly designed databases? Check for ID attributes in every table (there are many other ways in which database design can suck, though).
How do you recognize good database diagrams? They can be easily translated into natural language (predicates and facts), and two (technically competent) different people would translate them in the same way, i.e., the model is unambiguous. Can you say so for Drupal’s schema?
I think the primary problem of giving examples here is similar to teaching software engineering, which needs complex projects solving complex problems - too big for a semester project.
A good schema depends on the problem it's solving.
A secondary problem is similar to code that has sacrificed clarity for performance. The tweaks made for performance are not intrinsic to the logical problem, but are an additional constraint.
For performance on common queries, schema can be flattened or "denomalized". The ability to do so was one of the original motivations for Codd's relational algebra.
They're probably mostly small (in terms of columns), many of them done just to have foreign key constraints, separate indexes, and arguably easier querying (joining) on just the data you need.
But I think it's a particular style, rather than a 'problem' that 'requires' so many.
(IANA database expert though, just my tuppence.)
I don’t think there’s a number above which you should not go, it’s really about how the data is organized and how easy it is to get data in and out of. I agree with another comment here about looking at anti-patterns first. There are some common ones that would be red flags.
As an example of what not to do: I once worked with a DBA who insisted that tables shouldn’t have more than 15 columns. So any table with more than 15 columns would have to be split into multiple tables, even if those tables would always have to be fetched together.
I hope that helps spark some ideas for you.
Anything that doesn't break the first normal form("1NF")[1].
> And what are some poor examples?
Anything that breaks 1NF.
You break 1NF when...
* a column cannot be described by the key, ex. user.id describes user.name but not user.items_purchased.
* values in a column are repeated, ex. a user table that stores multiple addresses for the same user should be split into an addresses table.
Treat tables as classes and you'll be fine. Just like a method may not belong in a class, a column may not belong in a table.
But action always trumps thought. And its better to just slurp up as much data as possible
The simplest design for keys in a dict type data store such as Redis hashes, is to just auto_increment user ids. Resolve id=user. And then all data is just flat {var:id=value}. Key type is just a string delimited by ":". Gets you in the game fast. Mine structure and relationships later ;)
CRMs often have hundreds of tables and ERPs have thousands or tens of thousands or more.
https://www.mediawiki.org/w/index.php?title=Manual:Database_...
Which I learned about from the jOOQ folks: https://www.jooq.org/sakila
Might be helpful for a rounded understanding of good schema design - it can depend on the context.
Anyone know if there's such a site like it or even a marketplace for it?
Ones that ship.
Just for record keeping, analysis, learning etc.
Exist an example of a good collection of variables/classes in Java?
For a rdbms, the schema IS the "classes, variables and types" on other languages. Only have the massive advantage of being much more visible and easier to understand.
But what if you are working in a municipality and need to be able select all commercial buildings with more than five stories, or businesses situated in the basement of the building for the yearly fire inspections? Then string-searching all those randomly entered address lines will quickly become a complete nightmare - where as if the floor number was normalized and stored in its own column the query for the fire inspector's report would be a piece of cake.
This is a good example of why it's so hard to do cookie-cutter-implementaion-ready-schema-design-templates. It's also a good example of why datamodeling is important no matter underlying tech-stack this data model is going to be implemented on.
Also, I prefer modeling the app, business or process in Chen's ERD first as I think it is much better at capturing modeling details than UML and other ERD-variants.
Also, just as each object class in OOP should do only one thing, each entity should be saved just one table. Eg. that "Employee" table in the first chapter of every beginner database book with a "Manager" relation as a foreign key to itself is an absolute catastrophe and very . The moment your CEO decides you are now in a matrix-organisation, everything breaks down datamodel-wise. The Employees go into one table, the Organisational Structure type into another - they are related by foreign keys and it's not that different from good OO modeling as people say. The tables containing organisational structure should probaly also have columns with a from- and to-date and a relationship to a Department table so different departments can be organised differently throughout their lifetime.
Also, entities which have some sort of lifecycle should also be split into different tables. So there should be a table for "Prospective Employees", "Current Employees", "Resigned Employees", "Former Employees", etc. An employee's day of resignation can now be not null and go into the right table. You can always UNION these three or four tables together into one big view or materialized table, and at the same time you will avoid a massive amount of WHERE statements that each need their own indexes, picking out just the right employees from that big generic Employee table in every effing query.
Also, columns with endless NULL values are a "code smell" in a relational database. Whatever value is hiding in the few rows with values probably belong to another entity and should have been stored in another table with the name of that "thing". Eg. the employee's day of resignation again.
Also, 99% of all business datamodels can be implemented in a relational database using just tables, views and queries created in standard SQL. You will rarely if ever need user defined functions, generators, connectors, stored procedures, foreign code blobs and other exotic and vendor specific extensions.
Also, I recommend reading everything by Joe Celko.