https://twitter.com/sarahmei/status/1348477224467394560
My understanding was always that for relational data (e.g., social networks) you should use a relational database. Is the person in this tweet correct? If so what is a better option?
Once you reach a large scale, relational databases start being a problem for availability and replication of data across different availability zones. Operations become complicated (you have replication chains, master/slave setups, etc.)
If your data is relatively simple and doesn't require a lot of relations and foreign keys, then something like Cassandra can save a lot of headaches.
Btw, a common trick to make a relational database perform at scale by limiting joins is to "flatten data", i.e. replicate data across different tables to avoid joining them.
Finally, don't let yourself be fooled by anyone who claims they know "the better option." There is no better option. There is only a better option for a particular use case you're looking at, given the specific constraints at hands. That's what engineering is about, including software engineering.
If you want to learn more about designing storage systems by constraints, I recommend that you read the 2007 Dynamo paper from Amazon, and in particular section 2.3 "Design Considerations". Below is a link, you can easily find a PDF online if you need.
https://www.allthingsdistributed.com/2007/10/amazons_dynamo....
Basically all meaningful data in an application context has relationships. There is no real such thing as "non-relational data"
Instead the question really is do you want a planned, enforced schema or an unplanned, freeform one.
Use a SQL database for the former.
Take a long look in a mirror and question the decisions that made you the way you are if the latter.
Storing more context in document helps obviously because you don't have to fetch the data many times, it's actually also done in relational databases whenever needed. But you can't store a lot in one document, that doesn't scale nor work.
For example, if someone changes its avatar or want to delete its account, do you want to parse all your social network documents to update an avatar or remove the comments on a tiny subset of them ? If a post is popular, are you going to update its document thousands of times per second ?
In practice you will most likely find a mix of everything. Relational databases, in memory data stores, cache layers, perhaps a few nosql documents database, some big data stuff and a probably some excel sheets.
The limitations that Sarah Mei identifies as clownpants is using a 32 bit primary key for an identifier for an ephemeral thing. That is again nothing to do with SQL vs NoSQL. It would affect both of them the same way.
So far I never used NoSQL.
NoSQL is for incompetent people who can't figure out how to convert a JSON request to a table structure. They just put the entire JSON as it is in a DB and call it NOSQL.
Anyone using NoSQL for anything is either lying or clueless.
Besides being a good read overall, the book discusses topics like this one in detail and with a healthy attitude (people tend to have strong opinions on this)
I thought you were supposed to use a graph database for that, like dgraph. Do I remember incorrectly?
> Dgraph is a horizontally scalable and distributed GraphQL database with a graph backend.
---
Edit: found the source... According to https://www.infoworld.com/article/3251829/why-you-should-use...:
"However, as with any popular technology, there can be a tendency to apply graph databases to every problem. It’s important to make sure that you have a use case that is a good fit. For example, graphs are often applied to problem domains like:
- Social networks
- Recommendation and personalization
- Customer 360, including entity resolution (correlating user data from multiple sources)
- Fraud detection
- Asset management"
- do you need relational data, or something more simple, or something more flexible ?
- do you need transaction integrity ? Transaction integrity is a nice feature, but you can also design all your code so that if something blows "in the middle", it is somehow repaired automatically in a further event.
Maybe a third point: most of our relational / transactional database technology is quite old. Could we do something better than SQL query language, common database types, and the actual database code that was very optimized for magnetic spinning disks, but maybe is not optimized for SSD ? Maybe, we would need something like SQLV2.
And my god how much hype bullshit is inserted in those technical discussions.
Of course, you will use materialized views for even better performance.
> ...the worst are full of passionate intensity.
that said, it's difficult to feel sympathy for people supporting a platform that encourages terrorism, murder, etc.
The type of join shouldn't be a problem, SQL engines should in most cases be able to determine the best join. In the cases it can't you can go start tweaking (although tricky to get right, especially if your data evolves, it's possible, you probably want to fix your query plan). B is however tricky and a performance loss since it's really a bit silly that data is flattened into a set each time to be then (probably) put into a nested (Object-Oriented or JSON) format to provide the data to the client. This is closely related to C, in a social graph you might have nodes (popular people or tweets) who have a much higher amount of links than others. That means if you do a regular join on tweets and comments and sort it, on the tweet you might not get beyond the first person. Instead, you probably only want the first x comments. That query might result in an amount of nested groups. So it might look more like the following SQL (wrote it by heart, probably not correct):
SELECT tweet.*, jsonb_agg(to_jsonb(comment)) ->> 0 as comments, FROM tweet JOIN comment ON tweet.id = comment.tweet_id
GROUP BY tweet.id HAVING COUNT(comment.tweet_id) < 64 LIMIT 64
That obviously becomes increasingly complex if you want a feed with comments, likes, retweets, people, etc.. all in one. There are reasons why two engineers that helped to scale twitter create a new database (https://fauna.com/) where I work. Although relational, the relations are done very differently. Instead of flattening sets, you would essentially walk the tree and on each level join. I did an attempt to explain that here for the GraphQL case: https://www.infoworld.com/article/3575530/understanding-grap...
TLDR, in my opinion you can definitely use a traditional relational database. But it might not be the most efficient choice due to the impedance mismatch. Relational applies to more than traditional SQL databases though, graph database or something like fauna is also relational and would be a better match (Fauna is similar in the sense that joins are very similar to how a graph database does these). Obviously I'm biased though since I work for Fauna.