What are good resources to learn system design?

Question

What I mean by system design is to understand seemingly endless options when it comes to data handling on backend side.For example...- Kafka- Rabbitmq- Kinesis- Spark- Elastic search- Map reduce- Bigquery- InfluxDB- Hadoop- Teradata- Snowflake- Databricks...I understand Postgres the best, and would love to know why these and others exist, where do they fit in, why are they better over PSQL and what for, and if they are cloud only what's their alternatives....It seems all of them just store data, which PSQL does too, so what's the difference?

exdsq · Accepted Answer

Designing Data-Intensive ApplicationsRecommended by the CTO of Azure, creator of Kafka, and many HN users on other threads including me :)https://dataintensive.net/

itsrajju · Answer

Please check out Azure Cloud Design Patterns [0]. It lists a large number of common patterns applied in the field of distributed application design.[0]: https://docs.microsoft.com/en-us/azure/architecture/patterns...

throwaway_pdp09 · Answer

All the new shiny. From experience I can tell you
a) most companies deal with small amounts of data. Small can mean dozens of megabytes to dozens or hundreds of gigabytes. A single well provisioned server will typically be able to handle that very well. Also an SQL database can do a great deal if you know what you're doing.
b) inappropriately used big data frameworks are expensive performance killers. https://adamdrake.com/command-line-tools-can-be-235x-faster-... for example.
c) Good quality programming, as in understanding the machine, memory layout and why it matters, and a good understanding of algorithms (and a hefty dose of common sense), will often yield you more speedup than buying almost any number of new machines.
c) Hiring is often driven by fads and companies often don't like being told 'you don't need this roomful of servers', they like to waste money, so maybe do learn them (the profligacy with money is likely to be coming to an end with the economic damage of covid).
Takeaway: brainpower will get you much further than horsepower

robheaton · Answer

A few months ago I wrote a post called "Systems design for advanced beginners" that a lot of people seem to have found helpful.Link: https://robertheaton.com/2020/04/06/systems-design-for-advan... HN comments: https://news.ycombinator.com/item?id=23904000

Kototama · Answer

Designing Data-Intensive Applications is excellent but also check these resources:- https://github.com/donnemartin/system-design-primer- http://aosabook.org/en/index.html

whalesalad · Answer

I want to commend you for asking this question. Once you open your mind to systems thinking (in a general, broad or abstract sense) it will make you a far better engineer.
When it comes to data, you are ultimately worried about 1. storing it and making sure it stays there and 2. retrieving it or asking questions about that data with certain guarantees. Speed? Consistency? Local access? Grabbing a ton of rows at once? Grabbing really old data quick? The old adage is true here: nothing in life is free. If you want fast writes you might sacrifice read performance, or vice versa. If you dial one knob up, one knob needs to get dailed down (usually). All of the tools you listed have various trade offs and were designed or optimized for specific workloads. Some are more general (PSQL is a great example) but looking at them all spread out on a table the differences become more clear.
Choosing your tool will depend on how well it will meet your requierments and how it is going to play nice with all your other systems. Systems thinking is a lot bigger than choosing a performant tool that has the right libraries. You gotta think about long term support: how do I do backups of my data? How do I restore data? How do I perform upgrades down the road? How do I deal with downtime, can I throw more resources at it?
Long story short: I am very glad to hear more people thinking about systems engineering but make sure you don't get too caught up in the specific tooling and libraries. Learning and practicing the concepts and fundamentals and making sure to pause to think in the abstract boxes-and-lines sense is very important, too.
Learning about 'Clean architecture' and 'hexagonal architecture' will help to reinforce good systems design patterns.

_____s · Answer

Assuming you have some experience building simple single-node systems:
- Read Designing Data Intensive Applications. As others have said, it's a gem of a book, very readable, and it covers a lot of ground. It should answer both of your questions. Take the time to read it, take notes, and you should be well set. If you need to dive deeper into specific topics, each chapter links to several resources.
- Read some classic papers (Dynamo, Spanner, GFS). Some of these are readable while some are not-so-readable, but it'll be useful to get a sense of what problems they solve and where they fit in. You may not understand all of the terminology but that's fine.
That should give you a strong foundation that you can build upon. Beyond that, just build some systems, experiment with the ideas that you're learning. You cannot replace that experience with any amount of reading, so build something, make mistakes, struggle with implementation, and you'll reinforce what you've learned.
Backend is vast, and this helps you build a general sense of the topic. When you find a topic that you're really interested in (say stream processing, storage systems, or anything else), you can dive into that specific topic with some extra resources.
> I understand Postgres the best, and would love to know why these and others exist, where do they fit in, why are they better over PSQL and what for, and if they are cloud only what's their alternatives....It seems all of them just store data, which PSQL does too, so what's the difference?
A lot of that depends on the way you're building a system, the amount of data you're going to store, query patterns, etc. In most cases, there are tradeoffs that you'll have to understand and account for.
For example, a lot of column oriented databases are better suited for analytics workloads. One of the reasons is for that is their storage format (as the name says, columns rather than rows). Some of the systems you mentioned are built for search; some are built from the ground up to allow easier horizontal scaling, etc.

legerdemain · Answer

Uhh... at the risk of being too literal: - Kafka: a service for defining and managing message streams; used in service architectures that communicate by message-passing and in high-throughput data processing applications and pipelines. - RabbitMQ: another message queue service; less complex than Kafka. - Kinesis: a message queue service provided by AWS. - Spark: an in-memory distributed computation engine; a central "driver" consumes job definitions, written in code, and farms them out to "workers"; horizontally scalable; a variety of options exist for managed/hosted Spark. - ElasticSearch: a service for indexing data; consumes data blobs and search terms to associate them with; used to build search engines; many convenient utilities for managing search terms and queries. - MapReduce: a paradigm for defining distributed data operations; partitions of a "job" are sent to "mappers" that compute partial results, and those results then flow to "reducers," that combine the partial results into the finished output; Hadoop is the best-known implementation of this paradigm. - BigQuery: a scalable database offered by Google as a service. - InfluxDB: a time series database; used for storing and analyzing data that has a time component. - Hadoop: an implementation of the MapReduce paradigm; many hosted options, or you can run it on your own hardware. - Teradata: a company that sells various data analysis tools that run on its custom data warehouse. - Snowflake: hosted SQL database. - Databricks: hosted Spark.

nitrogen · Answer

System design is not just about the game (individual tools), but about the meta game (flows of data, interconnected abstractions, navigating problem space). There is no rote checklist process that will reliably pick the right tool.
You do need to learn about tools at least superficially, but when you learn how to build the right mental models for your problems, that's when the whole picture starts to become clear and you will just "see" how the right tools will slot into your problem. Then you can deep dive into those tools.
I'd highly recommend starting with Bret Victor's demo, Up And Down The Ladder Of Abstraction: http://worrydream.com/LadderOfAbstraction/ (view on desktop) to start building the "abstraction muscle".
Then it will become more apparent what constraints might lead you to choose a message bus with a RabbitMQ broker instead of making internal HTTPS calls, for example.
[But really, as to your final paragraph, just use Postgres until you can't anymore]

asdasdasdas5453 · Answer

I have this link saved in my bookmarks. Never read it though.https://robertheaton.com/2020/04/06/systems-design-for-advan...

natn · Answer

This is one of the best and most accessible pieces I've read about the underlying principles of how such systems work.https://engineering.linkedin.com/distributed-systems/log-wha...

barrkel · Answer

This is a kind of help vampire question. If you don't have problems which made you look at big data solutions, don't look for them. Those solutions will create problems for you if you don't already have them.Using those things (Kafka, Hadoop etc.) when you don't have sufficient data to justify it is like using a supertanker to do your grocery shopping.

arberavdullahu · Answer

https://www.educative.io/courses/grokking-the-system-design-...

rushilhalflife · Answer

What you've listed here are tools. You need to learn system design concepts and not just tools.Have a look here - https://github.com/donnemartin/system-design-primer

einszwei · Answer

It might be helpful to also study how databases are built and function. They contain applications of theoretical concepts from Algorithms, Data Structures, OS and Distributed systems.There are many textbooks on this subject but if you are feeling lost then I'd suggest starting with https://www.databass.dev/ which gives a decent birds eye view of many concepts.

stunt · Answer

You have to start learning about different architectures and patterns first. Their popular use cases and pros and cons. That will help you to understand where different features and behaviors of a certain product/service really fits. Details are mostly about trade-offs between different resources and design choices. Read case studies from companies that tried those architectures to learn more about challenges and benefits.
It isn't only about features. Cost and security are big factors. Risk, disaster-recovery, data-management, SLAs, available APIs, and interfaces.
You have to calculate how different resources and architectures will scale with your use-case and how much they will cost to develop and maintain.
There are also other variables that are related to your organization. Internal parameters like available skills, organization structure, project life cycle, available documentation, and long-term support are big factors when making a decision.

valand · Answer

It's a good thing I favorited this. System design doe beginner. https://news.ycombinator.com/item?id=23904000Also, check the benchmark, the scalability, the architecture, etc. Sometimes, DB with similar frontend (API) are very different on the backend (architecture, implementation), for example CockroachDB vs PostgreSQL, hence different usage. One is OLTP, the other is OLAP, etc.

syndacks · Answer

Read "Designing Data-intensive Applications"-- it's a great combo of theory + real-application (including some of the technologies you listed).

machiaweliczny · Answer

From my understanding (don't have much backend experience) you need those only for specific workloads. First learn difference between OLTP and OLAP. Traditional DBs are usually designed for OLTP, and new DBs are designed for OLAP and some for mega scale (Petabytes).
I recommend you learn: * ES - for text search * Clickhouse - simplest OLAP * Cassandra (Petabytes of data, columnar store) * Learn some about timeseries DBs (analytics) * Graph DBs
RabittMQ, Kafka or Pulsar are used for message bus/que implementations. Simple case: producing message takes 1 time unit but processing 5 units, so you want to implement kind of threading without coupling to specific hosts, so you use queue and subscribe to quue with readers. Read ZeroMQ docs on all communication patterns to learn typical cases.

ingvul · Answer

Your question seems to imply generality ("systems design"), but the description of your question seems to imply specific tooling (e.g., Kafka).
Many people have mentioned really good books (e.g., DDIA). Such books are good for gathering a general knowledge about "systems design", but you will be still clueless about the differences between Kafka and Rabbitmq until you actually read their documentation manuals.
There is no shortcut I'm afraid. If you want to "understand seemingly endless options when it comes to data handling on backend side" you will have to read the corresponding seemingly endless documentation manuals. How else would you know about the advantages or disadvantages of, let's say, InfluxDB over Postgres if you don't read their manuals?

mikeomoto · Answer

Being able to make these decisions means understanding a wide variety of potential moving pieces. Getting broad exposure is a key theme.
I recommend reading this specifically; this is basically an education in production systems, and covers a lot of ground. :)
http://aosabook.org/en/index.html

unwabuisi · Answer

This playlist helped get me familiar with a wide breadth of topics - https://www.youtube.com/watch?v=vge7qwCR1dA&list=PLt4nG7RVVk...

ZLeviathan · Answer

mark

imvetri · Answer

Start to build.

What are good resources to learn system design?

Designing Data-Intensive Applications
Recommended by the CTO of Azure, creator of Kafka, and many HN users on other threads including me :)
https://dataintensive.net/

Please check out Azure Cloud Design Patterns [0]. It lists a large number of common patterns applied in the field of distributed application design.
[0]: https://docs.microsoft.com/en-us/azure/architecture/patterns...

A few months ago I wrote a post called "Systems design for advanced beginners" that a lot of people seem to have found helpful.
Link: https://robertheaton.com/2020/04/06/systems-design-for-advan... HN comments: https://news.ycombinator.com/item?id=23904000

Designing Data-Intensive Applications is excellent but also check these resources:
- https://github.com/donnemartin/system-design-primer
- http://aosabook.org/en/index.html

I have this link saved in my bookmarks. Never read it though.
https://robertheaton.com/2020/04/06/systems-design-for-advan...

This is one of the best and most accessible pieces I've read about the underlying principles of how such systems work.
https://engineering.linkedin.com/distributed-systems/log-wha...

https://www.educative.io/courses/grokking-the-system-design-...

What you've listed here are tools. You need to learn system design concepts and not just tools.
Have a look here - https://github.com/donnemartin/system-design-primer

Read "Designing Data-intensive Applications"-- it's a great combo of theory + real-application (including some of the technologies you listed).

Being able to make these decisions means understanding a wide variety of potential moving pieces. Getting broad exposure is a key theme.
I recommend reading this specifically; this is basically an education in production systems, and covers a lot of ground. :)
http://aosabook.org/en/index.html

This playlist helped get me familiar with a wide breadth of topics - https://www.youtube.com/watch?v=vge7qwCR1dA&list=PLt4nG7RVVk...

mark

Start to build.

What are good resources to learn system design?

Designing Data-Intensive ApplicationsRecommended by the CTO of Azure, creator of Kafka, and many HN users on other threads including me :)https://dataintensive.net/

Please check out Azure Cloud Design Patterns [0]. It lists a large number of common patterns applied in the field of distributed application design.[0]: https://docs.microsoft.com/en-us/azure/architecture/patterns...

A few months ago I wrote a post called "Systems design for advanced beginners" that a lot of people seem to have found helpful.Link: https://robertheaton.com/2020/04/06/systems-design-for-advan... HN comments: https://news.ycombinator.com/item?id=23904000

Designing Data-Intensive Applications is excellent but also check these resources:- https://github.com/donnemartin/system-design-primer- http://aosabook.org/en/index.html

I have this link saved in my bookmarks. Never read it though.https://robertheaton.com/2020/04/06/systems-design-for-advan...

This is one of the best and most accessible pieces I've read about the underlying principles of how such systems work.https://engineering.linkedin.com/distributed-systems/log-wha...

https://www.educative.io/courses/grokking-the-system-design-...

What you've listed here are tools. You need to learn system design concepts and not just tools.Have a look here - https://github.com/donnemartin/system-design-primer

Read "Designing Data-intensive Applications"-- it's a great combo of theory + real-application (including some of the technologies you listed).

Being able to make these decisions means understanding a wide variety of potential moving pieces. Getting broad exposure is a key theme.I recommend reading this specifically; this is basically an education in production systems, and covers a lot of ground. :)http://aosabook.org/en/index.html

This playlist helped get me familiar with a wide breadth of topics - https://www.youtube.com/watch?v=vge7qwCR1dA&list=PLt4nG7RVVk...

mark

Start to build.

Designing Data-Intensive Applications
Recommended by the CTO of Azure, creator of Kafka, and many HN users on other threads including me :)
https://dataintensive.net/

Please check out Azure Cloud Design Patterns [0]. It lists a large number of common patterns applied in the field of distributed application design.
[0]: https://docs.microsoft.com/en-us/azure/architecture/patterns...

A few months ago I wrote a post called "Systems design for advanced beginners" that a lot of people seem to have found helpful.
Link: https://robertheaton.com/2020/04/06/systems-design-for-advan... HN comments: https://news.ycombinator.com/item?id=23904000

Designing Data-Intensive Applications is excellent but also check these resources:
- https://github.com/donnemartin/system-design-primer
- http://aosabook.org/en/index.html

I have this link saved in my bookmarks. Never read it though.
https://robertheaton.com/2020/04/06/systems-design-for-advan...

This is one of the best and most accessible pieces I've read about the underlying principles of how such systems work.
https://engineering.linkedin.com/distributed-systems/log-wha...

What you've listed here are tools. You need to learn system design concepts and not just tools.
Have a look here - https://github.com/donnemartin/system-design-primer

Being able to make these decisions means understanding a wide variety of potential moving pieces. Getting broad exposure is a key theme.
I recommend reading this specifically; this is basically an education in production systems, and covers a lot of ground. :)
http://aosabook.org/en/index.html