For example...
- Kafka
- Rabbitmq
- Kinesis
- Spark
- Elastic search
- Map reduce
- Bigquery
- InfluxDB
- Hadoop
- Teradata
- Snowflake
- Databricks
...
I understand Postgres the best, and would love to know why these and others exist, where do they fit in, why are they better over PSQL and what for, and if they are cloud only what's their alternatives....It seems all of them just store data, which PSQL does too, so what's the difference?
Recommended by the CTO of Azure, creator of Kafka, and many HN users on other threads including me :)
[0]: https://docs.microsoft.com/en-us/azure/architecture/patterns...
a) most companies deal with small amounts of data. Small can mean dozens of megabytes to dozens or hundreds of gigabytes. A single well provisioned server will typically be able to handle that very well. Also an SQL database can do a great deal if you know what you're doing.
b) inappropriately used big data frameworks are expensive performance killers. https://adamdrake.com/command-line-tools-can-be-235x-faster-... for example.
c) Good quality programming, as in understanding the machine, memory layout and why it matters, and a good understanding of algorithms (and a hefty dose of common sense), will often yield you more speedup than buying almost any number of new machines.
c) Hiring is often driven by fads and companies often don't like being told 'you don't need this roomful of servers', they like to waste money, so maybe do learn them (the profligacy with money is likely to be coming to an end with the economic damage of covid).
Takeaway: brainpower will get you much further than horsepower
Link: https://robertheaton.com/2020/04/06/systems-design-for-advan... HN comments: https://news.ycombinator.com/item?id=23904000
When it comes to data, you are ultimately worried about 1. storing it and making sure it stays there and 2. retrieving it or asking questions about that data with certain guarantees. Speed? Consistency? Local access? Grabbing a ton of rows at once? Grabbing really old data quick? The old adage is true here: nothing in life is free. If you want fast writes you might sacrifice read performance, or vice versa. If you dial one knob up, one knob needs to get dailed down (usually). All of the tools you listed have various trade offs and were designed or optimized for specific workloads. Some are more general (PSQL is a great example) but looking at them all spread out on a table the differences become more clear.
Choosing your tool will depend on how well it will meet your requierments and how it is going to play nice with all your other systems. Systems thinking is a lot bigger than choosing a performant tool that has the right libraries. You gotta think about long term support: how do I do backups of my data? How do I restore data? How do I perform upgrades down the road? How do I deal with downtime, can I throw more resources at it?
Long story short: I am very glad to hear more people thinking about systems engineering but make sure you don't get too caught up in the specific tooling and libraries. Learning and practicing the concepts and fundamentals and making sure to pause to think in the abstract boxes-and-lines sense is very important, too.
Learning about 'Clean architecture' and 'hexagonal architecture' will help to reinforce good systems design patterns.
- Read Designing Data Intensive Applications. As others have said, it's a gem of a book, very readable, and it covers a lot of ground. It should answer both of your questions. Take the time to read it, take notes, and you should be well set. If you need to dive deeper into specific topics, each chapter links to several resources.
- Read some classic papers (Dynamo, Spanner, GFS). Some of these are readable while some are not-so-readable, but it'll be useful to get a sense of what problems they solve and where they fit in. You may not understand all of the terminology but that's fine.
That should give you a strong foundation that you can build upon. Beyond that, just build some systems, experiment with the ideas that you're learning. You cannot replace that experience with any amount of reading, so build something, make mistakes, struggle with implementation, and you'll reinforce what you've learned.
Backend is vast, and this helps you build a general sense of the topic. When you find a topic that you're really interested in (say stream processing, storage systems, or anything else), you can dive into that specific topic with some extra resources.
> I understand Postgres the best, and would love to know why these and others exist, where do they fit in, why are they better over PSQL and what for, and if they are cloud only what's their alternatives....It seems all of them just store data, which PSQL does too, so what's the difference?
A lot of that depends on the way you're building a system, the amount of data you're going to store, query patterns, etc. In most cases, there are tradeoffs that you'll have to understand and account for.
For example, a lot of column oriented databases are better suited for analytics workloads. One of the reasons is for that is their storage format (as the name says, columns rather than rows). Some of the systems you mentioned are built for search; some are built from the ground up to allow easier horizontal scaling, etc.
- Kafka: a service for defining and managing message streams; used in service architectures that communicate by message-passing and in high-throughput data processing applications and pipelines.
- RabbitMQ: another message queue service; less complex than Kafka.
- Kinesis: a message queue service provided by AWS.
- Spark: an in-memory distributed computation engine; a central "driver" consumes job definitions, written in code, and farms them out to "workers"; horizontally scalable; a variety of options exist for managed/hosted Spark.
- ElasticSearch: a service for indexing data; consumes data blobs and search terms to associate them with; used to build search engines; many convenient utilities for managing search terms and queries.
- MapReduce: a paradigm for defining distributed data operations; partitions of a "job" are sent to "mappers" that compute partial results, and those results then flow to "reducers," that combine the partial results into the finished output; Hadoop is the best-known implementation of this paradigm.
- BigQuery: a scalable database offered by Google as a service.
- InfluxDB: a time series database; used for storing and analyzing data that has a time component.
- Hadoop: an implementation of the MapReduce paradigm; many hosted options, or you can run it on your own hardware.
- Teradata: a company that sells various data analysis tools that run on its custom data warehouse.
- Snowflake: hosted SQL database.
- Databricks: hosted Spark.
You do need to learn about tools at least superficially, but when you learn how to build the right mental models for your problems, that's when the whole picture starts to become clear and you will just "see" how the right tools will slot into your problem. Then you can deep dive into those tools.
I'd highly recommend starting with Bret Victor's demo, Up And Down The Ladder Of Abstraction: http://worrydream.com/LadderOfAbstraction/ (view on desktop) to start building the "abstraction muscle".
Then it will become more apparent what constraints might lead you to choose a message bus with a RabbitMQ broker instead of making internal HTTPS calls, for example.
[But really, as to your final paragraph, just use Postgres until you can't anymore]
https://robertheaton.com/2020/04/06/systems-design-for-advan...
https://engineering.linkedin.com/distributed-systems/log-wha...
Using those things (Kafka, Hadoop etc.) when you don't have sufficient data to justify it is like using a supertanker to do your grocery shopping.
Have a look here - https://github.com/donnemartin/system-design-primer
There are many textbooks on this subject but if you are feeling lost then I'd suggest starting with https://www.databass.dev/ which gives a decent birds eye view of many concepts.
It isn't only about features. Cost and security are big factors. Risk, disaster-recovery, data-management, SLAs, available APIs, and interfaces.
You have to calculate how different resources and architectures will scale with your use-case and how much they will cost to develop and maintain.
There are also other variables that are related to your organization. Internal parameters like available skills, organization structure, project life cycle, available documentation, and long-term support are big factors when making a decision.
Also, check the benchmark, the scalability, the architecture, etc. Sometimes, DB with similar frontend (API) are very different on the backend (architecture, implementation), for example CockroachDB vs PostgreSQL, hence different usage. One is OLTP, the other is OLAP, etc.
I recommend you learn: * ES - for text search * Clickhouse - simplest OLAP * Cassandra (Petabytes of data, columnar store) * Learn some about timeseries DBs (analytics) * Graph DBs
RabittMQ, Kafka or Pulsar are used for message bus/que implementations. Simple case: producing message takes 1 time unit but processing 5 units, so you want to implement kind of threading without coupling to specific hosts, so you use queue and subscribe to quue with readers. Read ZeroMQ docs on all communication patterns to learn typical cases.
Many people have mentioned really good books (e.g., DDIA). Such books are good for gathering a general knowledge about "systems design", but you will be still clueless about the differences between Kafka and Rabbitmq until you actually read their documentation manuals.
There is no shortcut I'm afraid. If you want to "understand seemingly endless options when it comes to data handling on backend side" you will have to read the corresponding seemingly endless documentation manuals. How else would you know about the advantages or disadvantages of, let's say, InfluxDB over Postgres if you don't read their manuals?
I recommend reading this specifically; this is basically an education in production systems, and covers a lot of ground. :)