HACKER Q&A
📣 mcsoft

How PostgreSQL source code is 3x shorter than MySQL's one?


It looks like PostgreSQL is more feature-rich and reliable, though it's written in only 1491985 LOC vs 4466967 LOC for MySQL. How is that?


  👤 twotwotwo Accepted Answer ✓
So I cloned the mysql-server and postgres repos and ran sloccount. It's not the deepest dive or anything but was interesting.

I saw MySQL had...600k lines of JavaScript?

It turned out that the storage/ndb directory had a Web-based management interface for NDB, which vendors in the Dojo JavaScript framework. It also had ~50k lines of Java for the "ClusterJ" framework, which interfaces with NDB skipping the SQL layer. Overall, sloccount reports about 1.4M lines of code in storage/ndb/.

NDB is a specialized cluster database where all secondary indexes have to fit in the cluster's RAM(!). Work on it started at Ericsson, then it was spun out into a startup which MySQL AB eventually bought. I imagine MySQL management at the time hoped the future of DB clusters might end up looking more like NDB and less like...gestures at today's database landscape.

There are also ~500K lines of Unicode tables in the strings/ directory. I recall Postgres calls out to libc for locale/collation related stuff so probably doesn't need those tables in-tree.

Even accounting for those chunks you still end up with ~1M vs ~2M SLOC as measured by sloccount. (I don't want to pretend the numbers are super precise.) There are probably other differences in what's in scope for the repo or other surprises. [Edit: see johannes1234321's comment which lists some of them.]

Besides those, though, might be truth to others' comments about MySQL spending lots of code supporting drastically different old and new "worlds" in a single binary (non-transactional and transactional storage, originally very-nonstandard vs. currently more-standard SQL, statement-based and row-based replication...). And at a totally non-technical level, as a product MySQL seems to have had more money thrown at it and that tends to mean more code.

This was fun but was an incredibly quick and dirty dive into it, and I'd love to hear more from folks who can look more or just know more.


👤 johannes1234321
One can argue about the statement that PostgreSQL is more feature rich.

MySQL has more replication features, different storage engines, etc. also in MySQL GIS functionality is included and not an external plugin (like PostGIS)

The source tree you looked at probably also has ndb cluster included; if you cloned from GitHub, you also get the MySQL Router and other side components.

MySQL also bundles most external dependencies (excluding for example boost)

Any serious discussion would require deeper analysis.

P.S. I work on the MySQL team, never looked at pgsql source, so can't really judge.


👤 rmrfstar
You could probably turn this into an interesting blog post.

+ What subsystems does each project have? How do LOC stack up on a per-subsystem basis?

+ Do they have different dependencies?

+ Does LOC have the same meaning in the two projects, or are there stylistic differences that meaningfully alter line counts?

+ Are statically linked binaries similarly sized?

So, it looks like you have the right question. You just need to do the digging.


👤 userbinator
Within a single language there's already considerable stylistic variance. I used to work in CS education and marked assignments, and I would see variances in length of students' (correct) solutions, for relatively trivial problems, that spanned a range of several multiples; e.g. what someone would take 50 lines to solve, someone else might need over 300. Part of it is comments and whitespace, but unnecessary abstractions also contribute to making code more "bloaty" than it could be.

👤 thinkx
Bruce Momjian wrote a few blogs about code quality.. the phrase “brutal replacement of old code” comes to mind

👤 capableweb
When groups of people develop software, you get software that represents the group and it's dynamics (Conway's law). I think it's mostly because of the people who wrote the software, and what priorities they had when working on it.

👤 Scarbutt
MySQL has a storage abstraction and includes multiple storages.

👤 owowow
Likely different strategies for software development and less need to support legacy clients contributed to the line count being much lower.

👤 aristofun
Why did you assume there is strong correlation between LOC and features/reliability in the first place?

👤 mobilemidget
That the source is much larger I don't mind _so much_. But this recent thing mysql is doing worries me more.

ls -lah /usr/sbin/mysqld -rwxr-xr-x 1 root root 1.1G Mar 26 2020 /usr/sbin/mysqld

1.1G binary.