HACKER Q&A
📣 Jiig

How to start learning large 20 year old code base?


I've been tasked to try and learn how some very old C++ code base (30k plus lines) at my job, I've heard that its originally from the late 90s and has been updated periodically, since it preforms some very crucial processing.

The code is commented fairly well but is missing any sort of top-level architecture documentation or explanation.

I've started by just following the flow, but am getting very very lost. Are there any tips you have to help me wrap my head around this?


  👤 davismwfl Accepted Answer ✓
Hard to give a lot of detail without understanding a little more, like is this a GUI tool, a utility library, an API or an algorithm etc.

But as some general advice.

1. Find all the inputs, map their usage/affect through the code. That will help you understand what happens when an input changes.

2. Find all the outputs, do the same as #1. Now you understand with what inputs the expected outputs.

3. Trace through the calls with a given input and do a function & class map, that'll help you see how the code interacts.

Where this gets harder is if the code is multi-threaded, or if it is a huge monolith where there are lots of places to start. In the case of threads, document each thread and the functionality it produces and which inputs are shared, accessed or needed and what outputs come out. Also check the timing against other dependencies. In the case of GUI monolith type application, pick a piece of functionality, as an example, pick login, or app startup and just trace all what happens, and just do this for a bunch of different smaller pieces of functionality until you understand how the code is put together.

As a consultant I used to walk into weird shit all the time, and whether it is I just have a knack for it, or it is my process (basics described above), I can learn code bases quickly and be productive very rapidly. Things that make it harder are lots of DI, especially when it is totally unnecessary for the problem, third party dependencies you can't access and aren't well documented, multi-threading or micro-services where they are not done well and there are lots of interdependencies. Also event systems that are poorly architected can make debugging issues and understanding flow super hard, so I have other methods I use for systems like that, but it basically follows the same pattern above.

The sure fire way to get lost fast is to try and map it all out at once. You have to pick small pieces of functionality, map them, and build it out from there.


👤 Someone
First things to check:

- can you build it and if so, does the produced binary work? If so, look at the makefile (or equivalent) to hunt for compilation switches. If not, spend some time trying to make it build. If you don’t succeed, tell your manager that this will be a lot harder (if the code doesn’t build and run, you don’t even know whether you have all the code, IDEs may have trouble analysing it, etc.)

- And, given that this is a decompressor, do you have access to the compressor, too? Chances are the makefile will give it to you. If yo don’t have one, that isn’t a showstopper, but may make things more difficult, so inform your manager.

- is the code under source control? If so, look at the history. Going back to older releases may give you an easier code base to work with (given that, elsewhere, you say “Its a decompression algorithm with a thin CLI around it”, that may help a lot, getting rid of various optimisations and config options)

You can use various tools to visualise the call graph, but this being a decompressor, there likely are many low-level functions you can’t tell about what they do. If you aren’t familiar with compression algorithms, or with this algorithm in particular, try googling the names of various functions or field or variable names.

In the end, 30k lines of C isn’t _that_ much. It may just be a matter of grinding through. If you browse 1,000 lines an hour (3½ seconds per line), that’s only 30 hours, doable in a week (and a week is not much, if you inherited the code base, and aren’t just visiting it). Just dive in, and by the time you’ve spent 10 hours, you probably have generated some questions that you want answered, discovered some #define’s that control compilation, etc. Eventually, you will have to read every line, but don’t feel obliged to, initially; just follow your instincts (and, in case the business side has some short-time priorities, let that guide you)

On the one hand, decompression algorithms typically are of above-average complexity, making that harder, but on the other hand, it is highly likely that there are various CPU-specific and/or OS-specific code paths that you (initially) can ignore, significantly decreasing your line count.


👤 AnimalMuppet
You might use a tool that would document (and help you visualize) the call graph. From that, you might get a better idea what parts are most important.

Another approach is to run it in a debugger, and just step through it, watching how it does what it does.

30K lines isn't horrible, but you shouldn't expect to understand it overnight. You should count on it taking at least a few weeks.

And, when you're done, you should leave behind a top-level architecture document, and an explanation of how it operates.


👤 thedevindevops
If they've taken the C++ 'interface' paradigm on, find the abstract classes folder and map out all the virtual methods and what calls them, that should give you an insight of what the major players are and how they interact

👤 Irishsteve
Maybe not applicable but reading the unit and integration tests are usually where I start

👤 probinso
There are ways to turn Make files into Graphvis plots. I've had value generating a plot, then systematically reducing complexity from the `.dot` files. It takes a long time