How do you discover features in unknown code bases?

Question

I'm realizing that one of the reasons why I don't do a lot of additional hobby programming is because I'm missing a fundamental skillset that I never developed (in a reliable way) over the years. I think I don't know how to discover features that I'm interested in, in code bases that I'm unfamiliar with. Example: In Chromium I want to find the algorithm that is building the DOM. I'm not sure if that is even part of the code base (https://github.com/chromium/chromium) How would you personally approach this problem?

lukasgraf · Accepted Answer

Version control can be a big help.Look through closed tickets in the issue tracker, and try to find a change (bugfix or new feature) that must have, given its nature, touched the functionality you're looking for. Then try to find the changeset(s) where that ticket's change was implemented.With some luck, the changeset will include a modification to the part of the codebase you've been looking for.

sillysaurusx · Answer

I would search WebKit, not chromium. But keep in mind that there’s a difference between “unknown code bases” and “one of the largest and most complex code bases ever created.” You’re asking, essentially, “in the Windows source code, where is the window layout algorithm?”
Stuff like that is certainly possible to find. But it requires a lot of time and dedication.
I would personally search for job listings for Chromium / Firefox, then use that knowledge to find someone who works on it. Then I’d ask them where it is.
But only in this specific case. My normal workflow is to build whatever it is I’m looking at, then change things until it breaks. It’s pretty quick to narrow down what I’m looking for at that point.
That doesn’t work here because building chrome requires close to a supercomputer.
EDIT: Actually, I would try to find a crash log related to the DOM. The stack trace will point you precisely where you’re interested in. Doing that is easier said than done, but I’ve pulled that trick a couple times, so it seems worth mentioning.

tacostakohashi · Answer

As well as poking around in the source code, do not discount non-source based approaches.
For example:
* Run chromium using strace, ltrace, gdb to see what's going on at runtime.
* Do some experiments / reverse engineering, treating the application / source code as a black box. Try different HTML input, inspect the DOM in chrome, possibly automate this process via selenium or something, and discover the runtime behavior of the algorithm that way.
The thing to keep in mind is that, for all you know, the DOM building algorithm is split across thousands of source files, or is in fact in some dependency and not in chromium itself, or is split across both. Presumably there is some particular aspect of the DOM building that you are interested in, so experiment with how that works, instead of trying to find / understand the entirety of chromium DOM building.

eurasiantiger · Answer

Clone the repository, open it and let my IDE build an index of the codebase, then use ctrl-p and/or &rdquo;grep -r keyword .&rdquo;

biggedyb · Answer

Personally I've never needed to unpick a full blown standalone app like Chromium before, I'll be using repo code or dropped into some hulking spaghetti massive legacy app that needs a bugfix, but there's no-one left at the company that has any idea how it works and the documentation is _lacking_.. but if this helps then good, otherwise, oops, sorry for the pollute.
Breakpoint methodology wins for me simple and true.
I imagine it like pathfinding the minotaurs maze, you stand at the last place you recognise and can get back to (if that's literally the first active line then that's fine), and put something there (breakpoint, print statement, log line), run it and check you still know where you are. Then put another down as far forward from that point that you can 'see', if that's literally one operation step then fine, spin it and check. Breakpoints are easily put down and just as easily cleared back up again. Keep only as many as you need to see which branch you are on.
Pretty soon you'll have run the damn thing so many of times you'll know it's bootstrapping and foibles and they will be second nature. You'll start seeing how it's generally laid out, you'll know where the main start up branchings are. When they leap into async or hidden 'rooms', log lines are perfect.
When the engine of it starts moving in your head, then is the time to start throwing breakpoints, prints or log lines in places that originally were completely unknown but now you have a feeling for. It's at this point you'll be bloody close to where you want to be.
Oh and do future you a favour, at least jot down something as you're going through this. I find that this initial torchlighting is remarkably gratifying but if you don't make notes in six months time it'll be completely gone, and you'll have to do between a quarter to a half of this all over again before the lights start lighting on and you remember how it's laid out.

maattdd · Answer

It's hard. Normally I start with a word that I know is fairly unique to the domain I'm looking for (in your case maybe "cssselector"). And what you are looking for is in the Blink third party folder.

chimineycricket · Answer

Usually some kind of string search works. If it's frontend then search a string that's on the feature. If it's backend search a string for table name, http error messages, anything like that.

LostRick · Answer

I'm no expert but currently currently getting into some coding after a bit of a break. Usually there is documentation, but you never know how relevant it still is. For this example with chromium, it looks like each folder has a readme.md, one even links to a dev wiki/guide and in there you can find google docs with diagram of the architecture :) For other projects there might be focus on doxygen which sort of collects the comments from the sourcefiles and puts it into for example html witch class trees etc.

rad_gruchalski · Answer

I usually start by finding issues related to the code fragment I&rsquo;m interested in. Those usually lead to pull requests in the code I&rsquo;m interested in.

actually_a_dog · Answer

Read the tests.

simonblack · Answer

You have to know what it is that you want to find before you start discovering.
My way is to have a project. What that is is unimportant, but it needs to be big enough that you run into roadblocks.
Now you know what your weak point is, and what you need to learn to overcome that weak point. So now you also know what it is you need to search for in that code-base.

flamesofphx · Answer

Poke, probe, prod... See what changes, till you get a better idea... I mean what else can you do when there no comment/documentation and your dealing with something like:function C($a, $b, $ba, %bb, $c, $d = NULL) { //insert random garbage with eval statement.. }

charcircuit · Answer

You use code search. For chromium it's hosted at https://source.chromium.org/chromium.You can use filters to narrow down the results to the right languages and paths.

nottorp · Answer

Reading and drawing out the structure as you read. Boxes with arrows or whatever.A good source structure tool will save some time but you&rsquo;re not getting away without doing your own reading anyway.

teeray · Answer

Search for pull requests that fix bugs in the area you&rsquo;re interested in. They&rsquo;ll point you towards the responsible sections of the codebase.

iFire · Answer

sourcetrail is still modern. It hasn't fully bitorotted.

baq · Answer

ripgrepBut to not leave a one word answer, start searching for a feature you know about and look around from where you find it. It might help to add a super small feature yourself - when it finally works, you&rsquo;ll have some idea of how the code is structured and will be able to infer where other features would live if they were there. (That might take a bit more than one addition ;))In chromium&rsquo;s case, what you&rsquo;re trying to do is more like reverse engineering&hellip; I&rsquo;d start with a debugger.

How do you discover features in unknown code bases?

Clone the repository, open it and let my IDE build an index of the codebase, then use ctrl-p and/or ”grep -r keyword .”

It's hard. Normally I start with a word that I know is fairly unique to the domain I'm looking for (in your case maybe "cssselector"). And what you are looking for is in the Blink third party folder.

Usually some kind of string search works. If it's frontend then search a string that's on the feature. If it's backend search a string for table name, http error messages, anything like that.

I usually start by finding issues related to the code fragment I’m interested in. Those usually lead to pull requests in the code I’m interested in.

Read the tests.

Poke, probe, prod... See what changes, till you get a better idea... I mean what else can you do when there no comment/documentation and your dealing with something like:
function C($a, $b, $ba, %bb, $c, $d = NULL) { //insert random garbage with eval statement.. }

You use code search. For chromium it's hosted at https://source.chromium.org/chromium.
You can use filters to narrow down the results to the right languages and paths.

Reading and drawing out the structure as you read. Boxes with arrows or whatever.
A good source structure tool will save some time but you’re not getting away without doing your own reading anyway.

Search for pull requests that fix bugs in the area you’re interested in. They’ll point you towards the responsible sections of the codebase.

sourcetrail is still modern. It hasn't fully bitorotted.