How do you search large codebases before adding a feature or fixing bug?

Question

Why do we need to search source-code? 1. Quickly learn the domain and context of the application 2. After adding a feature, we should aware if we broke anything (assume you work with code that doesn't have test-case), it helps even to search testcases 3. Find similar code and ensure you are improving quality of the overall similar code (not just fixing current bug) 4. Understand how application behaves when there are production issues. Most often I deal with large inherited code-base in my career, often we need to search similar code or usage of certain variable or a function/class/module. When it is statically typed language to certain extent IDE/compiler helps. But we have to deal with different languages and sometime developers copy/paste for various reason. Searching/grepping code and its usage seems to be very useful for various reasons.You as a developer, what are all the ways you search source-code before working/fixing feature or bug? Do you use any CLI tools other than grep. I have used OpenGrok, But few times it is not maintained by me or other developers.Below is my steps. 1. Read the relevant code, and know certain domain keyword, variable names (inclusive class/method/function) 2. Use the bitbucket/GitHub/git search 3. Use the grep 4. Use the git-grep Still few times, I end up missing.Seems like this is (Especially CLI based search) very valuable skill to have. Do you have any tips/tools for other developers?

pyjarrett · Accepted Answer

I work on code bases with millions of lines, so I wrote a tool called Septum to help me (https://github.com/pyjarrett/septum/). This isn't to replace grep or ripgrep or silver searcher, those are all great tools you should have!Septum is neighborhood based (context-based) search, so you can find contiguous groups of lines which contain specific things, but exclude other things. It's also interactive so you can add/remove filters as needed. This makes it useful for those cases where terms change based on their context so you can exclude terms related to the contexts you don't want to keep. It reads .septum/config which contains its normal commands to load directories and settings, so you can have different configs per project you're working on.

egberts1 · Answer

Sourcetrail (Google) had been my primary go-to for large multi-MLOC C-code and C++ project like Mozilla Firefox recently for me. I can now insert instrumentation testpoints deep inside Firefox JavaScript JIT engine within two days from zero-knowledge, quicker there on.
- Sourcetrail (GUI/Linux/Windows, closed-net-capable, archived) - https://github.com/CoatiSoftware/Sourcetrail
- SourceInsight (repo/web-server/closed-net) https://www.sourceinsight.com/
- OpenGrok (web server plug-in/Java/closed-net) https://oracle.github.io/opengrok/
- CLion (GUI-based IDE) - By IntelliJ/JetBrains - https://www.jetbrains.com/clion/
- SourceGraph (Web-based) https://about.sourcegraph.com (thanks, gravypod)
- Codesee.io (GitHub/web-based) - https://www.codesee.io/privacy-and-security
For free as in beer, I prefer OpenGrok so I can get more than JetBrains

ericsilly · Answer

Off the tools topic, but IMO the most important consideration: the mindset should be entirely about understanding the existing approach, conventions and philosophy, vs a critical assessment leading to "this needs to be modernized". Particularly on small-medium codebases with smaller teams, I've seen projects be fundamentally damaged by new, well-meaning devs who bypass most of the hard-slog of really understanding the existing how/why, and instead try to jump to the more comfortable space of using tooling or approaches they're more familiar with. There is certainly a place for that, but depending on the project, that might be 3 to 6 months later. Programmers need to appreciate the power and consequences for management and non-programming team members (product) when a new dev brings a condemning assessment of an existing codebase after one or two weeks.

daxaxelrod · Answer

Whenever working with a new codebase I always try to find the route definitions file first. Something that maps the api interface to the functions they call. I can then reason backwards from any service by clicking into whatever I&rsquo;m interested in. After that I look for where config is defined and try to understand what&rsquo;s unique about this envs setup.

tuankiet65 · Answer

Whenever I work on huge codebase (think 1M+ lines of code), I always reach for Russ Cox's codesearch https://github.com/google/codesearch. It requires indexing the codebase first, which takes 15 minutes or so, but after that searches are instant.

sqs · Answer

I made a web site that catalogues how various companies/projects use code search:
https://codesearchguide.org/story/google
https://codesearchguide.org/story/facebook
https://codesearchguide.org/story/brave
https://codesearchguide.org/story/chromium-android
https://codesearchguide.org/story/linux
https://codesearchguide.org/story/yelp
https://codesearchguide.org/story/stripe
The Google one in particular has a great breakdown of how they use code search by use case (examples, exploration, etc.).
And here are a bunch of known code search tools: https://codesearchguide.org/tools
(Disclaimer: I am the Sourcegraph CEO and our core product is code search.)

mangamadaiyan · Answer

I almost always fall back on ag (https://github.com/ggreer/the_silver_searcher).
Honourable mentions to cscope and ctags. They work for me since most of my $dayjob involves me mucking around with C++.
All tools get invoked from within Vim. (Which _also_ works reasonably well in Windows Terminal).

weinzierl · Answer

I keep a directory with up-to-date clones of all relevant repos. This is separate from my usual working directory. I experimented with git workspaces, but it wasn't worth the trouble, especially since the set of repos I'm working on is not necessarily the same as the ones I keep in my search dir.

At the root level I maintain two scripts:

  clone.sh

  update.sh

clone.sh has one

  git clone --recursive ..

line per repo. When I run low on disk space I sometimes delete larger repos. The clone script allows me to easily re-clone everything in this case.

update.sh is similar but pulls all repos.

For global search across all branches I do:

  git grep  $(git rev-list --all)

(when I forget the line I look it up in Stack Overflow [1])

This is especially useful since I work a lot with Bitbucket and to the best of my knowledge you can only search the default branches there.

When I know the branch, but want to search across all history I use git pickaxe, aka

git -S ...

All of this is not very sophisticated and takes a lot of disk space but it works pretty well for me.

[1] https://stackoverflow.com/a/15293283

eatonphil · Answer

At Oracle (keep in mind every org within Oracle is very different) I wrote a crappy script to grab all repos in my org and hook them up to Etsy Hound [0]. But I couldn't stick with it long enough for it to be useful.
It's surprising to me how much effort is not being put into whole-org code search. Most projects focus solely on single-repo search. If you need to make breaking changes or find examples and you don't even know where to look, single-repo search isn't so useful.
[0] https://github.com/hound-search/hound

nic0-c0 · Answer

IntelliJ / PyCharm / WebStorm ctrl-shift-F: search in the whole codebase, is what I use

vinyl7 · Answer

This was posted a few weeks ago: https://mitchellh.com/writing/contributing-to-complex-projec...

swframe2 · Answer

I wrote several scripts that temporarily add call tracing to the source code. The scripts take a few days to write but I've used them on many projects. Note there are code parsing libraries that can help you.

For example:

def foo(args):

   trace_func("foo", args)

   # rest of the function

(NOTE There is call tracing logic built into some languages but it doesn't always work for some complex code bases; try it before you write your own.)

If the code creates html elements, my script adds attributes to the html element to link back the source code location so I can look at the html and figure out where the elements were created. If the html is built using templates, then I add html comments to the template so I can tell where they are used in the final page.

Then I test the app and look at the traces to figure how it works.

At first I trace everything but once I get to know the code I add the tracing to the areas that matter. I don't check in this code.

ufmace · Answer

There's no one perfect solution for the issue. A few things that I've found helpful across many years of work, codebases, languages, and frameworks:
What if any frameworks and libraries is it using? Try to identify particularly core frameworks that tend to dictate the whole workflow of the application. Many frameworks have standards of file organization and system architecture that can help you get a handle on what goes where. They may not always have been used properly, but it's a start at least. It might even help to set up a small learning project in that framework just to get to know it better. There may also be libraries in use that influence a lot of how the application does whatever it does.
Trace control flows of the application. How does it start? Do any other processes get started in addition to the main application? Learn how to do the workflow you need to modify, or the closest one to it if you're making a new one. Trace how the command to do X first gets into the application (API call? GUI button press? Some kind of messaging system trigger?), and try to follow the code to see what it does and how it does it.
Trace data flows. Where does the application store critical data, and how does that data actually get picked up from there, transformed, and eventually used, to present to the user or get transformed and handed off to some other system or whatever?
Text search of the codebase can be useful. In strongly-typed languages, often IDE tools are better at jumping straight to the code of the actual method being called though. In less typed languages, text search might be better. Or if whoever wrote the thing did a bunch of dynamic trickery, you may need to resort to running the code, in a unit test if it actually exists, or in your test environment, and attaching a debugger or adding a bunch of log statements.
It's always helpful to understand the business logic of what the application is actually trying to do, and the perspective of developers more experienced with it, if any such people are actually available.
Usually you need to do all of the above to actually develop expertise in a new codebase. Sometimes you have to not be afraid to just jump in and try doing stuff, even if it might not be the best way.

havkom · Answer

This skill is what differentiates really good developers from not so good developers (that may wrongly believe themselves they are good):
How good are you at reading code, finding out how smaller parts work in a larger system and understanding the context&domain is a large part in how good you are.
Not so good developers, when they are not so good at this, often start blaming the system and people who have worked on it.
Not so good developers may however be able to handle smaller systems (in particular ones written in their favorite tech stack), and this experience leads them to erroneously believe they are good.

scrapcode · Answer

This has been a great discussion for me as someone that has been programming as a non-profession for over a decade, but is new to applying it to existing projects that are not mine through open-source.One tool I haven't been able to find that I feel would be super helpful in the IDE is to show where code is covered in tests, like contexts when using python's `coverage`. Does anything like this exist? The benefits are two-fold: they help show me how the methods are supposed to be used, and also guide me on how and where I should test my fix or feature.

antoineMoPa · Answer

Tools I use:- grep- ag - Same as grep, but faster!- find - when looking for a file by name- helm (an epic Emacs package which does interactive search)Used to work at a Windows shop and we used Entrian in visual studio. That was pretty good, bust closed source and a pain to setup.

craftuser · Answer

Especially if this is long term, this is a great tool:
https://github.com/hound-search/hound#hound
It would be great if someone integrated this with tree-sitter plus something to make the search semantics a bit smarter about usages of X:
https://www.etsy.com/codeascraft/announcing-hound-a-lightnin...
Screenshots:
https://jaxenter.com/hound-go-react-code-search-engine-15008...
Another trick I use for Java: javap all the Enums out of the compiled artifacts; these indicate weird things like "modes" that you can use to start asking questions relevant to the domain. Like "why are there four ways to reprice an invoice" or finding the "types" of fees or w/e in a billing system. (assuming enum classes are used)

erwincoumans · Answer

I use a combination of(1) breakpoint debugging, finding the connection between program start and various features(2) Doxygen to generate a dependency graph(3) create json performance profiles, manually instrumenting functions, and navigate traces using Google Chrome about://tracing or similar tools.(4) trace and look at the data input and output, using a hex editor or over the network using wireshark

epirogov · Answer

I should add small point for a story, you pobably wont not realize right solution after initial research. please use expert to check prepared idea and repeat on fail:1. search with advanced tools or scripts that you wrote to find concrete answers in the code. 2. draw graph of knowledge what youu have, steps, undersand how these knowledge may help to resolve an issue. 3. go to reviewer with the plan. 4. if expert make dicision you plan will not work, then repeat step 1. 5. you may implement fix for an issue.

ravenstine · Answer

I ask other developers questions. Oh, they're busy? Well I don't really care because the sooner I get up to speed the less of a hassle I'll be to everyone in the long run. (EDIT: Yes, I'm being jocular with my use of hyperbole) All the documentation and grepping in the world can't make up for the intimate knowledge of those who've been on a project for a meaningful amount of time. It's surprising how people can point you to the right place in a codebase without searching, grepping, or any of that.

junon · Answer

Clone, ripgrep (rg). Learning how to navigate shitty code has helped me in more ways than one - one of which being I don't have to rely on extensive tooling to understand codebases.

leftbit · Answer

First thing I do is cleaning up the code base. I'm deleting unused code, fix class and method visibility according to usage, do the occasional rename if naming is inconsistent, check error handling and logging for inconsistencies, write the occasional unit test... This gives me a broad overview over the code base, improves my chances to find anything by text search and enables me to better assess the impact of changes.

raxits · Answer

0. Try to find what & why exactly is being built
1. Try to find out which framework, architecture, design patterns used - get hold of that
2. Library dependancy 3. Database structure
4. Pick up your favourite editor (be it vim or emacs or vs code or any) in which you have mastery
5. Search for various entry points like routes, or start activity or main function etc & try step thru code (with possible debug tools open)

karmakaze · Answer

The first thing I always do is 'read' the data model. What are the tables called? What are the relationships and cardinalities. Combining that with the source can give you a head start into being able to extract relevant conceptual information that's (strangely) rarely documented.

superjan · Answer

I can recommend going beyond the source code and check the checkin comment or bugtracker entry of the code you intend to change (once you found it). If the code is strange or arbitrary there is often a reason, and knowing that will improve your understanding.

duped · Answer

You can almost always ask for help from a colleague who has seen the codebase.

rad_gruchalski · Answer

Look at existing tests. Existing tests usually setup things in separation. Find tests related to what I&rsquo;m looking for, break it down, go from there.

37ef_ced3 · Answer

Concatenate all the source code files into a single file, with pathnames inserted between files.Then use Vim to read the concatenation and (regexp) search.

syngrog66 · Answer

find & grep have been invaluable tools for me to begin grokking an unknown codebase, esp in the absence of more tailored tools. if I happen to have an IDE avail which fits a use case and has whizbang search or visualization, I'll use italso: GraphViz is a great tool and CLI friendly

Irongirl1 · Answer

Codesee.io seems like it was made for this.

How do you search large codebases before adding a feature or fixing bug?

Whenever I work on huge codebase (think 1M+ lines of code), I always reach for Russ Cox's codesearch https://github.com/google/codesearch. It requires indexing the codebase first, which takes 15 minutes or so, but after that searches are instant.

I almost always fall back on ag (https://github.com/ggreer/the_silver_searcher).
Honourable mentions to cscope and ctags. They work for me since most of my $dayjob involves me mucking around with C++.
All tools get invoked from within Vim. (Which _also_ works reasonably well in Windows Terminal).

IntelliJ / PyCharm / WebStorm ctrl-shift-F: search in the whole codebase, is what I use

This was posted a few weeks ago: https://mitchellh.com/writing/contributing-to-complex-projec...

Tools I use:
- grep
- ag - Same as grep, but faster!
- find - when looking for a file by name
- helm (an epic Emacs package which does interactive search)
Used to work at a Windows shop and we used Entrian in visual studio. That was pretty good, bust closed source and a pain to setup.

Clone, ripgrep (rg). Learning how to navigate shitty code has helped me in more ways than one - one of which being I don't have to rely on extensive tooling to understand codebases.

The first thing I always do is 'read' the data model. What are the tables called? What are the relationships and cardinalities. Combining that with the source can give you a head start into being able to extract relevant conceptual information that's (strangely) rarely documented.

I can recommend going beyond the source code and check the checkin comment or bugtracker entry of the code you intend to change (once you found it). If the code is strange or arbitrary there is often a reason, and knowing that will improve your understanding.

You can almost always ask for help from a colleague who has seen the codebase.

Look at existing tests. Existing tests usually setup things in separation. Find tests related to what I’m looking for, break it down, go from there.

Concatenate all the source code files into a single file, with pathnames inserted between files.
Then use Vim to read the concatenation and (regexp) search.

find & grep have been invaluable tools for me to begin grokking an unknown codebase, esp in the absence of more tailored tools. if I happen to have an IDE avail which fits a use case and has whizbang search or visualization, I'll use it
also: GraphViz is a great tool and CLI friendly

Codesee.io seems like it was made for this.

How do you search large codebases before adding a feature or fixing bug?

Whenever I work on huge codebase (think 1M+ lines of code), I always reach for Russ Cox's codesearch https://github.com/google/codesearch. It requires indexing the codebase first, which takes 15 minutes or so, but after that searches are instant.

I almost always fall back on ag (https://github.com/ggreer/the_silver_searcher).Honourable mentions to cscope and ctags. They work for me since most of my $dayjob involves me mucking around with C++.All tools get invoked from within Vim. (Which _also_ works reasonably well in Windows Terminal).

IntelliJ / PyCharm / WebStorm ctrl-shift-F: search in the whole codebase, is what I use

This was posted a few weeks ago: https://mitchellh.com/writing/contributing-to-complex-projec...

Tools I use:- grep- ag - Same as grep, but faster!- find - when looking for a file by name- helm (an epic Emacs package which does interactive search)Used to work at a Windows shop and we used Entrian in visual studio. That was pretty good, bust closed source and a pain to setup.

Clone, ripgrep (rg). Learning how to navigate shitty code has helped me in more ways than one - one of which being I don't have to rely on extensive tooling to understand codebases.

The first thing I always do is 'read' the data model. What are the tables called? What are the relationships and cardinalities. Combining that with the source can give you a head start into being able to extract relevant conceptual information that's (strangely) rarely documented.

I can recommend going beyond the source code and check the checkin comment or bugtracker entry of the code you intend to change (once you found it). If the code is strange or arbitrary there is often a reason, and knowing that will improve your understanding.

You can almost always ask for help from a colleague who has seen the codebase.

Look at existing tests. Existing tests usually setup things in separation. Find tests related to what I’m looking for, break it down, go from there.

Concatenate all the source code files into a single file, with pathnames inserted between files.Then use Vim to read the concatenation and (regexp) search.

find & grep have been invaluable tools for me to begin grokking an unknown codebase, esp in the absence of more tailored tools. if I happen to have an IDE avail which fits a use case and has whizbang search or visualization, I'll use italso: GraphViz is a great tool and CLI friendly

Codesee.io seems like it was made for this.

I almost always fall back on ag (https://github.com/ggreer/the_silver_searcher).
Honourable mentions to cscope and ctags. They work for me since most of my $dayjob involves me mucking around with C++.
All tools get invoked from within Vim. (Which _also_ works reasonably well in Windows Terminal).

Tools I use:
- grep
- ag - Same as grep, but faster!
- find - when looking for a file by name
- helm (an epic Emacs package which does interactive search)
Used to work at a Windows shop and we used Entrian in visual studio. That was pretty good, bust closed source and a pain to setup.

Concatenate all the source code files into a single file, with pathnames inserted between files.
Then use Vim to read the concatenation and (regexp) search.

find & grep have been invaluable tools for me to begin grokking an unknown codebase, esp in the absence of more tailored tools. if I happen to have an IDE avail which fits a use case and has whizbang search or visualization, I'll use it
also: GraphViz is a great tool and CLI friendly