HACKER Q&A
📣 azeirah

How do you modify functions over 1000 lines long?


I recently started working at a startup whose code was developed under very intense business pressure. This combined with the attitude towards code needing to "work" and not code needing to be qualitative has lead to a messy codebase. One can argue that at least if the code _does_ work then it's ok, but we're dealing with many errors daily. Luckily we do have sentry and continuous deployment set-up (no CI though), so these errors can be fixed quickly when they are seen.

Needless to say, we don't have any tests.

The code is messy in most places, not all. We're working on improving the codebase, but I still feel like I'm not very well equipped to work in functions that has 10s of database queries, over 100 variables, nested if-statements, and if-statements with 5 or more checks all over the place.

I'm trying to learn testing, but under these circumstances it's still kind of difficult.

The positive side is that I do get a lot of time to read, research and learn, and I do get the freedom to try things out if they provide some form of benefit for us.

I consider myself a medior in terms of experience, this is my second job in programming.


  👤 tmaly Accepted Answer ✓
I would establish some tests for the code first.

Then I would look for chunks of code I could break out into their own functions and break it out.

Next I would write some tests for those smaller functions.

Finally I would test the new function calling the smaller functions with the tests used against the original.

Depending on the complexity of the original function, you may have to write more tests.


👤 verdagon
I recommend a first step of identifying the "pure calculation" sections of the function, and separate those out into other functions. If there's any slight adjustments you can make to make a section more pure, then do so.

It should help a little bit to simplify the function, as a first step.



👤 fendy3002
IMO you'll need to have tests setup, either you develop unit tests if possible, or you gather manual test case scenarios which you can replicate over and over.

Then start refactoring, usually by the last (20-30% ish?) part of code, moving above. IMO, it's easier to take out / refactor last part since the variables declared inside that part isn't used anywhere else .


👤 muzani
There's a lot of good sources. I recommend reading books like Clean Code and Domain Driven Design to know what good code looks like. Then read Martin Fowler's refactoring books (or even the blog) to morph it into that. There's a lot in these books that I don't agree with. Don't be too dogmatic. Just improve your toolset.

The immediate approach is to figure out what the function is trying to do. Write a test, perhaps manual if necessary. Manual meaning you run the function, what happens? What appears in the log? What shouldn't happen?

Is there anything in there you can write a test for? Pick the low hanging fruit. I like to tackle validation first, because they're pretty important, easy to test, and get messy fast.

Extract it into another function. Sometimes you want to extract it into a class. Write a test on this function (again, manual is okay). What's the input? What's the expected output?

Then you have lots of littler functions with their own expected inputs and outputs.


👤 EddieDante
> How do you modify functions over 1000 lines long?

I don't modify them. I refactor and replace them. I do it carefully, and with hatred in my heart for the pusbags who couldn't be bothered to get it right the first time.

First, I curse the people who let these functions get that big in the first place. My go-to is the PERP (public explosive rectal prolapse) because I think people who write such gargantuan methods deserve to shit themselves to death, but for particularly egregious cases I find myself praying that the original authors' children grow up to be everything they hate and fear. Once I've vented my hatred, I document the method by saying it's deprecated and will soon be replaced and commit to give every other poor bastard working on this codebase fair warning.

Next, I write a unit test against the original function, or at least try to do so, so that I have a baseline.

After I've written the test, I set about identifying in broad strokes everything the function does. I create stub methods with names representing each "step" in the procedure implemented by the original function, and call them in order from a new function. I then write tests for the new outer function as well, and I make sure that the test for the old function also runs against the new function.

Once I have my stubs in place, I reimplement the big function as a bunch of little functions called in sequence. If a function exceeds thirty lines, I break it down into smaller functions. Once I am confident that the new functions work together and provide the same results as the old function, I set about replacing every reference to the old function with calls to the new one.

Once the replacement process is done, I spend some time on regression testing and documentation. I don't remove the old function until I'm sure that it's no longer needed.

PS: You might benefit from reading The Pragmatic Programmer. The authors dropped a new edition in 2019.


👤 mikewarot
First, print it all out, in one big long piece, use tape if you have to... so it's all on a single large piece of paper... mark it up with pencil, pen, etc.

You have to be able to physically navigate something like that if you are to have any hope of sorting it all out.

Next... some things will be poorly named, or just wrong. Fix those.

Remove all dead code, and comments. They're in your GIT repo if you need them back (plus the original listing you marked up!)

Find and mark up ALL side effects. Nothing outside the module should break if it were run with the same arguments twice. If that's not the case... it's an IO procedure in disguise.


👤 alexdowad
Aside from the (good) comments made by others:

- Before you can reliably modify any program, you absolutely must get to understand how and why it works. This is always challenging for a sizable program, but when its complexity has not been skillfully managed and broken up into digestible units, it's much worse. However, you can overcome that obstacle with time and effort.

If the people who wrote the code are accessible and willing to answer questions, that is "gold" and you should not waste it. Even a 15-minute conversation can make an enormous difference. Even if they have moved on to another company or department, you might be able to track them down and see if they are willing to talk.

If the source control history is available, looking up when a difficult piece of code was written and what else was changed at the same time can provide precious clues. If the authors left decent commit log messages, that is even better, but from your description of the codebase, I guess you can't expect that.

Don't get stuck for a long time trying to understand one small bit of code (unless it is the critical part for your immediate task at hand). To build understanding, it is more effective to "dip your toes into" a section, read through and try to understand what you can without getting stuck for too long, then move on to a different section. After a while you will start making connections between the parts of the program. Each part that you grasp will help you to understand other parts.

Stepping through the code in a debugger, watching where control flow goes and how data flows from one part to another can be very helpful. Another possibility is to add debug print statements in key places and watch the messages printed as you interact with the program.

If you are able to build and run the program, you can also do "experiments" to verify your understanding. For example, if you are reading code and think to yourself: "I think feature XYZ should still work even if these lines were removed", then rather than staring at the code for an hour trying to determine whether your hunch is right, you might make faster progress by deleting the lines, then running the program and seeing what happens. Another type of experiment is where you deliberately change something just to see what compiler errors will be produced.

- Try to identify parts which "can't affect anything else" (i.e. whose effects are very local), and clean those up first.

- Every part which you are able to clean up will make what remains easier to understand. Keep expanding the borders of what you understand and shrinking the borders of what you can't. Understanding is the key thing; the actual refactoring is relatively easy after that.

- If a certain function or section of code just seems really crazy or nonsensical, maybe that's because it is. Sometimes, rather than trying to delve into its craziness, it is better to figure out what that part was intended to do, delete it, then rewrite it to do what the original author had intended.

When you do this, you may find that a bunch of strange bugs disappear.


👤 joshxyz
reminds me of that saying its easier to rewrite a code that to grok existing code.

👤 shoo
You may get something out of Michael Feathers' book working effectively with legacy code [1]. There's a more recent interview with the author that might be interesting [2]. Feathers' book is well suited to provide guidance in situations where there is a large amount of complicated mission-critical code without any automated tests, which must be changed to fix defects or add necessary functionality, but no one has confidence in changing the code, as the risk of introducing new defects is perceived to be too high. The book is roughly an introduction to test driven development and refactor-to-test in a controlled way, for the situation where it isn't a greenfields software project but there is a pile of existing code that was developed without any tests.

But perhaps that's not quite the situation your employer is in, if people have confidence to change the code rapidly and iterate quickly. Perhaps the cost to your employer's business of shipping a defect is not very high, so it is somewhat reasonable to cut costs on QA and focus on rapid feedback loops, or piling on features, or so on.

Some forms of automated tests are very expensive to maintain -- especially if tests are not well designed and are tightly coupled to implementation details that change frequently, producing a new low-value job of someone needing to painstakingly update the expected values hardcoded into the tests after each code change -- but other kinds of automated tests can have a very good ROI -- e.g. a basic "smoke" or "integration" test that automatically checks if components can integrate and do something successfully along some simple happy path scenario. Any complex algorithmic code (business rules) or security critical code (e.g. logic controlling authorization) should - where practical - be structured as functions (ideally pure functions) without any dependency on the environment or external integrations - the kind of code that is easy and cheap to unit test. Then write decent unit tests for that easy-to-test complicated or critical logic, where the tests assert properties that are actually functional requirements - not implementation details. Ensure the unit tests are regularly run in CI, and that development halts if any of those tests fail, until someone has fixed it.

Another interesting viewpoint about how to structure code and test code is Gary Bernhardt's "boundaries" talk [3] -- this gives some guidance on how to factor codebases into components with complex logic (that can be unit tested) and "straight-line" code that integrates stuff (which can be exercised with automated integration or smoke tests).

Another idea, orthogonal to testing, is designing the system to be more robust against failure, when failures inevitably occur. Sounds like your team is already using a few of these techniques (e.g. enough observability of issues in production to rapidly detect and fix some issues when they occur and cause impact). Michael Nygard's book Release It! [4] discusses some ways to do this.

One thing to try with books is to ask your manager and see if they are willing to have the company pay for useful technical books and make them available for staff to borrow and read. Probably quite easy to make a business case for spending a few bucks on a book about automated testing for staff to read and discuss, ... provided management understands the potential business value of reducing the number of defects that ship to production.

[1] https://www.r7krecon.com/legacy-code

[2] https://www.infoq.com/podcasts/working-effectively-legacy-co...

[3] https://www.destroyallsoftware.com/talks/boundaries

[4] https://pragprog.com/titles/mnee2/release-it-second-edition/