Obviously the write-only paradigm is useful when reconciling changes with others and when reverting recent, broken changes or recovering accidentally-deleted work. But to me, it seems like there's diminishing value the further back you go. I can't imagine getting much value from trawling through two-year-old commits, much less twenty-year-old commits.
So I ask: at your company and in your experience, do you get value from source-control-arachaeology? And if so, what does that look like in your case?
IMHO revision history is just as valuable to a company as the code itself.
One of my favorite tricks is to make a file out of all the changes in the history:
git log -p > bigass
and then grep through the file (edit: which I like to do in Emacs—hence the file) to see every appearance of some construct. There's a lot of knowledge in there. It's particularly useful when you remember that you did something, but forget how you did it.In fact, I use git proactively this way, to store things in the version history that I might want to remember later. For example, if I write exploratory code to test out a feature or throwaway code to do some analysis—anything I might want to use again, but don't want to commit to the codebase—I'll add it as a commit and then immediately revert the commit (i.e. make a new commit that deletes what I just added). The codebase remains unchanged, but what I just did is now there forever for future me to recover.
Such an approach only works if your system is small, but I like to work on small systems and prevent them from becoming large systems. There's a beneficial feedback loop here: as you get comfortable working with history, it gives you more confidence to delete things, helping to keep the system small.
I've also found this technique useful for solving the chronic problem with documentation: that it inevitably fails to get updated. When I write something about the code, I commit it and, as above, immediately revert the commit. Now it's permanently glued to the state of the code when I wrote it. When I read it in the future, I can do so alongside a diff of the code from then to now. This makes it easy to see what has changed in the meantime, in which case I can update the document and commit/revert it again.
Two weeks ago I found something in a critical library at work (that ~every single C++ binary we run depends on: our main implementation of our custom threads' executor API) that made no sense. I couldn't understand why a variable was being rounded before being passed down to a lower layer, in a way that introduced an average 0.5 Ms of latency to many operations (I estimate that at peak, just one of the binaries that I maintain, a caching system, runs this code at least 200 million times per second), for no gain that I could see. There even was a comment attempting to explain why the rounding logic was added, but it was factually incorrect. As far as I could tell, I could just delete the rounding logic and everything would just work. I was baffled.
... until I looked at the code history! It explained it immediately (well, in like 5 to 10 minutes): the code from 2013, when the rounding was introduced, was calling into some lower level API that received parameters in a way that had limitations that ... Well, let's just say made it very clear to me why the rounding had been added.
Someone cleaned up the lower level library in 2016 or so, but the rounding remained in the upper layer.
This is just one example of many. I do this all the time.
Just two days ago, I was running scripts to extract lines-of-code by author and reviewer over different directories to get a sense of the size of the contributions of different team members, as part of the employee performance evaluation process (obviously, LOC is just one of many many many signals, and has to be taken in context). "Interesting, this person has already contributed 4k LOC to this particular directory, I didn't realize that!" Or "Source code files in the directories of the components that this person is a Tech Lead for had contributions from 131 engineers in 2019; of these, at least 56 engineers contributed more than 100 loc."
I guess I'll call out also that when I find a reproduceable bug that I can't explain, being able to binary search in the code history until I find the first change that exhibits the bug can be a life saver. I don't do this very often, but I estimate that, when I've done it, it has saved me days, possibly even weeks, of work.
People have a tendency to comment out unused code "in case they still need it". Or not delete unused stuff, because who knows what.
I have the feeling that I'm much more inclined to just delete a bunch of code lines that "I might still need in some situation" if I know there's version control. Because even if it's unlikely, "I can get it back if I want to" is a good feeling.
I think this leads to less cluttered code overall.
Also something that came to mind: When the shellshock vuln was discovered in bash noone really knew when and how it got introduced, because it was so old (literally decades) and there was no version control in that time. I don't think anyone suspects any malpractice with shellshock, but think about it: If you find a really strange bug that looks like a backdoor, and it's 10 or 20 years old. Wouldn't you want to know who committed that code?
The scenario isn't "I'm gonna go browse the changes that were made in March of 1994", instead it's trying to solve a specific mystery.
You see some code that doesn't make much sense, so you look at git blame to find the commit where it was written. Look at the full change, read the commit message, and now you've got some more context. Often this is enough to understand, but if not, you can check out the code at that time and read the implementation of related systems. Soon things are starting to make sense! Certainly they make much more sense than they did when you started.
On my current job, I very rarely go back to see when something was changed, because the business requirements are very straight forward. A change needs to happen, and the implications are clear. Also, no one really documents discussions systematically, commit messages are rather short etc. Not much value can be extracted.
On my last job, a system with over 15 years of history, my team was often puzzled with the existing codebase and the seemingly weird things it did. "Who wants this?", "Is there a usecase for this?" and "Do any of our customers actually expect this functionality if we remove this?" was a frequent question.
Then we'd check the commit history and get the 3-4 tickets involved in the functionality's history. Long discussions and back and forth with the client, explanations why the functionality was being added etc.
This archaeology was so frequently fruitful that all the team engaged in it.
"git blame" (or the p4 equivalent) is my usual archaeologic tool in this context, but "git bisect" has been very helpful in others. For the first, it should be easy to look at your current codebase in SVN and see how far back the history goes in any particular area. I've found that bisection is most useful for relatively recent history, because I usually have wanted to build or run the software to test for a bug or something - beyond some point in history that becomes impractical.
Moving from SVN to git shouldn't require losing history though...
I was reading the business logic that triggered the bug and it made no sense.
I activated the blame view of the code, and I realized most of the code had been written in ~1998 but a couple lines had been updated in ~2007, by someone who probably never even met the original author.
Realizing that made it a lot easier to understand the context of the bug and fixing it.
There is a lot of value in knowing that two lines of code next to each other have been written decades apart by people that did not coordinate with each other. Never erase that history voluntarily.
I can't remember a specific reason why off the top of my head, but it was usually something to do with looking at the context around why some piece of code existed. The companies I've worked for also require commit messages to contain bug tracking IDs, which can provide further context.
There's also really not much of a reason to migrate from svn to git if svn is still working for your organization. Whenever the topic has come up previously in my workplaces it ended with "nah, svn is still working fine for us." OTOH I was involved in a migration from CVS to SVN because of limitations/problems with CVS.
lol, I don't think that's the reason. At the only place I worked that used SVN the real reason was that the old guys didn't want to learn something new.
For example, we once had a customer-reported issue in an older version of our product (customers were complaining that an automation script for our product started lasting minutes where it previously took a few seconds). After some investigation, it turned out someone had deleted some code which excepted the scenario in the customer script from a timeout.
The commit removing the exception had a bug attached - the QA team had been complaining about the expected timeout not applying in some cases, and someone found the exception in the code and deleted it. They had no idea why the exception was there in the first place (according to the bug chat logs) and didn't bother to look back in history to see.
Funnily enough, looking back even further in history, we found that the exception had been introduced a few years prior, after a customer had complained that... some automation scripts were taking too long... the same automation scripts that we received in the new complaint, give or take a few years worth of additions.
Some use cases are:
a) "Blame" tool, which produces a file version annotated line-by-line with who and when last changed that line, along with the commit message. "Who the f* did that s%$@#? Oh, it was me again..."
b) Searching the history of a file by keyword. Especially useful when something was deleted, and as such no longer exists in the source code, but you can find it by searching for the commit message. (knowing you can later do this gives you more confidence to actually delete things, instead of commenting them out or leaving them there in case they become necessary again)
c) "All I know about that feature is that Jenny implemented it before she left the company." Filter for Jenny's user tag.
d) "All I can find about that change is this old email saying it had just been done." Look at logs around email date.
I wish that when the team had migrated from SVN to git, they had used a tool that would have preserved the history. It's very easy to do! I don't know why they didn't. They did it right before I joined the company so I never had an opportunity to show them how.
A lot of developers do not write good statements. They don't even link to a ticket. But you get to know those developers real quick when you're doing spelunking using annotate. And developers who don't write good commits probably didn't leave any other documentation behind of use.
I've used this to illuminate "technical debt" from a different perspective. If you take a critical code path, find the important commits for critical logic, and then just show the "context" you're left with, you'll often be able to say "this is why your quality sucks" in a real concrete way.
Managers love proof, and showing them what little context you have for critical areas can be a very different way of looking at the quality of their systems. Otherwise, I've often seen a LOT of overconfidence largely because "we have automation in place".
It's called software archeology. It's not important if you keep exactly the same people working on the same project and they have perfect memory. But if you, say, move people between different teams, or lose people, or hire people, it's a gold mine.
(Granted code review systems and ticket systems change overtime)
Git blame on gitlab is also a good way of getting context of why something is there to begin with.
Sometimes those changes are over a decade old (of course such old changes make it more unlikely that they are still buggy, but new changes may interact with those old changes in unexpected ways).
So yes, the older a code base, the more important a complete change history becomes.
A great example is data migration. Infrastructure changes over time, even if only gradually. Databases get upgraded and moved around. Recently we realized that some data we migrated nearly a decade ago had significant inconsistencies. We didn’t have full revision history, but what we did have was enough to piece together the puzzle over a period of several weeks. If we had full revision history—which would’ve gone back about two decades—the job would’ve been much easier.
If you fix it without context, you may not actually fix anything, and actually create a broken state. This may be a new bug or a regression.
If there's more context, you're less likely to fall into that trap.
Of course, this is all moot if there's decent documentation, but I've never been employed in a place that does. Everywhere requires reverse engineering / archeological expeditions to understand the mistakes of the past, before accepting them as necessary evils, or fixing them without breaking the side effects of the mistakes.
By keeping commits larger-grained (especially if I'm deleting a functional component), it supports deleting with abandon, and follow the "you aren't going to need it" (YAGNI) principle, rather than having large commented-out sections (or worse, large sections of deadwood in your tree). It also allows you to restore it later if you need it again by only reverting one commit.
Reason 2: finding out WTH went wrong.
By having a master/stable branch and a development branch, if anything goes wrong, I can always diff between the branches to see what/how things broke. Sometimes it's a change to a dependency. Sometimes (ok, most of the time) it's a change I made.
This said, I think it's useful to me because I know what's in the history already. I think looking through commits from someone else with a tree that I'm not familiar with is going to be of very limited use, especially because people don't generally provide the critical answer of "why" a change was done in the commit log.
Related #protip: always try to describe why you're making a change in the commit log.
The how of the change is already there: it's the diff. Why you're either making the change or choosing a specific method over another can be invaluable to the Engineers of Tomorrow, and prevent them from a regression due to context loss/tribal knowledge loss.
I build custom automation equipment, which involves individual 100-400 hour projects. They're developed in a continuous scratch-to-complete flurry, with a few days of revision after customer review and installation, a year's warranty that typically involves 1-5 on-site days, and an annual "we never read the manual please remind us how to calibrate it again" for the next decade. Very little maintenance coding, lots of fresh feature development.
I disagree that there's diminishing value to older commits. You're more likely to forget what you did the farther back you go!
I'd estimate that I use revision history maybe 0-2 times in a typical project. But that's an easy way to recover a couple days of work that would otherwise need to be rewritten from memory, or worse, reengineered from scratch! You can write a lot of commit messages in 16 hours, so one incident where you can recover two days of work makes two months of using version control without ever referencing the history worth it. Plus, it's a nice security blanket for me, I don't worry about commenting around old code or making changes to a reference implementation I'm modifying because I know it will be in version control.
I do think it's exceedingly unlikely that you'll suddenly decide to revert to the state of your codebase from 20 years ago. If you transitioned to Git and kept the SVN repository around for the rare occasion when you need to reference it, at least in my projects, you'd be able to do so without much trouble.
You can convert a repo from SVN to Git with history intact!
There's a tool called cvs2svn that I have used to upgrade really old CVS projects to git (it can do git too), and there is also an svn2git. And, I believe there is git-svn that provides a git interface to an svn repo.
It often contains valuable clues.
Less often you will want to know "how has this code changed over time?" or "was the code like this originally, or did it used to look different at some point in the past?"
Commit messages often say why something was changed. Well, good ones do.
> At my current company they place a huge value on that history, so much so that they haven't transitioned from SVN to git solely because of the logistical challenge of migrating 30 years of commits.
I assume they've actually tried to do it? I ask because there's a bunch of tooling and at least one reasonably well understood process for achieving this and preserving history so it's pretty low investment to try it out and see if it works.
Here's Atlassian's version, for example:
https://www.atlassian.com/git/tutorials/migrating-overview
(I will grant you that figuring out how to navigate to the next page of the current tutorial at the bottom of the page is unnecessarily complex.)
I suspect with 30 years of history it's going to take a very long time to do the conversion (days to weeks), but you can set it off and leave it running. Once you have your initial migrate done you can set up syncing to git, and then you need to pick a time when everyone will stop committing to svn, allow a sync and verification window of a few days, and then everyone starts using git.
It gets more complex with multiple projects ongoing, and scheduling around releases, but making this happen is more a matter of will than battling complexity.
I strongly advice against abandoning revision history just because it is easier to just start fresh from a single git commit of the current state of the code. Especially so for code that has been in use more than a couple of years, where the developers may have forgotten the purpose or who did what.
Surely you can convert the svn repository to git with history intact? We did that when we migrated from cvs to mercurial. If it is too complicated to do directly from svn to git, maybe it is easier to convert via mercurial, i.e first from svn to hg, then from hg to git?
The decision of migrating the repository or missing the commits is a false dichotomy. One can deal with two repositories without much of a problem, it's only a little slow down at the rare event you have to look at it.
Anyway, that applies only if you do have a reason to migrate.
2019 fix, of a 2011 breakage:
http://www.kylheku.com/cgit/txr/commit/?id=3a91828748385d8d6...
2020 removal of 2009 misfeature:
http://www.kylheku.com/cgit/txr/commit/?id=24bd936a9fa671599...
The TXR project only goes back to 2009.
We can fix these kinds of things without reference to the past, but the process would feel uniformed and impaired.
Not everything is in the code; there are sometimes questions of requirements, which are not always properly captured in documentation.
We need all the historic questions to be able to figure out the whole situation: what happened to the requirments as well as the code, and how it all relates.
Changes made more than 10 years ago help to fix bugs still present today even if the codebase has changed a lot (they have commit messages, link to old tickets with more discussions ; sometimes, just the name of the committer tells a lot about what to expect from a change).
I've spent a lot of efforts when we started migrating from SVN to git to not lose this, knowing the pain of not having the history go far enough (some of our projects were already migrated from CVS to SVN a long time ago, and histories where lost then). Efforts have been more human than technical, BTW, since not everyone was aware of the value of the history — usually bugs in the oldest parts of the codebase get through only a handful of people who have been there for a long time, and other people tend to take for granted that we understand why something is the way it is.
In all the steps we preserved the commit history, except for the final git->git. However also when we moved from SVN to git we kept the old svn server running as a historical archive for several years, as we didn’t carry across all projects (some were already EOL’d years ago).
During that time I looked at it maybe twice, and ultimately we decommissioned it.
Likewise with the new/old git repos, we still have the old git repo if we need the history.
One final thought: git blame was nice, until someone reformatted the entire codebase and committed it back in (we’ve since adopted better git workflow and code review practices!)
Once I did a git blame on a file, found that the offending code had been committed nine years previous, and was able to figure out why the code was the way it was by looking at that nine year old commit and all the other code that had changed with that commit.
The nine year old context was super useful.
I realize the value of this history is smaller for newcomers to the team.
(And Subversion and it's bigger, expensive brother Perforce still make sense for game development - when you don't really want to go wild with branches or remote work, and when you need multi-terabyte-sized repositories and multi-gigabyte single commits.)
Of course, there are other options too, like migrating and then using replace, migrating and then rebasing, etc. I just want to point out that even the lowest effort option is valuable enough compared to throwing away history.
Granted, the context didn't really change what the fix needed to be, but it did provide a useful moment of reflection on the ways in which software can break through subtle changes over time that stack up, and it helped to know that the section of code that broke was indeed originally intended to work the way it did (and not that it was a bug from the very beginning).
1) We found a nasty heisenbug that crashed with a useless unrelated stack trace, but only if you didn't have a debugger attached, and only if you held open the windows 8 "charm bar" open for more than 10 seconds, and only on the main menu screen. After wasting a couple weeks trying to root cause it with logic, I eventually resorted to brute force bisecting perforce history by hand - and then the changes within the changelist to blame, as it was a large one. This let me figure out it was a bug in a seemingly completely unrelated, closed source system API, that we were calling to check internet connectivity. I had to write a standalone repro case to prove to myself it was the cause, it seemed so nonsensical. I wrote a workaround. This bug was only a few months old though, because QA was able to catch it early enough. The bug likely would've eventually gone unfixed without perforce history.
2) I went to upgrade a 3rd party dependency that we checked in, that hadn't been upgraded in years - maybe even a decade - for bugfixes and such. Except we'd made changes to said 3rd party dependency, so I needed to seperate out and understand our changes to the baseline SDK so I could decide if I should re-apply them to the updated SDK (in some cases yes! I was able to drop others.) We had a web interface to an archived SVN repository containing the commits before our years-old Perforce transition - and before my employment there - which I used to help me grok it all. I might have reached as far back as a decade in this case - very low frequency of commits to that part of the code, however, so "a decade" might have meant "the past 10 or 20 commits", if that. I had to reach out to IT to even get credentials to see said history. Helped turn a nervwracking upgrade into a tame one.
3) We decided to port an archived, years-old project to a new platform. Just seeing the last change made to sanity check if the weird logic I'm seeing might be a "new" bug or not means looking at years old history. This has actually happened to me a couple of times.
After transitioning to freelancer and being the sole user of it, it finally allows me to use it for its true purpose, namely refreshing my memory on some techniques. Sometime I copy/pasta code from older revisions on a different project because that's the code I need for current project (while the current code of that project changed due to client requirements changes). Also sometime it's used by clients to see how the status of the project evolved over time, so it also serves as a metric purpose.
We put Jira ticket IDs in our commits and sometimes that’s useful. But the value tends to be in the content of the tickets as much as the commits.
If we decided to squash everything more than a year or two old into a single commit I doubt it would affect us very much in practice.
Thank you OP for demonstrating the effective use of Cunningham's law: "the best way to get the right answer on the internet is not to ask a question; it's to post the wrong answer."
Sometimes there are clues sometimes not, but you can often see the line change in the context of changes to other files and that can help.
For this reason I tend to make quite verbose commits with the context of why I'm making the change. A comment in the code would go out of date, and pollute readability, but a well written commit can be very useful.
On personal projects I don't use it that much.
One example I don't think I've seen mentioned in this thread is that sometimes a change touches two widely-separated parts of the code. The commit message may be your only opportunity to comment both parts at the same time - to tie them together.
Also must confess that I never seen much profit for myself from commit messages apart from one-liners like “broken”, “savepoint” and “fixes to bar, uploaded foo”. If trunk has a problem, you just can blame and get an exact revision. If you search through a history, use a gui tool / ide that can fetch it quickly and compare to head, then bisect manually. I don’t make hundred-pagedown commits, so that’s easy enough.
For a future employer: that doesn’t mean I’m against or unable to make branches and write good commit messages. All above are just obvious shortcuts that my own “garage” projects tolerate with no downsides. Personally, I don’t get why some guys freely decide to break project rules when at work - and it was frustrating when they did it to me.
Sometimes code only makes sense if you can see it's evolution.
Also knowing who wrote it you can ask those people sometimes about it.
I did this for a 15+ year old very large code base. I tried various recommended techniques but everything would fail at some point or another (usually after many, many hours) and I'd have to start over.
Finally I wrote a very simple program with logging where the logging also records the current state and on any failure I could start from any point in the log.
The idea was simple.
1. Check out SVN version N
2. Parse the changed files list for file between commit N and N - 1
3. Copy those files to the Git folder
4. Parse the commit log for version N to get the commit date, committer name, and commit message
5. Commit in Git using "[ 6. Repeat with N + 1. If it fails at anytime then simply reset both SVN and Git to the last successful commit and restart. I also did a binary compare of the entire directory tree every 100 commits to ensure the copies were identical. The process took about two weeks running all day and night (since one commit at a time is very slow) but it was very robust and left a perfect version history. To deal with the fact that the SVN repo was still live, I believe I mirrored (or something like that) the repo and would sync between my local mirror and the live repo every couple of days. When my program caught up with the live repo we just stopped commits for a few hours while I wrapped everything up and then archived the SVN repo.
As for commits that are over 2 years old, they still serve a purpose. For a legacy app that I worked on, I had `git blame` ran on every line (vim and vscode both have support), and I was able to see who worked on a block of code last. Sometimes, those developers are still there and available to ask questions which has helped me greatly.
Which applies to any project on which a number of people is working on. Especially when there is a bug, git blame is a life savior. Which potentially has a lot to do with me being annoyingly pedantic about commit messages and branches. I did however had the "pleasure" of working with a guy who's branches were commonly called "bugfix102015" and commit messages along the lines of "fix some bug". In such cases there is not much you can do when shit hits the fan.
For my very personal projects - hardly. Much like you, if something has been done 2 years ago, chances are it's working fine as it is, or you are not using it at all. So for personal projects, digging years back is something I don't ever recall doing.
if you have never experienced the raw power of "git bisect" when trying to hunt down a bug, you're missing out.
using git bisect can literally save your life in terms of stress. I think it one of THE most important tools in git that developers can learn. it shows exactly why we should commit small and commit often.
This helps a ton when refactoring code in a project with alot of history. I know that this bit was done this way for a reason, but that reason could be anything.
I work with different code bases, some have 20+ years of history (migrated from RCS to CVS to git).
There's no week where I don't go back to look at some kind of history, usually to find out why or when something was done. Often the issue keys / ticket numbers referenced in the commit messages help me when the commit message itself is too opaque to understand.
I also like to get a sense of how often a file changes. This gives me a sense of whether the code is likely to be fragile and/or touches often-changing requirements.
There is a diminishing return for very old commits, partly because our team was much smaller back then, and communicated less in writing, partly because too much of the context has changed. But two years doesn't qualify as "very old" here, in our case the diminishing returns start more at 5 to 8 years.
That said, if I were working with SVN again, I'd likely look at the history much less, because it's that much slower and more painful.
I switched companies (FANG) to my family software company and the company had been using Microsoft Visual SourceSafe. The company was only casually using it, as one computer was used to compile customer executables (and fix compile-time linker errors) and was often times never checked into VSS. Needless to say, no one on the SWE side knew if any code actually worked or when anyone did anything.
Part of this was lack of management, part of this was inadequate tools. After I joined and learned about the horrors of VSS, I switched our company immediately to git (there was some initial resistance). While there was around 10-20 years of VSS commit history to migrate over, having git blame immediately in VisualStudio and any git client makes a world of difference. While legacy code can’t be cleaned up immediately, the team’s mindset has changed so that there’s no more commented out code (“in case I need it later”), no more new duplicate implementations of the same business logic, and a person to blame for software bugs :)
I'm at the stage where if someone suggests that we try to keep a linear history in git I push back and argue that it isn't worth the extra effort compared to the gains.
I use git blame and history at least once a week when bug sleuthing, and value it very highly.
The most common use case for archaeology is to find who made a particular source line change 10 years ago and just ask them something. I often find it’s my own code...
People usually remember at least vaguely why they wrote the code even 10 years ago.
when we need to dig history of a line we git log -S
Then for our users it's like you can have one name and one name only if you change it, it's going to be that name forever in the past too
Tables should have revisioning on as a default
* Dimension 1: Code layout (organization) structure
* Dimension 2: Execution (data-flow) structure
* Dimension 3: Evolution (change over time) structure
All three dimensions try to capture some of the intent of the implementer, and understanding that intent is very important when improving upon the work. Along with that comes a perspective on what assumptions the author had. Code last modified 5 years ago very likely had different assumptions than code written last week -- Being able to see which lines in a function came from which era can illuminate things nicely, and that is only scratching the surface of this evolution-dimension.
Our case was a tooling policy issue but it should be possible to migrate from SVN to git and keep the change history. You should investigate this option.
I’m surprised you haven’t found excellent tools to migrate from SVN to Git, considering how popular these VCS systems are.
I once worked in a company whose source tree originally predated version control. There I found an entire module that appeared to be dead code and I wanted to determine how it became dead. I did a bisect and landed on an 8 year commit that was the very first commit in the version-controlled tree. Yikes. So I guess I’ll never know how that module became dead.
In the first case, I don't recall ever needing to go back into the old SVN repo, spelunking for "how we used to do it". But the capability was there, with the minor hassle of not having a single repository to search. The git repo, with some minimal recentish history, soon became the authoritative source, and we never looked back.
[edited to clarify the partitioning of the first codebase]
In your specific example, git-svn works really well for maintaining that history including authorship. I have a few projects which predate Git existing and it’s been quite usable for history. You can’t direct link to a commit ID but Git searches are very fast (we’re not on 20 year old hardware) and you shouldn’t be doing this many times a day.
- Looking at root causes of undocumented weird hacks and technical decisions
- Finding culprit, the one that causes a bug, in order to remind them to do better.
- Finding out someone underappreciated contributor
- Reverting changes.
- Cherry-picking changes.
A REMINDER: Revision history is great, but so is flexibility and velocity. You can always cut off history and use another tree (e.g. when moving from SVN to git), make a documentation about it, keep the SVN history as an archive and use git.
If a decision will boost velocity, flexibility, and sacrifice less valuable thing, you should do it, but make sure you will have a fallback.
In the end, flexibility is what you will need at every level (code, product, company) because the world around you (and requirements) always changes and you'll need flexiblity to be adaptive.
I've also used it to do some deep archeology. I had a piece of code I inherited that was always problematic. Eventually, I went through its history to figure out what it was originally intended to do and why it changed over the years. This was invaluable for finally figuring out how to fix the damn thing once and for all.
`git bisect` to find when and how a bug was introduced, getting details when it’s time to merge multi-month merges, getting stats about previous projects.
But on the rare occasions I need it, I often really need it. Especially because code that has survived sufficiently unchanged for that long ofte has done so for important reasons.
The amortized value per commit for really old code is likely low, but you get them 'for free' because you want to do them to have them for recent code, and the overall value of having them for older code to the codebase as a whole can be significant.
I'd say the SVN history challenge is an excuse - firstly there are tools that can do it.
Alternatively you can easily enough keep the SVN repo around for those rare occasions people really need to dig.
I also used it to pinpoint the cause of a bug after updating a docker image, knowing that the bug was introduced in a certain file between certain dates.
Now I try to strictly enforce detailed tickets and ticket numbers in commit messages.
git blame > who changed that line the last time git log > why was it changed
You can quickly find out if this was some trivial typo fix, or an important feature was introduced.
Implicitly, it means than to get some sort of value from that kind of archaeology you have either very detailed git commit message, or really clear bug tracker with all the why, regression tests, etc that were done a that time.
That being said, I think this is a bad argument for not changing/updating your VCS.
You can absolutely move to git, and keep a dump of the SVN base you can still expose and review at will.
https://github.com/jolmg/git-reblame
Last time I used it was last week to see how a particular piece of code was developed throughout the years. There was a comment that didn't explain some puzzling details, and it helped to make sense of it by seeing how the code changed from the time the comment was written.
What benefit you can get from 30 years of commits I'm not sure.
By the way - it looks to be possible to migrate history from SVN to Git, so if your company needs that, maybe start there , by creating a local git repo with intact history and showing it to them.
If a project has a clean commit history, this instantly gives me extra context and hopefully even links me to an issue thread explaining what was being solved.
In older code bases this is invaluable - I often find myself looking at history from five years ago or more.
It's also great for my own projects. Even if I wrote the code six months ago there's still a strong chance I won't fully remember the context for the change.
It’s an important communication tool. Also game companies tend not to have unit tests, but the culture is very much “don’t break _anything_, and don’t make the game worse,” so devs have to triple-check the intentions & effects of any code/script they touch, to be sure they understand what they’re changing and know it won’t introduce any unexpected changes. Timelapse view (Perforce’s version of git blame) is an essential tool for all departments, especially for anyone trying to figure out a bug.
There were other core systems that I also read sometimes that were older, and it was extremely useful to understand their construction and function.
Shitty codebase and lack of tests defining business needs require it
Also Devs with headphones on not starting decisions ... That way when people leave no one know why it is the way it is.
Where I work, we use it primarily as part of maintenance. Looking through what changes have been made to a section of code over time very often gives insight into what is causing a current malfunction -- sometimes it even lets you spot the problem almost immediately.
We also use it as part of development and bug tracking. All code changes are tracked by revision number. Even there, being able to look up even antique history can be very useful.
https://gregoryszorc.com/blog/2015/05/18/firefox-mercurial-r...
This becomes very valuable when maintaining projects that will be running for years, and prevents you from undoing things or going back to doing the same mistakes
BTW: if you sometimes move code around between two git repos (from multirepo to monorepo for example), I wrote a script to move a subfolder between the two and keep history:
https://github.com/jakub-g/git-move-folder-between-repos-kee...
What happened in the CVS => SVN migration? You don’t have 30-years of SVN history. Do you have an SVN mirror / backup for which you can try out the git svn to try to import the codebase?
Besides, it may be 30-years worth of commits, but I’d guess it’s smaller than the LLVM SVN repo was at the time of the first git mirroring. How many commits are you talking about? (Including all branches, etc.).
I've used git-svn[0] to use git within svn, it's been working flawlessly in my case.
I push to master infrequently. I keep a series of topic branches off of master, one per project phase. For changes that affect other developers, I pull those out and PR them to master, then rebase the topic branch chain once the PR completes. When I switch projects I use git reflog to remember where I was working.
Basically I take advantage of git rebase and use it like time travel constantly. Somehow I stay sane..
Especially if I have written them since when someone asks me how something works or why it's written that way I can read the commit body and explain it to them/refer them to the message (otherwise my answer is, I don't remember!)
Additionally if you choose your changes well and don't squash commits it can be a good guide to what else touches the thing I'm looking at.
And yes, once in a blue moon there is a change that breaks something, I need to go back and recover some old code. Much more often - I need to see how it worked before.
Plus, let's second the psychological benefit. I don't need to worry or think twice before changing code.
When you do want to convert that SVN repository, use Reposurgeon (http://www.catb.org/~esr/reposurgeon/).
More importantly: we have a clear date in this case for what versions we need to considering releasing patch fixes.
In some cases this can be useful: even when the functional problem it causes is not easily evident in previous versions.
I did such a migration 3 years ago at a company that had a 10 year history and it was fine using the standard tool. Is there a particular problem your company has with it, or have they just not tried?
(Also, if they are really worried, nothing to stop them keeping a read only SVN server somewhere.)
You fiddle with it until you get it working the way you like, then you do an import in the background or overnight. It takes as long as it takes but you don’t care. When it’s time to make the transition you aren’t importing the whole thing, just the past week. The older stuff has already been transferred over.
It would be worth switching to git if the current technical costs outweighed the costs of the migration, yes.
I think there's more to it than just "we don't need all the history, just squash it and starting with git would be better" (or even "setup authors file, git svn fetch"), though.
Recently I had to go back and find out why a particular conditional was added to the code. 10 years prior someone added in a particular conditional for a bug in IE8 (which we no longer support). There was a Jira associated with it. I then knew I could remove this odd logic as it was no-longer relevant.
Yes, there is diminishing value in old commits, but they are far from worthless! Never ever destroy the commit history. Doing that is imho, a cardinal sin.
Just bite the bullet and convert the repo from SVN to Git?
Guessing the primary issue is the time it takes to convert the repo, maybe a job to do over the holidays when most people are off?
https://en.m.wikipedia.org/wiki/Wikipedia:Chesterton%27s_fen...
Also sometimes you want to find the author of a piece of code to ask more questions why some things was done a certain way.
Makes it easier to figure out how old a line of code is and (if the commit messages are any good) why it was introduced or changed.
And then I started looking at history and its invaluable to have particularly when understanding rationale or debugging issues.
In Goland/idea you can look on the git history of only the selected code. I use this constantly to see how the code has been previously modified before I make my own changes.
- What is this trying to accomplish?
- Why are you doing it this way?
- Why not this other way?
Ideally these things would be answered in comments, but they often aren't. The commit message hopefully answers #1, and it links to the code review tool which may shed light on the others.
I also use history for git blame to understand when a change was introduced for debugging / intent purposes; this can go back months if not years
The sequence of diffs is much, much, more informative than the current state of the software.
Though we stand on shoulders of midgets usually, you still get a better view.
Being able to pop into our SVN history made this a trivial issue ask.
A wonderful reason to create atomic commits with good commit messages.
Call it insurance.
Most commit messages are only the code of the Jira issue and maybe its title, but almost never what they actually did or why. Frequently, they will have half a dozen commits with the same message -sometimes even unrelated commits because they got a bit too lazy-. Most Jira tasks don't have a description. If it's a new development, the documentation is generally elsewhere and the Jira task has no description at all. If it's a bug, it may have some screenshot attached, and it sometimes has an explanation but generally the explanation is given verbally to the developer.
A handful of developers heard The Architect say once that it's preferable to submit one commit for each changed file than to put two unrelated changes in the same commit, and so they do. They change 12 different files for a certain feature and they will make 12 separate commits, one file each. Not always one after the other but sometimes dispersed through the day. One or two developers obsessively commit each single change they do. Meaning they write a couple of lines of code, commit it, and then try it, see it wasn't correct -there was a typo, it wasn't the correct field they needed, whatever-, edit again, commit again, etc.
They have a certain backup process which stores a handful of XML log files from some processes; they store them by committing them to the SVN repo. A commit every hour, in the development branch.
They have a flow with two branches, trunk and development, and a 6 month cycle for releases... It sort of works this way:
Start (theoretical): People develop on development. Two -or two and a half- months before release, they make "the switch". Everybody commits whatever they are doing at the moment and stops for a day. They merge development into trunk. and then they all start working on trunk for the rest of the cycle until release.
In that final period, trunk is mostly "open" -more on this later- and people just commit to it and that's it. development is abandoned and deleted. A new development branch is taken from trunk but is not generally used during this period.
When release time comes, trunk is tagged with the version. Everybody switches back to the new development and development is done there. But this is not what happens because there's another period of maybe one or two months, where trunk -the released version- has a number of a. bugs, b. stuff that was unfinished, c. smaller things which "well, we could do it on trunk because it's just a small thing". So, what happens is they go on working on trunk for that month or two, and only gradually people start working on development.
Also, they don't really tag trunk at release time because it's not "done" yet. When the bug hunting season is over -or when they are just tired of it- then they tag and freeze trunk, with the version, move it into storage. Nothing in this is really planned. They just decide one day and then tell people, who just rush whatever they were doing on trunk and commit it, or abandon it and move to development.
During both pre-release and post-release periods, merges are done about once or twice a week from trunk to development. If you use SVN you'll know that these merges are seen as a single commit in the receiving branch. You can see the full history if you query the merge info, but it's not shown directly in the main "svn log".
All this means they have:
- about 40% automated commits from some backup process.
- Most changes happening in the other branch, so you need to go through mergeinfo several times.
- Main development branches deleted and created new every so often.
- Most people not explaining what they did in commit messages.
- About half of the bugs in Jira not describing the problem and almost all of the tasks not explaining the work to be done.
So... do we ever truly use the revision history?
Yes.
A few people -particularly Karen- use it to drop the blame on whoever they want. They get a bug, they open the svn log for something related, see a name they don't like much and say "Ok, just assign this to X, because they did something on that file 4 months ago".
I am using it, sometimes -with some effort and some success- to try to understand just where do some heavily copy-pasted snippets come from, so that I can wipe them out for good. Also, sometimes I use it just to write in my diary and laugh a bit about it so I don't cry so much when I get up in the morning. This is probably the most valuable thing we get out of it, because it keeps me... well, insane, but at least not murderly insane.
- Why the heck is this here?!
- git blame, git log, git show
- Ahh...