Best way to prepare a codebase for open-source?

Question

Hi HN,How would you approach the process to opensource a proprietary codebase, especially regarding things such as ensuring that no secrets are sitting somewhere in the history.I'd be tempted to publish without history but I feel like a lot of important context will be lost.

hobo_mark · Accepted Answer

Keep the proprietary history to yourself, publish just a squashed snapshot of the state you want to make public, use git-replace [1] to link the public history with the (hidden) private one when you want full history, develop only on the public one from now on.[1] https://git-scm.com/book/en/v2/Git-Tools-Replace

marto1 · Answer

One really important part people forget about is providing a development workflow. How should people reach out to you ? How can they make patches ? Report security flaws ? Which design principles are you following and are you open to change these ? How will this discussion be facilitated ?Making something available publicly is a far shot from actually having an open source project in my opinion. Good luck in your endeavors, it's a lot of work!

johannes1234321 · Answer

Aside from the code: Know your intention.
What do you want to achieve by open sourcing? Are you trying to grow your username, gain trust or is it just throwing code over the fence so users can go on while you focus elsewhere. Do you hope for contributions or do you want to continue driving the project?
From there you can derive the community management you have to do. The more involvement you want from externals the more you have to invest in community management. (The more you want the quicker and more thorough you have to respond)
Then be aware of all the legal things. When using libraries: anything lgpl or gplnlicensed stuff with something incompatible etc.?
And then for secrets best is indeed squashing the history. Makes it in the beginning a bit annoying, but you might have code comments or commit messages referring to customers or have experiments with libraries of unacceptable license or whatever in there. Limiting review to recent state is a lot simpler.

edderkopp · Answer

BFG Repo-Cleaner (https://rtyley.github.io/bfg-repo-cleaner/) scans the history for secrets and removes them.

trebligdivad · Answer

As well as cleaning up/squashing history; consider:a) Make sure it's actually all your code - not code you copied from some other closed source project, or a contractor gave you 3 years ago. b) Remove the questionable comments about other employees and their managers c) If you're removing history you are removing some of the rational about why things are like they are - that does make it harder for people in the future to change things

exikyut · Answer

Here's a possible practical idea:1) run the full contents of HEAD into a space-separated list of tokens considered "okay".2) dump out the full history into a space-separated list of tokens, filter out everything in the "okay" list, and list what's left.You might want to set up some sort of incremental regex filter thing to chew through the list efficiently.But if you implicitly trust HEAD as "incontrovertibly okay" this might filter out a lot of tokens for you.

ufmace · Answer

I'd worry about the licensing too. What open-source license would you publish it under, and do you actually have permission from the owners and everyone who participated in writing it to publish under that license?

PaulAJ · Answer

On secrets: you should be cycling all your secrets on a regular basis anyway. If you aren't, now might be a good time to start.In addition to the good advice here, you might want to check for anything potentially embarrassing, such as offensive language in comments, identifiers or commit comments. Some "tech bros" can be remarkably dumb about that stuff. Of course if you developed all this yourself, no problem.

andrew_ · Answer

Tangential, but still relevant; Once you decide on the cleanup for open sourcing, choose a path that's comfortable for YOU, the maintainer, and clearly define contributing rules/guidelines. Take breaks, and feel free to let the community know. And remember, it's only a hobby, nothing to get stressed or burnt out by.

Raed667 · Answer

99.9% of people will only care about the current version. So don't worry too much about history, people mostly just want stuff that is accessible and works.Edit for clarification: You can delete the history for the open-source version and publish it with a fresh history. And keep the original history internally in case it is ever needed.

protomyth · Answer

If it is an old codebase that you are not sure of the origin of all the code, then you need to do a code audit and maybe even consult a lawyer.Do not publish the history. Fresh start for open source.

hvgk · Answer

Clone it then push it to a clean repo to destroy the history. Seriously. I understand security concerns but the history could expose previous bad practices which can be applied to other products you run. Bad patterns are endemic.Also it gives you the chance to secret scan it and remove any swearing and embarrassing comments (I haven&rsquo;t seen a codebase without any yet)

Jugurtha · Answer

I recently put something on GitHub that we wrote for our product. The repo was on GitLab. You can add a git remote and push and it will keep your git history.
I wrote the library because the issue it solves in MinIO's Python client was marked as "won't fix" and it has been useful for many people (we put it on PyPI before adding it to GitHub), and I was glad a few days ago to see that MinIO added something very similar to their Python client (they added a Python wrapper just like bmc).
https://github.com/ikodotai/bmc

joshxyz · Answer

CI tests are nice, unlimited use of GitHub Actions for public repos

bumblebritches5 · Answer

To remove secrets you can use git filter-repo to rewrite commits

mrslave · Answer

Another reason to squash into a single commit: not all of your developers want their name & work email address published without their consent.