It's a git repo and everyone can pull/push directly to master.
I skim the commits once a week.
It's basically plain text files. But the md extension triggers some nice eye candy in vim and other browsers.
I think we will keep this structure forever. Maybe we will (additionally) serve the files over http at some point. Maybe we might even add edit / search / push functions over http, but for now I have not planned that.
I have seen CMS come and go. And I'm tired of it. Text files are forever.
Postmortems are important to drive home what went wrong, but newcomers won’t read them all.
That’s why you need people with experience and people with tenure.
What companies sometimes do: They encode lessons into rules, which tend to survive turnover. But that comes with its own set of problems, where you end up having a lot of rules who’s reason is forgotten.
To describe an example at a previous company from the one I'm at now: I used to work for a startup building a mobile phone network (technically, a Mobile Virtual Network Operator, MVNO, if you care about those distinctions - the point is that phone calls went through our infrastructure). It was very easy in the process of changing someone's account (porting in a phone number, changing a plan, or something like that) for the different IDs from different systems (phone number, phone's serial number, SIM's serial number, billing system ID number) to get out of sync.
So, we could have documented how to avoid this, and the document would have sat there with no one reading it. Instead we created a nightly job that went through all our accounts and verified that ID numbers were "as expected." It would output to a slack channel with whatever breakages occurred for us to look into the next day. This program also served as documentation - I could look at it to understand what IDs should match to which systems.
My current employer follows the same sort of learn-by-building-mechanisms approach, but at a much larger scale.
February 2th. when I wipe out a production database because the ansible playbooks had hadcorded settings. Since then we use settings repositories and confirmation dialogs when playbooks will run in production.
September 2. They day we realise we are unable to restore old backups because of media paths are not related to the data, so when we move all data from different servers with lost all user's images forever. Since then images are prefixed with the db record id where it belongs, later we added metadata to S3 to add extra stuff like user_id, object_id, company_id, etc. so we keep urls clean
September 10. Inbox carnaval: we have an small hack that added users to BCC to send newsletter, with the time all users where receiving emails 2 times, then 3, then 4, then 10, then 2 again. It was a threading issue where the variable of the BCC was set to global in certain cases and it appended to the list instead of starting again ... 2 full weeks into that. Python3 and typing was the way to fix it
Sharing knowledge isn't just a matter of tooling, but a matter of principle. Because we want the knowledge we share to not just float out there as a "lesson", we want people to use the lessons and act differently what we're actually talking about here is governance (i.e. law!). This might sound heavy. It's not. It's just a change in orientation from "I'm passively sharing this lesson we learned" to "this lesson we learned changes the way we act."
There are 3 things to do to shift from "lessons learned" to working agreements:
1. Capture knowledge in the pattern "when this happens, our team will act this way"
2. Adopt a workflow for formally adopting a working agreement – could be a majority vote, consensus vote, etc.
3. Keep that knowledge someplace the team can browse, search, and update (e.g. Confluence, Notion, Google Drive, etc.)
If you do this, something magical happens: you'll begin to evolve your knowledge over time.
Have a working agreement that didn't quite cover a corner case? Update it! Have a working agreement that was too restrictive? Nuke it!
It's no less shift in magnitude as when humanity switched from oral tradition to the written word. And guess what? The written word works much better when you're operating remotely.
Our remote team has been operating this way for nearly 5 years at Parabol. It's a common pattern that at the end of every retrospective we have a new working agreement we'd like to adopt. We've even come up with a Slack-based async workflow for adopting them: https://www.parabol.co/blog/async-decision-making-slack
One of the reasons why some people become valuable in long-tenure positions is because of the lessons learned. At a certain point, no one is going to read through every page in the wiki / archive / man pages / whatever is popular this year.
That's where onboarding and process come in: Management needs to make sure that lessons drive improving the process, that newcomers are onboarded with lessons from the past, and that everyone continues to follow the processes.
Now, jokes aside, in my company, the new owners decided they didn't like the people we were outsourcing with, and decided to replace them with their own outsourcing center. Now everyone's re-learning lessons that are probably tracked in our various wikis, repositories, ect. But, the newcomers want to run things their own way.
That's why a few long-tenured people are important.
Generally, I don't think efforts to accumulate institutional knowledge on a website work bear fruit - no one really wants to update the website, both because it's thankless, but also because of access time. It is much faster to tap the institutional knowledge in management by sending an email than by paging through the results of a search. For written institutional knowledge to have real value, the access time has to be small, which means someone has taken real care in curating the knowledge so it's easily accessed. Finally, we have the Brian problem. Brian was the person most likely to update our internal websites - unfortunately Brian wasn't very good and had some poor ideas regarding lessons learned - by adding them to the websites, his bad ideas were passed on to younger team members who didn't know better.
Felt good to know that stuff is neatly documented somewhere, but since no one ever knows where that was, it was of little value and few ever read it. People still tapped on shoulders and repeated the same mistakes.
It baffles me that an established company like Atlassian can’t get something as fundamental as search right. I can’t even find the content I myself created at times.
We have since switched to Nuclino (https://www.nuclino.com) and so far are having a better experience. It's as feature-packed, but the basics work as expected and are a lot more user-friendly.
Re-establishing a proper documentation culture in the team is still a challenge, but that’s not something a tool can solve.
Wikis - we found wikis are too heavyweight and formal to be used consistently for recording learnings.
Slack - in our experience, Slack makes capturing learnings easy but organizing and keeping track of learnings difficult.
Our goal with BB is to make recording a learning as convenient as writing a Slack message AND to make organizing and keeping track of these learnings similarly easy.
You can write bytes directly in Bytebase or save them from Slack.
Would love any feedback or ideas. Email me (cara@bytebase.io) to get access to the closed beta with HN in the subject line.
Every every system failure, we email the entire org a postmortem google doc describing what went wrong, why, and what we are doing to prevent it from happening again. Postmortems are also as their own JIRA project.
2. We are diligently linking our issues into a hypertext mesh: what is related to what, what was blocked by what, what was decomposed from what. We are using milestones and epics.
3. All the commits in all the codebases are linked to issues. There is a rule on a server side that forbids pushing non-linked commits. There is a single exception for firefighting code changes, when there is no time to write a ticket first, but those commits are marked with a special signs and author should create an issue and link to commit when the problem is solved.
4. Documentation and API specs are laying in the same or adjacent repos and are changing according to same rules.
5. So, every line of code is linked to the corresponding issue via commit and then to other issues and commits in other repos via hypertext mesh. When your code is clean, it mostly self-documented and becomes a knowledge itself.
If you can't be bothered to put together documentation as you build the software (sometimes I can't be bothered), you should at least make sure to document as you troubleshoot later so you don't keep making the same mistake. We store these as "Flight Rules" for our application (or error signatures, etc - whatever you want to call them) which provides the team a single location to start their search when things go belly up.
That way, when you run your post-mortem (you run these, right?) you have a place to store the error notes which eventually builds up into a really useful document.
Lastly, I'd say having a team norm that when one person does something the others should also be able to test it (and therefore have the right instructions to do so) is a good one for continuity.
EDIT: COuple other things that have worked: - Checked-in ".dot" file GraphViz context diagrams alongside your repos are nice and easy to update - Creating decision documents for a quick run-through of options with your technical team is a great way to run an effective process while also creating a searchable artifact for later, which is great for context / lessons learned.
The basic idea is document decisions with a specific structure and keep it close to the code. The thing is any time we can answer "why", it's a form of decision that can be documented somehow. Since it's close to the code, while coding, any search will also land on those decision if the same terms are used.
There are several tools to help with that as presented here [2] and here [3].
[1] https://www.thoughtworks.com/radar/techniques/lightweight-ar... [2] https://adr.github.io/ [3] https://github.com/joelparkerhenderson/architecture_decision...
https://risk-engineering.org/learning-incidents-accidents/
1. For large issues visible to customers an incident report is shared inside the company. These are written for general consumption and so lack any technically interesting aspects (they're "dumbed-down" a lot).
2. Technical "lessons learned" are curated in a Sphinx based documentation website that I started but which is starting to see more and more contributions from other tech heads in the company.
We used to have a wiki but it ossified after years of no contributions. Personally I didn't like the MoinMoin wiki engine that much but this is just personal taste of course. I started setting up the Sphinx site to encourage knowledge retention despite turnover - I kept explaining the same things again and again. Now I just share a URL when such questions come up :-).
1. General business lessons, to which companies generally don't summarise or track them.
2. DevOps outage post-mortems, which competent companies generally have some sort of process around.
I've never seen a rigorous post-mortem culture in tech outside of DevOps/SRE.
I guess there are a few reasons. One is probably that the DevOps/SRE space is very amenable to encoding lessons learned in scripts of various kinds, so it's actually useful to do a post-mortem exercise because the outcomes are very small, very concrete and will be somewhat actionable. Things like "errors in parsing this file shouldn't cause the server to blow up" are easily corrected and a process (unit test) put in place to formally encode that institutional knowledge.
In regular software development there's way less reflection. This is partly because the tooling is much less home grown and malleable. Lessons are learned and they are encoded, but it happens slowly and through the mechanism of library and language design. It's generally not something you do within a single company but rather, it's an emergent consensus across the whole industry. Additionally, this is harder because lessons learned are often ambiguous or subjective. For instance I learned the lesson, many years ago, that dynamic typing leads to more mistakes than static typing. But you see many programmers still who prefer dynamically typed languages and dispute this sort of conclusion.
In the business world there's virtually never any kind of "lessons learned" repository or process. At most you get something like a formalised interview process, but even then, those are usually baked into a company from day one or never adopted at all. I've heard of very few large companies that adopted a more rigorous approach to hiring than the one they previously used. It does happen but it's rare.
At the executive/CEO level lessons learned get recorded in the form of strategy talks given at fancy conferences, if at all. Often abstracted or vague to the point of uselessness, any insight that is present gets forgotten immediately by the audience who are mostly there because it's easier than doing real work. These lessons learned are things like "innovation is key to the customer experience", which is a genuine learning in a sense (usually from observing the wreckage of firms that went up against a competent tech company). But it's not really useful in the sense of being actionable by normal employees.
The things I have learnt are:
- companies don't learn, people do
- having an internal wiki/kb helps a lot IF it's structured/indexed well enough that you can actually find information
- in an ideal world, no project should be considered done if documentation is not written
It's all about people. People learn and can recall lessons learned.
Otherwise you rely on the good faith and will of the next person to actually go through all the documentation that has been left from the people before. This person might not have all the will, or it might not have the time.
The negative side of integration is the person with knowledge integrates it in their practice and ‘forgets’ it is new knowledge for everyone else, thus not propagating the new lesson learned.
As the telegram bot are processed by my coordinator website, almost every message sent by any user to the bot are saved in my co's database. There are 2 types of options, the public and private. Public type will be saved in db and tell the whole bot's user about information someone sent, with or without further feedback from other bot users. The private type will be saved in db and only visible to several user specified.
I though it is very versatile for report, lesson learned and etc.
Unfortunately, folks don't really read the docs (and I've learned from this thread that we're not alone ;)
Been thinking about this problem and thought to embed something like a quiz in the docs to make them interactive, yet still static - something like howtographql.com. Yet to try that approach, though.
I don't have a better answer yet other than making it a personal point of pride that my docs are always up to date and well-organized.
They did make effort to standardize everything, but nobody seems to care.
https://github.com/pragmatismo-io/pragmatismo-io-framework/t... (Currently available in Portuguese)
They have pretty good procedures for keeping track of lessons learned. The book (https://landing.google.com/sre/sre-book/toc/) goes into some detail.
The dev overall, across the globe: i find retrospectives after a sprint cycle really good actually, it's a good place to call up where improvements can be made too.
On a personal level: When my mess up/mistake causes grief for someone else, I make damn sure I learn from that.
Otherwise, it's mostly Confluence now, but no specific page of lessons learned, instead, those lessons are dotted around in individual documentation pages.
Non written education is probably the most effective way to communicate and maintain important information. Writings leave it up to the authors' ability to know where and how to communicate...
Of course nobody takes any notice.
And if there are significant failures in policy, you can just put a note after the revised policy that says "we tried X, it didn't work because of Y.
Is there any intersection between "keeping track of lessons learned" and "agile methodologies" or are the two completely unrelated/orthogonal ?
+ for being OSS
- anything written down and not enforced is almost the same as nothing.
IMO if the full answer to a question doesn't exist in a single brain, you'll have a hard time reconstructing what really happened without a challenger-level investigation
companies only solve problems temporarily
I'm leaving soon.