Refactoring Code to Enterprise Level

Question

I have been writing code since past 10 years, mostly for small companies which have seen medium level of traffic (50-500 concurrently active user). Last year I moved to a startup where I had to scale code to 5000-10000 concurrent users which has been done nicely so far! Now I have to refactor one of its core functionality to enterprise level software. In short it would be a logistic division of my company and whose end users would be 3-4 tier city users of India who aren&rsquo;t really so tech Savy ( basically it means exceptions catching and validations should be full proof). Last written code has been working good so far but I am not happy with the &lsquo;if else&rsquo; way code has been written! Any suggestions on how to architect a logistic company or how to refactor code that with assumptions it would fail and recover automatically or atleast points developer/data team that some thing wrong has happened (as there are 2-3 dbs which have to kept in sync, thanks to micro services :/) In short I have never written enterprise grade application ! Any suggestions/advise would be highly appreciated.

avl999 · Accepted Answer

This is an incredibly broad question and touches on many aspects of software engineering, devops and operations. For operations, I would recommend having the team read the google SRE book https://sre.google/workbook/table-of-contents/ it has everything one would need to setup a modern operations infrastructure and associated best practices. https://sre.google/workbook/table-of-contents/
Sounds like you are unhappy with the quality of code and error handling. Seems like it was written by junior devs? I would recommend having the team read the usually recommended books if they haven't. At the bare minimun:
* Refactoring by Fowler
* Clean Code by Robert Martin
If you are using an OO language:
* Practical Object-Oriented Design in Ruby
The book has "in Ruby" in the title but it's a general purpose book on what make OO design "good"
Then follow it up with
* Patterns of Enterprise Architecture
* Clean Architecture
These books are not perfect or the be all end all, there are parts of them that might be slightly dated but they get you to a large chunk of the way to the promised land.
Lastly, if the application is working, the users are happy, there is no bug infestation and you are not having issues with releasing new features, don't feel the pressure to immediately "fix" the code.

tarun_anand · Answer

The best place to start is to read open source code. Start with small but popular repository on GitHub.

Jugurtha · Answer

Here are the thing you can set up in a day or two. These are things you can do right now that will give you room to breathe so you can start to think more broadly, but most importantly, they don't require change in code:
Take a look at Sentry[0]. It will catch exceptions, group them by type, count them, display them with the stack trace, and more. It has integrations with Slack, GitLab/GitHub, etc. It makes creating issues and alerting you easier. You won't have to dig into dozens of log files and miss exceptions that happened a few hours ago but are completely drowned by log messages.
Take a look at Prometheus[1] and Grafana[2]: you can set them up and have the state of your infrastructure in a dashboard (services up or down, how many times they were down, for how long, etc. Latency, CPU, GPU, disk space, RAM, etc). Whatever matters to you, put it there. You can create custom alerts (example: storage 90% full) to give yourself a heads up and act beforehand.
Look at whatever you do as a team when there's something wrong, figure out all the information you usually need to troubleshoot, and have it all displayed so you can glance at something and know quickly where something is wrong. This is reactive, but it's a start and you won't waste
For development, you can have issue templates for incidents [service shut down and nobody saw it, etc], and bugs to lower the barrier to entry to write good quality incident reports. This way, people know what to write, where to write it, and how to write it. Put a tag in the template, the people in CC, whatever makes your life easier. Summary, impact, recovery, investigation, future prevention.
One benefit of that is that when you have these incident reports, patterns will emerge fast. It surfaces the most frequent and the most impactful pretty quickly.
Once this is done, it will save you time and effort that you can put into reading more on the subject. Search for "Site Reliability Engineering", or "SRE"[3]. There are a few books, some more abstract and others more practical[4][5].
Take a look at Enterprise Ready[6]. It talks about the most common requirements and features in an enterprise product (SSO, RBAC, etc).
- [0]: https://sentry.io
- [1]: https://prometheus.io/
- [2]: https://grafana.com/
- [3]: https://en.wikipedia.org/wiki/Site_reliability_engineering
- [4]: https://sre.google/books/
- [5]: "Seeking SRE, Conversations about Running Production Systems at Scale"
- [6]: https://www.enterpriseready.io/