(by unusable, I mean I can never find the page I am looking for when I search. Basically, I have to maintain my own wiki of important links I may need to reference in the future)
* First you have a few occurrence of the same search query in your search history (because only a few people searched similar words in the past)
* You can't either use synonyms of remove stop words to recommend better content (IT, can means "information technology, or the pronoun. THE can be an acronym, ...).
So basically the only thing you can do is search words. Confluence is worse than that because it tries to remove stop words and do things that break exact match search. But this is a difficult job. Ways to improve search: allow multi titles, index with tags, attributes, only do exact words match, allow users to suggest content for a specific search query, search autocompletion, searching in live during typing ... (many things that Confluence doesn't care about). You also have to respect rights when returning documents, each documents, can have rights from folder or document itself, inherited from team access or user access, so this is really computation intensive too, or pre-compute rights
(Working on a competitor [0] of Confluence and I have put plenty of hours of work on that specific issue, and I can tell you this is really hard)
I wrote a custom search engine that worked by running on cron, pulling in all of the content from Confluence and writing it into a SQLite table with SQLite full-text search enabled (using https://sqlite-utils.datasette.io/en/stable/python-api.html#...), then sticking a https://datasette.io/ interface in front of it.
This methodology works
https://ccc.inaoep.mx/~villasen/bib/AN%20OVERVIEW%20OF%20EVA...
and I used it to tune up the relevance of a search engine for patents to the point where users could immediately perceive that it worked better than other products.
After I worked on that I wound up talking to the developers and/or marketing people for many enterprise search engines and few of them, if any, did any kind of formal benchmarking of relevance.
People at one firm told me that they used to go to TREC conferences because they thought it got them visibility but that they decided it didn't so they quit going.
A message I got repeatedly was that these firms thought that the people who bought the search engines didn't care much about relevance, but they did care about there being 200 or more plug-ins to import data from various sources.
In principle the tuning is unique to the text corpus. One reason for that is that there is a balancing act of having a search engine that prefers small documents (they have spiky vectors that look more like query vectors) or large documents (they have so many words they match everything.) Different corpuses have different distributions of document sizes, not to mention different distributions of words that appear.
Few organizations are willing to do the work to tune up a search engine (you have to decide about the relevance of 10,000+ document hits), but I've had the experience that you can beat the pants off the defaults even using a generic tuning. For instance that patent search engine was tuned up against the GOV2 corpus instead of a patent corpus. A small patent corpus showed us we were on the right track, however.
Aside from the organizational issues, I think there's a problem where basically no search system can be good for every org with any kind of internal info and different queries from perhaps several distinct types of users with different goals. To get good, a system needs to improve through at least rudimentary ML. At its simplest, if Alice searches for X today and clicks doc3, if Bob searches for X tomorrow, doc3 should rank higher. This requires collecting and aggregating click stream data, and using this count info (with cardinality #docs x #queries) at search time. But sometimes it requires a richer model relating search terms to terms in relevant (clicked) docs and optimizing for some measure of search quality (NDCG) etc. All of this requires detailed access to docs, search/click histories, and a fair amount of computation and storage. But customers have legit reasons for wanting these docs to only be accessible by their own employees. And they don't want to dedicate their own staff to improving such a system. No one wants to hear that their model retaining ran out of memory, etc. So shipping a simple system which doesn't improve but doesn't have moving parts becomes a local optima.
OTOH I'm also a believer that you should be able to navigate to the right information.
People seem to think that writing pages is sufficient. A library works because pages are gathered in books, organised by sections and has an army of librarians to keep it running smoothly.
I treat documentation like code - DRY, refactor apply just the same. e.g. I might split a page up so that some common part can be re-used. I'll cull obsolete information or mark it obsolete. I'll _also_ updated headings to help them show up in searches.
I can gain far more functionality with a properly implemented self-hosted mediawiki server (the same code that runs wikipedia itself) with a number of useful plugins installed and enabled.
It doesn't require a rocket science level of apache2+php7+mariadb knowledge to set up. The instructions are really quite straightforward.
Honestly I wish I knew more but it was like pulling teeth trying to get people there to speak openly about why it's so hard when it is solved in so many other products.
Some existing tooling:
Google cloud search has a confluence connector https://developers.google.com/cloud-search/docs/connector-di...
Elastic workplace search has a connector. https://www.elastic.co/guide/en/workplace-search/current/wor...
Lessonly had / had a thing called Obie https://www.lessonly.com/blog/how-to-search-better-in-conflu...
Raytion https://www.raytion.com/connectors/raytion-confluence-connec...
"Atlassian Tools" is on my list of automatic rejections for companies I'm thinking of working at for this reason.
The organization of most teams' documentation is horrendous at my company. There are at least 3 different pages I have to go to for how-to articles and that's just within my current team's space. Not to mention there's limited information on those pages.
Documentation is an after thought. We've also seen a lot of attrition this year. I'm the senior person on my team as a midlevel. I have one contractor who's term is up in a couple months and one junior. They can't fill the 4 positions that have been open for 2-3 months.
What’s the prevailing wisdom these days on the best solution for an internal knowledge base/wiki platform?
My colleagues and I have been grumbling for ages that our instance of Confluence must be really badly configured. If you put in a single word search term, there will be lots of results, but no guarantee that any pages containing that word in the title (or body), will appear above ones where it doesn't.
The search problem was solved long ago by Apache Solr/Lucene. Although this may not be true for multiple languages.
1. give pages labels. This lets you insert a label-based index, and also makes it possible to narrow search by label
2. use spaces. Separate the content into spaces based on who is likeliest to need that information. You can narrow search by space, and put a search box on the page in the space.
3. use the hierarchy. You have to put the pages somewhere in the hierarchy anyway, so try to make it reasonable.
4. Make useful index pages. Obviously, this doesn't scale, but if you can provide people with useful starting points, it will help them. For example, at Khan Academy we have a space for the whole org with a front page to get you to every team's front page. The engineering team has a front page with a small collection of useful & commonly-used links
5. if you have a page in your hierarchy with a lot of content underneath it, add a search box on that page that constrains the search to that set of pages.
The biggest problem Confluence search has is that it's terrible with relevance, and using its tools to narrow down the search can improve the relevance of the results considerably.
It does partial matches anywhere in a word, supports every language even in the same document, and even has regex support for those who need it. Update instantly with instant filters.
It can find things like 168.0 in 192.168.0.1 which the existing confluence search cannot for example. Or search for AKIA credentials /AKIA[A-Z0-9]{16}/ I have heard people describe it as Agolia for confluence which makes me happy.
https://marketplace.atlassian.com/apps/1225034/better-instan...
As for why their search is so bad? It's probably due to how they apply permissions. Every permission for their search needs to apply per search per user. It makes it complex and hard to apply changes, making it hard to improve things. I imagine it's one of those parts of confluence that is a major pain to work with.
I think a lot of this is also due to their cloud migration. When using the server version they were allowing you to host yourself you could store the index on disk. With cloud they suddenly need to keep the index state somewhere persistent, but they also want to dynamically scale up and down.
Lastly, they also apply stop words, stemming and such, using out of the box lucene. Lucene is a great tool, but it can also be a pain to work with. You can see problems when you mix languages on the page too, such as having Thai, Chinese and English on a single page which confuses the Lucene tokeniser.
When choosing 3 years ago, we used the following criteria:
* WYSIWYG editor. Any user must have a minimum effort to write documentation
* Flexible access permissions to various parts of the documentation. Public documentation is open to anonymous users, the internal one is divided into many sections with access for certain groups
* Multilingual support. Not out of the box, but possible with plugins
* Multilingual pdf export. In some markets, some customers prefer to have exported manuals
* The ability to inherit articles. We need to be able to make edits once, instead of duplicating the same articles
* Have a relatively modern appearance. Wiki engines are familiar to many because the whole world uses Wikipedia, but this does not make them more pleasing to the eyes, if I can say so
3 years have passed, I periodically look at alternatives, so far only wiki.js seems like a good solution but it’s not even close yet.
I use Confluence and Jira because, again, we use them at work. So I guess I'm using them because I have to. I also understand it's a pain to move our company from one to another (oh we've had discussions to move to Coda and others) but again, I'm not taking on that project. Again, UI/UX, search - all meh - they are working and I got used to it.
The inconvenience of using them does not justify the amount of time I need to spend to overcome my inconvenience. Some things, you just have to let them slide.
I think most search engine designers want to make the index as broad as possible, but the problem seems to be that people rarely want such broad searches. What they really want are very detailed indices and metadata implications over well trodden folders.
Maybe this is something google should take on. A search plugin for Confluence where google crawlers logs in from time to time for internal crawling to enable non-public teach request on that data. That boost knowledge workers efficiency a lot. I hope somebody from Google reads this and takes on the challenge. I'm sure companies would pay a lot for this.
If you want best of both words, you can use the "Favorite Pages Macro" on any page to reference all of the pages that you have saved for later, which makes keeping that page up to date with your latest changes to saved pages trivial.