HACKER Q&A
📣 jve

Google not completely indexing github.com issues?


Given https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/issues/3033

Fore query "cannot persist Microsoft authentication token cache securely!" (with quotes) Google returns single result, written in Chinese. Luckily I opened that Chinese result and spotted link to that issue.

duck.com does find the issue at hand.

I mean GitHub is no small site, but somehow I expected that Google will find my ANY public string on the internet.

Not that it doesn't find issues at all - but I stumbled upon this one that got left out.


  👤 ghughes Accepted Answer ✓
Recently I've noticed that the GitHub and Stack Overflow scraper clones will often be the only result for this kind of query. It looks like the blackhats have found a way to rank higher then the content they're cloning. I suspect this has the side effect of tricking Google's anti-spam system into punishing the canonical domains and URLs, because it thinks they're copies of higher ranked content.

👤 danuker
It happened to me also, not just for GitHub, but also Stack Overflow. I no longer rely on Google, and search the sites directly.

Sometimes I can't find anything non-commercial. For instance, I wanted to find out where the phrase "milk and honey" comes from. Googling it only yields a book for sale. But a Wikipedia search for it, which is one click away in Firefox, gives me this page [0] which is exactly what I wanted to know.

Looks like Google no longer acts on the mission to organize the world's information, but focuses on making money.

[0] - https://en.wikipedia.org/wiki/Milk_and_Honey


👤 AtNightWeCode
It seems like people with skills and ambitions don’t want to work with search. The search feature of Github is also garbage. The ranking is ridiculously bad. Stackoverflow search is also a joke. Almost forgot about the docs at MS. Clone and use a tool to find in files for MS docs will beat the MS docs sites. Also, I challenge anybody to write a serious version of find in files on Windows that is slower than the native one.

My tip is to try to avoid searching. Ask yourself where the information may be located. Then try to find that location if you do not already know it.


👤 pendar747
I have come across this before, searching for a particular problem but not finding the relevant Github issue in google's search results, which made me search using Github's own search functionality inside the relevant repo to find what I'm looking for. I'm not sure why google wouldn't be indexing Github issues though

👤 marginalia_nu
My guesswork based on the fact my URLs my crawler became aware of so many GitHub URLs I had to add special logic to exclude URLs that look like commit hashes:

GitHub is absurdly large in terms of the number of documents since there's ostensibly a "document" per file and commit, it's likely the crawling budget Google affords GitHub simply runs out before they've crawled it all unless they implement special logic to prioritize issues and about-pages.


👤 Thaxll
> but somehow I expected that Google will find my ANY public string on the internet.

Is it the case tough? Does Google index everything? I don't think it's the case.


👤 fxtentacle
I believe it's quite simple: Ad money. Those search queries that cannot reliably be monetized with ads are more or less worthless to Google. So they skip indexing those search terms. That probably also implies that search queries where the majority of users use an ad-blocker (that's us developers) will be a very low priority.

I'm not even sure I can blame them. Creating the search index is a huge investment on Google's side, both in terms of time (can't crawl too fast or else they knock the target offline) and in terms of bandwidth, CPU power, and storage space. But Google needs to operate their search profitably, on average, or else they will go bust. In the long term, I predict that this leads to ever more SEO spam and ever more ads as the internet grows and more and more NLP and AI processing is needed to separate the good results from the bad ones. Filtering gets more expensive => more false positives + need more ads to pay for it.

I've been trying hard (maybe a bit too hard?) to stir up a discussion about people actually paying for the search index, because that would allow them to get the results that they want - private and ad-free - no matter if those results are monetizable with ads or not.

In my opinion, that would make for a nice open source project. In case you're curious:

https://news.ycombinator.com/item?id=30374611

https://news.ycombinator.com/item?id=30361385


👤 shadowgovt
Bing has this indexed, so I'm assuming DDG is finding it from there. (Bing is also, unfortunately, also returning results that do not have the string present, which depending on how one tunes one's success metrics is worse... Remember the bad old days of sifting through search hits to not find the information the engine claimed was on the page?).

There's not a lot one can glean about the operation of a search engine from one anecdote; any number of things (including the backing store on which the one copy of this data is indexed being temporarily unavailable) can explain a failure-to-find like this.

But this story does point to both an advantage a meta-search like DDG has and a best practice: different search engines will give different answers. Keep several in your back pocket; don't assume Google is omniscient.


👤 bliteben
I suspect this boils down to those type of searches do not make money.

👤 nshm
It is clear Google has issues with indexing Github. Couple years ago they somehow retrieved github mobile page instead of desktop page. As a result, no comments, no readmes, no code indexed at all.

A solution is to move most of the documentation to self-hosted website, leaving only references in the code itself.


👤 bitwize
If it can't be turned into a purchase by the end user Google has no reason to index it fully. It's just waste heat that doesn't generate ad revenue.

👤 JimWestergren
Google only indexes a fraction of the public internet.

👤 cma
Microsoft recently started putting github behind a reddit/twitter/Instagram login wall: comments for open source community projects are truncated unless you log in. Maybe related?