What are these low quality “code snippet” sites?
Whenever i am trying to google a code issue i have, there is countless low quality sites just showing SO threads with no added value whatsoever.
It is so annoying it actually drives me mad.
Does anyone know what's up with that?
I am really disappointed because the guys creating these sites (i guess for some kind of monetization) must have some relation to coding. But i feel this is an attack against all of us. Every programmer should be grateful for the opportunity to find good quality content quickly. Now my search results are flooded with copy & paste from SO.
They are killing that.
Am I the only one experiencing this or being that annoyed by it?
P.S: I don't name URLs because if you don't know what I am talking about already, you probably don't have that issue.
For years now I've ran a programming site (stackabuse.com) and have closely followed the state of Google SERPs when it comes to programming content. A few thoughts/ramblings:
- The search results for programming content has been very volatile the last year or so. Google has released a lot of core algorithm updates in the last year, which has caused a lot of high-quality sites to either lose traffic or stagnate.
- These low-quality code snippet sites have always been around, but their traffic has exploded this year after the algorithm changes. Just look at traffic estimates for one of the worst offenders - they get an estimated 18M views each month now, which has grown almost 10x in 12 months. Compare that to SO, which has stayed flat or even dropped in the same time-frame
- The new algorithm updates seem to actually hurt a lot of high-quality sites as it seemingly favors code snippets, exact-match phrases, and lots of internal linking. Great sites with well-written long-form content, like RealPython.com, don't get as much attention as they deserve, IMO. We try to publish useful content, but consistently have our traffic slashed by Google's updates, which end up favoring copy-pasted code from SO, GitHub, and even our own articles.
- The programming content "industry" is highly fragmented (outside of SO) and difficult to monetize, which is why so many sites are covered in ads. Because of this, it's a land grab for traffic and increasing RPMs with more ads, hence these low-quality snippet sites. Admittedly, we monetize with ads but are actively trying to move away from it with paid content. It's a difficult task as it's hard to convince programmers to pay for anything, so the barrier to entry is high unless you monetize with ads.
- I'll admit that this is likely a difficult problem because of how programmer's use Google. My guess is that because we often search for obscure errors/problems/code, their algorithm favors exact-match phrases to better find the solution. They might then give higher priority to pages that seem like they're dedicated to whatever you searched for (i.e. the low-quality snippet sites) over a GitHub repo that contains that snippet _and_ a bunch of other unrelated code.
Just my two cents. Interested to hear your thoughts :)
Somewhat tangential but I believe Google Search is going downhill, they seem to be losing the content junk spam SEO fight. Recently, I've had to append wiki/reddit/hn to queries I search for because everything near the top is shallow copied content marketing.
Not only SO threads, I particularly hate the ones that mirror GitHub Issues. They don't even link back to the original thread, for Christ's sake!
Sites that do auto translate of original SO threads and pretend that it's their original content are the worst. Google sometimes prefers to give me that results instead of the actual SO thread because I'm not in English-speaking country. I have to waste some time to understand that it's just stolen SO thread. And it's not even that useful because some of them AUTO TRANSLATE CODE.
IMHO the biggest offender:
https://www.geeksforgeeks.org/
They are making me insane with the modal login demand. I wonder if Google has downgraded the authoritative standing of StackOverflow?
Just to chip in with a very minor annoyance, I hate how google puts up w3schools results above MDN for anything related to JS/HTML.
Really ticks me off that Google allows itself to be so easily gamed, it's your core business for christ's sake.
The answer is, if someone can make money by doing something shitty but not illegal, they will do it.
Almost everything on the web is some scheme to put ads in your face so someone can make some money.
My pet peeve is ApiDock, which has managed to SEO itself so high up the rankings when searching for anything connected to Ruby or Rails that it is actually quite difficult to get to the legitimate, official documentation.
What's worse is most of the results are outdated so you're looking at web-scraped API docs for Rails 3 or something.
Really frustrating.
It’s an easy way to make money. Scrape a popular site like Stack overflow or Wikipedia and add a bunch of advertisements.
One of the many ways that scum ruin the web.
I really hate these. Especially when I'm trying to figure something out and I'm struggling to find answers, I end up haplessly finding the exact same wrong answer on three different sites.
An index of these sites will be helpful to mass blacklist them with the uBlacklist extension.
Anyone up for creating one so everyone can contribute to it?
The extension allows subscribing to blacklists via links, so a single txt file will be enough.
I spoke to a VP at google in 2006 in london and discussed using a combination of curation and entropy to flush out duplicates. He seemed pretty excited by the idea but I don't think anything materialized. Which is another way of saying this is not new - in those days these sites were copying newsgroups too.
Well, as i wrote i understand that they try to monetize it.
But: why the sudden explosion?
I feel there is more of these sites going live regularly.
Many times they make up 80% of the first pages on search results, just repeating the SO threads listed before.
So it‘s really getting difficult.
Something must be done…
The biggest problem is that they waste your time even when the content is ostensibly helpful, since the search result is usually listed after the Stack Overflow page it crawled from anyway. Each click steals a few seconds of developers’ time, which adds up, given how frequently these types of results pop up on Google search lately. That makes them worse than useless, they actively subtract value.
Not related to coding, but I've noticed a lot of "best of" and "top ten" sites that appear to be of the same ilk, possibly automated, that just combine pictures and paragraphs ranging from ad copy to pure drivel. On topics ranging from bicycles to Linux distro's.
Important tidbit: SO's content is CC-licensed and this is probably completely legal (apart from those who fail to add a link to the original). Not that I don't want those sites to burn in hell, but they are not even in a grey zone legally.
As others have said: SO content is ripped-off (poorly) and mirrored. The page games Google's algorithm and shows up as a 'legit' result.
Probably more complex than simple keyword stuffing, which isn't supposed to work these days..
Scrape-paste is one of the easiest way to make significant money, if it takes off, and that’s why these sites are made.
I think, google does well in general with coding or SO questions, but will show you these low quality sites when the questions are new or very specific and difficult to answer. Maybe, time to apply your head more.
*been on both sides
People are crawling content that is searched for frequently, then using SEO to rank higher in the results than the original content to make money from the ad revenue.
Code and recipes are two examples.
I'm also seeing politicians posting Tweets containing a link to their personal website, which has ads.
I've found that Bing does a better job at detecting spam like this. Not perfect, just better.
I’ve had to switch to !py on ddg because the official Python docs never make the first page. It’s really frustrating. :/
I wonder if there's a market for a software engineering specific search engine. Skip the shitty content farms, include code from open source projects, and potentially be more smart about finding package uses
I've noticed that Google Alerts for my open source projects have been useless for years. Full of snippet sites as well as outright scam sites which take code from SO or my blog or just mixed up tech words and repost it.
Here's the Google Alert from yesterday (scammy URLs redacted):
Guestmount qcow2 - Casino en ligne fiable
It uses libguestfs for access to the guest filesystem, and FUSE (the
``filesystem in userspace'') to make it appear as a mountable device.
Stdin 1 libguestfs error usr bin supermin exited with error status 1 -
Aritco
Since libguestfs 1. sudo apt-get install libguestfs-tools mkdir sysroot #
Just a test file. Supermin and Docker-Toolbox #14. DIF/DIX increases the ...
Edit Qcow2 Image - A-ONE HEALTH BRIDGE
The libguestfs is a C library and a collection of tools on this library to
create, view, access and modify virtual machine disk images in Linux. img
Another thing I've noticed recently is that a lot of queries about computer graphics, especially tied to Unity's render pipelines, bring up what look like blog posts full of code snippets but the actual "article" seems nonsensical and impossible to follow. I suspect they are machine translated and they're really annoying.
edit: after doing a single Google search for "urp rendercontext" I found this: https://programmerall.com/article/71251053239/
Looking at it closely there seems to be some red thread and the images and code snippets do seem to follow a logical progression, but the text itself is a complete mess. I can tell it sometimes references things from the code snippets and hints at things I can see in the image, but it's certainly not informative.
Their site description says "Programmer All, we have been working hard to make a technical sharing website that all programmers love." I'm sorry, but I really don't.
I am with you on this. Lately I have noticed that I've googled for an issue, find a low quality site with relevant results, and later discover that it's just a copy of the GitHub issues page from the original project. Why didn't the GitHub issue link make it to the first page of Google and this crappy knockoff, with no link back to the source material beat it? So frustrating.
Just putting this out there... try brave search. The best answers from stackoverflow etc are all snippets and their results are getting better and better every day. I got sick of google after they made the BERT update. Really happy I switched (except for google maps data. google is still winning that game)
I mean, while we're at it, can we get rid of blogspam? Try googling for instructions for installing cellulose insulation. It takes AGES to find a site that isn't just garbage vague content. It should be possible to detect and demote this stuff. it is so obvious.
One thing that makes SO an easy target for this is that they let you download all their data and you don't even need to crawl and scrape the content from the website. Just download a dump, put it in an database, slap an HTML template on top of it, splash a few ads, and boom.
The worst of all are those websites that only show the content in search engine. When you click the link to their webpages, you can only see random texts have nothing to do with the search result at all. There must be some really narcissistic programmers behind these.
At least the SO clones will still probably have content that you can make use of in some way; what's worse is when you search for an error code and you get back tons of pages that don't even have the exact code you searched for, which seems to be increasingly common recently.
It also used to be the case that you could dig into the second, third, ... sometimes even 20-30 pages in and hit the jackpot. Now, the results are even less relevant there, you soon get to "the end", and if you change your query slightly and try again a few times to search harder for what you want, you'll get hit with the unsolvable CAPTCHA hellban.
Same thing with various GitHub issues lookalikes.
There was an SO outage this last year, and I only found that out when I tried to go to SO and couldn't get to it. I checked page 2 of Google and found one of the mirrors that you're talking about. I grabbed the content from there and continued with work.
I think it's a matter of perspective. If you _know_ that you want a specific site, use google's 'site:' specifier. If you're looking and find something that is from SO, redo the search and get to the SO Q/A. As for me, I'm moderately grateful for the decentralized backups.
The issue is actually pretty old. There was a time when Google introduced blacklisting of search results and revenue of those sites dived. Sadly, later Google rolled back the blacklist.
All user-contributed content on Stack Overflow is under a CC-BY-SA license. So what the sites are doing is allowed under the license, as long as they're providing attribution.
Is it annoying? Sure. But neither Stack Overflow nor the authors of the content can do anything about it since they gave away a license to do it.
One of the things you have to accept when you release something under an open-source or Creative Commons license is that other people can take it and use it in ways that you don't like.
They get fed into a web crawler and then into a giant hopper whence they become the backbone of that shiny "No Code" technology you've been hearing about.
I have this problem, and contrary to a lot of people I don't protect a lot of my PI from Google. It used to be Google was good at giving me stuff I wanted in ads, especially in gmail but they don't really anymore. You would think the more they knew about you they would be able to give you better results, so maybe a large scale test should be done - if Google knows your PI do you get better or worse results, or doesn't it matter at all.
I suspect they were always there, but google and ddg are getting gamed more now. The quality of results has dropped quite a bit in the past 4 or 5 years in this regard.
If I recall the Stackoverflow dataset is open source or at least made available to download so I assume all these sites just download that information regularly.
Does anyone know of a good list of these copy sites? I just came across this Firefox extension which makes it possible to filter sites from search results: https://addons.mozilla.org/en-US/firefox/addon/hohser/. Would be great with a community blocklist like those for pi.hole
Surely, even using an extension mainly to hide these stuffs.
Not only mirroring SO, also its siblings (like serverfault and askubuntu), and others like GitHub.
But the most annoying part is it keeps showing those mirrored and machine translated stuffs that offers little to none benefits to me and I'm already being forced trained enough to identify those at first glance.
Those even shows up when I'm already searching in other languages, ahrr.
edit: formatting
- 1990 no big data, no data, google indexing a porn and no ads and black hack market, no open source code, no seo articles, no market, bbs only - $
- 2000 censorship, business, big data and ads ads - $$
- 2010 code learning projects, quora, reddit, iphone, spam indexing and seo ads ads ads ads - $$$
- 2020 ai indexing everywhere, no-code indexing and code is a porn of no-code now so ads and ads ads ads seo seo seo - $$$$$$
- 2030 profit $googleplex?
Plugging my own FF/Chrome browser extension that lets you add domains you want to block and will simply prepend matching text links with an angry emoji and prompt you to confirm whether you want to visit the page or not:
https://github.com/fnune/nay
This reminds me Yahoo! Answers clones 10 years+ ago. To get traffic to website and cheat the search engine they would have index the Yahoo! answers website for specific niche category and create a garbage website with questions and answers not crediting the source and cramp the website with Ads everywhere to earn massive revenue.
I believe Google have hit a sweet spot (for them) where they can keep you browsing a specific topic for a long time while still showing you mildly interesting results. Since the results are consistently on topic, you are shown ads that are interesting to you time and time again, which results in a lot of clicks and a lot of revenue.
This is why I often search solutions directly on Stack Overflow, and not via Google. Or I add "site:stackoverlow.com" to my search. Generally SO has all I ever need... I find vendor forums to be a total wasteland for help (ie. Power BI forums) so don't need them as part of Google results.
> Every programmer should be grateful for the opportunity to find good quality content quickly
Totally. There should be a better way to index SO.
you.com seems to try doing it that way. For most code issues, it's easier to navigate and decide what's worth reading from You than from google IMO.
Recently I've found duckduckgo to provide better coding results than Google, which really surprised me. I was only using duckduckgo at home for privacy, and Google at work because "best tool for the job", but I think that might not be the case anymore.
It's quite simple. SO has a huge easily indexable database of answers, and SEO scammers can make a quick buck by copying it all and making it seem like they have an answer for unanswered questions. Nothing to see here, blame your search engine.
I wonder if you are logged in Google and allow search history to be saved? (assuming you talked about Google search). Because I don’t have that kind of problems and I know Google use your search history to improve your personal results.
I’ve got into the habit of clicking the 3-dot icon next to the search entry (often number 1 in Googles’s results) and reporting these sites as scrapers, stealing content from SO.
Maybe if we all did this, google might eventually take notice?
It will probably take some time, but sites such as roseindia and expertsexchange also clouded search results in a similar manner. They are now history because Google and others deranked them to the depths of hell.
I do a lot of this sort of search
I have no idea what you are talking about (except for Apple's efforts at astro turfing, but that is not what you mean I think)
I use DuckDuckGo, is that why this does not bother me?
I have not used Google for search for years.
Spammers. They mirror SO stuff on their own sites and put google ads on them
Google search is going downhill in my experience.
My question: What is the alternative?
Reminds me a little of an oldie but a goodie: expertsexchange.com
annoyed by that too, weirdly enough they pop up for certain queries and for others they don't. i have also seen that for github issues ..
but i switched to https://www.ecosia.org/ as my main search engine, and i like the results much more than on google - nothing special, but somehow more reliable/predictable. and meanwhile you plant some trees :P
It's for all kinds of sites lately. The uBlacklist extension has solved it for me - one click and you can remove an entire domain from future results.
This is just Google doesn't need real search anymore. They're now the portal. Market cap is what drives them not some geeks' needs.
Why don't you name the URL? Share so we know to avoid. It is not like we are going to dox the guy or something.
It is just Google, with its amazing algorithms that rank established website way higher than a random spam website.
They are doing that because it's technically not violating the license the way they do it.
I hate those websites which just proxy github or npm with a different stylesheet so much.
The content farms get ahead of the organic results for many other areas too. Search for the programming questions isn't so bad. At least, the garbage is easy to recognize. Queries about products and services probably have the worst results.
Maybe code snippets are "enriched" in these sites?
This is an obvious case of SEO spam. But there are tons and tons of other examples worth mentioning.
For example many news sites have soft paywalls that can easily be circumvented with a few clicks. The reason they don't have an _actual_ paywall is likely to come up in search results. So essentially they spam search results and obscure the content for technical illiterate users instead of just paying for ads. They want their cake and eat it too.
Now this whole dynamic is super weird. We often talk about these issues as if Google was some kind of public service that should make useful and fair search suggestions. Sure they have the incentive to do so, but they have conflicting interests at the same time.
baeldung is the worst offender for java echosystem
stackoverflow.com doesn't have google ads. Those copy sites do. What is google's motive to fix that?
this shits is just as same as quora and pinterest w3school and apidock
site:stackoverflow.com That's what I do
Is it bad that I've actually found my answer on some of these sites haha. But yeah, they're pretty low quality in general.
This is more of an issue with google results than the content itself.
Google is a shit product and you get shit results when you use it.
Just whitelist Stack Overflow in your head and avoid splogs / spamblogs?