Let's build an HN uBlacklist to improve our Google search results?

Question

For the unaware, uBlacklist [0] is a browser extension that lets you blacklist sites from the google search results page. It lets you blacklist sites right from the results page, by regex, or by linking lists hosted somewhere.The low quality of results has been a problem from a while now and has become worse lately thanks to all those StackOverflow and Github clones. So I was wondering if we could come together and contribute to a single blacklist hosted somewhere and then import it into each of our browsers. Who knows? We might end up improving the quality of the results we all get.Lists to get rid of the StackOverflow and Github clones already exist. [1]I would love to contribute to a project like this, but won't be able to be a maintainer due to time constraints. Would greatly appreciate it if someone could host this. A simple txt file on github would do.What do you say, HN?[0]: https://github.com/iorate/ublacklist [1]: https://github.com/rjaus/awesome-ublacklist

tyingq · Accepted Answer

>become worse lately thanks to all those StackOverflow and Github clones
A google search showing some of these leech type sites:
https://www.google.com/search?q=%22code+that+protects+users+...
For me, "farath.com" is outranking stackoverflow.

darekkay · Answer

uBlock Origin supports blocking search results, so I don't require an additional browser extension. I maintain a blocklist for myself, targetting Google and DuckDuckGo [1]. Feel free to contribute more websites or use this list as a template for your own repository.[1] https://github.com/darekkay/config-files/blob/master/adblock...

jhchabran · Answer

That's an ambitious goal, I'm not sure to see how that would be maintainable on the long run.On a much smaller scale, if anyone is interested, I maintain a black list focused on those code snippet content farms that gets in the way when you're searching for some error message or particular function here https://github.com/jhchabran/code-search-blacklist.

ZeroGravitas · Answer

Isn't this Google's job? Are developers a small but lucrative target and so the suits at Google don't see the benefit of improving that experience by cleaning up the spam?Can we just nudge them to do so under the threat of an influential minority leaving due to their use case being affected?

hooande · Answer

The problem with this is illustrated in another comment where nixcraft's site, cyberciti.biz, was added to a personal block list. The content on the site does seem to be original and productive. I'd guess it was added based on the criteria of "I haven't heard of this site and the domain looks suspicious". I have a feeling that this will be true for other domains on this proposed master list. And the owners of those domains will have no recourse.Specifically blocking github clones seems doable. Adding anything else needs equally specific criteria or it will quickly become subjective and unfair.

littlecranky67 · Answer

I wonder why Apple is not starting it's own search engine. I mean yes, they get >$1Bn per year making Google the default on iOS+macOS, but they have plenty of cash so they wouldn't need it. They would get immediately ~10% market share when it is launched, just because it would be made the default on their devices. From their they just need to present better search results than Google (which shouldn't be that hard right now) and can only grow further.As another commenter here said "Google does not make money by helping you find what you are searching, it makes money by keeping you searching". That only works when there is no competition. But once Apple would be in the game, people would use what presents them with the better results. Right now, I don't feel there is real competition.

LinuxBender · Answer

This is just my own personal preference, but I manage my own list of what is blocked or allowed on my systems. I would be concerned that a group contributed list for this category of blocking could quickly devolve into a group-think censorship dominated by whomever is the most devoted to blocking and extending echo bubbles to peoples browsers.

fsflover · Answer

This looks like a big, time-consuming project that would rely on a private Google API that can change any time. I think it's not worth to invest your effort into that. I wish more people would help to improve FLOSS, peer-to-peer search engine YaCy instead, https://yacy.net.

imglorp · Answer

See comments in this thread for a number of lists in progress.https://news.ycombinator.com/item?id=29546433#29549855

badrabbit · Answer

How about support better search engines instead?

rolisz · Answer

Why not use another search engine, such as Kagi, which has built-in support for this? At least for the programming niche, Kagi has worked really great for me for a month now.

jagged-chisel · Answer

Sounds like we'd just need an HN uBlacklist subscription. Sourcing and validating submissions to the blacklist is the problematic bit. Perhaps use HN as an OAuth provider (not currently an available feature), use rules based on account age and karma for allowing or scoring submissions, voting system like HN ... sounds like something that might actually do better hosted on HN.

colesantiago · Answer

Or just use search.brave.comNo problems there.Comparison [0][0] https://brave.com/search/

iio7 · Answer

There is also https://www.mojeek.com/, I haven't tested it out in a while, so perhaps it has become better, but they should be striving to make it what Google used to be.

j1elo · Answer

This was discussed around a month ago, leading me to this post:
https://news.ycombinator.com/item?id=29546433#29549855
and the consequent uBlock Origin list that is what I'm using as the so far better solution for this problem:
https://github.com/stroobants-dev/ublock-origin-shitty-copie...
but it will need curation and updates over time, which I'm not sure the author is willing or has the time to do.

Otek · Answer

Offtop: Please notice that uBlacklist has nothing to do with uBlock Origin. They just use uBlock&rsquo;s &ldquo;brand&rdquo;. Also uBlock Origin have &ldquo;blacklist&rdquo; feature, you don&rsquo;t need uBlacklist if you use uBlock Origin.

TrueDuality · Answer

Huzzah, that extension supports other browser engines as well. It's not nearly as atrocious an issue on DuckDuckGo but there are still some of those re-post heavy sites that aggressively get through, as well as some low quality content farms. It's nice to have a tool available to do local/personal fine grained refinement.

geuis · Answer

Nah. Blocking isn&rsquo;t the answer. What we need is a better search index.I find it difficult to believe that relatively beginner NLP projects get posted here all the time, yet no one has adapted that stuff to create a new search index.Personally I don&rsquo;t know enough to really do this well, but I can tell just blocking sites from Google&rsquo;s results isn&rsquo;t the way.

gitgud · Answer

Here's the low quality sites, if anyone is curious... I'm sure you'll recognise themhttps://raw.githubusercontent.com/arosh/ublacklist-github-tr...

BuyMyBitcoins · Answer

I am keenly interested in this idea.I sense that in the near future the paradigm of search engines will go from the current &ldquo;index everything and become a universal answer engine&rdquo; to &ldquo;index a small subset of the Internet and become an answer engine honed towards a specific topic/domain&rdquo;.

AmosLightnin · Answer

Shouldn't we make a plugin for SearX that learns from the results you click, so that the customization and machine learning is on the client side? That way search becomes a commodity, but the final selection algorithm's behavior is owned by the user.

analognoise · Answer

Why are we spending the effort to fix Google?
We make $0 doing this, they make...astronomical profits screwing it up. So we invest a bunch of time so they can continue to take in astronomical amounts of money while abandoning "don't be evil"?
Absolutely not.

quyleanh · Answer

Could anyone tell me why we don't add these domains to adblock filter?

aronpye · Answer

Wouldn&rsquo;t it be easier to simply invert the problem and come up with a whitelist instead? When searching for technical info there are only a handful of sites I use, namely stackoverflow and wikipedia.

throwaway745686 · Answer

Not sure if this is helpful, but here you go: https://someonewhocares.org/hosts/

runnerup · Answer

I'd love to contribute. If my small contributions collectively have the potential to save many hours of others' time, it may end up the most impactful thing I do this year.

decebalus1 · Answer

How about we just let the free market decide and just go with another search engine instead of trying to fix the perception of a broken product which we don't own?

unilynx · Answer

Odds are some of things some HN users want blocked are startups started by other HN users. What if w3schools didn't yet exist but applied to YC tomorrow?

rubyist5eva · Answer

I just use duckduckgo with fallback bangs when the results aren't satisfactory to me, usually falling back to Bing.Google doesn't deserve my time or my eyes.

alangibson · Answer

A reply to those in this thread saying that Google should/will take care of this:
Imagine you had a position in a huge market that was as close to unassailable as there has ever been. Imagine also that you have a controlling position over the mechanisms that allow people to participate in that market.
Now try to make a case against optimizing for squeezing every last cent out at the cost of the user experience.
In 10 years we will regard Google the way we regard cable companies today. Maybe even worse since we need to be able to search for answers more than we ever needed cable TV.

ColinHayhurst · Answer

Thus could be really helpful for us at Mojeek, and generally helpful for all search engines we think, done as a list rather than as an extension.

hammock · Answer

Google's existing blacklist is half the reason (for me) that the results quality has declined so much. So this is not a great solution.

samuelfekete · Answer

Blocking bad sites is just one side of the coin. You should also be able to promote sites that you are more interested in.
This is the goal of Entfer (Show HN thread: https://news.ycombinator.com/item?id=29799867)
Entfer will in the future also allow you to bulk export and import your personal rankings, so that they can be shared on GitHub, for example.

short12 · Answer

Does HN block Google completely. I can't recall any search bringing them up even with the wealth of onfo

ChrisArchitect · Answer

uBlacklist? You have that much of a problem getting relevant search results? Seems like bad search technique and paranoia. Seems crazy to have to maintain a blacklist for a few domains on specific searches. You can "search around" the bad domains with higher quality search phrases

classified · Answer

My method of improving search results is to not use Google.

dgut · Answer

Shameless plug: I run okeano.com, a privacy friendly search engine. We support natively blocklists [0].[0] https://okeano.com/blocklist

sobkas · Answer

If only distributed search engine was possible...

zavkz · Answer

How about a search engine that does this already?

mrkramer · Answer

Open source software FTW

siva7 · Answer

I think we are attacking the wrong angle here. This should be solved at google

tagoregrtst · Answer

Im afraid this is potentially dangerously political.
Are you only going to filter obvious spam and sites that republish other’s content, or are you going to block sites that are “harmful” or disseminate “disinformation”?
Who will get to decide which media bubble I’m in?

dredmorbius · Answer

TL;DR: I strongly suspect that relatively small, personally-curated lists will be much more appropriate and highly effective. These might be augmented with specific classifications, but probably not on a widespread basis.
Though the proposed solution borrows heavily from concepts long used in email and Usenet spam, there are a few critical distinction in SEO SERP[1] spam which both make a widely-crowdsourced listing less applicable and less necessary.
In the case of email, your inbox is an unlimited resource to the spammers --- there's effectively no limit to how much spam they can throw at it. As there are also an effectively limitless set of source addresses (by either domain name or IPv6 addresses), and because email/Usenet spam is itself a quantity/numbers game with rapidly shifting origins, collectively-source and curated blocklists have value.[2]
A SERP is itself a finite resource --- the default is to display 10 results, and not making it into the top ten provides little reward. Moreover, high ranking search takes some effort and time to achieve, it's not like in email where a new server can spin up and immediately start deluging targets.
My experience with annoyances matching this sort (stream-based social media is one example) is that blocking a relatively small number of high-profile annoyances hugely improves signal/noise ratios. And I think that will be the case with SERPs as well. There are a half-dozen or so sites which tend to dominate results in most cases, and those can be individually blocklisted (if the capability exists). If more appear, they can similarly be removed.
The other factor is that quite a few sites which some people find exceedingly annoying and spammish, others find appealing. Coming to agreement on what to block, and classifications of such domains / sites, is likely to be difficult and/or contentious. There may be exceptions in specific instances (hence: specific classifications of unwanted results), but less so in the general case.
I might be wrong. The case of DNS adbloc, with PiHole as the classic example, shows that very large lists can be compiled and used. My own Web adblock / malware block configurations have typically had from ~10k to ~100k of thousands of entries. That said, the really heavy lifting is typically done by a much smaller fraction of the total. Power laws and Zipf functions work in your advantage here.
________________________________
Notes:
1. Search engine results page, that is, what you see in response to a query.
2. Even in the case of email spam, the principle value is largely from curated lists, usually by experts, e.g., Spamhaus.

pluc · Answer

Use a search engine that doesn't fuck up your results instead of trying to unfuck the results it gives you. Why go through this much effort to still give your money to google?

Let's build an HN uBlacklist to improve our Google search results?

>become worse lately thanks to all those StackOverflow and Github clones
A google search showing some of these leech type sites:
https://www.google.com/search?q=%22code+that+protects+users+...
For me, "farath.com" is outranking stackoverflow.

Isn't this Google's job? Are developers a small but lucrative target and so the suits at Google don't see the benefit of improving that experience by cleaning up the spam?
Can we just nudge them to do so under the threat of an influential minority leaving due to their use case being affected?

This looks like a big, time-consuming project that would rely on a private Google API that can change any time. I think it's not worth to invest your effort into that. I wish more people would help to improve FLOSS, peer-to-peer search engine YaCy instead, https://yacy.net.

See comments in this thread for a number of lists in progress.
https://news.ycombinator.com/item?id=29546433#29549855

How about support better search engines instead?

Why not use another search engine, such as Kagi, which has built-in support for this? At least for the programming niche, Kagi has worked really great for me for a month now.

Or just use search.brave.com
No problems there.
Comparison [0]
[0] https://brave.com/search/

There is also https://www.mojeek.com/, I haven't tested it out in a while, so perhaps it has become better, but they should be striving to make it what Google used to be.

Offtop: Please notice that uBlacklist has nothing to do with uBlock Origin. They just use uBlock’s “brand”. Also uBlock Origin have “blacklist” feature, you don’t need uBlacklist if you use uBlock Origin.

Here's the low quality sites, if anyone is curious... I'm sure you'll recognise them
https://raw.githubusercontent.com/arosh/ublacklist-github-tr...

I am keenly interested in this idea.
I sense that in the near future the paradigm of search engines will go from the current “index everything and become a universal answer engine” to “index a small subset of the Internet and become an answer engine honed towards a specific topic/domain”.

Shouldn't we make a plugin for SearX that learns from the results you click, so that the customization and machine learning is on the client side? That way search becomes a commodity, but the final selection algorithm's behavior is owned by the user.

Why are we spending the effort to fix Google?
We make $0 doing this, they make...astronomical profits screwing it up. So we invest a bunch of time so they can continue to take in astronomical amounts of money while abandoning "don't be evil"?
Absolutely not.

Could anyone tell me why we don't add these domains to adblock filter?

Wouldn’t it be easier to simply invert the problem and come up with a whitelist instead? When searching for technical info there are only a handful of sites I use, namely stackoverflow and wikipedia.

Not sure if this is helpful, but here you go: https://someonewhocares.org/hosts/

I'd love to contribute. If my small contributions collectively have the potential to save many hours of others' time, it may end up the most impactful thing I do this year.

How about we just let the free market decide and just go with another search engine instead of trying to fix the perception of a broken product which we don't own?

Odds are some of things some HN users want blocked are startups started by other HN users. What if w3schools didn't yet exist but applied to YC tomorrow?

I just use duckduckgo with fallback bangs when the results aren't satisfactory to me, usually falling back to Bing.
Google doesn't deserve my time or my eyes.

Thus could be really helpful for us at Mojeek, and generally helpful for all search engines we think, done as a list rather than as an extension.

Google's existing blacklist is half the reason (for me) that the results quality has declined so much. So this is not a great solution.

Does HN block Google completely. I can't recall any search bringing them up even with the wealth of onfo

uBlacklist? You have that much of a problem getting relevant search results? Seems like bad search technique and paranoia. Seems crazy to have to maintain a blacklist for a few domains on specific searches. You can "search around" the bad domains with higher quality search phrases

My method of improving search results is to not use Google.

Shameless plug: I run okeano.com, a privacy friendly search engine. We support natively blocklists [0].
[0] https://okeano.com/blocklist

If only distributed search engine was possible...

How about a search engine that does this already?

Open source software FTW

I think we are attacking the wrong angle here. This should be solved at google

Im afraid this is potentially dangerously political.
Are you only going to filter obvious spam and sites that republish other’s content, or are you going to block sites that are “harmful” or disseminate “disinformation”?
Who will get to decide which media bubble I’m in?

Use a search engine that doesn't fuck up your results instead of trying to unfuck the results it gives you. Why go through this much effort to still give your money to google?

Let's build an HN uBlacklist to improve our Google search results?

>become worse lately thanks to all those StackOverflow and Github clonesA google search showing some of these leech type sites:https://www.google.com/search?q=%22code+that+protects+users+...For me, "farath.com" is outranking stackoverflow.

Isn't this Google's job? Are developers a small but lucrative target and so the suits at Google don't see the benefit of improving that experience by cleaning up the spam?Can we just nudge them to do so under the threat of an influential minority leaving due to their use case being affected?

This looks like a big, time-consuming project that would rely on a private Google API that can change any time. I think it's not worth to invest your effort into that. I wish more people would help to improve FLOSS, peer-to-peer search engine YaCy instead, https://yacy.net.

See comments in this thread for a number of lists in progress.https://news.ycombinator.com/item?id=29546433#29549855

How about support better search engines instead?

Why not use another search engine, such as Kagi, which has built-in support for this? At least for the programming niche, Kagi has worked really great for me for a month now.

Or just use search.brave.comNo problems there.Comparison [0][0] https://brave.com/search/

There is also https://www.mojeek.com/, I haven't tested it out in a while, so perhaps it has become better, but they should be striving to make it what Google used to be.

Offtop: Please notice that uBlacklist has nothing to do with uBlock Origin. They just use uBlock’s “brand”. Also uBlock Origin have “blacklist” feature, you don’t need uBlacklist if you use uBlock Origin.

Here's the low quality sites, if anyone is curious... I'm sure you'll recognise themhttps://raw.githubusercontent.com/arosh/ublacklist-github-tr...

I am keenly interested in this idea.I sense that in the near future the paradigm of search engines will go from the current “index everything and become a universal answer engine” to “index a small subset of the Internet and become an answer engine honed towards a specific topic/domain”.

Shouldn't we make a plugin for SearX that learns from the results you click, so that the customization and machine learning is on the client side? That way search becomes a commodity, but the final selection algorithm's behavior is owned by the user.

Why are we spending the effort to fix Google?We make $0 doing this, they make...astronomical profits screwing it up. So we invest a bunch of time so they can continue to take in astronomical amounts of money while abandoning "don't be evil"?Absolutely not.

Could anyone tell me why we don't add these domains to adblock filter?

Wouldn’t it be easier to simply invert the problem and come up with a whitelist instead? When searching for technical info there are only a handful of sites I use, namely stackoverflow and wikipedia.

Not sure if this is helpful, but here you go: https://someonewhocares.org/hosts/

I'd love to contribute. If my small contributions collectively have the potential to save many hours of others' time, it may end up the most impactful thing I do this year.

How about we just let the free market decide and just go with another search engine instead of trying to fix the perception of a broken product which we don't own?

Odds are some of things some HN users want blocked are startups started by other HN users. What if w3schools didn't yet exist but applied to YC tomorrow?

I just use duckduckgo with fallback bangs when the results aren't satisfactory to me, usually falling back to Bing.Google doesn't deserve my time or my eyes.

Thus could be really helpful for us at Mojeek, and generally helpful for all search engines we think, done as a list rather than as an extension.

Google's existing blacklist is half the reason (for me) that the results quality has declined so much. So this is not a great solution.

Does HN block Google completely. I can't recall any search bringing them up even with the wealth of onfo

uBlacklist? You have that much of a problem getting relevant search results? Seems like bad search technique and paranoia. Seems crazy to have to maintain a blacklist for a few domains on specific searches. You can "search around" the bad domains with higher quality search phrases

My method of improving search results is to not use Google.

Shameless plug: I run okeano.com, a privacy friendly search engine. We support natively blocklists [0].[0] https://okeano.com/blocklist

If only distributed search engine was possible...

How about a search engine that does this already?

Open source software FTW

I think we are attacking the wrong angle here. This should be solved at google

Im afraid this is potentially dangerously political.Are you only going to filter obvious spam and sites that republish other’s content, or are you going to block sites that are “harmful” or disseminate “disinformation”?Who will get to decide which media bubble I’m in?

Use a search engine that doesn't fuck up your results instead of trying to unfuck the results it gives you. Why go through this much effort to still give your money to google?

>become worse lately thanks to all those StackOverflow and Github clones
A google search showing some of these leech type sites:
https://www.google.com/search?q=%22code+that+protects+users+...
For me, "farath.com" is outranking stackoverflow.

Isn't this Google's job? Are developers a small but lucrative target and so the suits at Google don't see the benefit of improving that experience by cleaning up the spam?
Can we just nudge them to do so under the threat of an influential minority leaving due to their use case being affected?

See comments in this thread for a number of lists in progress.
https://news.ycombinator.com/item?id=29546433#29549855

Or just use search.brave.com
No problems there.
Comparison [0]
[0] https://brave.com/search/

Here's the low quality sites, if anyone is curious... I'm sure you'll recognise them
https://raw.githubusercontent.com/arosh/ublacklist-github-tr...

I am keenly interested in this idea.
I sense that in the near future the paradigm of search engines will go from the current “index everything and become a universal answer engine” to “index a small subset of the Internet and become an answer engine honed towards a specific topic/domain”.

Why are we spending the effort to fix Google?
We make $0 doing this, they make...astronomical profits screwing it up. So we invest a bunch of time so they can continue to take in astronomical amounts of money while abandoning "don't be evil"?
Absolutely not.

I just use duckduckgo with fallback bangs when the results aren't satisfactory to me, usually falling back to Bing.
Google doesn't deserve my time or my eyes.

Shameless plug: I run okeano.com, a privacy friendly search engine. We support natively blocklists [0].
[0] https://okeano.com/blocklist

Im afraid this is potentially dangerously political.
Are you only going to filter obvious spam and sites that republish other’s content, or are you going to block sites that are “harmful” or disseminate “disinformation”?
Who will get to decide which media bubble I’m in?