The low quality of results has been a problem from a while now and has become worse lately thanks to all those StackOverflow and Github clones. So I was wondering if we could come together and contribute to a single blacklist hosted somewhere and then import it into each of our browsers. Who knows? We might end up improving the quality of the results we all get.
Lists to get rid of the StackOverflow and Github clones already exist. [1]
I would love to contribute to a project like this, but won't be able to be a maintainer due to time constraints. Would greatly appreciate it if someone could host this. A simple txt file on github would do.
What do you say, HN?
[0]: https://github.com/iorate/ublacklist [1]: https://github.com/rjaus/awesome-ublacklist
A google search showing some of these leech type sites:
https://www.google.com/search?q=%22code+that+protects+users+...
For me, "farath.com" is outranking stackoverflow.
[1] https://github.com/darekkay/config-files/blob/master/adblock...
On a much smaller scale, if anyone is interested, I maintain a black list focused on those code snippet content farms that gets in the way when you're searching for some error message or particular function here https://github.com/jhchabran/code-search-blacklist.
Can we just nudge them to do so under the threat of an influential minority leaving due to their use case being affected?
Specifically blocking github clones seems doable. Adding anything else needs equally specific criteria or it will quickly become subjective and unfair.
As another commenter here said "Google does not make money by helping you find what you are searching, it makes money by keeping you searching". That only works when there is no competition. But once Apple would be in the game, people would use what presents them with the better results. Right now, I don't feel there is real competition.
https://news.ycombinator.com/item?id=29546433#29549855
and the consequent uBlock Origin list that is what I'm using as the so far better solution for this problem:
https://github.com/stroobants-dev/ublock-origin-shitty-copie...
but it will need curation and updates over time, which I'm not sure the author is willing or has the time to do.
I find it difficult to believe that relatively beginner NLP projects get posted here all the time, yet no one has adapted that stuff to create a new search index.
Personally I don’t know enough to really do this well, but I can tell just blocking sites from Google’s results isn’t the way.
https://raw.githubusercontent.com/arosh/ublacklist-github-tr...
I sense that in the near future the paradigm of search engines will go from the current “index everything and become a universal answer engine” to “index a small subset of the Internet and become an answer engine honed towards a specific topic/domain”.
We make $0 doing this, they make...astronomical profits screwing it up. So we invest a bunch of time so they can continue to take in astronomical amounts of money while abandoning "don't be evil"?
Absolutely not.
Google doesn't deserve my time or my eyes.
Imagine you had a position in a huge market that was as close to unassailable as there has ever been. Imagine also that you have a controlling position over the mechanisms that allow people to participate in that market.
Now try to make a case against optimizing for squeezing every last cent out at the cost of the user experience.
In 10 years we will regard Google the way we regard cable companies today. Maybe even worse since we need to be able to search for answers more than we ever needed cable TV.
This is the goal of Entfer (Show HN thread: https://news.ycombinator.com/item?id=29799867)
Entfer will in the future also allow you to bulk export and import your personal rankings, so that they can be shared on GitHub, for example.
Are you only going to filter obvious spam and sites that republish other’s content, or are you going to block sites that are “harmful” or disseminate “disinformation”?
Who will get to decide which media bubble I’m in?
Though the proposed solution borrows heavily from concepts long used in email and Usenet spam, there are a few critical distinction in SEO SERP[1] spam which both make a widely-crowdsourced listing less applicable and less necessary.
In the case of email, your inbox is an unlimited resource to the spammers --- there's effectively no limit to how much spam they can throw at it. As there are also an effectively limitless set of source addresses (by either domain name or IPv6 addresses), and because email/Usenet spam is itself a quantity/numbers game with rapidly shifting origins, collectively-source and curated blocklists have value.[2]
A SERP is itself a finite resource --- the default is to display 10 results, and not making it into the top ten provides little reward. Moreover, high ranking search takes some effort and time to achieve, it's not like in email where a new server can spin up and immediately start deluging targets.
My experience with annoyances matching this sort (stream-based social media is one example) is that blocking a relatively small number of high-profile annoyances hugely improves signal/noise ratios. And I think that will be the case with SERPs as well. There are a half-dozen or so sites which tend to dominate results in most cases, and those can be individually blocklisted (if the capability exists). If more appear, they can similarly be removed.
The other factor is that quite a few sites which some people find exceedingly annoying and spammish, others find appealing. Coming to agreement on what to block, and classifications of such domains / sites, is likely to be difficult and/or contentious. There may be exceptions in specific instances (hence: specific classifications of unwanted results), but less so in the general case.
I might be wrong. The case of DNS adbloc, with PiHole as the classic example, shows that very large lists can be compiled and used. My own Web adblock / malware block configurations have typically had from ~10k to ~100k of thousands of entries. That said, the really heavy lifting is typically done by a much smaller fraction of the total. Power laws and Zipf functions work in your advantage here.
________________________________
Notes:
1. Search engine results page, that is, what you see in response to a query.
2. Even in the case of email spam, the principle value is largely from curated lists, usually by experts, e.g., Spamhaus.