I believe my algorithms are decent, but the biggest problem for Gigablast is now the index size. You do a search on Gigablast and say, well, why didn't it get this result that Google got. And that's because the index isn't big enough because I don't have the cash for the hardware. btw, I've been working on this engine for over 20 years and have coded probably 1-2M lines of code on it.
What if instead of even trying to index the entire web, we moved one step back towards the curated directories of the early web? Give users a search engine and indexer that they control and host. Allow them to "follow" domains (or any partial URLs, like subreddits) that they trust.
Make it so that you can configure how many hops it is allowed to take from those trusted sources, similar to LinkedIn's levels of connections. If I'm hosting on my laptop, I might set it at 1 step removed, but if I've got an S3 bucket for my index I might go as far as 3 or 4 steps removed.
There are further optimizations that you could do, such as having your instance not index Wikipedia or Stack Overflow or whatever (instead using the built-in search and aggregating results).
I'm sure there are technical challenges I'm not thinking of, and this would absolutely be a tool that would best serve power users and programmers rather than average internet users. Such an engine wouldn't ever replace Google, but I'd think it would go a long way to making a better search engine for a single user's (or a certain subset of users') everyday web experience.
While that may be good for most people, there is still a lot of power and utility in simple keyword-driven searches. Sadly, it seems like every major search engine has to follow Google's lead.
It's a bit like the car industry - you could run a startup from your garage in the early days but you need titanic amounts of capital to compete now thanks to vertical integration.
Major governments and billionaires can compete but everybody else is locked out of the market (most "startups" use bings index).
I’ve thought the same about pre-ad Twitter and Facebook.
Early on, startups with free services look a lot like non-profits and just maximize user benefit to grow. The problem is they’re not non-profits, and have to make money at some point. That has tended to mean ads.
I’d easily pay, say, $9/mo to have access to an ad-free search engine that made me feel the way 1999 Google did.
Part of the problem is that there's a lot more low-quality content to wade through now than there was in 2005. I think the Google of 2005 would have trouble delivering quality results today also.
And that specifically blocked Pinterest, Quora, most non-personal “blogs”, etc.
People suggest DDG ! operators, but I don’t want to use a site’s (bad, single-site) search box. I want a multi-site SERP that only displays results from known good sites, which are customizable.
Also gmail, used to have the best spam filters out there, now it's utter crap. Emails from my google analytics account, for whatever reason and disregarding how many times I have clicked on "Not Spam", go to spam, and it's their own service; while messages who are textbook spam ("Hi, I just got some inheritance ...") go to my inbox.
AI (in its current state) is crap, when is the industry going to accept these are the emperor's new clothes.
Thus, the Google of today, which is optimized to extract that money from us.
We also need to be aware that when we remember past times it usually carries a romantic, nostalgic note. Web is very different than it was 15 years ago and the problem of search has evolved.
What you are looking for is basically 'grep for the web' but it is just one facet of search that we use today. 15 years ago you would not get an instant answer to a question like you do today and many users would not be able to live without that today. There are also maps and location based answers, all sorts of widgets like translation etc. Also world became more polarized so an objective best search result became more difficult to produce, specially for events covered in news, which means bias inevitably starts to creep in.
This is not to say that Google is good or bad today, it is what it is and they are doing best they can. Startups like ours see an opportunity on the market, in large part to help savvy users find what they want.
[1] https://kagi.com
“Information Neutrality is the principle to treat all information provided (by a service) equally. The information provided, after being processed by an information-neutral service, is the same for every user requesting it, independent of the user’s attributes, including, e.g., origin, history or personal preferences and independent of the financial or influential interest of the service provider, as well as independent of the timeliness of information."
I wrote about this in relation to search [0]. We need to be allowed more freedom to choose search engines and services. One (default or selected) choice for search is unhealthy. We shouldn't have to choose between Google or Bing; DuckDuckGo or Startpage; Brave or Ecosia; Mojeek or Gigablast ..... Personally I use all 8 of these and more, as also explained [0].
[0] https://blog.mojeek.com/2021/09/multiple-choice-in-search.ht...
https://www.burda.com/en/news/cliqz-closes-areas-browser-and...
https://news.ycombinator.com/item?id=23031520
https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...
Also user preferences have changed in the last decade or so. I know millenaials and users in their late 30's or early 40's still yearn for the old web where they would type a search term and correct results would astonish them. However, younger users tend to gravitate to videos and that is why a large portion of the google results are now video results.
It is called Poe's law, and Google returned it at #4. Bing or Duckduckgo don't have a clue...
2) They have a years of user's data, like for specific term, they see what users clicked most, so they see which results were perceived as most relevant. It is hard to catch up if you dont have such data.
3) They developed anti-spamming tools during the years of fighting against SEO-spammers.
Search engine isn’t singular, it’s plural.
(1) Search engine for something I know exists.
(2) Search engine for finding something new.
There’s a market for both, but you don’t have to solve both problems with the same product.
Sometimes I switch to Google for the former, but the latter works well enough for me that I don’t care what else Google would’ve shown me.
More often than not, my feeling is Google would only have shown me more ads in addition to whatever I could already find elsewhere.
Now that Google exists, you can't create another one. There's only room for one.
Another thing is the rise of "content sites", like this one (Hacker News). I'm sure YCombinator doesn't like getting hit by dozens of crawlers. The impulse to ban everything that crawls except (Google|Bing|Baidu|VK) is too great.
A lot of alternative suggestions are being thrown into this discussion. Let me throw in mine: Reverse the concept of the "crawler". Instead of following links around the internet randomly, require sites to register with you and request to be crawled and/or submit a sitemap. It would be hard to get started, but once something like this gained momentum, I believe that there's room for several of these reverse-search-engines to compete.
Just let me type stuff into the search box -- including typo corrections and modifications to what I'm searching for -- and hit ENTER to start the actual search.
When I'm ready to start my search I'll hit the fucking ENTER key. Stop annoying me with your stupid assumptions about what I'm looking for.
This ONE THING is why I switched to Webcrawler.com two years ago. I type in five or ten words with ZERO craptastic guesses flashing around on my screen, hit ENTER, and THEN it returns what I'm looking for.
Even in 2021, despite how bad it's become, it's still miles ahead of other competitors.
In that era, Google would return a match based on words that appear in the links to a URL but not in the article itself, meaning that it was easy to produce "Googlebombs". For example, from 2005-2007 the top hit for "miserable failure" was the Wikipedia article for George W. Bush.
See https://www.screamingfrog.co.uk/google-bombs/ for some of the "better" ones.
I heard HN constantly crying over its deteriorating quality, but I am not noticing it that much, not better not worse, it just does its job.
To create 05 Google, it is easily billions of dollars and years of investment, before people will treat you seriously.
The reason we didn't get 05 Google could only because it is not profitable. Some nation state attempt to demonopolize the search engine business might work, but I didn't expect any for profit organization to easily attempt doing this, let alone individual hobbyists
The parent company, Tiscali, was a huge hit in the 1990s, as it provided internet access to millions of Italians. It went through some struggle for several years, but lately the original founder, Renato Soru, came back to run the company.
The company is based in Cagliari, the capital of Sardinia, Italy.
Imo, 2005 google got initial traction because of its tech forum post indexing, as I remember my switch to it was because it became an extension and then replacement for manpages. In that sense, what made it good was it reflected the consensus of what its incredibly influential userbase thought was important and just managed that really well. The demographic impact of the U.S. Gen X all using it at once didn't hurt either.
The equivalent today, as a lot of us say, is that blockchains are in the 1997 internet phase, and the service that makes the content of those as navigable as the 90's internet, will likely grow in a similar way.
Search that provides young people with privacy and freedom to pursue their true interests will be the dominant strategy. Its success will be because it's a product that rides growth, and not because it "solved a problem." Imo, we all index too much on the privacy pattern because the freedom pattern is too risky.
What's changed since that time are the maturity of things like Bloom and other probabilistic filters, Apple's private set intersection, differential privacy, zksnarks, and everybody you'd ask an opinion from now gets their content through mobile devices. Apple's ecosystem is equipped to do this kind of search, but they're too exposed politically to get into it. Meta will likely go there, but nobody's going to trust them willingly.
A protocol that generated a cryptograpically strong anonymous index from your browsing - and instead of putting it on google's servers, it was on a chain, or the content index information and its evolving consensus score was included in something like a DNS record - may still unseat these ensconced interests. IPFS and other P2P or torrents might do something like that as well. Blockchains maybe good for that consensus/desire score.
It's not something you architect and design top down that has to solve all cases, it will be just another useful product that grows while riding a demographic change. It would be on the level of inventing HTML/HTTP again, which, when you think about it, was just another dude making a thing he needed.
Rather than being told "No, there are only eight pages of results on anything in the goddamned world. Really. Would I lie to you?"
Gigablast Search Engine - https://news.ycombinator.com/item?id=29421898 - Dec 2021 (10 comments)
* Don't use JS * Don't use Google analytics * Don't weigh more than a few kB per page * Don't show any sites with ads
That would be a place to begin.
Because the universe being searched isn't the internet of 2005 and earlier, and because user expectations have moved on, too.
Plus the index expense.
For example if my search term appears in the URL I can almost guarantee I don’t want that page.
I'd gladly pool in some of my CPU time if it helps build a better search.
Well, it's wikipedia. So just create a search engine for that, since their search sucks rocks.
Knuth's "Searching and Sorting" volume desperately needs an update.
Ask HN: Has Google search become quantitatively worse?
https://news.ycombinator.com/item?id=29392702
Inviting all the paranoid/speculative/hearsay/personal experience responses. Lame Ask HNs!!!!!
DDG is pretty useless though unfortunately.