Would you do what duckduckgo did, which is to use some body else's index and ranking or would you just build your index from commoncrawl? How are Ecosia and Startpage.com able to stay profitable without doing either?
Does that mean we can have many niche search engines? Can we crawl CommonCrawl and build an index for less than $10K (what one individual can do out of pocket.)
I think my takeaway from the last 10 years, is that a lot of the info on websites that was by real people has disappeared and you have a lot of spammy blog and heavily commercial approaches. And, most of the real info has gone into facebook groups, quora, reddit comments, slack, twitter, and so on.
The problem is those are all closed-door eco-systems in a lot of ways, and the knowledge is hard to differentiate from the temporal messaging.
I think if I was going to approach this I would build software that users run, or browser add ons that lets user tag and save information in some type of format, and then that contributes to a knowledge search information.
For example, I am a member of several FB groups focused around specific expat groups for where I live. There are great pieces of wisdom and hard to find info in there. I'd love to with a chrome extension say save this and here is a little context (or if it could know that is great from formats).
Then try to figure out how to make that public and searchable.
- indie websites only (no news, no medium, no ecommerce, etc.), for those who want to find individuals who still maintain their own website, and say something interesting on it - low-size websites only, for people with very low bandwidth; anything above a certain size (e.g. 1Mb) and it doesn't get indexed - recipes (but there are some niche websites for this already) - websites with no ads on them (but this may conflict with your business model, if you have one) - websites focused on a certain geographic area (e.g. websites with information by, for, and about Texas, or Slovakia, or Buenos Aires) - websites with no javascript on them (for people who want to be able to turn off javascript, but don't have a good way of finding out which websites they can still use to get a particular piece of info)
Every time I repeat a search query and end up finding the answer in the same website I visited before I wish I marked that link as the 'definite answer'.
Next time I search for the same information the extension would point me directly to my previously marked link.
Maybe by letting people could subscribe to each other answers I could bootstrap better google. Developers seem a good initial target market. Students too...
Obviously, the easiest is to build on top of existing index. This in turn makes it a purely marketing play (DDG's marketing play is "privacy").
Here is one exploration of the following concept: search engine built on top of high quality sites only (as vetted by HN submission history)
https://cse.google.com/cse?cx=014479775183020491825:c2lrlzro...
Described in full here: https://news.ycombinator.com/item?id=21209358
Imagine instead you start writing out your ideas in natural form. Documents will appear that are relevant to your ideas, but with the goal of diversifying category. Instead of the top 100 results, you may just get minimal results per category.
As you continue to write the relevant documents get more constraint, but continue to attempt to maximize diversity.
I don't know if this work for everything, but would be interesting.
So I would start with a curated and crowdsourced first page results for the top x% of these knowledge based queries. With wikipedia like guidelines and moderation to ensure the quality of the sites mentioned is up to scratch coupled with an inbuilt feedback mechanism from people browsing. I think wikipedia has proven that while difficult, a scheme like this is indeed possible.
I also think you can start very niche and play with the results structure. For example, I like motorcycles and after years of browsing I have discovered the best places for reviews and information. Even just this use case could benefit from a better structure of results page and removal of all the spammy sites. The same goes for other niches like cooking & programming languages.
Same problem with niche search engines, unless they have some unique features/properties that Google doesn't have you are plainly better off with Google.
Google has a Monopoly on search, which looking at the market will hold up for the foreseeable future.
So unless you can offer a specific feature(set) for a niche or you just want to build it for the hell of it I wouldn't reccomend anyone to go into search engines.
I would probably go with what duckduckgo does but offer unique features that are usefull for one or more niches.