HACKER Q&A
📣 PaulHoule

Consolidated RSS feeds for independent blogs?


I've been experimenting with a smart RSS reader and I've been using Superfeedr to ingest content. Superfeedr monitors RSS feeds and calls a webhook when content arrives, I stuff these into an Amazon SQS and then spool them out at my leisure.

My system requires a fairly high volume (1000's of articles) to train a machine learning model so I have been focusing on high volume feeds like preprints from the arXiv, news from the Guardian, etc.

Now Superfeedr charges 10ยข per feed per month, my bill right now is $3 a month. It's reasonable that I can subscribe to about 100 feeds on Superfeedr but subscribing to 1000 feeds seems pricey to me.

There are a lot of independent blogs out there that publish an article every week or every six months. What they all have in common is that somebody has to poll an RSS file many many many times per article ingested. W/ superfeedr it makes for high costs but it is a hassle even if I built my own RSS ingest system.

One thing that would help would be consolidated RSS feeds that aggregate posts from a large number of blogs. Are there good ones out there? Are there other answers to problem of polling hundreds or thousands of independent blogs?


  👤 WorldMaker Accepted Answer ✓
Back in the day these used to be called "Planets" and at one point there were lots of them with all sorts of content-specific niches and various levels of automation/moderation. (The best were opt-in/by-invite-only and extremely respectful to the bloggers the aggregated.)

https://en.wikipedia.org/wiki/Planet_(software)

You may want to be careful/cautious about what exactly from RSS feeds you train an ML model on. If you are planning to commercialize your ML model that may be a direct infringement on CC licenses many blogs provide their content as, especially to RSS and aggregators like "Planets". (This applies to using feed content you've already aggregated from Superfeedr as well.) Please use ML responsibly and ethically.


👤 kevincox
Why don't you just get a list of feeds and scrape them? It doesn't sound like you need real-time updates for your use case so polling shouldn't be very expensive. You can just poll weekly or less. Polling a thousand blogs shouldn't take long at all.

👤 kaffeeringe
A good feedreader wouldn't poll again when it receives a 304 response.

👤 is_true
Lambda, cloudflare workers (probably cheaper) or GitHub actions + a repo