Makes me think they are all wrappers around some other service, but I can't find anything else. Does anyone happen to know how this works?
These wire agencies must cost more than the amount you mentioned. But they would be considered primary sources, a layer 1 for news and information. A lot of news websites essentially just repackage L1 facts, add some stories about tweets famous people made, and throw in a handful of original reporting.
A cheaper wire service that deals more with business plugs and PR would be businessnewswire and prnewswire - these are inherently promotional by nature.
I doubt this answers your question, but I'd be interested to knowing how accessible these essential L1 information would be to ordinary users (not companies). While you may end up paying for Reuters directly, another website that includes a Reuters feed could give you the same content for less.
On a scale of 1 to 10, the difficulty is like a 2 or 3
The list itself isn’t particularly hard to maintain. What’s hard are the myriad of rules and configurations required to crawl and scrape each publisher. We built a model that extracts article data and it does a good job figuring out headlines, images, authors, and text.
Scraping rules are very self-manageable if you're planning on crawling just a few publishers. But jt gets exponentially more difficult to crawl hundreds.
Buyers are often a bit old school and frankly far more expensive than it's worth. Also good luck getting a usable, affordable API. I'm looking here at people like Meltwater, LexisNexus etc., who have licencing agreements with publishers.
Then there are the scrapers. The one I use is newsapi.ai, and I can broadly recommend them. They've got a decent selection, are happy to add stuff for you, and have lots of nice goodies baked in (e.g, NERD).
Most of the other ones you'll find with a cursory "news api" search also fall into this category AFAICT, but few, if any provide full text, which is what I need.
From conversations I've had with my supplier, I believe they've got a scrapy box running somewhere pulling largely off RSS feeds. I wouldn't want their job to be honest, so much to look after.
This approach is fine for some needs, but you can literally see the gaps in the time series where something has fallen over.
I'm very interested in this space and would love to hear other's experiences.