How does archive.today bypass paywalls?
I can understand how it does it for simple paywalls that can be bypassed using a user agent pretending to be Google.
But how does it work for firmer paywalls with, e.g. NYT?
Mostly used agents, sites want to be crawled to appear in search results, but from time to time it does actively login to sites (linkedin was one that stopped working, presumably suspended as it likely violates t&c)
I wondered the same thing, I just figured they were whitelisted by the paywalled sites but I don't know exactly why. Perhaps archive.today has accounts to these paywalled sites although I am not sure that's the case.
There is another site, 12ft.io, that used to bypass many of these site's paywalls but they got pushback and now longer offer bypass for sites like WSJ and NYT.
I mean you could kind of do the same thing locally if you use archivebox. I think it's just the way it scrapes the web page.
Sometimes the paywall is nothing more than a couple HTML div’s.
You can programmatically remove them and then return the page.
I would guess most paywalled sites will allow Bots to scrape content to improve site discoverability. Whether it is Archive.org, Archive.today, Google, Google news...
Perhaps they subscribe (gasp!) to sites with paywalled content, sign in, and begin scraping.
My guess: stolen usernames/passwords. Or actual paid subs.