Have you ever used anti detect browsers for web scraping?

Question

I'm in the web scraping industry for a while and I often spend some time creating my "swiss knife" with Playwright or Selenium in case things get tough. Thanks to a niche substack I'm following, I discovered only today the existence of anti detect browsers like GoLogin and others. From what I see, they seem a good solution for small projects, but difficult to scale in larger ones for costs of licensing and infrastructure (most of them require a windows machine to run). Does any of you guys smarter than me use these browsers on a large scale? How is composed your tech stack?

jmt_ · Accepted Answer

How would you actually use an anti-detect browser programmatically? Would you need to write a custom Selenium driver for it or equivalent for Playwright? Even if the browser is built off something like Chrome, you'd still need a way to interact with the anti-detect related features.
A good trick I discovered is using webkit thru Playwright to bypass fingerprinting and related anti-bot measures. Firefox/Chrome simply leaks too much information, even with various "stealth" modifications. e.g: have been able to reliably scrape a well known companies site that implemented a "state of the art, AI-powered, behavioral analysis, etc" anti-bot product. Using Chrome/Firefox + stealth measures in Playwright did not work - simply switching to Webkit with no further modifications did the trick.
Not exactly what you're asking, but my point is, that with a little time and effort, I've usually been able to find fairly simple holes in most anti-bot measures -- it probably wouldn't be terribly hard (especially since you're versed in scraping) to build-out something similar to what you're looking to achieve without having to pay for sketchy anti-detect browsers.

fxtentacle · Answer

I've found that it's almost never needed. Most of the "advanced AI human detection" things are glorified IP reputation systems. So you just need a few IPs that would be way too painful to block, for example US residential IPs, and you're good.But if you really want to make sure, it's pretty easy to remote-control a cheap Android phone. Plus detection thresholds tend to be much higher on mobile, because filling out a ReCaptcha with a touch screen is just such a horrible user experience.

darkpatterns · Answer

Good community called Scraping Enthusiasts on this topic here: https://discord.gg/4fGEPZzs Plus curated list of research papers here if you want to go deep on the subject matter: https://github.com/prescience-data/dark-knowledge

splatzone · Answer

The Hero browser is designed for this kind of sneaky scraping, it&rsquo;s very interesting: https://github.com/ulixee/hero

ffgh · Answer

Can you share the substack?

jnk345u8dfg9hjk · Answer

this smells like an ad for GoLogin

decide1000 · Answer

What do you mean with "most of them require a windows machine to run"?

QuadmasterXLII · Answer

if you don't want to be detected, run chrome in a vm and move the mouse around with pyuserinput