I'm trying to deal with a very interesting (to me) case. Someone is proxy-mirroring all content of my website under a different domain name.
- Original: https://www.saashub.com
- Abuser/Proxy-mirror: https://sukuns.us.to
My ideas of resolution:
1) Block them by IP - That doesn't work as they are rotating the IP from which the request is coming.
2) Block them by User Agent - They are duplicating the user-agent of the person making the request to sukuns.us.to
3) Add some JavaScript to redirect to the original domain-name - They are stripping all JS.
4) Use absolute URLs everywhere - they are rewriting everything www.saashub.com to their domain name.
i.e. I'm out of ideas. Any suggestions would be highly appreciated.
p.s. what is more, Bing is indexing all of SaaSHub's content under sukuns.us.to ¯\_(ツ)_/¯. I've reported a copyright infringement, but I have a feeling that it could take ages to get resolved.
I wrote a HN post about it as well: https://news.ycombinator.com/item?id=26105890, but to spare you all the irrelevant details and digging in the comments for updates - here is what worked for me - you can block all their IPs, even though they may have A LOT and can change them on each call:
1) I prepared a fake URL that no legitimate user will ever visit (like website_proxying_mine.com/search?search=proxy_mirroring_hacker_tag)
2) I loaded that URL like 30 thousand times
3) from my logs, I extracted all IPs that searched for "proxy_mirroring_hacker_tag" (which, from memory, was something like 4 or 5k unique IPs)
4) I blocked all of them
After doing the above, the offending domains were showing errors for 2-3 days and then they switched to something else and left me alone.
I still go back and check them every few months or so ...
P.S. My advice is to remove their URL from your post here. This will not help with search engines picking up their domain and ranking it with your content ...
2. Create fake html elements and put unique strings inside. And you can search that string in search engines for finding similar fake sites on different domains.
3. Create fake html element and put all request details in encrypted format. Visit adversary's website and look for that element and flag that ip OR flag the headers.
4. Buy proxy databases, and when any user requests your webpage, check if its a proxy.
5. Instead of banning them, return fake content (fake titles and fake images etc) if proxy is detected OR the ip is flagged.
6. Don't ban the flagged ip's. She/He's gonna find another one. Make them angry and their user's angry so they give up on you.
7. Maybe write some bad words to the user on random places in the HTML when you detect flagged ip's :D So the user's will leave the site and this will reduce the SEO point of the adversary. Will be downranked.
8. Enable image hotlinking protection. Increase the cost of proxying for them.
9. Use @document CSS to hide the stuff when the URL is different.
10. Send abuse mail request to the hosting site.
11. Send abuse mail request to the domain provider.
12. Look for the flagged IPs and try to find the proxy provider. If you find, send mail to them too.
Edit: More ideas sparkled in my mind when I was in toilet:
1. Create fake big css files (10MB etc). And repeatedly download that from the adversary's website. This should cost them too much money on proxies.
2. When you detect proxy, return too big fake HTML files (10GB) etc. That could crash their server if they load the HTML into the memory when parsing.
https://webmasters.stackexchange.com/questions/56326/canonic...
I noticed that the other domain is hotlinking your images. So you can disable image hotlinking, by only allowing certain domains as the referers. If you block hotlinked images then the other domain will not look as good. Remember to do it for SVGs too.
https://ubiq.co/tech-blog/prevent-image-hotlinking-nginx/
Finally I also see they are using a CDN called Statically to host some assets off your domain. You can block their scrapers by user agent listed here:
If the TLS ciphers the client proposes for negotiation doesn’t align with the client’s User-Agent they get a CAPTCHA.
I would suspect that whoever is doing this proxy-mirroring isn’t smart enough to ensure the TLS ciphers align with the User-Agent they’re passing through.
By the way, I've also reported the abuser as a phishing/fraud website through https://safebrowsing.google.com/safebrowsing/report_phish/?u...
Instead, plot a few different changes and throw them in all at once. Preferably in a way where they will have to solve all of the changes at the same time to figure out what happened and get things working again. Also, favor changes that are harder to detect. E.g., pure IP blocks are easier to detect than tarpitting and returning fake/corrupted content. The longer their feedback loops, the more likely it is that they'll just give up and go be a parasite somewhere else.
FIND THE IP FOR THE DOMAIN
PS > ping sukuns.us.to
Pinging sukuns.us.to [45.86.61.166] with 32 bytes of data:
Reply from 45.86.61.166: bytes=32 time=319ms TTL=39
...
REVERSE DNS TO FIND HOST https://dnschecker.org/ip-whois-lookup.php?query=45.86.61.166
Apparently it's "Dedipath".And that WHOIS lookup gives an abuse email address:
"Abuse contact for '45.86.60.0 - 45.86.61.255' is 'abuse@dedipath.com'"
So you could try emailing that address. They may take the site down, or hopefully more than that...
Are they now?
Add a `visibility: hidden` to random elements on the page, and show them with javascript.
OR
Are they removing _all_ js? Have you checked whether they remove `
` ?You can try to do script injection _into your own site_ to see if their mirroring software is smart enough to deal with all the different xss vectors.
Bonus points: if they remove your
` attribute, add a style likebody { display: none} body[onhover='the js code that they will remove'] {display: block}
Then instead of blocking the fingerprint, poison the data. Introduce errors that are hard to detect. Maybe corrupt the URLs, or use the incorrect description or category. Be creative, but make it kind of shit.
It's easy to work around blocks. Working around poisoned data is much harder.
The nice thing about this is it can be made arbitrarily complex. For example you can make the page actually blank and fetch all the normal, real content with JS after validating the user's browser as much as you like on both client and server. That's what Cloudflare's bot shield stuff does. Since JS is Turing complete there is no shortcut that the proxy can take to avoid running your real JS if you obfuscate it enough. They would have to solve the halting problem.
What a determined adversary would do is run your code in a sandbox that spoofs the URL, so then your job becomes detecting the sandbox. But it's unlikely they would escalate to this point when there are so many other sites on the internet to copy.
Put brandings/personalizations/signatures in your pages that are not easy to remove to remove automatically. Include your site URL if possible. The idea is that if a visitor sees these on a different site, it becomes obvious that the content doesn't belong there.
Write an article page about these things happening, specifically mentioning the mirroring site URLs, and see if they will also blindly mirror it.
The copy is using ZeroSSL. This seems to use a similar mechanism like letsencrypt to verify certs. Maybe, you could get their certificate by serving the response to their challenge from your server. Not idea how to proceed from there.
Or activating the google webmaster tools. Maybe there's some setting "remove from index" or "upload sitemap" that could reduce its visibility on google.
Put it under a URL only you know, then start DoS-ing it.
Of course that requires you to be able to serve a prepared gzipped resonse, depends on your stack.
Bots will trigger it by walking through all pages, but real human would not click in since the paging is non-sense and titles are non-sense.
The first line of defense is contacting the relevant authorities. This means search engines, the hosting provider, and the owner of the domain (who may not be the abuser). Be polite and provide relevant evidence. Make it easy for them to act on it. There'll be some turnaround time and it's not always successful, but it's the best way to get a meaningful resolution to the issue.
What about in the meantime? If all the source IPs are from one ASN, just temporarily block all IPs originating from that ASN. There'll be some collateral damage, but most of your users won't be affected.
What striked me, though, is that a copycat website is waaaay faster than your original. If I were in your shoes, I would invest my time and effort into speeding up the site. Unlike hunting some script kiddies, that will bring palpable benefits.
I have a website doing this to one of my domains. I have let it slide for now since I get value out of users that use their site too, but I have thought about packing their content with advertisements to turn the tables a bit.
If you change subtle details about spelling, spacing, formatting, etc by the source IP, then you can look at one of their pages and figure out which IP it was scraped from.
Then, just add goatse to all pages requested by that IP. Alternatively, replace every other sentence with GPT-generated nonsense.
EDIT: it should be quite easy to use JS to fingerprint the scraper. The downside is that you will also block all NoScript users.
1. Grab the list of IPs that you've already identified and feed them through nrich (https://gitlab.com/shodan-public/nrich): "nrich bad-ips.txt"
2. See if all of the offending IPs share a common open port/ service/ provider/ hostname/ etc. Your regular visitors probably connect from IPs that don't have any open ports exposed to the Internet (or just 7547).
3. If the IPs share a fingerprint then you could lazily enrich client IPs using https://internetdb.shodan.io and block them in near real-time. You could also do the IP enrichment before returning content but then you're adding some latency (<40ms) to every page load which isn't ideal.
I seem to recall someone doing something similar at one point hosting files and setting up resources that get pulled down only on flagged IPs such as a 300kb gzip encoded file that tries to expand to 100TB.
You may be able to claim their domain out from under them and then mess with search settings (e.g. In Google Search Console you can remove URLs from search results).
Extra points if you can cause legal trouble for whoever runs the site. If you're hosting rather large files, then you can also hide content by default that will never be loaded on your site, but will load on the other site. Add a large file to your site, then reference that file a few thousand times with query params to ensure cache busting, and then make the browser load it all using CSS when it detects that it runs on the other site.
- "bait urls" that other crawlers won't touch
- trigger by request volume and filter out legit crawlers
- find something unique about the headers in their requests that you can identify them with
One additional suggestion is to not block them, but rather, serve up different content to them. Like have a big pool of fake pages and randomly return that content. If they get a 200/OK and some content they are less likely to check that anything is wrong.Another idea is to serve them something that you can then report as some type of violation to Google, or something (think SafeSearch) that gets their site filtered.
a[href*="sukuns"] { font-size: 500px!important; color: LimeGreen!important; }
pretty much destroys the page. i guess eventually they would give up in the specificity battle.
probably more stuff you could do with CSS to mess with them.
(Less economical if they're not caching anything.)
Base64 encoding images with watermarks may also be worth a shout.
Love the zip bombing.
Long shot but I wonder if its possible to execute some script on their server.
Happened to me back in the days of blogging.
Posted an image of me mocking them on my blog. Sure enough they published it and they didn't notice for a while. They stopped it soon after :)
[0] https://developers.cloudflare.com/workers/runtime-apis/reque...
1) Add a watermark to your images when they proxy to you.
Stolen image from {url}
2) Add a js script when the url differs from yours and display a message + redirect
More examples here from long long ago.. http://www.ex-parrot.com/~pete/upside-down-ternet.html
Luckily, I am at home, and my children are at school.
I have no idea what happened, or why I got redirected, but I can certainly suggest not taking up the idea to serve disgusting content (given I clicked a link that someone on HN posted, I shouldn't be subjected to that).
Even with IP rotation, a proxy website would probably generate more traffic than normal from these few IPs, tweak fail2ban vars so you to make it less likely to trigger on false positives (larger number of requests / larger amount of time) but block the violating IPs for long period, few days for example.
I hope it helps
You're already using Cloudflare, you could try talking to their support or just turning up settings to make it more strict for bots.
Instead of blocking their IPs, detect if the traffic is coming from the abuser's IPs, and serve different content -- blank, irrelevant, offensive, copyright violations, etc.
https://bgp.he.net/AS35913#_prefixes
The IPs they switch between may all be from this pool.
More simply you could just make all the HTML links broken unless some obfusticated or server-backed algorithm is run on them. Think google search results.
Potential downsides: SEO.
Does copied content even rank in Google? How are they driving the traffic to it?
https://blog.cloudflare.com/introducing-scrapeshield-discove...
How would one go about finding out?
Access denied Error code 1020
You do not have access to www.saashub.com.
The site owner may have set restrictions that prevent you from accessing the site.
Error details Caret icon Was this page helpful?
Performance & security by Cloudflare External link
They're serving your copyrighted content. Seems like what it was made for.
Make your site only work with JS.. Easy.
The current IP 45.86.61.166 is likely a compromised host [1] which tells me you are dealing with one of the gangs that create watering holes for phishing attacks and plan to use your content to lure people in. They probably have several thousand compromised hosts to play with. Since others mentioned you could change the content on your site, I would suggest adding the EICAR string [2] throughout the proxied content as well so that people using anti-malware software might block it. They are probably parking multiple phishing sites on the same compromised hosts [3].
This would also be a game of whack-a-mole but if you can find a bunch of their watering hole sites and get the certificate fingerprints and domains into a text file, give them to ZeroSSL and see if they can mass revoke them. Not many browsers validate this but it might get another set of eyes on the gang abusing their free certs.
If you have a lot of spare time on your hands, you could automate scripting the gathering of the compromised proxy hosts they are using and submit the IP, server name, domain name to the hosting provider with the subject "Host: ${IP}, ${Hostname}, compromised for phishing watering hole attacks". Only do this if you can automate it as many server providers have so many of these complaints they end up in a low priority bucket. Use the abuse@, legal@ and security@ aliases for the hosting company along with whatever they have on their abuse contact page. Send these emails from a domain you do not care about as it will get flagged as spam.
Another option would be to draft a very easy to understand email that explains what is occurring and give that to Google and Bing. Even better would be if we could get the eyes of Tavis Ormandy from Google's vulnerability research team to think of ways to break this type of plagiarized content. Perhaps ping him on Twitter and see if he is up to the challenge of solving this in a generalized way to defeat the watering holes.
I can think of a few other things that would trip up their proxies but no point in mentioning it here since the attackers are reading this.
[1] - https://www.shodan.io/host/45.86.61.166
[2] - https://www.eicar.org/download-anti-malware-testfile/
[3] - https://urlscan.io/result/af93fb90-f676-4300-838f-adc5d16b47...