Danbooru

Cloudflare Walls despite API and Authentication Keys

Posted under General

I use automated tooling to pull certain images out of Danbooru and into my devices for portable access to favorite images. Unfortunately, my tooling and many like it don't work.

Initially, the Cloudflare Wall was designed to curtail the ability to masquerade tooling as browsers. This is completely understandable, and user agent string changes to match usernames, etc. to better track who is using the tooling came into play as we would expect to counter bots scraping when there aren't actual users behind it.

Then, we had to start copying our authentication cookies into place in order to bypass the wall. Now, this makes sense too, and in fact until recently worked very well. Now, however, despite UA string changes, session cookies, and HTTP API Basic authentication implementation with account-holder API keys, no tooling works. Even taking the cookies and applying these to the site via pure Python requests and making all these adjustments hits a cloudflare wall.

Was additional changes such as transparent CAPTCHA, etc. added in such a way to deter even those of us who are simply collecting our favorites for private enjoyment being blocked unnecessarily in the incessant anti-bot, anti-AI worlds?

Who suggested copying authentication cookies? That sounds like a terrible idea and I would expect Cloudflare to block you just for that alone. It’s easily detectable and reeks of access restrictions circumvention.

Are you accessing only JSON/XML endpoints or HTML endpoints too? Are you adding delay between requests or are you querying the site as fast as you can?

I sometimes have the problem that Cloudflare decides to give me a bot check for .js and .css files only, but not for any HTML content. Which is a problem because it never shows me any Captcha I could solve, I just get a broken site with no JavaScript and CSS. In that case, I have to copy the URL of a failing requests from the browser console and open that directly to get the captcha. Afterwards, everything loads fine. Based on that, I wouldn’t be surprised if the IP used by your tooling is subject to a bot check that you need to pass once using a browser. If you already solved a bot check on that device in a browser, maybe your browser uses IPv6, which gets your IPv6 address cleared, but your tooling uses IPv4 and your IPv4 address never gets cleared because you never use it in a browser?

kittey said:

Who suggested copying authentication cookies? That sounds like a terrible idea and I would expect Cloudflare to block you just for that alone. It’s easily detectable and reeks of access restrictions circumvention.

Are you accessing only JSON/XML endpoints or HTML endpoints too? Are you adding delay between requests or are you querying the site as fast as you can?

I sometimes have the problem that Cloudflare decides to give me a bot check for .js and .css files only, but not for any HTML content. Which is a problem because it never shows me any Captcha I could solve, I just get a broken site with no JavaScript and CSS. In that case, I have to copy the URL of a failing requests from the browser console and open that directly to get the captcha. Afterwards, everything loads fine. Based on that, I wouldn’t be surprised if the IP used by your tooling is subject to a bot check that you need to pass once using a browser. If you already solved a bot check on that device in a browser, maybe your browser uses IPv6, which gets your IPv6 address cleared, but your tooling uses IPv4 and your IPv4 address never gets cleared because you never use it in a browser?

The tooling I use can hit any of those endpoints. Imgbrd-grabber tries XML, JSON, then HTML. My Python tooling is based on the Python Requests library and JSON decoding, so it uses the JSON endpoint.

Without session cookies but a header set to "Danbooru user DarknessEternal" or just "DarknessEternal" it still hits the CF wall, even on separate networks. I have zero information other than a failure code sent back by Cloudflare that doesn't help with any information. The session cookie suggestion came from multiple threads *on these forums* historically as well as from imgbrd-grabber issue suggestions.

I'm not sure if there's a CF problem but it sounds like CF is errantly blocking people without reason. It's also possible that something is up with the config on the Danbooru site that blocks certain subnets - I had to change IP ranges due to ISP changes and though I was able to connect on the previous /28 network range with one provider, the /29 I now use doesn't work with the new provider. So I wonder if something is blocked on that level as well.

My /29 is utilized to use this page here right now. Which means it's at least gone through CF wall once before. It's the same exfil IP, heck it's the same system. I don't have IPv6 on this network (because that's an extra $115/month for my business grade connections).

What exactly does the error page say? Judging by my logs, you're not being blocked by Cloudflare, you're getting 403 Access Denied errors from Danbooru. Double check the permissions on your API key. It looks like your API key is set to be restricted to certain IP addresses, and I think you're using a different IP.

The only times you should be blocked by Cloudflare are when:

  • You're manually IP banned by me, usually for abusive levels of traffic, like doing 500k+ API requests per day, or trying to scrape the entire site one page at a time. On rare occasions entire subnets or ASNs get blocked if there are high levels of abuse (usually one guy rotating between tons of IPs on the same network).
  • Your User-Agent HTTP header is blank.
  • Your User-Agent is set to spoof a browser when you're not. Some tools and HTTP libraries do this by default. You shouldn't go out of your way to pretend to be a browser when you're not. That's a sign you're a bad actor.
  • Your User-Agent is manually blocked by me. This usually happens when a particular user agent is the source of a lot of bad traffic. This is rare. Most of these are bots spoofing old browsers, or abusive bots using particular versions of common libraries like python-requests or node-fetch. A lot of the time someone writes baby's first bot in Python or Javascript, and they do stupid things like trying to scrape the entire site one page at a time as fast as possible. Then when they get IP banned they start rotating IPs and I have no choice but to block python-requests as a whole.
  • Your IP has a high threat score according to Cloudflare. This usually happens when using Tor or proxies or VPNs that have been abused by spammers. This is rare.
  • Your request trips some WAF rule. Usually this happens when your request looks like an SQL injection attempt, or an exploit for a PHP or Wordpress vulnerability. This is very rare.

Judging by my logs, you're not being blocked by Cloudflare, you're getting 403 Access Denied errors from Danbooru. Double check the permissions on your API key. It looks like your API key is set to be restricted to certain IP addresses, and I think you're using a different IP.

DERP well I feel stupid. Didn't even realize I put IP restriction on. Guess that one's on me since my IP ranges changed (RIP me!).

The only times you should be blocked by Cloudflare are when:

  • You're manually IP banned by me, usually for abusive levels of traffic, like doing 500k+ API requests per day, or trying to scrape the entire site one page at a time. On rare occasions entire subnets or ASNs get blocked if there are high levels of abuse (usually one guy rotating between tons of IPs on the same network).
  • Your User-Agent HTTP header is blank.
  • Your User-Agent is set to spoof a browser when you're not. Some tools and HTTP libraries do this by default. You shouldn't go out of your way to pretend to be a browser when you're not. That's a sign you're a bad actor.
  • Your User-Agent is manually blocked by me. This usually happens when a particular user agent is the source of a lot of bad traffic. This is rare. Most of these are bots spoofing old browsers, or abusive bots using particular versions of common libraries like python-requests or node-fetch. A lot of the time someone writes baby's first bot in Python or Javascript, and they do stupid things like trying to scrape the entire site one page at a time as fast as possible. Then when they get IP banned they start rotating IPs and I have no choice but to block python-requests as a whole.
  • Your IP has a high threat score according to Cloudflare. This usually happens when using Tor or proxies or VPNs that have been abused by spammers. This is rare.
  • Your request trips some WAF rule. Usually this happens when your request looks like an SQL injection attempt, or an exploit for a PHP or Wordpress vulnerability. This is very rare.

Good to know the cases in which CF walling becomes permanent, thanks for sharing!

In any case I updated my IP restrict settings, so lets see what happens there. (Odd though that its a CF wall assumed by some tools)

Update: Being the *derp* I am, I didn't realize the IP restriction was in place. OOPS!

Updating the IP range solved the connection chaos.

Now to catch up on my backlogs of images to collect... which is likely to be a bunch (so apologies in advance for the several thousand API hits in one go as I requeue the categories for downloading/checking. Fortunately my tooling checks for dupes before redownloading but still.)

------

From a Security perspective, though, do you analyze and compare session ID vs. user agent and IP? If not, then you might want to include it, I've been able to impersonate myself in vanilla browsers with the right session cookie copied over.

(Disclaimer: I do IT Security research and work for my dayjob so that's why I asked)

Updated

1