🎉 Happy 19th Birthday to Danbooru! 🎉
Danbooru

Getting 403 error while trying to download images

Posted under Bugs & Features

kittey said:

Hopefully, you can post some info on the forum about what would be considered “benign”, especially if delays or rate limits are required to stay below some otherwise invisible threshold.

Agree, would be more than happy to put some delays in my scripts if I knew what to work to

I appreciate the intent, but I do hope the restrictions are relaxed soon. I mostly use Hydrus to selectively download and organize pictures and this change has broken that completely.

I don't download a ton of images but it is still a significant amount (maybe like 1000 a month at the very most). I'm sorry if I was part of the problem, I follow the guidelines for the API and I would gladly follow stricter guidelines for image downloads. I didn't realize it was that bad. I'll hold off on downloading for a while and honor the idea of not masquerading as a permitted browser until more updates are given.

I will gladly pay a decent amount to be able to have the ability to download a certain amount of images through a script and will gladly heavily limit my scripts. Please keep us updated on the options. I want to support the site and its maintainers. Thank you for the work.

I was making a "remaster" of danbooru with newer ui and builtin proxy (to bypass danbooru blocking in some countries) as my hobby project, it's sad if I have to forget about this idea because I & my friend spent a lot of time on this project ;-;

Menko said:

Agree, would be more than happy to put some delays in my scripts if I knew what to work to

Just out of interest. Did you have any limit set?
Please do not answer "but there isn't a written guideline" or any variation on that statement

Was wondering the whole day if it was me breaking something for my bot, heh. Both happy (and afraid) to figure out how this continues on.

Maybe rate limiting by using API authentication (which already exists for REST API, but not for image cdn) could be a solution?

I have been using links to donmai images to post on Lemmy (reddit like social media). The image shows up in the feed of the users for them to view (they don't have to click the link to visit the donmai page). Is this still an intended use case for the future? Or should I quit doing that?

redtails said:

Just out of interest. Did you have any limit set?
Please do not answer "but there isn't a written guideline" or any variation on that statement

At the moment no, I recently updated my scripts to download through multiple threads, and I haven't worked out how to put a delay or limit on that multi thread process. I could put sleep timers into the function itself, but that would only add a delay between the ending of the function and the starting of the next one within the same thread. It wouldn't stop all the thread processes starting simultaneously when the program starts, or from different threads making requests near simultaneously to each other if that were to coincidentally happen. Before I started using multithread I just had a delay built into the download function. But what I found was that the download time and connecting to start the next download was greater than the sleep time I set anyway.

soredeii said:

I have been using links to donmai images to post on Lemmy (reddit like social media). The image shows up in the feed of the users for them to view (they don't have to click the link to visit the donmai page). Is this still an intended use case for the future? Or should I quit doing that?

You should stop. If you really want to, just download it yourself and put it in the post on lemmy.

over 30% of our bandwidth is taken up by bots and scrapers now and the image servers can no longer take it.

I run a rather low-volume scraper from time to time, and I'd like to be a good citizen. For example, after fetching an image my scraper goes to sleep and doesn't o anything until it wakes up two seconds later.

If you suggest any guidelines for robots and scrapers, I'll make sure to comply.

Thanks.

I've dialed it down to only blocking bots that try to impersonate web browsers. If you set a custom User-Agent header, you won't be blocked (unless you're downloading too much). If you try to disguise your traffic as human traffic, you will be blocked.

You may be blocked manually if you try to download too much. "Too much" is not a hard line, it depends on what you're doing and how much spare bandwidth the site has at the time. Basically, if the site is feeling slow, that's when I start going down the list of the top downloaders and blocking people. If you're downloading less than 50GB per day, then I probably won't take notice of you. If you're downloading more than 50GB per day, that's when I start looking at what you're doing and potentially blocking you if it doesn't seem reasonable.

You should set your User-Agent to something containing your bot's name and/or your name or contact info. You're less likely to get blocked if I can check your bot's code on Github or if there's some way to contact you. You're more likely to get blocked if I can't tell what you're doing and I have no way of contacting you. If you try to disguise your traffic as a browser or as human traffic, you will be blocked.

Hotlinking is allowed within reason. Things like posting images for friends on social media or on forums or personal sites is allowed. Things like building apps and alternate UIs for Danbooru are allowed, as long as they're not monetized, they don't contain "Danbooru" in their name, they don't harvest user passwords or API keys, and they don't remove the Referer header. Things like building competing sites that leech our bandwidth aren't allowed, especially if they're monetized (i.e. building a hentai site that hotlinks or proxies all our images and surrounds them with ads).

If you're downloading images for AI purposes, get the 720x720 samples instead of the full size original images. The full set of posts is 9.5 TB. Downloading that much data will take too long and use too much bandwidth. It would take nearly a day even if you could download at a full 1Gbps, which you can't. Just writing that much data to a hard drive would take nearly a day, even if you were copying it straight from one drive to another.

For reference, something like 7 TB of traffic per day is from bots and other things that aren't real people browsing the site. About half of that is from search crawlers like Google and Bing, from hotlinks on other sites (mainly Google Images), and from Discord embeds (1.5 TB alone is from Discord, probably mostly from bots endlessly dumping random images in some shitty unread server somewhere). The other half is from downloaders and scrapers. This traffic is harder on the servers because most of it can't be cached. With human traffic, most of it can be cached because you have a large number of people viewing a small number of images. With bot traffic, most of it can't be cached because you have a large number of bots downloading a large number of random images, so it all goes straight to disk and the disks eventually can't keep up.

Updated

evazion said:

You should set your User-Agent to something containing your bot's name and/or your name or contact info. You're less likely to get blocked if I can check your bot's code on Github or if there's some way to contact you. You're more likely to get blocked if I can't tell what you're doing and I have no way of contacting you. If you try to disguise your traffic as a browser or as human traffic, you will be blocked.

So I tried setting my User-Agent setting to "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko Danbooru User Id#120517 Rathurue" but it still doesn't work. I'm using IDM at the moment, because I've got pending downloads there that I have added for...months? without actually downloading it. Do I need to set the User-Agent setting for only IDM, for the browser (Firefox) too or also for my OS?

Rathurue said:

So I tried setting my User-Agent setting to "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko Danbooru User Id#120517 Rathurue" but it still doesn't work. I'm using IDM at the moment, because I've got pending downloads there that I have added for...months? without actually downloading it. Do I need to set the User-Agent setting for only IDM, for the browser (Firefox) too or also for my OS?

That's a browser-like user agent. It counts as pretending to be a browser. Just set it to "Danbooru user #120517", remove the fake Mozilla stuff.

nonamethanks said:

That's a browser-like user agent. It counts as pretending to be a browser. Just set it to "Danbooru user #120517", remove the fake Mozilla stuff.

Still returns 403 forbidden *on older downloads.
Eh, at least now it works for newer downloads but that backlog of downloads are toast.

1 2 3 4