policy on leeching

Posted under General

Hi,

so I was thinking about making meself an offline copy of the site on a dedicated drive (please don't ban this IP for the mere question, it's part of a large ISP's pool - you would only penalize the person who has it after me), so I have a large library of images to play with. (Seriously, that's the whole reason. Exclusively private usage).

Anyway, I was wondering if people have attempted this before, and if there's an official policy on doing things like this. I could see several stances

* OMGWTFHAXBAN?! Get your grubby fingers off our data.
* Please don't.
* Meh
* No problem, as long as you keep your bandwidth below $K

If the officials have a problem with it, I won't do it. Feedback much appreciated.

--feep

PS: in related news, how many deca-gigs do you think I'd need, presuming duplicates are automatically symlinked?

Updated by FeepingCreature

There should be various people around who have an old copy of the database. About... 16 months? ago, the site shut down for a while. At the time it had been using a mirroring system, so most of the mirrors still have that old stuff. If all you want is a large collection of images, that should probably suffice. I think the tag data for them was also made available by albert, maybe you can find that dataset somewhere.

Sounds sweet! Although, of course, downloading the live site would be vastly more fun. :)

Sorry, I have a bit of an unhealthy addiction to downloading threads from image sites. (like 4chan .. I still have the shell scripts somewhere)

I presume this copy is a subset of the current list of images? In this case, sorting them by tags (even if the actual tags aren't available), should be as easy as walking through the current tag list over a few days.

Actually, presuming I start out with this corpus thanks to the generosity of random people, how much would I need to download to get it "up-to-date"? (addict, remember)

Or would that defeat the whole point?

(Note, again, that if you or anybody says "No leeching", I'm perfectly fine with that)

As far as I can figure out, what you want to do would be about equivalent to having 5 or so regular visitors more. I can't see how that's going to put us over the edge. As long as you don't make the script hammer incessantly I don't see a problem with it, but I guess you should hold on for albert's final word on it.

From a practical perspective I would recommend getting the old copy and then adding the new stuff from here, just because of the 503's and such.

FeepingCreature said:
I presume this copy is a subset of the current list of images? In this case, sorting them by tags (even if the actual tags aren't available), should be as easy as walking through the current tag list over a few days.

Pretty much. There will invariably be a few posts that were deleted within the last 16 months from that corpus, though.

FeepingCreature said:
Actually, presuming I start out with this corpus thanks to the generosity of random people, how much would I need to download to get it "up-to-date"? (addict, remember)

Well, going by post IDs, seems you're looking at roughly 114331 posts then, 280053 now.

In a related question, I'm wondering if it would be possible to get a dump of the tagging information, in a machine-readable format. I'm especially interested in the tag-type and tag-post relations.

I'm considering trying my hand at a "tag predictor", which would be a program that tries to predict tags based on other tags. For example, it would be nice to be able to enter "da_capo cat_ears maid" and have it predict "sagisawa_yoriko" for me.

In another related question: if I grab those images, I'll need a good place to store it. What filesystem would you recommend, specifically for speed of access?

I don't have any experience with the more exotic file systems, having used ext2/reiser exclusively so far :)

--feep

[edit] Can't get a response either. Seems ded. :pokes:
[edit2] tomomaru: sounds neat! :)

I don't care. It would be nice if you showed some manners and only download a thousand posts a day, but I can't stop you from leeching more.

You can get the tags if you use the API (http://danbooru.donmai.us/help/api). As far as a database dump, I'm not keen on doing this because I'd have to sanitize sensitive information and that's too much work for something you're not even making public.

Thanks albert!

Okay, it's running.

Source is here: http://paste.dprogramming.com/dpn384ul

I'm not using the official API because .. well, once I have the page source it's easier to just grep it directly.

Since the source doesn't bandwidth limit by itself, I'm running it like this:

crunchy src # for ((i=0; i < 100000; i += 64)); do echo Fetch $i to $((i+63)); ./fetch_page .. $i $((i+63)); echo Sleeping; sleep 1h; done

Let's hope any bugs get found before the first 1000 images :)

[edit] Since this is rather slow, it doesn't matter that I'm starting at the front - as long as whoever would be willing to donate a copy of the old image archive does so within the next week or so, there won't be much overlap.
[edit2] Bugfix - permissions were off. Link updated.

Updated by FeepingCreature

1