Danbooru

Image Sample Cleanup Project

Posted under General

Mikaeri said:

@BrokenEagle98 Will it be possible to have a script to detect e-hentai samples? post #2657508, for example.

Quite possible, however it all would depend on the standard link format for those sources on Danbooru....

Type 1

No problem...

Type 2

Shouldn't be a problem...

Type 3

Nope... there's the likelihood for having to search through > 1000 images. If anything, I'd mark these with bad source, although I have mostly used this format myself since checking sample sizes wasn't a thought, so I'm not sure which way to go on these.

Type 4

Not even >:(

Others

Are there any that I'm missing...?

sweetpeɐ said:

It would also be useful the the source was for e-hentai.org for non-restricted content since not all users are up and running with exhentai

I use ex because that's the only thing I use, but if I am remembering correctly, fairly sure if sadpanda occurs on one it will occur on the other. Regardless, circumventing sadpanda is as easy as getting the sadpanda extension.

BrokenEagle98 said:

Yeah... so e-hentai has extremely sensitive anti-bot triggers... after being IP banned twice (this last time 24 hours), I'm calling a fullstop for now... :/

Yes, if you browse pages too quickly (I've been burned by this one a few times) or refresh too often in a given period, you'll get flagged as a bot. For link checks you'd probably have to set a delay between actions of like 5-10s.

CodeKyuubi said:

Yes, if you browse pages too quickly (I've been burned by this one a few times) or refresh too often in a given period, you'll get flagged as a bot. For link checks you'd probably have to set a delay between actions of like 5-10s.

Yeah, the first time was just me manually trying out different cookies to see which were necessary to get what I wanted. The second time was because I had mis-coded the delay of 60 seconds I had put in (due to that first IP ban) so it never actually got called. Unfortunately, screw-ups like the above are a lot more costly when there is such a severe penalty... :/

BrokenEagle98 said:

Are there any that I'm missing...?

Some Type 4 urls include an "xres=" parameter which seems to be followed by "org" for originals or a number for samples.

Also a reminder that e-hentai has an IQDB-like function if you click "Show File Search". It lists the results as galleries rather than individual pictures, though. (Type 3)

Yeah, you can.

@BrokenEagle98 Want to adjust your script for that? Really reminds me though, we should get through as many Twitter samples as we can before artists start deleting their tweets or losing their accounts. usashiro mani recently had their original twitter account suspended for some reason, and it's starting to worry me.

Adding the tags manually is fine.

Just for reference, I only have my script checking every post 5 mins after they're uploaded while I'm awake (0900-2300,Z-0400), then I batch the ones I missed while I was asleep.

Although I plan on it, I haven't nailed down a schedule for regular check-ups...

I know that full checkups will only be done every couple of months since they usually take a couple of days per site source.

Maybe, check posts after a week has passed, then a month, then not again until a full checkup...?

Thoughts?

BrokenEagle98 said:

Adding the tags manually is fine.

Just for reference, I only have my script checking every post 5 mins after they're uploaded while I'm awake (0900-2300,Z-0400), then I batch the ones I missed while I was asleep.

Although I plan on it, I haven't nailed down a schedule for regular check-ups...

I know that full checkups will only be done every couple of months since they usually take a couple of days per site source.

Maybe, check posts after a week has passed, then a month, then not again until a full checkup...?

Thoughts?

Sounds like a good idea to me. Could adjust as you go, to see how effective it is.

Sacriven said:

I'm interested to help, but I don't know from where I should start. Guidance please?

@Sacriven Sure thing. Read up on the basics of image sample if you have the time, and then go through some of the useful searches listed.

I'd say Twitter samples are the most at risk of being deleted soon, so they're on a pretty high priority.

I'm considering yandere samples low priority as of the moment since they rarely delete images. It'll also be difficult if we run into images that exceed the 25 MB upload limit that our booru has. howto:yandere also hasn't been created yet (which I plan to do soon).

Feel free to bump this thread or Dmail me if you have any more questions.

EDIT: I'm hoping there's a way that we can more incentivize users to clean up image samples, as it's a daunting process for only few users to handle given we have so much images to go through. I think it's an extremely easy way to rack up uploads and increase a user's upload/deletion ratio.

Updated

Updated the original post so users who want to contribute might have a sense of direction as to which posts they should focus on first. Basically how much we have of active samples as of the moment.

A lot of the "source:*twitter.com/ md5_mismatch -upscaled status:active" ones seem to be images that are visibly identical just the danbooru one is a larger filesize than the twitter one. Do we really need to be replacing all these things?

@kuuderes_shadow Yeah. I realized that Twitter changed their algorithm for recompression ~9-12 months ago, so that when we started to replace Twitter samples, there were some twitter ":large" samples with a different MD5 than that of the current ":large" samples and the same dimensions of the original image. I did a quick diff to see where the color changes were, and found out the larger, older twitter ":large" samples were worse (poorer color, artifacts).

That's the reason I updated the search.

EDIT: It's also the reason why a lot of the Twitter samples older than 9 months back being deleted are matching md5_mismatch downscaled, not twitter_sample. Should I put a note there?

EDIT 2: Note that some sources may just plain be incorrect. I've been working through images quickly and checking the md5s as accurately as I can, but there are bound to be some mistakes.

Updated

1 4 5 6 7 8 9 10 11 12 14