Danbooru

Image Sample Cleanup Project

Posted under General

Alright, anyways, just so no one gets confused -- I'll elaborate more on this. BrokenEagle98's script probably made a false positive, and I work fast through image samples so I assume from the getgo that his is correct without double-checking visually sometimes.

But let me note, again, that Twitter used a different image sampling method than they do now. Let's take post #1642836 for example. It's a Twitter sample, going by the link itself, but not the one we currently know Twitter samples by. Quick comparison:

post #1642836

File: sunny_milk_touhou_drawn_by_spirytus_tarou__1413787eb145c38f7febee495cd95f3e.jpg
CRC-32: a0158e4e
MD4: 4bc0e7c8509bbc6aa801d20a37fdcb7e
MD5: 1413787eb145c38f7febee495cd95f3e
SHA-1: f49037244fcbf7482434b63d46281119f03ed898

Sample at the time he uploaded it. The sample from the current link has this information:

File: BjPxhHJCMAEiAhH.jpg-large
CRC-32: 30920fba
MD4: 2f9b492d501c814414ae5a17d1fa1cff
MD5: bd8d460b18a0bfcbfad2deb6db77e653
SHA-1: 8dd84e804f45d21d594faabe2661c05d0fe845b1

This also counts for posts that are sampled, but are still the same resolution as their parent. Let's take post #1949989 and its parent:

post #1949989

File: kawashiro_nitori_touhou_drawn_by_maturiuta_sorato__5bc14be330d85ca045ede0cdbb2c4f08.jpg
CRC-32: 8d1781b2
MD4: 7a4e271d175be5d05e4dc52f8ffdc559
MD5: 5bc14be330d85ca045ede0cdbb2c4f08
SHA-1: 8ad963a411a590004f48e3dec61258164dbf5f29

Which is a sample, but different from the current sample below.

Current sample from this link

File: B9bwU1rCMAAwsgb.jpg-large
CRC-32: a9aa3d65
MD4: f22d435cf8ee0080c223d0d7a7117015
MD5: 2cb244fa234f233e1a63490b450ff73a
SHA-1: a67ecead58dcb3d433412fa93ea9a5fde9ee3ad6

And of course, the original image.

post #2669391 (post I replaced with)

File: kawashiro_nitori_touhou_drawn_by_maturiuta_sorato__587f1d2cefe4821bfb484262f9676050.jpg
CRC-32: 45986f34
MD4: cdea1f9b60a1447ebb36fcafe199d186
MD5: 587f1d2cefe4821bfb484262f9676050
SHA-1: a050ea5d13319664196f44ce2e2d8ed61d1e5418

If you're confused, don't be. Let's just say that Twitter's kind of stupid? And although they've tried improving on their sampling algorithm it still produces a bunch of generally crappy results for lower resolution images at near-zero benefits for bandwidth and a lot more cost for storage on their end.

The mistake here was that BrokenEagle98's script derped, and I didn't make a thorough visual comparison (did not note her pasties were removed). But we still consider images that are md5 mismatches from Twitter to be samples (especially if they have more blatant artifacting) as long as they are visually identical. The last part is important, and I hope I made note of that in twitter sample. If not, I'll revise it real quick.

Pinging relevant users @Sacriven @chinatsu although I think latter got the message in the discord general chat.

For more examples, you can look here:

CodeKyuubi said:

So what you're saying is, the old original had a different md5 than the current original, so the bot detected the new original's md5 and got an md5 mismatch with the old original's md5?

Originals stay the same for Twitter samples. This image is still the same as when it was posted -- nothing has changed. But what did change was this image, the sample. Previously it had an MD5 of 5bc14be330d85ca045ede0cdbb2c4f08 but currently, it has an MD5 of 2cb244fa234f233e1a63490b450ff73a.

I can do a diff comparison later if desired, but let's just say that whatever they did when they switched over to the new sampling algorithm is kinda weird, just to keep things short.

Mikaeri said:

Originals stay the same for Twitter samples. This image is still the same as when it was posted -- nothing has changed. But what did change was this image, the sample. Previously it had an MD5 of 5bc14be330d85ca045ede0cdbb2c4f08 but currently, it has an MD5 of 2cb244fa234f233e1a63490b450ff73a.

I can do a diff comparison later if desired, but let's just say that whatever they did when they switched over to the new sampling algorithm is kinda weird, just to keep things short.

But if the original was the image on Danbooru like what Sacriven posted, why was there a false positive when it only applies to samples?

CodeKyuubi said:

But if the original was the image on Danbooru like what Sacriven posted, why was there a false positive when it only applies to samples?

He was comparing the wrong image, that's why. The comment was deleted, but he was trying to compare the second image (one without pasties) to the one with the first. Of course it'd result in an md5 mismatch, since they're both different images with noticeably different visual features (again, the pasties). But he reran the script, and found that there was a match so the comment was removed.

It was just a derp on his end. Also partly my fault for not making the visual comparison (as again, I work fast), but I told him to rerun the script on all currently tagged md5 mismatches on Twitter in case there are any more false positives like that. If there are, they'll be edited accordingly and we can move from that.

Full scan report

  • Site: Twitter
  • Type: Recheck of Twitter sample, MD5 mismatch, bad Twitter ID
  • Start time: 2017-05-20 23:41 Z
  • End time: 2017-05-21 22:46 Z
  • Tag changes:
    • "-bad_id -bad_twitter_id": 89
    • "-downscaled -md5_mismatch -resized": 3
    • "-image_sample -twitter_sample": 32
    • "-md5_mismatch": 14
    • "bad_id bad_twitter_id -downscaled -md5_mismatch -resized": 37
    • "bad_id bad_twitter_id -image_sample -twitter_sample": 14
    • "bad_id bad_twitter_id -md5_mismatch": 18
    • "bad_id bad_twitter_id -md5_mismatch -resized": 1
    • "bad_id bad_twitter_id inactive_account -image_sample -twitter_sample": 1
    • "bad_id bad_twitter_id inactive_account -md5_mismatch": 1
    • "bad_source": 5
    • "bad_source -bad_id -bad_twitter_id": 1
    • "bad_source -downscaled -md5_mismatch -resized": 1
    • "bad_source -md5_mismatch": 2
    • "bad_source revision -bad_id -bad_twitter_id": 1
    • "bad_source revision -downscaled -md5_mismatch -resized": 3
    • "bad_source revision -md5_mismatch": 2
    • "bad_source revision -md5_mismatch -resized -upscaled": 2
    • "cropped -downscaled -resized": 1
    • "cropped -resized": 1
    • "cropped md5_mismatch -bad_id -bad_twitter_id": 1
    • "downscaled md5_mismatch resized -bad_id -bad_twitter_id": 7
    • "downscaled md5_mismatch resized -image_sample -twitter_sample": 83
    • "downscaled resized": 35
    • "downscaled resized -image_sample -twitter_sample": 1
    • "image_sample twitter_sample": 1
    • "md5_mismatch": 2
    • "md5_mismatch resized upscaled -bad_id -bad_twitter_id": 2
    • "md5_mismatch stitched": 1
    • "protected_link -downscaled -md5_mismatch -resized": 6
    • "protected_link -image_sample -twitter_sample": 4
    • "protected_link -md5_mismatch": 12
    • "resized upscaled": 3
    • "resized upscaled -bad_id": 2
    • "upscaled -downscaled": 1

Updated

What are with all these images in the mod queue that are flagged with sample but are clearly not the exact same images, what do we do here?

Also shouldn't these replaced when they are the same image and not reuploaded/flagged?

Provence. said:

If that's the case then those users should be told to stop.

Maybe. They could just be replacing the posts with notes, which as of yet aren't able to be replaced because of current limitations with the new feature. I'll double-check some of them now that I'm awake.

That, or they're misconstruing what md5 mismatch actually means outside of a Twitter use case. There are md5 mismatches out there that are bonafide image samples (from user modification or sampled/cropped from a 3rd party website, but these should be rare and in between, and would necessitate a full image replacement without using the feature instead.

Updated

pool #12319 It's this pool, it's not really either of those things. It appears the user just flagged every old post of each of those images as sample with no regards to the actual contents of the images themselves.

pool #3324 and this pool, without going through the entire queue it might be even more pools, in fact.

edit: wait this is a bot. This makes it even worse.

Log said:

pool #12319 It's this pool, it's not really either of those things. It appears the user just flagged every old post of each of those images as sample with no regards to the actual contents of the images themselves.

pool #3324 and this pool, without going through the entire queue it might be even more pools, in fact.

edit: wait this is a bot. This makes it even worse.

I'm confused -- what's this about a bot? Also, I double checked for some of these unsourced images and some of them are sampled from yandere (post #750087, child of that is also probably a sample).

Let me get in contact with the user... sec.

Full scan report

  • Site: Tumblr
  • Type: Recheck of Tumblr sample, MD5 mismatch, bad Tumblr ID
  • Start time: 2017-05-30 22:50 Z
  • End time: 2017-05-31 05:22 Z
  • Tag changes:
  • "-bad_id -bad_tumblr_id": 31
  • "-image_sample -tumblr_sample": 5
  • "-md5_mismatch": 24
  • "-md5_mismatch -resized -upscaled": 86
  • "-resized -upscaled": 2
  • "bad_id bad_tumblr_id -md5_mismatch": 81
  • "bad_id bad_tumblr_id -md5_mismatch -resized -upscaled": 2
  • "md5_mismatch -bad_id -bad_tumblr_id": 13
  • "protected_link -bad_id -bad_tumblr_id": 4
  • "protected_link -md5_mismatch": 3

Other

Special thanks to @chinatsu for pointing out that some Tumblr links contain the MD5 hash, which was able to remove a lot of the MD5 mismatches. I noticed these types of links mostly from the faulty 68.media.tumblr.com. I wonder if the MD5 hash links were implemented to overcome that faultiness...? Regardless, I'll be adding a comment to issue #2938 documenting this new discovery.

Discovered that the GET response headers for Tumblr images have a value called "Etag" that most of the time has a match to the MD5 hash of the image. After incorporating that into the scanning script, the following is the details of the rescan of MD5 mismatch.

  • Tag changes:
    • "-md5_mismatch -resized -upscaled": 17
    • "-md5_mismatch": 22

@reiyasona @RaisingK Do you guys think it'd be possible to make it a regular job to replace samples on your machines? I've added a new row in topic #14119 so we can consider automation for sample replacements, so approvers won't have to do it by hand a lot of the time.

Also pinging @BrokenEagle98 for ideas. Some things:

  • We can replace images with notes now.
    • This makes it possible to replace the rest of the Twitter samples and Yandere samples with notes.
  • Uploading samples from Nicoseiga and Pawoo is exceedingly difficult (you'd basically have to dig for it), so automated jobs might not be required there.

EDIT: Part of me wonders if we could offload the scripts to DanbooruBot so everything's fairly persistent, but let's see.

Updated

Mikaeri said:

Do you guys think it'd be possible to make it a regular job to replace samples on your machines? I've added a new row in topic #14119 so we can consider automation for sample replacements, so approvers won't have to do it by hand a lot of the time.

I already do that with pixiv samples, which I can verify sample-ness for directly, and I'm relying on BrokenEagle98's comments to verify the tumblr samples for me. I'm considering stopping the tumblr replacements once the backlog is cleared, though; I don't like automating something like replacements unless I know how to (and am willing to make the effort to) automate verification myself.

Fair enough, but I think you could talk to him about it. He might share his verification scripts with you so you can run them on your own computer. Maybe you could even help improve the scripts? We're just chatting in the #technical channel of the Discord right now.

1 8 9 10 11 12 13 14