Danbooru

Image Sample Cleanup Project

Posted under General

It seems like all the pieces are set in place to open up a topic like this, especially as we have an overarching image sample tag that covers all of the image samples from major sites.

For starters, if you don't know what an image sample is, it's best you look at the wiki page linked there. This is especially important if one of your images has been flagged for such a reason.

Image samples shouldn't be uploaded. Image samples, or downsized md5 mismatches for that matter, will almost always be worse than their original counterpart. This does not count edits, revisions, etc.

And why I think this is important is because the sooner a new uploader notices that they're uploading samples, they most likely won't do it again after the fact. So it's better to flag and warn them earlier than later when they've already uploaded a lot of content.

Current Status on Samples

High Count

Should focus on these first.

Low Count

Searching the metatag parent:any for any of these searches may also yield duplicates that can be flagged/deleted.

Special Cases
Tumblr

Cleaning Up Samples

If you have a sample to replace, tag it replaceme if the source is still live or post it in topic #14156 with a working alternate link.

Only approvers should be replacing images as of right now. Approvers have a new feature available to replace image samples -- there's more information in topic #14063.

This does not count md5 mismatches (such as from Twitter), lossy conversions (png -> unsourced jpg), or lossy-lossless conversions (lossy-lossless) -- those are handled separately.

Wrap-up

If you have any questions or want to help contribute, feel free to post here and bump this topic, or DM me for anything you need help with if you feel that it's too specific.

Other relevant topics:

Changes

Show

03.30.2017 - Updated post count estimates. Added information regarding forum #129017.
04.03.2017 - Notice for Janitors to not use the 'move favorites' function removed. More information in forum #128927.
05.02.2017 - Updated counts. Small adjustment to searches.
05.15.2017 - New feature added to replace images in place! Updated counts.
05.20.2017 - Updated counts.
05.21.2017 - Updated counts.
06.21.2017 - Changed; only approvers should replace samples. Added special section for Tumblr.
06.22.2017 - Counts updated. Link to topic #14119 added. Adjusted searches for Tumblr md5 mismatches.
09.16.2017 - Counts updated. Edited.

Updated

sweetpeɐ said:

I don't see why these would be tagged with downscaled to begin with.

Because there's really no other way to indicate that a larger image exists at the source since they do not meet the criterion for image sample...

Edit:

As for uploading the larger images, I'm still a firm believer in manual checking of the image before uploading, primarily because you may not consider the image worthy to have been uploaded in the first place. That's why I never created a script to upload the full-size images. I believe RaisingK also checks the image before uploading the replacement.

Besides that, there's the question of how the images should be uploaded. Should I use my ability to bypass the queue, should I send it through the queue, or should I create a new member-level account to upload the images...? One thought would be if the image score exceeds a certain value, then it would be okay to bypass the queue, otherwise upload it with the alternate account.

Thoughts?

Updated

BrokenEagle98 said:
Because there's really no other way to indicate that a larger image exists at the source since they do not meet the criterion for image sample...

Umm... duplicate. Being a simple does not imply that the sample is smaller than the original. And judging by the wiki page for downscalled, it's in fact misatributed when used on samples...

BrokenEagle98 said:
manual checking of the image before uploading

Indeed there should be no such automatic uploading. The higher res sometimes in fact is worse than the sample, it should always be up to human judgement lest grainy artifacty images bleed through.

BrokenEagle98 said:
Should I use my ability to bypass the queue, should I send it through the queue

I think this would be warranted and a good usage of the permission. However if the concern is that you don't want to take credit for simply uploading the correct version and feel a separate account is needed to distinguish your personal uploads from fulfilling a site service than I would go in that direction.

Also the antecedent post is already active so the uploaded one should be just as good in most cases. This raises the question though of whether the manual check should be to evaluate the quality of the image or just to make sure the image properties are good (no jpeg artifacts).

Updated

sweetpeɐ said:

Umm... duplicate. Being a simple does not imply that the sample is smaller than the original. And judging by the wiki page for downscalled, it's in fact misatributed when used on samples...

I think that's exactly what he means though, assuming you're talking about entries that match md5_mismatch downscaled. Samples shouldn't be tagged downscaled even if they technically are "downscales" of the original because such an image exists on that site already (which is why, again, we call them samples). Downscales on the other hand, we may not know if they are samples from a previous version of the source site or if it's user-resized. They're typically not safe to keep if the user has provided a source and we know for a fact that it isn't uploaded correctly. Hopefully this isn't incentive to start uploading without sources, as that would be horrible. I've clarified that in help:image source for posterity.

Indeed there should be no such automatic uploading. The higher res sometimes in fact is worse than the sample, it should always be up to human judgement lest grainy artifacty images bleed through.

Well, I should have clarified -- semi-automatic uploading would be better. Always checking an image beforehand for quality standards is a good bet, even if the image was approved or upped by a contributor before. As for your latter comment, I'm not really sure how to feel about that -- I've only ever seen it once (post #2558234), and typically if artifacts are present in an original then they'll be there in a sample too, just at a lesser visible degree. Anyways I feel like there should be some clarification on the "The higher res sometimes in fact is worse than the sample" statement because it's not quite true unless original artist rescaled it himself or pumped his work through waifu2x, which would be beyond me why they would do that... but who knows. Going to ping @☆♪ to clarify this, hopefully he doesn't mind.

Besides that, there's the question of how the images should be uploaded. Should I use my ability to bypass the queue, should I send it through the queue, or should I create a new member-level account to upload the images...? One thought would be if the image score exceeds a certain value, then it would be okay to bypass the queue, otherwise upload it with the alternate account.

I think this would be warranted and a good usage of the permission. However if the concern is that you don't want to take credit for simply uploading the correct version and feel a separate account is needed to distinguish your personal uploads from fulfilling a site service than I would go in that direction.

Also the antecedent post is already active so the uploaded one should be just as good in most cases. This raises the question though of whether the manual check should be to evaluate the quality of the image or just to make sure the image properties are good (no jpeg artifacts).

It would be good, I'm just not sure how the mods would prefer it if we had two separate contributor accounts for either of these purposes given we have the option to send ups back in to the queue. There's RaisingK's second account RazingK, but that account doesn't have unlimited uploads permissions. If an image has already made its rounds on the site (and as BrokenEagle98 has mentioned, a good score/favcount is a fairly good indicator of which images to keep, at least for these kinds of things) then I think it can skip the queue altogether. After all, if we do make it go through the queue, that's more things for the Janitor+'s to go through, which doesn't seem all that efficient.

Perhaps it's something of personal choice though, having a second account. Somehow there's this feeling that I want to just keep my uploads on this account filled with uploads I generally like. It makes it easier for other users to search through my uploads too, instead of having to sort through the bunches of originals from sampled images that I typically wouldn't look at in the first place.

BrokenEagle98 said:

Duplicate doesn't indicate size difference...

This was one of the qualms I had about the tag, which is why I asked about this in forum #125044. I almost NEVER tag duplicate on anything aside from maybe images with extra junk metadata included. I see it as a very rarely used tag in my portfolio. If we had implications for resized, image_sample -> duplicate, it would get extremely messy. But the fact of the matter is that exact duplicates (same md5 and all) cannot be uploaded anymore.

I suppose the first step then would be reuploading the larger versions of all of the posts tagged image sample if and only if they appear worthy to be reuploaded...

Also just as a note, I've started going through some of the pictures and noticed that image sample is not always "inferior", i.e. lesser resolution.

post #2498862
post #2497911
post #2497826

The above posts are just a few I found where the "large" and "orig" versions were the same dimensions, and were only off by a few bytes as to the filesize, so not really worth reuploading.

Would it be helpful if I had a script go through all of the "image samples" and add a comment showing the difference in filesize and resolution...? That way time could be saved by not going to the source for the manual checking.

I think that would be a good idea. I'm still a hardcore believer in the fact that the original should be uploaded anyway, though. You could do a more in-depth analysis to see if there are more differences than what's visible to the naked eye.

Hmm, I like it... Though it links to the sampled image and not the original, is that intended? Because I prefer it with :orig suffixed, so users get less confused looking at it.

In the meantime I guess I can go think up some ideas on how to observe sample/md5 mismatch differences with Twitter...

EDIT: I'm exploring how post #2602087 is different from its sample right now with what tools exists for this kind of purpose, but I'm sort of new to image comparing. Probably going to finally start using ExifTool and ImageMagick too...

Updated

Mikaeri said:

Hmm, I like it... Though it links to the sampled image and not the original, is that intended? Because I prefer it with :orig suffixed, so users get less confused looking at it.

In the meantime I guess I can go think up some ideas on how to observe sample/md5 mismatch differences with Twitter...

EDIT: I'm exploring how post #2602087 is different from its sample right now with what tools exists for this kind of purpose, but I'm sort of new to image comparing. Probably going to finally start using ExifTool and ImageMagick too...

Performing a diff between the original and its sample gave me this image: https://i.imgur.com/R7JVhUG.jpg, red indicates where there was change

Full album is here. https://imgur.com/a/ubPcE

Updated

I'm not really familiar with Twitter, but I do know that it does some resampling of the image. However, if the :orig is the first round of resampling, is every other image size (large,medium,small,thumb) a resample of that resample...?

If that's the case, then that would make a stronger argument for replacing samples with the original, even if they are the same dimensions.

That's what I think. They do a resampling of the image from :orig to large even if the dimensions remain the same, interestingly enough.

So yeah, it makes a very strong argument for replacing twitter samples that are the same dimensions as their original, even if the differences aren't visible to the naked eye. I haven't tried with png yet, though.

EDIT: And for the record, I've settled that it is a fact that samples are always worse than their originals. So I have to disagree with you, sweetpea.

Updated

BrokenEagle98 said:

It also works from the image url... as long as you're using the bookmarklet.
If you go directly to the uploads page and plug in the URL, then it will not work.

Hmm, even when you're on the non-prefixed URL link? That's interesting, never knew. I don't usually do that since I don't like having to continuously type :orig at the end, but it could be useful...

Oh, right, I'm going to go ahead and link this thread in image sample if that's fine.

BrokenEagle98 said:

It also works from the image url... as long as you're using the bookmarklet.
If you go directly to the uploads page and plug in the URL, then it will not work.

This is wrong, the uploads page also draws the :orig file automatically. I know because I do everything via the uploads page and not the bookmarklet.

And Mikaeri, the 'sample' may be "worse", but as you said, it isn't visible to the naked eye so it only matters for the purposes of completion and has no bearing for the average user who visits the site. That said, both the 'sample' and 'original' tend to be shit-all in quality anyways due to twitter's compression (Especially their reds).

My opinion is if it doesn't benefit users or uploaders in some way, it's unnecessary. I won't fight your efforts because there is precedence in deleting pixiv and imageboard samples. However, I support those because the samples are clearly a smaller resolution. I oppose twitter sample deletions only of images that are identical to the 'original' except in microscopic compression differences and/or md5 hashes, which I find overly and exceptionally pedantic, an opinion I have made previous in regards to the topic on lossless twitter compression.

1 2 3 4 5 14