Danbooru

Image Sample Cleanup Project

Posted under General

CodeKyuubi said:

This is wrong, the uploads page also draws the :orig file automatically. I know because I do everything via the uploads page and not the bookmarklet.

Disregard, see forum #125941
Did not know this, and never tried it since I did not want to accidentally upload the lesser quality image. However, having said that I wanted to test the above, so I went to Testbooru and uploaded straight from the uploads page...

1. Without URL addons (i.e. no :orig)

2. With a size addon (:medium)

So it looks like if the size modifier is in the URL input, then it uploads that size instead of the :orig. Also, I do recall seeing a lot of posts that got tagged with Twitter sample that had a :large or :medium modifier in the source URL.

Not sure what Danbooru uses to test and then add the :orig modifier, but the following regex is what I use to test for valid Twitter URLs and retrieve only the URL portion without the size modifier.

(https?://pbs\.twimg\.com/media/[^.]+\.(?:png|gif|jpg))(?::(?:orig|large|medium|small))?
Edit:

Disregarded the above with new findings in forum #125941.

Updated

CodeKyuubi said:

And Mikaeri, the 'sample' may be "worse", but as you said, it isn't visible to the naked eye so it only matters for the purposes of completion and has no bearing for the average user who visits the site. That said, both the 'sample' and 'original' tend to be shit-all in quality anyways due to twitter's compression (Especially their reds).

My opinion is if it doesn't benefit users or uploaders in some way, it's unnecessary. I won't fight your efforts because there is precedence in deleting pixiv and imageboard samples. However, I support those because the samples are clearly a smaller resolution. I oppose twitter sample deletions only of images that are identical to the 'original' except in microscopic compression differences and/or md5 hashes, which I find overly and exceptionally pedantic, an opinion I have made previous in regards to the topic on lossless twitter compression.

As much as I understand your point and want to agree, I still disagree with you. But I do appreciate you playing the other side.

The thing is though, in regards to pixiv samples, even pixiv samples can be the same resolution as their originals, and exhibit the same 'microscopic compression differences and/or md5 hashes' you speak of. Take a look at post #2592839 for example. When you take this image (hotlinked, won't work but still) and diff it with the original, you get this. Even if there's no visible differences to the naked eye, it still matters that we preserve the original and the original first. Why would we want resampled images in the first place?

So this isn't a comparison of apples and oranges. Twitter will always compress the image it's given, but we should always keep images that are as much as original as the author/artist uploaded, regardless of the 'microscopic compression differences and/or md5 hashes'. Neither should this be an argument against pedantry or OCDness. If it's overly and exceptionally pedantic for us to do this, why not users appeal all the image samples that have the same resolution as their original "because the average user can't see the difference?" Sounds a bit ludicrous to me, especially taking into consideration that if you ask any user about which image they'd like to keep, obviously it's the original image if they know it's the original.

Moreover, your argument is against twitter sample deletions on the premise of if they should be flagged or not. The real question is, however, "should Twitter samples be reapproved?" and my answer to that, and probably of the other janitors' answers to that, is a resounding "no."

And maybe you're more concerned about the health of users' uploads but in hindsight, uploaders shouldn't fear of their deletion count going up of old deletions from samples. If they've worked hard to earn the contributor status, then they'll keep it anyway. That shouldn't be a deciding factor in their status unless they repeatedly upload samples after the first couple or so flags. I feel like you might be trying to simplify one problem to another, but the fact of the matter is that that isn't the case here.

My opinion is if it doesn't benefit users or uploaders in some way, it's unnecessary.

And my opinion is that it does. Always. There is always a user out there that will see benefit in having the original than a resample.

Updated

I'm not arguing to reapprove deleted images, just against the practice of reuploading identical images and deleting the older one, regardless of if it's pixiv or twitter. In my opinion, Danbooru is a site by the users, for the users (in majority), and I struggle to find how perfect md5s between two images help the vast majority of users in their day-to-day of favoriting and downloading images.

I don't know how to code. This is just my opinion, but I think that md5s are only going to have some use to the handful of people on this website that can code and, for whatever reason, need the md5 on a specific image to match instead of using a batch. I think if an image is being uploaded for the first time, it only makes sense to get the original given the choice, and to upload better quality versions when given (twitter vs pixiv), but I also think it's anti-user (in majority) to delete an identical image on the basis of the md5. But that's just me.

I'd be open to you giving me examples of how regular users (in majority) can benefit from the original md5, and what purposes more code-capable users would use the md5s for.

Edit: Your argument on originals can easily be extended to the deletion of twitter images once pixiv versions are uploaded, as the pixiv version is the closest to the true original the artist intended (the absolute true original being their psd file).

Edit2: Regardless, I don't intend to fight you on this, I'm just setting out my point and disagreement clearly. I'm only here to give the users the best that I can in my limited time, and arguing in circular fashion on the forum isn't conducive to that, so this'll be my last post on the matter.

Updated

CodeKyuubi said:

I'm not arguing to reapprove deleted images, just against the practice of reuploading identical images and deleting the older one, regardless of if it's pixiv or twitter. In my opinion, Danbooru is a site by the users, for the users (in majority), and I struggle to find how perfect md5s between two images help the vast majority of users in their day-to-day of favoriting and downloading images.

I don't know how to code. This is just my opinion, but I think that md5s are only going to have some use to the handful of people on this website that can code and, for whatever reason, need the md5 on a specific image to match instead of using a batch. I think if an image is being uploaded for the first time, it only makes sense to get the original given the choice, and to upload better quality versions when given (twitter vs pixiv), but I also think it's anti-user (in majority) to delete an identical image on the basis of the md5. But that's just me.

I'd be open to you giving me examples of how regular users (in majority) can benefit from the original md5, and what purposes more code-capable users would use the md5s for.

To put it in layman's terms, an md5 is a quick way of telling if one image is the exact same as another -- aka a way of finding duplicates. A md5 algorithm provides the hash of a file which basically says "if you see the same thing on any other file, it's the same exact file as me." Kind of how you can tell if you're downloading the original of an application installer, which many sites will list.

See, the thing is about md5s is that with a sampled image (which will inherently carry a different md5 than the original), you will always known they're worse. You'll know that there is some mild artifacting somewhere in the image that makes it not worth keeping. Most casual users won't notice -- you're right on this part. But some will -- and you have to account for them too. Perhaps they're really into digital art or photoshop and it's their prerogative to look high and low for the best images that they can work on and/or improve.

Image samples shouldn't be confused for duplicates, as sweetpea might have mentioned before. A twitter original (not sampled) upload with potential artifacts compared to it's superior pixiv upload is not grounds for flagging, and I would dispute and appeal that if someone were to flag an image as such -- as would many others. But a sample on the other hand, we know for a fact there is ALWAYS a better image to upload.

As for how regular users can find this useful, let's say this as an example: you're a user that wants to make a wallpaper out of a relatively lowres image for your desktop (1920x1080, 1920x1200, etc.). You're going to want the best image you can get your hands on and maybe pipe it through waifu2x. The lower amount of artifacting in the original image, the higher quality of an image you're going to get through such a tool. Artifacts expand when you upscale an image, that's just a fact. We call those resizing artifacts, and it's usually just visible pixelation. But if anything is already blocky or exhibits mosquito noise, then it's only going to get worse as you upscale such an image.

tapnek said:

I'm not sure if this is the right thread, but shouldn't we relegate these md5 mismatch notice comments to another account? DanbooruBot doesn't seem to see much use.

I was just following RaisingK's lead, as he does all of his Pixiv commenting from his main account...

If desired, I could always create an alt account, though I'd have to wait a week before being able to comment...

Edit:

Just wondering though, is there an issue from doing the MD5 mismatch and image sample notice commenting from our main accounts...? Perhaps I'm just not seeing the issue...

Updated

I don't see an issue with it, but if users prefer comments like these all go under one account, then we could adopt the suggestion given we already have a bot named DanbooruBot. It would definitely look a lot more "official" at the very least, and that would also free up anyone who wants to look through your account comments without having to see all your script comments everywhere.

EDIT: Semi-relevant to topic #13649

BrokenEagle98 said:

I was just following RaisingK's lead, as he does all of his Pixiv commenting from his main account...

I've pondered switching that to my alt account before, for the sake of tidiness if nothing else, but inertia is a large factor, here, with so many comments in my name already before I finally considered an alt account. Being able to move ("recycle") comments from post to post used to be one reason for staying, but even with that bug/feature patched out, manual management is more convenient when they're under my main account, and needing to account in my code for all the old comments under the original account might be problematic.

tapnek said:

It's just a bot account. What's wrong with sharing its API key with only a few people?

A shared account sounds too problematic, and I like doing my own thing anyway.

Mikaeri summed up one of my concerns with this kind of hash checking work. I would go into RaisingK's comments for new stuff to upload but there's always the usual comment that was actually written by the guy and not some script doing the work. And then there's the deletion of comments which can leave some pages with only just one comment but I think that's for another thread.

I have a question.

One of the artists I follow (barbariank) often posts his works on 4chan, then uploads them to his tumblr. Tumblr downscales them to 1280, so I upload 4chan versions (which are full-sized) and add Tumblr posts as sources. Now many of those posts have been marked upscaled, while in fact their listed source is downscaled.

I'm going to remove resized/upscaled from them anyway, but what should I do to prevent further confusion? Should I not source them at all?

If you ask me, I would just link them to the direct URL to the image on 4chan where they originally came from, or an archival site that keeps track of that stuff. I know it can be understandably difficult now given 4chan images have an expiry date, but aside from that, you could just put 4chan by itself in the source I suppose. Then you could just comment the tumblr link if you'd like.

Consider tagging bad id if your link is correct, but I would regard it as optional in cases like these where you're not all too sure.

Mikaeri said:

I think that's exactly what he means though, assuming you're talking about entries that match md5_mismatch downscaled. Samples shouldn't be tagged downscaled even if they technically are "downscales" of the original because such an image exists on that site already (which is why, again, we call them samples). Downscales on the other hand, we may not know if they are samples from a previous version of the source site or if it's user-resized. They're typically not safe to keep if the user has provided a source and we know for a fact that it isn't uploaded correctly. Hopefully this isn't incentive to start uploading without sources, as that would be horrible. I've clarified that in help:image source for posterity.

What exactly does the downscaled tag legitimately apply to? If it's not a sample, then how do you know it's downscaled? Under what conditions would you know something is a downscale but not have the original available to upload instead? The only thing that comes to mind is that Sombra image where the full size didn't fit in the site's filesize limit (post #2561360), and that didn't even have the tag (I just added it). Edit: Never mind, it's not downscaled, just recompressed. Removed the tag again.

Mikaeri said:

sweetpeɐ said:

... The higher res sometimes in fact is worse than the sample, it should always be up to human judgement lest grainy artifacty images bleed through.

... I'm not really sure how to feel about that -- I've only ever seen it once (post #2558234), and typically if artifacts are present in an original then they'll be there in a sample too, just at a lesser visible degree. Anyways I feel like there should be some clarification on the "The higher res sometimes in fact is worse than the sample" statement because it's not quite true unless original artist rescaled it himself or pumped his work through waifu2x, which would be beyond me why they would do that... but who knows. Going to ping @☆♪ to clarify this, hopefully he doesn't mind.

A sample is strictly and by definition inferior to the original: you can recreate a sample from the original at any time, but not the other way around. And samples are almost always lossily compressed, meaning they have additional artifacts even if they're the same size.

I don't even agree with your example (post #2558234). It's true that the full size is uncomfortably aliased, but by no means is it worse than the small deleted child. If you scale the large image down to the small one's size, it looks just as good or better. Someone said it looked like a nearest neighbor upscale, but here's what that actually looks like. If you scale both images to a size in between the two and compare, it's clear that the larger one is sharper. In fact, that's really all it is. Rather than being scaled by the artist, I bet it was just exported without (much) anti-aliasing. It's really based on preference to some extent, and on how you're viewing it. On a high-res display (meaning dense pixels, not just a lot of them) at a reasonable distance, the original actually looks perfectly fine. Plenty of artists post images with noticeably aliased edges, like nogi takayoshi.

(And I definitely don't mind being pinged. It makes me feel important! :P If I hadn't been so busy this week I'd have probably replied to this thread anyway.)

Mikaeri said:

I'm still a hardcore believer in the fact that the original should be uploaded anyway, though. You could do a more in-depth analysis to see if there are more differences than what's visible to the naked eye.

I 100% agree with this. To address CodeKyuubi's arguments as well: There are so many reasons why something that looks the same to you might not to someone else. Obviously some people's eyes are better than others, but there's much more than that. People have different hardware, for example. A high quality monitor, especially if it's calibrated, can make clear as day things that are all but invisible on an average one. The ambient light in the room makes a huge difference too, much more than you'd probably expect if you've never explored that. You really have no way of knowing that the difference won't be noticeable to anyone. If you accept a sample instead of the original, you're throwing away irreplaceable information. If the source goes down later, which is really quite common, that information could be lost forever. At the risk of being dramatic, it's like burning a library.

Updated

Ah, I knew I could count on you :)

Anyways, thanks for providing useful information once again.

☆♪ said:

What exactly does the downscaled tag legitimately apply to? If it's not a sample, then how do you know it's downscaled? Under what conditions would you know something is a downscale but not have the original available to upload instead? The only thing that comes to mind is that Sombra image where the full size didn't fit in the site's filesize limit (post #2561360), and that didn't even have the tag (I just added it).

Quite honestly, I'm not sure either. It is a relatively new tag, but we haven't tagged resized on pixiv samples either, for example. My reasoning is that since a sample is a sample, it should be treated as if it were an image existing on the site itself, as it is very possible to have a direct URL link to such a sample on pixiv -- regardless of the source image. Clear uses of it are when an image is either downscaled from the source, or knowingly downscaled from a parent image by the artist on another website. It's somewhat a sketchy tag to use outside of those two use cases though.

Hmm, interesting note about nogi takayoshi. I've uploaded works from him before, and there is definitely some visible aliasing on his work, especially on the edges of hair. It's visible to a noticeable degree, but it doesn't detract too much from the image.

I 100% agree with this. To address CodeKyuubi's arguments as well: There are so many reasons why something that looks the same to you might not to someone else. Obviously some people's eyes are better than others, but there's much more than that. People have different hardware, for example. A high quality monitor, especially if it's calibrated, can make clear as day things that are all but invisible on an average one. The ambient light in the room makes a huge difference too, much more than you'd probably expect if you've never explored that. You really have no way of knowing that the difference won't be noticeable to anyone. If you accept a sample instead of the original, you're throwing away irreplaceable information. If the source goes down later, which is really quite common, that information could be lost forever. At the risk of being dramatic, it's like burning a library.

Heh, well maybe a better analogy would be that it's like having all the books in a library be rewritten by 4th graders and then the originals discarded. There's bound to be some mistakes in the "copied" source material, and once the original book is gone, we don't know by what degree such a "copy" accurately represents the original, regardless of how close it looks. It actually kind of reminds me of how many times the bible was translated and rewritten by hand, but that's probably an analogy to save for another day.

EDIT: Edited downscaled for more information. Hopefully the usage will be clearer from here on out.

Updated

Mikaeri said:

If you ask me, I would just link them to the direct URL to the image on 4chan where they originally came from, or an archival site that keeps track of that stuff. I know it can be understandably difficult now given 4chan images have an expiry date, but aside from that, you could just put 4chan by itself in the source I suppose. Then you could just comment the tumblr link if you'd like.

Consider tagging bad id if your link is correct, but I would regard it as optional in cases like these where you're not all too sure.

Is there a wiki with the above guidance, and if not, should there be...?

Also, for non-working links that do not have an indexed ID system like the sites I have already tackled, I was thinking of using an alternate tag, like bad link. Thoughts?

@Mikaeri: Sorry, I screwed up. The Sombra image is actually not downscaled, it's just recompressed. I could see an uploader downscaling in that situation, but I can't find any examples of that having happened.

BrokenEagle98 said:

Also, for non-working links that do not have an indexed ID system like the sites I have already tackled, I was thinking of using an alternate tag, like bad link. Thoughts?

I don't really see the need for a separate tag if the distinction is just in the way the source addresses things - semantically it's the same tag, no? If anything, bad_id could be renamed to something like source_gone.

Also, relating to worldendDominator's original question, I'm generally against rewriting sources altogether. The source should be where you got the image from, that's what it means, even if that source is no longer available. Linking to a source with a different version of the image is liable to cause confusion, as seen here. If you want to provide an alternate source for context, use a no-bump comment. Obviously not everyone may share my attitude on that, so what does everyone else think? It doesn't seem like there's a clear policy on the site about that; maybe we can decide on one.

Edit: So there are a couple of my old uploads that BrokenEagle98's script caught (guess I didn't know :orig was a thing at that time). I'm uploading the originals where appropriate. Am I then supposed to flag the samples?

Updated

☆♪ said:

So there are a couple of my old uploads that BrokenEagle98's script caught (guess I didn't know :orig was a thing at that time). I'm uploading the originals where appropriate. Am I then supposed to flag the samples?

Yes, that's the idea... see image sample for the full guidance.

1 2 3 4 5 6 14