Danbooru

Image Sample Cleanup Project

Posted under General

Fred1515 said:

I assume Qpax omitted them, though I can't fathom why. At least, I think it was Qpax, since his uploads had migrated favorites/scores within the first minute and samples being immediately deleted, so I assume he was moving them himself.

My guess is he forgot?

In any case, we sure had a huge influx of Western-style art yesterday.

CodeKyuubi said:

My guess is he forgot?

It was fault on my end. At some point yesterday I was automatically started to reupload those samples which means my body started to act its own and I am sure as for couple of posts like that, I forgot to click the button that lets you move the favorites so yeah it happened because of my absentmindedness so I am sorry about that.

I also started to take care of deviantart samples yesterday but at some point I got tired and went to sleep because I was literally just reuploading samples for like more than +12 hours while taking care of my other stuffs in real life. Anyways hope the technical guys takes cares of the current issues then so I'll just hold up and wait for good news regarding to replacement of samples in the near future.

Keep up with the good works mates o/

Qpax said:

It was fault on my end. At some point yesterday I was automatically started to reupload those samples which means my body started to act its own and I am sure as for couple of posts like that, I forgot to click the button that lets you move the favorites so yeah it happened because of my absentmindedness so I am sorry about that.

I also started to take care of deviantart samples yesterday but at some point I got tired and went to sleep because I was literally just reuploading samples for like more than +12 hours while taking care of my other stuffs in real life. Anyways hope the technical guys takes cares of the current issues then so I'll just hold up and wait for good news regarding to replacement of samples in the near future.

Keep up with the good works mates o/

Quite a hero, since you are always the most active Approver right now. Can only repeat what I wrote you some hours ago: Great work.

Type-kun said:

OK, guys, you should stop flagging samples for now. If issue #3015 gets merged it'll be possible to replace the image file without creating a new post.

I already commented on the above issue, but basically, since we're dealing with the deletion of files, IMO we should tread more carefully with the merging of that issue. I believe it would be wise to put the image replace function on a test site for a while, and let the users on this site work out most of the kinks before the commits go live.

Thoughts?

I have it running on http://feat-replace-images.devbooru.ml. login: albert / password: password. Feel free to test there.

Quoting myself from issue #3015:

Deleting the old files isn't strictly necessary. Actually I don't think it would hurt to keep them, beyond leaving some loose files on the server and wasting a little disk space. I don't know how albert feels about that though, so I set it to delete the old files in a delayed job three days later. The grace period could be extended if we want.

If there's a mistake, the mod action has a link to the old file still on the server, so it's possible to redownload it before it's deleted. It should also be possible to cancel the delayed job to prevent the deletion, although that would need albert's intervention. It would be possible too to have the delayed job list which files it's going to delete, although I didn't do this.

Did some testing on the image replace function and I have some initial feedback.

1. No automatic removal of bad tags

Could lead to conditions where multiple people replace the same image over and over and over...

2 Concurrency concerns

When someone clicks "Replace Image", does the system lock down that function for that image so that it doesn't lead to race conditions.

3 Cooldown period

Related to the two above, but there should be a cooldown period before an image can be replaced again.

4 Replacement of source

If using an image link, it replaces the source with that link, even if the source was the post source and therefore the correct source to use. This will lead to manual fixing of the source every time, or people will forget and leave the substandard source in place.

Ex: http://feat-replace-images.devbooru.ml/posts/2498862

I wasn't able to confirm with the above since you don't have the archive service running, but does replacing an image create a new post version when the source is changed?

5 No confirmation stage

The fact that the replacement is a one-stage step is a bit leery. There should be a confirmation step where the site will pull in the source from the supplied URL, and show the standard image information, to include thumbnail, dimensions, MD5 hash, and exact filesize. No kb or mb please though, as images can differ by only a few bytes which can get lost in the conversion to kb and mb.

If the user can confirm that everything is correct, then they can submit the changes to go forward.

6 No batch upload selection

For when post links are used for the replace image, there is no option to choose which image on the post will be used. This affects sites like Twitter and Pixiv.

Example;

http://feat-replace-images.devbooru.ml/posts/2497818

The above pulled the first image available, even though it might be desired to upload the second.

7 No upload file

Certain sources like Tinami do not allow for image retrieval without a few modifications to the HTTP request, like Referrer or a Cookie. For these sources, I've always had to download the file to disk, then upload the file and set the source when uploading.

8 No IQDB confirmation

There is no check to see if the image being uploaded is similar at all to the the image being replaced.

Examples:

http://feat-replace-images.devbooru.ml/posts/2498862
http://feat-replace-images.devbooru.ml/posts/2497818

9 Delay Job deletion

What about events where an image is mistakenly replaced? In these cases, the original file under Mod Actions can be used to change the file back. However, what about the delay job created? Will it still delete the original file? Will the replaced file also be deleted. This could lead to posts that have broken images.

Also, the information given by the delay job page isn't very informational. I can't tell what they're going to do... only that they are going to "delete old files", plus the post # they are going to delete from.

Example:

http://feat-replace-images.devbooru.ml/posts/2497911

10 [Feature Request] Order mechanism to find these posts

This is mostly related to mine and RaisingK's scripts, but it would be helpful if it were possible to order these posts where an image is replaced by most recent replacement. The reason being is that I currently perform three checks when an image is uploaded: at the five min mark, one week mark, and one month mark. Without a way to easily find these posts, they may have to wait until I do the full gamut which I only do once every several months.

Final

The above were just some items off the top of my head. I'd like to do some more testing once I've had some additional time to think on this, as well as after some of the above issues get addressed.

1. No automatic removal of bad tags

I'm not a huge fan of embedding knowledge of particular tags directly in the code, as things like *_sample tags are subject to change. There are also tags like jpeg_artifacts that would still need to be handled manually.

In general, replacement can be used for other things besides replacing samples. Think of things like replacing scans, or forcing old thumbnails to regenerate.

2 Concurrency concerns

Well, I think it's safe, but it's pretty difficult to test to know for certain. Danbooru suffers from concurrency bugs practically everywhere, so to an extent it's par for the course. I don't think it's necessarily riskier than the normal upload process though. Duplicates will be detected if two people try to replace something with the same file.

3 Cooldown period

As it stands, there's no real way to track the history to see when something was last replaced beyond the mod actions log, which is not suitable for this. Albert brought up having a new model that would allow for tracking replacements.

4 Replacement of source
If using an image link, it replaces the source with that link, even if the source was the post source and therefore the correct source to use. This will lead to manual fixing of the source every time, or people will forget and leave the substandard source in place.

It should set the source to the same thing that it would be set to as if it were uploaded normally. We don't have a referer url in this context though, so it could be getting messed up by that.

I wasn't able to confirm with the above since you don't have the archive service running, but does replacing an image create a new post version when the source is changed?

It should create a version if the tags change (which can happen due to filetype/dimensions autotagging) or if the source changes. I suppose if the source didn't change (say the source is the html page which is the same for the sample and the full size) it won't create a post version.

5 No confirmation stage

The fact that the replacement is a one-stage step is a bit leery. There should be a confirmation step where the site will pull in the source from the supplied URL, and show the standard image information, to include thumbnail, dimensions, MD5 hash, and exact filesize. No kb or mb please though, as images can differ by only a few bytes which can get lost in the conversion to kb and mb.

The upload page can do some of this re: getting the filesize and thumbnail, but it's intertwined heavily with the upload page. Basically it needs to be factored out and moved into /source.json?url=... so that it will be available in this context. I was thinking that a preview/comparison could either be included in the dialog box, or perhaps in the "Fetch source data" box.

6 No batch upload selection

It does the same thing that the regular upload page does, which is to take the first image in the gallery when you give it a gallery source. This should perhaps raise an error instead, as this is error-prone in normal uploading too. The "Fetch source data" warns you when something is a gallery, but I think people don't always pay attention to that.

7 No upload file
Certain sources like Tinami do not allow for image retrieval without a few modifications to the HTTP request, like Referrer or a Cookie. For these sources, I've always had to download the file to disk, then upload the file and set the source when uploading.

I avoided dealing with file uploading for now to keep the first pass at this simple. The problem with Tinami specifically is that I think we only spoof the Referer for Pixiv as a special case, but really we should spoof it by default for everything. I haven't tried that but I think that would solve hotlinking problems for all sites.

8 No IQDB confirmation
There is no check to see if the image being uploaded is similar at all to the the image being replaced.

Are we talking about preventing mistakes or preventing malicious use? If it's the former, then I think showing the IQDB hits should be sufficient. If it's the latter, and a replacement has to have some minimum similarity score, then you get into problems with things being rejected when they shouldn't. Think of replacing a scan that was flipped horizontally for some reason. I don't think this would detected by IQDB.

9 Delay Job deletion
What about events where an image is mistakenly replaced? In these cases, the original file under Mod Actions can be used to change the file back. However, what about the delay job created? Will it still delete the original file? Will the replaced file also be deleted. This could lead to posts that have broken images.

Also, the information given by the delay job page isn't very informational. I can't tell what they're going to do... only that they are going to "delete old files", plus the post # they are going to delete from.

Hmm, it would still delete the original file. The delayed job would need to be canceled, which would be possible to do but not yet implemented.

Jobs have more info that could be displayed on the /delayed_jobs page, but currently that info is only shown to admins. I think albert wanted to be careful not to display too much in case something is sensitive. So would need to look through all the job types and decide what's safe to expose.

10 [Feature Request] Order mechanism to find these posts
This is mostly related to mine and RaisingK's scripts, but it would be helpful if it were possible to order these posts where an image is replaced by most recent replacement. The reason being is that I currently perform three checks when an image is uploaded: at the five min mark, one week mark, and one month mark. Without a way to easily find these posts, they may have to wait until I do the full gamut which I only do once every several months.

Albert brought up having a model for tracking purposes, which should allow for this. As it currently stands, you would have to check /uploads.json or /mod_actions.json to monitor replacements.

evazion said on #8:

Are we talking about preventing mistakes or preventing malicious use? If it's the former, then I think showing the IQDB hits should be sufficient. If it's the latter, and a replacement has to have some minimum similarity score, then you get into problems with things being rejected when they shouldn't. Think of replacing a scan that was flipped horizontally for some reason. I don't think this would detected by IQDB.

Hmmm... I was thinking the image replacement function was mostly for image samples, and such instances should have a high similarity score. For anything else, you should upload the new source and set the parent/child relationship.

On #10:

As it currently stands, you would have to check /uploads.json or /mod_actions.json to monitor replacements.

Unfortunately, those sources would only be good for three days... :/ I could however store these in a file so that I can reattack them at a later point.

On #6:

It does the same thing that the regular upload page does, which is to take the first image in the gallery when you give it a gallery source.

Sorry, I was referring to the functionality of the Batch bookmarklet, i.e. /uploads/batch. If this functionality was available, then images other than the first image could be replaced without needing to use the direct image link and then having to go in and fix the source afterwards. (See #4 from my previous post).

-------------------------------------------------------

After thinking about it, I agree with your stance on #1. If I'm able to figure out a good way to handle these image replacements, then my script will automatically remove the bad tags when it processes them. This is already done a lot by those that have been uploading the replacement images, as they sometimes forget to remove the bad tags when copying them over.

Maybe for #9, just don't delete files at all. Admins could perhaps be given a way to locate and delete these orphan images on demand.

No worries about #7. I know this will probably be a multi-stage release.

I have no additional comments on the rest. I just wanted to throw these all out so that they're on the table.

BrokenEagle98 said:

Yeah, I don't know... I just check both sources manually. There is a difference, even though the filesizes are the same.

Hm, maybe we'll finally be able to use the image replace function on posts like that. AFAIK Pawoo posts can't change, so it might've been from a different deleted Toot originally.

Had some extra time recently, and I've made some minor progress with Ehentai... post #2531642 was my first successful tag. Though I still don't know if my script will trigger the spam block IP ban as the above was tagged pretty early. For reference, I currently have the checks set to a 2 minute spread, and I'm hoping that'll avoid any detection issues.

Also, I'd like to hold off on wiki/implication creation until I can successfully scan enough posts...

Edit

*le sigh*... even with a 2 minute spread, I still get an IP ban for... 24 HOURS!!!

Updated

1 7 8 9 10 11 12 13 14