Danbooru

Image Sample Cleanup Project

Posted under General

I'm to agree with the comment above... If they are the exact same image pixel-by-pixel then so be it, but then one has to wonder what composes the rest of that file bulk. I've never tried it, but it seems a little contrary to believe that the sampled image might actually be better than the original image in that regard. But even then, it's been settled for some time now that :orig is the most original image available, regardless of whether :large is actually better.

Maybe I'll test it later if I have time.

☆♪ said:

Does it really matter what site did the sampling? It's still an inferior downscale and ought to be replaced and deleted. Why make the distinction?

I think it does. Let's say for instance a user found an image and could not trace down a proper source so they uploaded the best image they could find using a reverse image search engine. That wouldn't be an image sample but would be downscaled.

Here's why it matters. *_image_sample refers to images sampled from specific sources and we have documentation of how to the best of our knowledge these images are sampled and how to avoid using them and in turn replace them with the true source. Thumbnails aren't from such a source. To alias thumbnail to image sample would needless erase information about the images tagged as such and would muddy the meaning of image sample.

Sure thumbnails should be flagged if they are poor quality or a better version of an image has been found but let's not needlessly conflate terms. I disagree with the wiki change that thumbnails are 'a subset of image samples,' can we agree that thumbnails 'while similar to image samples are in contrast downscaled images that do not come from a specific documented source?' They should still be flagged if the quality is poor or an appropriate alternative image is found.

Nudging @Mikaeri since I mentioned your wiki changes.

Hmm... I get the feeling the problem lies more with the existence of thumbnail rather than it's usage. Where exactly would the tag thumbnail be fittingly used by most users? I can agree that they are downscaled images, but it's strange because another question is perhaps this: Under what circumstances would a thumbnail not be considered an image sample aside from having no source and a better parent? What would make it well-defined (and distinct from image samples) aside from either very low resolutions or very small image samples from existing sites such as nicoseiga or deviantart (or even pixiv's 150x150 thumbs when you're viewing an artist's works)

post #98431 is an especially strange anomaly. By first glance it can definitely be considered a thumbnail (by the bare definition and it's lowres), but it's not an image sample because we don't know where or if it was resampled by some other site.

I'm not exactly what to do with the tag's existence. If we're being very loose with the definition, thumbnails are really just very small resolutions of high detail images, including very small image samples with such detail.

EDIT: And in a broader respect for what we're discussing right now, the reason why I included these two cases in image samples:

is because for the first, this basically explains the same ambiguity that an image could just be resized by the user and provided an unintentionally incorrect source (perhaps also by another user). We'll know this happens from pixiv when the source isn't correctly a direct image URL for pixiv and seiga, for example. The second is for images that have no clear source and are downscaled/lower quality. We can always reapprove these images in the case there is a mistake, but for now that's what it is.

Updated

Mikaeri said:

Though I do wonder because I haven't been paying attention all too closely, there are users that misappropriate image sample and downscaled? That's a little bit concerning.

Mostly the latter, I think. I've seen it tagged on images that are just smaller than their parents, but are e.g. from a different source and not (third-party) downscaled. And then there's this tagging of WIPs as samples that was mentioned in this thread. It's probably not as bad as I make it sound, but it just worries me a little that something legitimate might get flagged and buried.

CodeKyuubi said:

Does that include post #1966693? The last time a conversation was started on twitter pngs, they decided twitter pngs, though visually lossless, were inferior in quality to conventional pngs. Following that train of thought, even if the resolution is the same and the filetype is lossless, md5 not matching is md5 not matching, and isn't the one uploaded by the artist and therefore sampled and inferior.

I compared the actual pixel values for that image and the original from Twitter, and they're identical. There's no difference in metadata either; the only difference is that Twitter has recompressed it, losslessly, at a lower compression level. (Why they do that, I have no idea.) That is not the same as sampling. Although it would have been nicer to have the original, nothing would actually be lost here if the source disappeared. In other words, I don't worship the file the artist created, but the image. Preferring the artist's original file is a good general rule, because that makes things a lot simpler - it's a much easier distinction to make, and second-hand copies are lossy more often than they're not. Even with PNG being a lossless format, a sample could still have been altered before encoding. But in this particular case, after determining that the information that really mattered was there, I guess I figured it wasn't worth it to mess up favorites or whatever. All that said, if someone does want to upload the original, that post would be fair game for flagging, and I wouldn't have a problem with that. Maybe I'll even do it later, now that I've spent so much time on this anyway...

Mikaeri said:

I'm to agree with the comment above... If they are the exact same image pixel-by-pixel then so be it, but then one has to wonder what composes the rest of that file bulk. I've never tried it, but it seems a little contrary to believe that the sampled image might actually be better than the original image in that regard. But even then, it's been settled for some time now that :orig is the most original image available, regardless of whether :large is actually better.

Maybe I'll test it later if I have time.

It's not so much a matter of "rest of that file bulk", as in the larger image is the smaller one plus something. It's two different ways of storing the same thing, that take up different amounts of space. When you compress a PNG, you can spend more time in order to get better compression, resulting in a smaller file, or you can do it quickly and get a bigger file, because it isn't compressed as much. Twitter opted for the latter. But when you decompress the two files, you get exactly the same output. It's like if you put one file in a .zip and one in a .7z - those archive files will be different, but the file inside will be exactly the same when you take it back out of either one. You can even try a program like optipng, which spends a long time trying to find good parameters to compress a png to a smaller filesize - if you run it on each of the two files, it will make them identical (shrinking them both).

The :large is not "better" than the :orig, ever. Larger filesize does not automatically mean better. Take for example post #2553638, which is bigger in filesize than its parent but actually worse, because the second time it went through the jpeg encoder, it treated the artifacts left from the first time as data, and had to encode those too (which takes up space), but for the actual data the best it could do was the same as the original, and in some places it's worse. You can't do better than the original unless you guess at what should be there, at which point you're re-drawing rather than just processing. (Maybe waifu2x should be an artist tag...)

sweetpeɐ said:

I think it does. Let's say for instance a user found an image and could not trace down a proper source so they uploaded the best image they could find using a reverse image search engine. That wouldn't be an image sample but would be downscaled.

Here's why it matters. *_image_sample refers to images sampled from specific sources and we have documentation of how to the best of our knowledge these images are sampled and how to avoid using them and in turn replace them with the true source. Thumbnails aren't from such a source. To alias thumbnail to image sample would needless erase information about the images tagged as such and would muddy the meaning of image sample.

Okay, that's fair. You've convinced me. I was thinking of "image sample" technically, like a thumbnail is an image sample by definition, but our tag has a more specific meaning. I'm not sure if the distinction can be made reliably, but I do see the purpose of it.

I suppose you're right then. Still though, Twitter is such a strange anomaly to upload from. There are clear-cut cases where with jpgs Twitter samples are blatantly worse, but for png you would think the same thing... except nothing happens but a lower compression (which explains the larger filesize).

Well, what I meant when I said :large is better is that the compression is better, at no cost to the original image data. But yes, you are correct. Filesize isn't really any indicator of image quality.

I'm still sort of on muddy ground as to what thumbnail should be used for, and what we can trust users to attribute it to. Perhaps @henmere can give their opinion?

EDIT: And @☆♪ do you think we should revise twitter sample to add a disclaimer there then? In the case where Twitter png samples have the same resolution as the original, that uploading the original makes no difference? Might be something extra for an approver to remember though, but ehhm...

Mikaeri said:

I suppose you're right then. Still though, Twitter is such a strange anomaly to upload from. There are clear-cut cases where with jpgs Twitter samples are blatantly worse, but for png you would think the same thing... except nothing happens but a lower compression (which explains the larger filesize).

Well, what I meant when I said :large is better is that the compression is better, at no cost to the original image data. But yes, you are correct. Filesize isn't really any indicator of image quality.

Oh, okay, I see what you were saying now. Sorry if I told you a bunch of stuff you already knew. Uploading from Twitter certainly can get "interesting". What's funny is that Twitter has the opportunity to compress better (and some sites do use png optimizers in similar situations), they go through the trouble of recompressing, but actually get a worse compression ratio the vast majority of the time, and don't even default to the original in that case. This may suggest that pngs are relatively rare on Twitter, because clearly they aren't paying attention. I guess jpgs are a lot more common.

Mikaeri said:

I'm still sort of on muddy ground as to what thumbnail should be used for, and what we can trust users to attribute it to.

Yeah, you might also have people misusing the tag just for images that are small and square or something, and then they'd get tagged as samples leaving future viewers with bad information. So the implication could be a bit dangerous. Maybe image sample should refer only to things that have a defined mapping, so that its friendly to automated tools, which a generic thumbnail would not be.

Mikaeri said:

EDIT: And @☆♪ do you think we should revise twitter sample to add a disclaimer there then? In the case where Twitter png samples have the same resolution as the original, that uploading the original makes no difference? Might be something extra for an approver to remember though, but ehhm...

No, I don't think anything should be changed with the sample guidelines. A png sample can still lose information if the original is large enough and maybe other conditions we don't know. If we make the guidelines more complicated, it's more likely that people will misinterpret them / make mistakes. The original is still better anyway, and now that the site rewrites automatically (I think) it shouldn't happen too often.

Side note: it seems I didn't get a mention notification this time.

On Twitter I believe it's largely jpg, yup. Maybe one could bother writing something in their developer forum regarding that.

I'm thinking right now the best course of action would be to nuke thumbnail and say not to use it in lieu of lowres and/or image sample. That sound better?

@☆♪ I suppose then, you would have to replace those image samples (despite the bloated filesize) in the originals? It might seem unnecessary but I suspect someone might do it in the future against their better intentions, so I think it's better just to clear it up sooner than later

And also, can we get special highlighting in the mod queue for images tagged image sample and md5 mismatch? @evazion @BrokenEagle98 There are a few images that have recently been approved by @Nitrogen09 and @Qpax, for example, that should have been reupped instead. I think it would be better this way.

Ah, yeah, I noticed that part. Although not all the queue moderators are using that CSS, so I'm worried if they might accidentally approve something that should have been reupped.

Mmm... I would prefer it, since the mod queue is quite healthy due to the efforts of all the new moderators. Qpax, especially, has been approving almost as frequently as Provence these days :o

And it would be good if we could also send automated messages to uploaders that upload these image samples (like with pixiv URL sources).

The main problem with the majority of such tags is that they are manually added, meaning that there is a high degree of likelihood that a post would get approved before they get marked up.

The only exception I can think of would be flagged images, but do those appear any different in the queue, or do they look like all of the rest?

They don't have any special highlighting in the queue save for duplicates (whether that be a flag or pending approval)

My thoughts would be to categorize the highlighting as such:

  • Posts marked with yellow: duplicate (current), image sample, md5 mismatch, resized, upscaled, downscaled, waifu2x (although this is manually attributed so maybe shouldn't be included -- upscaled may already fulfill this)
    • These posts would cover images that if normally accepted, would tell the approver "upload the original instead" when available, or simply ignore the post in case of a duplicate
  • Posts marked some other color (maybe orange?): hard translated, self upload, nude filter, photoshop, screencap
    • These would cover images that should be treated with much more scrutiny as they're modifications of their original source (with screencap, that would be the episode frame(s), which can vary)
  • As for bad anatomy, bad proportions, and the rest of the bad_* tags, I think there's still room to discuss their uses currently (topic #13673, forum #127040) so I would hold off until there's more consensus.

Updated

Okay, I've updated my CSS code in forum #127054 to include some of the suggestions above.

BrokenEagle98 said:

The only exception I can think of would be flagged images, but do those appear any different in the queue, or do they look like all of the rest?

Flagged posts in the mod queue still have the usual, noticeable enough red border, so there's no need to mark them with a background color.

Updated

md5_mismatch would be important for if the source has a different image than the one submitted, whether it's better or worse. It's for precautions -- in the case that the image is an md5_mismatch, then we either know it's an incorrect source or the user has modified it in some way.

The one thing I suppose we'll all have to be more careful about is if users forget to remove the relevant tags while they're replacing image samples, but usually we can at least quickly tell if the image has a child by the green border around it.

As long as they're image samples and the full version is uploaded, then it is fine to go ahead and delete the image sample copy.

This stance is open to change if abuse occurs and things other than image samples end up getting deleted though.

I got a PM regarding a post I flagged for being an unsourced duplicate (post #2526507, likely a deleted tweet because of the bloated filesize/poor png compression compared to the pixiv original), so I'd like to revisit forum #127532.

The reason I mentioned why I consider unsourced (and potentially downscaled) duplicates as samples is because this provides dangerous incentive to NOT upload with sources (which I consider warranting a neutral or negative feedback). Even images that are detexted or use photoshop should reference their original image somewhere, whether through the parent-child relationship or through the source itself.

The current system for detecting these image samples and md5 mismatches only work if users are willing to add sources to their own images. Which they should, because not doing so would be fairly obstructionist. This is why I elaborated that in help:image source that if an image is not self provided, a source MUST be provided to the best of that user's ability. There are a handful of uploaders with unlimited permissions that really skirt the lines of this rule (for example, only providing the work/copyright title as the source) but at the very least something should be provided.

Because I see it as this. If, say, a user wanted to pad their uploads with high quality uploads that eventually turned out to be duplicates (with extremely minute changes) or hard-to-tell waifu2x upscales, and they weren't required to provide a source for their image (not citing the software they used if they modified the image, etc) then I'm of the opinion that those posts should be deleted.

But perhaps better to ask the question what can stay (for identical images)? My opinion is these:

  • The latest revision (as long as it stays active)
  • All other known images that have been revised or are unique, known derivatives
  • Unsourced duplicates that are uploaded before an original is uploaded, and can neither be confirmed or denied being a sample due to a deleted source. These shouldn't have noticeable resizing artifacts and/or other sorts of defects.

EDIT: Thanks to NWF Renim for pointing this out, but there should also be caution with unsourced duplicates that are uploaded before a sourced original is uploaded so as to encourage users to upload their images even if they don't have a source provided (say a bad id, if their image was saved and artist deleted it off the site).

EDIT 2: Would also highlight the danger of uploading from other imageboards (yandere in particular, post #2638827) as we can't always confirm an image's "originality" if we upload from yandere themselves. Their sources neither confirm or deny their image's authenticity as do ours with the scripts that run. So it's always a better idea to just upload from the most original source, rather than a derivative source.

EDIT 3: Looks like my hunch was right. Digging through some tumblr reblogs from GIS, it turns out that particular image was a deleted tweet.

Updated

1 3 4 5 6 7 8 9 10 11 14