Pixiv: md5_mismatch scan, revisions

RaisingK

As described in revision, artists can upload revisions that replace the original and are indicated by "?[unix timestamp]" appended to the image URL. As I understand it, the original can still be accessed by dropping the timestamp, but only for a limited time--sometimes a day, sometimes longer.

Anyway, I kicked off a one-time(?) scan to find md5 mismatch pixiv posts using HEAD requests, and most seem to be because of revisions, which in a few cases are even completely different images. There are already thousands of new md5 mismatch posts, with the accompanying comparison comments. Scan speed is only 100 to 120 ppm over 774457 status:any source:pixiv/* -bad_id posts; 54 hours remaining.

I hope this is not too unwelcome a course of action...

See also:

revision
md5 mismatch
topic #8648 - Mass uploading revisions
topic #7792 - Revised images

Updated by RaisingK over 11 years ago

Reply

OOZ662

over 11 years ago

My only beef with the system is the flood of bumped comments rendering the Comments feed a bit useless; I used to use it to find images interesting enough to spur discussion, but nowadays it's generally half (or more) MD5 Mismatch notifications.

Reply

Toks

over 11 years ago

I get the tag but are the 5000 extra comments really necessary? They're not providing any useful information that simply adding the md5_mismatch tag itself doesn't already provide. And they're beginning to get annoying with how many of them I'm seeing all over the place.

Reply

RaisingK

over 11 years ago

OOZ662 said:
My only beef with the system is the flood of bumped comments rendering the Comments feed a bit useless; I used to use it to find images interesting enough to spur discussion, but nowadays it's generally half (or more) MD5 Mismatch notifications.

Wait, what? Only sample/thumb comments are supposed to be bumping. I did some debugging, and it isn't bumping for me. I don't get it...

Toks said:
I get the tag but are the 5000 extra comments really necessary? They're not providing any useful information that simply adding the md5_mismatch tag itself doesn't already provide

Width/height/filesize of the two versions isn't useful to you? It's certainly useful to me.

Updated by RaisingK over 11 years ago

Reply

Dbx

over 11 years ago

Should these really be classified as md5 mismatches? While the image at the original url will eventually be changed, I'm not aware of that url being made available by the Pixiv frontend any longer; the revised image gets a new url. When a normal user uploads a revised image to Danbooru, it'll have the new source url. The fact that the original url will link to the revised image instead of a 404 or redirect seems like a behind the scenes implementation choice that we shouldn't be concerned with.

Reply

RaisingK

over 11 years ago

Yes. Whatever the reason, the MD5 hash of the image pointed to by the source link (with or without the ?###) does not match the MD5 of the Danbooru post. A mismatch. Simple.

Reply

Dbx

over 11 years ago

You wouldn't just consider it to be erased because it's unreachable unless you tamper with the referer header?

Reply

Toks

over 11 years ago

RaisingK said:
Wait, what? Only sample/thumb comments are supposed to be bumping. I don't get it...

Did you forget that order:comment ignores whether the comment was a bumping comment or not (issue #1351)? All comments bump posts for the purposes of order:comment, so all these comments of yours are making order:comment far less useful.

RaisingK said:
Width/height/filesize of the two versions isn't useful to you?

No, not really. But even if I did need them I could just check the images myself to get them easily. I don't need that information in the form of a comment every time I view an image.

Reply

RaisingK

over 11 years ago

Toks said:
Did you forget that order:comment ignores whether the comment was a bumping comment or not (issue #1351)? All comments bump posts for the purposes of order:comment, so all these comments of yours are making order:comment far less useful.

OOZ662 said the "Comments feed", though--that doesn't refer to http://danbooru.donmai.us/comments?group_by=post?

No, not really. But even if I did need them I could just check the images myself to get them easily. I don't need that information in the form of a comment every time I view an image.

Right, so you have to manually load every image to get those three attributes, for every image, to see if it's enough to be worth re-uploading. With the comments, that information is there at a glance. Just "md5 mismatch" could be +0.1 KB, or it could be +1000px/+1000px. I want to find the big differences.

Updated by RaisingK over 11 years ago

Reply

EB

over 11 years ago

Toks said:
But even if I did need them I could just check the images myself to get them easily.

Not if the image is revised again, or it becomes a bad_id. Granted, there's no getting the image at all at that point, but I could see it as potentially useful in that case if it happens to get uploaded from an third-party source (to help verify that it's not a third-party alteration).

Reply

Toks

over 11 years ago

RaisingK said:
OOZ662 said the "Comments feed", though--doesn't that refer to http://danbooru.donmai.us/comments?group_by=post and not order:comment?

It could mean either one but in this context it must mean order:comment.

And regardless of what OOZ662 is talking about, I find the decreased usefulness of order:comment annoying.

RaisingK said:
Right, so you have to manually load the entire image to find the those three attributes. For every image, to see if it's enough to be worth re-uploading. With the comments, that information is there at a glance. A change of 0.1 KB is a bit different than 1.0 MB, but you wouldn't know it from just "md5 mismatch".

I can see how that would matter to you since you're going through thousands of images and deciding which are worth re-uploading. But it wouldn't matter to most people since most people aren't going to go through thousands of images and re-upload them.

I only have two md5_mismatch uploads so I would have no problems manually loading two entire images and looking at them normally. Non-uploaders would care even less about it.

Why not just output this information to a text file so you can look at it without spamming comments?

Reply

Type-kun

over 11 years ago

Toks said:
I only have two md5_mismatch uploads so I would have no problems manually loading two entire images and looking at them normally. Non-uploaders would care even less about it.

Except you would never know that your images have md5 mismatch unless you monitor it periodically. Also, there are different causes of md5 mismatch, namely uploading from another source and changing it to pixiv later.
Maybe the script should bulk-collect the info and then PM it to uploaders and/or someones who do post gardening, or post in some specific thread? Text file seems fine only if RaisingK would take care of all mismatches personally, which isn't as efficient as leaving it to community.

Reply

Toks

over 11 years ago

Type-kun said:
Maybe the script should bulk-collect the info and then PM it to uploaders

I agree, just had the same idea actually. Doing that would make the information available to the only two people who might want it (RaisingK and the uploader), while avoiding comment spam and messing up order:comment.

Though the "bulk" part of that is important - only one dmail should be sent to each uploader; sending a separate dmail for every md5_mismatch post would be even more annoying than comments to people who have a lot of uploads.

Reply

OOZ662

over 11 years ago

RaisingK said:
Wait, what? Only sample/thumb comments are supposed to be bumping. I did some debugging, and it isn't bumping for me. I don't get it...

Sorry, I hadn't noticed the distinction. I was referring to http://danbooru.donmai.us/comments?group_by=post

Reply

RaisingK

over 11 years ago

OOZ662 said:
Sorry, I hadn't noticed the distinction. I was referring to http://danbooru.donmai.us/comments?group_by=post

As was I, and I can't reproduce the issue on demand.

Toks said:
I agree, just had the same idea actually. Doing that would make the information available to the only two people who might want it (RaisingK and the uploader), while avoiding comment spam and messing up order:comment.
Though the "bulk" part of that is important - only one dmail should be sent to each uploader; sending a separate dmail for every md5_mismatch post would be even more annoying than comments to people who have a lot of uploads.

I and the uploader of a given image can't be the only ones who might re-upload that image based on the difference described in the comment. I doubt uploaders in general are going to care about their specific uploads, and I am not going to send out dmails on the same scale as comments anyway.

I'm not doing this scan because I'm looking for more posts to upload; I have plenty of those as it is. I just wanted to put out more information on mismatches to anyone else who might be interested (well, that and because I could).

The comment spam is a one-time thing, and order:comment will clear up soon afterwards, right? I don't see any harm in leaving a quick automated comment on md5 mismatch posts apart from that; I've been doing that for quite some time, just not on this scale.

Reply

user 11314

over 11 years ago

I don't think your script should be bumping posts if the Danbooru image is LARGER (filesize) than the current Pixiv post, as this is most likely due to Pixiv compressing the JPG on their server.

Reply

RaisingK

over 11 years ago

It already shouldn't be bumping mismatches (on the Comments index) period--comment[do_not_bump_post] is set to 1; nothing I can do about order:comment.

Reply

OOZ662

over 11 years ago

RaisingK said:
As was I, and I can't reproduce the issue on demand.

I hadn't made the distinction between the MD5 Mismatch comments (which don't bump themselves now that I really looked at it) and the ones that do bump, i.e. the pixiv manga sample and pixiv thumbnails ones. Those are the ones that flood through every once in a while.

I just saw the username on them all and assumed they were all bumping.

Reply

Toks

over 11 years ago

RaisingK said:
nothing I can do about order:comment.

You can stop spamming comments.

Reply

jxh2154

over 11 years ago

Ehhhhhh, I think the comments are useful but I kinda see the complaints about comment spam. I don't know, maybe it's because I almost never read comments, let alone using order:comment (which I didn't know existed until this thread), so I'm not a good person to ask. I don't really have a snap decision on this, it's purely a matter of opinion and what features you happen to use.

I don't think limiting the rate per day would help either, 774457 would require like two years even at 1000/day. And 1000 a day for 2 years would be a much bigger issue than one lump run.

When there's no clear answer I guess status quo wins by default, with negative effect to an existing feature being a bigger problem than not implementing a new feature (or expanding an existing one rather). So, I guess the comments should not be made (on the 700k legacy batch anyway, on newly uploaded stuff is fine).

Reply