petopeto said:
I'm just cutting off at a fixed value; 90% by default, 30% on "show more", or whatever's specified in the "threshold" parameter. If you include a value for the default threshold, I can use it as the default and for as dupe check threshold.
You might as well not use any threshold at all for "show more", and just show all 16 results. Using 30% means you either get none of the other results (for grayscale images where the noise level is often below 30%), or you get all of them anyway.
I do now return the suggested threshold level anyway. In my tests it was always slightly above the first noise pic, though I'm open to tuning the std.dev. threshold of 5% if needed. In case of no relevant match, I return a 90% threshold. You should use the returned threshold to decide match relevance, and use no threshold (or 0%) for "show more".
Another good post to test this is http://moe.imouto.org/post/similar/609?services=all
This is trickier, but: is there any way we could get tags for results?
I really don't want to attempt to mirror Danbooru's DB. I can return the image resolution because it never changes, and because the image DB has fields for it anyway. But tracking tag and rating changes would be a nightmare, I'd have to retrieve the entire Danbooru DB at regular intervals. Or for tags one could implement a JSON/XML interface to query recent tag changes, but there isn't even an internal change log for ratings, so retrieving that would be impossible. But even if it were feasible I'd like to avoid this if at all possible.
You could however retrieve tags and ratings with a standard JSON/XML call for each relevant match. You might even cache the result. I decided not to do this for myself because it adds a delay to showing the matches, especially if one of the services is down. If you do it in JS in the browser, it would not slow down the results at all, but the user might briefly see images he does not want to see.
If you decide to do it either way, it would be useful to implement a JSON/XML query interface in the danbooru engine that takes multiple MD5s for all relevant matches from that service, so you only have to make one call per service instead of one for each match. This should not be too much work.
Updated