Danbooru

Image search

Posted under General

petopeto said:
Ah, yeah, I forgot that the thumb is always a jpeg, OK. I can construct the URL, but having it in the XML would be nice so it doesn't break later if it changes, and so supporting extra services is easy.

OK, I've added the preview URL now.

Any reason you're waiting to release the code? (GPL means "release the code, don't ask first", though of course it's polite to send patches upstream too.) I have the XML interface integrated and will send it off to dovac soon to try it on moe. As I was testing it locally, I found that danbooru itself makes a decent personal image database. Being able to run the image search locally on my own files (in addition to remotely for other sites) would be handy.

I could also try to have it push updates to your server when a new file is uploaded, so searching can be up to date, eg. to query http://haruhidoujins.yi.org/search-update.php?service=moe.imouto.org&id=55&md5=0a01723647d685444736fcedfa41a936 when that post is uploaded or deleted. Let me know if you want to do that. Of course, it'd only work where the code is merged (don't know if rq is interested), but it'd also be useful for local repositories if you release the code. I'd much rather have it update when I upload something than to have to run a cron job.

petopeto said:
Any reason you're waiting to release the code? (GPL means "release the code, don't ask first", though of course it's polite to send patches upstream too.)

You're right of course. Besides, I've realized that my changes would probably not be useful for the imgSeek folks, since I tore out several things they rely on (such as keywords, and Python bindings) and redesigned the API.

However I still need to come up with a good name for this fork. It's no longer compatible, so naming it the same would cause a lot of confusion.

I could also try to have it push updates to your server when a new file is uploaded, so searching can be up to date

For faster queries and less memory usage, I convert the database to a read-only structure after loading it. Basically, resolving all image IDs so I don't have to look them up for each query. But that also means that online updates are impossible. Updating the Danbooru image DB takes ~30 seconds, and reloading it in the DB server another ~10 seconds, so doing that for every pic would probably be excessive.

Though it would be possible to write some code to support adding and removing pics even in "read-only" mode, that would also make queries a little slower. I'm not sure if it's really all that necessary.

OK, here's the code. For now I renamed the project to "iqdb" for image query database. I have the best imagination ever.

http://haruhidoujins.yi.org/iqdb-20080218.tar.bz2

I didn't include most of my PHP code because a large part of it is for the Haruhi doujins which my site was originally about. That'd just be confusing. I did include a PHP script to query the database server and parse the reply, it should provide everything you need to get it running, or to convert to whatever language you prefer.

[edit]

petopeto said:
I think it's important for local indexing, but since I might be the only one who wants that, I'll look at it when you release it.

For a local copy you could just run it in "normal" mode, then searches will be a little slower but you probably won't care if it takes 500 ms instead of 80 ms. And that's with the Danbooru image set, a smaller collection would be substantially faster anyway.

Updated

It takes 1.75 seconds to run "iqdb query" for me, I definitely care about that over 0.08. (If I was to use db as a general image organizer--which I don't think is too insane, though it'd need some enhancements to pools, at least, and a more sophisticated version of piclens--I'd have around 80k images in there.)

Can you put up the code for the XML queries? The idea is for others to be able to get the same HTTP/XML API running, so I can take my service list (eg. "danbooru.donmai.us" => "http://haruhidoujins.yi.org/multi-search.xml") and add "me" => "http://localhost/multi-search.xml", and have image searches on my local server search my own repository in addition to the ones you're indexing, using the same API. This way, anyone running a danbooru (public or personal) can have their "find similar images" link check whichever servers they want.

petopeto said:
It takes 1.75 seconds to run "iqdb query" for me, I definitely care about that over 0.08.

That's mostly the time it takes to load the database, you can't avoid that, and in that case the query time is the least problem. If you run it in server mode though, it only has to load the DB once instead of for every query. Then you'll see the time it actually takes to query the DB.

On the other hand, it would be possible to query the DB without loading all of it first, but that would require a different way of storing it on disk.

Can you put up the code for the XML queries?

You mean for querying local files? Or also querying external URLs? The former would just be a handful of lines of code, the latter is quite a bit more complicated. Just wondering if it would be needed, since it would also require Curl to be installed for PHP.

Though for running it locally I think you should really skip the webserver+PHP+XML part entirely, and just query the DB directly, like in my sample PHP script. You'd need to make separate queries to my search and your local search anyway, and then merge them somehow.

piespy said:
That's mostly the time it takes to load the database, you can't avoid that, and in that case the query time is the least problem. If you run it in server mode though, it only has to load the DB once instead of for every query. Then you'll see the time it actually takes to query the DB.

I'm not sure what you meant by "normal mode", then. As far as I can tell, there's only one way to load the server in listen mode.

piespy said:
You mean for querying local files? Or also querying external URLs? The former would just be a handful of lines of code, the latter is quite a bit more complicated. Just wondering if it would be needed, since it would also require Curl to be installed for PHP.

I have a local copy of danbooru running, with a /post/similar interface that queries your image server via multi-search.xml for a specified image and shows the results in the same format as danbooru's /post/index (well, using moe.imouto.org's layout, since I need a place to put the similarity and size). In addition to finding posts on danbooru/moe/kona, I want it to be able to find images uploaded onto my local copy of danbooru. The code uses the PHP interface, it doesn't talk to the image server directly. Handling merging of multiple queries is already done.

It uploads files via POST, it never sends a URL.

I'll release the full code soon, but here's the current low-level code, in case you're curious: http://pastebin.com/m3839b374

petopeto said:
I'm not sure what you meant by "normal mode", then. As far as I can tell, there's only one way to load the server in listen mode.

If you want to query, that's all you need. If you want to modify the DB online, you'll have to modify iqdb.cpp and change imgdb::dbSpace::mode_simple to ::mode_normal in the server(...) function. And if you want to save the DB, you need to send it the saveas command. Or copy the DO_QUITANDSAVE code from the command(...) function to save it when quitting.

It uploads files via POST, it never sends a URL.

OK, that just takes another couple of lines in PHP. Try this: http://haruhidoujins.yi.org/iqdb-xml.php.gz

I'm going to allow the internal interface to take URLs, too, so I can use the same interface for all image searching. Can you have the XML interface include the dimensions and thumbnail URL of the file downloaded (maybe as attributes to <matches>)? I'm just passing the URL through, so I can't do this server-side.

ed: MD5 of the file wouldn't hurt, either (since it's available for all other source and result images, it eliminates an exception)

Here's a quick GreaseMonkey script: alt-shift-click any image, and it'll load up an image search in another tab. Edit GM_openInTab to point to the search you want. http://pastebin.com/f7c53e61b

petopeto said:
Can you have the XML interface include the dimensions and thumbnail URL of the file downloaded (maybe as attributes to <matches>)? I'm just passing the URL through, so I can't do this server-side.

ed: MD5 of the file wouldn't hurt, either

You'd only get the thumbnail dimensions, and the MD5 of the thumbnail, rather than those of the image it is representing. Except in the very special case of the thumbnail being for a service I index, in that case it would theoretically be possible to return this information. However, since I load the DB in simple mode for faster searching and less memory usage, it is not possible to look up image data by ID directly, except for searching the entire DB for it. So I'm not sure that's really worth it.

OK, I've added it as preview field for the matches tag. Let me know if there's a more convenient way.

Not tested a lot, so I hope I didn't miss any cases that don't generate thumbnails or you'll probably get a broken URL. Let me know if that happens too.

Updated

Whoa, that's pretty amazing.

Are you using a fixed similarity threshold to detect relevant matches? I've found that using a standard deviation based threshold works better, since the similarity noise level varies depending on the source pic, basically how common its major wavelet coefficients are (and especially whether or not it's grayscale or set to discard colors).

For my own search, I start at the least similar match and keep going until the std.dev. rises above a certain threshold, in my case 5%. (Note that for GET queries of images on danbooru et.al. the exact match is discarded.) If the threshold is reached, there is a relevant match, otherwise all of it is noise. The actual similarity threshold is then 5% above the noise level, which is the average similarity of the pics before the std.dev. exceeded 5%.

That way you can find a barely-relevant match even with varying noise levels. For example compare these results, both querying the same thumbnail:

http://moe.imouto.org/post/similar/6210?services=all

http://haruhidoujins.yi.org/multi-search.php?url=http://moe.imouto.org/preview/4c/65/4c6511401f0d4029a540c6fb42c894e2.jpg

By the way, if you prefer I can indicate in the XML result if I consider a match relevant or not using this algorithm, so that you don't have to code it yourself too. Or I can at least return the appropriate similarity threshold and let you figure out which pics are above it.

I'm just cutting off at a fixed value; 90% by default, 30% on "show more", or whatever's specified in the "threshold" parameter. If you include a value for the default threshold, I can use it as the default and for as dupe check threshold.

I think the only feature not present in this interface is accepting uploaded files directly, which I'm already finding to be a problem, so I'll probably try to implement that soon, too...

This is trickier, but: is there any way we could get tags for results? It'd probably mean having to scan the tag history to update. (Ratings, too, but that probably needs rating history, which doesn't exist yet.)

The problem is that without that info, user blacklists and can_see_post can't be honored, so image search comes up with results that people don't want to see.

I know this is nontrivial; I'm open to ideas...

1 2 3 4 5 6