Skip to content
This repository has been archived by the owner on Mar 3, 2022. It is now read-only.

[WIP] Add preliminary phash ETL #76

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

imalsogreg
Copy link

@imalsogreg imalsogreg commented Jul 8, 2018

Proof of concept phash ETL job (#74)

This adds data to a phash column in the ads table, with the perceptual hash of each image in the ad. A prereq for doing reverse image search.

Written in Haskell to make it easy for me to experiment, but I'm happy to rewrite in the other languages from the project if that will help with maintainability. The code here is more for demo of the approach.

There is a single executable with 3 subcommands: one to update the ads table with phashes for all images in each ad, one to reset that column, and one to check database connectivity.

[facebook-ad-image-hashes]$ hashes-cli populate-phashes --help
Usage: hashes-cli populate-phashes [-h|--dbhost ARG] [-p|--dbport ARG]
                                   [-U|--dbuser ARG] [-p|--dbpass ARG]
                                   [-d|--dbname ARG]
  Compute phashes for images in the ads database

Available options:
  -h,--dbhost ARG          Database Host
  -p,--dbport ARG          Database Port
  -U,--dbuser ARG          Database User
  -p,--dbpass ARG          Database Password
  -d,--dbname ARG          Database Name
  -h,--help                Show this help text

The implementation is careful to stream rows from ads and to batch writes back in to update the phashs column, since your database may be larger than what fits in memory on the cron machine driving this. It also keeps a least-recently-used cache of phashes for urls, since we expect a lot of redundancy in ad image urls across different users' ads, and we don't want to repeatedly download those. I did some light testing with my local database and verified the basic logic and caching work. I need to do more work to verify that the threading stuff works and gives a speedup.

The code in the PR has a dependency on phash bindings that might not be available on your system. I've been using nix locally to manage the dependencies across languages. This (and a few other things) may prevent this branch from running on your machine. I wanted to get comments on the approach before doing the work of trying to pick versions that work with the package managers on the machines you use in dev. If you don't mind using nix for this, I find that a very clean way of sharing dependencies across machines. But of course if you don't like nix or your environment rules it out, I can get things working in your current standard way.

Future directions: it would be very natural to add one or two commands to the executable for image-search-by-similarity. That would be too expensive a query for users to run from scratch (full table scan for every request) - I have a couple ideas for how to store state across runs to make that search cheap.

Any comments/thoughts appreciated! :) Looking forward to getting to the next steps.

@imalsogreg
Copy link
Author

a3a08c9 is a little bit of cleanup, and also manually wraps the whole action in a transaction. I discovered this is necessary after running the ETL with the data loaded up by ads downloaded from the propublica data store. Under load, the reads/writes got a chance to overlap in time and uncovered an error with the code before a3a08c9, where my streaming SQL library was automatically creating a read-only transaction if it detected it was being run with no transaction. Manually wrapping the whole thing in my own transactions allows reads and writes to be interleaved.

Running 10000 ads took about 15 minutes with the default parallelism settings.

@jeremybmerrill
Copy link
Contributor

this looks awesome and I intend to take a deeper look tomorrow, hopefully in the morning. thanks!!

@imalsogreg
Copy link
Author

:) Great! No hurry!

@imalsogreg
Copy link
Author

imalsogreg commented Jul 14, 2018

Hope I'm not jumping the gun too much, but I wanted to explore some search ideas. Implemented in ca9bfd6. There's also some code to generate a view of the search results, which isn't intended for public consumption, but just for debugging the search and giving a sense of how phash distances map to perceptual similarity. I'm not thrilled with the ability to find similar images, but probably I need to import more images before I can get a sense for whether search-by-meme will work when backed by phash.

Same caveats as before: needs cleanup and documentation, open to rewriting in a language already used in facebook-political-ads.

Example CLI call and output:

dist/build/hashes-cli/hashes-cli search --range-bounds '[10,15,20,25,30,35,40,50]' --n-examples 3 --out index.html --url https://pp-facebook-ads.s3.amazonaws.com/v/t45.1600-4/c0.0.476.249/p476x249/24293757_23842685712150427_3314323092913782784_n.png

("range-bounds" search type finds some images at various phash distances from the query target - really useful for seeing what phash considers a good match vs. a terrible match)

search_ranges_screengrab

The top match doesn't say anything about phash usability unfortunately, because my search query was a url taken directly from the ads table, so the search found that exact image :). I only have about 10k rows in my database. Hopefully when I ingest the rest there will be more meaningful stuff.

Oh and I see the LICENSE file has a Greg Hale copywrite. That's just autogenerated boilerplate. During cleanup of course proublica can have the copyright and pick the license.

@jeremybmerrill
Copy link
Contributor

Hey @imalsogreg, this looks awesome. Seriously impressive! (Especially the parallelism and speed you mention.)

I don't know much about nix. We use Docker almost exclusively and however I end up deploying this, it'll be as a Docker image. So that means Haskell is probably fine for now, since it won't have to share dependencies or live on the same server as anything else. Maintainability-wise, something like Ruby or Python always makes everyone's lives easier, but if you find that you've scratched your itch and don't want to port it, Haskell can work in production.

I'm working on getting it installed now, locally, with homebrew. I'll keep some notes, but so far it's just been zeromq and I'm going to have to take a look at installing phash from source, it looks like. I'll keep you posted, but I love the work so far!

@imalsogreg
Copy link
Author

@jeremybmerrill ooh I shudder thinking of trying to wrangle the dependencies in homebrew from scratch - but if you're happy to try, it may work! I can build this into a docker container, if you're interested in playing with it sooner than later and dependencies get frustrating.

Thanks for the kind words! I'll stick around as a contributor until this gets used and finally retired. So feel free to set priorities in that context - if the most useful thing for me to do is to port to ruby or python in order to help you and other contributors, that's totally fine. If other priorities come first, also fine :)

A hint for building phash - use the configure flags and post-configure steps here: https://github.com/NixOS/nixpkgs/blob/98c1ad879a34954944972ac7465343d325f7156b/pkgs/development/libraries/phash/default.nix#L16-L19
( ./configure --enable-video-hash=no --enable-audio-hash=no, make, make install, cp path/to/CImg.h path/to/phash/include/). Video and audio hashing use some weird c stuff that I think may be outdated, and the library assumes to have that CImg.h file manually copied in to the build output. imagemagick also needs to be installed globally - the library shells out to it.

@jeremybmerrill
Copy link
Contributor

jeremybmerrill commented Jul 16, 2018

Ah, thanks for those phash hints! I think I'm almost there getting this running in Docker. I gave up on doing it locally once I realized I'd have to build pHash from source; not going to try to do that twice :).

Right now, it looks like the Haskell app almost builds. It complains about not being able to find exec/Main. Do you know what this might be?

Configuring facebook-ad-image-hashes-0.1.0.0...
Warning: 'hs-source-dirs: exec' directory does not exist.
Preprocessing library for facebook-ad-image-hashes-0.1.0.0..
Building library for facebook-ad-image-hashes-0.1.0.0..
[1 of 5] Compiling Queries          ( src/Queries.hs, dist/dist-sandbox-626f19ae/build/Queries.o )
[2 of 5] Compiling Search           ( src/Search.hs, dist/dist-sandbox-626f19ae/build/Search.o )
[3 of 5] Compiling Report           ( src/Report.hs, dist/dist-sandbox-626f19ae/build/Report.o )
[4 of 5] Compiling CliOptions       ( src/CliOptions.hs, dist/dist-sandbox-626f19ae/build/CliOptions.o )
[5 of 5] Compiling RunCli           ( src/RunCli.hs, dist/dist-sandbox-626f19ae/build/RunCli.o )
Preprocessing executable 'hashes-cli' for facebook-ad-image-hashes-0.1.0.0..
cabal: can't find source for Main in exec

cabal: Leaving directory '.'

@imalsogreg
Copy link
Author

Nice. I forgot to add exec/Main.hs to the git repo :) Just pushed the fix.

@jeremybmerrill
Copy link
Contributor

Ha, I've done it a million times. Thanks! I will hope to pick this up again tomorrow. I got it building correctly in a Dockerfile and hashes-cli populate-phashes --help works, so I know it's successfully executing your code.

@jeremybmerrill
Copy link
Contributor

Oh hey, and it looks like it's calculating phashes. Seems to require me to run reset-phashes before populate-phashes... otherwise populate-phashes just ends right away, silently, having done nothing. (Maybe you had a default set on the column where mine were just nulls?)

@jeremybmerrill
Copy link
Contributor

I get this error occasionally. Do you know what it is?

convert-im6.q16: no decode delegate for this image format `' @ error/constitute.c/ReadImage/504.
convert-im6.q16: no images defined `pnm:-' @ error/convert.c/ConvertImageCommand/3258.
sh: 1: gm: not found

[CImg] *** CImgIOException *** [instance(0,0,0,0,(nil),non-shared)] CImg<unsigned char>::load(): Failed to recognize format of file '/tmp/fbp-images/17858'.

@imalsogreg
Copy link
Author

You got the dependencies sorted? Rad!

I haven't seen your error, but sh: gm: not found sounds like a missing system binary. I found another github issue mentioning this: tj/node-gify#9 - gm seems to come from the graphicsmagick package.

Quick disclaimer that I wouldn't trust the cli tool with prod database yet 😅

Yah, reset-phashes right now is a needed prereq for populate, since populate isn't smart about choosing the rows to recompute - it uses a very specific sentinal value that reset- sets. Another thing to fix up.

@jeremybmerrill
Copy link
Contributor

Got it. Here's an easy question: Right now, I kick off all the Haskell stuff with cabal sandbox init && cabal update && cabal install, but this causes the final compiled binaries to show up somewhere like dist/dist-sandbox-626f19ae/build/hashes-cli/hashes-cli instead of the far prettier and far more predictable dist/build/hashes-cli/hashes-cli, which is what you have. Should I do something different (different than cabal sandbox init, I assume) to get that same path?

@imalsogreg
Copy link
Author

@jeremybmerrill a couple options for the sandboxed build:

cabal install in your scenario should put the binary in ./.cabal-sandbox/bin/hashes-cli
cabal install --only-dep && cabal build should put it in ./dist/build/hashes-cli/hashes-cli I think - I'm surprised that your setup didn't put one there too actually.

@jeremybmerrill
Copy link
Contributor

@imalsogreg ah, great! I see it there in ./.cabal-sandbox/bin/hashes-cli and now that i've run cabal install --only-dep && cabal build it puts it in ./dist/build/hashes-cli/hashes-cli. I've modified the Dockerfile.

Which, FYI, is this:

FROM haskell:8.2
RUN apt-get update
RUN apt-get install -y cimg-dev wget libczmq-dev graphicsmagick postgresql-client libpq-dev
RUN wget http://www.phash.org/releases/pHash-0.9.6.tar.gz && tar -xf pHash-0.9.6.tar.gz && cd pHash-0.9.6 && ./configure --disable-video-hash --disable-audio-hash LDFLAGS='-lpthread' --prefix=/usr/ && make && make install ; cd ..
ADD . "/hasher/"
WORKDIR "/hasher/"
RUN cabal sandbox init && cabal update && cabal install --only-dep && cabal build

@jeremybmerrill
Copy link
Contributor

Looks like installing graphicsmagick fixes that gm issue. I think I can get rid of the imagemagick dependency by figuring out what in debian includes convert from graphicsmagick, rather than requiring imagemagick alongside it.

Incidentally, the populate-phash script seems to error out when it can't download a file from S3. Would it be possible for it to give up in that instance and keep going? I suspect it was a network hiccup, but I wouldn't be shocked if there were some missing links in the database.

I'm trying to search now ./dist/build/hashes-cli/hashes-cli search --url https://pp-facebook-ads.s3.amazonaws.com/v/t45.1600-4/c0.0.476.249/p476x249/24293757_23842685712150427_3314323092913782784_n.png --out index.html --n-examples 3 --range-bounds '[10,15,20,25,30,35,40,50]' --threshold 0 --overwrite-cache.

Unfortunately, I can't get anything but a blank HTML page. (I mean, it has HTML content, but the search results are nothign but the exact image I passed in.) I wonder if the problem is the threshold value? What's a reasonable value to try for this?

Null is the new sentinal value for computing phashes during ETL
During search, only non-null-phash rows are considered
All considered rows must have images length == phash length
Number of considered rows must be greater than 0
@imalsogreg
Copy link
Author

imalsogreg commented Jul 19, 2018

@jeremybmerrill I think probably the hard failure on image download could be responsible for both issues?

When the download fails, the whole transaction populating ads will fail, so no phashes will end up in the database for later use by the search.

I just pushed a commit that silences download errors, so they won't bother the data load.

I also changed the behavior of reset-phashes and populate-phashes - they now use null as the sentinel value for "needs to be recomputed".

search now ignores any row with phash IS NULL, and for all non-null rows, it asserts that the array length of phash is the same as the array length of images.

If the number of valid rows from the db for a search is zero, search now throws an error. Hopefully that will help diagnose the cause of the blank index.html you're getting.

Could you rerun the search with no --out index.html? That will just print the raw results, which could have more clues.

Sorry, bugs :(

A good threshold is 0, in theory. What 0 would do is make it so that the search tree considers every distinct image as a distinct point in the search tree - a problem when the tree would be too large, which is apparently the opposite of our situation. (Higher threshold would more aggressively group images, making the tree easier to fit in memory)

@jeremybmerrill
Copy link
Contributor

Hey @imalsogreg, I think it was my bad with the search function, I think I had made some modification that messed it up.

Now I'm a little bit confused. Is the phash column supposed to be text or text[] in psql?

@imalsogreg
Copy link
Author

@jeremybmerrill phash :: text[]. I read images out of images :: text[] and make one phash per image.
(ps - happy to use another means of communicating if you want faster feedback) imalsogreg at gmail.com if you want to exchange contact info. Github is fine too if you don't mind some of my responses being delayed.

Without the annotations, some possible combination of psql, haskell
postgres libraries, and libpq was resulting in a type error during
the sql query (`phash` treated as `text` rather than `text[]`)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants