Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[py-tx] Add a new match command line option (--rotations) #1672

Merged
merged 21 commits into from
Nov 9, 2024

Conversation

haianhng31
Copy link
Contributor

@haianhng31 haianhng31 commented Oct 30, 2024

Summary


Files changed:

  • Main changes are in file: match_cmd.py

Test Plan

$ tx match --rotations text -- "try text"
--rotations flag is only available for Photo content type

$ tx hash --rotations photo pdq/data/bridge-mods/aaa-orig.jpg
$ tx match --rotations photo pdq/data/bridge-mods/aaa-orig.jpg
pdq ORIGINAL 16 (Sample Signals) INVESTIGATION_SEED
$ tx match --rotations photo pdq/data/bridge-mods/aaa-rotated180.jpg
pdq ROTATE180 16 (Sample Signals) INVESTIGATION_SEED

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your test plan, showing the ouput of the match function will be critical. Here's what I want to see:

Using https://github.com/facebook/ThreatExchange/blob/main/pdq/data/bridge-mods/aaa-orig.jpg

# Showing hash of original
$ tx hash photo aaa-orig.jpg
pdq facefacefafce...

# Here we use imageMagick, but any editor will work
$ magick aaa-org.jpg -rotate 90 aaa-90.jpg

# index sample data, which includes aaa-orig
$ tx fetch 

# Now match against rotated
$ tx match photo aaa-90.jpg
pdq sample 0[90º] INGVESTIGATION_SEED

You can also consider adding --rotations to tx hash as well, which will be simpler.

python-threatexchange/threatexchange/cli/match_cmd.py Outdated Show resolved Hide resolved
Comment on lines 229 to 232
# Add rotation information if possible
rotation_info = ""
if hasattr(r.similarity_info, "rotation"):
rotation_info = f" [{r.similarity_info.rotation.name}]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While clever, and valid python, this will make typechecking unhappy (try running mypy locally as described in https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/CONTRIBUTING.md).

Instead, change the return type of these functions from IndexMatch to a new object that contains an index match and the rotation info.


# Add rotation information if any matches were found
for match in matches:
match.similarity_info.rotation = rotation_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: I think this might be an illegal operation, and I'm unsure whether we should add rotation to the base concept of similarity.

We have a few options, but the simplest one might be change the return of this function from IndexMatch to instead return a new dataclass that might contains the rotation information.

I'll accept any solution that:

  1. Doesn't change the base interface (since I don't think we yet know how to handle rotations in general - you are pioneering our approach!)
  2. Can show which rotation was used in tx match.

results: t.Sequence[IndexMatchWithRotation] = [
IndexMatchWithRotation(match=match)
for match in results_from_hashes
]
Copy link
Contributor Author

@haianhng31 haianhng31 Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a new dataclass IndexMatchWithRotation which will be the return type of _match_file.

Q: @Dcallies Should I also make it the return type of _match_hashes? Just so the return types of these 2 are similar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the python typing pass you may need to be consistent between them (see what mypy says)

@haianhng31 haianhng31 changed the title Add a new command line option (--rotations) [py-tx] Add a new command line option (--rotations) Oct 30, 2024
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice progress! Starting to get close.

  1. Let's not print the rotation unless --rotations is provided - this is why your tests are failing that check the output
  2. Some notes inline about how to handle that.
  3. If you haven't done it yet, it will be easier to add --rotations to tx hash than it will be to match.

results: t.Sequence[IndexMatchWithRotation] = [
IndexMatchWithRotation(match=match)
for match in results_from_hashes
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the python typing pass you may need to be consistent between them (see what mypy says)

python-threatexchange/threatexchange/cli/match_cmd.py Outdated Show resolved Hide resolved
python-threatexchange/threatexchange/signal_type/index.py Outdated Show resolved Hide resolved
@@ -82,7 +83,7 @@ def flip_minus1(cls, image_data: bytes) -> bytes:
return buffer.getvalue()

@classmethod
def all_simple_rotations(cls, image_data: bytes):
def all_simple_rotations(cls, image_data: bytes) -> t.Dict[RotationType, bytes]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@dataclass
class IndexMatchWithRotation(t.Generic[T]):
match: IndexMatchUntyped[SignalSimilarityInfo, T]
rotation_type: RotationType = RotationType.ORIGINAL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you move it into the match command, consider making this optional, then adding a __str__ command which handles the output of the match command. If rotation_type is None, then no rotation is printed. If the rotation is there, then it can format the string you put to string.

@Dcallies Dcallies changed the title [py-tx] Add a new command line option (--rotations) [py-tx] Add a new match command line option (--rotations) Nov 2, 2024
@Dcallies
Copy link
Contributor

Dcallies commented Nov 4, 2024

Use "Edit" to add your test plan into the summary!

edit

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer, but maybe let's break down the problem a little more.

It may help to break this down into two PRs - one for the hash command, and one for the match command.

Hash will be more straightforward!

In order to do that, you'll need to use some git tricks to "split" the stack, which can be daunting.

One way we could do that is the following (ref)

  1. Checkout the main branch with git checkout main
  2. Create a new branch for only the hash command git checkout -b hash_cmd_rotations
  3. Pull only the contents of the hash command to your new branch git checkout photo-helpers -- python-threatexchange/threatexchange/cli/hash_cmd.py
  4. Create a new commit on our new branch `git commit -am '[pytx] add --rotations to hash_cmd'
  5. Push this branch to start a new PR

python-threatexchange/threatexchange/cli/hash_cmd.py Outdated Show resolved Hide resolved
python-threatexchange/threatexchange/cli/match_cmd.py Outdated Show resolved Hide resolved
from threatexchange.content_type.video import VideoContent


class TestHashCommand(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: We are using the py.test framework for this library, not unittest.

Check out: https://docs.pytest.org/en/stable/

and this reference test: https://github.com/haianhng31/ThreatExchange/blob/8d10599cda78afa49f763c08ca556db94f4ae332/python-threatexchange/threatexchange/cli/tests/match_cmd_test.py

We should probably just use the hash and match_cmd tests rather than creating new tests.

python-threatexchange/threatexchange/signal_type/index.py Outdated Show resolved Hide resolved
python-threatexchange/threatexchange/cli/match_cmd.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job splitting your PR! This is quite difficult to do any many people have a lot more trouble their first time!

python-threatexchange/threatexchange/cli/match_cmd.py Outdated Show resolved Hide resolved

for rotation_type, rotated_bytes in rotated_images.items():
# Create a temporary file to hold the image bytes
with tempfile.NamedTemporaryFile() as temp_file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: This is going to write a lot of files over the course of an execution! We are going to call this method for every single match type.

Another approach you could do is refactor this so that the rotated images are inserted higher up in the stack, and then rather than taking rotations: bool, you could pass in the path of the rotation, and an optional enum representing which enum it can take.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of change s_type to be subclass of BytesHasher so that I can use bytes directly with hash_from_bytes. All the tests passed and mypy doesn't complain. What do you think of this approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the last PR, I mentioned the downside - not all photo formats are knowable without the extension. There's likely a workaround, but let's stay with the current course and see if we can fix it in a followup instead.

@@ -196,18 +230,23 @@ def execute(self, settings: CLISettings) -> None:
for s_type, index in indices:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before processing this list of files, if rotations is true here, generate new files that you iterate through.

Here's one way to do that

def handle_rotations() -> Iterator[(Path, t.Optional[RotationType)]:
  for file in self.files:
    if not self.rotations:
      yield file, None
      continue
   for rot, dat in PhotoContent.blah():
      with NamedTemporary(...) as f:
         yield Path(f.name), rot

for path in handle_rotations(self.files):

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All that's missing now:

  1. We should update the unittest for match to also include a case for --rotations
  2. From your test plan it looks like it's not matching hashes that it should, since the rotated image should match the rotated hash.

Here's a e2e test case:

$ threatexchange hash --rotations photo https://github.com/facebook/ThreatExchange/blob/main/pdq/data/misc-images/b.jpg?raw=true | grep FLIPX | cut -d ' ' -f 2- >> ~/file.txt
$ threatexchange config collab edit local_file --filename ~/file.txt 'file.txt' --create
$ threatexchange fetch
$ threatexchange match photo https://github.com/facebook/ThreatExchange/blob/main/pdq/data/misc-images/b.jpg?raw=true
# No match

$ threatexchange match --rotations photo https://github.com/facebook/ThreatExchange/blob/main/pdq/data/misc-images/b.jpg?raw=true
pdq - (file.txt) INVESTIGATION_SEED

Comment on lines 46 to 66
def test_photo_hash_with_rotations(self):
test_file = pathlib.Path("threatexchange/tests/hashing/resources/rgb.jpeg")

hash_cmd = ThreatExchangeCLIE2eHelper()
hash_cmd.COMMON_CALL_ARGS = ("hash",)
hash_cmd._state_dir = pathlib.Path()

hash = hash_cmd.cli_call("photo", str(test_file))
assert hash == "pdq fb4eed46cb8a6c78819ca06b756c541f7b07ef6d02c82fccd00f862166272cda\n"

# rotated_images = PhotoContent.all_simple_rotations(test_file.read_bytes())

# img = rotated_images[RotationType.ROTATE90] #try with 1 rotated image first

# with tempfile.NamedTemporaryFile() as tmp_file:
# img = rotated_images[RotationType.ROTATE90]
# tmp_file.write(img)
# self.assert_cli_output(
# ("--rotations", "photo", "--", tmp_file.name),
# "video_md5 - (Sample Signals) INVESTIGATION_SEED",
# )
Copy link
Contributor Author

@haianhng31 haianhng31 Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current approach for this test:

  1. Hash an image
  2. Fetch
  3. Call match --rotations the rotated version of the original image

Q: @Dcallies

  • I got the hash pdq of the image (line 53) but how do i fetch indexes and call match command on this hash?
  • If there is a match found, what should the output look like? When I try it with the CLI it keeps output
    pdq 16 (Sample Signals) INVESTIGATION_SEED
    or if i'm using the config collab:
    pdq 0 (file.txt) INVESTIGATION_SEED

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a match, then the output is the matching collaboration and distance.

Consider this ouptput:

pdq 16 (Sample Signals) INVESTIGATION_SEED

signal_type distance collab name confidence
pdq 16 Sample Signals INVESTIGATION_SEED

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I did:

First I sanity checked that we expect the produced hashes to match

$ tx hash --rotations photo /workspaces/devcontainer-ThreatExchange/pdq/data/bridge-mods/aaa-orig.jpg 
ORIGINAL pdq f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22
ROTATE90 pdq 30a10efd71cc3d429013d48d0ffffc52e34e0e17ada952a9d29685211ea9e5af
ROTATE180 pdq adad5a64b5a142e75b62a09857da895ae63b847fc23794b766b319361bc93188
ROTATE270 pdq a5f0a457a48995e8c9065c275aaa5498b61ba4bdf8fcf80387c32f8b1bfc4f05
FLIPX pdq f8f80f31e0f417b20e37f5cd028f980fb36ed02a9662c1e233e64c634e9c64dd
FLIPY pdq 0dad2599b1a1bd1a5362576742da32a5e63b7380c2374b4866b366c91bc9ce77
FLIPPLUS1 pdq f0a5e102f1ccc0bd945308720fff038de34ef1e8ada9a956d2967ade5ea91a50
FLIPMINUS1 pdq a5f05aa8a4896a17c906a2d85aaaab07b61b5b42f8fc07fc87c3d0741bfcb0fa

# Check original
$ tx match photo -H f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22
pdq 16 (Sample Signals) INVESTIGATION_SEED
# Note - this should be distance 0, but it's stopping at the first match, which is one of the variations

# Check with rotation
$ tx match --rotations photo /workspaces/devcontainer-ThreatExchange/pdq/data/bridge-mods/aaa-orig.jpg
pdq 16 (Sample Signals) INVESTIGATION_SEED

However, I can see an error here - why isn't the rotation printed? When I looked to answer this question I found a bug in your code, which I found by adding some helpful print statements. I believe you can find that bug too!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment! I found the bug. I shouldn't have checked s_type == PhotoContent since it's SignalType not ContentType

@haianhng31
Copy link
Contributor Author

@Dcallies
Because the hash command output the Rotation_Type at the front, the match command cannot function correctly. Where do you suggest me looking at to fix this? (I've been jumping around files and was confused where the matching implementation is)

@Dcallies
Copy link
Contributor

Dcallies commented Nov 6, 2024

Because the hash command output the Rotation_Type at the front, the match command cannot function correctly.

Hey @haianhng31, what do you mean? The hash command and the match command are different commands, and so one should not break the other.

My guess is that you mean that the output of the rotation type makes it not read correctly from a file. Take a closer look at the command I gave you before:

threatexchange hash --rotations photo https://github.com/facebook/ThreatExchange/blob/main/pdq/data/misc-images/b.jpg?raw=true | grep FLIPX | cut -d ' ' -f 2- >> ~/file.txt

Let me try patching your code on my machine locally to see if I can see what you mean...

Comment on lines 46 to 66
def test_photo_hash_with_rotations(self):
test_file = pathlib.Path("threatexchange/tests/hashing/resources/rgb.jpeg")

hash_cmd = ThreatExchangeCLIE2eHelper()
hash_cmd.COMMON_CALL_ARGS = ("hash",)
hash_cmd._state_dir = pathlib.Path()

hash = hash_cmd.cli_call("photo", str(test_file))
assert hash == "pdq fb4eed46cb8a6c78819ca06b756c541f7b07ef6d02c82fccd00f862166272cda\n"

# rotated_images = PhotoContent.all_simple_rotations(test_file.read_bytes())

# img = rotated_images[RotationType.ROTATE90] #try with 1 rotated image first

# with tempfile.NamedTemporaryFile() as tmp_file:
# img = rotated_images[RotationType.ROTATE90]
# tmp_file.write(img)
# self.assert_cli_output(
# ("--rotations", "photo", "--", tmp_file.name),
# "video_md5 - (Sample Signals) INVESTIGATION_SEED",
# )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I did:

First I sanity checked that we expect the produced hashes to match

$ tx hash --rotations photo /workspaces/devcontainer-ThreatExchange/pdq/data/bridge-mods/aaa-orig.jpg 
ORIGINAL pdq f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22
ROTATE90 pdq 30a10efd71cc3d429013d48d0ffffc52e34e0e17ada952a9d29685211ea9e5af
ROTATE180 pdq adad5a64b5a142e75b62a09857da895ae63b847fc23794b766b319361bc93188
ROTATE270 pdq a5f0a457a48995e8c9065c275aaa5498b61ba4bdf8fcf80387c32f8b1bfc4f05
FLIPX pdq f8f80f31e0f417b20e37f5cd028f980fb36ed02a9662c1e233e64c634e9c64dd
FLIPY pdq 0dad2599b1a1bd1a5362576742da32a5e63b7380c2374b4866b366c91bc9ce77
FLIPPLUS1 pdq f0a5e102f1ccc0bd945308720fff038de34ef1e8ada9a956d2967ade5ea91a50
FLIPMINUS1 pdq a5f05aa8a4896a17c906a2d85aaaab07b61b5b42f8fc07fc87c3d0741bfcb0fa

# Check original
$ tx match photo -H f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22
pdq 16 (Sample Signals) INVESTIGATION_SEED
# Note - this should be distance 0, but it's stopping at the first match, which is one of the variations

# Check with rotation
$ tx match --rotations photo /workspaces/devcontainer-ThreatExchange/pdq/data/bridge-mods/aaa-orig.jpg
pdq 16 (Sample Signals) INVESTIGATION_SEED

However, I can see an error here - why isn't the rotation printed? When I looked to answer this question I found a bug in your code, which I found by adding some helpful print statements. I believe you can find that bug too!

)

def test_photo_hash_with_rotations(self):
test_file = pathlib.Path("threatexchange/tests/hashing/resources/rgb.jpeg")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Danger! This photo might not correspond to one of the sample signals. Check out the comments on https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/threatexchange/signal_type/pdq/signal.py#L90

Additionally, this makes the test not work if run from different directories! Here's how I fixed it:

Suggested change
test_file = pathlib.Path("threatexchange/tests/hashing/resources/rgb.jpeg")
test_file = pathlib.Path(
__file__ + "../../../../../../pdq/data/bridge-mods/aaa-orig.jpg"
).resolve()

@haianhng31 haianhng31 marked this pull request as ready for review November 6, 2024 21:31
@haianhng31
Copy link
Contributor Author

@Dcallies thank you for all the comments and suggestions, they are really helpful! I have completed the test cases for match cmd with rotations (updated in the PR's summary).
The match command now also run correctly with the rotation type printed.

@haianhng31
Copy link
Contributor Author

@Dcallies
When you have time can you help me review this PR? Also I saw you mentioned about adding -save option in the PR with Mackey, can you tell me more about that?

@Dcallies
Copy link
Contributor

Dcallies commented Nov 9, 2024

Sorry for the delay @haianhng31! You can mark the PR as ready for review by hitting the refresh button in the upper right here:
image

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one potentially blocking comment, but I think it's fine to land and merge as is.

This was a tough stack, nice job @haianhng31 !

Comment on lines +61 to +64
if rotation == RotationType.ROTATE270:
rotation = RotationType.ROTATE90
elif rotation == RotationType.ROTATE90:
rotation = RotationType.ROTATE270
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking q: What is the purpose of this block? It looks like you still end up with all the rotations, so does this just change the order?

If so, why?

If not, consider adding an explanation!


self.assert_cli_output(
("--rotations", "photo", tmp_file.name),
f"pdq {rotation.name} 16 (Sample Signals) INVESTIGATION_SEED",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the distance here will be stable (I'm actually surprised this works), but if it passes, I think this is okay for now.

@Dcallies Dcallies merged commit 21f4f41 into facebook:main Nov 9, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants