[HMA] Wiki backup (#1523)

facebook · Jan 25, 2024 · b6c2f15 · b6c2f15
1 parent 8b93875
commit b6c2f15
Show file tree

Hide file tree

Showing 38 changed files with 1,277 additions and 0 deletions.
diff --git a/hasher-matcher-actioner/wiki/.gitignore b/hasher-matcher-actioner/wiki/.gitignore
@@ -0,0 +1 @@
+.DS_Store
diff --git a/hasher-matcher-actioner/wiki/Action-Rule-Evaluation.md b/hasher-matcher-actioner/wiki/Action-Rule-Evaluation.md
@@ -0,0 +1,48 @@
+The [Actioner](Glossary#actioner) component of HMA controls how external systems are notified when a [Match](Glossary#matcher) is created. This is primarily used to communicate with the [Platform](Glossary#terms-and-concepts-used-in-hma) that deployed HMA. This document describes how the Actioner makes decisions about who to notify and how it notifies them. 
+
+The decision layer of the Actioner is the [ActionRules](Glossary#actioner) framework. You can think of an ActionRule as an algorithm that takes in a Match as input and outputs either a specific Action on no action. When a Match is created, it has various [Classifications](Glossary#matcher) (sometimes also called Labels) which describe the Match such as where the [Content](Glossary#hasher) came from, what [Dataset](Glossary#matcher) it matched against, where that Dataset originated, and more. The ActionRules framework reads these Classifications and determines which Actions should be run.
+
+Actions, in turn, specify how to notify the external system (like your Platform). Currently, there is only one method of notifying an external system, [Webhooks](Webhooks-Reference). Read about how to set up Webhooks [here](https://github.com/facebook/ThreatExchange/wiki/Tutorial:-How-to-Notify-My-System-when-My-Content-Matches-a-Dataset).
+
+Let's look at an example where you have 2 Datasets, one for Cats and one for Dogs. Your Platform has 3 systems for enforcing your [Community Standards](Glossary#other-terminology-used-in-content-moderation): High Severity (HS), Low Severity (LS), and Record Keeping (RK). On your Platform, cat images are a high severity violation and so cat images must be sent to the HS system. Dog images, however, are a Low severity violation and can be handled by the LS system. For both cat and dog matches, we want to notify our Record Keeping system, RK.  
+
+For each system, you'll need to create an Action which communicate with that system. This is done via Webhooks and you can find the details [here](Webhooks-Reference). We then need to set up a series of ActionRules for our logic. Specifically, we need 4 rules.
+
+Here you can see how our 4 ActionRules and 3 Actions would work to notify HighSeverity and RecordKeeping systems of a cat Match:
+![](https://github.com/facebook/ThreatExchange/blob/main/hasher-matcher-actioner/docs/images/ActionRule%20Cat%20Match%20Example.png)
+
+Here you can see how our 4 ActionRules and 3 Actions would work to notify LowSeverity and RecordKeeping systems of a dog Match:
+![](https://github.com/facebook/ThreatExchange/blob/main/hasher-matcher-actioner/docs/images/ActionRule%20Dog%20Match%20Example.png)
+
+## How to specify ActionRules
+ActionRules can be created and modified on the [ActionRules tab of the Settings page](The-Action-Rules-Page). [This tutorial](https://github.com/facebook/ThreatExchange/wiki/Tutorial:-How-to-Notify-My-System-when-My-Content-Matches-a-Dataset#step-3---create-actionrule-to-trigger-your-action-when-a-match-occurs) explains how to use the page to define ActionRules.
+
+Returning to our example, for an ActionRule like "Notify HS system if Cat" we'd specify it as follows:
+- **Name** : `Notify HS System If Cat`
+    - A unique name for the ActionRule
+- **Classifications** : `DatasetID = 12345`
+    - This field allows you to specify the logic of the ActionRules in terms of what must be present or not present on the Match in order to trigger the specified Action. If our cat [Dataset](Glossary#matcher) has id `12345`, we should specify all that DatasetID must equal "12345" to guarantee that all matches to the cat Dataset trigger the specified Action. See below for list of available Classifications
+- **Action** : `Notify HighSeverity`
+    - This field specifies which action should be called. The options here are auto-populated based on the Actions you have configured on the Actions tab of the settings page. You can see how to create Actions [here](https://github.com/facebook/ThreatExchange/wiki/Tutorial:-How-to-Notify-My-System-when-My-Content-Matches-a-Dataset#step-2---create-an-action-to-send-a-webhook)
+
+## What Classifications are available and how to specify them
+There are several types of Classification options that you can add to an Action rule to defien its logic. Currently, we support the follwoing types of Classifications
+
+* DatasetID
+   * If used with `=`,  the Action will only be run on a Match if the [MatchedSignal](Glossary#matcher) is in the the given Dataset. You can find the ID for a Dataset on the [ThreatExchange Settings page](ThreatExchange-Setting-Page)
+   * If used with `≠`, the Action will not be run on a Match if the [MatchedSignal](Glossary#matcher) is in the the given Dataset
+
+* Dataset Source
+   * If used with `=`, the Action will only be run on a Match if the [MatchedSignal](Glossary#matcher) is in a Dataset that was from the given [Source](Glossary#fetcher)
+   * If used with `≠`, the Action will not be run on a Match if the [MatchedSignal](Glossary#matcher) is from that Source
+   * The following sources are available:
+       * `te` for ThreatExchange
+
+* MatchedSignal ID
+   * If used with `=`, the Action will only be run on a Match if the [MatchedSignal](Glossary#matcher) has the given ID. This ID will be different for different Sources. If the Source of the Dataset is ThreatExchange (`te`) the MatchedSignal ID is the [Indicator](Glossary#fetcher) ID. You can view the MatchedSignal ID for a Match between a piece of Content and a Signal on the [Content Details page](Content-Details). The MatchedSignal ID is at the bottom of the page under "Matches"
+   * If used with `≠`, the Action will not be run on a Match if the [MatchedSignal](Glossary#matcher) has the given ID.
+
+* MatchedSignal
+   * If used with `=`, the Action will only be run on a Match if the [MatchedSignal](Glossary#matcher) has the Classification. MatchedSignal objects can have one or more string Classifications associated with them such as `true_positive`. These Classifications are provided by the Source of the Dataset. If the Source is ThreatExchange, the MatchedSignal Classifications will be [Tags](Glossary#fetcher) on the [Indicator](Glossary#fetcher) 
+   * If used with `≠`, the Action will only be run on a Match if the [MatchedSignal](Glossary#matcher) does not have the Classification.
+
diff --git a/hasher-matcher-actioner/wiki/Content-Details.md b/hasher-matcher-actioner/wiki/Content-Details.md
@@ -0,0 +1,17 @@
+This page displays the details for the submitted piece of [Content](Glossary#hasher). 
+
+There are two ways to navigate to this details page.
+- After successfully submitting on the [Submit Content Page](Submit-Content), a link will be shown
+- From the [Matches page](The-Matches-Page), after selecting a [Match](Glossary#matcher), you can click "Open the details page for content"
+
+This page has four distinct sections 
+- Image/Content (blurred unless hovered over)
+- Content Details
+  - Such as timestamps, provided fields, generated hashes
+- [Action](Glossary#actioner) Events
+  - History of [Actions](The-Action-Page) taken by system
+- [Signals](Glossary#hasher) Matched
+  - This page can be used to review submitted Content that resulted in a Match and for [reporting matches](Reporting-Opinions) as [FalsePositive](Glossary#writebacker) or [TruePositive](Glossary#writebacker) to [ThreatExchange](Glossary#fetcher).
+
+
+![](https://github.com/facebook/ThreatExchange/blob/master/hasher-matcher-actioner/docs/images/Content%20Details%20Page.png)
diff --git a/hasher-matcher-actioner/wiki/Content-Submissions-API.md b/hasher-matcher-actioner/wiki/Content-Submissions-API.md
@@ -0,0 +1,115 @@
+The Content Submissions API allows you to send [Content](Glossary#content) from your platform to be ingested into the HMA system.
+
+Since HMA can't see into your platform, it will use your own platform's content ID schema for deduplication and logging, and the resultant actions from content evaluation will include the content ID (as well as any `additional_fields` you've added at submission time). As long as you upload the same content with the same parameters, the API is idempotent (given no other state changes). Re-using the same id for different submission endpoints or parameters results in undefined behavior.
+
+See Also:
+The [Submit Content UI page](Submit-Content) that calls this API.
+
+## The Submission API has 4 endpoints
+
+### Common fields
+
+All submission endpoints have 4 common fields/parameters (two always required and two optional):
+
+- Parameters:
+  - `content_id` (string)
+    - your platform's id for this content
+  - `content_type` (string, one of):
+    - `photo`
+    - `video`
+  - `additional_fields` (optional list of strings)
+    - Added to as metadata on the Content
+  - `force_resubmit` (optional boolean [default=False])
+    - flag that must be set to true for successful submission with `content_id` already found in system
+
+
+### 1) Submitting a URL to content
+
+- Endpoint:
+  - `/submit/url/`
+- Endpoint specific parameters:
+  - `content_url`
+    URL to send a get request for content media
+- Response
+  - `Status: 200 OK` - The content was successfully ingested into HMA
+  - `Status: 400 Bad Request` - One of the parameters had an issue - see the returned message
+- Notes:
+  - Does not store a copy of the content. The hashing function requests the bytes and only records hash and metadata.
+
+### 2) Submitting bytes of content directly
+
+- Endpoint:
+  - `/submit/bytes/`
+- Endpoint specific parameters:
+  - `content_bytes`
+    - bytes (64bit encode) for of an image which is decoded and copied to s3
+- Response
+  - `Status: 200 OK` - The content was successfully ingested into HMA
+  - `Status: 400 Bad Request` - One of the parameters had an issue - see the returned message
+- Notes:
+  - Request size limitation result in images greater than 3.5MB being likely to fail.
+
+### 3) Submitting a hash of content
+
+- Endpoint:
+  - `/submit/hash/`
+- Endpoint specific parameters:
+  - `signal_value` (string)
+    - Hash of the content corresponding to `content_id`
+  - `signal_type` (string, one of):
+    - `pdq`
+    - `video_md5`
+  - `content_url` (optional)
+    URL to content media corresponding to hash
+- Response
+  - `Status: 200 OK` - The content ('s hash) was successfully ingested into HMA
+  - `Status: 400 Bad Request` - One of the parameters had an issue - see the returned message
+- Notes:
+  - This submission endpoint is a two step process for the client. The initial request creates the content record and post_url given in response.
+  - The second request uses the given URL to upload the content media.  
+
+### 4) Submitting via returned put url
+
+- Endpoint:
+  - `/submit/put-url/`
+- Endpoint specific parameters:
+  - `file_type`
+    - type of file the client wishes to upload directly to s3. Used to create and return signed url
+- Response
+  - `Status: 200 OK` - The content was successfully ingested into HMA
+    - response contains presigned_url allowing client to upload corresponding file to HMA's s3 storage
+  - `Status: 400 Bad Request` - One of the parameters had an issue - see the returned message
+- Notes:
+  - This submission endpoint is a two step process for the client. The initial request creates the content record and post_url given in response.
+  - The second request uses the given URL to upload the content media.  
+
+## Examples
+
+In cases where the status is not 200, a payload with more context is returned.
+
+```json
+{
+  "message": "error"
+}
+```
+
+The payload of the response depends endpoint:
+
+`/submit/bytes/` & `/submit/url/` & `/submit/hash/`
+
+```json
+{
+  "content_id": "12345",
+  "submit_successful": "True"
+}
+```
+
+`/submit/put-url/`
+
+```json
+{
+  "content_id": "12345",
+  "file_type": "image/jpeg",
+  "presigned_url": "www.example.com"
+}
+```
diff --git a/hasher-matcher-actioner/wiki/Creating-and-Managing-Banks.md b/hasher-matcher-actioner/wiki/Creating-and-Managing-Banks.md
@@ -0,0 +1,47 @@
+
+In this article, you'll learn.
+1. What are banks?
+2. How you can use banks to scan your platform for copies of media.
+
+---
+
+To scan your submissions for copies of photos or videos, you use the Banks feature in HMA. `Banks` are collections of `BankMembers`. Each BankMember is a photo or a video that you want to scan for.
+
+Once you upload a photo or a video, all future submissions to HMA are scanned for copies. Copies, if found, can be reported to your APIs.
+
+![](https://github.com/facebook/ThreatExchange/blob/main/hasher-matcher-actioner/docs/images/HMA-Banks-illustration.png?raw=true)
+
+## Creating a bank
+
+Head over to the HMA home page for a deployed instance. Click on 'Banks' in the left side bar. Now, click on the 'Add Bank' or 'Create Bank' button.
+
+In the form:
+a. add a bank name. Bank names are required to be unique. 
+b. add a description. This helps others in your team understand what this bank does.
+c. turn matching on or keep it off. If matching is kept off, matches against this bank's members will not be reported. You can always turn this on later.
+d. optionally, add tags. These tags help the [[actions|The-Action-Rules-Page]] determine whether to report a match or not. We recommend using the bank's category, or the reason why the members are violating your policies as tags. Eg. 'puppies' is a great tag if images of puppies are not allowed on your site.
+
+## Adding Bank Members
+
+Once a bank is created, it shows up when you click 'Banks' on the HMA sidebar. You can click on the bank's tile to see its details. There are three tabs when you open a bank. Bank Details, Video Memberships and Photo Memberships.
+
+To add photos, click on Photo Memberships → 'Add Member', or to add videos, click on Video Memberships → 'Add Member'.
+
+## Rebuild index manually
+
+Within a minute of adding a member, its fingerprints (or hashes) are extracted. To start matching against these new fingerprints, HMA needs to rebuild the index. HMA automatically does rebuilds every fifteen minutes. However, if it is important to you that the rebuild happen sooner (if you are dealing with a crisis for example), rebuild the index manually by heading over to the 'Settings' page on the sidebar. Click on the 'Indexes' tab and then click on 'Rebuild Indexes'.
+
+Note: HMA will only match if the bank's 'Active' toggle is set to true. However, you don't need to rebuild indexes when changing the 'Active' toggle.
+
+## Configuring an action rule for banks
+
+Check out how to create action rules [[The-Action-Rules-Page]]. For bank matches, the conditions look as such:
+
+| Condition Name | Condition Value  |
+|-----------------|------|
+| `Dataset Source` | `bnk`|
+| `Dataset ID` | `<bank_id>` ⬅️ Get this from the Bank's details page. | 
+| `Matched Signal ID` | `<bank_member's id>` ⬅️ Get this from by clicking on 'View Member` on any Video or Photo Member |
+| `Matched Signal` | tags attached to banks and bank members will come here. Use one 'Matched Signal' row per tag.  |
+
+Any action rule configured using the source, bank_id or bank_member_id can be used along with a specific action. You can define actions as shown in [[The-Action-Page]] and then use an action to notify when submissions from your site match bank members.
diff --git a/hasher-matcher-actioner/wiki/Dockerfile b/hasher-matcher-actioner/wiki/Dockerfile
@@ -0,0 +1,12 @@
+FROM ruby:2.7
+
+RUN apt-get -y update && apt-get -y install libicu-dev cmake git && rm -rf /var/lib/apt/lists/*
+
+RUN gem install github-linguist
+RUN gem install gollum
+RUN gem install org-ruby  # optional
+
+WORKDIR /var/wiki
+ENTRYPOINT ["gollum", "--port", "80"]
+EXPOSE 80
+
diff --git a/hasher-matcher-actioner/wiki/DynamoDB-Design.md b/hasher-matcher-actioner/wiki/DynamoDB-Design.md
@@ -0,0 +1,34 @@
+## Overview
+
+DynamoDB is the primary database for the HMA system. As with any NoSQL database, the schema design affects what queries can be effectively run or at all.
+
+# HashRecords
+
+Of all the record types, this is the most common one. So, the design for these items is especially crucial.
+
+## Primary Index
+* Partition Key: `c#{key}`. Here key is content key.
+* Sort Key: `s#{indicatorSource}#{descriptorId}`. For MVP, the indicator source can only be "te".
+
+## Global Secondary Index I
+* Partition Key: `s#{indicatorSource}#{descriptorId}`
+* Sort Key: `c#{key}`
+
+This is the reverse of the primary index.
+
+## Global Secondary Index II
+* Partition Key: `type#{hashingMethod}` 
+
+
+# Issues with using FilterExpressions (@schatten's expeditions; very WIP)
+
+It appears using `FilterExpressions` can be tremendously costly. I'm yet to figure out the impact of SKs being fully utilized. This is a WIP of findings and potential next steps.
+
+Since the PK is used by Dynamo to physically partition, and it is hashed, the quickest and **only** Dynamo style query using this key is the "get me all matches for this content" query.
+
+## Secondary Indexes
+Because our SK is "s#{indicatorSource}#{descriptorId}", equality on that can be used to narrow down the data rapidly.
+
+**References**
+1: [https://www.alexdebrie.com/posts/dynamodb-filter-expressions/](https://www.alexdebrie.com/posts/dynamodb-filter-expressions/)
+
diff --git a/hasher-matcher-actioner/wiki/Fetching-data-from-ThreatExchange.md b/hasher-matcher-actioner/wiki/Fetching-data-from-ThreatExchange.md
@@ -0,0 +1,17 @@
+HMA uses the [Fetcher](Glossary#fetcher) to read [Signals](Glossary#hasher) and [Opinions](Glossary#writebacker) from [ThreatExchange](https://developers.facebook.com/docs/threat-exchange/getting-started) to build [Datasets](Glossary#matcher) and keep them in sync. The [Fetcher](Glossary#fetcher)(an AWS lambda function) runs every 15 minutes by default, fetching data through [ThreatExchange API](https://developers.facebook.com/docs/threat-exchange/reference/apis/threat-privacy-groups/v11.0) based on which [PrivacyGroups](Glossary#fetcher) you have access to. For each PrivacyGroup, the Fetcher keeps an HMA Dataset in sync. You can find the python code [here](https://github.com/facebook/ThreatExchange/blob/master/hasher-matcher-actioner/hmalib/lambdas/fetcher.py)
+
+To Fetch Datasets from ThreatExchange, you need to ensure HMA has your ThreatExchange credentials. If you haven't already, you can [follow these steps](Installation#connect-to-threatexchange) to connect HMA to ThreatExchange: 
+
+# Change the frequency of the Fetcher
+We are using a Terraform variable to configure the frequency of fetching data from the [ThreatExchange](https://developers.facebook.com/docs/threat-exchange/getting-started). To change it, go to ```/ThreatExchange/hasher-matcher-actioner/terraform/terraform.tfvars``` and change the value of ```fetch_frequency```, rebuild and deploy the image. For example, if you want to change it to 10 minutes, you will do following steps: 
+* update ```fetch_frequency = "10 minutes"``` in **terraform.tfvars**  
+
+* go to terraform folder, run command ```terraform apply```
+# Start/Stop Fetching data from ThreatExchange
+You can start/stop fetching data from ThreatExchange for specific datasets.   
+* Select _**Settings**_ in left bottom, go to ThreatExchange tab
+
+![](https://github.com/facebook/ThreatExchange/blob/31d8c61a3f5c8f746db772157bf13f311bf1969c/hasher-matcher-actioner/docs/images/ThreatExchange%20tab.png)
+* Toggle off the _**Fetcher Active**_ for specific datasets
+
+![](https://github.com/facebook/ThreatExchange/blob/31d8c61a3f5c8f746db772157bf13f311bf1969c/hasher-matcher-actioner/docs/images/Fetcher%20Active.png)