Merge branch 'master' into bug_fix

alibaba · Sep 20, 2023 · 28b2170 · 28b2170
2 parents 5bbead1 + 1af9b36
commit 28b2170
Show file tree

Hide file tree

Showing 9 changed files with 637 additions and 22 deletions.
diff --git a/docs/source/feature/fg_docs/ComboFeature.md b/docs/source/feature/fg_docs/ComboFeature.md
@@ -0,0 +1,33 @@
+# combo_feature
+
+## 功能介绍
+
+combo_feature是多个字段（或表达式）的组合（即笛卡尔积），id_feature可以看成是一种特殊的combo_feature，即参与交叉字段只有一个的combo_feature。一般来讲，参与交叉的各个字段来自不同的表（比如user特征和item特征进行交叉）。
+
+## 配置方法
+
+```
+{
+   "feature_type" : "combo_feature",
+   "feature_name" : "comb_u_age_item",
+   "expression" : ["user:age_class", "item:item_id"]
+}
+```
+
+## 例子
+
+^\]表示多值分隔符，注意这是一个符号，其ASCII编码是"\\x1D"，而不是两个符号
+
+| user:age_class的取值 | item:item_id的取值 | 输出的feature                                                                                                 |
+| ----------------- | --------------- | ---------------------------------------------------------------------------------------------------------- |
+| 123               | 45678           | comb_u_age_item_123_45678                                                                                  |
+| abc, bcd          | 45678           | comb_u_age_item_abc_45678, comb_u_age_item_bcd_45678                                                       |
+| abc, bcd          | 12345^\]45678   | comb_u_age_item_abc_12345, comb_u_age_item_abc_45678, comb_u_age_item_bcd_12345, comb_u_age_item_bcd_45678 |
+
+输出的feature个数等于
+
+```
+|F1| * |F2| * ... * |Fn|
+```
+
+其中Fn指依赖的第n个字段的值的个数。
diff --git a/docs/source/feature/fg_docs/IdFeature.md b/docs/source/feature/fg_docs/IdFeature.md
@@ -0,0 +1,32 @@
+# id_feature
+
+## 功能介绍
+
+id_feature是一个sparse feature，是一种最简单的离散特征，只是简单的将某个字段的值与用户配置的feature名字拼接。
+
+## 配置方法
+
+```json
+{
+  "feature_type" : "id_feature",
+  "feature_name" : "item_is_main",
+  "expression" : "item:is_main"
+}
+```
+
+| 字段名            | 含义                                                                            |
+| -------------- | ----------------------------------------------------------------------------- |
+| feature_name   | 必选项，feature_name会被当做最终输出的feature的前缀                                           |
+| expression     | 必选项，expression描述该feature所依赖的字段来源                                              |
+| need_prefix    | 可选项，true表示会拼上feature_name作为前缀，false表示不拼，默认为true，通常在shared_embedding的场景会用false |
+| invalid_values | 可选项，表示这些values都会被输出成null。list string，例如\[""\]，表示将所有的空字符串输出变成null。             |
+
+例子 （  ^\]表示多值分隔符，注意这是一个符号，其ASCII编码是"\\x1D"，而不是两个符号）
+
+| 类型       | item:is_main的取值 | 输出的feature                                  |
+| -------- | --------------- | ------------------------------------------- |
+| int64_t  | 100             | (item_is_main_100, 1)                       |
+| double   | 5.2             | (item_is_main_5, 1)（小数部分会被截取）               |
+| string   | abc             | (item_is_main_abc, 1)                       |
+| 多值string | abc^\]bcd       | (item_is_main_abc, 1),(item_is_main_bcd, 1) |
+| 多值int    | 123^\]456       | (item_is_main_123, 1),(item_is_main_456, 1) |
diff --git a/docs/source/feature/fg_docs/LookupFeature.md b/docs/source/feature/fg_docs/LookupFeature.md
@@ -0,0 +1,112 @@
+# lookup_feature
+
+## 功能介绍
+
+如果离线生成不符合预期 请先使用最新的离线fg包
+
+lookup_feature 和 match_feature类似，是从一组kv中匹配到自己需要的结果。
+
+lookup_feature 依赖 map 和 key 两个字段，map是一个多值string(MultiString)类型的字段，其中每一个string的样子如"k1:v2"。；key可以是一个任意类型的字段。生成特征时，先是取出key的值，将其转换成string类型，然后在map字段所持有的kv对中进行匹配，获取最终的特征。
+
+map 和 key 源可以是 item，user，context 的任意组合。在线输入的时候item的多值用多值分隔符char(29)分隔，user和context的多值在tpp访问时用list表示。该特征仅支持json形式的配置方式。
+
+## 配置方法
+
+```json
+{
+    "features" : [
+        {
+            "feature_type" : "lookup_feature",
+            "feature_name" : "item_match_item",
+            "map" : "item:item_attr",
+            "key" : "item:item_value",
+            "needDiscrete" : true
+        }
+    ]
+}
+```
+
+对于上面的配置，假设对于某个 doc：
+
+```
+item_attr : "k1:v1^]k2:v2^]k3:v3"
+```
+
+^\]表示多值分隔符，注意这是一个符号，其ASCII编码是"\\x1D"，而不是两个符号。该字符在emacs中的输入方式是C-q C-5, 在vi中的输入方式是C-v C-5。 这里item_attr是个多值string。需要切记，当map用来表征多个kv对时，是个多值string，而不是string！
+
+```
+item_value : "k2"
+```
+
+特征结果为 item_match_item_k2_v2。由于needDiscrete的值为true，所以特征结果为离散化后的结果。
+
+## 其它
+
+match_feature 和 lookup_feature都是匹配类型的特征，即从kv对中匹配到相应的结果。两者的区别是： match_feature的被匹配字段user 必须是qinfo中传入的字段，即一次查询中对所有的doc来说这个字段的值都是一致的。而 lookup_feature 的 key 和 map 没有来源的限制。
+
+## 配置详解
+
+默认情况的配置为 `needDiscrete == true, needWeighting = false, needKey = true, combiner = "sum"`
+
+### 默认输出
+
+### needWeighting == true
+
+```
+feature_name:fg
+map:{{"k1:123", "k2:234", "k3:3"}}
+key:{"k1"}
+结果：feature={"fg_k1", 123}
+```
+
+此时会用 string 部分查 weight 表，然后乘对应 feature value 用于 LR 模型。
+
+### needDiscrete == true
+
+```
+feature_name:fg
+map:{{"k1:123", "k2:234", "k3:3"}}
+key:{"k1"}
+结果：feature={"fg_123"}
+```
+
+### needDiscrete == false
+
+```
+map:{{"k1:123", "k2:234", "k3:3"}}
+key:{"k1"}
+结果：feature={123}
+```
+
+如果存在多个 key 时，可以通过配置 combiner 来组合多个查到的值。可能的配置有 `sum, mean, max, min`。 ps：如果要使用combiner的话需要将needDiscrete设置为false，只有dense类才能做combiner，生成的value会是数值类的
+
+一个配置样例 update on 2021.04.15
+
+```json
+"kv_fields_encode": [
+    {
+      "name": "cnty_dense_features",
+      "dimension": 99,
+      "min_hash_type": 0,
+      "use_sparse": true
+    },
+    {
+      "name": "cross_a_tag",
+      "dimension": 12,
+      "min_hash_type": 0,
+      "use_sparse": true
+    },
+    {
+      "name": "cross_gender",
+      "dimension": 12,
+      "min_hash_type": 0,
+      "use_sparse": true
+    },
+    {
+      "name": "cross_purchasing_power",
+      "dimension": 12,
+      "min_hash_type": 0,
+      "use_sparse": true
+    }
+  ]
+```
diff --git a/docs/source/feature/fg_docs/MatchFeature.md b/docs/source/feature/fg_docs/MatchFeature.md
@@ -0,0 +1,100 @@
+# match_feature
+
+## 功能介绍
+
+match_feature一般用来做特征之间的匹配关系，要用到user，item和category三个字段的值。
+match_feature支持两种类型，hit和multi hit。
+match_feature本质是是一个两层map的匹配，user字段使用string的方式描述了一个两层map，|为第一层map的item之间的分隔符，^为第一层map的key与value之间的分隔符。,为第二层map的item之间的分隔符，:第二层map的key与value之间的分隔符。例如对于50011740^50011740:0.2,36806676:0.3,122572685:0.5|50006842^16788:0.1这样的一个string，转化为二层map就是
+
+```json
+{
+	"50011740" : {
+		"50011740" : 0.2,
+		"36806676" : 0.3,
+		"122572685" : 0.5
+	},
+	"50006842" : {
+		"16788" : 0.1
+	}
+}
+```
+
+对于hit match 匹配的方式，就是用category的值在第一层map中查找，然后使用item的值在第二层map中查找，最终得到一个结果。 如果不需要使用两层匹配，只需要一层匹配，则可以在map的第一层key中填入ALL， 然后在fg配置的category一项中也填成"ALL"即可。具体见实例一。
+
+## 配置方式
+
+json格式配置文件：
+
+```json
+{
+    "feature_name": "user__l1_ctr_1",
+    "feature_type": "match_feature",
+    "category": "ALL",
+    "needDiscrete": false,
+    "item": "item:category_level1",
+    "user": "user:l1_ctr_1",
+    "matchType": "hit"
+}
+```
+
+needDiscrete:true 时，模型使用 match_feature 输出的特征名，忽略特征值。默认为 true。
+needDiscrete:false 时，模型取 match_feature 输出的特征值，而忽略特征名。
+
+matchType：
+hit:输出命中的feature
+
+xml配置文件：
+
+```xml
+<features name="matched_features">
+    <feature name="brand_hit" dependencies="user:user_brand_tags_hit1,item:brand_id" category="item:auction_root_category" type="hit"/>
+    <feature name="brand_matched_hit" dependencies="user:user_brand_tags_cos1,item:brand_id" category="ALL" type="hit"/>
+</features>
+```
+
+dependencie:需要做Match 的两个特征
+
+category: 类目的feature 字段。category="ALL"不需要分类目匹配
+
+## Normalizer
+
+match_feature 支持和 raw_feature 一样的 normalizer，具体可见 [raw_feature](./RawFeature.md)。
+
+## 配置详解
+
+### hit
+
+对于下面的配置
+
+```json
+{
+    "feature_name": "brand_hit",
+    "feature_type": "match_feature",
+    "category": "item:auction_root_category",
+    "needDiscrete": true,
+    "item": "item:brand_id",
+    "user": "user:user_brand_tags_hit",
+    "matchType": "hit"
+}
+```
+
+假设各字段的值如下：
+
+| user_brand_tags_hit   | `50011740^107287172:0.2,36806676:0.3,122572685:0.5\|50006842^16788816:0.1,10122:0.2,29889:0.3,30068:19` |
+| --------------------- | ------------------------------------------------------------------------------------------------------- |
+| brand_id              | 30068                                                                                                   |
+| auction_root_category | 50006842                                                                                                |
+
+如果 needDiscrete=true，结果为：\<brand_hit_50006842_30068_19，1.0>
+如果 needDiscrete=false，结果为：\<brand_hit，19.0>
+如果只需要使用一层匹配，则需要将上面配置里的 category 的值改为 ALL。这种情况，用户也可以考虑使用 lookup_feature。 假设各字段的值如下
+
+| user_brand_tags_hit | ALL^16788816:40,10122:40,29889:20,30068:20 |
+| ------------------- | ------------------------------------------ |
+| brand_id            | 30068                                      |
+
+如果 needDiscrete=true，结果：\<brand_hit_ALL_30068_20, 1.0> 如果 needDiscrete=false，结果：\<brand_hit, 20.0>
+
+### multihit
+
+允许用户 category 和 item 两个值为 ALL（注意，不是配置的值，是传入的值），进行 wildcard 匹配，可以匹配出多个值。输出结果类似于 hit。
diff --git a/docs/source/feature/fg_docs/OverLapFeature.md b/docs/source/feature/fg_docs/OverLapFeature.md
@@ -0,0 +1,56 @@
+# overlap_feature
+
+## 功能介绍
+
+用来输出一些字符串字词匹配信息的feature
+
+离线推荐使用1.3.56-SNAPSHOT这个版本。 ps: 写fg的时候注意维度，title的维度要大于或等于query的问题（简单来说就是如果title是user特征，那query也只能是user特征，user特征的batch size为1，商品特征的batch size为商品数）
+
+| 方式                  | 描述                                              | 备注                 |
+| ------------------- | ----------------------------------------------- | ------------------ |
+| common_word         | 计算query与title间重复term，并输出为fg_common1_common2     | 重复数不超过query term数  |
+| diff_word           | 计算query与title间不重复term，并输出为fg_diff1_diff2        | 不重复数不超过query term数 |
+| query_common_ratio  | 计算query与title间重复term数占query中term比例,乘以10取下整      | 取值为\[0,10\]        |
+| title_common_ratio  | 计算query与title间重复term数占title中term比例,乘以100取下整     | 取值为\[0,100\]       |
+| is_contain          | 计算query是否全部包含在title中，保持顺序                       | 0表示未包含，1表示包含       |
+| is_equal            | 计算query是否与title完全相同                             | 0表示不完全相同，1表示完全相同   |
+| common_word_divided | 计算query与title间重复term，并输出为fg_common1, fg_common2 | 重复数不超过query term数  |
+| diff_word_divided   | 计算query与title间不重复term，并输出为fg_diff1, fg_diff2    | 重复数不超过query term数  |
+
+## 配置方法
+
+```json
+  {
+			"feature_type" : "overlap_feature",
+			"feature_name" : "is_contain",
+			"query" : "user:attr1",
+			"title" : "item:attr2",
+			"method" : "is_contain",
+			"separator" : " "
+  }
+```
+
+| 字段名          | 含义                                                                                     |
+| ------------ | -------------------------------------------------------------------------------------- |
+| feature_type | 必选项，描述改feature的类型                                                                      |
+| feature_name | 必选项，feature_name会被当做最终输出的feature的前缀                                                    |
+| query        | 必选项，query依赖的表, attr1是一个多值string, 多值string的分隔符使用chr(29)                                 |
+| title        | 必选项，title依赖的表, attr2是一个多值string                                                        |
+| method       | 可填common_word, diff_word, query_common_ratio, title_common_ratio, is_contain， 对应上图五种方式 |
+| separator    | 输出结果中的分割字符，不填写我们默认为\_ ，但也可以用户自己定制，具体看例子                                                |
+
+## 例子
+
+query为high,high2,fiberglass,abc
+title为high,quality,fiberglass,tube,for,golf,bag
+
+| method              | separator | feature                    |
+| ------------------- | --------- | -------------------------- |
+| common_word         |           | name_high_fiberglass       |
+| diff_word           | " "       | name high2 abc             |
+| query_common_ratio  |           | name_5                     |
+| title_common_ratio  |           | name_28                    |
+| is_contain          |           | name_0                     |
+| is_equal            |           | name_0                     |
+| common_word_divided |           | name_high, name_fiberglass |
+| diff_word_divided   |           | name_high2, name_abc       |
diff --git a/docs/source/feature/fg_docs/RawFeature.md b/docs/source/feature/fg_docs/RawFeature.md
@@ -0,0 +1,58 @@
+# raw_feature
+
+## 功能介绍
+
+raw_feature是一种dense的feature，是直接引用原始feature的字段值作为feature的value。raw feature仅支持数值int、float、double等数值类型，对非数值类型的feature需使用id feature。
+
+## 配置方法
+
+```json
+{
+ "feature_type" : "raw_feature",
+ "feature_name" : "ctr",
+ "expression" : "item:ctr",
+ "normalizer" : "method=log10"
+}
+```
+
+| 字段名             | 含义                                                                                 |
+| --------------- | ---------------------------------------------------------------------------------- |
+| feature_name    | 必选项，在正常使用时该选项是没用处的，因为实际参与接下来运算的主要是feature value，但是在debug的情况下，可以看到对应feature name的值。 |
+| expression      | 必选项，expression描述该feature所依赖的字段来源                                                   |
+| value_dimension | 可选项，默认值为1，表示输出的字段的维度。                                                              |
+| normalizer      | 可选项，归一化方法，详见后文                                                                     |
+
+## 例子
+
+^\]表示多值分隔符，注意这是一个符号，其ASCII编码是"\\x1D"，而不是两个符号
+
+| 类型      | item:ctr的取值 | 输出的feature                                     |
+| ------- | ----------- | ---------------------------------------------- |
+| int64_t | 100         | (ctr, 100)                                     |
+| double  | 100.1       | (ctr, 100.1)                                   |
+| 多值int   | 123^\]456   | (ctr, (123,456))  (注意，输入字段必须与配置的dimension维度一致) |
+
+## Normalizer
+
+raw_feature 和 match_feature 支持 normalizer，共三种，`minmax，zscore，log10`。配置和计算方法如下：
+
+### log10
+
+```
+配置例子：method=log10,threshold=1e-10,default=-10
+计算公式：x = x > threshold ? log10(x) : default;
+```
+
+### zscore
+
+```
+配置例子：method=zscore,mean=0.0,standard_deviation=10.0
+计算公式：x = (x - mean) / standard_deviation
+```
+
+### minmax
+
+```
+配置例子：method=minmax,min=2.1,max=2.2
+计算公式：x = (x - min) / (max - min)
+```