-
Notifications
You must be signed in to change notification settings - Fork 342
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
9 changed files
with
637 additions
and
22 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# combo_feature | ||
|
||
## 功能介绍 | ||
|
||
combo_feature是多个字段(或表达式)的组合(即笛卡尔积),id_feature可以看成是一种特殊的combo_feature,即参与交叉字段只有一个的combo_feature。一般来讲,参与交叉的各个字段来自不同的表(比如user特征和item特征进行交叉)。 | ||
|
||
## 配置方法 | ||
|
||
``` | ||
{ | ||
"feature_type" : "combo_feature", | ||
"feature_name" : "comb_u_age_item", | ||
"expression" : ["user:age_class", "item:item_id"] | ||
} | ||
``` | ||
|
||
## 例子 | ||
|
||
^\]表示多值分隔符,注意这是一个符号,其ASCII编码是"\\x1D",而不是两个符号 | ||
|
||
| user:age_class的取值 | item:item_id的取值 | 输出的feature | | ||
| ----------------- | --------------- | ---------------------------------------------------------------------------------------------------------- | | ||
| 123 | 45678 | comb_u_age_item_123_45678 | | ||
| abc, bcd | 45678 | comb_u_age_item_abc_45678, comb_u_age_item_bcd_45678 | | ||
| abc, bcd | 12345^\]45678 | comb_u_age_item_abc_12345, comb_u_age_item_abc_45678, comb_u_age_item_bcd_12345, comb_u_age_item_bcd_45678 | | ||
|
||
输出的feature个数等于 | ||
|
||
``` | ||
|F1| * |F2| * ... * |Fn| | ||
``` | ||
|
||
其中Fn指依赖的第n个字段的值的个数。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# id_feature | ||
|
||
## 功能介绍 | ||
|
||
id_feature是一个sparse feature,是一种最简单的离散特征,只是简单的将某个字段的值与用户配置的feature名字拼接。 | ||
|
||
## 配置方法 | ||
|
||
```json | ||
{ | ||
"feature_type" : "id_feature", | ||
"feature_name" : "item_is_main", | ||
"expression" : "item:is_main" | ||
} | ||
``` | ||
|
||
| 字段名 | 含义 | | ||
| -------------- | ----------------------------------------------------------------------------- | | ||
| feature_name | 必选项,feature_name会被当做最终输出的feature的前缀 | | ||
| expression | 必选项,expression描述该feature所依赖的字段来源 | | ||
| need_prefix | 可选项,true表示会拼上feature_name作为前缀,false表示不拼,默认为true,通常在shared_embedding的场景会用false | | ||
| invalid_values | 可选项,表示这些values都会被输出成null。list string,例如\[""\],表示将所有的空字符串输出变成null。 | | ||
|
||
例子 ( ^\]表示多值分隔符,注意这是一个符号,其ASCII编码是"\\x1D",而不是两个符号) | ||
|
||
| 类型 | item:is_main的取值 | 输出的feature | | ||
| -------- | --------------- | ------------------------------------------- | | ||
| int64_t | 100 | (item_is_main_100, 1) | | ||
| double | 5.2 | (item_is_main_5, 1)(小数部分会被截取) | | ||
| string | abc | (item_is_main_abc, 1) | | ||
| 多值string | abc^\]bcd | (item_is_main_abc, 1),(item_is_main_bcd, 1) | | ||
| 多值int | 123^\]456 | (item_is_main_123, 1),(item_is_main_456, 1) | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
# lookup_feature | ||
|
||
## 功能介绍 | ||
|
||
如果离线生成不符合预期 请先使用最新的离线fg包 | ||
|
||
lookup_feature 和 match_feature类似,是从一组kv中匹配到自己需要的结果。 | ||
|
||
lookup_feature 依赖 map 和 key 两个字段,map是一个多值string(MultiString)类型的字段,其中每一个string的样子如"k1:v2"。;key可以是一个任意类型的字段。生成特征时,先是取出key的值,将其转换成string类型,然后在map字段所持有的kv对中进行匹配,获取最终的特征。 | ||
|
||
map 和 key 源可以是 item,user,context 的任意组合。在线输入的时候item的多值用多值分隔符char(29)分隔,user和context的多值在tpp访问时用list表示。该特征仅支持json形式的配置方式。 | ||
|
||
## 配置方法 | ||
|
||
```json | ||
{ | ||
"features" : [ | ||
{ | ||
"feature_type" : "lookup_feature", | ||
"feature_name" : "item_match_item", | ||
"map" : "item:item_attr", | ||
"key" : "item:item_value", | ||
"needDiscrete" : true | ||
} | ||
] | ||
} | ||
``` | ||
|
||
对于上面的配置,假设对于某个 doc: | ||
|
||
``` | ||
item_attr : "k1:v1^]k2:v2^]k3:v3" | ||
``` | ||
|
||
^\]表示多值分隔符,注意这是一个符号,其ASCII编码是"\\x1D",而不是两个符号。该字符在emacs中的输入方式是C-q C-5, 在vi中的输入方式是C-v C-5。 这里item_attr是个多值string。需要切记,当map用来表征多个kv对时,是个多值string,而不是string! | ||
|
||
``` | ||
item_value : "k2" | ||
``` | ||
|
||
特征结果为 item_match_item_k2_v2。由于needDiscrete的值为true,所以特征结果为离散化后的结果。 | ||
|
||
## 其它 | ||
|
||
match_feature 和 lookup_feature都是匹配类型的特征,即从kv对中匹配到相应的结果。两者的区别是: match_feature的被匹配字段user 必须是qinfo中传入的字段,即一次查询中对所有的doc来说这个字段的值都是一致的。而 lookup_feature 的 key 和 map 没有来源的限制。 | ||
|
||
## 配置详解 | ||
|
||
默认情况的配置为 `needDiscrete == true, needWeighting = false, needKey = true, combiner = "sum"` | ||
|
||
### 默认输出 | ||
|
||
### needWeighting == true | ||
|
||
``` | ||
feature_name:fg | ||
map:{{"k1:123", "k2:234", "k3:3"}} | ||
key:{"k1"} | ||
结果:feature={"fg_k1", 123} | ||
``` | ||
|
||
此时会用 string 部分查 weight 表,然后乘对应 feature value 用于 LR 模型。 | ||
|
||
### needDiscrete == true | ||
|
||
``` | ||
feature_name:fg | ||
map:{{"k1:123", "k2:234", "k3:3"}} | ||
key:{"k1"} | ||
结果:feature={"fg_123"} | ||
``` | ||
|
||
### needDiscrete == false | ||
|
||
``` | ||
map:{{"k1:123", "k2:234", "k3:3"}} | ||
key:{"k1"} | ||
结果:feature={123} | ||
``` | ||
|
||
如果存在多个 key 时,可以通过配置 combiner 来组合多个查到的值。可能的配置有 `sum, mean, max, min`。 ps:如果要使用combiner的话需要将needDiscrete设置为false,只有dense类才能做combiner,生成的value会是数值类的 | ||
|
||
一个配置样例 update on 2021.04.15 | ||
|
||
```json | ||
"kv_fields_encode": [ | ||
{ | ||
"name": "cnty_dense_features", | ||
"dimension": 99, | ||
"min_hash_type": 0, | ||
"use_sparse": true | ||
}, | ||
{ | ||
"name": "cross_a_tag", | ||
"dimension": 12, | ||
"min_hash_type": 0, | ||
"use_sparse": true | ||
}, | ||
{ | ||
"name": "cross_gender", | ||
"dimension": 12, | ||
"min_hash_type": 0, | ||
"use_sparse": true | ||
}, | ||
{ | ||
"name": "cross_purchasing_power", | ||
"dimension": 12, | ||
"min_hash_type": 0, | ||
"use_sparse": true | ||
} | ||
] | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
# match_feature | ||
|
||
## 功能介绍 | ||
|
||
match_feature一般用来做特征之间的匹配关系,要用到user,item和category三个字段的值。 | ||
match_feature支持两种类型,hit和multi hit。 | ||
match_feature本质是是一个两层map的匹配,user字段使用string的方式描述了一个两层map,|为第一层map的item之间的分隔符,^为第一层map的key与value之间的分隔符。,为第二层map的item之间的分隔符,:第二层map的key与value之间的分隔符。例如对于50011740^50011740:0.2,36806676:0.3,122572685:0.5|50006842^16788:0.1这样的一个string,转化为二层map就是 | ||
|
||
```json | ||
{ | ||
"50011740" : { | ||
"50011740" : 0.2, | ||
"36806676" : 0.3, | ||
"122572685" : 0.5 | ||
}, | ||
"50006842" : { | ||
"16788" : 0.1 | ||
} | ||
} | ||
``` | ||
|
||
对于hit match 匹配的方式,就是用category的值在第一层map中查找,然后使用item的值在第二层map中查找,最终得到一个结果。 如果不需要使用两层匹配,只需要一层匹配,则可以在map的第一层key中填入ALL, 然后在fg配置的category一项中也填成"ALL"即可。具体见实例一。 | ||
|
||
## 配置方式 | ||
|
||
json格式配置文件: | ||
|
||
```json | ||
{ | ||
"feature_name": "user__l1_ctr_1", | ||
"feature_type": "match_feature", | ||
"category": "ALL", | ||
"needDiscrete": false, | ||
"item": "item:category_level1", | ||
"user": "user:l1_ctr_1", | ||
"matchType": "hit" | ||
} | ||
``` | ||
|
||
needDiscrete:true 时,模型使用 match_feature 输出的特征名,忽略特征值。默认为 true。 | ||
needDiscrete:false 时,模型取 match_feature 输出的特征值,而忽略特征名。 | ||
|
||
matchType: | ||
hit:输出命中的feature | ||
|
||
xml配置文件: | ||
|
||
```xml | ||
<features name="matched_features"> | ||
<feature name="brand_hit" dependencies="user:user_brand_tags_hit1,item:brand_id" category="item:auction_root_category" type="hit"/> | ||
<feature name="brand_matched_hit" dependencies="user:user_brand_tags_cos1,item:brand_id" category="ALL" type="hit"/> | ||
</features> | ||
``` | ||
|
||
dependencie:需要做Match 的两个特征 | ||
|
||
category: 类目的feature 字段。category="ALL"不需要分类目匹配 | ||
|
||
## Normalizer | ||
|
||
match_feature 支持和 raw_feature 一样的 normalizer,具体可见 [raw_feature](./RawFeature.md)。 | ||
|
||
## 配置详解 | ||
|
||
### hit | ||
|
||
对于下面的配置 | ||
|
||
```json | ||
{ | ||
"feature_name": "brand_hit", | ||
"feature_type": "match_feature", | ||
"category": "item:auction_root_category", | ||
"needDiscrete": true, | ||
"item": "item:brand_id", | ||
"user": "user:user_brand_tags_hit", | ||
"matchType": "hit" | ||
} | ||
``` | ||
|
||
假设各字段的值如下: | ||
|
||
| user_brand_tags_hit | `50011740^107287172:0.2,36806676:0.3,122572685:0.5\|50006842^16788816:0.1,10122:0.2,29889:0.3,30068:19` | | ||
| --------------------- | ------------------------------------------------------------------------------------------------------- | | ||
| brand_id | 30068 | | ||
| auction_root_category | 50006842 | | ||
|
||
如果 needDiscrete=true,结果为:\<brand_hit_50006842_30068_19,1.0> | ||
如果 needDiscrete=false,结果为:\<brand_hit,19.0> | ||
如果只需要使用一层匹配,则需要将上面配置里的 category 的值改为 ALL。这种情况,用户也可以考虑使用 lookup_feature。 假设各字段的值如下 | ||
|
||
| user_brand_tags_hit | ALL^16788816:40,10122:40,29889:20,30068:20 | | ||
| ------------------- | ------------------------------------------ | | ||
| brand_id | 30068 | | ||
|
||
如果 needDiscrete=true,结果:\<brand_hit_ALL_30068_20, 1.0> 如果 needDiscrete=false,结果:\<brand_hit, 20.0> | ||
|
||
### multihit | ||
|
||
允许用户 category 和 item 两个值为 ALL(注意,不是配置的值,是传入的值),进行 wildcard 匹配,可以匹配出多个值。输出结果类似于 hit。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# overlap_feature | ||
|
||
## 功能介绍 | ||
|
||
用来输出一些字符串字词匹配信息的feature | ||
|
||
离线推荐使用1.3.56-SNAPSHOT这个版本。 ps: 写fg的时候注意维度,title的维度要大于或等于query的问题(简单来说就是如果title是user特征,那query也只能是user特征,user特征的batch size为1,商品特征的batch size为商品数) | ||
|
||
| 方式 | 描述 | 备注 | | ||
| ------------------- | ----------------------------------------------- | ------------------ | | ||
| common_word | 计算query与title间重复term,并输出为fg_common1_common2 | 重复数不超过query term数 | | ||
| diff_word | 计算query与title间不重复term,并输出为fg_diff1_diff2 | 不重复数不超过query term数 | | ||
| query_common_ratio | 计算query与title间重复term数占query中term比例,乘以10取下整 | 取值为\[0,10\] | | ||
| title_common_ratio | 计算query与title间重复term数占title中term比例,乘以100取下整 | 取值为\[0,100\] | | ||
| is_contain | 计算query是否全部包含在title中,保持顺序 | 0表示未包含,1表示包含 | | ||
| is_equal | 计算query是否与title完全相同 | 0表示不完全相同,1表示完全相同 | | ||
| common_word_divided | 计算query与title间重复term,并输出为fg_common1, fg_common2 | 重复数不超过query term数 | | ||
| diff_word_divided | 计算query与title间不重复term,并输出为fg_diff1, fg_diff2 | 重复数不超过query term数 | | ||
|
||
## 配置方法 | ||
|
||
```json | ||
{ | ||
"feature_type" : "overlap_feature", | ||
"feature_name" : "is_contain", | ||
"query" : "user:attr1", | ||
"title" : "item:attr2", | ||
"method" : "is_contain", | ||
"separator" : " " | ||
} | ||
``` | ||
|
||
| 字段名 | 含义 | | ||
| ------------ | -------------------------------------------------------------------------------------- | | ||
| feature_type | 必选项,描述改feature的类型 | | ||
| feature_name | 必选项,feature_name会被当做最终输出的feature的前缀 | | ||
| query | 必选项,query依赖的表, attr1是一个多值string, 多值string的分隔符使用chr(29) | | ||
| title | 必选项,title依赖的表, attr2是一个多值string | | ||
| method | 可填common_word, diff_word, query_common_ratio, title_common_ratio, is_contain, 对应上图五种方式 | | ||
| separator | 输出结果中的分割字符,不填写我们默认为\_ ,但也可以用户自己定制,具体看例子 | | ||
|
||
## 例子 | ||
|
||
query为high,high2,fiberglass,abc | ||
title为high,quality,fiberglass,tube,for,golf,bag | ||
|
||
| method | separator | feature | | ||
| ------------------- | --------- | -------------------------- | | ||
| common_word | | name_high_fiberglass | | ||
| diff_word | " " | name high2 abc | | ||
| query_common_ratio | | name_5 | | ||
| title_common_ratio | | name_28 | | ||
| is_contain | | name_0 | | ||
| is_equal | | name_0 | | ||
| common_word_divided | | name_high, name_fiberglass | | ||
| diff_word_divided | | name_high2, name_abc | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# raw_feature | ||
|
||
## 功能介绍 | ||
|
||
raw_feature是一种dense的feature,是直接引用原始feature的字段值作为feature的value。raw feature仅支持数值int、float、double等数值类型,对非数值类型的feature需使用id feature。 | ||
|
||
## 配置方法 | ||
|
||
```json | ||
{ | ||
"feature_type" : "raw_feature", | ||
"feature_name" : "ctr", | ||
"expression" : "item:ctr", | ||
"normalizer" : "method=log10" | ||
} | ||
``` | ||
|
||
| 字段名 | 含义 | | ||
| --------------- | ---------------------------------------------------------------------------------- | | ||
| feature_name | 必选项,在正常使用时该选项是没用处的,因为实际参与接下来运算的主要是feature value,但是在debug的情况下,可以看到对应feature name的值。 | | ||
| expression | 必选项,expression描述该feature所依赖的字段来源 | | ||
| value_dimension | 可选项,默认值为1,表示输出的字段的维度。 | | ||
| normalizer | 可选项,归一化方法,详见后文 | | ||
|
||
## 例子 | ||
|
||
^\]表示多值分隔符,注意这是一个符号,其ASCII编码是"\\x1D",而不是两个符号 | ||
|
||
| 类型 | item:ctr的取值 | 输出的feature | | ||
| ------- | ----------- | ---------------------------------------------- | | ||
| int64_t | 100 | (ctr, 100) | | ||
| double | 100.1 | (ctr, 100.1) | | ||
| 多值int | 123^\]456 | (ctr, (123,456)) (注意,输入字段必须与配置的dimension维度一致) | | ||
|
||
## Normalizer | ||
|
||
raw_feature 和 match_feature 支持 normalizer,共三种,`minmax,zscore,log10`。配置和计算方法如下: | ||
|
||
### log10 | ||
|
||
``` | ||
配置例子:method=log10,threshold=1e-10,default=-10 | ||
计算公式:x = x > threshold ? log10(x) : default; | ||
``` | ||
|
||
### zscore | ||
|
||
``` | ||
配置例子:method=zscore,mean=0.0,standard_deviation=10.0 | ||
计算公式:x = (x - mean) / standard_deviation | ||
``` | ||
|
||
### minmax | ||
|
||
``` | ||
配置例子:method=minmax,min=2.1,max=2.2 | ||
计算公式:x = (x - min) / (max - min) | ||
``` |
Oops, something went wrong.