Filtering words composed of more than 1 token #4

mataney · 2019-12-09T09:42:46Z

Hi, thanks for the great works.

I see that you are filtering out words that are composed of more than one token:

PPLM/run_pplm.py

Line 390 in 5f27e19

single_bow = list(filter(lambda x: len(x) <= 1, single_bow))

, which makes it filter quite a bit of words (including all terms that have more than one word).

Do you have any idea how to deal with this when we want to use these multi token words?

Cheers.

dathath · 2019-12-10T00:03:35Z

I think one option would be to compute the probability of multiple tokens being generated and use that the same way the single token probability is being used.

Let's say there is a word that splits into two tokens s1, s2: Instead of p(w|x) in equation 5, you could potentially replace this by p(s1|x)*p(s2|s1,x), and I suspect that should work with everything else as is.

I haven't tested this, if you have any luck with this, let us know. Alternatively, I plan on testing it at some point soon and can get back (will update the code appropriately).

monkdou0 · 2021-05-06T13:17:00Z

    bow_indices.append(
        [tokenizer.encode(word.strip(),
                          add_prefix_space=True,
                          add_special_tokens=False)
         for word in words])

i try to run this code, and all words are composed of more than one token,
i think this is because add_prefix_space=True.
Did I do something wrong?

vaibhavvarshney0 · 2021-06-14T16:43:39Z

Hi,
any update on this?

janleemark · 2021-10-28T13:28:49Z

Hi,
It's there any implementation of code on phrases(or more than one token)?

yananchen1989 · 2022-01-11T20:52:12Z

    bow_indices.append(
        [tokenizer.encode(word.strip(),
                          add_prefix_space=True,
                          add_special_tokens=False)
         for word in words])
i try to run this code, and all words are composed of more than one token, i think this is because add_prefix_space=True. Did I do something wrong?

@monkdou0

setting add_prefix_space to True, will not make it into more token ids.

dathath self-assigned this Dec 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering words composed of more than 1 token #4

Filtering words composed of more than 1 token #4

mataney commented Dec 9, 2019

dathath commented Dec 10, 2019 •

edited

Loading

monkdou0 commented May 6, 2021

vaibhavvarshney0 commented Jun 14, 2021

janleemark commented Oct 28, 2021

yananchen1989 commented Jan 11, 2022

Filtering words composed of more than 1 token #4

Filtering words composed of more than 1 token #4

Comments

mataney commented Dec 9, 2019

dathath commented Dec 10, 2019 • edited Loading

monkdou0 commented May 6, 2021

vaibhavvarshney0 commented Jun 14, 2021

janleemark commented Oct 28, 2021

yananchen1989 commented Jan 11, 2022

dathath commented Dec 10, 2019 •

edited

Loading