fairseq\fairseq\data\dictionary.py", line 259, in add_from_file raise ValueError( ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]' #2

samrudh · 2020-11-15T16:14:07Z

OS: Windows
While createing a class , below error occurred:
q\fairseq\data\dictionary.py", line 246, in add_from_file
count = int(field)
ValueError: invalid literal for int() with base 10: 'https://git-lfs.github.com/spec/v1'

Stacktrace shows:
fairseq\data\dictionary.py", line 259, in add_from_file
raise ValueError(
ValueError: Incorrect dictionary format, expected ' [flags]'

I believe the wrong value of the field is being set somewhere

Code:

from asamiasami import Hi2EnTranslator
hi2EnObj = Hi2EnTranslator()

The text was updated successfully, but these errors were encountered:

swapniljadhav1921 · 2020-11-18T05:37:59Z

This is because of fairseq version or lfs version. Please try updated steps given for installation. Also, make sure to take updated repo.

swapniljadhav1921 · 2020-11-29T12:51:22Z

Closing the issue due to inactivity. Please open the same if you have further doubt.

akshay951228 · 2021-03-21T13:43:37Z

Hi ,
thanks for you great work!,
I'm still facing the same , follow exact setup in readme
Error:-
ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]'

lacls · 2021-05-06T06:21:34Z

Hi ,
thanks for you great work!,
I'm still facing the same , follow exact setup in readme
Error:-
ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]'
This was exactly what I am facing too. Just because in the tokenization of specific model (in my case of using PhoBERT, the file is stored at transformers/models/PhoBERT/Tokenization_PhoBert)


def add_from_file(self, f):
        """
        Loads a pre-existing dictionary from a text file and adds its symbols to this instance.
        """
        if isinstance(f, str):
            try:
                with open(f, "r", encoding="utf-8") as fd:
                    self.add_from_file(fd)
            except FileNotFoundError as fnfe:
                raise fnfe
            except UnicodeError:
                raise Exception(f"Incorrect encoding detected in {f}, please rebuild the dataset")
            return
        lines = f.readlines()
        for lineTmp in lines:
            line = lineTmp.strip()
            idx = line.rfind(" ")
            if idx == -1:
                raise ValueError("Incorrect dictionary format, expected '<token> <cnt>'")
            word = line[:idx]
            self.encoder[word] = len(self.encoder)

Because of only appending the token without (cnt tag, I don't really know what does it present for)

Do you guys have any other approach, please share. I do really appreciate that.

swapniljadhav1921 · 2021-05-22T12:15:13Z

Opening the issue again ... it is happening bcz I initially used git-lfs for file storage and later removed.
It converted dict files to different text. You can check in lets say dict.en.txt file. No dictionary present .. hence code is failing.
I will make sure to provide correct files soon.
Opening the issue again.

And sorry for the super delayed reply .. didn't notice bcz of closed issue @akshay951228 @lacls

swapniljadhav1921 · 2021-05-22T13:57:33Z

Issues with LFS

Due to various issues with LFS files initially added to LFS later removed .. created unstable file versions which are currently present in repo.
File sizes are big and github with free version has size limitations.
I propose to use files from this location -> https://drive.google.com/drive/folders/18x_vGGa5v3jT-Zx73u0eKFfDGyw9M_aB?usp=sharing
Same folder structure ... please replace git files with these files ... and then LFS is not required.
Please update if found any issue here -> #2
Very non efficient way .. but will make it more usable later.

swapniljadhav1921 · 2021-05-22T13:58:25Z

@lacls @akshay951228 plz do check and let me know in case of any issue ... I have checked on my setup.

Pogayo · 2022-07-12T20:59:12Z

I got this error because my dictionary file generated by sentencepiece tokenizer was tab-separated. Replacing the tabs by space solved it for me. Remember to also remove the unknown, bos, and eos tokens from the dictionary if you are using sentencepiece.

deleted unk, s, /s, to avoid the error, swapniljadhav1921/asamiasami#2

swapniljadhav1921 self-assigned this Nov 18, 2020

swapniljadhav1921 closed this as completed Nov 29, 2020

swapniljadhav1921 reopened this May 22, 2021

swapniljadhav1921 mentioned this issue May 22, 2021

unable to download files from git-lfs #4

Closed

su0315 added a commit to su0315/contextual-mt that referenced this issue Nov 30, 2022

Update spm.en.nopretok.vocab

f2011a3

deleted unk, s, /s, to avoid the error, swapniljadhav1921/asamiasami#2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fairseq\fairseq\data\dictionary.py", line 259, in add_from_file raise ValueError( ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]' #2

fairseq\fairseq\data\dictionary.py", line 259, in add_from_file raise ValueError( ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]' #2

samrudh commented Nov 15, 2020

swapniljadhav1921 commented Nov 18, 2020 •

edited

Loading

swapniljadhav1921 commented Nov 29, 2020

akshay951228 commented Mar 21, 2021 •

edited

Loading

lacls commented May 6, 2021 •

edited

Loading

swapniljadhav1921 commented May 22, 2021

swapniljadhav1921 commented May 22, 2021

swapniljadhav1921 commented May 22, 2021

Pogayo commented Jul 12, 2022 •

edited

Loading

fairseq\fairseq\data\dictionary.py", line 259, in add_from_file raise ValueError( ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]' #2

fairseq\fairseq\data\dictionary.py", line 259, in add_from_file raise ValueError( ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]' #2

Comments

samrudh commented Nov 15, 2020

swapniljadhav1921 commented Nov 18, 2020 • edited Loading

swapniljadhav1921 commented Nov 29, 2020

akshay951228 commented Mar 21, 2021 • edited Loading

lacls commented May 6, 2021 • edited Loading

swapniljadhav1921 commented May 22, 2021

swapniljadhav1921 commented May 22, 2021

swapniljadhav1921 commented May 22, 2021

Pogayo commented Jul 12, 2022 • edited Loading

swapniljadhav1921 commented Nov 18, 2020 •

edited

Loading

akshay951228 commented Mar 21, 2021 •

edited

Loading

lacls commented May 6, 2021 •

edited

Loading

Pogayo commented Jul 12, 2022 •

edited

Loading