Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fairseq\fairseq\data\dictionary.py", line 259, in add_from_file raise ValueError( ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]' #2

Open
samrudh opened this issue Nov 15, 2020 · 8 comments
Assignees

Comments

@samrudh
Copy link

samrudh commented Nov 15, 2020

OS: Windows
While createing a class , below error occurred:
q\fairseq\data\dictionary.py", line 246, in add_from_file
count = int(field)
ValueError: invalid literal for int() with base 10: 'https://git-lfs.github.com/spec/v1'

Stacktrace shows:
fairseq\data\dictionary.py", line 259, in add_from_file
raise ValueError(
ValueError: Incorrect dictionary format, expected ' [flags]'

I believe the wrong value of the field is being set somewhere

Code:

from asamiasami import Hi2EnTranslator
hi2EnObj = Hi2EnTranslator()
@swapniljadhav1921
Copy link
Owner

swapniljadhav1921 commented Nov 18, 2020

This is because of fairseq version or lfs version. Please try updated steps given for installation. Also, make sure to take updated repo.

@swapniljadhav1921 swapniljadhav1921 self-assigned this Nov 18, 2020
@swapniljadhav1921
Copy link
Owner

Closing the issue due to inactivity. Please open the same if you have further doubt.

@akshay951228
Copy link

akshay951228 commented Mar 21, 2021

Hi ,
thanks for you great work!,
I'm still facing the same , follow exact setup in readme
Error:-
ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]'

@lacls
Copy link

lacls commented May 6, 2021

Hi ,
thanks for you great work!,
I'm still facing the same , follow exact setup in readme
Error:-
ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]'
This was exactly what I am facing too. Just because in the tokenization of specific model (in my case of using PhoBERT, the file is stored at transformers/models/PhoBERT/Tokenization_PhoBert)


def add_from_file(self, f):
        """
        Loads a pre-existing dictionary from a text file and adds its symbols to this instance.
        """
        if isinstance(f, str):
            try:
                with open(f, "r", encoding="utf-8") as fd:
                    self.add_from_file(fd)
            except FileNotFoundError as fnfe:
                raise fnfe
            except UnicodeError:
                raise Exception(f"Incorrect encoding detected in {f}, please rebuild the dataset")
            return
        lines = f.readlines()
        for lineTmp in lines:
            line = lineTmp.strip()
            idx = line.rfind(" ")
            if idx == -1:
                raise ValueError("Incorrect dictionary format, expected '<token> <cnt>'")
            word = line[:idx]
            self.encoder[word] = len(self.encoder)

Because of only appending the token without (cnt tag, I don't really know what does it present for)

Do you guys have any other approach, please share. I do really appreciate that.

@swapniljadhav1921
Copy link
Owner

Opening the issue again ... it is happening bcz I initially used git-lfs for file storage and later removed.
It converted dict files to different text. You can check in lets say dict.en.txt file. No dictionary present .. hence code is failing.
I will make sure to provide correct files soon.
Opening the issue again.

And sorry for the super delayed reply .. didn't notice bcz of closed issue @akshay951228 @lacls

@swapniljadhav1921
Copy link
Owner

Issues with LFS

Due to various issues with LFS files initially added to LFS later removed .. created unstable file versions which are currently present in repo.
File sizes are big and github with free version has size limitations.
I propose to use files from this location -> https://drive.google.com/drive/folders/18x_vGGa5v3jT-Zx73u0eKFfDGyw9M_aB?usp=sharing
Same folder structure ... please replace git files with these files ... and then LFS is not required.
Please update if found any issue here -> #2
Very non efficient way .. but will make it more usable later.

@swapniljadhav1921
Copy link
Owner

@lacls @akshay951228 plz do check and let me know in case of any issue ... I have checked on my setup.

@Pogayo
Copy link

Pogayo commented Jul 12, 2022

I got this error because my dictionary file generated by sentencepiece tokenizer was tab-separated. Replacing the tabs by space solved it for me. Remember to also remove the unknown, bos, and eos tokens from the dictionary if you are using sentencepiece.

su0315 added a commit to su0315/contextual-mt that referenced this issue Nov 30, 2022
deleted unk, s, /s, to avoid the error, swapniljadhav1921/asamiasami#2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants