You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Getting this error during generation of embeddings:
Traceback (most recent call last):
File "/home/user/.local/bin/sem", line 8, in <module>
sys.exit(main())
File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/cli.py", line 84, in main
query_func(args)
File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/cli.py", line 38, in query_func
do_query(args, model)
File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/query.py", line 51, in do_query
do_embed(args, model)
File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/embed.py", line 82, in do_embed
functions = _get_repo_functions(
File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/embed.py", line 71, in _get_repo_functions
file_content = f.read()
File "/usr/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 132693: invalid start byte
after it already successfully processed quite a few files:
As workaround, I just added try/catch to the affected lines:
def _get_repo_functions(root, supported_file_extensions, relevant_node_types):
functions = []
print('Extracting functions from {}'.format(root))
for fp in tqdm([root + '/' + f for f in os.popen('git -C {} ls-files'.format(root)).read().split('\n')]):
if not os.path.isfile(fp):
continue
with open(fp, 'r') as f:
lang = supported_file_extensions.get(fp[fp.rfind('.'):])
if lang:
try:
parser = get_parser(lang)
file_content = f.read()
tree = parser.parse(bytes(file_content, 'utf8'))
all_nodes = list(_traverse_tree(tree.root_node))
functions.extend(_extract_functions(
all_nodes, fp, file_content, relevant_node_types))
except Exception as e:
print(f"Hit error while parsing {fp}: {e}")
return functions
It shows quite a lot of third-party files in my repo. Since these are third-party, I cannot update/fix them. Should sem be made robust against such issues?
Getting this error during generation of embeddings:
after it already successfully processed quite a few files:
The text was updated successfully, but these errors were encountered: