whisper v3微调过程中出现乱码的情况 #13

Kevinarcsin001 · 2024-12-21T07:46:53Z

我这边使用2卡的4090，实验数据是aishell1 带标点的数据。
环境如下：
Package Version Editable project location

accelerate 0.28.0
aiohappyeyeballs 2.4.4
aiohttp 3.11.10
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.7.0
async-timeout 5.0.1
attrs 24.2.0
audioread 3.0.1
av 14.0.1
bitsandbytes 0.41.3
Brotli 1.0.9
certifi 2024.8.30
cffi 1.17.1
charset-normalizer 3.3.2
click 8.1.7
coloredlogs 15.0.1
ctranslate2 4.5.0
dataclasses 0.6
datasets 3.2.0
decorator 5.1.1
dill 0.3.8
evaluate 0.4.3
exceptiongroup 1.2.2
fastapi 0.115.6
faster-whisper 1.1.0
filelock 3.13.1
flatbuffers 24.3.25
frozenlist 1.5.0
fsspec 2024.9.0
gmpy2 2.1.2
h11 0.14.0
huggingface-hub 0.26.5
humanfriendly 10.0
idna 3.10
Jinja2 3.1.4
jiwer 3.0.5
joblib 1.4.2
lazy_loader 0.4
librosa 0.10.2.post1
llvmlite 0.43.0
MarkupSafe 3.0.2
mkl_fft 1.3.11
mkl_random 1.2.8
mkl-service 2.4.0
mpmath 1.3.0
msgpack 1.1.0
multidict 6.1.0
multiprocess 0.70.16
networkx 3.2.1
numba 0.60.0
numpy 2.0.1
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
onnxruntime 1.16.3
packaging 24.2
pandas 2.2.3
peft 0.7.0 # 我自己的路径，使用源码安装的
pillow 11.0.0
pip 24.2
platformdirs 4.3.6
pooch 1.8.2
propcache 0.2.1
protobuf 5.29.1
psutil 6.1.0
pyarrow 18.1.0
pycparser 2.22
pydantic 2.10.3
pydantic_core 2.27.1
pydub 0.25.1
PySocks 1.7.1
python-dateutil 2.9.0.post0
pytz 2024.2
PyYAML 6.0.2
RapidFuzz 3.10.1
regex 2024.11.6
requests 2.32.3
safetensors 0.4.5
scikit-learn 1.5.2
scipy 1.13.1
setuptools 75.1.0
six 1.17.0
sniffio 1.3.1
SoundCard 0.4.3
soundfile 0.12.1
soxr 0.5.0.post1
starlette 0.41.3
sympy 1.13.1
tensorboardX 2.6.2.2
threadpoolctl 3.5.0
tokenizers 0.21.0
torch 2.5.1
torchaudio 2.5.1
torchvision 0.20.1
tqdm 4.67.1
transformers 4.47.0
triton 3.1.0
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
uvicorn 0.32.1
wheel 0.44.0
xxhash 3.5.0
yarl 1.18.3
zhconv 1.4.3

实验结果如下

请问您这边可以提供一些建议吗？麻烦啦

gody7334 · 2024-12-28T02:13:15Z

I have same issue
Reckon its due to the discrepancy between the training (testing) data and the real data
And the model overfit to the training data,

You need to start to think what kind of data (or augmentation) will better represent your real data?

shuaijiang · 2025-01-02T10:06:39Z

check the format of training data(text) is utf-8?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper v3微调过程中出现乱码的情况 #13

whisper v3微调过程中出现乱码的情况 #13

Kevinarcsin001 commented Dec 21, 2024

gody7334 commented Dec 28, 2024

shuaijiang commented Jan 2, 2025

whisper v3微调过程中出现乱码的情况 #13

whisper v3微调过程中出现乱码的情况 #13

Comments

Kevinarcsin001 commented Dec 21, 2024

gody7334 commented Dec 28, 2024

shuaijiang commented Jan 2, 2025