라즈베리파이를 사용한 나만의 스피커 만들기

팀원 HW : 심하민, 성원희, 윤재선
팀원 SW : 박지완 , 박은주 , 남경현
[Notion]

How does it work?

Server

Listen for request(especially post request for our project)
When post request is triggered at endpoint('rapa'), upload file content recieved from post to /server_uploaded(Removed when request is ended)
Read file from that path, and process data(model part)
Finally, return text answer data(Made by GPT Chat-API, propting)

Client(Raspberry pi board)

Request some tasks to server(especially posting recorded file for our project)
When .wav file is recorded by user(=human) using button, it requests server to make appropriate response(=text) for that input audio, considering the user(=human)'s sex and age
Finally, gets answer text(=response) from server and make that answer to audio file using google TTS, and speak that to user(=human)

Model(Making appropriate answer text)

Get personal info(sex, age): Whipser model, fine-tuned with korean audio data
Recognize text of the audio: Use Google API - STT(Speech to Text)
Prompting through GPT, make appropriate answer text

Explanation for python files

server.py: utilizes server, to activate it, open terminal and type uvicorn server:api --reload
virtual_model.py: executed when server recalls. first recall stt, and track sex, age from the audio file. Then make a answer with prompted gpt
client.py: after making .wav file(=audio input), save it to recorded_audio, and execute this file (with appropriate path designated)
stt.py: google cloud speech-to-text, must activate auth(=json file) to use
tts.py: same as stt.py, by google, must be activated first
ask_gpt.py: requires extra info(=sex, age) and message from audio input which is transcripted by stt code

SW Team

- 초기 계획 : HW 팀으로 부터 받은 음성파일을 형태소 분석 , 구문 분석 , 의미 분석 , 담화 분석을 통해 그에 알맞는 언어를 생성한 후 다시 음성파일로 변환 후에 HW 팀으로 다시 넘긴다.

STT , TTS 에 사용한 OpenAI Whisper model

https://openai.com/research/whisper

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing.

whisper STT 를 사용하여 음성을 텍스트로 변환한다음 모델을 통하여 화자정보를 추출
이에 화자정보에 따른 rule-based의 프롬프트 튜닝을 거쳐 gpt api를 사용, input text에 맞는 텍스트를 생성
생성된 텍스트를 whisper TTS 를 사용하여 음성파일의 형태로 변환하는 작업을 거침

화자 분류에 사용된 데이터셋

화자를 분류할 때 있어서 데이터셋이 중요했습니다.
먼저 어린아이와 성인의 화자 정보를 구분하기 위해 AI 허브의 어린이 음성 데이터셋을 사용하였습니다. 어린아이를 0 , 성인을 1로 라벨링 하였습니다.

더 나아가 어린아이와 성인의 화자정보만 구분하는 것에더 더 나아가 남녀의 발화정보도 구분하였습니다.
성별을 구분하기 위해서 AI 허브의 한국어 음성을 사용하였습니다. 여자는 0 , 남자를 1로 라벨링 하였습니다.

화자 분류에 사용된 모델

model

(openai/whisper-large)

feature extractor (based on cnn)

encoder (based on transformer)

linear classifier layer

model training

finetuing whisper

freeze model's feature extractor and encoder but train randomly initialized linear layer

HW Team

Explanation for python files

audiolist.py: identify current audio input and output device number
stt.py: The given Python script utilizes Google Cloud's Text-to-Speech API to convert text into speech and save it as a file.
finrecord.py: record audio and send it to server than get the appropriate audio file and output the audio file by speaker

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
board		board
model		model
utils		utils
.env.sample		.env.sample
.gitattributes		.gitattributes
.gitignore		.gitignore
api.py		api.py
audiolist.py		audiolist.py
finrecord.py		finrecord.py
readme.md		readme.md
server.py		server.py
virtual_model.py		virtual_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

라즈베리파이를 사용한 나만의 스피커 만들기

How does it work?

Server

Client(Raspberry pi board)

Model(Making appropriate answer text)

Explanation for python files

SW Team

STT , TTS 에 사용한 OpenAI Whisper model

화자 분류에 사용된 데이터셋

화자 분류에 사용된 모델

model

feature extractor (based on cnn)

encoder (based on transformer)

linear classifier layer

model training

HW Team

Explanation for python files

About

Releases

Packages

Contributors 4

Languages

hamin-shim/ai-speaker

Folders and files

Latest commit

History

Repository files navigation

라즈베리파이를 사용한 나만의 스피커 만들기

How does it work?

Server

Client(Raspberry pi board)

Model(Making appropriate answer text)

Explanation for python files

SW Team

STT , TTS 에 사용한 OpenAI Whisper model

화자 분류에 사용된 데이터셋

화자 분류에 사용된 모델

model

feature extractor (based on cnn)

encoder (based on transformer)

linear classifier layer

model training

HW Team

Explanation for python files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages