-
Notifications
You must be signed in to change notification settings - Fork 1
/
klatt.man
448 lines (360 loc) · 11.2 KB
/
klatt.man
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
.TH klatt 1 "April 1994"
.SH NAME
klatt \- Klatt cascade-parallel formant synthesizer (v3.03)
.SH SYNTAX
.B klatt
[
.B \-i
.I input filename
][
.B \-o
.I output filename
][
.B \-q
][
.B \-t
.I output waveform type
][
.B \-c
][
.B \-n
.I number of formants in the cascade branch
][
.B \-s
.I sample rate
][
.B \-f
.I number of milliseconds per frame
][
.B \-v
.I voicing source
][
.B \-V
.I sampled voicing filename
][
.B \-r
.I raw samples output type
][
.B \-F
.I percent f0 flutter
]
.SH DESCRIPTION
.de EX \"Begin example
.ne 5
.if n .sp 1
.if t .sp .5
.nf
.ta +8u*\w'\0'u +8u*\w'\0'u +8u*\w'\0'u +8u*\w'\0'u +8u*\w'\0'u +8u*\w'\0'u
..
.de EE
.fi
.if n .sp 1
.if t .sp .5
..
The
.I klatt
software is an implementation of a speech synthesizer first
described by Dennis Klatt in 1980 [1]. The object of the program is to convert
a set of parameter values into a waveform representing speech sound. The
following pages describe the command line options available to the user, and
the format of the input and output data files. Details of the history
of this code and the modifications that have been made can be found in
the README file in the distribution.
.SH OPTIONS
.TP
.B \-h
Displays a help message
.TP
.B \-i
.I filename
User specified input filename. The file specified will contain ASCII
data in a format described in a later section. If no filename is
specified then input is assumed to be from stdin.
.TP
.B \-o
.I filename
User specified output filename. The output speech waveform will be written to
this file. The output waveform may be written as signed 16 bit integers in
raw binary samples, or as a file of ASCII integers. This second format is
suitable for plotting the waveform using gnuplot etc. If no filename
is specified, then output is assumed to be to stdout.
.TP
.B \-q
Run in quiet mode, no output messages are produced. The default is to
run in verbose mode, where details of the current frame of parameters
being processed will be displayed on the screen.
.TP
.B \-t
.I Ouput Waveform Type
This option allows the user to select the type of waveform that is
passed to the output file. The default for this option is the complete
speech waveform. The list below indicates the available options. Note,
a value must be set at compilation time to enable the code which
generates the various output waveforms. This code may be disabled
to improve the speed and efficiency of the program overall. The
options available are listed below.
.B 1
voicing.
.B 2
aspiration.
.B 3
frication.
.B 4
cascade glottal output.
.B 5
parallel glottal output.
.B 6
bypass output.
.B 7
all excitation (frication voicing etc.).
.TP
.B \-c
This flag selects full cascade-parallel operation. The default setting
of the synthesizer is parallel branch only.
.TP
.B \-n
.I Number of Formants.
This option is used to set the number of formants in the cascade
branch of the synthesizer. The default number is 5.
.TP
.B \-s
.I Sample Rate
Sets the sample rate used for the output waveform. The default is
10000 (10kHz).
.TP
.B \-f
.I Number of Milliseconds per Frame
This value specifies the number of milliseconds of output waveform
each frame of synthesizer parameters represents. The default is 10.
.TP
.B \-v
.I Voicing Source
Three types of voicing source are available, these are listed below.
.B 1
.I Impulse train.
This is just a series of regular pulses. The pulses are
smoothed using low-pass filtering to remove unwanted "sharp"
transitions. These pulses do not resemble the excitation waveform
derived from natural voicing.
.B 2
.I Natural Simulation
This voicing source is an idealised version of the natural excitation
waveform. It is in fact the inverse (from left to right) of a real
excitation waveform, although this should not be a problem unless
total accuray in modelling natural speech is required.
.B 3
.I Sampled Natural Excitation
The software provides the ability to utilise the excitation waveform
measured from the voice of a real speaker. The easiest way to get this
is through inverse filtering. A default sampled excitation waveform is
supplied, although I make no claims for its usefulness!
.TP
.B \-V
.I Sampled Natural Excitation Filename
The sampled excitation waveform used by the software can be loaded in
from a file. The file is expected in the following format, in ASCII
characters. First, an integer representing the total number of
samples, secondly, a floating point value indicating the amount these
values are to be scaled by when used. Finally, the required number of
integer samples.
.TP
.B \-r
.I Raw Samples Output Type
Selecting this flag will produce the output waveform as a raw binary
file, rather than as ASCII integers. Two types are available, type 1
gives a high byte - low byte arrangement, and type 2 gives a low byte -
high byte arrangement.
.TP
.B\-F
.I Percent f0 Flutter
The percentage of f0 flutter to be applied to the synthesized speech
as described in [2]. f0 flutter is an attempt to cure synthetic speech
of lack of naturalness introduced by using constant values of f0. A
small amount of quasi-random f0 flutter is applied when this value is
greater than 0.
.SH INPUT FILE FORMAT
The input file consists of a series of parameter frames. Each frame of
parameters (usually) represents 10ms of audio output, although this
figure can be adjusted down to 1ms per frame. The parameters in each
frame are described below. To avoid confusion, note that the cascade
and parallel branch of the synthesizer duplicate some of the control
parameters.
.TP
.B \ f0
This is the fundamental frequency (pitch) of the utterance
in this case it is specified in steps of 0.1 Hz, hence 100Hz
will be represented by a value of 1000.
.TP
.B \ av
Amplitude of voicing for the cascade branch of the
synthesizer in dB. Range 0-70, value usually about 60 for a vowel sound.
.TP
.B \ f1
First formant frequency. Range usually 200-1300 Hz.
.TP
.B \ b1
Cascade branch, bandwidth of first formant. Range usually 40-1000 Hz.
.TP
.B \ f2
Second formant frequency. Range usually 550-3000 Hz.
.TP
.B \ b2
Cascade branch, bandwidth of second formant. Range usually 40-1000 Hz.
.TP
.B \ f3
Third formant frequency. Range usually 1200-4999 Hz.
.TP
.B \ b3
Cascade branch bandwidth of third formant. Range usually 40-1000 Hz.
.TP
.B \ f4
Fourth formant frequency. Range usually 1200-4999 Hz.
.TP
.B \ b4
Cascade branch, bandwidth of fourth formant. Range usually 40-1000 Hz.
.TP
.B \ f5
Fifth formant frequency. Range usually 1200-4999 Hz.
.TP
.B \ b5
Cascade branch, bandwidth of fifth formant. Range usually 40-1000 Hz.
.TP
.B \ f6
Sixth formant frequency. Range usually 1200-4999 Hz.
.TP
.B \ b6
Cascade branch, bandwidth of sixth formant. Range usually 40-2000 Hz.
.TP
.B \ fnz
Frequency of the nasal zero. Range usually 248-528 Hz (cascade branch only).
.TP
.B \ bnz
Bandwidth of the nasal zero. Range usually 40-1000 Hz (cascade branch only).
.TP
.B \ fnp
Frequency of the nasal pole. Range usually 248-528 Hz .
.TP
.B \ bnp
Bandwidth of the nasal pole in 40-1000 Hz
.TP
.B \ asp
Amplitude of aspiration 0-70 dB.
.TP
.B \ kopen
Open quotient of voicing waveform, range 0-60, usually 30.
Will influence the gravelly or smooth quality of the voice.
Only works with impulse and natural simulations. For the
sampled glottal excitation waveform the open quotient is fixed.
.TP
.B \ aturb
Amplitude of turbulence 0-80 dB. A value of 40 is useful. Can be
used to simulate "breathy" voice quality.
.TP
.B \ tilt
Spectral tilt in dB, range 0-24. Tilts down the output spectrum.
The value refers to dB down at 3Khz. Increasing the value emphasizes
the low frequency content of the speech and attenuates the high
frequency content.
.TP
.B \ af
Amplitude of frication in dB, range 0-80 (parallel branch).
.TP
.B \ skew
Spectral Skew - skewness of alternate periods, range 0-40
.TP
.B \ a1
Amplitude of first formant in the parallel branch, in 0-80 dB.
.TP
.B \ b1p
Bandwidth of the first formant in the parallel branch, in Hz.
.TP
.B \ a2
Amplitude of parallel branch second formant.
.TP
.B \ b2p
Bandwidth of parallel branch second formant.
.TP
.B \ a3
Amplitude of parallel branch third formant.
.TP
.B \ b3p
Bandwidth of parallel branch third formant.
.TP
.B \ a4
Amplitude of parallel branch fourth formant.
.TP
.B \ b4p
Bandwidth of parallel branch fourth formant.
.TP
.B \ a5
Amplitude of parallel branch fifth formant.
.TP
.B \ b5p
Bandwidth of parallel branch fifth formant.
.TP
.B \ a6
Amplitude of parallel branch sixth formant.
.TP
.B \ b6p
Bandwidth of parallel branch sixth formant.
.TP
.B \ anp
Amplitude of the parallel branch nasal formant.
.TP
.B \ ab
Amplitude of bypass frication in dB. 0-80.
.TP
.B \ avp
Amplitude of voicing for the parallel branch, 0-70 dB.
.TP
.B \ gain
Overall gain in dB range 0-80.
.SH EXAMPLES
Included with the distribution are two example parameter files. They may be
synthesized using the command line:
klatt -i example.par -o example.raw -f 5 -v 2 -s 16000 -r 1
This produces raw 16bit signed integers. A package like sox can be
used to convert to your favourite audio format. For example,
conversion to the ulaw encoded format used by Sun Sparc SLC's is given
below.
sox -r 16000 -s -w example.raw -r 8000 -b -U example.au
Beware of the byte ordering of your machine - if the above procedure
produces distored rubbish, try using -r 2 instead of -r 1. This just
reverses the byte ordering in the raw binary output file. It is also
worth noting that the above example reduces the quality of the output,
as the sampling rate is being halved and the number of bits per sample
is being halved. Ideally output should be at 16kHz with 16 bits per
sample.
.SH BUGS
I have not had a chance to test loading a sampled excitation waveform
from a file. Please let me know if there are problems.
My research does not (yet) require me to use the synthesizer in its
primary mode, which is combined cascade-parallel operation. I have
primarily used the synthesizer in parallel only mode. I would
appreciate any comments regarding use of the cascade branch.
Finally, there is no protection against rapid parameter changes. Large
jumps in many of the parameters will cause clicks and pops in the
output. This may be remedied in future with some form of parameter
clamping that becomes effective when parameters exceed a set rate of
change.
All bug reports and queries to Jon Iles, ([email protected])
University of Birmingham, School of Computer Science, Edgbaston,
Birmingham. B29 7PY. UK.
.SH AUTHORS
Jon Iles ([email protected])
Nick Ing-Simmons ([email protected])
.SH ACKNOWLEDGEMENTS
Many thanks to Tony Robinson for his help and support.
Thanks also Alan Black, Paul Callaghan, Johannes Kiehl, Arthur
Dirksen and Gary Murphy for prompt bug spotting and feedback, and to Mark
Thornton for help with C7.
.SH REFERENCES
.B [1]
Klatt,D.H. Software for a cascade/parallel formant synthesizer, in
the Journal of the Acoustic Society of America, pages 971-995, volume 67,
number 3, March 1980.
.B [2]
Klatt,D.H. and Klatt, L.C. Analysis, synthesis and perception of voice
quality variations among female and male talkers. In the Journal of
the Acoustical Society of America", volume 87, number 2. Pages
820-857. February 1990.