Python¶

Model¶

class Model(*args, **kwargs)[source]¶

Class holding a DeepSpeech model

Parameters

aModelPath (str) – Path to model file to load
aAlphabetConfigPath (str) – Path to alphabet file to load
aBeamWidth (int) – Decoder beam width

createStream(sample_rate=16000)[source]¶

Create a new streaming inference state. The streaming state returned by this function can then be passed to feedAudioContent() and finishStream().

Parameters: aSampleRate (int) – The sample-rate of the audio signal.
Returns: Object holding the stream
Throws: RuntimeError on error

enableDecoderWithLM(*args, **kwargs)[source]¶

Enable decoding using beam scoring with a KenLM language model.

Parameters

aLMPath (str) – The path to the language model binary file.
aTriePath (str) – The path to the trie file build from the same vocabulary as the language model binary.
aLMAlpha (float) – The alpha hyperparameter of the CTC decoder. Language Model weight.
aLMBeta (float) – The beta hyperparameter of the CTC decoder. Word insertion weight.

Returns

Zero on success, non-zero on failure (invalid arguments).

Type

int

feedAudioContent(*args, **kwargs)[source]¶

Feed audio samples to an ongoing streaming inference.

Parameters

aSctx (object) – A streaming state pointer returned by createStream().
aBuffer (int array) – An array of 16-bit, mono raw audio samples at the appropriate sample rate.
aBufferSize (int) – The number of samples in @p aBuffer.

finishStream(*args, **kwargs)[source]¶

Signal the end of an audio signal to an ongoing streaming inference, returns the STT result over the whole audio signal.

Parameters: aSctx (object) – A streaming state pointer returned by createStream().
Returns: The STT result.
Type: str

finishStreamWithMetadata(*args, **kwargs)[source]¶

Signal the end of an audio signal to an ongoing streaming inference, returns per-letter metadata.

Parameters: aSctx (object) – A streaming state pointer returned by createStream().
Returns: Outputs a struct of individual letters along with their timing information.
Type: Metadata()

intermediateDecode(*args, **kwargs)[source]¶

Compute the intermediate decoding of an ongoing streaming inference. This is an expensive process as the decoder implementation isn’t currently capable of streaming, so it always starts from the beginning of the audio.

Parameters: aSctx (object) – A streaming state pointer returned by createStream().
Returns: The STT intermediate result.
Type: str

stt(*args, **kwargs)[source]¶

Use the DeepSpeech model to perform Speech-To-Text.

Parameters

aBuffer (int array) – A 16-bit, mono raw audio signal at the appropriate sample rate.
aBufferSize (int) – The number of samples in the audio signal.
aSampleRate (int) – The sample-rate of the audio signal.

Returns

The STT result.

Type

str

sttWithMetadata(*args, **kwargs)[source]¶

Use the DeepSpeech model to perform Speech-To-Text and output metadata about the results.

Parameters

aBuffer (int array) – A 16-bit, mono raw audio signal at the appropriate sample rate.
aBufferSize (int) – The number of samples in the audio signal.
aSampleRate (int) – The sample-rate of the audio signal.

Returns

Outputs a struct of individual letters along with their timing information.

Type

Metadata()

Metadata¶

class Metadata[source]¶

Stores the entire CTC output as an array of character metadata objects

confidence()[source]¶: Approximated confidence value for this transcription. This is roughly the sum of the acoustic model logit values for each timestep/character that contributed to the creation of this transcription.

items()[source]¶

List of items

Returns: A list of MetadataItem() elements
Type: list

num_items()[source]¶

Size of the list of items

Returns: Size of the list of items
Type: int

MetadataItem¶

class MetadataItem[source]¶

Stores each individual character, along with its timing information

character()[source]¶: The character generated for transcription

start_time()[source]¶: Position of the character in seconds

timestep()[source]¶: Position of the character in units of 20ms