SDK Contents and Technical Information
Neurotechnology AI SDK is intended for developers who want to use our automated speech recognition and speaker diarization engines in their on-premise systems. The SDK allows rapid development of speech-to-text applications using functions from the Neurotechnology AI SDK. Developers provide their audio input, text data etc, and have complete control over the output data; therefore the Neurotechnology AI SDK functions can be used with any user interface or integrated into third-party systems.
List of the components
The table below lists the components of the Neurotechnology AI SDK:
Components | Microsoft Windows | Linux |
Automated speech recognition engine (more info) | ||
---|---|---|
• ASR Dev. Edition | 1 single computer license | |
• ASR-10 | Optionally available | |
• ASR-30 | Optionally available | |
• ASR-100 | Optionally available | |
Speaker diarization engine (more info) | ||
• Diarization Dev. Edition | 1 single computer license | |
• Diarization-20 | Optionally available | |
• Diarization-60 | Optionally available | |
• Diarization-200 | Optionally available | |
Wrappers for programming languages and platforms | ||
• Python | + | + |
• C++ | + | + |
• .NET | + | |
• Java | + | + |
Simple programming samples | ||
• .NET | + | |
• Java | + | + |
Diarization programming samples | ||
• Java | + | + |
Documentation | ||
• Neurotechnology AI SDK documentation | + |
Automated Speech Recognition Engine
The Automatic Speech Recognition (ASR) engine is responsible for transcribing audio samples into text. The ASR engine can use the output of the Speaker Diarization engine to process the records with multiple speakers.
The Automatic Speech Recognition (ASR) engine is available as these components with different performance capabilities:
- ASR Dev. Edition – intended for initial integration, development, testing, and small-scale production workloads. It is designed to run on a PC with ADD: CPU, GPU and RAM requirements. One license is included with the Neurotechnology AI SDK.
- ASR-10 – intended for systems with small-scale processing capabilities, like transcribing up to several thousand short phone calls per day. It is designed to run on a server with ADD: CPU, GPU and RAM requirements. Licenses for this components are optionally available with the Neurotechnology AI SDK.
- ASR-30 – intended for systems with moderate throughput, like immediate phone call processing. It is designed to run on a server with ADD: CPU, GPU and RAM requirements. Licenses for this components are optionally available with the Neurotechnology AI SDK.
- ASR-100 – intended for large-scale systems with high amounts of audio records, like immediate processing of the recent TV or radio shows, or the uploaded audio/video on social media platforms. It is designed to run on a server with ADD: CPU, GPU and RAM requirements. Licenses for this components are optionally available with the Neurotechnology AI SDK.
- Custom version of the ASR engine with even higher performance is available upon request.
Automated speech recognition engine performance | ||||
---|---|---|---|---|
Component | ASR Dev. Edition | ASR-10 | ASR-30 | ASR-100 |
Processing speed (seconds of the record per one real-time second) |
2 | 10 | 30 | 100 |
Speaker Diarization Engine
The Speaker Diarization engine is responsible for recognizing who and when is speaking in the record, and marking them as specific timestamps for a particular audio record. The output of the diarization engine can be used for processing the records with multiple speakers using the ASR engine.
The Speaker Diarization engine is available as these components with different performance capabilities:
- Diarization Dev. Edition – intended for initial integration, development, testing, and small-scale production workloads. It is designed to run on a PC with ADD: CPU, GPU and RAM requirements. One license is included with the Neurotechnology AI SDK.
- Diarization-20 – intended for systems with small-scale processing capabilities, like transcribing up to several thousand short phone calls per day. It is designed to run on a server with ADD: CPU, GPU and RAM requirements. Licenses for this components are optionally available with the Neurotechnology AI SDK.
- Diarization-60 – intended for systems with moderate throughput, like immediate phone call processing. It is designed to run on a server with ADD: CPU, GPU and RAM requirements. Licenses for this components are optionally available with the Neurotechnology AI SDK.
- Diarization-200 – intended for large-scale systems with high amounts of audio records, like immediate processing of the recent TV or radio shows, or the uploaded audio/video on social media platforms. It is designed to run on a server with ADD: CPU, GPU and RAM requirements. Licenses for this components are optionally available with the Neurotechnology AI SDK.
- Custom version of the Speaker Diarization engine with even higher performance is available upon request.
Speaker diarization engine performance | ||||
---|---|---|---|---|
Component | Diarization Dev. Edition | Diarization-20 | Diarization-60 | Diarization-200 |
Processing speed (seconds of the record per one real-time second) |
4 | 20 | 60 | 200 |
Usage Recommendations
- The record should be stored in the RAM before performing speaker diarization and speech recognition, so enough free RAM should be ensured. Very long multi-hour records can be divided into shorter chunks.
- Voice records of at least 1 second in length are recommended to assure the quality of speech recognition.
- Microphones – there are no particular constraints on models or manufacturers when using regular PC microphones, headsets or the built-in microphones in laptops, smartphones and tablets.
- Constant sound level / loudness is recommended to assure the quality of speech recognition.
- Settings for clear sound must be ensured; as some audio software, hardware or drivers may have sound modification enabled by default. For example, the Microsoft Windows OS usually has, by default, sound boost enabled.
- A minimum 8000 Hz sampling rate, with at least 16-bit depth, should be used during voice recording.
-
Environment constraints – in general the speaker diarization and the speech recognition engines produce best results when clear voice records are provided. There are these specific considerations to ensure better recognition quality:
- Background noise; which interferes with the speaker voice, can affect the recognition results, thus third-party or custom solutions for background noise reduction can be used to pre-process the voice records.
- Multiple people speaking at the same time can affect the recognition results.
- Short voice overlappings are acceptable.