Speaker recognition for stand-alone or Web applications
VeriSpeak voice identification technology is designed for biometric system developers and integrators. The text-dependent speaker recognition algorithm assures system security by checking both voice and phrase authenticity. Voiceprint templates can be matched in 1-to-1 (verification) and 1-to-many (identification) modes.
Available as a software development kit that enables the development of stand-alone and Web-based speaker recognition applications on Microsoft Windows, Linux, macOS, iOS and Android platforms.
Technical Specifications and Usage Recommendations
- The speaker recognition accuracy of MegaMatcher depends on the audio quality during enrollment and identification.
- Voice samples of at least 2-seconds in length are recommended to assure speaker recognition quality.
- A passphrase should be kept secret and not spoken in an environment where others may hear it if the speaker recognition system is used in a scenario with unique phrases for each user.
- The text-independent speaker recognition may be vulnerable to attack with a covertly recorded phrase from a person. Passphrase verification or two-factor authentication (i.e. requirement to type a password) will increase the overall system security.
there are no particular constraints on models or manufacturers when using regular PC microphones, headsets or the built-in microphones in laptops, smartphones and tablets.
However these factors should be noted:
- The same microphone model is recommended (if possible) for use during both enrollment and recognition, as different models may produce different sound quality. Some models may also introduce specific noise or distortion into the audio, or may include certain hardware sound processing, which will not be present when using a different model. This is also the recommended procedure when using smartphones or tablets, as different device models may alter the recording of the voice in different ways.
- The same microphone position and distance is recommended during enrollment and recognition. Headsets provide optimal distance between user and microphone; this distance is recommended when non-headset microphones are used.
- Web cam built-in microphones should be used with care, as they are usually positioned at a rather long distance from the user and may provide lower sound quality. The sound quality may be affected if users subsequently change their position relative to the web cam.
- Settings for clear sound must be ensured; some audio software, hardware or drivers may have sound modification enabled by default. For example, the Microsoft Windows OS usually has, by default, sound boost enabled.
- A minimum 11025 Hz sampling rate, with at least 16-bit depth, should be used during voice recording.
Environment constraints –
the MegaMatcher speaker recognition engine is sensitive to noise or loud voices in the background; they may interfere with the user's voice and affect the recognition results.
These solutions may be considered to reduce or eliminate these problems:
- A quiet environment for enrollment and recognition.
- Several samples of the same phrase recorded in different environments can be stored in a biometric template. Later the user will be matched against these samples with much higher recognition quality.
- Close-range microphones (like those in headsets or smartphones) that are not affected by distant sources of sound.
- Third-party or custom solutions for background noise reduction, such as using two separate microphones for recording user voice and background sound, and later subtracting the background noise from the recording.
User behavior and voice changes:
Natural voice changes may affect speaker recognition accuracy:
- a temporarily hoarse voice caused by a cold or other sickness;
- different emotional states that affect voice (i.e. a cheerful voice versus a tired voice);
- different pronunciation speeds during enrollment and identification.
The aforementioned voice and user behavior changes can be managed in two ways:
- separate enrollments for the altered voice, storing the records in the same person's template;
- a controlled, neutral voice during enrollment and identification.
- Natural voice changes may affect speaker recognition accuracy:
All voice templates should be loaded into RAM before identification, thus the maximum voice template database size is limited by the amount of available RAM.
The voiceprint template size has linear dependence on the voice sample length. For example, when using voice samples that are 2 times shorter, the template size values will be 2 times smaller.
VeriSpeak 11.2 text-dependent engine can perform template matching in two modes:
- Fixed phrase – each subject in the database has recorded the same phrase. This mode provides faster matching, but lower reliability.
- Unique phrase – each subject in the database has recorded a unique phrase. This mode provides higher reliability, but slower matching.
VeriSpeak biometric template extraction and matching algorithm is designed to run on multi-core processors allowing to reach maximum possible performance on the used hardware.
|VeriSpeak 11.2 text-dependent voiceprint engine specifications|
|Template extraction components||Mobile
|Template extraction time (seconds)||1.34 (1)||1.20 (1)||1.34 (2)||0.60 (2)|
|Template matching components||Mobile Voice Matcher||Voice Matcher|
|Template matching speed
fixed phrase mode
(voiceprints per second)
|100 (1)||8,000 (2)|
|Template matching speed
unique phrase mode
(voiceprints per second)
|20 (1)||1,700 (2)|
|Single voiceprint record size in a template, when 5 second long voice samples used (bytes)||3,500 - 4,500|
(1) Requires to be run on Android devices based on at least Snapdragon S4 system-on-chip with Krait 300 processor (4 cores, 1.51 GHz).
(2) Requires to be run on PC or laptop with at least Intel Core i7-4771 processor.