VeriSpeak SDK

Speaker recognition for stand-alone or Web applications

VeriSpeak voice identification technology is designed for biometric system developers and integrators. The text-dependent speaker recognition algorithm assures system security by checking both voice and phrase authenticity. Voiceprint templates can be matched in 1-to-1 (verification) and 1-to-many (identification) modes.

Available as a software development kit that enables the development of stand-alone and Web-based speaker recognition applications on Microsoft Windows, Linux, Mac OS X, iOS and Android platforms.

Basic Recommendations for Speaker Recognition

The speaker recognition accuracy of VeriSpeak depends on the audio quality during enrollment and identification. Certain constraints should be noted before or during algorithm integration into a speaker recognition system. Other variables may be overcome by enrollment with the same phrase in different environments.

Voice samples of at least 2 seconds in length are recommended to assure recognition quality.

General Security

A passphrase should be kept secret and not spoken in an environment where others may hear it if the speaker recognition system is used in a scenario with unique phrases for each user.

The text-independent speaker recognition may be vulnerable to attack with a covertly recorded phrase from a person. Passphrase verification or two-factor authentication (i.e. requirement to type a password) will increase the overall system security.

Microphones

There are no particular constraints on models or manufacturers when using regular PC microphones, headsets or the built-in microphones in laptops, smartphones and tablets. However these factors should be noted:

  • The same microphone model is recommended (if possible) for use during both enrollment and recognition, as different models may produce different sound quality. Some models may also introduce specific noise or distortion into the audio, or may include certain hardware sound processing, which will not be present when using a different model. This is also the recommended procedure when using smartphones or tablets, as different device models may alter the recording of the voice in different ways.
  • The same microphone position and distance is recommended during enrollment and recognition. Headsets provide optimal distance between user and microphone; this distance is recommended when non-headset microphones are used.
  • Web cam built-in microphones should be used with care, as they are usually positioned at a rather long distance from the user and may provide lower sound quality. The sound quality may be affected if users subsequently change their position relative to the web cam.

Sound Settings

Settings for clear sound must be ensured; some audio software, hardware or drivers may have sound modification enabled by default. For example, the Microsoft Windows OS usually has, by default, sound boost enabled.

A minimum 11025 Hz sampling rate, with at least 16-bit depth, should be used during voice recording.

Environment Constraints

The VeriSpeak speaker recognition algorithm is sensitive to noise or loud voices in the background; they may interfere with the user's voice and affect the recognition results. These solutions may be considered to reduce or eliminate these problems:

  • A quiet environment for enrollment and recognition.
  • Several samples of the same phrase recorded in different environments can be stored in a biometric template. Later the user will be matched against these samples with much higher recognition quality.
  • Close-range microphones (like those in headsets or smartphones) that are not affected by distant sources of sound.
  • Third-party or custom solutions for background noise reduction, such as using two separate microphones for recording user voice and background sound, and later subtracting the background noise from the recording.

User Behavior and Voice Changes

Natural voice changes may affect speaker recognition accuracy:

  • a temporarily hoarse voice caused by a cold or other sickness;
  • different emotional states that affect voice (i.e. a cheerful voice versus a tired voice);
  • different pronunciation speeds during enrollment and identification.

The aforementioned voice and user behavior changes can be managed in two ways:

  • separate enrollments for the altered voice, storing the records in the same person's template;
  • a controlled, neutral voice during enrollment and identification.