phonexia product portfolio - pangea.ek-solutions.co.za

11
Phonexia Product Portfolio Turning Voice into Knowledge

Upload: others

Post on 04-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Phonexia Product Portfolio - pangea.ek-solutions.co.za

Phonexia Product Portfolio

Turning Voice into Knowledge

Page 2: Phonexia Product Portfolio - pangea.ek-solutions.co.za

TABLE OF CONTENTS

About Phonexia 2

Phonexia Speaker Identification 3

Phonexia Language Identification 4

Phonexia Gender Identification 5

Phonexia Keyword Spotting 6

Phonexia Speech Transcription 7

Phonexia Speaker Diarization 8

Phonexia Voice Activity Detection 9

Phonexia Speech Quality Estimation 10

Phonexia Age Estimation 11

Phonexia Voice Inspector 12

Phonexia Denoiser 14

Integration Possibilities and Licensing 15

Page 3: Phonexia Product Portfolio - pangea.ek-solutions.co.za

2 3

About Phonexia

Customers and Partners

Phonexia Products

Phonexia transforms voice into knowledge with its

innovative speech analytics and voice biometrics

technologies. Its Phonexia Speech Platform is

the first on the market using exclusively deep

neural networks to allow speaker identification

with extremely accurate and fast results. The

Phonexia Speech Platform packs a wide range of

speech technologies into a single, highly modular

platform that is easy to integrate with other

solutions. Phonexia innovation is available through

its network of integration partners. A university

spin-off, Phonexia has been delivering its

technologies to call centers, financial institutions

and security agencies in more than 60 countries

since 2006.

Phonexia Voice Biometrics helps the

identification of or search for a speaker based on

the comparison to a previously created voiceprint.

Similar to a fingerprint, Phonexia voice biometrics

can be used for voice authentication, fraud

detection, speaker search or speaker spotting.

Phonexia Speech Analytics provides ready to

analyze data on speech content using either full

Speech Transcription, Keyword Spotting

(a phonetically based keyword search)

or Language Identification.

Phonexia Voice Inspector is an out-of-

the-box solution providing police forces and

forensic experts with a highly accurate Speaker

Identification tool that supports criminal

investigations.

Phonexia Denoiser so�ware cleans the audio

signals of reverberation and other noises to make

them more audible to the human ear.

13YEARS

ON THE MARKETBASED IN CZ,

THE EUROPEAN UNIONPROJECTS IN

60 COUNTRIES

Phonexia Speaker Identification

Output

XML/JSON format with all results or results

files with a log likelihood ratio (- ; ) and/or

percentage metric scoring <0–100%>

Accuracy and speed

Achieves more than 99% accuracy (0.96% Equal

Error Rate based on NIST evaluation data set).

Up to 182× faster than real-time processing

on 1 CPU core with the most precise model—

for example, a standard 8 CPU core server

processes up to 28,992 hours of audio in one day

of computing time.

Technology

• A calibration tool for even higher accuracy

• 1:1 (verification), 1:n and n:m (identification)

comparison possible

• The technology is language-, accent-, text-,

and channel- independent

• Uses deep neural networks to generate highly

representative voiceprints

• Applies state-of-the-art channel compensation

techniques, verified by NIST evaluation

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Minimum speech signal for enrolment:

recommended 20+ secs

Minimum speech signal for identification:

recommended 7+ secs

In specific use cases the time required for

the speaker enrolment and identification can

be much shorter.

Phonexia Speaker Identification uses the power of voice biometrics to recognize a speaker automatically by their voice. Its latest generation, called Deep EmbeddingsTM, uses deep neural networks for even greater performance.

Voice Biometrics

Page 4: Phonexia Product Portfolio - pangea.ek-solutions.co.za

4 5

Voice Biometrics Speech AnalyticsVoice Biometrics Speech Analytics

Phonexia Language Identification

Technology

• The technology is text and channel independent

• Applies state-of-the-art channel compensation

techniques, verified by NIST evaluation

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Supported languages

Afan_Oromo, Albanian, Amharic, Arabic, Arabic_

Gulf, Arabic_Iraqi, Arabic_Levantine, Arabic_

Maghrebi, Arabic_MSA, Azerbaijani, Bangla_

Bengali, Bosnian, Burmese, Chinese_Cantonese,

Chinese_Dialects, Chinese_Mandarin, Creole,

Croatian, Czech, Dari, English_American, English_

British, English_Indian, Farsi, French, Georgian,

German, Greek, Hausa, Hebrew, Hindi, Hungarian,

Indonesian, Italian, Japanese, Khmer, Kirundi_

Kinyarwanda, Korean, Lao, Macedonian, Ndebele,

Pashto, Polish, Portuguese, Punjabi, Russian, Serbian,

Shona, Slovak, Somali, Spanish, Swahili,Swedish,

Tagalog, Tamil, Thai, Tibetan, Tigrigna, Turkish,

Ukrainian, Urdu, Uzbek, Vietnamese

A user can add new languages to the system,

no assistance from Phonexia is necessary.

Approx. 20 hours of audio recordings

recommended for new language training.

The Phonexia Language Identification (LID) system allows the automatic detection of spoken language or dialect.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Minimum speech signal for identification:

recommended 7+ secs

Output

XML/JSON format with all results or results files

with a logarithm of probabilities scoring (- ;0>

and/or percentage metric scoring <0-100%>

Processing speed examples

Approx. 20x faster than real-time processing

on 1 CPU core with the most precise model

i.e., a standard 8 CPU core server processes

3,840 hours of audio in 1 day of computing time

Technology

• Uses the acoustic characteristics of speech

• Speech is converted to frequency spectra

and modeled with advanced statistical

methods

• The technology is language-, accent-, text-,

and channel- independent

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Minimum speech signal for identification:

recommended 7+ secs

Output

XML/JSON format with all results or results files

with processed information (scores for male

and female)

Phonexia Gender IdentificationPhonexia Gender Identification (GID) automatically recognizes the gender of a speaker.

Processing speed

Approx. 200x faster than real-time processing

on 1 CPU core i.e., a standard 8 CPU core server

processes 38,400 hours of audio in 1 day of

computing time

Page 5: Phonexia Product Portfolio - pangea.ek-solutions.co.za

6 7

Speech AnalyticsSpeech Analytics

Technology

• Robust acoustic-based technology, even with

noisy recordings

• Keywords are automatically converted into

phonemes and searched for

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted). List of keywords or key

phrases to be searched for.

Output

XML/JSON format with all results or results files

generated with detected keywords (containing the

keyword, start/end time, path, probability, etc.)

Processing speed

The 5th generation is approximately 30x faster

than real-time processing on 1 CPU core—

Phonexia Keyword SpottingPhonexia Keyword Spotting (KWS) identifies the occurrences of keywords and/or keyphrases in audio recordings.

for example, a standard 8 CPU core server

processes 5,760 hours of audio in one day of

computing time.

The 4th generation is approximately 10x faster

than real-time processing on 1 CPU core.

Supported languages

Language Code Note

Arabic ar-KW 4th Gen.

Chinese zh-CN 4th Gen. – Beta

Croatian hr-HR 4th Gen.

Czech cs-CZ 5th Gen.

Dutch nl-NL 5th Gen.

English UK en-UK 4th Gen.

English US en-US 5th Gen.

Farsi fa-IR 4th Gen. – Beta

French fr-FR 4th Gen.

German de-DE 4th Gen.

Hungarian hu-HU 4th Gen. – Beta

Italian it-IT 4th Gen.

Pashtu ps-AR 4th Gen.

Polish pl-PL 5th Gen.

Russian ru-RU 5th Gen.

Slovak sk-SK 5th Gen.

Spanish – Latin America es-LA 5th Gen.

Turkish tr-TR 4th Gen. – Beta

A user can add an unlimited number of keywords

to the system, as well as an unlimited number of

pronunciation variants for each keyword.

Technology

• In the fi�h generation a Language Model

Customization tool is available for the optional

addition of desired words to the model

• Trained with an emphasis on spontaneous

telephone conversation

• Based on state-of-the-art techniques for

acoustic modeling, including discriminative

training and neural network-based features

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Output

XML/JSON format with all results or results

files with:

• One-best transcription

i.e., a file with a time-aligned speech transcript

(the time of the words’ start and end)

Phonexia Speech TranscriptionPhonexia Speech Transcription (STT) converts speech signals into plain text.

• n-best transcription

i.e., a confusion network with hypotheses for

words at each moment

Processing speed

The 5th generation is approximately 7x faster than

real-time processing on 1 CPU core—for example, a

standard 8 CPU core server processes 1,344 hours of

audio in one day of computing time.

The 4th generation is approximately 1.2x faster than

real-time processing on 1 CPU.

Supported languages

Language Code Note

Arabic ar-KW 4th Gen. – Beta

Chinese zh-CN 4th Gen. – Beta

Czech cs-CZ 5th Gen.

Dutch nl-NL 5th Gen.

English UK en-UK 4th Gen.

English US en-US 5th Gen.

Farsi fa-IR 4th Gen. – Beta

French fr-FR 4th Gen.

German de-DE 4th Gen.

Italian it-IT 4th Gen.

Polish pl-PL 5th Gen.

Russian ru-RU 5th Gen.

Slovak sk-SK 5th Gen.

Spanish – Latin America es-LA 5th Gen.

Page 6: Phonexia Product Portfolio - pangea.ek-solutions.co.za

8 9

Voice Biometrics Speech AnalyticsVoice Biometrics Speech Analytics

Technology

• Trained with an emphasis on spontaneous

telephone conversation

• The technology is language-, accent-, text-,

and channel- independent

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Output

XML/JSON format with all results or results

files with segmentation of speech, silence, and

technical signals (i.e., elimination of phone lines

beeps, DTMF tones, music, etc.)

Audio file extracted for each speaker

Phonexia Speaker DiarizationPhonexia Speaker Diarization (DIAR) enables segmentation of voices in one monochannel audio record.

Processing speed

Approx. 50x faster than real-time processing on

1 CPU core i.e., a standard 8 CPU core server

processes 9,600 hours of audio in 1 day of

computing time

Technology

• Trained with an emphasis on spontaneous

telephone conversation

• The technology is language-, accent-, text-,

and channel- independent

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Output

XML/JSON format with all results or results files

with labels (speech vs. non-speech segments)

Phonexia Voice Activity DetectionPhonexia Voice Activity Detection (VAD) identifies parts of audio recordings with speech content vs. non-speech content.

Processing speed

Approx. 150x faster than real-time processing

on 1 CPU core i.e., a standard 8 CPU core server

processes 28,800 hours of audio in 1 day of

computing time

Page 7: Phonexia Product Portfolio - pangea.ek-solutions.co.za

10 11

Voice Biometrics Speech AnalyticsVoice Biometrics Speech Analytics

Technology

• The technology is language-, accent-, text-,

and channel- independent

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Output

XML/JSON format with all results or results

files with:

• Global score

i.e., a percentage expression of audio quality

(range <0; 100>), by default, the global score

is calculated based on waveform_n_bits

and waveform_snr variables

• Detailed outputs

i.e., clipped signal, amplitude, sample values,

sampling frequency, SNR, technical signal,

encoding, etc.

Phonexia Speech Quality EstimationPhonexia Speech Quality Estimator (SQE) measures the quality parameters of the speech in an audio recording.

Processing speed

Approx. 2,000x faster than real-time processing

on 1 CPU core i.e., a standard 8 CPU core server

processes 384,000 hours of audio in 1 day of

computing time

Technology

• Trained with an emphasis on spontaneous

telephone conversation

• The technology is language-, accent-, text-,

and channel- independent

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Output

XML/JSON format with all results or results files

with age estimates

Phonexia Age EstimationPhonexia Age Estimation (AGE) estimates the age of a speaker from an audio recording.

Processing speed

Up to 182× faster than real-time processing

on 1 CPU core with the most precise model—

for example, a standard 8 CPU core server

processes up to 28,992 hours of audio in one day

of computing time.

Page 8: Phonexia Product Portfolio - pangea.ek-solutions.co.za

12 13

Voice BiometricsVoice Biometrics

Phonexia Voice InspectorPhonexia Voice Inspector is an out-of-the-box solution providing police forces and forensic experts with highly accurate, AI-powered automatic speaker recognition to support criminal investigations.

Technology

• Deep Embeddings™ - uses deep neural

networks to generate highly representative

voiceprints

• Applies state-of-the-art channel compensation

techniques, verified by NIST evaluation

• Compatibility with the widest range of audio

sources possible: GSM/CDMA, 3G, VoIP,

landlines, etc.

• Independent of language, accent, text and

channel

Input

• WAV (8 or 16 bits linear coding), A-law and Mu-

law, PCM, 8 kHz+ sampling

• 7 seconds recommended minimum speech

signal duration for a questioned recording

• 20 seconds recommended minimum speech

signal duration for a suspected speaker

Features and Benefits

• 1:1 speaker comparison in accordance with

ENFSI guidelines

• 1:N speaker identification for more

complex cases

• Automatic Forensic Voice Comparison

• A diarization tool to make working with audio

recordings containing multiple speakers easier

• A phoneme recognizer for the searching and

visualization of the same phoneme sequences

across multiple audio files

• An evaluation tool for the measurement of

accuracy in a user’s data sets

• A waveform editor with tools such as

a spectrum panel, voice activity detection

and more

• Easy management of investigation cases

Output

• Scoring to a likelihood ratio (LR), log-likelihood

ratio (LLR) and verbal presentation of results

• Graphic presentation of the likelihood ratio (LR)

• Detailed report output (expert opinion

template automatically generated) for

presentation of results (to a court or an

investigation team)

Phonexia Voice Inspector User Interface

A visualization of scores from a sample case

Page 9: Phonexia Product Portfolio - pangea.ek-solutions.co.za

14 15

Phonexia Denoiser Phonexia Denoiser so�ware cleans the audio signals of reverberation and other noises to make them more audible to the human ear.

Technology

• Denoiser is distributed as a part of Phonexia

Speech Engine and is accessible via REST

API. Its algorithms use deep neural networks

to achieve the automatic cleaning and

reconstruction of the processed audio signals.

Removing noises and enhancing the speech

signal provide better audibility and the ability

to understand the speech content. For each

denoised file, information is provided about the

difference of the signal-to-noise ratio to indicate

the improvement in the signal achieved by the

process of denoising.

Input

• A WAVE (*.wav) container including any of the

following:

• signed 8-bit PCM (s8)

• signed 16-bit PCM (s16le)

• IEEE float 32-bit (f32le)

• IEEE float 64-bit (f64le)

• A-law (alaw)

• µ-law (mulaw)

• ADPCM

• FLAC codec inside a FLAC (*.flac) container

• OPUS codec inside an OGG (*.opus) container

Output

• A RAW or WAV audio file (8 or 16 bits)

The processed audio is to be listened to and

examined by an expert and is not to be used as an

input for other automatic processing.

Interfaces

• REST API interface

• Command line interface

• Graphical user interface (GUI) for evaluation

Supported OS

• Windows 64 bit (x86_64)

• Linux 64 bit (x86_64)

Licensing options

• USB dongle licensing key (offline license,

on-premise installment)

• HW profile licensing key (offline license,

on-premise installment)

• Licensing server (offline license, on-premise

installment, used for HA)

• NET-based license (for demo purposes)

Integration Possibilities and LicensingPhonexia offers multiple integration and licensing possibilities, as well as custom development.

Recommended hardware

For the production system, a 64-bit system server

kind processor is recommended with a higher L3

cache (the higher, the better) – e.g., Intel® Xeon®

Processor E5, E7, i5, or i7, Phonexia technologies

also work in a virtualized environment.

An advanced consultation of HW configuration

will be provided upon a specific deployment

request.

Customization

Phonexia provides research and development

services such as speech technology optimization

for target channels, development of new language

versions, etc. Phonexia also offers multiple

engines balancing speed and accuracy according

to the specific use case. Contact our team for

more details.

More information

Should you like to know more information about

Phonexia technologies, please do not hesitate to

contact us at [email protected]

Page 10: Phonexia Product Portfolio - pangea.ek-solutions.co.za

Voice Biometrics Speech Analytics

Page 11: Phonexia Product Portfolio - pangea.ek-solutions.co.za

Phonexia s.r.o.

+420 511 205 265 [email protected] Chaloupkova 3002/1a, 612 00 Brno, Czech Republic, European Union

phonexia.com

PARTNER:

Pangea Communications

Anche BothaPangea Communications (Pty) [email protected]+27 82 570 5862