2020/02/17

Media Resource Control Protocol - MRCP

MRCP 是 speech server 給 client 提供服務(例如 speech recognition, speech synthesis)的傳輸協定,MRCP 無法獨立運作,必須透過 RTSP 或 SIP 建立 control session 與 audio streams。MRCP 是使用類似 http 的 text style protocol,每個訊息包含三個部分:first line, header, body。

MRCP 使用跟 http 一樣的 a request and reponse model,例如 MRCP client 發送 request,要求要發送 audio data 給 server 做語音辨識,server 會回傳一個訊息,裡面包含要接收資料的 port number,因為 MRCP 並沒有規範要如何傳送語音資料,這部分就透過 RTP 處理。

MRCP v2 (RFC 6787)使用 SIP 管理 session 與 audio stream,v1 (RFC 4463) 則沒有規範這部分要使用哪一種 protocol,目前比較常討論的是 MRCP v2,另外因為MRCPv1依賴 RTSP (RFC2326),但在討論 MRCP v2 時,大家一致 RTSP 的這種使用方式,會導致向後兼容性問題,因此在 (Requirements for Distributed Control of Automatic Speech Recognition (ASR), Speaker Identification/Speaker Verification (SI/SV), and Text-to-Speech (TTS) Resources RFC4313) 的3.2節禁止使用,這就是為什麼MRCPv2不能在RTSP上運作的原因。

MRCP V2 中使用了 SIP 負責建立獨立的媒體和會話支持語音媒體資源,增加了對講話者變化和講話者的身份引擎的支援(speaker verification 和 identification),同時增加了未來的擴充能力。

MRCP v2 規範中的架構圖為

     MRCPv2 client                   MRCPv2 Media Resource Server
|--------------------|            |------------------------------------|
||------------------||            ||----------------------------------||
|| Application Layer||            ||Synthesis|Recognition|Verification||
||------------------||            || Engine  |  Engine   |   Engine   ||
||Media Resource API||            ||    ||   |    ||     |    ||      ||
||------------------||            ||Synthesis|Recognizer |  Verifier  ||
|| SIP  |  MRCPv2   ||            ||Resource | Resource  |  Resource  ||
||Stack |           ||            ||     Media Resource Management    ||
||      |           ||            ||----------------------------------||
||------------------||            ||   SIP  |        MRCPv2           ||
||   TCP/IP Stack   ||---MRCPv2---||  Stack |                         ||
||                  ||            ||----------------------------------||
||------------------||----SIP-----||           TCP/IP Stack           ||
|--------------------|            ||                                  ||
         |                        ||----------------------------------||
        SIP                       |------------------------------------|
         |                          /
|-------------------|             RTP
|                   |             /
| Media Source/Sink |------------/
|                   |
|-------------------|

                      Figure 1: Architectural Diagram

W3C 在 1999年建立 Voice Broswer Working Group(VBWG),研究如何透過 web 支援語音辨識及 DTMF 處理,然後發佈了基於 web 的語音介面架構,核心是 VoiceXML。

W3C 的 Speech Recognition Grammar Specification (SRGS) 是一種 XML 標準,支援語音語法的規則,可識別的短詞語。和 SRGS 比較接近的是 W3C Semantic Interpretation for Speech Recognition (SISR),它更常用在標記語義信息支援語音語法,構成了對自然語言理解的基本格式。W3C Speech Synthesis Markup Language (SSML)是基於 XML 的方式指定內容進行語音合成的方式,可控制語音的各種屬性,包括音量大小,發音,語音間距,語速等方面的控制。

SRGS和SSML能互補和控制W3C的發音語法規則(Pronunciation Lexicon Specification (PLS))。PLS可以使用標準的發音字母來指定單字和短詞語發音。

VoiceXML 協助 MRCP,可支援多種第三方語音辨識及合成引擎。

MRCPv2 Media Resource Types

一個 MRCPv2 server 就是一種 SIP server,因此是用 SIP URI 方式定址 (sip:mrcpv2@example.net or sips:mrcpv2@example.net),可提供以下 media processing resources 給 clients

  • Basic Synthesizer

    透過連接 audio clips 產生語音 media stream,speech data 是以 limited subset of the Speech Synthesis Markup Language (SSML) 描述,最簡單的 synthesizer 必須支援這些 SSML tags: <speak>, <audio>, <say-as>, <mark>

  • Speech Synthesizer

    有完整 TTS 功能,必須完整支援 SSML

  • Recorder

    recoding audio 並提供該錄音的 URI,必須支援在錄音的最前面及後面要 supressing silence,錄音檔的中間可選擇要不要 supress silence,如果有做靜音處理,要記錄 timting metadata,才能知道原始錄音 media 實際發生語音的 timestamp

  • DTMF Recognizer

    能取得 media stream 中的 Dual-Tone Multi-Frequency (DTMF) digits,並對應到 supplied sigit grammar 中

  • Speech Recognizer

    完整的 speech recognition resource 可接收 audio media stream 並辨識取得結果,另外包含一個 natural language semantic interpreter 做辨識結果的 post-process,轉為 grammar 中的 semantic data

  • Speaker Verifier

    可辨別已存在的 voice print 的 speaker

Resource Type Resource Description
speechrecog Speech Recognizer
dtmfrecog DTMF Recognizer
speechsynth Speech Synthesizer
basicsynth Basic Synthesizer
speakverify Speaker Verification
recorder Speech Recorder

MRCPv2 的規範中,整個應用的使用過程如下:

  1. MRCP Client 通過SIP&SDP建立與MRCP Server的MRCP control channel(使用MRCP 通道ID進行唯一標識,MRCP Server返回200消息時,通過a==channel屬性指定)

  2. 可以使用SIP的Re-INVITE消息添加或者刪除一個會話中的MRCP control channel,所以一個 session 可以擁有多個MRCP control channels(比如:一個會話可以同時擁有ASR&TTS channel)

  3. 多個MRCP control channel 可以共享同一個TCP connection

  4. 一個 MRCP message 只能攜帶一個MRCP channel ID。

  5. MRCP控制消息不能更改 SIP dialog 的狀態。

  6. 由於MRCP不保證傳輸的可靠性,所以必須使用TCP/TLS來保證其傳輸

resourse control channel

MRCPv2 附在 SIP 的 SDP 裡面,client 透過 SIP Invite 連接 MRCPv2 server,產生 SIP dialog,SDP 讓兩個端點協調所有要建立的 resource control channel,並產生 server 與 source/sink of audio 之間的 media session。

client 需要建立獨立的 MRCPv2 resource control channel,控制 SIP dialog 裡面要處理的 media resource,因此需要產生一個唯一的 channel identifier string。

在 SDP 中,要有一行 "m=" 給 session 中每一個 MRCPv2 resource 使用,transport type 必須要是 "TCP/MRCPv2" or "TCP/TLS/MRCPv2",client 可透過 TCP 或 TCP/TLS 連接到 MRCPv2 server。

example:

連接到 synthesizer 的範例,server 會產生一個單向 audio stream 傳給 client

  1. 產生 Synthesizer Control Channel
C->S:  INVITE sip:mresources@example.com SIP/2.0
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf1
          Max-Forwards:6
          To:MediaServer <sip:mresources@example.com>
          From:sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314161 INVITE
          Contact:<sip:sarvi@client.example.com>
          Content-Type:application/sdp
          Content-Length:...

          v=0
          o=sarvi 2890844526 2890844526 IN IP4 192.0.2.12
          s=-
          c=IN IP4 192.0.2.12
          t=0 0
          m=application 9 TCP/MRCPv2 1
          a=setup:active
          a=connection:new
          a=resource:speechsynth
          a=cmid:1
          m=audio 49170 RTP/AVP 0
          a=rtpmap:0 pcmu/8000
          a=recvonly
          a=mid:1

   S->C:  SIP/2.0 200 OK
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf1;received=192.0.32.10
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314161 INVITE
          Contact:<sip:mresources@server.example.com>
          Content-Type:application/sdp
          Content-Length:...

          v=0
          o=- 2890842808 2890842808 IN IP4 192.0.2.11
          s=-
          c=IN IP4 192.0.2.11
          t=0 0
          m=application 32416 TCP/MRCPv2 1
          a=setup:passive
          a=connection:new
          a=channel:32AECB234338@speechsynth
          a=cmid:1
          m=audio 48260 RTP/AVP 0
          a=rtpmap:0 pcmu/8000
          a=sendonly
          a=mid:1

   C->S:  ACK sip:mresources@server.example.com SIP/2.0
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf2
          Max-Forwards:6
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:Sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314161 ACK
          Content-Length:0

上面的 RTP 資源,另外再對 recognizer 要求取得一個 resource control channel 的資源,並改為 sendrecv 雙向傳輸語音

   C->S:  INVITE sip:mresources@server.example.com SIP/2.0
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf3
          Max-Forwards:6
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314162 INVITE
          Contact:<sip:sarvi@client.example.com>
          Content-Type:application/sdp
          Content-Length:...

          v=0
          o=sarvi 2890844526 2890844527 IN IP4 192.0.2.12
          s=-
          c=IN IP4 192.0.2.12
          t=0 0
          m=application 9 TCP/MRCPv2 1
          a=setup:active
          a=connection:existing
          a=resource:speechsynth
          a=cmid:1
          m=audio 49170 RTP/AVP 0 96
          a=rtpmap:0 pcmu/8000
          a=rtpmap:96 telephone-event/8000
          a=fmtp:96 0-15
          a=sendrecv
          a=mid:1
          m=application 9 TCP/MRCPv2 1
          a=setup:active
          a=connection:existing
          a=resource:speechrecog
          a=cmid:1

   S->C:  SIP/2.0 200 OK
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf3;received=192.0.32.10
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314162 INVITE
          Contact:<sip:mresources@server.example.com>
          Content-Type:application/sdp
          Content-Length:...

          v=0
          o=- 2890842808 2890842809 IN IP4 192.0.2.11
          s=-
          c=IN IP4 192.0.2.11
          t=0 0
          m=application 32416 TCP/MRCPv2 1
          a=setup:passive
          a=connection:existing
          a=channel:32AECB234338@speechsynth
          a=cmid:1
          m=audio 48260 RTP/AVP 0 96
          a=rtpmap:0 pcmu/8000
          a=rtpmap:96 telephone-event/8000
          a=fmtp:96 0-15
          a=sendrecv
          a=mid:1
          m=application 32416 TCP/MRCPv2 1
          a=setup:passive
          a=connection:existing
          a=channel:32AECB234338@speechrecog
          a=cmid:1

   C->S:  ACK sip:mresources@server.example.com SIP/2.0
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf4
          Max-Forwards:6
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:Sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314162 ACK
          Content-Length:0

釋放 recofnizer channel 的資源,改回 recvonly

   C->S:  INVITE sip:mresources@server.example.com SIP/2.0
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf5
          Max-Forwards:6
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314163 INVITE
          Contact:<sip:sarvi@client.example.com>
          Content-Type:application/sdp
          Content-Length:...

          v=0
          o=sarvi 2890844526 2890844528 IN IP4 192.0.2.12
          s=-
          c=IN IP4 192.0.2.12
          t=0 0
          m=application 9 TCP/MRCPv2 1
          a=resource:speechsynth
          a=cmid:1
          m=audio 49170 RTP/AVP 0
          a=rtpmap:0 pcmu/8000
          a=recvonly
          a=mid:1
          m=application 0 TCP/MRCPv2 1
          a=resource:speechrecog
          a=cmid:1


   S->C:  SIP/2.0 200 OK
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf5;received=192.0.32.10
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314163 INVITE
          Contact:<sip:mresources@server.example.com>
          Content-Type:application/sdp
          Content-Length:...

          v=0
          o=- 2890842808 2890842810 IN IP4 192.0.2.11
          s=-
          c=IN IP4 192.0.2.11
          t=0 0
          m=application 32416 TCP/MRCPv2 1
          a=channel:32AECB234338@speechsynth
          a=cmid:1
          m=audio 48260 RTP/AVP 0
          a=rtpmap:0 pcmu/8000
          a=sendonly
          a=mid:1
          m=application 0 TCP/MRCPv2 1
          a=channel:32AECB234338@speechrecog
          a=cmid:1

   C->S:  ACK sip:mresources@server.example.com SIP/2.0
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf6
          Max-Forwards:6
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:Sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314163 ACK
          Content-Length:0

MRCPv2 message

MRCPv2 訊息包含 client 給 server 的 request,及server 發給 client 的 response 與asynchronous events,資料格式包含一行 start-line,多個 headers,一行 empty line 代表 header 結束,然後是 optional message body,跟 http protocol 類似

generic-message  =    start-line
                      message-header
                      CRLF
                      [ message-body ]

message-body     =    *OCTET

start-line       =    request-line / response-line / event-line

message-header   =  1*(generic-header / resource-header / generic-field)

resource-header  =    synthesizer-header
                 /    recognizer-header
                 /    recorder-header
                 /    verifier-header

ex:

   C->S:   MRCP/2.0 877 INTERPRET 543266
           Channel-Identifier:32AECB23433801@speechrecog
           Interpret-Text:may I speak to Andre Roy
           Content-Type:application/srgs+xml
           Content-ID:<request1@form-level.store>
           Content-Length:661

           <?xml version="1.0"?>
           <!-- the default grammar language is US English -->
           <grammar xmlns="http://www.w3.org/2001/06/grammar"
                    xml:lang="en-US" version="1.0" root="request">
           <!-- single language attachment to tokens -->
               <rule id="yes">
                   <one-of>
                       <item xml:lang="fr-CA">oui</item>
                       <item xml:lang="en-US">yes</item>
                   </one-of>
               </rule>
           <!-- single language attachment to a rule expansion -->
               <rule id="request">
                   may I speak to
                   <one-of xml:lang="fr-CA">
                       <item>Michel Tremblay</item>
                       <item>Andre Roy</item>
                   </one-of>
               </rule>
           </grammar>

   S->C:   MRCP/2.0 82 543266 200 IN-PROGRESS
           Channel-Identifier:32AECB23433801@speechrecog

   S->C:   MRCP/2.0 634 INTERPRETATION-COMPLETE 543266 200 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success
           Content-Type:application/nlsml+xml
           Content-Length:441

           <?xml version="1.0"?>
           <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
                   xmlns:ex="http://www.example.com/example"
                   grammar="session:request1@form-level.store">
               <interpretation>
                   <instance name="Person">
                       <ex:Person>
                           <ex:Name> Andre Roy </ex:Name>
                       </ex:Person>
                   </instance>
                   <input>   may I speak to Andre Roy </input>
               </interpretation>
           </result>

request-line 的格式為

   request-line   =    mrcp-version SP message-length SP method-name SP request-id CRLF

   method-name    =    generic-method
                  /    synthesizer-method
                  /    recognizer-method
                  /    recorder-method
                  /    verifier-method
                  
   request-id     =    1*10DIGIT

response 的格式為

response-line  =    mrcp-version SP message-length SP request-id
                       SP status-code SP request-state CRLF
status-code     =    3DIGIT
request-state    =  "COMPLETE"
                    /  "IN-PROGRESS"
                    /  "PENDING"

event-line 的格式為

event-line       =  mrcp-version SP message-length SP event-name
                       SP request-id SP request-state CRLF
event-name       =  synthesizer-event
                    /  recognizer-event
                    /  recorder-event
                    /  verifier-event

注意到訊息格式中,分別對 synthesizer, recognizer, recorder, verifier 四種 resource type,有不同的定義 methods, headers, events

Generic Methods, Headers, Result Structure

所有 resource 通用的 methods, headers

MRCPv2 支援兩種 generic methods,可 reading, writing 相關資源的 state

   generic-method      =    "SET-PARAMS"
                       /    "GET-PARAMS"

SET-PARAMS

​ client 發送給 server,通知該 session 的 MRCPv2 resource 要定義 parameter

   C->S:  MRCP/2.0 ... SET-PARAMS 543256
          Channel-Identifier:32AECB23433802@speechsynth
          Voice-gender:female
          Voice-variant:3

   S->C:  MRCP/2.0 ... 543256 200 COMPLETE
          Channel-Identifier:32AECB23433802@speechsynth

GET-PARAMS

​ client 發送給 server,通知要取得 MRCPv2 resource 目前的 session parameters

   C->S:   MRCP/2.0 ... GET-PARAMS 543256
           Channel-Identifier:32AECB23433802@speechsynth
           Voice-gender:
           Voice-variant:
           Vendor-Specific-Parameters:com.example.param1;
                         com.example.param2
   S->C:   MRCP/2.0 ... 543256 200 COMPLETE
           Channel-Identifier:32AECB23433802@speechsynth
           Voice-gender:female
           Voice-variant:3
           Vendor-Specific-Parameters:com.example.param1="Company Name";
                         com.example.param2="124324234@example.com"

所有 MRCPv2 header 中,包含 generic-headers 及 resource-specific headers

header 的定義如下

   generic-field  = field-name ":" [ field-value ]
   field-name     = token
   field-value    = *LWS field-content *( CRLF 1*LWS field-content)
   field-content  = <the OCTETs making up the field-value
                    and consisting of either *TEXT or combinations
                    of token, separators, and quoted-string>

generic header 有

   generic-header      =    channel-identifier
                       /    accept
                       /    active-request-id-list
                       /    proxy-sync-id
                       /    accept-charset
                       /    content-type
                       /    content-id
                       /    content-base
                       /    content-encoding
                       /    content-location
                       /    content-length
                       /    fetch-timeout
                       /    cache-control
                       /    logging-tag
                       /    set-cookie
                       /    vendor-specific
  • Channel-Identifier

    在產生一個 control channel 時,由 server 指定一個 Channel Id

   channel-identifier  = "Channel-Identifier" ":" channel-id CRLF
   channel-id          = 1*alphanum "@" 1*alphanum
  • Accept

  • Active-Request-Id-List

    在 request 裡面,這個 header 代表這個 request 對這個 list of request-ids 有作用。在 response ,這個 header 代表該 method 影響到的 list of request-ids

       active-request-id-list  =  "Active-Request-Id-List" ":"
                                  request-id *("," request-id) CRLF
  • Proxy-Sync-Id

    當某個 server resource 產生 "barge-in-able" event,也會產生一個 unique tag,該 tag 會透過這個 header 放在 event 裡面,傳給 client

       proxy-sync-id    =  "Proxy-Sync-Id" ":" 1*VCHAR CRLF
  • Accept-Charset

    在 request 裡面指定 response or event 可接受能夠處理的 character sets。

    例如指定 Natural Language Semantic Markup Language (NLSML) results 的 RECOGNITION-COMPLETE event 可使用的 character set

  • Content-Type

    MRCPv2 的 content 支援有限 media types,例如 speech markup, grammer, recofnition results

       content-type     =    "Content-Type" ":" media-type-value CRLF
    
       media-type-value =    type "/" subtype *( ";" parameter )
    
       type             =    token
    
       subtype          =    token
    
       parameter        =    attribute "=" value
    
       attribute        =    token
    
       value            =    token / quoted-string
  • Content-ID

    該 content 參考或引用的 ID or name

  • Content-Base

    指定 base URI

    content-base      = "Content-Base" ":" absoluteURI CRLF
  • Content-Encoding

    某個 Content-Type 的附加資訊,例如 Content-Encoding:gzip

       content-encoding  = "Content-Encoding" ":"
                           *WSP content-coding
                           *(*WSP "," *WSP content-coding *WSP )
                           CRLF
  • Content-Location

       content-location  =  "Content-Location" ":"
                            ( absoluteURI / relativeURI ) CRLF
  • Content-Length

    message body 的長度

       content-length  =  "Content-Length" ":" 1*19DIGIT CRLF
  • Fetch Timeout

    當 recognizer/synthesizer 需要取得文件或其他資源,定義 server 透過網路取得資源的 timeout 時間

       fetch-timeout       =   "Fetch-Timeout" ":" 1*19DIGIT CRLF
  • Cache-Control

    如果 server 有支援 content caching,遵循 http 1.1 的規則提供 cache

       cache-control    =    "Cache-Control" ":"
                             [*WSP cache-directive
                             *( *WSP "," *WSP cache-directive *WSP )]
                             CRLF
    
       cache-directive     = "max-age" "=" delta-seconds
                           / "max-stale" [ "=" delta-seconds ]
                           / "min-fresh" "=" delta-seconds
    
       delta-seconds       = 1*19DIGIT
  • Logging-Tag

    SET-PARAMS/GET-PARAMS method 的 header,可 set/retrieve server 產生的 log 的 logging tag

       logging-tag    = "Logging-Tag" ":" 1*UTFCHAR CRLF
  • Set-Cookie

    類似 http 的 cookie,讓 server 在 client 存放 cookie values

  • Vendor-Specific Parameters

    ex:

       com.example.companyA.paramxyz=256
       com.example.companyA.paramabc=High
       com.example.companyB.paramxyz=Low

Generic Result Structure

Recognizer 與 Verifier resource server 產生的 result data,以 Natural Language Semantics Markup Language (NLSML) 格式提供

ex:

   Content-Type:application/nlsml+xml
   Content-Length:...

   <?xml version="1.0"?>
   <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
           xmlns:ex="http://www.example.com/example"
           grammar="http://theYesNoGrammar">
       <interpretation>
           <instance>
                   <ex:response>yes</ex:response>
           </instance>
           <input>OK</input>
       </interpretation>
   </result>

Resource Discovery

透過 SIP OPTIONS 向 server 詢問 server capabilities

server 必須以 SDP 回應 capabilities,包含 media type, transport type: m=application 0 TCP/TLS/MRCPv2 1,以及 resource: a=resource:speechsynth

ex:

   C->S:
        OPTIONS sip:mrcp@server.example.com SIP/2.0
        Via:SIP/2.0/TCP client.atlanta.example.com:5060;
         branch=z9hG4bK74bf7
        Max-Forwards:6
        To:<sip:mrcp@example.com>
        From:Sarvi <sip:sarvi@example.com>;tag=1928301774
        Call-ID:a84b4c76e66710
        CSeq:63104 OPTIONS
        Contact:<sip:sarvi@client.example.com>
        Accept:application/sdp
        Content-Length:0


   S->C:
        SIP/2.0 200 OK
        Via:SIP/2.0/TCP client.atlanta.example.com:5060;
         branch=z9hG4bK74bf7;received=192.0.32.10
        To:<sip:mrcp@example.com>;tag=62784
        From:Sarvi <sip:sarvi@example.com>;tag=1928301774
        Call-ID:a84b4c76e66710
        CSeq:63104 OPTIONS
        Contact:<sip:mrcp@server.example.com>
        Allow:INVITE, ACK, CANCEL, OPTIONS, BYE
         Accept:application/sdp
        Accept-Encoding:gzip
        Accept-Language:en
        Supported:foo
        Content-Type:application/sdp
        Content-Length:...

        v=0
        o=sarvi 2890844536 2890842811 IN IP4 192.0.2.12
        s=-
        i=MRCPv2 server capabilities
        c=IN IP4 192.0.2.12/127
        t=0 0
        m=application 0 TCP/TLS/MRCPv2 1
        a=resource:speechsynth
        a=resource:speechrecog
        a=resource:speakverify
        m=audio 0 RTP/AVP 0 3
        a=rtpmap:0 PCMU/8000
        a=rtpmap:3 GSM/8000

Speech Synthesizer Resource

client 發送 text markup,讓server 即時產生 audio stream,可指定語音合成的參數,例如 voice characteristics, speaker speed

有兩種: speech synth, basicsynth

Synthesizer State Machine

pending 的 SPEAK request 可以被 deleted/stopped

   Idle                    Speaking                  Paused
   State                   State                     State
     |                        |                          |
     |----------SPEAK-------->|                 |--------|
     |<------STOP-------------|             CONTROL      |
     |<----SPEAK-COMPLETE-----|                 |------->|
     |<----BARGE-IN-OCCURRED--|                          |
     |              |---------|                          |
     |          CONTROL       |-----------PAUSE--------->|
     |              |-------->|<----------RESUME---------|
     |                        |               |----------|
     |----------|             |              PAUSE       |
     |    BARGE-IN-OCCURRED   |               |--------->|
     |<---------|             |----------|               |
     |                        |      SPEECH-MARKER       |
     |                        |<---------|               |
     |----------|             |----------|               |
     |         STOP           |       RESUME             |
     |          |             |<---------|               |
     |<---------|             |                          |
     |<---------------------STOP-------------------------|
     |----------|             |                          |
     |     DEFINE-LEXICON     |                          |
     |          |             |                          |
     |<---------|             |                          |
     |<---------------BARGE-IN-OCCURRED------------------|

Synthesizer Methods

   synthesizer-method   =  "SPEAK"
                        /  "STOP"
                        /  "PAUSE"
                        /  "RESUME"
                        /  "BARGE-IN-OCCURRED"
                        /  "CONTROL"
                        /  "DEFINE-LEXICON"

Synthesizer Events

   synthesizer-event    =  "SPEECH-MARKER"
                        /  "SPEAK-COMPLETE"

Synthesizer Header Fields

   synthesizer-header  =  jump-size
                       /  kill-on-barge-in
                       /  speaker-profile
                       /  completion-cause
                       /  completion-reason
                       /  voice-parameter
                       /  prosody-parameter
                       /  speech-marker
                       /  speech-language
                       /  fetch-hint
                       /  audio-fetch-hint
                       /  failed-uri
                       /  failed-uri-cause
                       /  speak-restart
                       /  speak-length
                       /  load-lexicon
                       /  lexicon-search-order

Example:

text 會被合成並播放到 media stream,resource 會產生 IN-PROGRESS, SPEAK-COMPLETE event

   C->S: MRCP/2.0 ... SPEAK 543257
         Channel-Identifier:32AECB23433802@speechsynth
         Voice-gender:neutral
         Voice-Age:25
         Prosody-volume:medium
         Content-Type:application/ssml+xml
         Content-Length:...

         <?xml version="1.0"?>
            <speak version="1.0"
                xmlns="http://www.w3.org/2001/10/synthesis"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
                xml:lang="en-US">
            <p>
             <s>You have 4 new messages.</s>
             <s>The first is from Stephanie Williams and arrived at
                <break/>
                <say-as interpret-as="vxml:time">0342p</say-as>.
                </s>
             <s>The subject is
                    <prosody rate="-20%">ski trip</prosody>
             </s>
            </p>
           </speak>

   S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206027059

   S->C: MRCP/2.0 ... SPEAK-COMPLETE 543257 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Completion-Cause:000 normal
         Speech-Marker:timestamp=857206027059

Speech Recognizer Resource

接收 client 提供的 voice stream,轉換為文字

有兩種: speechrecog, dtmfrecog

recognizer resource 的能力有:

  1. Normal Mode Recognition:會將整個語音或 DTMF 判斷是否吻合

  2. Hotword Mode Recognition

    判斷是否有出現某個特定的 speech grammar or DTMF sequence

  3. Voice Enrolled Grammars

    (optional) enrollment 是用某個人的 voice 進行判斷, server 會維護 a list of contacts,包含人員的名稱以及 voice,這個技術也稱為 speaker-dependent recognition

  4. Interpretation

    natural language interpretation

    以 text 作為 input,產生該文字的 grammar

Recognizer State Machine

   Idle                   Recognizing               Recognized
   State                  State                     State
    |                       |                          |
    |---------RECOGNIZE---->|---RECOGNITION-COMPLETE-->|
    |<------STOP------------|<-----RECOGNIZE-----------|
    |                       |                          |
    |              |--------|              |-----------|
    |       START-OF-INPUT  |       GET-RESULT         |
    |              |------->|              |---------->|
    |------------|          |                          |
    |      DEFINE-GRAMMAR   |----------|               |
    |<-----------|          | START-INPUT-TIMERS       |
    |                       |<---------|               |
    |------|                |                          |
    |  INTERPRET            |                          |
    |<-----|                |------|                   |
    |                       |   RECOGNIZE              |
    |-------|               |<-----|                   |
    |      STOP                                        |
    |<------|                                          |
    |<-------------------STOP--------------------------|
    |<-------------------DEFINE-GRAMMAR----------------|

Recognizer Methods

   recognizer-method    =  recog-only-method
                        /  enrollment-method
   recog-only-method    =  "DEFINE-GRAMMAR"
                        /  "RECOGNIZE"
                        /  "INTERPRET"
                        /  "GET-RESULT"
                        /  "START-INPUT-TIMERS"
                        /  "STOP"
   enrollment-method    =  "START-PHRASE-ENROLLMENT"
                        /  "ENROLLMENT-ROLLBACK"
                        /  "END-PHRASE-ENROLLMENT"
                        /  "MODIFY-PHRASE"
                        /  "DELETE-PHRASE"

Recognizer Events

   recognizer-event     =  "START-OF-INPUT"
                        /  "RECOGNITION-COMPLETE"
                        /  "INTERPRETATION-COMPLETE"

Recognizer Header Fields

   recognizer-header    =  recog-only-header
                        /  enrollment-header

   recog-only-header    =  confidence-threshold
                        /  sensitivity-level
                        /  speed-vs-accuracy
                        /  n-best-list-length
                        /  no-input-timeout
                        /  input-type
                        /  recognition-timeout
                        /  waveform-uri
                        /  input-waveform-uri
                        /  completion-cause
                        /  completion-reason
                        /  recognizer-context-block
                        /  start-input-timers
                        /  speech-complete-timeout
                        /  speech-incomplete-timeout
                        /  dtmf-interdigit-timeout
                        /  dtmf-term-timeout
                        /  dtmf-term-char
                        /  failed-uri
                        /  failed-uri-cause
                        /  save-waveform
                        /  media-type
                        /  new-audio-channel
                        /  speech-language
                        /  ver-buffer-utterance
                        /  recognition-mode
                        /  cancel-if-queue
                        /  hotword-max-duration
                        /  hotword-min-duration
                        /  interpret-text
                        /  dtmf-buffer-time
                        /  clear-dtmf-buffer
                        /  early-no-match
                        
   enrollment-header    =  num-min-consistent-pronunciations
                        /  consistency-threshold
                        /  clash-threshold
                        /  personal-grammar-uri
                        /  enroll-utterance
                        /  phrase-id
                        /  phrase-nl
                        /  weight
                        /  save-best-waveform
                        /  new-phrase-id
                        /  confusable-phrases-uri
                        /  abort-phrase-enrollment

Example

   C->S:MRCP/2.0 ... RECOGNIZE 543257
   Channel-Identifier:32AECB23433801@speechrecog
           Confidence-Threshold:0.9
   Content-Type:application/srgs+xml
   Content-ID:<request1@form-level.store>
   Content-Length:...

   <?xml version="1.0"?>

   <!-- the default grammar language is US English -->
   <grammar xmlns="http://www.w3.org/2001/06/grammar"
            xml:lang="en-US" version="1.0" root="request">

   <!-- single language attachment to tokens -->
       <rule id="yes">
               <one-of>
                     <item xml:lang="fr-CA">oui</item>
                     <item xml:lang="en-US">yes</item>
               </one-of>
         </rule>

   <!-- single language attachment to a rule expansion -->
         <rule id="request">
               may I speak to
               <one-of xml:lang="fr-CA">
                     <item>Michel Tremblay</item>
                     <item>Andre Roy</item>
               </one-of>
         </rule>

     </grammar>

   S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
   Channel-Identifier:32AECB23433801@speechrecog

   S->C:MRCP/2.0 ... START-OF-INPUT 543257 IN-PROGRESS
   Channel-Identifier:32AECB23433801@speechrecog

   S->C:MRCP/2.0 ... RECOGNITION-COMPLETE 543257 COMPLETE
   Channel-Identifier:32AECB23433801@speechrecog
   Completion-Cause:000 success
   Waveform-URI:<http://web.media.com/session123/audio.wav>;
                 size=424252;duration=2543
   Content-Type:application/nlsml+xml
   Content-Length:...
   <?xml version="1.0"?>
   <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
           xmlns:ex="http://www.example.com/example"
           grammar="session:request1@form-level.store">
       <interpretation>
           <instance name="Person">
               <ex:Person>
                   <ex:Name> Andre Roy </ex:Name>
               </ex:Person>
           </instance>
               <input>   may I speak to Andre Roy </input>
       </interpretation>
   </result>

Recorder Resource

將收到的 audio/video 存到指定的 URI

Recorder State Machine

   Idle                   Recording
   State                  State
    |                       |
    |---------RECORD------->|
    |                       |
    |<------STOP------------|
    |                       |
    |<--RECORD-COMPLETE-----|
    |                       |
    |              |--------|
    |       START-OF-INPUT  |
    |              |------->|
    |                       |
    |              |--------|
    |    START-INPUT-TIMERS |
    |              |------->|
    |                       |

Recorder Methods

   recorder-method      =  "RECORD"
                        /  "STOP"
                        /  "START-INPUT-TIMERS"

Recorder Events

   recorder-event       =  "START-OF-INPUT"
                        /  "RECORD-COMPLETE"

Recorder Header Fields

   recorder-header      =  sensitivity-level
                        /  no-input-timeout
                        /  completion-cause
                        /  completion-reason
                        /  failed-uri
                        /  failed-uri-cause
                        /  record-uri
                        /  media-type
                        /  max-time
                        /  trim-length
                        /  final-silence
                        /  capture-on-speech
                        /  ver-buffer-utterance
                        /  start-input-timers
                        /  new-audio-channel

example

   C->S:  MRCP/2.0 ... RECORD 543257
          Channel-Identifier:32AECB23433802@recorder
          Record-URI:<file://mediaserver/recordings/myfile.wav>
          Media-Type:audio/wav
          Capture-On-Speech:true
          Final-Silence:300
          Max-Time:6000

   S->C:  MRCP/2.0 ... 543257 200 IN-PROGRESS
          Channel-Identifier:32AECB23433802@recorder

   S->C:  MRCP/2.0 ... START-OF-INPUT 543257 IN-PROGRESS
          Channel-Identifier:32AECB23433802@recorder

   S->C:  MRCP/2.0 ... RECORD-COMPLETE 543257 COMPLETE
          Channel-Identifier:32AECB23433802@recorder
          Completion-Cause:000 success-silence
          Record-URI:<file://mediaserver/recordings/myfile.wav>;
                     size=242552;duration=25645

Speaker Verification and Identification

辨識 speaker 的身份

Speaker Verification State Machine

     Idle              Session Opened       Verifying/Training
     State             State                State
      |                   |                         |
      |--START-SESSION--->|                         |
      |                   |                         |
      |                   |----------|              |
      |                   |     START-SESSION       |
      |                   |<---------|              |
      |                   |                         |
      |<--END-SESSION-----|                         |
      |                   |                         |
      |                   |---------VERIFY--------->|
      |                   |                         |
      |                   |---VERIFY-FROM-BUFFER--->|
      |                   |                         |
      |                   |----------|              |
      |                   |  VERIFY-ROLLBACK        |
      |                   |<---------|              |
      |                   |                         |
      |                   |                |--------|
      |                   | GET-INTERMEDIATE-RESULT |
      |                   |                |------->|
      |                   |                         |
      |                   |                |--------|
      |                   |     START-INPUT-TIMERS  |
      |                   |                |------->|
      |                   |                         |
      |                   |                |--------|
      |                   |         START-OF-INPUT  |
      |                   |                |------->|
      |                   |                         |
      |                   |<-VERIFICATION-COMPLETE--|
      |                   |                         |
      |                   |<--------STOP------------|
      |                   |                         |
      |                   |----------|              |
      |                   |         STOP            |
      |                   |<---------|              |
      |                   |                         |
      |----------|        |                         |
      |         STOP      |                         |
      |<---------|        |                         |
      |                   |----------|              |
      |                   |    CLEAR-BUFFER         |
      |                   |<---------|              |
      |                   |                         |
      |----------|        |                         |
      |   CLEAR-BUFFER    |                         |
      |<---------|        |                         |
      |                   |                         |
      |                   |----------|              |
      |                   |   QUERY-VOICEPRINT      |
      |                   |<---------|              |
      |                   |                         |
      |----------|        |                         |
      | QUERY-VOICEPRINT  |                         |
      |<---------|        |                         |
      |                   |                         |
      |                   |----------|              |
      |                   |  DELETE-VOICEPRINT      |
      |                   |<---------|              |
      |                   |                         |
      |----------|        |                         |
      | DELETE-VOICEPRINT |                         |
      |<---------|        |                         |

Speaker Verification Methods

   verifier-method          =  "START-SESSION"
                            / "END-SESSION"
                            / "QUERY-VOICEPRINT"
                            / "DELETE-VOICEPRINT"
                            / "VERIFY"
                            / "VERIFY-FROM-BUFFER"
                            / "VERIFY-ROLLBACK"
                            / "STOP"
                            / "CLEAR-BUFFER"
                            / "START-INPUT-TIMERS"
                            / "GET-INTERMEDIATE-RESULT"

Verification Events

   verifier-event       =  "VERIFICATION-COMPLETE"
                        /  "START-OF-INPUT"

Verification Header Fields

   verification-header      =  repository-uri
                            /  voiceprint-identifier
                            /  verification-mode
                            /  adapt-model
                            /  abort-model
                            /  min-verification-score
                            /  num-min-verification-phrases
                            /  num-max-verification-phrases
                            /  no-input-timeout
                            /  save-waveform
                            /  media-type
                            /  waveform-uri
                            /  voiceprint-exists
                            /  ver-buffer-utterance
                            /  input-waveform-uri
                            /  completion-cause
                            /  completion-reason
                            /  speech-complete-timeout
                            /  new-audio-channel
                            /  abort-verification
                            /  start-input-timers

References

MRCP wiki

MRCP協議學習筆記-MRCP背景知識介紹

MRCP學習筆記-語音識別資源的事件和Headers詳解

MRCP協議學習筆記-語音識別資源的概括和全部Methods

MRCP協議學習筆記-關於媒體資源伺服器的定位路由策略

MRCPv2概述

MRCPv2在電信智能語音識別業務中的應用

MRCPv2 - Speech Synthesizer Resource

MRCP v2.0 規範 - RFC6787中文翻譯(1)

cisco 使用MRCPv1 ASR/TTS的IOS語音XML網關到CVP呼叫流

沒有留言:

張貼留言