2020/02/17

Media Resource Control Protocol - MRCP

MRCP 是 speech server 給 client 提供服務(例如 speech recognition, speech synthesis)的傳輸協定,MRCP 無法獨立運作,必須透過 RTSP 或 SIP 建立 control session 與 audio streams。MRCP 是使用類似 http 的 text style protocol,每個訊息包含三個部分:first line, header, body。

MRCP 使用跟 http 一樣的 a request and reponse model,例如 MRCP client 發送 request,要求要發送 audio data 給 server 做語音辨識,server 會回傳一個訊息,裡面包含要接收資料的 port number,因為 MRCP 並沒有規範要如何傳送語音資料,這部分就透過 RTP 處理。

MRCP v2 (RFC 6787)使用 SIP 管理 session 與 audio stream,v1 (RFC 4463) 則沒有規範這部分要使用哪一種 protocol,目前比較常討論的是 MRCP v2,另外因為MRCPv1依賴 RTSP (RFC2326),但在討論 MRCP v2 時,大家一致 RTSP 的這種使用方式,會導致向後兼容性問題,因此在 (Requirements for Distributed Control of Automatic Speech Recognition (ASR), Speaker Identification/Speaker Verification (SI/SV), and Text-to-Speech (TTS) Resources RFC4313) 的3.2節禁止使用,這就是為什麼MRCPv2不能在RTSP上運作的原因。

MRCP V2 中使用了 SIP 負責建立獨立的媒體和會話支持語音媒體資源,增加了對講話者變化和講話者的身份引擎的支援(speaker verification 和 identification),同時增加了未來的擴充能力。

MRCP v2 規範中的架構圖為

     MRCPv2 client                   MRCPv2 Media Resource Server
|--------------------|            |------------------------------------|
||------------------||            ||----------------------------------||
|| Application Layer||            ||Synthesis|Recognition|Verification||
||------------------||            || Engine  |  Engine   |   Engine   ||
||Media Resource API||            ||    ||   |    ||     |    ||      ||
||------------------||            ||Synthesis|Recognizer |  Verifier  ||
|| SIP  |  MRCPv2   ||            ||Resource | Resource  |  Resource  ||
||Stack |           ||            ||     Media Resource Management    ||
||      |           ||            ||----------------------------------||
||------------------||            ||   SIP  |        MRCPv2           ||
||   TCP/IP Stack   ||---MRCPv2---||  Stack |                         ||
||                  ||            ||----------------------------------||
||------------------||----SIP-----||           TCP/IP Stack           ||
|--------------------|            ||                                  ||
         |                        ||----------------------------------||
        SIP                       |------------------------------------|
         |                          /
|-------------------|             RTP
|                   |             /
| Media Source/Sink |------------/
|                   |
|-------------------|

                      Figure 1: Architectural Diagram

W3C 在 1999年建立 Voice Broswer Working Group(VBWG),研究如何透過 web 支援語音辨識及 DTMF 處理,然後發佈了基於 web 的語音介面架構,核心是 VoiceXML。

W3C 的 Speech Recognition Grammar Specification (SRGS) 是一種 XML 標準,支援語音語法的規則,可識別的短詞語。和 SRGS 比較接近的是 W3C Semantic Interpretation for Speech Recognition (SISR),它更常用在標記語義信息支援語音語法,構成了對自然語言理解的基本格式。W3C Speech Synthesis Markup Language (SSML)是基於 XML 的方式指定內容進行語音合成的方式,可控制語音的各種屬性,包括音量大小,發音,語音間距,語速等方面的控制。

SRGS和SSML能互補和控制W3C的發音語法規則(Pronunciation Lexicon Specification (PLS))。PLS可以使用標準的發音字母來指定單字和短詞語發音。

VoiceXML 協助 MRCP,可支援多種第三方語音辨識及合成引擎。

MRCPv2 Media Resource Types

一個 MRCPv2 server 就是一種 SIP server,因此是用 SIP URI 方式定址 (sip:mrcpv2@example.net or sips:mrcpv2@example.net),可提供以下 media processing resources 給 clients

  • Basic Synthesizer

    透過連接 audio clips 產生語音 media stream,speech data 是以 limited subset of the Speech Synthesis Markup Language (SSML) 描述,最簡單的 synthesizer 必須支援這些 SSML tags: <speak>, <audio>, <say-as>, <mark>

  • Speech Synthesizer

    有完整 TTS 功能,必須完整支援 SSML

  • Recorder

    recoding audio 並提供該錄音的 URI,必須支援在錄音的最前面及後面要 supressing silence,錄音檔的中間可選擇要不要 supress silence,如果有做靜音處理,要記錄 timting metadata,才能知道原始錄音 media 實際發生語音的 timestamp

  • DTMF Recognizer

    能取得 media stream 中的 Dual-Tone Multi-Frequency (DTMF) digits,並對應到 supplied sigit grammar 中

  • Speech Recognizer

    完整的 speech recognition resource 可接收 audio media stream 並辨識取得結果,另外包含一個 natural language semantic interpreter 做辨識結果的 post-process,轉為 grammar 中的 semantic data

  • Speaker Verifier

    可辨別已存在的 voice print 的 speaker

Resource Type Resource Description
speechrecog Speech Recognizer
dtmfrecog DTMF Recognizer
speechsynth Speech Synthesizer
basicsynth Basic Synthesizer
speakverify Speaker Verification
recorder Speech Recorder

MRCPv2 的規範中,整個應用的使用過程如下:

  1. MRCP Client 通過SIP&SDP建立與MRCP Server的MRCP control channel(使用MRCP 通道ID進行唯一標識,MRCP Server返回200消息時,通過a==channel屬性指定)

  2. 可以使用SIP的Re-INVITE消息添加或者刪除一個會話中的MRCP control channel,所以一個 session 可以擁有多個MRCP control channels(比如:一個會話可以同時擁有ASR&TTS channel)

  3. 多個MRCP control channel 可以共享同一個TCP connection

  4. 一個 MRCP message 只能攜帶一個MRCP channel ID。

  5. MRCP控制消息不能更改 SIP dialog 的狀態。

  6. 由於MRCP不保證傳輸的可靠性,所以必須使用TCP/TLS來保證其傳輸

resourse control channel

MRCPv2 附在 SIP 的 SDP 裡面,client 透過 SIP Invite 連接 MRCPv2 server,產生 SIP dialog,SDP 讓兩個端點協調所有要建立的 resource control channel,並產生 server 與 source/sink of audio 之間的 media session。

client 需要建立獨立的 MRCPv2 resource control channel,控制 SIP dialog 裡面要處理的 media resource,因此需要產生一個唯一的 channel identifier string。

在 SDP 中,要有一行 "m=" 給 session 中每一個 MRCPv2 resource 使用,transport type 必須要是 "TCP/MRCPv2" or "TCP/TLS/MRCPv2",client 可透過 TCP 或 TCP/TLS 連接到 MRCPv2 server。

example:

連接到 synthesizer 的範例,server 會產生一個單向 audio stream 傳給 client

  1. 產生 Synthesizer Control Channel
C->S:  INVITE sip:mresources@example.com SIP/2.0
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf1
          Max-Forwards:6
          To:MediaServer <sip:mresources@example.com>
          From:sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314161 INVITE
          Contact:<sip:sarvi@client.example.com>
          Content-Type:application/sdp
          Content-Length:...

          v=0
          o=sarvi 2890844526 2890844526 IN IP4 192.0.2.12
          s=-
          c=IN IP4 192.0.2.12
          t=0 0
          m=application 9 TCP/MRCPv2 1
          a=setup:active
          a=connection:new
          a=resource:speechsynth
          a=cmid:1
          m=audio 49170 RTP/AVP 0
          a=rtpmap:0 pcmu/8000
          a=recvonly
          a=mid:1

   S->C:  SIP/2.0 200 OK
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf1;received=192.0.32.10
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314161 INVITE
          Contact:<sip:mresources@server.example.com>
          Content-Type:application/sdp
          Content-Length:...

          v=0
          o=- 2890842808 2890842808 IN IP4 192.0.2.11
          s=-
          c=IN IP4 192.0.2.11
          t=0 0
          m=application 32416 TCP/MRCPv2 1
          a=setup:passive
          a=connection:new
          a=channel:32AECB234338@speechsynth
          a=cmid:1
          m=audio 48260 RTP/AVP 0
          a=rtpmap:0 pcmu/8000
          a=sendonly
          a=mid:1

   C->S:  ACK sip:mresources@server.example.com SIP/2.0
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf2
          Max-Forwards:6
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:Sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314161 ACK
          Content-Length:0

上面的 RTP 資源,另外再對 recognizer 要求取得一個 resource control channel 的資源,並改為 sendrecv 雙向傳輸語音

   C->S:  INVITE sip:mresources@server.example.com SIP/2.0
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf3
          Max-Forwards:6
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314162 INVITE
          Contact:<sip:sarvi@client.example.com>
          Content-Type:application/sdp
          Content-Length:...

          v=0
          o=sarvi 2890844526 2890844527 IN IP4 192.0.2.12
          s=-
          c=IN IP4 192.0.2.12
          t=0 0
          m=application 9 TCP/MRCPv2 1
          a=setup:active
          a=connection:existing
          a=resource:speechsynth
          a=cmid:1
          m=audio 49170 RTP/AVP 0 96
          a=rtpmap:0 pcmu/8000
          a=rtpmap:96 telephone-event/8000
          a=fmtp:96 0-15
          a=sendrecv
          a=mid:1
          m=application 9 TCP/MRCPv2 1
          a=setup:active
          a=connection:existing
          a=resource:speechrecog
          a=cmid:1

   S->C:  SIP/2.0 200 OK
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf3;received=192.0.32.10
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314162 INVITE
          Contact:<sip:mresources@server.example.com>
          Content-Type:application/sdp
          Content-Length:...

          v=0
          o=- 2890842808 2890842809 IN IP4 192.0.2.11
          s=-
          c=IN IP4 192.0.2.11
          t=0 0
          m=application 32416 TCP/MRCPv2 1
          a=setup:passive
          a=connection:existing
          a=channel:32AECB234338@speechsynth
          a=cmid:1
          m=audio 48260 RTP/AVP 0 96
          a=rtpmap:0 pcmu/8000
          a=rtpmap:96 telephone-event/8000
          a=fmtp:96 0-15
          a=sendrecv
          a=mid:1
          m=application 32416 TCP/MRCPv2 1
          a=setup:passive
          a=connection:existing
          a=channel:32AECB234338@speechrecog
          a=cmid:1

   C->S:  ACK sip:mresources@server.example.com SIP/2.0
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf4
          Max-Forwards:6
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:Sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314162 ACK
          Content-Length:0

釋放 recofnizer channel 的資源,改回 recvonly

   C->S:  INVITE sip:mresources@server.example.com SIP/2.0
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf5
          Max-Forwards:6
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314163 INVITE
          Contact:<sip:sarvi@client.example.com>
          Content-Type:application/sdp
          Content-Length:...

          v=0
          o=sarvi 2890844526 2890844528 IN IP4 192.0.2.12
          s=-
          c=IN IP4 192.0.2.12
          t=0 0
          m=application 9 TCP/MRCPv2 1
          a=resource:speechsynth
          a=cmid:1
          m=audio 49170 RTP/AVP 0
          a=rtpmap:0 pcmu/8000
          a=recvonly
          a=mid:1
          m=application 0 TCP/MRCPv2 1
          a=resource:speechrecog
          a=cmid:1


   S->C:  SIP/2.0 200 OK
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf5;received=192.0.32.10
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314163 INVITE
          Contact:<sip:mresources@server.example.com>
          Content-Type:application/sdp
          Content-Length:...

          v=0
          o=- 2890842808 2890842810 IN IP4 192.0.2.11
          s=-
          c=IN IP4 192.0.2.11
          t=0 0
          m=application 32416 TCP/MRCPv2 1
          a=channel:32AECB234338@speechsynth
          a=cmid:1
          m=audio 48260 RTP/AVP 0
          a=rtpmap:0 pcmu/8000
          a=sendonly
          a=mid:1
          m=application 0 TCP/MRCPv2 1
          a=channel:32AECB234338@speechrecog
          a=cmid:1

   C->S:  ACK sip:mresources@server.example.com SIP/2.0
          Via:SIP/2.0/TCP client.atlanta.example.com:5060;
           branch=z9hG4bK74bf6
          Max-Forwards:6
          To:MediaServer <sip:mresources@example.com>;tag=62784
          From:Sarvi <sip:sarvi@example.com>;tag=1928301774
          Call-ID:a84b4c76e66710
          CSeq:314163 ACK
          Content-Length:0

MRCPv2 message

MRCPv2 訊息包含 client 給 server 的 request,及server 發給 client 的 response 與asynchronous events,資料格式包含一行 start-line,多個 headers,一行 empty line 代表 header 結束,然後是 optional message body,跟 http protocol 類似

generic-message  =    start-line
                      message-header
                      CRLF
                      [ message-body ]

message-body     =    *OCTET

start-line       =    request-line / response-line / event-line

message-header   =  1*(generic-header / resource-header / generic-field)

resource-header  =    synthesizer-header
                 /    recognizer-header
                 /    recorder-header
                 /    verifier-header

ex:

   C->S:   MRCP/2.0 877 INTERPRET 543266
           Channel-Identifier:32AECB23433801@speechrecog
           Interpret-Text:may I speak to Andre Roy
           Content-Type:application/srgs+xml
           Content-ID:<request1@form-level.store>
           Content-Length:661

           <?xml version="1.0"?>
           <!-- the default grammar language is US English -->
           <grammar xmlns="http://www.w3.org/2001/06/grammar"
                    xml:lang="en-US" version="1.0" root="request">
           <!-- single language attachment to tokens -->
               <rule id="yes">
                   <one-of>
                       <item xml:lang="fr-CA">oui</item>
                       <item xml:lang="en-US">yes</item>
                   </one-of>
               </rule>
           <!-- single language attachment to a rule expansion -->
               <rule id="request">
                   may I speak to
                   <one-of xml:lang="fr-CA">
                       <item>Michel Tremblay</item>
                       <item>Andre Roy</item>
                   </one-of>
               </rule>
           </grammar>

   S->C:   MRCP/2.0 82 543266 200 IN-PROGRESS
           Channel-Identifier:32AECB23433801@speechrecog

   S->C:   MRCP/2.0 634 INTERPRETATION-COMPLETE 543266 200 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success
           Content-Type:application/nlsml+xml
           Content-Length:441

           <?xml version="1.0"?>
           <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
                   xmlns:ex="http://www.example.com/example"
                   grammar="session:request1@form-level.store">
               <interpretation>
                   <instance name="Person">
                       <ex:Person>
                           <ex:Name> Andre Roy </ex:Name>
                       </ex:Person>
                   </instance>
                   <input>   may I speak to Andre Roy </input>
               </interpretation>
           </result>

request-line 的格式為

   request-line   =    mrcp-version SP message-length SP method-name SP request-id CRLF

   method-name    =    generic-method
                  /    synthesizer-method
                  /    recognizer-method
                  /    recorder-method
                  /    verifier-method
                  
   request-id     =    1*10DIGIT

response 的格式為

response-line  =    mrcp-version SP message-length SP request-id
                       SP status-code SP request-state CRLF
status-code     =    3DIGIT
request-state    =  "COMPLETE"
                    /  "IN-PROGRESS"
                    /  "PENDING"

event-line 的格式為

event-line       =  mrcp-version SP message-length SP event-name
                       SP request-id SP request-state CRLF
event-name       =  synthesizer-event
                    /  recognizer-event
                    /  recorder-event
                    /  verifier-event

注意到訊息格式中,分別對 synthesizer, recognizer, recorder, verifier 四種 resource type,有不同的定義 methods, headers, events

Generic Methods, Headers, Result Structure

所有 resource 通用的 methods, headers

MRCPv2 支援兩種 generic methods,可 reading, writing 相關資源的 state

   generic-method      =    "SET-PARAMS"
                       /    "GET-PARAMS"

SET-PARAMS

​ client 發送給 server,通知該 session 的 MRCPv2 resource 要定義 parameter

   C->S:  MRCP/2.0 ... SET-PARAMS 543256
          Channel-Identifier:32AECB23433802@speechsynth
          Voice-gender:female
          Voice-variant:3

   S->C:  MRCP/2.0 ... 543256 200 COMPLETE
          Channel-Identifier:32AECB23433802@speechsynth

GET-PARAMS

​ client 發送給 server,通知要取得 MRCPv2 resource 目前的 session parameters

   C->S:   MRCP/2.0 ... GET-PARAMS 543256
           Channel-Identifier:32AECB23433802@speechsynth
           Voice-gender:
           Voice-variant:
           Vendor-Specific-Parameters:com.example.param1;
                         com.example.param2
   S->C:   MRCP/2.0 ... 543256 200 COMPLETE
           Channel-Identifier:32AECB23433802@speechsynth
           Voice-gender:female
           Voice-variant:3
           Vendor-Specific-Parameters:com.example.param1="Company Name";
                         com.example.param2="124324234@example.com"

所有 MRCPv2 header 中,包含 generic-headers 及 resource-specific headers

header 的定義如下

   generic-field  = field-name ":" [ field-value ]
   field-name     = token
   field-value    = *LWS field-content *( CRLF 1*LWS field-content)
   field-content  = <the OCTETs making up the field-value
                    and consisting of either *TEXT or combinations
                    of token, separators, and quoted-string>

generic header 有

   generic-header      =    channel-identifier
                       /    accept
                       /    active-request-id-list
                       /    proxy-sync-id
                       /    accept-charset
                       /    content-type
                       /    content-id
                       /    content-base
                       /    content-encoding
                       /    content-location
                       /    content-length
                       /    fetch-timeout
                       /    cache-control
                       /    logging-tag
                       /    set-cookie
                       /    vendor-specific
  • Channel-Identifier

    在產生一個 control channel 時,由 server 指定一個 Channel Id

   channel-identifier  = "Channel-Identifier" ":" channel-id CRLF
   channel-id          = 1*alphanum "@" 1*alphanum
  • Accept

  • Active-Request-Id-List

    在 request 裡面,這個 header 代表這個 request 對這個 list of request-ids 有作用。在 response ,這個 header 代表該 method 影響到的 list of request-ids

       active-request-id-list  =  "Active-Request-Id-List" ":"
                                  request-id *("," request-id) CRLF
  • Proxy-Sync-Id

    當某個 server resource 產生 "barge-in-able" event,也會產生一個 unique tag,該 tag 會透過這個 header 放在 event 裡面,傳給 client

       proxy-sync-id    =  "Proxy-Sync-Id" ":" 1*VCHAR CRLF
  • Accept-Charset

    在 request 裡面指定 response or event 可接受能夠處理的 character sets。

    例如指定 Natural Language Semantic Markup Language (NLSML) results 的 RECOGNITION-COMPLETE event 可使用的 character set

  • Content-Type

    MRCPv2 的 content 支援有限 media types,例如 speech markup, grammer, recofnition results

       content-type     =    "Content-Type" ":" media-type-value CRLF
    
       media-type-value =    type "/" subtype *( ";" parameter )
    
       type             =    token
    
       subtype          =    token
    
       parameter        =    attribute "=" value
    
       attribute        =    token
    
       value            =    token / quoted-string
  • Content-ID

    該 content 參考或引用的 ID or name

  • Content-Base

    指定 base URI

    content-base      = "Content-Base" ":" absoluteURI CRLF
  • Content-Encoding

    某個 Content-Type 的附加資訊,例如 Content-Encoding:gzip

       content-encoding  = "Content-Encoding" ":"
                           *WSP content-coding
                           *(*WSP "," *WSP content-coding *WSP )
                           CRLF
  • Content-Location

       content-location  =  "Content-Location" ":"
                            ( absoluteURI / relativeURI ) CRLF
  • Content-Length

    message body 的長度

       content-length  =  "Content-Length" ":" 1*19DIGIT CRLF
  • Fetch Timeout

    當 recognizer/synthesizer 需要取得文件或其他資源,定義 server 透過網路取得資源的 timeout 時間

       fetch-timeout       =   "Fetch-Timeout" ":" 1*19DIGIT CRLF
  • Cache-Control

    如果 server 有支援 content caching,遵循 http 1.1 的規則提供 cache

       cache-control    =    "Cache-Control" ":"
                             [*WSP cache-directive
                             *( *WSP "," *WSP cache-directive *WSP )]
                             CRLF
    
       cache-directive     = "max-age" "=" delta-seconds
                           / "max-stale" [ "=" delta-seconds ]
                           / "min-fresh" "=" delta-seconds
    
       delta-seconds       = 1*19DIGIT
  • Logging-Tag

    SET-PARAMS/GET-PARAMS method 的 header,可 set/retrieve server 產生的 log 的 logging tag

       logging-tag    = "Logging-Tag" ":" 1*UTFCHAR CRLF
  • Set-Cookie

    類似 http 的 cookie,讓 server 在 client 存放 cookie values

  • Vendor-Specific Parameters

    ex:

       com.example.companyA.paramxyz=256
       com.example.companyA.paramabc=High
       com.example.companyB.paramxyz=Low

Generic Result Structure

Recognizer 與 Verifier resource server 產生的 result data,以 Natural Language Semantics Markup Language (NLSML) 格式提供

ex:

   Content-Type:application/nlsml+xml
   Content-Length:...

   <?xml version="1.0"?>
   <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
           xmlns:ex="http://www.example.com/example"
           grammar="http://theYesNoGrammar">
       <interpretation>
           <instance>
                   <ex:response>yes</ex:response>
           </instance>
           <input>OK</input>
       </interpretation>
   </result>

Resource Discovery

透過 SIP OPTIONS 向 server 詢問 server capabilities

server 必須以 SDP 回應 capabilities,包含 media type, transport type: m=application 0 TCP/TLS/MRCPv2 1,以及 resource: a=resource:speechsynth

ex:

   C->S:
        OPTIONS sip:mrcp@server.example.com SIP/2.0
        Via:SIP/2.0/TCP client.atlanta.example.com:5060;
         branch=z9hG4bK74bf7
        Max-Forwards:6
        To:<sip:mrcp@example.com>
        From:Sarvi <sip:sarvi@example.com>;tag=1928301774
        Call-ID:a84b4c76e66710
        CSeq:63104 OPTIONS
        Contact:<sip:sarvi@client.example.com>
        Accept:application/sdp
        Content-Length:0


   S->C:
        SIP/2.0 200 OK
        Via:SIP/2.0/TCP client.atlanta.example.com:5060;
         branch=z9hG4bK74bf7;received=192.0.32.10
        To:<sip:mrcp@example.com>;tag=62784
        From:Sarvi <sip:sarvi@example.com>;tag=1928301774
        Call-ID:a84b4c76e66710
        CSeq:63104 OPTIONS
        Contact:<sip:mrcp@server.example.com>
        Allow:INVITE, ACK, CANCEL, OPTIONS, BYE
         Accept:application/sdp
        Accept-Encoding:gzip
        Accept-Language:en
        Supported:foo
        Content-Type:application/sdp
        Content-Length:...

        v=0
        o=sarvi 2890844536 2890842811 IN IP4 192.0.2.12
        s=-
        i=MRCPv2 server capabilities
        c=IN IP4 192.0.2.12/127
        t=0 0
        m=application 0 TCP/TLS/MRCPv2 1
        a=resource:speechsynth
        a=resource:speechrecog
        a=resource:speakverify
        m=audio 0 RTP/AVP 0 3
        a=rtpmap:0 PCMU/8000
        a=rtpmap:3 GSM/8000

Speech Synthesizer Resource

client 發送 text markup,讓server 即時產生 audio stream,可指定語音合成的參數,例如 voice characteristics, speaker speed

有兩種: speech synth, basicsynth

Synthesizer State Machine

pending 的 SPEAK request 可以被 deleted/stopped

   Idle                    Speaking                  Paused
   State                   State                     State
     |                        |                          |
     |----------SPEAK-------->|                 |--------|
     |<------STOP-------------|             CONTROL      |
     |<----SPEAK-COMPLETE-----|                 |------->|
     |<----BARGE-IN-OCCURRED--|                          |
     |              |---------|                          |
     |          CONTROL       |-----------PAUSE--------->|
     |              |-------->|<----------RESUME---------|
     |                        |               |----------|
     |----------|             |              PAUSE       |
     |    BARGE-IN-OCCURRED   |               |--------->|
     |<---------|             |----------|               |
     |                        |      SPEECH-MARKER       |
     |                        |<---------|               |
     |----------|             |----------|               |
     |         STOP           |       RESUME             |
     |          |             |<---------|               |
     |<---------|             |                          |
     |<---------------------STOP-------------------------|
     |----------|             |                          |
     |     DEFINE-LEXICON     |                          |
     |          |             |                          |
     |<---------|             |                          |
     |<---------------BARGE-IN-OCCURRED------------------|

Synthesizer Methods

   synthesizer-method   =  "SPEAK"
                        /  "STOP"
                        /  "PAUSE"
                        /  "RESUME"
                        /  "BARGE-IN-OCCURRED"
                        /  "CONTROL"
                        /  "DEFINE-LEXICON"

Synthesizer Events

   synthesizer-event    =  "SPEECH-MARKER"
                        /  "SPEAK-COMPLETE"

Synthesizer Header Fields

   synthesizer-header  =  jump-size
                       /  kill-on-barge-in
                       /  speaker-profile
                       /  completion-cause
                       /  completion-reason
                       /  voice-parameter
                       /  prosody-parameter
                       /  speech-marker
                       /  speech-language
                       /  fetch-hint
                       /  audio-fetch-hint
                       /  failed-uri
                       /  failed-uri-cause
                       /  speak-restart
                       /  speak-length
                       /  load-lexicon
                       /  lexicon-search-order

Example:

text 會被合成並播放到 media stream,resource 會產生 IN-PROGRESS, SPEAK-COMPLETE event

   C->S: MRCP/2.0 ... SPEAK 543257
         Channel-Identifier:32AECB23433802@speechsynth
         Voice-gender:neutral
         Voice-Age:25
         Prosody-volume:medium
         Content-Type:application/ssml+xml
         Content-Length:...

         <?xml version="1.0"?>
            <speak version="1.0"
                xmlns="http://www.w3.org/2001/10/synthesis"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
                xml:lang="en-US">
            <p>
             <s>You have 4 new messages.</s>
             <s>The first is from Stephanie Williams and arrived at
                <break/>
                <say-as interpret-as="vxml:time">0342p</say-as>.
                </s>
             <s>The subject is
                    <prosody rate="-20%">ski trip</prosody>
             </s>
            </p>
           </speak>

   S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206027059

   S->C: MRCP/2.0 ... SPEAK-COMPLETE 543257 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Completion-Cause:000 normal
         Speech-Marker:timestamp=857206027059

Speech Recognizer Resource

接收 client 提供的 voice stream,轉換為文字

有兩種: speechrecog, dtmfrecog

recognizer resource 的能力有:

  1. Normal Mode Recognition:會將整個語音或 DTMF 判斷是否吻合

  2. Hotword Mode Recognition

    判斷是否有出現某個特定的 speech grammar or DTMF sequence

  3. Voice Enrolled Grammars

    (optional) enrollment 是用某個人的 voice 進行判斷, server 會維護 a list of contacts,包含人員的名稱以及 voice,這個技術也稱為 speaker-dependent recognition

  4. Interpretation

    natural language interpretation

    以 text 作為 input,產生該文字的 grammar

Recognizer State Machine

   Idle                   Recognizing               Recognized
   State                  State                     State
    |                       |                          |
    |---------RECOGNIZE---->|---RECOGNITION-COMPLETE-->|
    |<------STOP------------|<-----RECOGNIZE-----------|
    |                       |                          |
    |              |--------|              |-----------|
    |       START-OF-INPUT  |       GET-RESULT         |
    |              |------->|              |---------->|
    |------------|          |                          |
    |      DEFINE-GRAMMAR   |----------|               |
    |<-----------|          | START-INPUT-TIMERS       |
    |                       |<---------|               |
    |------|                |                          |
    |  INTERPRET            |                          |
    |<-----|                |------|                   |
    |                       |   RECOGNIZE              |
    |-------|               |<-----|                   |
    |      STOP                                        |
    |<------|                                          |
    |<-------------------STOP--------------------------|
    |<-------------------DEFINE-GRAMMAR----------------|

Recognizer Methods

   recognizer-method    =  recog-only-method
                        /  enrollment-method
   recog-only-method    =  "DEFINE-GRAMMAR"
                        /  "RECOGNIZE"
                        /  "INTERPRET"
                        /  "GET-RESULT"
                        /  "START-INPUT-TIMERS"
                        /  "STOP"
   enrollment-method    =  "START-PHRASE-ENROLLMENT"
                        /  "ENROLLMENT-ROLLBACK"
                        /  "END-PHRASE-ENROLLMENT"
                        /  "MODIFY-PHRASE"
                        /  "DELETE-PHRASE"

Recognizer Events

   recognizer-event     =  "START-OF-INPUT"
                        /  "RECOGNITION-COMPLETE"
                        /  "INTERPRETATION-COMPLETE"

Recognizer Header Fields

   recognizer-header    =  recog-only-header
                        /  enrollment-header

   recog-only-header    =  confidence-threshold
                        /  sensitivity-level
                        /  speed-vs-accuracy
                        /  n-best-list-length
                        /  no-input-timeout
                        /  input-type
                        /  recognition-timeout
                        /  waveform-uri
                        /  input-waveform-uri
                        /  completion-cause
                        /  completion-reason
                        /  recognizer-context-block
                        /  start-input-timers
                        /  speech-complete-timeout
                        /  speech-incomplete-timeout
                        /  dtmf-interdigit-timeout
                        /  dtmf-term-timeout
                        /  dtmf-term-char
                        /  failed-uri
                        /  failed-uri-cause
                        /  save-waveform
                        /  media-type
                        /  new-audio-channel
                        /  speech-language
                        /  ver-buffer-utterance
                        /  recognition-mode
                        /  cancel-if-queue
                        /  hotword-max-duration
                        /  hotword-min-duration
                        /  interpret-text
                        /  dtmf-buffer-time
                        /  clear-dtmf-buffer
                        /  early-no-match
                        
   enrollment-header    =  num-min-consistent-pronunciations
                        /  consistency-threshold
                        /  clash-threshold
                        /  personal-grammar-uri
                        /  enroll-utterance
                        /  phrase-id
                        /  phrase-nl
                        /  weight
                        /  save-best-waveform
                        /  new-phrase-id
                        /  confusable-phrases-uri
                        /  abort-phrase-enrollment

Example

   C->S:MRCP/2.0 ... RECOGNIZE 543257
   Channel-Identifier:32AECB23433801@speechrecog
           Confidence-Threshold:0.9
   Content-Type:application/srgs+xml
   Content-ID:<request1@form-level.store>
   Content-Length:...

   <?xml version="1.0"?>

   <!-- the default grammar language is US English -->
   <grammar xmlns="http://www.w3.org/2001/06/grammar"
            xml:lang="en-US" version="1.0" root="request">

   <!-- single language attachment to tokens -->
       <rule id="yes">
               <one-of>
                     <item xml:lang="fr-CA">oui</item>
                     <item xml:lang="en-US">yes</item>
               </one-of>
         </rule>

   <!-- single language attachment to a rule expansion -->
         <rule id="request">
               may I speak to
               <one-of xml:lang="fr-CA">
                     <item>Michel Tremblay</item>
                     <item>Andre Roy</item>
               </one-of>
         </rule>

     </grammar>

   S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
   Channel-Identifier:32AECB23433801@speechrecog

   S->C:MRCP/2.0 ... START-OF-INPUT 543257 IN-PROGRESS
   Channel-Identifier:32AECB23433801@speechrecog

   S->C:MRCP/2.0 ... RECOGNITION-COMPLETE 543257 COMPLETE
   Channel-Identifier:32AECB23433801@speechrecog
   Completion-Cause:000 success
   Waveform-URI:<http://web.media.com/session123/audio.wav>;
                 size=424252;duration=2543
   Content-Type:application/nlsml+xml
   Content-Length:...
   <?xml version="1.0"?>
   <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
           xmlns:ex="http://www.example.com/example"
           grammar="session:request1@form-level.store">
       <interpretation>
           <instance name="Person">
               <ex:Person>
                   <ex:Name> Andre Roy </ex:Name>
               </ex:Person>
           </instance>
               <input>   may I speak to Andre Roy </input>
       </interpretation>
   </result>

Recorder Resource

將收到的 audio/video 存到指定的 URI

Recorder State Machine

   Idle                   Recording
   State                  State
    |                       |
    |---------RECORD------->|
    |                       |
    |<------STOP------------|
    |                       |
    |<--RECORD-COMPLETE-----|
    |                       |
    |              |--------|
    |       START-OF-INPUT  |
    |              |------->|
    |                       |
    |              |--------|
    |    START-INPUT-TIMERS |
    |              |------->|
    |                       |

Recorder Methods

   recorder-method      =  "RECORD"
                        /  "STOP"
                        /  "START-INPUT-TIMERS"

Recorder Events

   recorder-event       =  "START-OF-INPUT"
                        /  "RECORD-COMPLETE"

Recorder Header Fields

   recorder-header      =  sensitivity-level
                        /  no-input-timeout
                        /  completion-cause
                        /  completion-reason
                        /  failed-uri
                        /  failed-uri-cause
                        /  record-uri
                        /  media-type
                        /  max-time
                        /  trim-length
                        /  final-silence
                        /  capture-on-speech
                        /  ver-buffer-utterance
                        /  start-input-timers
                        /  new-audio-channel

example

   C->S:  MRCP/2.0 ... RECORD 543257
          Channel-Identifier:32AECB23433802@recorder
          Record-URI:<file://mediaserver/recordings/myfile.wav>
          Media-Type:audio/wav
          Capture-On-Speech:true
          Final-Silence:300
          Max-Time:6000

   S->C:  MRCP/2.0 ... 543257 200 IN-PROGRESS
          Channel-Identifier:32AECB23433802@recorder

   S->C:  MRCP/2.0 ... START-OF-INPUT 543257 IN-PROGRESS
          Channel-Identifier:32AECB23433802@recorder

   S->C:  MRCP/2.0 ... RECORD-COMPLETE 543257 COMPLETE
          Channel-Identifier:32AECB23433802@recorder
          Completion-Cause:000 success-silence
          Record-URI:<file://mediaserver/recordings/myfile.wav>;
                     size=242552;duration=25645

Speaker Verification and Identification

辨識 speaker 的身份

Speaker Verification State Machine

     Idle              Session Opened       Verifying/Training
     State             State                State
      |                   |                         |
      |--START-SESSION--->|                         |
      |                   |                         |
      |                   |----------|              |
      |                   |     START-SESSION       |
      |                   |<---------|              |
      |                   |                         |
      |<--END-SESSION-----|                         |
      |                   |                         |
      |                   |---------VERIFY--------->|
      |                   |                         |
      |                   |---VERIFY-FROM-BUFFER--->|
      |                   |                         |
      |                   |----------|              |
      |                   |  VERIFY-ROLLBACK        |
      |                   |<---------|              |
      |                   |                         |
      |                   |                |--------|
      |                   | GET-INTERMEDIATE-RESULT |
      |                   |                |------->|
      |                   |                         |
      |                   |                |--------|
      |                   |     START-INPUT-TIMERS  |
      |                   |                |------->|
      |                   |                         |
      |                   |                |--------|
      |                   |         START-OF-INPUT  |
      |                   |                |------->|
      |                   |                         |
      |                   |<-VERIFICATION-COMPLETE--|
      |                   |                         |
      |                   |<--------STOP------------|
      |                   |                         |
      |                   |----------|              |
      |                   |         STOP            |
      |                   |<---------|              |
      |                   |                         |
      |----------|        |                         |
      |         STOP      |                         |
      |<---------|        |                         |
      |                   |----------|              |
      |                   |    CLEAR-BUFFER         |
      |                   |<---------|              |
      |                   |                         |
      |----------|        |                         |
      |   CLEAR-BUFFER    |                         |
      |<---------|        |                         |
      |                   |                         |
      |                   |----------|              |
      |                   |   QUERY-VOICEPRINT      |
      |                   |<---------|              |
      |                   |                         |
      |----------|        |                         |
      | QUERY-VOICEPRINT  |                         |
      |<---------|        |                         |
      |                   |                         |
      |                   |----------|              |
      |                   |  DELETE-VOICEPRINT      |
      |                   |<---------|              |
      |                   |                         |
      |----------|        |                         |
      | DELETE-VOICEPRINT |                         |
      |<---------|        |                         |

Speaker Verification Methods

   verifier-method          =  "START-SESSION"
                            / "END-SESSION"
                            / "QUERY-VOICEPRINT"
                            / "DELETE-VOICEPRINT"
                            / "VERIFY"
                            / "VERIFY-FROM-BUFFER"
                            / "VERIFY-ROLLBACK"
                            / "STOP"
                            / "CLEAR-BUFFER"
                            / "START-INPUT-TIMERS"
                            / "GET-INTERMEDIATE-RESULT"

Verification Events

   verifier-event       =  "VERIFICATION-COMPLETE"
                        /  "START-OF-INPUT"

Verification Header Fields

   verification-header      =  repository-uri
                            /  voiceprint-identifier
                            /  verification-mode
                            /  adapt-model
                            /  abort-model
                            /  min-verification-score
                            /  num-min-verification-phrases
                            /  num-max-verification-phrases
                            /  no-input-timeout
                            /  save-waveform
                            /  media-type
                            /  waveform-uri
                            /  voiceprint-exists
                            /  ver-buffer-utterance
                            /  input-waveform-uri
                            /  completion-cause
                            /  completion-reason
                            /  speech-complete-timeout
                            /  new-audio-channel
                            /  abort-verification
                            /  start-input-timers

References

MRCP wiki

MRCP協議學習筆記-MRCP背景知識介紹

MRCP學習筆記-語音識別資源的事件和Headers詳解

MRCP協議學習筆記-語音識別資源的概括和全部Methods

MRCP協議學習筆記-關於媒體資源伺服器的定位路由策略

MRCPv2概述

MRCPv2在電信智能語音識別業務中的應用

MRCPv2 - Speech Synthesizer Resource

MRCP v2.0 規範 - RFC6787中文翻譯(1)

cisco 使用MRCPv1 ASR/TTS的IOS語音XML網關到CVP呼叫流

2020/02/10

如何在 Markdown 輸入數學公式及符號

使用 latex 語法在 Markdown 輸入數學公式及符號

數學公式

1. 如何插入公式

有行內公式與獨立公式兩種

$ 行內公式 $

$$ 獨立公式 $$

ex:

行內公式 \(F=ma\)

獨立公式 \[F=ma\]

2. 上下標

上標符號,符號 ^, ex: $x^2$,就是 \(x^2\)

下標符號,符號:_,ex: $x_2$,就是 \( x_2 \)

組合符號,符號:{},ex: $x_{12}$,就是 \(x_{12}\)

如果要在左右兩邊都有上下標,可以用 \sideset 命令。

$$ \sideset{^1_2}{^3_4}\bigotimes $$

\[ \sideset{^1_2}{^3_4}\bigotimes \]

3. 括號

()[]| 表示符號本身,用 \{\} 來表示 {} 。當要顯示大號的括號或分隔符時,要用 \left\right 命令。

一些特殊的括號:

輸入 顯示 輸入 顯示
\langle \(\langle\) \rangle \(\rangle\)
\lceil \(\lceil\) \rceil \(\rceil\)
\lfloor \(\lfloor\) \rfloor \(\rfloor\)
\lbrace \(\lbrace\) \rbrace \(\rbrace\)

ex1:

$$ f(x,y,z) = 3y^2z \left( 3+\frac{7x+5}{1+y^2} \right) $$

\[ f(x,y,z) = 3y^2z \left( 3+\frac{7x+5}{1+y^2} \right) \]

ex2:

$$ \left. \frac{{\rm d}u}{{\rm d}x} \right| _{x=0} $$

\[ \left. \frac{{\rm d}u}{{\rm d}x} \right| _{x=0} \]

4. 分數

通常用 \frac {分子} {分母} 產生一個分數,分數可嵌套。 可直接輸入 \frac ab 來快速生成一個 \(\frac ab\) 。 如果分式很複雜,亦可使用 分子 \over 分母 命令,此時分數僅有一層。

ex:

$$\frac{a-1}{b-1} \quad and \quad {a+1\over b+1}$$

\[\frac{a-1}{b-1} \quad and \quad {a+1\over b+1}\]

5. 開方

\sqrt [根指數,省略時為2] {被開方數} 輸入開方。

ex:

$$\sqrt{2} \quad and \quad \sqrt[n]{3}$$

\[\sqrt{2} \quad and \quad \sqrt[n]{3}\]

6. 省略符號

數學公式中常見的省略號有兩種,\ldots 表示與文本底線對齊的省略號,\cdots 表示與文本中線對齊的省略號。

ex:

$$ f(x_1,x_2,\underbrace{\ldots}_{\rm ldots} ,x_n) = x_1^2 + x_2^2 + \underbrace{\cdots}_{\rm cdots} + x_n^2 $$

\[f(x_1,x_2,\underbrace{\ldots}_{\rm ldots} ,x_n) = x_1^2 + x_2^2 + \underbrace{\cdots}_{\rm cdots} + x_n^2\]

7. 向量

\vec{向量} 產生一個向量。也可以用 \overrightarrow 自訂字母上方的符號。

ex:

$$\vec{a} \cdot \vec{b}=0$$

\[\vec{a} \cdot \vec{b}=0\]

ex:

$$\overleftarrow{xy} \quad and \quad \overleftrightarrow{xy} \quad and \quad \overrightarrow{xy}$$

\[\overleftarrow{xy} \quad and \quad \overleftrightarrow{xy} \quad and \quad \overrightarrow{xy}\]

8. 微積分

\int_積分下限^積分上限 {積分表達式}

ex:

$$\int_0^1 {x^2} \,{\rm d}x $$

\[\int_0^1 {x^2} \,{\rm d}x\]

本例中 \,{\rm d} 部分可省略,但建議加入,能使式子更美觀。

\[\int_0^1 {x^2} dx \] 可發現 d 的部分跟上面有一點不一樣

\partial{}微分

ex:

\frac{\partial x}{\partial y}   

\(\frac{\partial x}{\partial y}\)

9. 極限

\lim_{變數 \to 表達式} 表達式

如有需要,可以修改 \to 符號為任意符號。

ex:

$$ \lim_{n \to +\infty} \frac{1}{n(n+1)} \quad and \quad \lim_{x\leftarrow{sample}} \frac{1}{n(n+1)} $$

\[ \lim_{n \to +\infty} \frac{1}{n(n+1)} \quad and \quad \lim_{x\leftarrow{sample}} \frac{1}{n(n+1)} \]

10. 級數

\sum_{下標表達式}^{上標表達式} {級數表達式} 與之類似,使用 \prod \bigcup \bigcap 來分別輸入連乘、聯集和交集

ex:

$$\sum_{i=1}^n \frac{1}{i^2} \quad and \quad \prod_{i=1}^n \frac{1}{i^2} \quad and \quad \bigcup_{i=1}^{2} R$$

\[\sum_{i=1}^n \frac{1}{i^2} \quad and \quad \prod_{i=1}^n \frac{1}{i^2} \quad and \quad \bigcup_{i=1}^{2} R\]

11. 希臘字母

\小寫希臘字母英文全稱\首字母大寫希臘字母英文全稱 來分別輸入小寫和大寫希臘字母。對於大寫希臘字母與現有字母相同的,直接輸入大寫字母即可。也可以直接用該字母,簡化數學式的寫法。

輸入 顯示 輸入 顯示 輸入 顯示 輸入 顯示
\alpha \(\alpha\) A \(A\) \beta \(\beta\) B \(B\)
\gamma \(\gamma\) \Gamma \(\Gamma\) \delta \(\delta\) \Delta \(\Delta\)
\epsilon \(\epsilon\) E \(E\) \zeta \(\zeta\) Z \(Z\)
\eta \(\eta\) H \(H\) \theta \(\theta\) \Theta \(\Theta\)
\iota \(\iota\) I \(I\) \kappa \(\kappa\) K \(K\)
\lambda \(\lambda\) \Lambda \(\Lambda\) \mu \(\mu\) M \(M\)
\nu \(\nu\) N \(N\) \xi \(\xi\) \Xi \(\Xi\)
o \(o\) O \(O\) \pi \(\pi\) \Pi \(\Pi\)
\rho \(\rho\) P \(P\) \sigma \(\sigma\) \Sigma \(\Sigma\)
\tau \(\tau\) T \(T\) \upsilon \(\upsilon\) \Upsilon \(\Upsilon\)
\phi \(\phi\) \Phi \(\Phi\) \chi \(\chi\) X \(X\)
\psi \(\psi\) \Psi \(\Psi\) \omega \(\omega\) \Omega \(\Omega\)

部分字母有變數專用形式,以 \var- 開頭。

小寫形式 大寫形式 變量形式 顯示
\epsilon E \varepsilon \(\epsilon \mid E \mid \varepsilon\)
\theta \Theta \vartheta \(\theta \mid \Theta \mid \vartheta\)
\rho P \varrho \(\rho \mid P \mid \varrho\)
\sigma \Sigma \varsigma \(\sigma \mid \Sigma \mid \varsigma\)
\phi \Phi \varphi \(\phi \mid \Phi \mid \varphi\)

12. 特殊符號

可在 Detexify 畫出符號,找到該符號的 latex 語法

若需要顯示更大或更小的字元,在符號前插入 \large\small

12.1 關係運算

輸入 顯示 輸入 顯示 輸入 顯示 輸入 顯示
\pm \(\pm\) \times \(\times\) \div \(\div\) \mid \(\mid\)
\nmid \(\nmid\) \cdot \(\nmid\) \circ \(\nmid\) \ast \(\ast\)
\bigodot \(\ast\) \bigotimes \(\ast\) \bigoplus \(\bigoplus\) \leq \(\bigoplus\)
\geq \(\geq\) \neq \(\neq\) \approx \(\approx\) \equiv \(\equiv\)
\sum \(\sum\) \prod \(\sum\) \coprod \(\coprod\) \backslash \(\backslash\)
\ngeq \(\ngeq\) \nleq \(\nleq\) \not\geq \(\not\geq\) \not\leq \(\not\leq\)

12.2 集合運算

輸入 顯示 輸入 顯示 輸入 顯示
\emptyset \(\emptyset\) \in \(\in\) \notin \(\notin\)
\subset \(\subset\) \supset \(\supset\) \subseteq \(\subseteq\)
\supseteq \(\supseteq\) \bigcap \(\bigcap\) \bigcup \(\bigcup\)
\bigvee \(\bigvee\) \bigwedge \(\bigwedge\) \biguplus \(\biguplus\)
\subsetneq \(\subsetneq\) \supsetneq \(\supsetneq\) \setminus \(\setminus\)
\bigodot \(\bigodot\) \bigotimes \(\bigotimes\) \mathbb{R} \(\mathbb{R}\)
\mathbb{Z} \(\mathbb{Z}\)

12.3 對數

輸入 顯示 輸入 顯示 輸入 顯示
\log \(\log\) \lg \(\lg\) \ln \(\ln\)

12.4 三角函數

輸入 顯示 輸入 顯示 輸入 顯示
30^\circ \(30^\circ\) \bot \(\bot\) \angle A \(\angle A\)
\sin \(\sin\) \cos \(\cos\) \tan \(\tan\)
\csc \(\csc\) \sec \(\sec\) \cot \(\cot\)

12.5 微積分

輸入 顯示 輸入 顯示 輸入 顯示
\int \(\int\) \iint \(\iint\) \iiint \(\iiint\)
\iiiint \(\iiiint\) \oint \(\oint\) \prime \(\prime\)
\lim \(\lim\) \infty \(\infty\) \nabla \(\nabla\)

12.6 邏輯運算

輸入 顯示 輸入 顯示 輸入 顯示
\because \(\because\) \therefore \(\therefore\)
\forall \(\forall\) \exists \(\exists\) \not\subset \(\not\subset\)
\not< \(\not<\) \not> \(\not>\) \not= \(\not=\)

12.7 hat

輸入 顯示 輸入 顯示
\hat{xy} \(\hat{xy}\) \widehat{xyz} \(\widehat{xyz}\)
\tilde{xy} \(\tilde{xy}\) \widetilde{xyz} \(\widetilde{xyz}\)
\check{x} \(\check{x}\) \breve{y} \(\breve{y}\)
\grave{x} \(\grave{x}\) \acute{y} \(\acute{y}\)

12.8 連線

輸入 顯示
\fbox{a+b+c+d} \(\fbox{a+b+c+d}\)
\overleftarrow{a+b+c+d} \(\overleftarrow{a+b+c+d}\)
\overrightarrow{a+b+c+d} \(\overrightarrow{a+b+c+d}\)
\overleftrightarrow{a+b+c+d} \(\overleftrightarrow{a+b+c+d}\)
\underleftarrow{a+b+c+d} \(\underleftarrow{a+b+c+d}\)
\underrightarrow{a+b+c+d} \(\underrightarrow{a+b+c+d}\)
\underleftrightarrow{a+b+c+d} \(\underleftrightarrow{a+b+c+d}\)
\overline{a+b+c+d} \(\overline{a+b+c+d}\)
\underline{a+b+c+d} \(\underline{a+b+c+d}\)
\overbrace{a+b+c+d}^{Sample} \(\overbrace{a+b+c+d}^{Sample}\)
\underbrace{a+b+c+d}_{Sample} \(\underbrace{a+b+c+d}_{Sample}\)
\overbrace{a+\underbrace{b+c}_{1.0}+d}^{2.0} \(\overbrace{a+\underbrace{b+c}_{1.0}+d}^{2.0}\)
\underbrace{a\cdot a\cdots a}_{b\text{ times}} \(\underbrace{a\cdot a\cdots a}_{b\text{ times}}\)

12.9 箭頭

輸入 顯示 輸入 顯示 輸入 顯示
\to \(\to\) \mapsto \(\mapsto\)
\implies \(\implies\) \iff \(\iff\) \impliedby \(\impliedby\)
  • 其它可用符號:
輸入 顯示 輸入 顯示
\uparrow \(\uparrow\) \Uparrow \(\Uparrow\)
\downarrow \(\downarrow\) \Downarrow \(\Downarrow\)
\leftarrow \(\leftarrow\) \Leftarrow \(\Leftarrow\)
\rightarrow \(\rightarrow\) \Rightarrow \(\Rightarrow\)
\leftrightarrow \(\leftrightarrow\) \Leftrightarrow \(\Leftrightarrow\)
\longleftarrow \(\longleftarrow\) \Longleftarrow \(\Longleftarrow\)
\longrightarrow \(\longrightarrow\) \Longrightarrow \(\Longrightarrow\)
\longleftrightarrow \(\longleftrightarrow\) \Longleftrightarrow \(\Longleftrightarrow\)

12.10 四則運算

運算 寫法 顯示
加法 x+y \(x+y\)
減法 x-y \(x-y\)
加減 x \pm y \(x \pm y\)
減加 x \mp y \(x \mp y\)
乘法 x \times y \(x \times y\)
星乘法 x \ast y \(x \ast y\)
點乘法 x \cdot y \(x \cdot y\)
除法 x \div y \(x \div y\)
斜除法 x / y \(x / y\)
分數 \frac{x}{y} \(\frac{x}{y}\)
分數 {x}\over{y} \({x}\over{y}\)

12.11 其他

運算 寫法 顯示
無窮 \infty \(\infty\)
虛數 \imath \(\imath\)
虛數 \jmath \(\jmath\)
^{\circ} \(^{\circ}\)

13. 字體轉換

要對公式的某一部分字元進行字體轉換,可以用 {\字體 {需轉換的部分字元}} 命令,其中 \字體 部分可以參照下表選擇合適的字體。一般情況下,預設為意大利體 \(italic\) 。

全部大寫 的字體僅大寫可用。

輸入 說明 顯示 輸入 說明 顯示
\rm 羅馬體 \(\rm{Sample}\) \cal 花體 \(\cal{SAMPLE}\)
\it 意大利體 \(\it{Sample}\) \Bbb 黑板粗體 \(\Bbb{SAMPLE}\)
\bf 粗體 \(\bf{Sample}\) \mit 數學斜體 \(\mit{SAMPLE}\)
\sf 等線體 \(\sf{Sample}\) \scr 手寫體 \(\scr{SAMPLE}\)
\tt 打字機體 \(\tt{Sample}\)
\frak 舊德式字體 \(\frak{Sample}\)

轉換字體十分常用,例如在積分中:

\begin{array}{cc}
\mathrm{Bad} & \mathrm{Better} \\
\hline \\
\int_0^1 x^2 dx & \int_0^1 x^2 \,{\rm d}x
\end{array}

\(\begin{array}{cc} \mathrm{Bad} & \mathrm{Better} \\ \hline \\ \int_0^1 x^2 dx & \int_0^1 x^2 \,{\rm d}x \end{array}\)

14. 大括號與行標

\left\right 來產生自動匹配高度的 (圓括號),[方括號] 和 {大括號}。 在每個公式結束前用 \tag{行標} 來實現行標。

$$
f\left(
   \left[
     \frac{
       1+\left\{x,y\right\}
     }{
       \left(
          \frac{x}{y}+\frac{y}{x}
       \right)
       \left(u+1\right)
     }+a
   \right]^{3/2}
\right)
\tag{行標}
$$

\[ f\left( \left[ \frac{ 1+\left\{x,y\right\} }{ \left( \frac{x}{y}+\frac{y}{x} \right) \left(u+1\right) }+a \right]^{3/2} \right) \tag{行標} \]

如果你需要在不同的行顯示對應括號,可以在每一行對應處使用 \left.\right. 來放一個"影子"括號:

ex:

$$
\begin{aligned}
a=&\left(1+2+3+  \cdots \right. \\
& \cdots+ \left. \infty-2+\infty-1+\infty\right)
\end{aligned}
$$

\[ \begin{aligned} a=&\left(1+2+3+ \cdots \right. \\ & \cdots+ \left. \infty-2+\infty-1+\infty\right) \end{aligned} \]

要將行內顯示的分隔符也變大,可以使用 \middle

$$
\left\langle
  q
\middle\|
  \frac{\frac{x}{y}}{\frac{u}{v}}
\middle|
   p
\right\rangle
$$

\[ \left\langle q \middle\| \frac{\frac{x}{y}}{\frac{u}{v}} \middle| p \right\rangle \]

15. 其他指令

15.1 定義新的符號 \operatorname

可查詢 關於此命令的定義關於此命令的討論

ex:

$$ \operatorname{Symbol} A $$

\[\operatorname{Symbol} A\]

15.2 註釋文字 \text

\text {文字} 中仍可以使用 $公式$ 插入其它公式。

ex:

$$ f(n)= \begin{cases} n/2, & \text {if $n$ is even} \\ 3n+1, & \text{if $n$ is odd} \end{cases} $$

\[ f(n)= \begin{cases} n/2, & \text {if $n$ is even} \\ 3n+1, & \text{if $n$ is odd} \end{cases} \]

15.3 在字元間加入空格

有四種寬度的空格可以使用: \,\;\quad\qquad

ex:

$$ a \, b \mid a \; b \mid a \quad b \mid a \qquad b $$

\[ a \, b \mid a \; b \mid a \quad b \mid a \qquad b \]

\text {n個空格} 也可以達到同樣效果。

15.4 修改文字顏色

使用 \color{顏色}{文字} 來更改特定的文字顏色。 更改文字顏色 需要瀏覽器支援 ,如果瀏覽器不知道你所需的顏色,那麼文字將為黑色。

對於較舊的瀏覽器(HTML4與CSS2),支援以下顏色:

輸入 顯示 輸入 顯示
black \(\color{black}{text}\) grey \(\color{grey}{text}\)
silver \(\color{silver}{text}\) white \(\color{white}{text}\)
maroon \(\color{maroon}{text}\) red \(\color{red}{text}\)
yellow \(\color{yellow}{text}\) lime \(\color{lime}{text}\)
olive \(\color{olive}{text}\) green \(\color{green}{text}\)
teal \(\color{teal}{text}\) auqa \(\color{auqa}{text}\)
blue \(\color{blue}{text}\) navy \(\color{navy}{text}\)
purple \(\color{purple}{text}\) fuchsia \(\color{fuchsia}{text}\)

對於較新的瀏覽器(HTML5與CSS3),支援額外的124種顏色:

輸入 \color {#rgb} {text} 來自定義更多的顏色,其中 #rgbr g b 可輸入 0-9a-f 來表示紅色、綠色和藍色的純度(飽和度)。

ex:

\begin{array}{|rrrrrrrr|}\hline
\verb+#000+ & \color{#000}{text} & & &
\verb+#00F+ & \color{#00F}{text} & & \\
& & \verb+#0F0+ & \color{#0F0}{text} &
& & \verb+#0FF+ & \color{#0FF}{text}\\
\verb+#F00+ & \color{#F00}{text} & & &
\verb+#F0F+ & \color{#F0F}{text} & & \\
& & \verb+#FF0+ & \color{#FF0}{text} &
& & \verb+#FFF+ & \color{#FFF}{text}\\
\hline
\end{array}

\(\begin{array}{|rrrrrrrr|}\hline \verb+#000+ & \color{#000}{text} & & & \verb+#00F+ & \color{#00F}{text} & & \\ & & \verb+#0F0+ & \color{#0F0}{text} & & & \verb+#0FF+ & \color{#0FF}{text}\\ \verb+#F00+ & \color{#F00}{text} & & & \verb+#F0F+ & \color{#F0F}{text} & & \\ & & \verb+#FF0+ & \color{#FF0}{text} & & & \verb+#FFF+ & \color{#FFF}{text}\\ \hline \end{array}\)

ex:

\begin{array}{|rrrrrrrr|}
\hline
\verb+#000+ & \color{#000}{text} & \verb+#005+ & \color{#005}{text} & \verb+#00A+ & \color{#00A}{text} & \verb+#00F+ & \color{#00F}{text}  \\
\verb+#500+ & \color{#500}{text} & \verb+#505+ & \color{#505}{text} & \verb+#50A+ & \color{#50A}{text} & \verb+#50F+ & \color{#50F}{text}  \\
\verb+#A00+ & \color{#A00}{text} & \verb+#A05+ & \color{#A05}{text} & \verb+#A0A+ & \color{#A0A}{text} & \verb+#A0F+ & \color{#A0F}{text}  \\
\verb+#F00+ & \color{#F00}{text} & \verb+#F05+ & \color{#F05}{text} & \verb+#F0A+ & \color{#F0A}{text} & \verb+#F0F+ & \color{#F0F}{text}  \\
\hline
\verb+#080+ & \color{#080}{text} & \verb+#085+ & \color{#085}{text} & \verb+#08A+ & \color{#08A}{text} & \verb+#08F+ & \color{#08F}{text}  \\
\verb+#580+ & \color{#580}{text} & \verb+#585+ & \color{#585}{text} & \verb+#58A+ & \color{#58A}{text} & \verb+#58F+ & \color{#58F}{text}  \\
\verb+#A80+ & \color{#A80}{text} & \verb+#A85+ & \color{#A85}{text} & \verb+#A8A+ & \color{#A8A}{text} & \verb+#A8F+ & \color{#A8F}{text}  \\
\verb+#F80+ & \color{#F80}{text} & \verb+#F85+ & \color{#F85}{text} & \verb+#F8A+ & \color{#F8A}{text} & \verb+#F8F+ & \color{#F8F}{text}  \\
\hline
\verb+#0F0+ & \color{#0F0}{text} & \verb+#0F5+ & \color{#0F5}{text} & \verb+#0FA+ & \color{#0FA}{text} & \verb+#0FF+ & \color{#0FF}{text}  \\
\verb+#5F0+ & \color{#5F0}{text} & \verb+#5F5+ & \color{#5F5}{text} & \verb+#5FA+ & \color{#5FA}{text} & \verb+#5FF+ & \color{#5FF}{text}  \\
\verb+#AF0+ & \color{#AF0}{text} & \verb+#AF5+ & \color{#AF5}{text} & \verb+#AFA+ & \color{#AFA}{text} & \verb+#AFF+ & \color{#AFF}{text}  \\
\verb+#FF0+ & \color{#FF0}{text} & \verb+#FF5+ & \color{#FF5}{text} & \verb+#FFA+ & \color{#FFA}{text} & \verb+#FFF+ & \color{#FFF}{text}  \\
\hline
\end{array}

\[ \begin{array}{|rrrrrrrr|} \hline \verb+#000+ & \color{#000}{text} & \verb+#005+ & \color{#005}{text} & \verb+#00A+ & \color{#00A}{text} & \verb+#00F+ & \color{#00F}{text} \\ \verb+#500+ & \color{#500}{text} & \verb+#505+ & \color{#505}{text} & \verb+#50A+ & \color{#50A}{text} & \verb+#50F+ & \color{#50F}{text} \\ \verb+#A00+ & \color{#A00}{text} & \verb+#A05+ & \color{#A05}{text} & \verb+#A0A+ & \color{#A0A}{text} & \verb+#A0F+ & \color{#A0F}{text} \\ \verb+#F00+ & \color{#F00}{text} & \verb+#F05+ & \color{#F05}{text} & \verb+#F0A+ & \color{#F0A}{text} & \verb+#F0F+ & \color{#F0F}{text} \\ \hline \verb+#080+ & \color{#080}{text} & \verb+#085+ & \color{#085}{text} & \verb+#08A+ & \color{#08A}{text} & \verb+#08F+ & \color{#08F}{text} \\ \verb+#580+ & \color{#580}{text} & \verb+#585+ & \color{#585}{text} & \verb+#58A+ & \color{#58A}{text} & \verb+#58F+ & \color{#58F}{text} \\ \verb+#A80+ & \color{#A80}{text} & \verb+#A85+ & \color{#A85}{text} & \verb+#A8A+ & \color{#A8A}{text} & \verb+#A8F+ & \color{#A8F}{text} \\ \verb+#F80+ & \color{#F80}{text} & \verb+#F85+ & \color{#F85}{text} & \verb+#F8A+ & \color{#F8A}{text} & \verb+#F8F+ & \color{#F8F}{text} \\ \hline \verb+#0F0+ & \color{#0F0}{text} & \verb+#0F5+ & \color{#0F5}{text} & \verb+#0FA+ & \color{#0FA}{text} & \verb+#0FF+ & \color{#0FF}{text} \\ \verb+#5F0+ & \color{#5F0}{text} & \verb+#5F5+ & \color{#5F5}{text} & \verb+#5FA+ & \color{#5FA}{text} & \verb+#5FF+ & \color{#5FF}{text} \\ \verb+#AF0+ & \color{#AF0}{text} & \verb+#AF5+ & \color{#AF5}{text} & \verb+#AFA+ & \color{#AFA}{text} & \verb+#AFF+ & \color{#AFF}{text} \\ \verb+#FF0+ & \color{#FF0}{text} & \verb+#FF5+ & \color{#FF5}{text} & \verb+#FFA+ & \color{#FFA}{text} & \verb+#FFF+ & \color{#FFF}{text} \\ \hline \end{array} \]

15.5 刪除線

使用刪除線功能必須用 $$ 符號。

在公式內使用 \require{cancel} 來允許 片段刪除線 的顯示。 聲明片段刪除線後,使用 \cancel{字符}\bcancel{字符}\xcancel{字符}\cancelto{字符} 來實現各種片段刪除線效果。

$$
\require{cancel}\begin{array}{rl}
\verb|y+\cancel{x}| & y+\cancel{x}\\
\verb|\cancel{y+x}| & \cancel{y+x}\\
\verb|y+\bcancel{x}| & y+\bcancel{x}\\
\verb|y+\xcancel{x}| & y+\xcancel{x}\\
\verb|y+\cancelto{0}{x}| & y+\cancelto{0}{x}\\
\verb+\frac{1\cancel9}{\cancel95} = \frac15+& \frac{1\cancel9}{\cancel95} = \frac15 \\
\end{array}
$$

\[ \require{cancel}\begin{array}{rl} \verb|y+\cancel{x}| & y+\cancel{x}\\ \verb|\cancel{y+x}| & \cancel{y+x}\\ \verb|y+\bcancel{x}| & y+\bcancel{x}\\ \verb|y+\xcancel{x}| & y+\xcancel{x}\\ \verb|y+\cancelto{0}{x}| & y+\cancelto{0}{x}\\ \verb+\frac{1\cancel9}{\cancel95} = \frac15+& \frac{1\cancel9}{\cancel95} = \frac15 \\ \end{array} \]

\require{enclose} 來允許 整段刪除線 的顯示。 聲明整段刪除線後,使用 \enclose{刪除線效果}{字符} 來實現各種整段刪除線效果。 其中,刪除線效果有 horizontalstrikeverticalstrikeupdiagonalstrikedowndiagonalstrike,可疊加使用。

$$
\require{enclose}\begin{array}{rl}
\verb|\enclose{horizontalstrike}{x+y}| & \enclose{horizontalstrike}{x+y}\\
\verb|\enclose{verticalstrike}{\frac xy}| & \enclose{verticalstrike}{\frac xy}\\
\verb|\enclose{updiagonalstrike}{x+y}| & \enclose{updiagonalstrike}{x+y}\\
\verb|\enclose{downdiagonalstrike}{x+y}| & \enclose{downdiagonalstrike}{x+y}\\
\verb|\enclose{horizontalstrike,updiagonalstrike}{x+y}| & \enclose{horizontalstrike,updiagonalstrike}{x+y}\\
\end{array}
$$

\[ \require{enclose}\begin{array}{rl} \verb|\enclose{horizontalstrike}{x+y}| & \enclose{horizontalstrike}{x+y}\\ \verb|\enclose{verticalstrike}{\frac xy}| & \enclose{verticalstrike}{\frac xy}\\ \verb|\enclose{updiagonalstrike}{x+y}| & \enclose{updiagonalstrike}{x+y}\\ \verb|\enclose{downdiagonalstrike}{x+y}| & \enclose{downdiagonalstrike}{x+y}\\ \verb|\enclose{horizontalstrike,updiagonalstrike}{x+y}| & \enclose{horizontalstrike,updiagonalstrike}{x+y}\\ \end{array} \]

矩陣

1. 無框矩陣

在開頭使用 begin{matrix},在結尾使用 end{matrix},在中間插入矩陣元素,每個元素之間插入 & ,並在每行結尾處使用 \\

$$
        \begin{matrix}
        1 & x & x^2 \\
        1 & y & y^2 \\
        1 & z & z^2 \\
        \end{matrix}
$$

\[ \begin{matrix} 1 & x & x^2 \\ 1 & y & y^2 \\ 1 & z & z^2 \\ \end{matrix} \]

2. 邊框矩陣

matrix 替換為 pmatrix bmatrix Bmatrix vmatrix Vmatrix

$ \begin{matrix} 1 & 2 \\ 3 & 4 \\ \end{matrix} $
$ \begin{pmatrix} 1 & 2 \\ 3 & 4 \\ \end{pmatrix} $
$ \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix} $
$ \begin{Bmatrix} 1 & 2 \\ 3 & 4 \\ \end{Bmatrix} $
$ \begin{vmatrix} 1 & 2 \\ 3 & 4 \\ \end{vmatrix} $
$ \begin{Vmatrix} 1 & 2 \\ 3 & 4 \\ \end{Vmatrix} $
matrix pmatrix bmatrix Bmatrix vmatrix Vmatrix
\( \begin{matrix} 1 & 2 \\ 3 & 4 \\ \end{matrix} \) \( \begin{pmatrix} 1 & 2 \\ 3 & 4 \\ \end{pmatrix} \) \( \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix} \) \( \begin{Bmatrix} 1 & 2 \\ 3 & 4 \\ \end{Bmatrix} \) \( \begin{vmatrix} 1 & 2 \\ 3 & 4 \\ \end{vmatrix} \) \( \begin{Vmatrix} 1 & 2 \\ 3 & 4 \\ \end{Vmatrix} \)

3. 帶省略符號的矩陣

\cdots \(\cdots\) , \ddots \(\ddots\) , \vdots \(\vdots\) 輸入省略符號。

ex:

$$
        \begin{pmatrix}
        1 & a_1 & a_1^2 & \cdots & a_1^n \\
        1 & a_2 & a_2^2 & \cdots & a_2^n \\
        \vdots & \vdots & \vdots & \ddots & \vdots \\
        1 & a_m & a_m^2 & \cdots & a_m^n \\
        \end{pmatrix}
$$

\[ \begin{pmatrix} 1 & a_1 & a_1^2 & \cdots & a_1^n \\ 1 & a_2 & a_2^2 & \cdots & a_2^n \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & a_m & a_m^2 & \cdots & a_m^n \\ \end{pmatrix} \]

4. 帶分割符號的矩陣

cc|c 代表在一個三列矩陣中的第二和第三列之間插入分割線。

$$
\left[
    \begin{array}{cc|c}
      1&2&3\\
      4&5&6
    \end{array}
\right]
$$

\[ \left[ \begin{array}{cc|c} 1&2&3\\ 4&5&6 \end{array} \right] \]

5. 行內矩陣

\bigl(\begin{smallmatrix} ... \end{smallmatrix}\bigr)

ex:

這是一個行內矩陣 $\bigl( \begin{smallmatrix} a & b \\ c & d \end{smallmatrix} \bigr)$ 。

這是一個行內矩陣 \(\bigl( \begin{smallmatrix} a & b \\ c & d \end{smallmatrix} \bigr)\) 。

方程式

1. 方程式序列

\begin{align}…\end{align} 來創造一列方程式,其中在每行結尾處使用 \\

請注意 {align} 語句是 自動編號

\begin{align}
\sqrt{37} & = \sqrt{\frac{73^2-1}{12^2}} \\
 & = \sqrt{\frac{73^2}{12^2}\cdot\frac{73^2-1}{73^2}} \\
 & = \sqrt{\frac{73^2}{12^2}}\sqrt{\frac{73^2-1}{73^2}} \\
 & = \frac{73}{12}\sqrt{1 - \frac{1}{73^2}} \\
 & \approx \frac{73}{12}\left(1 - \frac{1}{2\cdot73^2}\right)
\end{align}

\[ \begin{align} \sqrt{37} & = \sqrt{\frac{73^2-1}{12^2}} \\ & = \sqrt{\frac{73^2}{12^2}\cdot\frac{73^2-1}{73^2}} \\ & = \sqrt{\frac{73^2}{12^2}}\sqrt{\frac{73^2-1}{73^2}} \\ & = \frac{73}{12}\sqrt{1 - \frac{1}{73^2}} \\ & \approx \frac{73}{12}\left(1 - \frac{1}{2\cdot73^2}\right) \end{align} \]

2. 在方程式序列的每一行中注明原因

{align} 中靈活組合 \text\tag 語句。\tag 語句編號優先級高於自動編號。

\begin{align}
   v + w & = 0  &\text{Given} \tag 1\\
   -w & = -w + 0 & \text{additive identity} \tag 2\\
   -w + 0 & = -w + (v + w) & \text{equations $(1)$ and $(2)$}
\end{align}

\[ \begin{align} v + w & = 0 &\text{Given} \tag 1\\ -w & = -w + 0 & \text{additive identity} \tag 2\\ -w + 0 & = -w + (v + w) & \text{equations $(1)$ and $(2)$} \end{align} \]

條件表達式

1. 條件表達式

使用 begin{cases} 來創造一組條件表達式,在每一行條件中插入 & 來指定需要對齊的內容,並在每一行結尾處使用 \\,以 end{cases} 結束。

$$
        f(n) =
        \begin{cases}
        n/2,  & \text{if $n$ is even} \\
        3n+1, & \text{if $n$ is odd}
        \end{cases}
$$

\[ f(n) = \begin{cases} n/2, & \text{if $n$ is even} \\ 3n+1, & \text{if $n$ is odd} \end{cases} \]

2. 左側對齊的條件表達式

$$
        \left.
        \begin{array}{l}
        \text{if $n$ is even:}&n/2\\
        \text{if $n$ is odd:}&3n+1
        \end{array}
        \right\}
        =f(n)
$$

\[ \left. \begin{array}{l} \text{if $n$ is even:}&n/2\\ \text{if $n$ is odd:}&3n+1 \end{array} \right\} =f(n) \]

3. 讓條件表達式調整行高

在一些情況下,條件表達式中某些行的行高為非標準高度,此時使用 \\[2ex] 語句代替該行末尾的 \\ 來讓編輯器自動調整。

$$
f(n) =
\begin{cases}
\frac{n}{2},  & \text{if $n$ is even} \\
3n+1, & \text{if $n$ is odd}
\end{cases}
$$

\[ f(n) = \begin{cases} \frac{n}{2}, & \text{if $n$ is even} \\ 3n+1, & \text{if $n$ is odd} \end{cases} \]

調整行高的結果

$$
f(n) =
\begin{cases}
\frac{n}{2},  & \text{if $n$ is even} \\[2ex]
3n+1, & \text{if $n$ is odd}
\end{cases}
$$

\[ f(n) = \begin{cases} \frac{n}{2}, & \text{if $n$ is even} \\[2ex] 3n+1, & \text{if $n$ is odd} \end{cases} \]

數組與表格

1. 如何輸入一個數組或表格

通常,一個格式化後的表格比單純的文字或排版後的文字更具有可讀性。數組和表格均以 begin{array} 開頭,並在其後定義列數及每一列的文本對齊屬性,c l r 分別代表居中、左對齊及右對齊。若需要插入垂直分割線,在定義式中插入 | ,若要插入水平分割線,在下一行輸入前插入 \hline 。與矩陣相似,每行元素間均須要插入 & ,每行元素以 \\ 結尾,最後以 end{array} 結束數組。

\begin{array}{c|lcr}
n & \text{左} & \text{置中} & \text{右} \\
\hline
1 & 0.24 & 1 & 125 \\
2 & -1 & 189 & -8 \\
3 & -20 & 2000 & 1+10i
\end{array}

\[ \begin{array}{c|lcr} n & \text{左} & \text{置中} & \text{右} \\ \hline 1 & 0.24 & 1 & 125 \\ 2 & -1 & 189 & -8 \\ 3 & -20 & 2000 & 1+10i \end{array} \]

2. 嵌套的數組或表格

多個數組/表格可 互相嵌套 並組成一組數組/一組表格。 使用嵌套前必須聲明 $$ 符號。

$$
% outer vertical array of arrays 外層垂直表格
\begin{array}{c}
    % inner horizontal array of arrays 內層水平表格
    \begin{array}{cc}
        % inner array of minimum values 內層"最小值"數組
        \begin{array}{c|cccc}
        \text{min} & 0 & 1 & 2 & 3\\
        \hline
        0 & 0 & 0 & 0 & 0\\
        1 & 0 & 1 & 1 & 1\\
        2 & 0 & 1 & 2 & 2\\
        3 & 0 & 1 & 2 & 3
        \end{array}
    &
        % inner array of maximum values 內層"最大值"數組
        \begin{array}{c|cccc}
        \text{max}&0&1&2&3\\
        \hline
        0 & 0 & 1 & 2 & 3\\
        1 & 1 & 1 & 2 & 3\\
        2 & 2 & 2 & 2 & 3\\
        3 & 3 & 3 & 3 & 3
        \end{array}
    \end{array}
    % 內層第一行表格組結束
    \\
    % inner array of delta values 內層第二行Delta值數組
        \begin{array}{c|cccc}
        \Delta&0&1&2&3\\
        \hline
        0 & 0 & 1 & 2 & 3\\
        1 & 1 & 0 & 1 & 2\\
        2 & 2 & 1 & 0 & 1\\
        3 & 3 & 2 & 1 & 0
        \end{array}
        % 內層第二行表格組結束
\end{array}
$$

\[ \begin{array}{c} \begin{array}{cc} \begin{array}{c|cccc} \text{min} & 0 & 1 & 2 & 3\\ \hline 0 & 0 & 0 & 0 & 0\\ 1 & 0 & 1 & 1 & 1\\ 2 & 0 & 1 & 2 & 2\\ 3 & 0 & 1 & 2 & 3 \end{array} & \begin{array}{c|cccc} \text{max}&0&1&2&3\\ \hline 0 & 0 & 1 & 2 & 3\\ 1 & 1 & 1 & 2 & 3\\ 2 & 2 & 2 & 2 & 3\\ 3 & 3 & 3 & 3 & 3 \end{array} \end{array} \\ \begin{array}{c|cccc} \Delta&0&1&2&3\\ \hline 0 & 0 & 1 & 2 & 3\\ 1 & 1 & 0 & 1 & 2\\ 2 & 2 & 1 & 0 & 1\\ 3 & 3 & 2 & 1 & 0 \end{array} \end{array} \]

3. 方程組

\begin{array}…\end{array}\left\{…\right.

$$
\left\{
\begin{array}{c}
a_1x+b_1y+c_1z=d_1 \\
a_2x+b_2y+c_2z=d_2 \\
a_3x+b_3y+c_3z=d_3
\end{array}
\right.
$$

\[ \left\{ \begin{array}{c} a_1x+b_1y+c_1z=d_1 \\ a_2x+b_2y+c_2z=d_2 \\ a_3x+b_3y+c_3z=d_3 \end{array} \right. \]

或者使用條件表達式組 \begin{cases}…\end{cases} 來實現相同效果

\begin{cases}
a_1x+b_1y+c_1z=d_1 \\
a_2x+b_2y+c_2z=d_2 \\
a_3x+b_3y+c_3z=d_3
\end{cases}

\[ \begin{cases} a_1x+b_1y+c_1z=d_1 \\ a_2x+b_2y+c_2z=d_2 \\ a_3x+b_3y+c_3z=d_3 \end{cases} \]

連分數

\cfrac

$$
x = a_0 + \cfrac{1^2}{a_1
          + \cfrac{2^2}{a_2
          + \cfrac{3^2}{a_3 + \cfrac{4^4}{a_4 + \cdots}}}}
$$

\[ x = a_0 + \cfrac{1^2}{a_1 + \cfrac{2^2}{a_2 + \cfrac{3^2}{a_3 + \cfrac{4^4}{a_4 + \cdots}}}} \]

可以使用 \frac 來表達連分數的 緊縮記法

$$
x = a_0 + \frac{1^2}{a_1+}
          \frac{2^2}{a_2+}
          \frac{3^2}{a_3 +} \frac{4^4}{a_4 +} \cdots
$$

\[ x = a_0 + \frac{1^2}{a_1+} \frac{2^2}{a_2+} \frac{3^2}{a_3 +} \frac{4^4}{a_4 +} \cdots \]

交換圖表

使用一行 $ \require{AMScd} $ 語句來允許交換圖表的顯示。 宣告交換圖表後,語法與矩陣相似,在開頭使用 begin{CD},在結尾使用 end{CD},在中間插入圖表元素,每個元素之間插入 & ,並在每行結尾處使用 \\

$\require{AMScd}$
\begin{CD}
    A @>a>> B\\
    @V b V V\# @VV c V\\
    C @>>d> D
\end{CD}$

\[ $\require{AMScd}$ \begin{CD} A @>a>> B\\ @V b V V\# @VV c V\\ C @>>d> D \end{CD}$ \]

@>>>代表右箭頭、@<<<代表左箭頭、@VVV代表下箭頭、@AAA代表上箭頭、@=代表水平雙實線、@|代表竪直雙實線、@.代表沒有箭頭。 在@>>>>>>` 之間任意插入文字即代表該箭頭的注釋文字。

$\require{AMScd}$
\begin{CD}
    A @>>> B @>{\text{very long label}}>> C \\
    @. @AAA @| \\
    D @= E @<<< F
\end{CD}

\[ $\require{AMScd}$ \begin{CD} A @>>> B @>{\text{very long label}}>> C \\ @. @AAA @| \\ D @= E @<<< F \end{CD} \]

注意事項

  • 在以e為底的指數函數、極限和積分中盡量不要使用 \frac 符號:它會使整段函數看起來很怪,而且可能產生歧義。也正是因此它在專業數學排版中幾乎從不出現。 橫著寫這些分式,中間使用斜線間隔 / (用斜線代替分數線)。
\begin{array}{cc}
\mathrm{Bad} & \mathrm{Better} \\
\hline \\
e^{i\frac{\pi}2} \quad e^{\frac{i\pi}2}& e^{i\pi/2} \\
\int_{-\frac\pi2}^\frac\pi2 \sin x\,dx & \int_{-\pi/2}^{\pi/2}\sin x\,dx \\
\end{array}

\(\begin{array}{cc} \mathrm{Bad} & \mathrm{Better} \\ \hline \\ e^{i\frac{\pi}2} \quad e^{\frac{i\pi}2}& e^{i\pi/2} \\ \int_{-\frac\pi2}^\frac\pi2 \sin x\,dx & \int_{-\pi/2}^{\pi/2}\sin x\,dx \\ \end{array}\)

  • 符號在被當作分隔符時會產生錯誤的間隔,因此在需要分隔時最好使用 \mid 來代替它。
\begin{array}{cc}
\mathrm{Bad} & \mathrm{Better} \\
\hline \\
\{x|x^2\in\Bbb Z\} & \{x\mid x^2\in\Bbb Z\} \\
\end{array}

\(\begin{array}{cc} \mathrm{Bad} & \mathrm{Better} \\ \hline \\ \{x|x^2\in\Bbb Z\} & \{x\mid x^2\in\Bbb Z\} \\ \end{array}\)

  • 使用多重積分符號時,不要多次使用 \int ,直接使用 \iint 來表示 二重積分 ,使用 \iiint 來表示 三重積分 等。對於無限次積分,可以用 \int \cdots \int 表示。
\begin{array}{cc}
\mathrm{Bad} & \mathrm{Better} \\
\hline \\
\int\int_S f(x)\,dy\,dx & \iint_S f(x)\,dy\,dx \\
\int\int\int_V f(x)\,dz\,dy\,dx & \iiint_V f(x)\,dz\,dy\,dx
\end{array}

\(\begin{array}{cc} \mathrm{Bad} & \mathrm{Better} \\ \hline \\ \int\int_S f(x)\,dy\,dx & \iint_S f(x)\,dy\,dx \\ \int\int\int_V f(x)\,dz\,dy\,dx & \iiint_V f(x)\,dz\,dy\,dx \end{array}\)

  • 在微分符號前加入 \, 來插入一個小的間隔空隙;沒有 \, 符號的話,latex 將會把不同的微分符號堆在一起。
\begin{array}{cc}
\mathrm{Bad} & \mathrm{Better} \\
\hline \\
\iiint_V f(x){\rm d}z {\rm d}y {\rm d}x & \iiint_V f(x)\,{\rm d}z\,{\rm d}y\,{\rm d}x
\end{array}

\(\begin{array}{cc} \mathrm{Bad} & \mathrm{Better} \\ \hline \\ \iiint_V f(x){\rm d}z {\rm d}y {\rm d}x & \iiint_V f(x)\,{\rm d}z\,{\rm d}y\,{\rm d}x \end{array}\)

Reference

如何在markdown中插入公式

MarkDown公式輸入

Cmd Markdown 公式指導手冊

數學符號的意義與念法

2020/02/03

ASR Automatic_Speech_Recognition

語音辨識

ref: 微軟Edx語音識別課程

  1. 語音信號處理
  2. 聲學模型
  3. 語言模型
  4. 解碼器

語音活動偵測 VAD

  • 聲學特徵分析

    將聲音波形只代表聲壓隨時間變化的關係,必須將波形轉成聲學特徵向量:MFCC, LPCC, MRCG

  • 以 MRCG 為基礎的分類器,以 tensorlfow (python 實作)

    1. ACAM 自我調整上下文關注模型
    2. bDNN 增強的深度神經網路
    3. DNN 深度神經網路
    4. LSTM-RNN 長短期記憶遞迴神經網路

標記語言

可用 IPA 或 SAMPA

SAMPA 是電腦讀取的語音字母表,包含將國際音標對應到 33~127 的 ASCII code

可用 audacity 標註音軌

可用 PRAAT 標記音節邊界

端點偵測 End-Point Detection

可自動檢測語音的起始及結束點。

可用雙門檻比較法進行端點偵測。以短時能量 E 和短時平均過零率 Z 作為特徵,進行端點偵測。

6-1 Introduction to End-Point Detection (端點偵測介紹)

常見的端點偵測方法與相關的特徵參數,可以分成兩大類:

  1. 時域(Time Domain)的方法:計算量比較小,因此比較容易移植到計算能力較差的微電腦平台。
    1. 音量:只使用音量來進行端點偵測,是最簡單的方法,但是會對氣音造成誤判。不同的音量計算方式也會造成端點偵測結果的不同,至於是哪一種計算方式比較好,並無定論,需要靠大量的資料來測試得知。
    2. 音量和過零率:以音量為主,過零率為輔,可以對氣音進行較精密的檢測。
  2. 頻域(Frequency Domain)的方法:計算量比較大,因此比較難移植到計算能力較差的微電腦平台。
    1. 頻譜的變異數:有聲音的頻譜變化較規律,變異數較低,可作為判斷端點的基準。
    2. 頻譜的Entropy:我們也可以使用使用 Entropy 達到類似上述的功能。

若只是對聲音波形做一些較簡單的運算,就是屬於時域的方法。另一方面,凡是要用到傅立葉轉換(Fourier Transform)來產生聲音的頻譜,就是屬於頻譜的方法。這種分法常被用來對音訊處的方法進行分類,但有時候有一些模糊地帶。有關於頻譜以及傅立葉轉換,會在後續的章節說明。

Sphinx Voice Activity Detection

端點偵測介紹

錯誤的端點偵測,在語音辨識上會造成兩種效應:

  • False Rejection:將 Speech 誤認為 Silence/Noise,因而造成音訊辨識率下降
  • False Acceptance:將 Silence/Noise 誤認為 Speech,此時音訊辨識率也會下降,但是我們可以在設計辨識器時,前後加上可能的靜音聲學模型,此時辨識率的下降就會比前者來的和緩。

動態時間規劃

Dynamic Time Warping(DTW)動態時間規整演算法

Dynamic Time Warping(DTW)是一種衡量兩個時間序列之間的相似度的方法,主要應用在語音識別領域來識別兩段語音是否表示同一個單詞。

在時間序列中,需要比較相似性的兩段時間序列的長度可能並不相等,在語音識別領域表現為不同人的語速不同。而且同一個單詞內的不同音素的發音速度也不同,比如有的人會把“A”這個音拖得很長,或者把“i”發的很短。另外,不同時間序列可能僅僅存在時間軸上的位移,亦即在還原位移的情況下,兩個時間序列是一致的。在這些複雜情況下,使用傳統的歐幾里得距離無法有效地求的兩個時間序列之間的距離(或者相似性)。

DTW通過把時間序列進行延伸和縮短,來計算兩個時間序列性之間的相似性:

SH專區 - Dynamic Time Warping(動態時間規劃) 1

最基礎的DTW計算其實相當直覺,就是建構一個「點對點」的distance matrix,然後由起點一步一步的向終點邁進。在前進的時候,只有一個條件:挑選最短路徑。

數位語音資料

ref: Audio

要讓電腦處理聲音,必須預先讓聲音變成數字,也就是讓聲音經過「取樣 sampling 」與「量化 quantization 」兩個步驟。取樣把時間變成離散,量化把振幅 amplitude 變成離散。

先取樣(得到數列),再量化(四捨五入),最後得到一串整數數列。每個數字稱作「樣本 sample 」或「訊號 signal 」。

duration 持續時間:聲音總共多少秒。數值越高,訊號越多。

sampling rate 取樣頻率:一秒鐘有多少個訊號。數值越高,音質越好。電腦的聲音檔案,通常採用 48000Hz 或 44100Hz 。手機與電話的聲音傳輸,公定為 8000Hz 。

bit depth 位元深度:一個訊號用多少個位元記錄。數值越高,音質越好。電腦的聲音檔案,通常採用 16-bit 或 24-bit 。 16-bit 的每個訊號是 [-32768,+32767] 的整數,符合 C 語言的 short 變數。

channel 聲道:同時播放的聲音訊號總共幾條。每一條聲音訊號都是一樣長。舉例來說,民眾所熟悉的雙聲道,其實就是同時播出兩條不同的聲音訊號。

取樣頻率、持續時間、聲道,相乘之後就是訊號數量。再乘以位元深度,就是容量大小。再除以 8 ,可將單位換成 byte 。

順帶一提,不管是聲音或者是其他信息,只要是經過取樣與量化得到的資料,總稱 PCM data 。「脈衝編碼調變 pulse-code modulation, PCM 」源自訊號學,所以名稱才會如此不直覺。

各種聲音資料的位元深度不盡相同。統合方式:採用 32-bit 浮點數,讀檔後將訊號數值縮放成 [-1,+1] ,才進行聲音處理;存檔前調回原本範圍。

Amplitude 振幅

聲音訊號的數值,代表空氣振動的幅度。基準訂為 0 ,範圍訂為 ±32767 (當位元深度是 16-bit )。

振幅高,聽起來大聲。振幅低,聽起來小聲。

Frequency 頻率

人類擅於感受的不是振動的幅度,而是振動的頻率。頻率高,聽起來尖銳。頻率低,聽起來低沉。

取樣定理: x Hz 的波,取樣頻率至少要是 2x Hz ,才能明確分辨上下次數,頻率保持相同(而振幅總是失真)。也就是說,取樣頻率 48000Hz ,頂多只能記錄 24000Hz 以下的聲音。但是別擔心,人類聽覺範圍是 20Hz 至 20000Hz 。

frame 訊框

訊號很長,變化很大,因此必須將訊號分成小段處理,使得小段之內變化很小。每個小段都稱作一個「框」或「幀」。

當取樣頻率是 48000Hz 、框是 512 個訊號,則此框佔有 512/48000 ≈ 0.01 秒,人耳無法分辨這麼短時間的變化,人聲也無法控制這麼短時間的變化,可以說是足夠細膩了。

為了讓變化更連續,於是讓框交疊。

Fourier Transform

振動十分複雜,難以測量頻率。計算學家的共識是:運用離散版本 Fourier Transform ,將訊號數值分解成簡諧波,解析頻率。因為簡諧波是最漂亮的振動方式,所以適合當作公定標準。

spectrum 頻譜:一個特定時間點的頻率分佈圖。

實務上的做法是:截取一小段時間範圍的訊號,實施快速傅立葉轉換,得到每一種頻率的波的強度、相位。

比如 48000Hz 的取樣頻率、 256 點訊號,則頻譜總共 256 種頻率。第一種頻率是 0Hz ,接著每一種頻率相差 48000Hz / 256 = 187.5Hz 。前 128 種、後 128 種左右對稱,後 128 種沒有實際作用。呼應取樣定理,資訊量只剩一半

spectrogram 頻譜圖:所有時間點的頻率分佈圖。

Window

原本完整的聲音波形,硬生生被框截斷,頻譜將產生誤差。解法:將框的兩端的訊號漸漸減弱,減少影響。也就是乘上一個中央高、兩側低的函數,數值皆介於零到一之間,稱作「窗函數 Window Function 」。

Filter 濾波器

頻譜是分析聲音的工具。濾波器則是修改聲音的工具。例如刪除聲音的高頻部分,稱做 lowpass filter 。

濾波器有時域與頻域兩大類。

時域濾波器,直接的修改訊號。計算速度飛快,但是需要「數位訊號處理」的數學知識。例如 shelving filter

頻域濾波器,間接的修改頻譜。原訊號 FFT 得到頻譜,修改頻譜(例如把低頻的強度和相位調成 0 ,形成 highpass filter ),再 IFFT 得到新訊號。

peak detection

找到波形的尖峰。

消除鋸齒:方法很多,諸如 * 時域的平滑效果(k點平均值)(k點中位數)。 * 頻域的刪除高頻(高頻形成鋸齒)。 * 時域的Linear Prediction(迴歸函數)。

frequency detection

找到波形的頻率。

  • 時域的方法 peak detection:找到兩個波峰,位置相減得到波長,波長倒數得到頻率。僅適合純音。 zero-crossing rate:波形穿越零的次數。僅適合純音。 ACF:位移、相乘、加總(內積)。各種位移量,找最大值。 AMDF:位移、相減再絕對值、加總(絕對值誤差)。各種位移量,找最小值。 YIN:位移、相減再平方、加總(平方誤差)。各種位移量,除以前綴和,找最小值。 複合:例如 ACF / (AMDF + 1.0),找最大值。

  • 頻域的方法 peak detection:找到強度最高的頻率。計算速度較慢。

語音處理

gain :音量放大(縮小)。振幅乘上倍率。波形於垂直方向伸展或壓扁。

normalization :校準聲音波形,中央為 0 ,最高振幅為 1 。等同於聲音儘量調到最大聲。

pre-emphasis :有時候錄音環境不佳,錄製到的聲音濛濛霧霧。微分運算可使聲音清晰。副作用是音量下降。

smoothing :有時候錄音環境不佳,錄製到的聲音唧唧吱吱。平均值可抑制雜訊。副作用是聲音模糊不清、音量下降。

mixing :混音。混和好幾道聲音。其實就是加權平均值。

echo :回聲。相同聲音稍後再度出現。其實就是延遲與混音(位移與疊加)。

reverb :迴響。餘音繞樑。反覆回音,間隔極短。聽起來彷彿位於寬敞的密閉空間。

pitch shifting :移調。改變頻率。

chorus :合唱。混和好幾道聲音,每道聲音的頻率略有變動、延遲時間略有差異。

robot sound :簡易做法是混和兩道聲音,頻率略高、頻率略低的聲音。

harmonics :和聲。混和好幾道聲音,每道聲音的頻率皆不同。

distortion :失真。降低聲音品質,聽起來像破音。

equalization :調整每個頻帶的音量,使得聽起來均勻。套用許多個濾波器即可。

pitch bending :轉音。頻率平滑地增減(頻譜的強度平滑地位移)。

morphing :一種聲音,平滑柔順地轉化成另一種聲音。推測是每個頻帶各自轉音。

Fourier transform

聲音訊號可分解為不同振幅、頻率、相位的波覆蓋。波的相位可用複數表示。

離散傅立葉轉換 DFT 是一個將 n 個複數的向量,對應到 n 個複數的另一個向量的函數。

\(X(k)=\sum_{t=0}^{N-1} x(t) e^{-i2 \pi \frac{tk}{N}}, k=0, ..., N-1\)

向量 x 是個時間點的訊號水平

向量 X 是個頻率下的訊號水平

這個公式的意思是,在頻率 k 處的訊號水平等於每個時間 t 的訊號水平 * 複數指數的總和

尤拉公式: 對任意實數 x : \( e^{xi} = cosx + i sinx \)

cos 是偶函數,因此 \( cos(-x) = cosx\)

sin 是奇函數,因此 \( sin(-x)=-sinx\)

\( e^{-i2 \pi \frac{tk}{N}} = e^{(-2 \pi \frac{tk}{N})i} = cos(-2\pi \frac{tk}{N}) + i sin(-2 \pi \frac{tk}{N}) = cos(2\pi \frac{tk}{N}) - i sin(2 \pi \frac{tk}{N}) \)

\( x= Re(x) + i Im(x)\) Re: 實部 Im: 虛部

\( x(t) e^{-i2 \pi \frac{tk}{N}} = [Re(x(t)) + i Im(x(t))][cos(2\pi \frac{tk}{N}) - i sin(2 \pi \frac{tk}{N})]\)

兩個複數相乘

\( x(t) e^{-i2 \pi \frac{tk}{N}} = [Re(x(t))cos(2\pi \frac{tk}{N}) + Im(x(t))sin(2 \pi \frac{tk}{N})] + i[-Re(x(t))sin(2 \pi \frac{tk}{N}) + Im(x(t)) cos(2 \pi \frac{tk}{N})]\)

Fourier Cosine Transform

N 個波,頻率是 0 倍、 0.5 倍、 1 倍、 1.5 倍、 …… ,分別是 cos((2π/N)⋅0⋅t) 、 cos((2π/N)⋅0.5⋅t) 、 cos((2π/N)⋅1⋅t) 、 …… 。寫成代數是 cos((2π/N)⋅(f/2)⋅t) 。

輸入數列與一個波,置中對齊。 N 個對應位置,相乘後求和(點積),得到一個輸出數值。

輸入數列,分別投影至 N 個波,得到 N 個輸出數值,形成輸出數列。這就是餘弦轉換。

正向餘弦轉換:一個複雜的波,拆解成 N 個平穩的波,頻率是 0 倍開始漸增 0.5 倍,振幅是 N 個輸出數值,相位都是 0 。

逆向餘弦轉換: N 個平穩的波,頻率是 0 倍開始漸增 0.5 倍,分別乘上振幅,疊加成一個複雜的波。


DFT 時間複雜度為 O(n^2)

1965 提出了一個 Cooley-Turkey Fast Fourier 演算法,以 divide-and-conquer 為方法,遞迴地將長度 \(N=N_1N_2\) 的 DST 轉換分解為長度 \(N_1\) 的 \(N_2\) 個較短序列的 DST,以及 O(N) 個旋轉因數的複數乘法

MFCC 特徵

  • 輸入 16kHz 音訊
  • 以 25ms 的 window,每次移動 10ms 輸出一個向量數列,產生數值數列
  • 乘上 windows 函數 (ex: 漢明距離)
  • 執行 FFT
  • 在每個頻率桶中記錄能量,也就是計算每個頻率區間的能量
  • 執行 DCT 離散餘弦轉換,得到 「倒譜」
  • 保留倒譜的前 13 個係數

深度學習可直接使用聲音的 wav file,不使用 MFCC

說話特徵

不同 speaker 在一個通用聲學簇內具有不同的子空間。

由通用簇偏移的子空間,描述了樣本的方向向量,取決於說話者的語言文字。

為取得相關的(含說話者) 的特徵,將這些向量分析為特徵因數。分析的因數特徵稱為身份向量 (identity vectors: I-vectors)。 i-vectors 就是說話者的特徵,可用來辨識說話者。(ex: 利用兩個語音向量的餘弦距離,作為兩段語音是否來自同一個說話者的衡量指標)

解碼

kaldi 工具中,沒有單一標準解碼器,或滿足固定介面的解碼器

目前有 SimpleDecoder 與 FasterDecoder 解碼器,及詞圖產生解碼器。有 command line tool 封裝這些解碼器,他們可解碼特定類型的模型 (ex: GMM) 或具有特定的特殊條件 (ex: 多類別 fMLLR)

解碼的 command 為 gmm-decode-simple, gmm-decode-faster, gem-decode-kaldi, gmm-decode-faster-fmllr。目前不提供可執行所有可能類型的解碼指令

語言模型

語音辨識是使用語言模型解碼識別結果。以下是 機率語言模型 & 語言模型套件 KenLM

機率語言模型

輸入語音序列 「已經還了」

兩種辨識結果,要如何評價?

\(S_1\) 已經/還/了

\(S_2\) 已經/黃/了

根據條件機率 \(P(S_1|C), P(S_2|C)\) 的值決定選擇哪一個

貝氏定理 \(P(C∩S) = P(S|C)P(C)=P(C|S)P(S)\) -> \(P(S|C) = \frac{P(C|S)P(S)}{P(C)}\)


\(P(C)\) 是字串在語料庫中出現的機率。 ex: 語料庫有一萬個句子,其中一句:「生命的意義是什麼」,\(P(C)=P("生命的意義是什麼")=0.0001\) 。

因字串恢復到中文字串的機率只有唯一一種,所以 \(P(C|S)=1\)

比較 \(P(S_1|C), P(S_2|C)\) 的大小,變成比較 \(P(S_1), P(S_2)\) 的大小,因為 \(\frac{P(S_1|C)}{P(S_2)|C} = \frac {P(S_1)}{P(S_2)}\)


\(P(S_1 已經/還/了) > P(S_2 已經/黃/了)\) 所以選擇 \(S_1\)


機率語言模型是用來評估指定詞序列 S 的機率 P(S),為簡化計算,假設每個詞之間的機率跟上下文無關

\(P(S)=P(w_1, w_2, .... w_r) = P(w_1)P(w_2)...P(w_r)\) 其中 \(P(w)\) 是詞 w 出現在語料庫中的機率

ex: \( P(S_1 已經/還/了) = P(已經) P(還) P(了)\)

因為詞表中的詞很多,分配到一個詞的機率很小,所以 \(P(S)\) 會是很多很小的數值的乘積。因此利用對數 \(y=log(x)\) 進行轉換,取對數後,表示一個小於 1 的正數的精確度加強了

\(P(S) = P(w_1)P(w_2)...P(w_r) => logP(w_1) logP(w_2)... logP(w_r)\)

因為機率小於 1 ,取對數後是負數

計算任一個詞出現的機率為 \(P(w_1) = \frac {w_1在語料庫出現的次數 n}{語料庫的總詞數 N}\) ,因此 \( logP(w_1) = log(freq_w) - logN\)

如果已經先算出詞機率的對數值,結果可直接用加法,得到 \(logP(S)\)

這個 \(P(S)\) 公式,就是一元機率語言模型為基礎的計算公式

一元模型

假設語料庫有 10000 個詞,其中 「了」出現了 180 次,則機率為 0.018

詞語 詞頻 概率
180 0.0180
5 0.0005
已經 10 0.0010
2 0.0002

\(P(S_1) = P(已經)P(還)P(了) = 0.001*0.0005*0.018 = 9*10^{-9}\)

\(P(S_2) = P(已經)P(黃)P(了) = 0.001*0.0002*0.018 = 3.6*10^{-9}\)

因為 \(P(S_1) > P(S_2)\) 故選擇 \(S_1\)

取對數的計算:

\(logP(S_1) = logP(已經)logP(還)logP(了) = -18.52604\)

\(logP(S_2) = logP(已經)logP(黃)logP(了) = -19.44233199\)

結果一樣 \(logP(S_1) > logP(S_2)\)

資料基礎

機率語言需要知道哪些是高頻詞,哪些是低頻詞,也就是

\(P(w) = \frac {freq(w)}{所有詞的總次數}\)

詞語機率表是由語料庫統計出來的。為支援中文分詞方法,需要增加分詞資料庫

從分詞語料庫加工出人工可以編輯的一元詞典。一元的英文是 Unigram,一元詞典就稱為 UnigramDic

UnigramDic.txt sample:

有: 180
有意: 5
意見: 10
見: 2
分歧: 1
大學生: 139
生活: 1671

可根據 UnigramDic.txt 產生一元詞典樹

為快速產生詞典樹,可把結構儲存下來。

以 Depth First 方式,對每個節點編號,沒有孩子的分支節點編號為 -1。根據編號,以 節點編號#左邊孩子節點的編號#中間孩子節點的編號#右邊孩子節點的編號#節點本身的資料 的格式,儲存節點之間的關係資料。

0#1#2#3#有
1#4#5#6#基
2#7#8#9#道

改進一元模型

計算最佳前驅節點到目前節點的傳輸機率,考慮更前面的切分路徑。

以 \(P(w_i|w_{i-1})\) 的值代替 \(P(W_i)\)

如果用最大似然估計 \(P(w_i|w_{i-1})\) ,則 \(P(w_i|w_{i-1}) = \frac {freq(w_{i-1}, w_i)}{freq(w_{i-1})}\)

ex: \(freq(有,意見) = 4\),則 \(P(意見|有) = freq(有, 意見)/freq(有) = 4/4000 = 0.001\)

ex: 語料庫中存在 「北京/舉行/新年/音樂會」,就存在了一元連結。存在二元連結:北京@舉行, 舉行@新年, 新年@音樂會。

可從語料庫統計前後兩個詞一起出現的次數


如果 「意見, 分歧」 沒找到其他搭配的詞, \(P(S_1), P(S_2)\) 都是 0,無法透過比較計算結果,找到更好的切分方案,這是零機率問題。


使用 \(freq(w_{i-1}, w_i)/freq(w_{i-1})\) 估計 \(P(w_i|w_{i-1})\)

使用 \(freq(w_{i-2}, w_{i-1}, w_i)/freq(w_{i-2}, w_{i-1})\) 估計 \(P(w_i|w_{i-2}, w_{i-1})\)

因為使用了最大似然估計法,所以把 \(freq(w_{i-1}, w_i)/freq(w_{i-1})\) 稱為 \(P_{ML}(w_i|w_{i-1})\)

\(P_{l_i}(w_i|w_{i-1}) = 𝜆_1 P_{ML}(w_i) + 𝜆_2 P_{ML}(w_i|w_{i-1}) = 𝜆_1(freq(w_i)/N) + 𝜆_2(freq(w_{i-1}, w_i) / freq(w_{i-1}))\)

對於 \(P_{l_i}(w_i|w_{i-2}, w_{i-1})\) 則是

\(P_{l_i}(w_i|w_{i-2}, w_{i-1}) = 𝜆_1 P_{ML}(w_i) + 𝜆_2 P_{ML}(w_i|w_{i-1}) + + 𝜆_3 P_{ML}(w_i|w_{i-2}, w_{i-1}) \)

其中 \(l_1 + l_2 + l_3 = 1\)

根據平滑公式計算,由於 \(P'(w_i|w_{i-1}) = 0.3 P(w_i) + 0.7 P(w_i | w_{i-1})\)

所以

\( P(S_1) = P(有) P'(意見|有) P'(分歧|意見) \\ = P(有)*(0.3 P(意見)+0.7P(意見|有)) * (0.3P(分歧)+0.7 P(分歧|意見)) \\ = 0.0180*(0.3*0.001+0.7*0.001)*(0.3*0.0001) \\ = 5.4*10^{-9} \)

\( P(S_2) = P(有意) P'(見|有意) P'(分歧|見) \\ = P(有意)*(0.3 P(見)+0.7P(見|有意)) * (0.3P(分歧)+0.7 P(分歧|見)) \\ = 0.0005*(0.3*0.0002)*(0.3*0.0001) \\ = 9*10^{-13} \)

\(P(S_1) > P(S_2)\) 改進一元模型,結果的區分度比較好

二元詞典

也就是 BigramDic,其中 "0START.0" "0END.0" 是虛擬的開始, 結束詞語

ex:

0START.0@歡迎     -> 「歡迎」 是開始詞
什麼@0END.0       -> 「什麼」是結束詞

二元詞表格式為 「前一個詞@後一個詞」組合出現的次數

中國@北京:100
中國@北海:1

二元詞表會有幾十萬筆,要考慮如何快速查詢

要先載入一元詞典,建置詞典樹結構,再載入二元詞典,也就是在詞典樹的結構,加掛上二元連結資訊。

N元模型

一元詞典假設兩個詞出現的機率互相獨立,但實際上不可能。

N元模型用 n 個單字組成的序列衡量切分方案的合理性

但當詞數有 20000,二元組合就有 20000^2 個,三元組合有 20000^3 個

利用馬可夫假設,解決參數空間過大的問題

馬可夫假設:一個詞的出現,僅依賴於前面出現的有限的幾個詞。

如果一個詞的出現,僅依賴於前面出現的一個詞,也就是二元模型 ex: \(P(S_1)=P(有) P(意見|有) P(分歧|意見)\)

如果一個詞的出現,僅依賴於前面出現的2個詞,就是三元模型


如果切分方案 S 由 n 個片語組成, \(P(w_1)P(w_2|w_1)P(w_3|w_2) ... P(w_n|w_{n-1})\),也就是 n 項連乘積

ex: Bigram \(P(S_1)=P(有) P(意見|有) P(分歧|意見)\)

Trigram \(P(S_1)=P(有) P(意見|有) P(分歧|有,意見)\)

因為 \(P(w_i|w_{i-1}) = freq(w_{i-1}, w_i)/freq(w_{i-1})\) 二元分詞也要使用到一元詞典


n 元機率 https://github.com/esbie/ngrams

評估語言模型

透過困惑度 perplexity 衡量語言模型。

困惑度 perplexity 是一個語言事件的不確定性的度量。

ex: 「行」後面可接 「走」「善」「不行」,所以「行」的困惑度高

有些詞的困惑度低

語言模型的困惑度要越低越好,這樣會有比較強的消除問題的能力。專業領域的語料庫,會有比較低的困惑度

困惑度的定義:

假設有一些測試資料,n個句子:\(S_1, S_2,... S_n\) 整個測試集 T 的機率為

\(log \sum_{i=1}^{n}P(S_i) = \sum_{i=1}^{n} logP(S_i)\)

困惑度為 \(2^{-x}\) 其中 \(x=\frac{1}{W} \sum_{i=1}^{n} logP(S_i)\) , W 是測試集 T 的總詞數


假設詞表 V,其中有 N 個詞。 \(P(w)=1/N\) 困惑度為 \(2^{-x}\) ,其中 \(x=log(1/N)\) ,所以困惑度為 N

ex:

訓練集有 38000 萬個詞,詞表有 19979 個詞,測試集有 150萬個詞。一元模型的困惑度 = 962

二元模型困惑度 = 170,三元模型困惑度 = 109

平滑演算法

語料有限,不可能覆蓋所有詞彙,當 N 元模型,N很大時,由於樣本數量有限,導致先驗機率值為 0,這就是 零機率問題。

當 N=1 一元模型,也存在零機率問題。例如有些詞在詞表中,但沒有出現在語料庫。

回歸分析可用來預測沒有測量過的值。

平滑演算法就是用觀測到的事件,預測未觀測事件的機率。

資料稀疏在統計自然語言處理中表現的就是零機率問題。有各種平滑演算法可解決此問題

最簡單的是加法平滑演算法。原理是替每個專案增加 \(lambda\) ( 其中 \(1>=lambda>=0\)) ,然後除以總數,作為專案新的機率。因數學家 Laplace 首先提出用加 1 的方法,估計沒有出現過的現象的機率,所以加法平滑也稱為 Laplace平滑


因不容易決定 lambda 的值。另一種 Good-Turning 方法不需要 lambda

假設詞典有 x 個詞,在語料庫出現 r 次的詞有 \(N_r\) 個,例如出現一次有 \(N_1\) 個

語料庫總詞數 \(N=0*N_0 + 1*N_1 + ... + r*N_r\) 其中 \(x=N_0+N_1+.. +N_r\)

使用觀察到的類別 r+1 的全部機率,估計類別 r 的全部機率。

沒有出現過的詞的總機率 \(p_0 = N_1/N\) 分攤到每個詞的機率為 \(N_1/(N*N_0)\)

第二步是估計語料庫中,出現過一次的總機率為 \(p_1 = 2N_2/N\) 分攤到每個詞的機率為 \(N_2/(N*N_1)\)

當 r 很大, \(N_r\) 可能為 0,此時不再求平滑

對條件機率的 N 元估計平滑

\(P_{GT}(w_i|w_1,...,w_{i-1}) = \frac{c^*(w_1,...,w_i)}{c^*(w_1,...,w_{i-1})}\)

其中 \(c^*\) 就是 GT 估計

估計三元條件機率為 \(P_{GT}(w_3|w_1,w_2) = \frac{c^*(w_1,w_2,w_3)}{c^*(w_1,w_2)}\)

對沒出現詞語的三元聯合機率為

\(P_{GT}(w_1,w_2,w_3) = C_0^*/N = N_1/(N_0*N)\)

KenLM 語言模型工具

github: https://github.com/kpu/kenlm.git

python module: pip install https://github.com/kpu/kenlm/archive/master.zip

測試語言模型檔案 test.arpa

使用 test.arpa 評估句子的機率

import kenlm
model = kenlm.Model('lm/test.arpa')
print(model.score('this is a sentence .', bos = True, eos = True))

#輸出
-49.57345703125

kenlm 編譯後可使用修改的 Kerser-Ney 平滑法從輸入文字 test.arpa 估計語言模型

$bin/lmplz -o 5 <test > text.arpa

建立三元模型

./lmplz --order 3 --text input.txt.tok --arpa output.arpa

ARPA 檔案格式

n-gram 語言模型格式

<LM_definition> = [ { <comment> } ]
                  \data\
                  <header>
                  <body>
                  \end\
   <comment> = { <word> }
<header> = { ngram <int>=<int> }

其中 第一個 列出 n-gram 階數,第二個 列出 n-gram 項目的數量

<body>  = { <lmpart1> } <lmpart2>
<lmpart1> = \<int>-grams:
            { <ngramdef1> }
<lmpart2> = \<int>-grams:
            { <ngramdef2> }
<ngramdef1> = <float> { <word> }  <float>
<ngramdef2> = <float> { <word> }

bigram 語言模型由 unigram 與 bigram 兩個部分組成。

實際機率用對數取代,所以會看到負數

sample:

wood pittsburgh cindy jean
jean wood

共四個單字,加上 句子開始 <s>、結束 </s>、未知單字 <unk> ,共七個

語言模型:

\data\
ngram 1=7
ngram 2=7

\1-grams:
-1.0000     <unk>               -0.2553
-98.9366    <s>                 -0.3064
-1.0000     </s>                0.0000
-0.6990     wood                -0.2553
-0.6990     cindy               -0.2553
-0.6990     pittsburgh  -0.2553
-0.6990     jean                -0.1973

\2-grams:
-0.2553     <unk> wood
-0.2553     <s> <unk>
-0.2553     wood pittsburgh
-0.2553     cindy jean
-0.2553     pittsburgh cindy
-0.5563     jean </s>
-0.5563     jean wood

\end\

依存語言模型

可建立單字之間的句法相依關係模型。

詞之間的關係有方向性。大多數的語言都滿足投射性,也就是如果 詞 p 依存於 詞 q,那麼 p 和 q 之間的任意詞 r 就不能依存到 p 和 q 組成的跨度之外。

箭頭的起點是從屬詞(修飾詞),終點指向支配詞(核心詞),弧上標記是依存關係標記。

一:從屬詞,本:支配詞,"qc" 是依存關係標記

依存文法也可以表示成樹狀結構


機率計算公式: T: 依存樹拓樸關係

\( P(s|T) = P(the|boy) P(boy|find) P(will|find) P(find|<NONE>) P(it|find) P(interesting|find) \)

References

Tutorials

語音識別的技術原理是什麼?

初識語音識別及 Kaldi 的安裝使用

The CMU Pronouncing Dictionary.

  • 聲學模型:是將聲學和發音學(phonetics)的知識進行整合,以特徵提取部分生成的特徵作為輸入,並為可變長特徵序列生成聲學模型分數。
  • 語言模型:通過從訓練語料(通常是文本形式)學習詞之間的相互關係,來估計假設詞序列的可能性,又叫語言模型分數。
  • GMM:Gaussian Mixture Model, 高斯混合模型,描述基於傅里葉頻譜語音特徵的統計模型,用於傳統聲學模型的建模中。
  • HMM:Hidden Markov Model, 隱馬爾科夫模型,是一種用來描述含有隱含未知參數的馬爾科夫過程,其難點是從可觀察的參數中確定該過程的隱含參數。然後利用這些參數來作進一步的分析,例如模式識別。
  • MFCC:Mel-Frequency Cepstral Coefficients, 梅爾頻率倒譜係數,是組成梅爾頻率倒譜的係數。衍生自音訊片段的倒頻譜(cepstrum)。倒譜與梅爾頻率倒譜的區別在於,梅爾頻率倒譜的頻帶劃分是在梅爾刻度上等距劃分的,它比用於正常的對數倒頻譜中的線性間隔的頻帶更接近人類的聽覺系統。廣泛應用於語音識別中。
  • Fbank:Mel Frequency Filter Bank, 梅爾頻率濾波器組。
  • WER:Word Error Rate, 詞錯誤率,是最常見的衡量語音識別系統性能的指標。

  • 特徵提取:語音識別第一步就是特徵提取,去除掉語音信號中對於語音識別無用的冗餘信息(如背景噪音),保留能夠反映語音本質特徵的信息(為後面的聲學模型提取合適的特徵向量),並用一定的形式表示出來;較常用的特徵提取算法有 MFCC。
  • 聲學模型訓練:根據語音庫的特徵參數訓練出聲學模型參數,在識別的時候可以將待識別的語音的特徵參數同聲學模型進行匹配,從而得到識別結果。目前主流的語音識別系統多採用 HMM 進行聲學模型建模。
  • 語言模型訓練:就是用來計算一個句子出現的概率模型,主要用於決定哪個詞序列的可能性更大。語言模型分為三個層次:字典知識、語法知識、句法知識。對訓練文本庫進行語法、語義分析,經過基於統計模型訓練得到語言模型。
  • 語音解碼與搜索算法:其中解碼器就是針對輸入的語音信號,根據已經訓練好的聲學模型、語言模型以及字典建立一個識別網絡,再根據搜索算法在該網絡中尋找一條最佳路徑,使得能夠以最大概率輸出該語音信號的詞串,這樣就確定這個語音樣本的文字。

語音識別工具包Kaldi的學習和使用(三):數據集的訓練和使用

語音識別大牛 Daniel Povey 莫名被JHU開除後,怒拒Facebook,轉向中國公司與高校

頂級語音專家、MSR首席研究員俞棟:語音識別的四大前沿研究

中文是一個有音調的語言,音調對字和詞的識別是有影響的。音調信息如果用好的話,就有可能提升識別率。不過大家發現 deep learning 模型有很強的非線性映射功能,很多音調裡的信息可以被模型自動學到,不需要特別處理。

很多研究組都發現或證實使用小 Kernel 的 Deep CNN 比我們之前在書裡面提到的使用大 kernel 的 CNN 方法效果更好。Deep CNN 跟 LSTM 比有一個好處。用 LSTM 的話,一般你需要用雙向的 LSTM 效果才比較好。但是雙向 LSTM 會引入很長的時延,因為必須要在整個句子說完之後,識別才能開始。而 Deep CNN 的時延相對短很多,所以在實時系統裡面我們會更傾向於用 Deep CNN 而不是雙向 LSTM。

kaldi中文語音識別(1)

System architecture of the cloud-based HALEF spoken dialog system depicting the various modular open-source components

Kaldi 是一個非常強大的語音識別工具庫,最近有些業者用 kaldi nnet3 語音辨識工具進行 TDNN 訓練,用於機器人電話客服。

語音識別從入門到放棄

語音變成文字的大致流程為:將一段語音的聲波按幀切開用幀組成狀態用狀態組成音素再將音素合成單詞,語音就變成了文字 。

語音識別kaldi該如何學習?

學習kaldi的話,先從hmm-gmm入手比較好,像steps/traindelta.sh, steps/trainfmllr.sh, steps/decode.sh這些腳本都是基於hmm-gmm模型。

搞清楚hmm-gmm之後對語音識別就有了一個清晰的理解,接下來就可以上手神經網絡。kaldi支持很多神經網路,如MLP, RNN, CNN, LSTM,如果對神經網路瞭解不多還是從MLP入手較好,MLP是神經網路中最基礎的模型。

台灣言語工具 語音辨識

語音識別技術的前世今生

語音識別技術的前世今生之前世

u010384318的專欄 機器學習 語音識別 ing

語音識別——基於深度學習的中文語音識別系統實現(代碼詳解)

book

李理的博客

深度學習理論與實戰:提高篇

微軟Edx語音識別課程

基於WFST的語音識別解碼器

Kaldi的MFCC特徵提取代碼分析

GitHub projects

zzw922cn/AutomaticSpeechRecognition

​ End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

audier DeepSpeechRecognition

​ A Chinese Deep Speech Recognition System 包括基於深度學習的聲學模型和基於深度學習的語言模型

​ 基於深度學習的中文語音識別系統

libai2 masr

​ 中文語音識別,高識別率預訓練模型,支持docker快速安裝 Chinese Speech Recognition; Mandarin Automatic Speech Recognition;

deepxuexi/ARFASR

​ An very convenient Audio Recorder For ASR Projects. It can recording 16K 16Bit Wav files for ASR projects for the next recognizing. it use directsound for recording in Windows OS , Python3.6 . Press Enter to start recording , and press Enter again to stop recording as you think it's recording enough. Press 'q' to exit the process.