microsoft’s cursive recognizer
DESCRIPTION
Microsoft’s Cursive Recognizer. Jay Pittman and the entire Microsoft Handwriting Recognition Research and Development Team. Syllabus. Neural Network Review Microsoft’s Own Cursive Recognizer Isolated Character Recognizer Paragraph’s Calligrapher Combined System. Neural Network Review. - PowerPoint PPT PresentationTRANSCRIPT
February 1, 2005 Microsoft Tablet PC
Microsoft’s Cursive Recognizer
Jay Pittmanand the entire
Microsoft Handwriting Recognition
Research and Development Team
February 1, 2005 Microsoft Tablet PC
Syllabus Neural Network Review Microsoft’s Own Cursive Recognizer Isolated Character Recognizer Paragraph’s Calligrapher Combined System
February 1, 2005 Microsoft Tablet PC
Neural Network Review
Directed acyclic graph Nodes and arcs, each containing a simple value Nodes contain activations, arcs contain weights At run-time, we do a “forward pass” which computes activation from inputs
to hiddens, and then to outputs From the outside, the application only sees the input nodes and output
nodes Node values (in and out) range from 0.0 to 1.0
1.0
0.0
0.0
0.6
1.0
0.8
0.1
1.4
-0.8 0.7
-2.3
0.0
-0.1
February 1, 2005 Microsoft Tablet PC
TDNN: Time Delayed Neural Network
item 2 item 3item 1 item 5 item 6item 4
This is still a normal back-propagation network All the points in the previous slide still apply
The difference is in the connections Connections are limited
The input is segmented, and the same features are computed for each segment
I decided I didn’t like this artwork, so I started over (next slide)
February 1, 2005 Microsoft Tablet PC
TDNN: Time Delayed Neural Network
item 2 item 3item 1 item 5 item 6item 4item 1
Edge Effects
For the first two and last two columns, the hidden nodes and input nodes that reach outside the range of our output receive zero activations
February 1, 2005 Microsoft Tablet PC
TDNN: Weights Are Shared
item 2 item 3item 1 item 5 item 6item 4item 1
0.1372 0.1372 0.1372
-0.006-0.006-0.006-0.006 -0.006 -0.006 -0.006
0.13720.1372 0.1372 0.1372
0.0655 0.0655 0.06550.06550.0655 0.0655
Since the weights are shared, this net is not really as big as it looks. When a net is stored (on disk or in memory), there is only one copy of each weight. On disk, we don’t store the activations, just the weights (and architecture).
February 1, 2005 Microsoft Tablet PC
Training We use back-propagation training We collect millions of words of ink data from thousands
of writers Young and old, male and female, left handed and right handed Natural text, newspaper text, URLs, email addresses, street addresses
We collect in over a dozen languages around the world Training on such large databases takes weeks We constantly worry about how well our data reflect our
customers Their writing styles Their text content
We can be no better than the quality of our training sets And that goes for our test sets too
February 1, 2005 Microsoft Tablet PC
Languages We ship now in:
English (US), English (UK), French, German, Spanish, Italian
We have done some initial work in: Dutch, Portuguese, Swedish, Danish, Norwegian, Finnish We cannot predict when we might ship these
Using a completely different approach, we also ship now in: Japanese, Chinese (Simplified), Chinese (Traditional), Korean
February 1, 2005 Microsoft Tablet PC
Recognizer Architecture
88 8 682263574 4461 575723
9231
51 9 4720
711252 8 7913
5318
792857 6
……
…
1381
8 2 14 3
1717 5 7 4390
716
57914415
Output Matrix
dog 68
clog 57
dug 51
doom 42
divvy 37
ooze 35
cloy 34
doxy 29
client 22
dozy 13
Ink Segments
Top 10 List
d 92
a 88
b 23
c 86
o 77
a 73
l 76
t 5
g 68
t 8
b 6
o 65
g 57
t 12
TDNN
a
b
do
g
ab
t
tc
l
og
t
Lexicon
e
a
…
…
…
…
… Beam Search
ab
de
gh
no
4
5
3
90
12
4
14
7
February 1, 2005 Microsoft Tablet PC
Segmentation
midpoints going up
tops
bottoms
tops and bottoms
February 1, 2005 Microsoft Tablet PC
TDNN Output Matrixabcdefghijklmntuvw
0 1 2 3 4 5 6 7 8 9
February 1, 2005 Microsoft Tablet PC
Language Model Now that we have a complete output matrix from the
TDNN, what are we going to do with it? We get better recognition if we bias our interpretation of
that output matrix with a language model Better recognition means we can handle sloppier cursive The lexicon (system dictionary) is the main part
But there is also a user dictionary And there are regular expressions for things like dates and currency amounts
We want a generator We ask it: “what characters could be next after this prefix?” It answers with a set of characters
We still output the top letter recognitions In case you are writing a word out-of-dictionary You will have to write more neatly
February 1, 2005 Microsoft Tablet PC
Lexicon
a
b
do
g
a
b
t
t
cl
og
t
e
a
…
…
…
o l o
r
u r
s
s
n a l y
s
z
e
e
r
r
s
s
d
d
s
sUS US
US
US
US
US US
UK
UK
UK
UK
UK UK
A
A AC C
C
AC
4125
4098
AC
AC
t h e952
a t e r3606US
s4187US
r e
T H CUS
s
3463
4125
3159
3354
US
UK
A
C
1234
u
s
Simple node
Leaf node
(end of valid word)
U.S. only
U.K. only
Australian only
Canadian only
Unigram score
(log of probability)
w a l k i n g
r u n n
UKAC
February 1, 2005 Microsoft Tablet PC
Clumsy lexicon Issue The lexicon includes all the words in the spellchecker The spellchecker includes obscenities
Otherwise they would get marked as misspelled But people get upset if these words are offered as corrections for other misspellings So the spellchecker marks them as “restricted”
We live in an apparently stochastic world We will throw up 6 theories about what you were trying to write If your ink is near an obscene word, we might include that
Dilemma: We want to recognizer your obscene word when you write it
• Otherwise we are censoring, which is NOT our place
We DON’T want to offer these outputs when you don’t write them
Solution (weak): We took these words out of the lexicon You can still write them, because you can write out-of-dictionary But you have to write very neat cursive, or nice handprint
February 1, 2005 Microsoft Tablet PC
Grammars
seconds: 0 1 2
0 12345 06789
1 0123456789
2
Start
Start
MonthNum = "123456789" | "1" "012";
seconds = digit | "12345" digit;
MonthNum: 0 1 2
0 1 23456789
1 012
2
Stop
Stop
February 1, 2005 Microsoft Tablet PC
Factoids and Input Scope IS_DEFAULT
see next slide
IS_PHRASELIST user dictionary only
IS_DATE_FULLDATE, IS_TIME_FULLTIME IS_TIME_HOUR, IS_TIME_MINORSEC IS_DATE_MONTH, IS_DATE_DAY, IS_DATE_YEAR, IS_DATE_MONTHNAME, IS_DATE_DAYNAME
IS_CURRENCY_AMOUNTANDSYMBOL, IS_CURRENCY_AMOUNT IS_TELEPHONE_FULLTELEPHONENUMBER
IS_TELEPHONE_COUNTRYCODE, IS_TELEPHONE_AREACODE, IS_TELEPHONE_LOCALNUMBER
IS_ADDRESS_FULLPOSTALADDRESS IS_ADDRESS_POSTALCODE, IS_ADDRESS_STREET, IS_ADDRESS_STATEORPROVINCE,
IS_ADDRESS_CITY, IS_ADDRESS_COUNTRYNAME, IS_ADDRESS_COUNTRYSHORTNAME
IS_URL, IS_EMAIL_USERNAME, IS_EMAIL_SMTPEMAILADDRESS IS_FILE_FULLFILEPATH, IS_FILE_FILENAME IS_DIGITS, IS_NUMBER IS_ONECHAR NONE
This yields an out-of-dictionary-only system
Setting the Factoid property merely enables and disables various grammars and lexica
February 1, 2005 Microsoft Tablet PC
Default Factoid Used when no factoid is set Intended for natural text, such as the body of an email Includes system dictionary, user dictionary, hyphenation
rule, number grammar, web address grammar All wrapped by optional leading punctuation and trailing punctuation Hyphenation rule allows sequence of dictionary words with hyphens between
Alternatively, can be a single character (any character supported by the system)
LeadingPunc
Number
Hyphenation
UserDict
SysDict
TrailingPunc
Web
Single Char
Start Final
February 1, 2005 Microsoft Tablet PC
Factoid Extensibility All the grammar-based factoids were specified in a
regular expression grammar, and then “compiled” into the binary table using a simple compiler
The compiler is available at run time Software vendors can add their own regular expressions The string is set as the value of the Factoid property One could imagine the DMV adding automobile VINs
This is in addition to the ability to load the user dictionary One could load 500 color names for a color field in a form-based app Or 8000 drug names in a prescription app Construct a WordList object, and set it to the WordList property Set the Factoid property to “IS_PHRASELIST”
February 1, 2005 Microsoft Tablet PC
Recognizer Architecture
88 8 682263574 4461 575723
9231
51 9 4720
711252 8 7913
5318
792857 6
……
…
1381
8 2 14 3
1717 5 7 4390
716
57914415
Output Matrix
dog 68
clog 57
dug 51
doom 42
divvy 37
ooze 35
cloy 34
doxy 29
client 22
dozy 13
Ink Segments
Top 10 List
d 92
a 88
b 23
c 86
o 77
a 73
l 76
t 5
g 68
t 8
b 6
o 65
g 57
t 12
TDNN
a
b
do
g
ab
t
tc
l
og
t
Lexicon
e
a
…
…
…
…
… Beam Search
ab
de
gh
no
4
5
3
90
12
4
14
7
February 1, 2005 Microsoft Tablet PC
DTW Dynamic Time Warping Dynamic Programming Elastic Matching
he apl e n t
le h ap n t
From dictionary
From user
From prototypes
From user
February 1, 2005 Microsoft Tablet PC
Brute Force Matching
1 1 1 1 1 0 1
1 1 1 1 0 1 1
1 1 1 0 1 1 1
1 1 0 1 1 1 1
1 0 1 1 1 1 1
0 1 1 1 1 1 1
e l p h a n t
n
a
h
p
l
e
1 1 1 1 1 1 0t
0 1 1 1 1 1 1e
Entry from dictionary
Entry from user
User must provide distance function
0 means match
1 means no match
Matrix of all possible matches
February 1, 2005 Microsoft Tablet PC
Cumulative Matching
1 1 0
0 1 1
1 0 1
2 1 0
0 1 2
1 0 1
0 1 1
1
1
1
1
0 1 2
1
3
2
3
Each cell adds its score with the minimum of the cumulative scores to the left, below, and left below.
The upper right corner cell holds the total cost of aligning these two sequences
Match Scores:
Cumulative Scores:
We start in the lower left corner and work our way up to the upper right corner.
February 1, 2005 Microsoft Tablet PC
Cumulative Matching
5 5 4 3 2 1 2
4 4 3 2 1 2 3
3 3 2 1 2 3 4
2 2 1 2 3 4 5
1 0 1 2 3 4 5
0 1 2 3 4 5 6
e l p h a n t
n
a
h
p
l
e
6 6 5 4 3 2 1t
1 1 1 2 3 4 5e
February 1, 2005 Microsoft Tablet PC
Alignment
4 4 4 3 2 1 2
3 3 3 2 1 2 3
2 2 2 1 2 3 4
1 1 1 2 3 4 5
1 0 1 2 3 4 5
0 1 2 3 4 5 6
e l p h a n t
n
a
h
p
l
e
5 5 5 4 3 2 1t
0 1 1 2 3 4 5e
Each cell can remember which neighbor it used, and these can be used to follow a path back from the upper right corner
A vertical move indicates an omission in the entry from the user (purple)
A horizontal move indicates an insertion in the entry from the user (purple)
February 1, 2005 Microsoft Tablet PC
Ink Prototypes
1.02.8
0.21.8
1.01.8
0.10.9
0.31.8
1.01.6
0.20.8
1.01.8
0.10.5
1.01.4
0.62.0
0.93.2
0.40.4
1.01.4
0.92.3
1.03.0
1.01.5
0.10.6
1.01.6
0.42.0
Ink from prototypes
Ink from user
February 1, 2005 Microsoft Tablet PC
Searching the Prototypes
4 4 4 4 4 3 2
3 3 3 3 3 2 3
2 2 2 3 2 3 4
1 1 2 2 3 4 5
1 0 1 2 3 4 5
0 1 2 3 4 5 6
e l p h a n t
n
a
g
l
e
t
0 1 1 2 3 4 5e
We can compute the score for every word in the dictionary, to find the closest set of words
This is slow, due to the size of the dictionary
February 1, 2005 Microsoft Tablet PC
DTW as a Stack
4 4 4 3 2 1 2
3 3 3 2 1 2 3
2 2 2 1 2 3 4
1 1 1 2 3 4 5
1 0 1 2 3 4 5
0 1 2 3 4 5 6
e l p h a n t
n
a
h
p
l
e
5 5 5 4 3 2 1t
0 1 1 2 3 4 5e
n
a
g
t
a
b
d
og
a
t
t
e l e
g
p
Lexicon
e
a
…
…
…
…
…
a
h a
n t
n t s
If we compute row-by-row (from bottom), we can treat the matrix as a stack
We can pop off a row when we back up a letter
This allows us to walk the dictionary tree
February 1, 2005 Microsoft Tablet PC
Using Columns to Avoid Memory If we compute the scores column-by-column, we don’t need to store
the entire matrix This isn’t a stack, so we don’t have to pop back to previous columns We don’t even need double buffering, we just need 2 local variables We don’t need to store the simple distance, just the cumulative distance
2.8 1.8
1.8 1.6
0.5 1.4
0.4 1.4
1.5 0.6
1.8
1.6
0.51.4
0.41.4
0.6
Full Matrix Double Buffer Single Buffer
Locals:
1.02.8
0.21.8
1.01.8
0.10.9
0.31.8
1.01.6
0.20.8
1.01.8
0.10.5
1.01.4
0.62.0
0.93.2
0.40.4
1.01.4
0.92.3
1.03.0
1.01.5
0.10.6
1.01.6
0.42.0
February 1, 2005 Microsoft Tablet PC
Beam Search
1
0
e
l
e
0
1
l
l
e
1e
1
1
2
p
p
l
e
1e
1
2
2
3
h
h
p
l
e
2e
2g
3
2
a
g
We can do column-by-column and row-by-row at the same time if we treat the rows as a tree, with each new row pointing backwards to its parent
February 1, 2005 Microsoft Tablet PC
Why Is It Called a Beam Search?
As we compute a column, we can remember the best score so far
We add a constant to that score Any scores worst than that are culled
Back in the original cumulative distance matrix, this keeps us from computing cells too far away from the best path (the beam)
Since we are following a tree, culling a cell may allow us to avoid an entire subtree This is the real savings
February 1, 2005 Microsoft Tablet PC
Out of Dictionary This is the wrong name:
It should really be called Out of Language Model Or simply Unsupported
• Since letter sequences in the language mode are called “Supported”
We simply want to walk across the output matrix and find the best characters
This is needed for part numbers, and words and abbreviations we don’t yet have in the user dictionary
We bias the output (slightly) toward the language statistics by using bigram probabilities
For instance, the probability of the sequence “at”: P(at|ink) = P(a|ink) P(t|ink) P(at) where P(a|ink) and P(t|ink) come from the output matrix and P(at) comes from the bigram table
We impose a penalty for OOD words, relative to supported words Otherwise the entire language model accomplishes nothing
The COERCE flag, if on, disables the OOD system This forces us to output the nearest language model character sequence, or nothing at all
There is also a Factoid NONE, which yields an out-of-dictionary-only recognizer
February 1, 2005 Microsoft Tablet PC
Error Correction: SetTextContext()
Dictum
Dictum
Left Context Right Context“Dict” “”
d 100
a 0
b 0
c 0
i 100
e 0
t 100
n 5
c 100
a 0
i 85
a 57
o 72
1. User writes “Dictionary”2. Recognizer misrecognizes it as
“Dictum”3. User selects “um” and rewrites
“ionary”4. TIP notes partial word selection, puts
recognizer into correction mode with left and right context
5. Beam search artificially recognizes left context
6. Beam search runs ink as normal7. Beam search artificially recognizes
right context8. This produces “ionary” in top 10 list;
TIP must insert this to the right of “Dict”
1.
2.
3.
4.
5. 6.
7.
Goal: Better context usage for error correction scenarios
February 1, 2005 Microsoft Tablet PC
Isolated Character Recognizer
Input character is fed via a variety of features Single neural network takes all inputs Have also experimented with alternate version which has
a separate neural network per stroke count
Inputa
Neural Network
1.0
0.0
0.0
0.6
1.0
0.8
0.1
February 1, 2005 Microsoft Tablet PC
Calligrapher The Russian recognition company Paragraph sold itself
to SGI (Silicon Graphics, Incorporated), who then sold it to Vadem, who sold it to Microsoft.
In the purchase we obtained: Calligrapher
• Cursive recognizer that shipped on the first Apple Newton
Transcriber• Handwriting app for handheld computers
We combined our system with Calligrapher We use a voting system to combine each recognizer’s top 10 list They are very different, and make different mistakes We get the best of both worlds If either recognizer outputs a single-character “word” we forget these lists and run
the isolated character recognizer
February 1, 2005 Microsoft Tablet PC
HMMs (Hidden Markov Models)
0.00.0
0.00.0
0.00.0
0.80.1
0.10.0
0.10.0
0.60.1
0.10.0
0.70.2
0.10.0
0.10.0
0.10.0
0.30.3
0.20.1
0.10.0
0.03.0
0.10.0
0.60.1
0.10.0
0.20.0
0.80.20.00.00.0
0.10.60.10.10.1
0.20.10.60.10.0
0.00.10.10.70.1
0.10.10.20.30.3
Start with a DTW, but replace the sequence of ink segments on the left with a sequence of probability histograms; this represents a set of ink samples
February 1, 2005 Microsoft Tablet PC
Calligrapher
dog 59
clog 54
dug 44
doom
37
dig 31
dag
29
cloy 23
clug
18
clag 14
clay
9
Top 10 List
Beam Search
d 92
a 88
b 14
c 86
o 67
a 57
l 76
t 5
g 59
t 8
g 37
o 65
g 54
y 23
d
aHMM models
……
a
b
do
g
ab
t
tc
l
og
t
Lexicon
e
a
…
…
…
…
…
February 1, 2005 Microsoft Tablet PC
Personalization Ink shape personalization
Simple concept: just do same training on this customer’s ink• Start with components already trained on massive database of ink samples• Train further on specific user’s ink samples• Trains TDNN, combiner nets, isolated character network
Explicit training• User must go to a wizard and copy a short script• Does have labels from customer• Limited in quantity, because of tediousness
Implicit training• Data is collected in the background during normal use• Doesn’t have labels from customer• We must assume correctness of our recognition result using our confidence measure• We get more data
Much of the work is in the GUI, the database support, management of different user’s trained networks, etc.
Lexicon personalization: Harvesting Simple concept: just add the user’s new words to the lexicon Examples: RTM, dev, SDET, dogfooding, KKOMO, featurization Happens when correcting words in the TIP Also scan Word docs and outgoing email (avoid spam)