=== See Entry for April 1, 2006. ===
=== I found the problem with the computation. ===
Below is a table that shows the computed mutual information for noun-verb pairs. The table displays 5 different ways of numerically sorting the noun-verb pairs.
The first column gives the noun that is used for the pairs. The number under the noun is the number of noun-verb pairs that were detected by the parser and clustering algorithm. (See entry for Feb 11, 2006 – Streaming Clusters).
The rest of the columns show the verbs having high mutual information with the noun. These words are sorted, so the top verb in each list has the highest mutual information value with the noun. The numbers in column 2 to 6 are the instance count for the given noun-verb pair.
The definition for the mutual information was taken from Hindle (1990), Lin (1998) and Gamallo, et. al. (2005) (See the reference page). Hindle and Lin each use the same definition. Gamallo, et. al., use a slightly different definition.
Columns 4 and 5 show a variation on the Hindle definition. In column 4, after computing the value according to Lin, the result is then multiplied by the number of instances of the given word pair (wrw). The term ‘wrw’ comes from Lin.
‘Door opened’ was found as a pair 526 times, so the result is
MI_Modified = MI * 526.
In column 5, the Hindle definition is multiplied by the log of the number of instances,
MI_Modified = MI * log(526).
The final column 6 shows the instance count for the noun-verb pairs.
|
Hindle
or Lin
|
Gamallo
|
Lin
With *wrw
|
Lin
With *log(wrw)
|
Instance
Count
|
Door
3090 |
2 creaminess 1 reflectors
1 musquitoes
1 azeglio
1 atlantis’
26 slammed
5 flinched
14 banged
2 hughes
2 thornie
2 unbarred
1 munich
2 burby |
Musquitos Azeglio
Atlantis
Slammed
Fliched
Banged
Hughes
Thornie
Unbarred
Munich
Burby
opened |
526 opened 207 open
127 closed
511 was
80 locked
87 shut
54 leading
39 swung
26 slammed
36 bell
26 flung
42 stood
14 banged
22 opening |
526 opened 80 locked
207 open
127 closed
26 slammed
87 shut
54 leading
39 swung
14 banged
36 bell
26 flung
22 opening
11 neighbor
11 unlocked
10 barred
8 creaked |
526 opened 511 was
207 open
127 clsoed
87 shut
80 locked
77 had
72 is
54 leading
42 stood
39 swung
36 bell
26 slammed
26 flung
25 be
22 opening
21 being |
Death
1736 |
2 deowe 2 distilling
2 furens
1 supervenes
1 titmouse
1 fouras
6 tristram
1 copperplate
2 usurp
1 resplendently
1 suffe
1 bonnemains |
deowe distilling
furens
supervenes
titmouse
fouras
tristram
copperplate |
96 bed 88 rate
27 blow
127 is
159 was
15 warrant
15 wound
10 rates
13 struggle
37 like
96 had
6 stistram
9 sentence
13 comes
62 be
6 beds
8 occurred |
96 bed 88 rate
27 blow
15 warrant
10 rates
15 wound
6 tristram
13 struggle
9 sentence
6 beds
4 maximus
6 trap
4 overtake
8 occurred
13 comes
6 brings
3 overweight |
159 was 127 is
96 bed
96 had
88 rate
62 be
37 like
35 will
34 have
27 blow
27 are
24 were
22 has
17 come
15 warrant
15 wound
15 came |
Child
2682 |
1 topknots 1 humouring
2 coughs
1 develish
1 stiddy
2 sneezed
1 cons
1 antagonized
5 stope
1 playhouses
1 ashiel
1 riposte
1 praisest |
humouring coughs
develish
stiddy
sneezed
cons
antagonized
stope
playhouses |
84 born 269 is
233 was
98 said
69 has
103 be
44 like
11 bearing
10 birth
10 skillful
7 learns
5 stope
12 sitting
7 asks
52 will |
84 born 5 stope
7 learns
10 birth
11 bearing
10 skillful
7 asks
6 dies
269 is
12 sitting
3 drogo
6 feels
9 learn
69 has
7 study |
269 is 233 was
103 be
98 said
93 had
84 born
69 has
52 will
44 like
40 have
27 can
15 up
15 do
12 sitting
12 looked |
There is something amiss with columns 2 and 3. Some of the verbs selected by the Hindle and Gamallo definitions are intuitively associated with the noun, but many of the words do not seem to be particularly related. And when compared to columns 4 and 5, columns 2 and 3 do not seem very close to target.
I suspect that this is a result of steps in the process that occur earlier in the process. The parsing algorithm that I use is substantially different from that desribed by Hindle, Lin or Gamallo. Similarly, the cluster gathering algorithm is different.
Another potential cause for these results to be different from the works referenced is the interpretation of the word pairs. Keeping track of which instances represent a cluster starting with a given noun and ending with a given verb is a difficult task. I have been optimizing the storage of the instances to speed up the computation. I need to do another pass on the code to see if I am correctly managing everything.