Skip to content

Commit

Permalink
Merge pull request #1 from glicerico/file_MI
Browse files Browse the repository at this point in the history
Add file-based weights mode to MST parser
  • Loading branch information
glicerico authored Mar 1, 2019
2 parents b63562f + 2ad1da7 commit 510a804
Show file tree
Hide file tree
Showing 6 changed files with 326 additions and 34 deletions.
71 changes: 63 additions & 8 deletions run-poc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,13 @@ Thus, operating the system requires three basic steps:

Each of these is described in greater detail in separate sections below.

Alternatively to word-pair counting and MI calculations, we have also
set up the system to be able to generate MST parses for sentences by
providing the weights of their instance-pairs in a file (instead of MI).
This allows for the use of different algorithms that estimate the
relationship between words, e.g. neural networks.
See the MST section below for more details.


Setting up the AtomSpace
------------------------
Expand Down Expand Up @@ -420,7 +427,19 @@ will automatically pick up where they left off.
open `process-word-pairs.sh` and remove the -N option from the nc commands
(some old version of netcat still support this option).

4) Wait some time, possibly a few days. When finished, stop the cogserver.
4) Wait some time, possibly a few days. When finished, you can export the
word-pair MI values to a file if you want. Start by loading the file in
the cogserver:
```
(load "export-mi.scm")
```
and running the export comand (change "any" to the mode used when
pair counting):
```
(export-mi "any")
```
This will generate a file called `mi-pairs.txt` in your working directory.
Then stop the cogserver.

These scripts use commands from the scripts in the `scm` directory.
The code for computing word-pair MI is in `batch-word-pair.scm`.
Expand Down Expand Up @@ -483,13 +502,20 @@ Minimum Spanning Tree Parsing
-----------------------------

The MST parser discovers the minimum spanning tree that connects the
words together in a sentence. The link-cost used is (minus) the mutual
information between word-pairs (so we are maximizing MI). Thus, MST
parsing cannot be started before the above steps to compute word-pair
words together in a sentence, using the provided link weights.
The link-cost used can be (minus) the mutual
information between word-pairs (so we are maximizing MI). In this case,
MST parsing cannot be started before the above steps to compute word-pair
MI have been accomplished.
Alternatively, one can obtain the weights between word-instance-pairs
from a different source (e.g. a neural-network-generated language model)
and feed them to the MST algorithm.

The minimum spanning tree code is in `scm/mst-parser.scm`. The current
version works well. To run it follow the next steps:
The minimum spanning tree code is called from `scm/mst-parser.scm` and
`run-poc/redefine-mst-parser.scm`. The current
version works well. To run it using MI calculated as explained in the previous
sections, follow the next steps (see after step 7 below for using other type
of weights):

1) Setup the working directory by running the following commands from the
root of your opencog clone, if you haven't already.
Expand Down Expand Up @@ -570,9 +596,38 @@ version works well. To run it follow the next steps:
scripts; they will pick up where they left off. When finished,
remember to stop the cogserver.

Once this is done, you can move to the next step, which is explained in

To use link weights calculated in some other way (instead of MI), you
need to provide them in files with the following format:

```
First Sentence (prefixed with ###LEFT-WALL###)
0 ###LEFT-WALL### 1 First-word-in-sentence link-weight
0 ###LEFT-WALL### 2 Second-word-in-sentence link-weight
...
1 First-word-in-sentence 2 Second-word-in-sentence link-weight
1 First-word-in-sentence 3 Third-word-in-sentence link-weight
...
Second Sentence (prefixed with ###LEFT-WALL###)
0 ###LEFT-WALL### 1 First-word-in-sentence link-weight
0 ###LEFT-WALL### 2 Second-word-in-sentence link-weight
...
```
where each block of a sentence and all its word-instance-pair lines are
separated from the next sentence by an empty line.
The 7 steps above still apply, with the following modifications:
In step 2), place the special-format files in `gamma-pages`, instead
of the plain text files.
In step 4), you need to set `cnt_mode="file"`, to indicate you're using
file-based weights, and make sure `split_sents="#f"`.
All other parameters still apply.
Step 6) is not needed.

Once this is done (either using MI or file-based weights), you can move
to the next step, which is explained in
the next section. If you activated the option, you can check out the
sentence parses in `mst-parses.txt`.
sentence parses generated in the folder `mst-parses/`.


Exploring Connector-Sets
Expand Down
76 changes: 76 additions & 0 deletions run-poc/alpha-pages/test-mi-pairs.ipw
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
###LEFT-WALL### A mom is a human .
0 ###LEFT-WALL### 1 A 0.1
0 ###LEFT-WALL### 2 mom 0.1
0 ###LEFT-WALL### 3 is 0.1
0 ###LEFT-WALL### 4 a 0.1
0 ###LEFT-WALL### 5 human 0.1
0 ###LEFT-WALL### 6 . 0.1
1 A 2 mom 0.1
1 A 3 is 0.2
1 A 4 a 0.3
1 A 5 human 0.4
1 A 6 . 0.5
2 mom 1 A 0.11
2 mom 3 is 0.21
2 mom 4 a 0.31
2 mom 5 human 0.41
2 mom 6 . 0.51
3 is 1 A 0.12
3 is 2 mom 0.22
3 is 4 a 0.32
3 is 5 human 0.42
3 is 6 . 0.52
4 a 1 A 0.13
4 a 2 mom 0.23
4 a 3 is 0.33
4 a 5 human 0.43
4 a 6 . 0.53
5 human 1 A 0.14
5 human 2 mom 0.24
5 human 3 is 0.34
5 human 4 a 0.44
5 human 6 . 0.54
6 . 1 A 0.15
6 . 2 mom 0.25
6 . 3 is 0.35
6 . 4 a 0.45
6 . 5 human 0.55

###LEFT-WALL### A dad is a human .
0 ###LEFT-WALL### 1 A 0.1
0 ###LEFT-WALL### 2 dad 0.1
0 ###LEFT-WALL### 3 is 0.1
0 ###LEFT-WALL### 4 a 0.1
0 ###LEFT-WALL### 5 human 0.1
0 ###LEFT-WALL### 6 . 0.1
1 A 2 dad 0.1
1 A 3 is 0.2
1 A 4 a 0.3
1 A 5 human 0.4
1 A 6 . 0.5
2 dad 1 A 0.11
2 dad 3 is 0.21
2 dad 4 a 0.31
2 dad 5 human 0.41
2 dad 6 . 0.51
3 is 1 A 0.12
3 is 2 dad 0.22
3 is 4 a 0.32
3 is 5 human 0.42
3 is 6 . 0.52
4 a 1 A 0.13
4 a 2 dad 0.23
4 a 3 is 0.33
4 a 5 human 0.43
4 a 6 . 0.53
5 human 1 A 0.14
5 human 2 dad 0.24
5 human 3 is 0.34
5 human 4 a 0.44
5 human 6 . 0.54
6 . 1 A 0.15
6 . 2 dad 0.25
6 . 3 is 0.35
6 . 4 a 0.45
6 . 5 human 0.55

6 changes: 3 additions & 3 deletions run-poc/config/params.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# PARAMETERS for NLP pipeline
# variable_name=value #possibilities

export cnt_mode="clique-dist" # clique, clique-dist or any
export cnt_mode="file" # clique, clique-dist, any, or file
export cnt_reach=6 # num-parses for any, or win-size for cliques
export mst_dist="#t" # #t or #f; use distance weight during mst
export mst_dist="#f" # #t or #f; use distance weight during mst
export exp_parses="#t" # #t or #f; exports parses in folder mst-parses
export split_sents="#t" # #t or #f; calls sentence splitter before parser
export split_sents="#f" # #t or #f; calls sentence splitter before parser
#TODO export store_fmi="#t" # #t or #f
10 changes: 8 additions & 2 deletions run-poc/process-one.sh
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,14 @@ case $1 in
;;
esac

# define opencog variables needed for cnt_mode=file
if [[ "$cnt_mode" == "file" ]]; then
echo "(define new-sent-flag #t)"| netcat $coghost $cogport;
echo "(define current-sentence \"\")"| netcat $coghost $cogport;
fi

# Punt if the cogserver has crashed: use netcat to ping it.
haveping=`echo foo | nc -N $coghost $cogport`
haveping=`echo foo | nc $coghost $cogport`
if [[ $? -ne 0 ]] ; then
exit 1
fi
Expand All @@ -72,7 +78,7 @@ fi
cat "$splitdir/$rest" | ./submit-one.pl $coghost $cogport $observe $params

# Punt if the cogserver has crashed (second test, before doing the mv and rm below)
haveping=`echo foo | nc -N $coghost $cogport`
haveping=`echo foo | nc $coghost $cogport`
if [[ $? -ne 0 ]] ; then
exit 1
fi
Expand Down
6 changes: 3 additions & 3 deletions run-poc/process-word-pairs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,11 @@ case $1 in
esac

# Punt if the cogserver has crashed: use netcat to ping it.
haveping=`echo foo | nc -N localhost $PORT`
haveping=`echo foo | nc localhost $PORT`
if [[ $? -ne 0 ]] ; then
exit 1
fi

# Submit instruction to the cogserver
echo -e "(load \"$module\")" | nc -N localhost $PORT
echo -e "($func \"$cnt_mode\")" | nc -N localhost $PORT
echo -e "(load \"$module\")" | nc localhost $PORT
echo -e "($func \"$cnt_mode\")" | nc localhost $PORT
Loading

0 comments on commit 510a804

Please sign in to comment.