Skip to content

Commit

Permalink
Notes on conversion to TSV
Browse files Browse the repository at this point in the history
  • Loading branch information
leod committed Mar 23, 2019
1 parent 48ef93b commit 76d3c9f
Show file tree
Hide file tree
Showing 4 changed files with 46 additions and 5 deletions.
34 changes: 34 additions & 0 deletions data-wiki/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,42 @@ parallel --verbose -j 8 ./convert-doc.sh '<' {} '>' {.}.md \
< wiki/docs.txt \
2>&1 | tee convert.log
```
This may take a few days.

There are some downsides to using Pandoc here, since it does not handle Wikipedia template
references, and instead seems to remove them in the output. This leads to a few sentences
missing words in the middle. This is a relatively rare occasion, so it should not be much
of a problem.

Also, the conversion crashes sometimes:
```
/home/leod/src/hncynic/data-wiki/convert-doc.sh < docs/85/Munich%E2%80%93Augsburg_railway.txt > docs/85/Munich%E2%80%93Augsburg_railway.md
Traceback (most recent call last):
File "/home/leod/src/hncynic/data-wiki/filter_markdown.py", line 114, in <module>
main()
File "/home/leod/src/hncynic/data-wiki/filter_markdown.py", line 98, in main
return run_filter(action, prepare=prepare, doc=doc)
File "/home/leod/.local/lib/python3.6/site-packages/panflute/io.py", line 260, in run_filter
return run_filters([action], *args, **kwargs)
...
File "/home/leod/.local/lib/python3.6/site-packages/panflute/elements.py", line 1061, in __init__
self.header = header
File "/home/leod/.local/lib/python3.6/site-packages/panflute/elements.py", line 1097, in header
raise IndexError(msg)
IndexError: table header has an incorrect number of cols: 6 rows but expected 8
pandoc: Error running filter /home/leod/src/hncynic/data-wiki/filter_markdown.py
Filter returned error status 1
```
## Convert to TSV
We use each section of an article as an individual training example.

```
find docs -name '*.md' > docs.md.txt
parallel --verbose -j 8 \
~/src/hncynic/data-wiki/clean_text.sh \
'<' {} \
'|' ~/src/hncynic/data-wiki/md_to_tsv.py {} \
'>' {.}.tsv \
< docs.md.txt \
> convert.tsv.log 2>&1
```
9 changes: 4 additions & 5 deletions data-wiki/md_to_tsv.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def write_tsv(self, out, header_path=None):
out.write('\t')
for i in range(len(self.text)-1):
out.write(self.text[i].strip())
out.write(' <P> ')
out.write(' <NL> <NL> ')
out.write(self.text[-1])
out.write('\n')

Expand Down Expand Up @@ -98,7 +98,6 @@ def parse_md_tree(stream, title):

title = urllib.parse.unquote(basename.replace('_', ' '))

with open(filename) as f:
tree = parse_md_tree(f, title)
#tree.write(sys.stdout)
tree.write_tsv(sys.stdout)
tree = parse_md_tree(sys.stdin, title)
#tree.write(sys.stdout)
tree.write_tsv(sys.stdout)
1 change: 1 addition & 0 deletions data/moses_tokenizer_protected.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
<NL>
<P>
http://\S+
https://\S+
ftp://\S+
7 changes: 7 additions & 0 deletions train/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,16 @@ the power plug.
```
onmt-main train --config opennmt_config.yml --model_type Transformer --num_gpus 1
```
I trained the model for about 40K steps with `opennmt_config.yml`. I noticed that the loss wasn't
improving much after that, so I got worried and increased the batch size (known to help with
training Transformer models) by performing gradient accumulation as in `opennmt_config_larger_batch.yml`.
As can be seen in the plot below, this seems to have helped.

![training loss](train.svg)

Unfortunately, I don't have a plot for the dev loss, since I forgot to turn on dev evaluation.
What a bummer.

### Evaluate

### Export
Expand Down

0 comments on commit 76d3c9f

Please sign in to comment.