diff --git a/.travis.yml b/.travis.yml index 7841b0b7e..00fe35951 100644 --- a/.travis.yml +++ b/.travis.yml @@ -14,10 +14,11 @@ env: - T2T_DATA_DIR=/tmp/t2t-data - T2T_TRAIN_DIR=/tmp/t2t-train script: - - pytest --ignore=tensor2tensor/utils/registry_test.py --ignore=tensor2tensor/problems_test.py --ignore=tensor2tensor/tpu/tpu_trainer_lib_test.py --ignore=tensor2tensor/data_generators/algorithmic_math_test.py + - pytest --ignore=tensor2tensor/utils/registry_test.py --ignore=tensor2tensor/problems_test.py --ignore=tensor2tensor/utils/trainer_lib_test.py --ignore=tensor2tensor/data_generators/algorithmic_math_test.py - pytest tensor2tensor/utils/registry_test.py - - pytest tensor2tensor/tpu/tpu_trainer_lib_test.py + - pytest tensor2tensor/utils/trainer_lib_test.py - t2t-datagen 2>&1 | grep translate && echo passed + - t2t-trainer --registry_help --t2t_usr_dir=./tensor2tensor/test_data/example_usr_dir 2>&1 | grep my_very_own_hparams && echo passed - python -c "from tensor2tensor.models import transformer; print(transformer.Transformer.__name__)" - t2t-trainer --registry_help - mkdir $T2T_DATA_DIR diff --git a/README.md b/README.md index de2951c53..06a15d1c8 100644 --- a/README.md +++ b/README.md @@ -296,36 +296,8 @@ specifying the `--t2t_usr_dir` flag in `t2t-trainer`. You can do so for models, hyperparameter sets, modalities, and problems. Please do submit a pull request if your component might be useful to others. -Here's an example with a new hyperparameter set: - -```python -# In ~/usr/t2t_usr/my_registrations.py - -from tensor2tensor.models import transformer -from tensor2tensor.utils import registry - -@registry.register_hparams -def transformer_my_very_own_hparams_set(): - hparams = transformer.transformer_base() - hparams.hidden_size = 1024 - ... -``` - -```python -# In ~/usr/t2t_usr/__init__.py -from . import my_registrations -``` - -``` -t2t-trainer --t2t_usr_dir=~/usr/t2t_usr --registry_help -``` - -You'll see under the registered HParams your -`transformer_my_very_own_hparams_set`, which you can directly use on the command -line with the `--hparams_set` flag. - -`t2t-datagen` also supports the `--t2t_usr_dir` flag for `Problem` -registrations. +See the [`example_usr_dir`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/test_data/example_usr_dir) +for an example user directory. ## Adding a dataset diff --git a/docs/cloud_tpu.md b/docs/cloud_tpu.md index 56bad4093..55144e69c 100644 --- a/docs/cloud_tpu.md +++ b/docs/cloud_tpu.md @@ -5,8 +5,10 @@ for ML training. Models and hparams that are known to work on TPU: * `transformer` with `transformer_tpu` -* `xception` with `xception_base` +* `transformer_encoder` with `transformer_tpu` +* `transformer_decoder` with `transformer_tpu` * `resnet50` with `resnet_base` +* `revnet104` with `revnet_base` To run on TPUs, you need to be part of the alpha program; if you're not, these commands won't work for you currently, but access will expand soon, so get @@ -34,6 +36,8 @@ gcloud compute instances create $USER-vm \ Launch the TPU instance; the Python program will connect to this to train on the TPU device. ``` +gcloud alpha compute tpus list +# Make an IP with structure 10.240.X.2 that’s unique in the list TPU_IP=10.240.0.2 gcloud alpha compute tpus create \ $USER-tpu \ @@ -41,9 +45,6 @@ gcloud alpha compute tpus create \ --version=nightly ``` -To see all TPU instances running: `gcloud alpha compute tpus list`. The -`TPU_IP` should be unique amongst the list and follow the format `10.240.i.2`. - SSH in with port forwarding for TensorBoard ``` gcloud compute ssh $USER-vm -- -L 6006:localhost:6006 @@ -52,7 +53,7 @@ gcloud compute ssh $USER-vm -- -L 6006:localhost:6006 Now that you're on the cloud instance, install T2T: ``` pip install tensor2tensor --user -# If your python bin dir isn't already in your path +# Add the python bin dir to your path export PATH=$HOME/.local/bin:$PATH ``` @@ -67,9 +68,9 @@ t2t-datagen --problem=translate_ende_wmt8k --data_dir=$DATA_DIR Setup some vars used below. `TPU_IP` and `DATA_DIR` should be the same as what was used above. Note that the `DATA_DIR` and `OUT_DIR` must be GCS buckets. ``` -TPU_IP= +TPU_IP=10.240.0.2 DATA_DIR=$GCS_BUCKET/t2t/data/ -OUT_DIR=$GCS_BUCKET/t2t/training/ +OUT_DIR=$GCS_BUCKET/t2t/training/transformer_ende_1 TPU_MASTER=grpc://$TPU_IP:8470 ``` diff --git a/docs/new_problem.md b/docs/new_problem.md index fd5f9d625..342d7abb1 100644 --- a/docs/new_problem.md +++ b/docs/new_problem.md @@ -264,16 +264,22 @@ t2t-datagen \ ``` Where: -* `PROBLEM` is the name of the class that was registered with `@registry.register_problem()`, but converted from `CamelCase` to `snake_case`. -* `PATH_TO_YOUR_PROBLEM_DIR` is a path to the directory of your python problem file. +* `PROBLEM` is the name of the class that was registered with + `@registry.register_problem()`, but converted from `CamelCase` to + `snake_case`. +* `PATH_TO_YOUR_PROBLEM_DIR` is a path to the directory of your python problem + file. -If you plan to contribute to the tensor2tensor repository, you can install the local cloned version in developer mode with `pip install -e .` from the tensor2tensor directory. You can also add your new problem file to [`all_problems.py`](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/all_problems.py). +If you plan to contribute to the tensor2tensor repository, you can install the +local cloned version in developer mode with `pip install -e .` from the +tensor2tensor directory. You can also add your new problem file to +[`all_problems.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/all_problems.py). # Run the problem -Now that we've gotten our problem set up, let's train a model and generate definitions. +Now that we've gotten our problem set up, let's train a model and generate +definitions. To train, specify the problem name, the model, and hparams: - ```bash PROBLEM=word2def MODEL=transformer @@ -282,6 +288,7 @@ HPARAMS=word2def_hparams The rest of the steps are as given in the [walkthrough](walkthrough.md). -What if we wanted to train a model to generate words given definitions? In T2T, we can change the problem name to be `PROBLEM=word2def_rev`. +What if we wanted to train a model to generate words given definitions? In T2T, +we can change the problem name to be `PROBLEM=word2def_rev`. All done. Let us know what definitions your model generated. diff --git a/docs/overview.md b/docs/overview.md index fcc0aba5a..9ea87bc50 100644 --- a/docs/overview.md +++ b/docs/overview.md @@ -14,7 +14,7 @@ to training, evaluation, and decoding. Some key files and their functions: -* [`tpu_trainer.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/tpu/tpu_trainer.py) and [`tpu_trainer_lib.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/tpu/tpu_trainer_lib.py): +* [`t2t_trainer.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/bin/t2t_trainer.py) and [`trainer_lib.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/utils/trainer_lib.py): Main entrypoint for training and evaluation. Constructs and runs all the main components of the system (the `Problem`, the `HParams`, the `Estimator`, the `Experiment`, the `input_fn`s and `model_fn`). @@ -134,7 +134,7 @@ The default implementations of `bottom`, `top`, and `loss` depend on the The actual training loop and related services (checkpointing, summaries, continuous evaluation, etc.) are all handled by `Estimator` and `Experiment` -objects. `tpu_trainer.py` is the main entrypoint and uses `tpu_trainer_lib.py` +objects. `t2t_trainer.py` is the main entrypoint and uses `trainer_lib.py` to construct the various components. ## Decoding @@ -144,7 +144,7 @@ to construct the various components. ## System Overview for Train/Eval -See `tpu_trainer.py`. +See `t2t_trainer.py` and `trainer_lib.py`. * Create HParams * Create `RunConfig`, including `Parallelism` object (i.e. `data_parallelism`) diff --git a/docs/walkthrough.md b/docs/walkthrough.md index de2951c53..06a15d1c8 100644 --- a/docs/walkthrough.md +++ b/docs/walkthrough.md @@ -296,36 +296,8 @@ specifying the `--t2t_usr_dir` flag in `t2t-trainer`. You can do so for models, hyperparameter sets, modalities, and problems. Please do submit a pull request if your component might be useful to others. -Here's an example with a new hyperparameter set: - -```python -# In ~/usr/t2t_usr/my_registrations.py - -from tensor2tensor.models import transformer -from tensor2tensor.utils import registry - -@registry.register_hparams -def transformer_my_very_own_hparams_set(): - hparams = transformer.transformer_base() - hparams.hidden_size = 1024 - ... -``` - -```python -# In ~/usr/t2t_usr/__init__.py -from . import my_registrations -``` - -``` -t2t-trainer --t2t_usr_dir=~/usr/t2t_usr --registry_help -``` - -You'll see under the registered HParams your -`transformer_my_very_own_hparams_set`, which you can directly use on the command -line with the `--hparams_set` flag. - -`t2t-datagen` also supports the `--t2t_usr_dir` flag for `Problem` -registrations. +See the [`example_usr_dir`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/test_data/example_usr_dir) +for an example user directory. ## Adding a dataset diff --git a/setup.py b/setup.py index fb2b6492d..ede08f6ae 100644 --- a/setup.py +++ b/setup.py @@ -5,7 +5,7 @@ setup( name='tensor2tensor', - version='1.4.1', + version='1.4.2', description='Tensor2Tensor', author='Google Inc.', author_email='no-reply@google.com', @@ -23,10 +23,19 @@ 'tensor2tensor/bin/t2t-datagen', 'tensor2tensor/bin/t2t-decoder', 'tensor2tensor/bin/t2t-make-tf-configs', + 'tensor2tensor/bin/t2t-exporter', + 'tensor2tensor/bin/t2t-query-server', + 'tensor2tensor/bin/t2t-insights-server', + 'tensor2tensor/bin/t2t-avg-all', + 'tensor2tensor/bin/t2t-bleu', + 'tensor2tensor/bin/t2t-translate-all', ], install_requires=[ 'bz2file', + 'flask', 'future', + 'gevent', + 'gunicorn', 'gym', 'numpy', 'requests', @@ -35,8 +44,8 @@ 'six', ], extras_require={ - 'tensorflow': ['tensorflow>=1.4.0'], - 'tensorflow_gpu': ['tensorflow-gpu>=1.4.0'], + 'tensorflow': ['tensorflow>=1.4.1'], + 'tensorflow_gpu': ['tensorflow-gpu>=1.4.1'], 'tests': ['pytest', 'h5py', 'mock'], }, classifiers=[ diff --git a/tensor2tensor/bin/t2t-avg-all b/tensor2tensor/bin/t2t-avg-all index 3b4d6211d..abef8b755 100755 --- a/tensor2tensor/bin/t2t-avg-all +++ b/tensor2tensor/bin/t2t-avg-all @@ -1,105 +1,15 @@ #!/usr/bin/env python -# coding=utf-8 -# Copyright 2017 The Tensor2Tensor Authors. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -"""Script to continously average last N checkpoints in a given directory.""" +"""t2t-avg-all.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function -import os -import logging - -# Dependency imports +from tensor2tensor.bin import t2t_avg_all -import numpy as np -import six -from six.moves import zip # pylint: disable=redefined-builtin -from collections import deque -import shutil import tensorflow as tf -from tensor2tensor.utils import bleu_hook - -flags = tf.flags -FLAGS = flags.FLAGS - -flags.DEFINE_string("model_dir", "", "Directory to load model checkpoints from.") -flags.DEFINE_string("output_dir", "avg/", "Directory to output the averaged checkpoints to.") -flags.DEFINE_integer("n", 8, "How many checkpoints should be averaged?") -flags.DEFINE_integer("min_steps", 0, "Ignore checkpoints with less steps.") -flags.DEFINE_integer("wait_minutes", 0, "Wait upto N minutes for a new checkpoint.") - - -def main(_): - tf.logging._handler.setFormatter(logging.Formatter("%(asctime)s:" + logging.BASIC_FORMAT, None)) - tf.logging.set_verbosity(tf.logging.INFO) - - model_dir = os.path.expanduser(FLAGS.model_dir) - output_dir = os.path.expanduser(FLAGS.output_dir) - out_base_file = os.path.join(output_dir, 'model.ckpt') - - # Copy flags.txt with the original time, so t2t-bleu can report correct relative time. - os.makedirs(FLAGS.output_dir, exist_ok=True) - if not os.path.exists(os.path.join(output_dir, 'flags.txt')): - shutil.copy2(os.path.join(model_dir, 'flags.txt'), os.path.join(output_dir, 'flags.txt')) - - models_processed = 0 - queue = deque() - for model in bleu_hook.stepfiles_iterator(model_dir, FLAGS.wait_minutes, FLAGS.min_steps): - if models_processed == 0: - var_list = tf.contrib.framework.list_variables(model.filename) - avg_values = {} - for (name, shape) in var_list: - if not name.startswith("global_step"): - avg_values[name] = np.zeros(shape) - models_processed += 1 - - tf.logging.info("Loading [%d]: %s" % (models_processed, model.filename)) - reader = tf.contrib.framework.load_checkpoint(model.filename) - for name in avg_values: - avg_values[name] += reader.get_tensor(name) / FLAGS.n - queue.append(model) - if len(queue) < FLAGS.n: - continue - - out_file = "%s-%d" % (out_base_file, model.steps) - tf_vars = [] - tf.logging.info("Averaging %s" % (out_file)) - for (name, value) in six.iteritems(avg_values): - tf_vars.append(tf.get_variable(name, shape=value.shape)) # TODO , dtype=var_dtypes[name] - placeholders = [tf.placeholder(v.dtype, shape=v.shape) for v in tf_vars] - assign_ops = [tf.assign(v, p) for (v, p) in zip(tf_vars, placeholders)] - - global_step = tf.Variable(model.steps, name="global_step", trainable=False, dtype=tf.int64) - saver = tf.train.Saver(tf.global_variables()) - - tf.logging.info("Running session for %s" % (out_file)) - with tf.Session() as sess: - sess.run(tf.global_variables_initializer()) - for p, assign_op, (name, value) in zip(placeholders, assign_ops, six.iteritems(avg_values)): - sess.run(assign_op, {p: value}) - tf.logging.info("Storing to %s" % out_file) - saver.save(sess, out_base_file, global_step=global_step) - os.utime(out_file + '.index', (model.mtime, model.mtime)) - - tf.reset_default_graph() - first_model = queue.popleft() - reader = tf.contrib.framework.load_checkpoint(first_model.filename) - for name in avg_values: - avg_values[name] -= reader.get_tensor(name) / FLAGS.n +def main(argv): + t2t_avg_all.main(argv) if __name__ == "__main__": diff --git a/tensor2tensor/bin/t2t-bleu b/tensor2tensor/bin/t2t-bleu index cac2b9fc3..966f50a81 100755 --- a/tensor2tensor/bin/t2t-bleu +++ b/tensor2tensor/bin/t2t-bleu @@ -1,136 +1,16 @@ #!/usr/bin/env python -# coding=utf-8 -# Copyright 2017 The Tensor2Tensor Authors. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -"""Evaluate BLEU score for all checkpoints/translations in a given directory. - -This script can be used in two ways. - -To evaluate one already translated file: -`t2t-bleu --translation=my-wmt13.de --reference=wmt13_deen.de` - -To evaluate all translations in a given directory (translated by t2t-translate-all): -`t2t-bleu - --translations_dir=my-translations - --reference=wmt13_deen.de - --event_dir=events` - -In addition to the above-mentioned compulsory parameters, -there are optional parameters: - - * bleu_variant: cased (case-sensitive), uncased, both (default). - * tag_suffix: Default="", so the tags will be BLEU_cased and BLEU_uncased. tag_suffix - can be used e.g. for different beam sizes if these should be plotted in different graphs. - * min_steps: Don't evaluate checkpoints with less steps. - Default=-1 means check the `last_evaluated_step.txt` file, which contains the number of steps - of the last successfully evaluated checkpoint. - * report_zero: Store BLEU=0 and guess its time based on the oldest file in the translations_dir. - Default=True. This is useful, so TensorBoard reports correct relative time for the remaining - checkpoints. This flag is set to False if min_steps is > 0. - * wait_minutes: Wait upto N minutes for a new translated file. Default=0. - This is useful for continuous evaluation of a running training, - in which case this should be equal to save_checkpoints_secs/60 plus time needed for translation - plus some reserve. -""" +"""t2t-bleu.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function -import os -from tensor2tensor.utils import bleu_hook -import tensorflow as tf +from tensor2tensor.bin import t2t_bleu -flags = tf.flags -FLAGS = flags.FLAGS - -flags.DEFINE_string("source", None, "Path to the source-language file to be translated") -flags.DEFINE_string("reference", None, "Path to the reference translation file") -flags.DEFINE_string("translation", None, "Path to the MT system translation file") -flags.DEFINE_string("translations_dir", None, "Directory with translated files to be evaulated.") -flags.DEFINE_string("event_dir", None, "Where to store the event file.") - -flags.DEFINE_string("bleu_variant", "both", - "Possible values: cased(case-sensitive), uncased, both(default).") -flags.DEFINE_string("tag_suffix", "", - "What to add to BLEU_cased and BLEU_uncased tags. Default=''.") -flags.DEFINE_integer("min_steps", -1, "Don't evaluate checkpoints with less steps.") -flags.DEFINE_integer("wait_minutes", 0, - "Wait upto N minutes for a new checkpoint, cf. save_checkpoints_secs.") -flags.DEFINE_bool("report_zero", None, "Store BLEU=0 and guess its time based on the oldest file.") - - -def main(_): - tf.logging.set_verbosity(tf.logging.INFO) - if FLAGS.translation: - if FLAGS.translations_dir: - raise ValueError('Cannot specify both --translation and --translations_dir.') - if FLAGS.bleu_variant in ('uncased', 'both'): - bleu = 100 * bleu_hook.bleu_wrapper(FLAGS.reference, FLAGS.translation, case_sensitive=False) - print("BLEU_uncased = %6.2f" % bleu) - if FLAGS.bleu_variant in ('cased', 'both'): - bleu = 100 * bleu_hook.bleu_wrapper(FLAGS.reference, FLAGS.translation, case_sensitive=True) - print("BLEU_cased = %6.2f" % bleu) - return - - if not FLAGS.translations_dir: - raise ValueError('Either --translation or --translations_dir must be specified.') - transl_dir = os.path.expanduser(FLAGS.translations_dir) - - last_step_file = os.path.join(FLAGS.event_dir, 'last_evaluated_step.txt') - if FLAGS.min_steps == -1: - try: - with open(last_step_file) as ls_file: - FLAGS.min_steps = int(ls_file.read()) - except FileNotFoundError: - FLAGS.min_steps = 0 - if FLAGS.report_zero is None: - FLAGS.report_zero = FLAGS.min_steps == 0 +import tensorflow as tf - writer = tf.summary.FileWriter(FLAGS.event_dir) - for transl_file in bleu_hook.stepfiles_iterator(transl_dir, FLAGS.wait_minutes, - FLAGS.min_steps, path_suffix=''): - # report_zero handling must be inside the for-loop, - # so we are sure the transl_dir is already created. - if FLAGS.report_zero: - all_files = (os.path.join(transl_dir, f) for f in os.listdir(transl_dir)) - start_time = min(os.path.getmtime(f) for f in all_files if os.path.isfile(f)) - values = [] - if FLAGS.bleu_variant in ('uncased', 'both'): - values.append(tf.Summary.Value(tag='BLEU_uncased' + FLAGS.tag_suffix, simple_value=0)) - if FLAGS.bleu_variant in ('cased', 'both'): - values.append(tf.Summary.Value(tag='BLEU_cased' + FLAGS.tag_suffix, simple_value=0)) - writer.add_event(tf.summary.Event(summary=tf.Summary(value=values), - wall_time=start_time, step=0)) - FLAGS.report_zero = False +def main(argv): + t2t_bleu.main(argv) - filename = transl_file.filename - tf.logging.info("Evaluating " + filename) - values = [] - if FLAGS.bleu_variant in ('uncased', 'both'): - bleu = 100 * bleu_hook.bleu_wrapper(FLAGS.reference, filename, case_sensitive=False) - values.append(tf.Summary.Value(tag='BLEU_uncased' + FLAGS.tag_suffix, simple_value=bleu)) - tf.logging.info("%s: BLEU_uncased = %6.2f" % (filename, bleu)) - if FLAGS.bleu_variant in ('cased', 'both'): - bleu = 100 * bleu_hook.bleu_wrapper(FLAGS.reference, filename, case_sensitive=True) - values.append(tf.Summary.Value(tag='BLEU_cased' + FLAGS.tag_suffix, simple_value=bleu)) - tf.logging.info("%s: BLEU_cased = %6.2f" % (transl_file.filename, bleu)) - writer.add_event(tf.summary.Event(summary=tf.Summary(value=values), - wall_time=transl_file.mtime, step=transl_file.steps)) - writer.flush() - with open(last_step_file, 'w') as ls_file: - ls_file.write(str(transl_file.steps) + '\n') if __name__ == "__main__": diff --git a/tensor2tensor/bin/t2t-datagen b/tensor2tensor/bin/t2t-datagen old mode 100644 new mode 100755 index 2ac0f0db2..4290365b6 --- a/tensor2tensor/bin/t2t-datagen +++ b/tensor2tensor/bin/t2t-datagen @@ -1,211 +1,15 @@ #!/usr/bin/env python -# coding=utf-8 -# Copyright 2017 The Tensor2Tensor Authors. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -"""Produces the training and dev data for --problem into --data_dir. - -Produces sharded and shuffled TFRecord files of tensorflow.Example protocol -buffers for a variety of registered datasets. - -All Problems are registered with @registry.register_problem or are in -_SUPPORTED_PROBLEM_GENERATORS in this file. Each entry maps a string name -(selectable on the command-line with --problem) to a function that takes 2 -arguments - input_directory and mode (one of "train" or "dev") - and yields for -each training example a dictionary mapping string feature names to lists of -{string, int, float}. The generator will be run once for each mode. -""" +"""t2t-datagen.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function -import os -import random -import tempfile - -# Dependency imports - -import numpy as np - -from tensor2tensor.data_generators import algorithmic_math -from tensor2tensor.data_generators import all_problems # pylint: disable=unused-import -from tensor2tensor.data_generators import audio -from tensor2tensor.data_generators import generator_utils -from tensor2tensor.data_generators import snli -from tensor2tensor.data_generators import wsj_parsing -from tensor2tensor.utils import registry -from tensor2tensor.utils import usr_dir +from tensor2tensor.bin import t2t_datagen import tensorflow as tf -flags = tf.flags -FLAGS = flags.FLAGS - -flags.DEFINE_string("data_dir", "", "Data directory.") -flags.DEFINE_string("tmp_dir", "/tmp/t2t_datagen", - "Temporary storage directory.") -flags.DEFINE_string("problem", "", - "The name of the problem to generate data for.") -flags.DEFINE_string("exclude_problems", "", - "Comma-separates list of problems to exclude.") -flags.DEFINE_integer("num_shards", 0, "How many shards to use. Ignored for " - "registered Problems.") -flags.DEFINE_integer("max_cases", 0, - "Maximum number of cases to generate (unbounded if 0).") -flags.DEFINE_bool("only_list", False, - "If true, we only list the problems that will be generated.") -flags.DEFINE_integer("random_seed", 429459, "Random seed to use.") -flags.DEFINE_integer("task_id", -1, "For distributed data generation.") -flags.DEFINE_string("t2t_usr_dir", "", - "Path to a Python module that will be imported. The " - "__init__.py file should include the necessary imports. " - "The imported files should contain registrations, " - "e.g. @registry.register_problem calls, that will then be " - "available to t2t-datagen.") - -# Mapping from problems that we can generate data for to their generators. -# pylint: disable=g-long-lambda -_SUPPORTED_PROBLEM_GENERATORS = { - "algorithmic_algebra_inverse": ( - lambda: algorithmic_math.algebra_inverse(26, 0, 2, 100000), - lambda: algorithmic_math.algebra_inverse(26, 3, 3, 10000)), - "parsing_english_ptb8k": ( - lambda: wsj_parsing.parsing_token_generator( - FLAGS.data_dir, FLAGS.tmp_dir, True, 2**13, 2**9), - lambda: wsj_parsing.parsing_token_generator( - FLAGS.data_dir, FLAGS.tmp_dir, False, 2**13, 2**9)), - "parsing_english_ptb16k": ( - lambda: wsj_parsing.parsing_token_generator( - FLAGS.data_dir, FLAGS.tmp_dir, True, 2**14, 2**9), - lambda: wsj_parsing.parsing_token_generator( - FLAGS.data_dir, FLAGS.tmp_dir, False, 2**14, 2**9)), - "inference_snli32k": ( - lambda: snli.snli_token_generator(FLAGS.tmp_dir, True, 2**15), - lambda: snli.snli_token_generator(FLAGS.tmp_dir, False, 2**15), - ), - "audio_timit_characters_test": ( - lambda: audio.timit_generator( - FLAGS.data_dir, FLAGS.tmp_dir, True, 1718), - lambda: audio.timit_generator( - FLAGS.data_dir, FLAGS.tmp_dir, False, 626)), - "audio_timit_tokens_8k_test": ( - lambda: audio.timit_generator( - FLAGS.data_dir, FLAGS.tmp_dir, True, 1718, - vocab_filename="vocab.endefr.%d" % 2**13, vocab_size=2**13), - lambda: audio.timit_generator( - FLAGS.data_dir, FLAGS.tmp_dir, False, 626, - vocab_filename="vocab.endefr.%d" % 2**13, vocab_size=2**13)), - "audio_timit_tokens_32k_test": ( - lambda: audio.timit_generator( - FLAGS.data_dir, FLAGS.tmp_dir, True, 1718, - vocab_filename="vocab.endefr.%d" % 2**15, vocab_size=2**15), - lambda: audio.timit_generator( - FLAGS.data_dir, FLAGS.tmp_dir, False, 626, - vocab_filename="vocab.endefr.%d" % 2**15, vocab_size=2**15)), -} - -# pylint: enable=g-long-lambda - - -def set_random_seed(): - """Set the random seed from flag everywhere.""" - tf.set_random_seed(FLAGS.random_seed) - random.seed(FLAGS.random_seed) - np.random.seed(FLAGS.random_seed) - - -def main(_): - tf.logging.set_verbosity(tf.logging.INFO) - usr_dir.import_usr_dir(FLAGS.t2t_usr_dir) - - # Calculate the list of problems to generate. - problems = sorted( - list(_SUPPORTED_PROBLEM_GENERATORS) + registry.list_problems()) - for exclude in FLAGS.exclude_problems.split(","): - if exclude: - problems = [p for p in problems if exclude not in p] - if FLAGS.problem and FLAGS.problem[-1] == "*": - problems = [p for p in problems if p.startswith(FLAGS.problem[:-1])] - elif FLAGS.problem: - problems = [p for p in problems if p == FLAGS.problem] - else: - problems = [] - - # Remove TIMIT if paths are not given. - if not FLAGS.timit_paths: - problems = [p for p in problems if "timit" not in p] - # Remove parsing if paths are not given. - if not FLAGS.parsing_path: - problems = [p for p in problems if "parsing" not in p] - - if not problems: - problems_str = "\n * ".join( - sorted(list(_SUPPORTED_PROBLEM_GENERATORS) + registry.list_problems())) - error_msg = ("You must specify one of the supported problems to " - "generate data for:\n * " + problems_str + "\n") - error_msg += ("TIMIT and parsing need data_sets specified with " - "--timit_paths and --parsing_path.") - raise ValueError(error_msg) - - if not FLAGS.data_dir: - FLAGS.data_dir = tempfile.gettempdir() - tf.logging.warning("It is strongly recommended to specify --data_dir. " - "Data will be written to default data_dir=%s.", - FLAGS.data_dir) - - tf.logging.info("Generating problems:\n%s" - % registry.display_list_by_prefix(problems, - starting_spaces=4)) - if FLAGS.only_list: - return - for problem in problems: - set_random_seed() - - if problem in _SUPPORTED_PROBLEM_GENERATORS: - generate_data_for_problem(problem) - else: - generate_data_for_registered_problem(problem) - - -def generate_data_for_problem(problem): - """Generate data for a problem in _SUPPORTED_PROBLEM_GENERATORS.""" - training_gen, dev_gen = _SUPPORTED_PROBLEM_GENERATORS[problem] - - num_shards = FLAGS.num_shards or 10 - tf.logging.info("Generating training data for %s.", problem) - train_output_files = generator_utils.train_data_filenames( - problem + generator_utils.UNSHUFFLED_SUFFIX, FLAGS.data_dir, num_shards) - generator_utils.generate_files(training_gen(), train_output_files, - FLAGS.max_cases) - tf.logging.info("Generating development data for %s.", problem) - dev_output_files = generator_utils.dev_data_filenames( - problem + generator_utils.UNSHUFFLED_SUFFIX, FLAGS.data_dir, 1) - generator_utils.generate_files(dev_gen(), dev_output_files) - all_output_files = train_output_files + dev_output_files - generator_utils.shuffle_dataset(all_output_files) - - -def generate_data_for_registered_problem(problem_name): - tf.logging.info("Generating data for %s.", problem_name) - if FLAGS.num_shards: - raise ValueError("--num_shards should not be set for registered Problem.") - problem = registry.problem(problem_name) - task_id = None if FLAGS.task_id < 0 else FLAGS.task_id - problem.generate_data( - os.path.expanduser(FLAGS.data_dir), - os.path.expanduser(FLAGS.tmp_dir), - task_id=task_id) +def main(argv): + t2t_datagen.main(argv) if __name__ == "__main__": diff --git a/tensor2tensor/bin/t2t-decoder b/tensor2tensor/bin/t2t-decoder old mode 100644 new mode 100755 index f453b01fd..612117c22 --- a/tensor2tensor/bin/t2t-decoder +++ b/tensor2tensor/bin/t2t-decoder @@ -1,109 +1,15 @@ #!/usr/bin/env python -# coding=utf-8 -# Copyright 2017 The Tensor2Tensor Authors. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -r"""Decode from trained T2T models. - -This binary performs inference using the Estimator API. - -Example usage to decode from dataset: - - t2t-decoder \ - --data_dir ~/data \ - --problems=algorithmic_identity_binary40 \ - --model=transformer - --hparams_set=transformer_base - -Set FLAGS.decode_interactive or FLAGS.decode_from_file for alternative decode -sources. -""" +"""t2t-decoder.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function -import os - -# Dependency imports - -from tensor2tensor.tpu import tpu_trainer -from tensor2tensor.tpu import tpu_trainer_lib -from tensor2tensor.utils import decoding -from tensor2tensor.utils import usr_dir +from tensor2tensor.bin import t2t_decoder import tensorflow as tf -flags = tf.flags -FLAGS = flags.FLAGS - -# Additional flags in tpu/tpu_trainer.py and utils/flags.py -flags.DEFINE_string("decode_from_file", None, - "Path to the source file for decoding") -flags.DEFINE_string("decode_to_file", None, - "Path to the decoded (output) file") -flags.DEFINE_bool("decode_interactive", False, - "Interactive local inference mode.") -flags.DEFINE_integer("decode_shards", 1, "Number of decoding replicas.") - - -def create_hparams(): - return tpu_trainer_lib.create_hparams( - FLAGS.hparams_set, - FLAGS.hparams, - data_dir=os.path.expanduser(FLAGS.data_dir), - problem_name=FLAGS.problems) - - -def create_decode_hparams(): - decode_hp = decoding.decode_hparams(FLAGS.decode_hparams) - decode_hp.add_hparam("shards", FLAGS.decode_shards) - decode_hp.add_hparam("shard_id", FLAGS.worker_id) - return decode_hp - - -def decode(estimator, hparams, decode_hp): - if FLAGS.decode_interactive: - decoding.decode_interactively(estimator, hparams, decode_hp) - elif FLAGS.decode_from_file: - decoding.decode_from_file(estimator, FLAGS.decode_from_file, hparams, - decode_hp, FLAGS.decode_to_file) - else: - decoding.decode_from_dataset( - estimator, - FLAGS.problems.split("-"), - hparams, - decode_hp, - decode_to_file=FLAGS.decode_to_file, - dataset_split="test" if FLAGS.eval_use_test_set else None) - - -def main(_): - tf.logging.set_verbosity(tf.logging.INFO) - usr_dir.import_usr_dir(FLAGS.t2t_usr_dir) - FLAGS.use_tpu = False # decoding not supported on TPU - - hp = create_hparams() - decode_hp = create_decode_hparams() - - estimator = tpu_trainer_lib.create_estimator( - FLAGS.model, - hp, - tpu_trainer.create_run_config(hp), - decode_hparams=decode_hp, - use_tpu=False) - - decode(estimator, hp, decode_hp) +def main(argv): + t2t_decoder.main(argv) if __name__ == "__main__": diff --git a/tensor2tensor/bin/t2t-exporter b/tensor2tensor/bin/t2t-exporter new file mode 100755 index 000000000..cfd4f5ff8 --- /dev/null +++ b/tensor2tensor/bin/t2t-exporter @@ -0,0 +1,16 @@ +#!/usr/bin/env python +"""t2t-exporter.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +from tensor2tensor.serving import export + +import tensorflow as tf + +def main(argv): + export.main(argv) + + +if __name__ == "__main__": + tf.app.run() diff --git a/tensor2tensor/bin/t2t-insights-server b/tensor2tensor/bin/t2t-insights-server new file mode 100755 index 000000000..102202c9b --- /dev/null +++ b/tensor2tensor/bin/t2t-insights-server @@ -0,0 +1,16 @@ +#!/usr/bin/env python +"""t2t-insights-server.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +from tensor2tensor.insights import server + +import tensorflow as tf + +def main(argv): + server.main(argv) + + +if __name__ == "__main__": + tf.app.run() diff --git a/tensor2tensor/bin/t2t-make-tf-configs b/tensor2tensor/bin/t2t-make-tf-configs old mode 100644 new mode 100755 index 0b656aba6..b481ea910 --- a/tensor2tensor/bin/t2t-make-tf-configs +++ b/tensor2tensor/bin/t2t-make-tf-configs @@ -1,86 +1,15 @@ #!/usr/bin/env python -# coding=utf-8 -# Copyright 2017 The Tensor2Tensor Authors. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -"""Output command line arguments and json-encoded TF_CONFIGs. - -Usage: - -`t2t-make-tf-configs --masters="server1:1234" --ps="server3:2134,server4:2334"` - -Outputs 1 line per job to stdout, first the masters, then the parameter servers. -Each line has the TF_CONFIG, then a tab, then the command line flags for that -job. - -If there is a single master, it will have the `--sync` flag. -""" +"""t2t-make-tf-configs.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function -import json - -# Dependency imports +from tensor2tensor.bin import make_tf_configs import tensorflow as tf -flags = tf.flags -FLAGS = flags.FLAGS - -flags.DEFINE_string("masters", "", "Comma-separated list of master addresses") -flags.DEFINE_string("ps", "", "Comma-separated list of ps addresses") - - -def main(_): - if not (FLAGS.masters and FLAGS.ps): - raise ValueError("Must provide --masters and --ps") - - masters = FLAGS.masters.split(",") - ps = FLAGS.ps.split(",") - - cluster = {"ps": ps, "master": masters} - - for task_type, jobs in (("master", masters), ("ps", ps)): - for idx, job in enumerate(jobs): - if task_type == "master": - cmd_line_flags = " ".join([ - "--master=grpc://%s" % job, - "--ps_replicas=%d" % len(ps), - "--worker_replicas=%d" % len(masters), - "--worker_gpu=1", - "--worker_id=%d" % idx, - "--worker_job='/job:master'", - "--ps_gpu=1", - "--schedule=train", - "--sync" if len(masters) == 1 else "", - ]) - else: - cmd_line_flags = " ".join([ - "--master=grpc://%s" % job, - "--schedule=run_std_server", - ]) - - tf_config = json.dumps({ - "cluster": cluster, - "task": { - "type": task_type, - "index": idx - }, - "environment": "cloud", - }) - print("'%s'\t%s" % (tf_config, cmd_line_flags)) +def main(argv): + make_tf_configs.main(argv) if __name__ == "__main__": diff --git a/tensor2tensor/bin/t2t-query-server b/tensor2tensor/bin/t2t-query-server new file mode 100755 index 000000000..91ede7ce7 --- /dev/null +++ b/tensor2tensor/bin/t2t-query-server @@ -0,0 +1,16 @@ +#!/usr/bin/env python +"""t2t-query-server.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +from tensor2tensor.serving import query + +import tensorflow as tf + +def main(argv): + query.main(argv) + + +if __name__ == "__main__": + tf.app.run() diff --git a/tensor2tensor/bin/t2t-trainer b/tensor2tensor/bin/t2t-trainer old mode 100644 new mode 100755 index 70435094a..77f1ec865 --- a/tensor2tensor/bin/t2t-trainer +++ b/tensor2tensor/bin/t2t-trainer @@ -1,190 +1,15 @@ #!/usr/bin/env python -# coding=utf-8 -# Copyright 2017 The Tensor2Tensor Authors. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -"""Train on TPU.""" +"""t2t-trainer.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function -import contextlib -import os -import sys - -# Dependency imports - -from tensor2tensor import models # pylint: disable=unused-import -from tensor2tensor import problems as problems_lib # pylint: disable=unused-import -from tensor2tensor.tpu import tpu_trainer_lib -from tensor2tensor.utils import decoding -from tensor2tensor.utils import flags as t2t_flags # pylint: disable=unused-import -from tensor2tensor.utils import registry -from tensor2tensor.utils import usr_dir +from tensor2tensor.bin import t2t_trainer import tensorflow as tf -flags = tf.flags -FLAGS = flags.FLAGS - -# See flags.py for additional command-line flags. -flags.DEFINE_string("t2t_usr_dir", "", - "Path to a Python module that will be imported. The " - "__init__.py file should include the necessary imports. " - "The imported files should contain registrations, " - "e.g. @registry.register_model calls, that will then be " - "available to the t2t-trainer.") -flags.DEFINE_integer("random_seed", 1234, "Random seed.") -flags.DEFINE_integer("tpu_num_shards", 8, "Number of tpu shards.") -flags.DEFINE_integer("iterations_per_loop", 1000, - "Number of iterations in a TPU training loop.") -flags.DEFINE_bool("use_tpu", False, "Whether to use TPU.") -flags.DEFINE_bool("generate_data", False, "Generate data before training?") -flags.DEFINE_string("tmp_dir", "/tmp/t2t_datagen", - "Temporary storage directory, used if --generate_data.") -flags.DEFINE_bool("profile", False, "Profile performance?") - -# To maintain compatibility with some internal libs, we guard against these flag -# definitions possibly erroring. Apologies for the ugliness. -try: - flags.DEFINE_string("master", "", "Address of TensorFlow master.") - flags.DEFINE_string("output_dir", "", "Base output directory for run.") - flags.DEFINE_string("schedule", "continuous_train_and_eval", - "Method of Experiment to run.") - flags.DEFINE_integer("eval_steps", 10000, - "Number of steps in evaluation. By default, eval will " - "stop after eval_steps or when it runs through the eval " - "dataset once in full, whichever comes first, so this " - "can be a very large number.") -except: # pylint: disable=bare-except - pass - - -def get_problem_name(): - problems = FLAGS.problems.split("-") - assert len(problems) == 1 - return problems[0] - - -def create_hparams(): - return tpu_trainer_lib.create_hparams(FLAGS.hparams_set, FLAGS.hparams) - - -def create_experiment_fn(): - return tpu_trainer_lib.create_experiment_fn( - model_name=FLAGS.model, - problem_name=get_problem_name(), - data_dir=os.path.expanduser(FLAGS.data_dir), - train_steps=FLAGS.train_steps, - eval_steps=FLAGS.eval_steps, - min_eval_frequency=FLAGS.local_eval_frequency, - schedule=FLAGS.schedule, - export=FLAGS.export_saved_model, - decode_hparams=decoding.decode_hparams(FLAGS.decode_hparams), - use_tfdbg=FLAGS.tfdbg, - use_dbgprofile=FLAGS.dbgprofile, - eval_early_stopping_steps=FLAGS.eval_early_stopping_steps, - eval_early_stopping_metric=FLAGS.eval_early_stopping_metric, - eval_early_stopping_metric_delta=FLAGS.eval_early_stopping_metric_delta, - eval_early_stopping_metric_minimize=FLAGS. - eval_early_stopping_metric_minimize, - use_tpu=FLAGS.use_tpu) - - -def create_run_config(hp): - return tpu_trainer_lib.create_run_config( - model_dir=os.path.expanduser(FLAGS.output_dir), - master=FLAGS.master, - iterations_per_loop=FLAGS.iterations_per_loop, - num_shards=FLAGS.tpu_num_shards, - log_device_placement=FLAGS.log_device_placement, - save_checkpoints_steps=max(FLAGS.iterations_per_loop, - FLAGS.local_eval_frequency), - keep_checkpoint_max=FLAGS.keep_checkpoint_max, - keep_checkpoint_every_n_hours=FLAGS.keep_checkpoint_every_n_hours, - num_gpus=FLAGS.worker_gpu, - gpu_order=FLAGS.gpu_order, - shard_to_cpu=FLAGS.locally_shard_to_cpu, - num_async_replicas=FLAGS.worker_replicas, - gpu_mem_fraction=FLAGS.worker_gpu_memory_fraction, - enable_graph_rewriter=FLAGS.experimental_optimize_placement, - use_tpu=FLAGS.use_tpu, - schedule=FLAGS.schedule, - no_data_parallelism=hp.no_data_parallelism, - daisy_chain_variables=hp.daisy_chain_variables, - ps_replicas=FLAGS.ps_replicas, - ps_job=FLAGS.ps_job, - ps_gpu=FLAGS.ps_gpu, - sync=FLAGS.sync, - worker_id=FLAGS.worker_id, - worker_job=FLAGS.worker_job) - - -def generate_data(): - # Generate data if requested. - data_dir = os.path.expanduser(FLAGS.data_dir) - tmp_dir = os.path.expanduser(FLAGS.tmp_dir) - tf.gfile.MakeDirs(data_dir) - tf.gfile.MakeDirs(tmp_dir) - - problem_name = get_problem_name() - tf.logging.info("Generating data for %s" % problem_name) - registry.problem(problem_name).generate_data(data_dir, tmp_dir) - - -@contextlib.contextmanager -def profile_context(): - if FLAGS.profile: - with tf.contrib.tfprof.ProfileContext("t2tprof", - trace_steps=range(100), - dump_steps=range(100)) as pctx: - opts = tf.profiler.ProfileOptionBuilder.time_and_memory() - pctx.add_auto_profiling("op", opts, range(100)) - yield - else: - yield - - -def log_registry(): - if FLAGS.registry_help: - tf.logging.info(registry.help_string()) - sys.exit(0) - - -def execute_schedule(exp): - if not hasattr(exp, FLAGS.schedule): - raise ValueError( - "Experiment has no method %s, from --schedule" % FLAGS.schedule) - with profile_context(): - getattr(exp, FLAGS.schedule)() - - -def main(_): - tf.logging.set_verbosity(tf.logging.INFO) - tpu_trainer_lib.set_random_seed(FLAGS.random_seed) - usr_dir.import_usr_dir(FLAGS.t2t_usr_dir) - log_registry() - - if FLAGS.generate_data: - generate_data() - - hparams = create_hparams() - run_config = create_run_config(hparams) - - exp_fn = create_experiment_fn() - exp = exp_fn(run_config, hparams) - execute_schedule(exp) +def main(argv): + t2t_trainer.main(argv) if __name__ == "__main__": diff --git a/tensor2tensor/bin/t2t-translate-all b/tensor2tensor/bin/t2t-translate-all index 1ee7e535f..fed5d3045 100755 --- a/tensor2tensor/bin/t2t-translate-all +++ b/tensor2tensor/bin/t2t-translate-all @@ -1,91 +1,17 @@ #!/usr/bin/env python -# coding=utf-8 -# Copyright 2017 The Tensor2Tensor Authors. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -"""Translate a file with all checkpoints in a given directory. - -t2t-decoder will be executed with these parameters: ---problems ---data_dir ---output_dir with the value of --model_dir ---decode_from_file with the value of --source ---decode_hparams with properly formated --beam_size and --alpha ---checkpoint_path automatically filled ---decode_to_file automatically filled -""" +"""t2t-translate-all.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function -import os -import shutil -import tensorflow as tf -from tensor2tensor.utils import bleu_hook - -flags = tf.flags - -# t2t-translate-all specific options -flags.DEFINE_string("decoder_command", "t2t-decoder {params}", - "Which command to execute instead t2t-decoder." - "{params} is replaced by the parameters. Useful e.g. for qsub wrapper.") -flags.DEFINE_string("model_dir", "", "Directory to load model checkpoints from.") -flags.DEFINE_string("source", None, "Path to the source-language file to be translated") -flags.DEFINE_string("translations_dir", "translations", "Where to store the translated files.") -flags.DEFINE_integer("min_steps", 0, "Ignore checkpoints with less steps.") -flags.DEFINE_integer("wait_minutes", 0, "Wait upto N minutes for a new checkpoint") - -# options derived from t2t-decoder -flags.DEFINE_integer("beam_size", 4, "Beam-search width.") -flags.DEFINE_float("alpha", 0.6, "Beam-search alpha.") -flags.DEFINE_string("model", "transformer", "see t2t-decoder") -flags.DEFINE_string("t2t_usr_dir", None, "see t2t-decoder") -flags.DEFINE_string("data_dir", None, "see t2t-decoder") -flags.DEFINE_string("problems", None, "see t2t-decoder") -flags.DEFINE_string("hparams_set", "transformer_big_single_gpu", "see t2t-decoder") +from tensor2tensor.bin import t2t_translate_all +import tensorflow as tf -def main(_): - FLAGS = flags.FLAGS - tf.logging.set_verbosity(tf.logging.INFO) - model_dir = os.path.expanduser(FLAGS.model_dir) - translations_dir = os.path.expanduser(FLAGS.translations_dir) - source = os.path.expanduser(FLAGS.source) - os.makedirs(translations_dir, exist_ok=True) - translated_base_file = os.path.join(translations_dir, FLAGS.problems) +def main(argv): + t2t_translate_all.main(argv) - # Copy flags.txt with the original time, so t2t-bleu can report correct relative time. - flags_path = os.path.join(translations_dir, FLAGS.problems + '-flags.txt') - if not os.path.exists(flags_path): - shutil.copy2(os.path.join(model_dir, 'flags.txt'), flags_path) - for model in bleu_hook.stepfiles_iterator(model_dir, FLAGS.wait_minutes, FLAGS.min_steps): - tf.logging.info("Translating " + model.filename) - out_file = translated_base_file + '-' + str(model.steps) - if os.path.exists(out_file): - tf.logging.info(out_file + " already exists, so skipping it.") - else: - tf.logging.info("Translating " + out_file) - params = ("--t2t_usr_dir={FLAGS.t2t_usr_dir} --output_dir={model_dir} " - "--data_dir={FLAGS.data_dir} --problems={FLAGS.problems} " - "--decode_hparams=beam_size={FLAGS.beam_size},alpha={FLAGS.alpha} " - "--model={FLAGS.model} --hparams_set={FLAGS.hparams_set} " - "--checkpoint_path={model.filename} --decode_from_file={source} " - "--decode_to_file={out_file}".format(**locals())) - command = FLAGS.decoder_command.format(**locals()) - tf.logging.info("Running:\n" + command) - os.system(command) if __name__ == "__main__": tf.app.run() diff --git a/tensor2tensor/bin/t2t_avg_all.py b/tensor2tensor/bin/t2t_avg_all.py new file mode 100644 index 000000000..6e2a3088e --- /dev/null +++ b/tensor2tensor/bin/t2t_avg_all.py @@ -0,0 +1,116 @@ +# coding=utf-8 +# Copyright 2017 The Tensor2Tensor Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Script to continously average last N checkpoints in a given directory.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +from collections import deque +import logging +import os +import shutil + +# Dependency imports + +import numpy as np +import six +from six.moves import zip # pylint: disable=redefined-builtin +from tensor2tensor.utils import bleu_hook +import tensorflow as tf + +flags = tf.flags +FLAGS = flags.FLAGS + +flags.DEFINE_string("model_dir", "", + "Directory to load model checkpoints from.") +flags.DEFINE_string("output_dir", "avg/", + "Directory to output the averaged checkpoints to.") +flags.DEFINE_integer("n", 8, "How many checkpoints should be averaged?") +flags.DEFINE_integer("min_steps", 0, "Ignore checkpoints with less steps.") +flags.DEFINE_integer("wait_minutes", 0, + "Wait upto N minutes for a new checkpoint.") + + +def main(_): + tf.logging._handler.setFormatter( # pylint: disable=protected-access + logging.Formatter("%(asctime)s:" + logging.BASIC_FORMAT, None)) + tf.logging.set_verbosity(tf.logging.INFO) + + model_dir = os.path.expanduser(FLAGS.model_dir) + output_dir = os.path.expanduser(FLAGS.output_dir) + out_base_file = os.path.join(output_dir, "model.ckpt") + + # Copy flags.txt with the original time, so t2t-bleu can report correct + # relative time. + tf.gfile.MakeDirs(FLAGS.output_dir) + if not os.path.exists(os.path.join(output_dir, "flags.txt")): + shutil.copy2(os.path.join(model_dir, "flags.txt"), + os.path.join(output_dir, "flags.txt")) + + models_processed = 0 + queue = deque() + for model in bleu_hook.stepfiles_iterator(model_dir, FLAGS.wait_minutes, + FLAGS.min_steps): + if models_processed == 0: + var_list = tf.contrib.framework.list_variables(model.filename) + avg_values = {} + for (name, shape) in var_list: + if not name.startswith("global_step"): + avg_values[name] = np.zeros(shape) + models_processed += 1 + + tf.logging.info("Loading [%d]: %s" % (models_processed, model.filename)) + reader = tf.contrib.framework.load_checkpoint(model.filename) + for name in avg_values: + avg_values[name] += reader.get_tensor(name) / FLAGS.n + queue.append(model) + if len(queue) < FLAGS.n: + continue + + out_file = "%s-%d" % (out_base_file, model.steps) + tf_vars = [] + tf.logging.info("Averaging %s" % (out_file)) + for (name, value) in six.iteritems(avg_values): + # TODO(martinpopel): dtype=var_dtypes[name] + tf_vars.append(tf.get_variable(name, shape=value.shape)) + placeholders = [tf.placeholder(v.dtype, shape=v.shape) for v in tf_vars] + assign_ops = [tf.assign(v, p) for (v, p) in zip(tf_vars, placeholders)] + + global_step = tf.get_variable( + "global_step", + initializer=tf.constant(model.steps, dtype=tf.int64), + trainable=False) + saver = tf.train.Saver(tf.global_variables()) + + tf.logging.info("Running session for %s" % (out_file)) + with tf.Session() as sess: + sess.run(tf.global_variables_initializer()) + for p, assign_op, (name, value) in zip( + placeholders, assign_ops, six.iteritems(avg_values)): + sess.run(assign_op, {p: value}) + tf.logging.info("Storing to %s" % out_file) + saver.save(sess, out_base_file, global_step=global_step) + os.utime(out_file + ".index", (model.mtime, model.mtime)) + + tf.reset_default_graph() + first_model = queue.popleft() + + reader = tf.contrib.framework.load_checkpoint(first_model.filename) + for name in avg_values: + avg_values[name] -= reader.get_tensor(name) / FLAGS.n + +if __name__ == "__main__": + tf.app.run() diff --git a/tensor2tensor/bin/t2t_bleu.py b/tensor2tensor/bin/t2t_bleu.py new file mode 100644 index 000000000..5db83965d --- /dev/null +++ b/tensor2tensor/bin/t2t_bleu.py @@ -0,0 +1,168 @@ +# coding=utf-8 +# Copyright 2017 The Tensor2Tensor Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Evaluate BLEU score for all checkpoints/translations in a given directory. + +This script can be used in two ways. + + +To evaluate one already translated file: + +``` +t2t-bleu --translation=my-wmt13.de --reference=wmt13_deen.de +``` + +To evaluate all translations in a given directory (translated by +`t2t-translate-all`): + +``` +t2t-bleu + --translations_dir=my-translations + --reference=wmt13_deen.de + --event_dir=events +``` + +In addition to the above-mentioned required parameters, +there are optional parameters: + * bleu_variant: cased (case-sensitive), uncased, both (default). + * tag_suffix: Default="", so the tags will be BLEU_cased and BLEU_uncased. + tag_suffix can be used e.g. for different beam sizes if these should be + plotted in different graphs. + * min_steps: Don't evaluate checkpoints with less steps. + Default=-1 means check the `last_evaluated_step.txt` file, which contains + the number of steps of the last successfully evaluated checkpoint. + * report_zero: Store BLEU=0 and guess its time based on the oldest file in the + translations_dir. Default=True. This is useful, so TensorBoard reports + correct relative time for the remaining checkpoints. This flag is set to + False if min_steps is > 0. + * wait_minutes: Wait upto N minutes for a new translated file. Default=0. + This is useful for continuous evaluation of a running training, in which case + this should be equal to save_checkpoints_secs/60 plus time needed for + translation plus some reserve. +""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import os + +# Dependency imports + +from tensor2tensor.utils import bleu_hook +import tensorflow as tf + + +flags = tf.flags +FLAGS = flags.FLAGS + +flags.DEFINE_string("source", None, + "Path to the source-language file to be translated") +flags.DEFINE_string("reference", None, "Path to the reference translation file") +flags.DEFINE_string("translation", None, + "Path to the MT system translation file") +flags.DEFINE_string("translations_dir", None, + "Directory with translated files to be evaulated.") +flags.DEFINE_string("event_dir", None, "Where to store the event file.") + +flags.DEFINE_string("bleu_variant", "both", + "Possible values: cased(case-sensitive), uncased, " + "both(default).") +flags.DEFINE_string("tag_suffix", "", + "What to add to BLEU_cased and BLEU_uncased tags.") +flags.DEFINE_integer("min_steps", -1, + "Don't evaluate checkpoints with less steps.") +flags.DEFINE_integer("wait_minutes", 0, + "Wait upto N minutes for a new checkpoint, cf. " + "save_checkpoints_secs.") +flags.DEFINE_bool("report_zero", None, + "Store BLEU=0 and guess its time based on the oldest file.") + + +def main(_): + tf.logging.set_verbosity(tf.logging.INFO) + if FLAGS.translation: + if FLAGS.translations_dir: + raise ValueError( + "Cannot specify both --translation and --translations_dir.") + if FLAGS.bleu_variant in ("uncased", "both"): + bleu = 100 * bleu_hook.bleu_wrapper(FLAGS.reference, FLAGS.translation, + case_sensitive=False) + print("BLEU_uncased = %6.2f" % bleu) + if FLAGS.bleu_variant in ("cased", "both"): + bleu = 100 * bleu_hook.bleu_wrapper(FLAGS.reference, FLAGS.translation, + case_sensitive=True) + print("BLEU_cased = %6.2f" % bleu) + return + + if not FLAGS.translations_dir: + raise ValueError( + "Either --translation or --translations_dir must be specified.") + transl_dir = os.path.expanduser(FLAGS.translations_dir) + + last_step_file = os.path.join(FLAGS.event_dir, "last_evaluated_step.txt") + if FLAGS.min_steps == -1: + if tf.gfile.Exists(last_step_file): + with open(last_step_file) as ls_file: + FLAGS.min_steps = int(ls_file.read()) + else: + FLAGS.min_steps = 0 + if FLAGS.report_zero is None: + FLAGS.report_zero = FLAGS.min_steps == 0 + + writer = tf.summary.FileWriter(FLAGS.event_dir) + for transl_file in bleu_hook.stepfiles_iterator( + transl_dir, FLAGS.wait_minutes, FLAGS.min_steps, path_suffix=""): + # report_zero handling must be inside the for-loop, + # so we are sure the transl_dir is already created. + if FLAGS.report_zero: + all_files = (os.path.join(transl_dir, f) for f in os.listdir(transl_dir)) + start_time = min( + os.path.getmtime(f) for f in all_files if os.path.isfile(f)) + values = [] + if FLAGS.bleu_variant in ("uncased", "both"): + values.append(tf.Summary.Value( + tag="BLEU_uncased" + FLAGS.tag_suffix, simple_value=0)) + if FLAGS.bleu_variant in ("cased", "both"): + values.append(tf.Summary.Value( + tag="BLEU_cased" + FLAGS.tag_suffix, simple_value=0)) + writer.add_event(tf.summary.Event(summary=tf.Summary(value=values), + wall_time=start_time, step=0)) + FLAGS.report_zero = False + + filename = transl_file.filename + tf.logging.info("Evaluating " + filename) + values = [] + if FLAGS.bleu_variant in ("uncased", "both"): + bleu = 100 * bleu_hook.bleu_wrapper(FLAGS.reference, filename, + case_sensitive=False) + values.append(tf.Summary.Value(tag="BLEU_uncased" + FLAGS.tag_suffix, + simple_value=bleu)) + tf.logging.info("%s: BLEU_uncased = %6.2f" % (filename, bleu)) + if FLAGS.bleu_variant in ("cased", "both"): + bleu = 100 * bleu_hook.bleu_wrapper(FLAGS.reference, filename, + case_sensitive=True) + values.append(tf.Summary.Value(tag="BLEU_cased" + FLAGS.tag_suffix, + simple_value=bleu)) + tf.logging.info("%s: BLEU_cased = %6.2f" % (transl_file.filename, bleu)) + writer.add_event(tf.summary.Event( + summary=tf.Summary(value=values), + wall_time=transl_file.mtime, step=transl_file.steps)) + writer.flush() + with open(last_step_file, "w") as ls_file: + ls_file.write(str(transl_file.steps) + "\n") + + +if __name__ == "__main__": + tf.app.run() diff --git a/tensor2tensor/bin/t2t_datagen.py b/tensor2tensor/bin/t2t_datagen.py index c83428bc2..451b99a3a 100644 --- a/tensor2tensor/bin/t2t_datagen.py +++ b/tensor2tensor/bin/t2t_datagen.py @@ -29,6 +29,7 @@ from __future__ import division from __future__ import print_function +import multiprocessing import os import random import tempfile @@ -66,6 +67,11 @@ "If true, we only list the problems that will be generated.") flags.DEFINE_integer("random_seed", 429459, "Random seed to use.") flags.DEFINE_integer("task_id", -1, "For distributed data generation.") +flags.DEFINE_integer("task_id_start", -1, "For distributed data generation.") +flags.DEFINE_integer("task_id_end", -1, "For distributed data generation.") +flags.DEFINE_integer( + "num_concurrent_processes", 10, + "Applies only to problems for which multiprocess_generate=True.") flags.DEFINE_string("t2t_usr_dir", "", "Path to a Python module that will be imported. The " "__init__.py file should include the necessary imports. " @@ -195,17 +201,35 @@ def generate_data_for_problem(problem): generator_utils.shuffle_dataset(all_output_files) +def generate_data_in_process(arg): + problem_name, data_dir, tmp_dir, task_id = arg + problem = registry.problem(problem_name) + problem.generate_data(data_dir, tmp_dir, task_id) + + def generate_data_for_registered_problem(problem_name): tf.logging.info("Generating data for %s.", problem_name) if FLAGS.num_shards: raise ValueError("--num_shards should not be set for registered Problem.") problem = registry.problem(problem_name) task_id = None if FLAGS.task_id < 0 else FLAGS.task_id - problem.generate_data( - os.path.expanduser(FLAGS.data_dir), - os.path.expanduser(FLAGS.tmp_dir), - task_id=task_id) - + data_dir = os.path.expanduser(FLAGS.data_dir) + tmp_dir = os.path.expanduser(FLAGS.tmp_dir) + if task_id is None and problem.multiprocess_generate: + if FLAGS.task_id_start != -1: + assert FLAGS.task_id_end != -1 + task_id_start = FLAGS.task_id_start + task_id_end = FLAGS.task_id_end + else: + task_id_start = 0 + task_id_end = problem.num_generate_tasks + pool = multiprocessing.Pool(processes=FLAGS.num_concurrent_processes) + problem.prepare_to_generate(data_dir, tmp_dir) + args = [(problem_name, data_dir, tmp_dir, task_id) + for task_id in range(task_id_start, task_id_end)] + pool.map(generate_data_in_process, args) + else: + problem.generate_data(data_dir, tmp_dir, task_id) if __name__ == "__main__": tf.app.run() diff --git a/tensor2tensor/bin/t2t_decoder.py b/tensor2tensor/bin/t2t_decoder.py index 25358739a..132dac0e4 100644 --- a/tensor2tensor/bin/t2t_decoder.py +++ b/tensor2tensor/bin/t2t_decoder.py @@ -36,9 +36,9 @@ # Dependency imports -from tensor2tensor.tpu import tpu_trainer -from tensor2tensor.tpu import tpu_trainer_lib +from tensor2tensor.bin import t2t_trainer from tensor2tensor.utils import decoding +from tensor2tensor.utils import trainer_lib from tensor2tensor.utils import usr_dir import tensorflow as tf @@ -46,7 +46,7 @@ flags = tf.flags FLAGS = flags.FLAGS -# Additional flags in tpu/tpu_trainer.py and utils/flags.py +# Additional flags in bin/t2t_trainer.py and utils/flags.py flags.DEFINE_string("decode_from_file", None, "Path to the source file for decoding") flags.DEFINE_string("decode_to_file", None, @@ -57,7 +57,7 @@ def create_hparams(): - return tpu_trainer_lib.create_hparams( + return trainer_lib.create_hparams( FLAGS.hparams_set, FLAGS.hparams, data_dir=os.path.expanduser(FLAGS.data_dir), @@ -95,10 +95,10 @@ def main(_): hp = create_hparams() decode_hp = create_decode_hparams() - estimator = tpu_trainer_lib.create_estimator( + estimator = trainer_lib.create_estimator( FLAGS.model, hp, - tpu_trainer.create_run_config(hp), + t2t_trainer.create_run_config(hp), decode_hparams=decode_hp, use_tpu=False) diff --git a/tensor2tensor/bin/t2t_trainer.py b/tensor2tensor/bin/t2t_trainer.py index 571a21839..a984ca9db 100644 --- a/tensor2tensor/bin/t2t_trainer.py +++ b/tensor2tensor/bin/t2t_trainer.py @@ -13,7 +13,7 @@ # See the License for the specific language governing permissions and # limitations under the License. -"""Train on TPU.""" +"""Train and evaluate.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function @@ -26,10 +26,10 @@ from tensor2tensor import models # pylint: disable=unused-import from tensor2tensor import problems as problems_lib # pylint: disable=unused-import -from tensor2tensor.tpu import tpu_trainer_lib from tensor2tensor.utils import decoding from tensor2tensor.utils import flags as t2t_flags # pylint: disable=unused-import from tensor2tensor.utils import registry +from tensor2tensor.utils import trainer_lib from tensor2tensor.utils import usr_dir import tensorflow as tf @@ -38,7 +38,7 @@ FLAGS = flags.FLAGS # See flags.py for additional command-line flags. -flags.DEFINE_string("t2t_usr_dir", "", +flags.DEFINE_string("t2t_usr_dir", None, "Path to a Python module that will be imported. The " "__init__.py file should include the necessary imports. " "The imported files should contain registrations, " @@ -49,6 +49,8 @@ flags.DEFINE_integer("iterations_per_loop", 1000, "Number of iterations in a TPU training loop.") flags.DEFINE_bool("use_tpu", False, "Whether to use TPU.") +flags.DEFINE_integer("tpu_infeed_sleep_secs", None, + "How long to sleep the infeed thread.") flags.DEFINE_bool("generate_data", False, "Generate data before training?") flags.DEFINE_string("tmp_dir", "/tmp/t2t_datagen", "Temporary storage directory, used if --generate_data.") @@ -77,11 +79,15 @@ def get_problem_name(): def create_hparams(): - return tpu_trainer_lib.create_hparams(FLAGS.hparams_set, FLAGS.hparams) + if FLAGS.use_tpu and "tpu" not in FLAGS.hparams_set: + tf.logging.warn("Not all hyperparameter sets work on TPU. When available " + "for a given model, prefer hparams_sets with a '_tpu' " + "suffix, e.g. transformer_tpu.") + return trainer_lib.create_hparams(FLAGS.hparams_set, FLAGS.hparams) def create_experiment_fn(): - return tpu_trainer_lib.create_experiment_fn( + return trainer_lib.create_experiment_fn( model_name=FLAGS.model, problem_name=get_problem_name(), data_dir=os.path.expanduser(FLAGS.data_dir), @@ -102,7 +108,7 @@ def create_experiment_fn(): def create_run_config(hp): - return tpu_trainer_lib.create_run_config( + return trainer_lib.create_run_config( model_dir=os.path.expanduser(FLAGS.output_dir), master=FLAGS.master, iterations_per_loop=FLAGS.iterations_per_loop, @@ -127,7 +133,9 @@ def create_run_config(hp): ps_gpu=FLAGS.ps_gpu, sync=FLAGS.sync, worker_id=FLAGS.worker_id, - worker_job=FLAGS.worker_job) + worker_job=FLAGS.worker_job, + random_seed=FLAGS.random_seed, + tpu_infeed_sleep_secs=FLAGS.tpu_infeed_sleep_secs) def generate_data(): @@ -161,6 +169,46 @@ def log_registry(): sys.exit(0) +def is_chief(): + schedules = ["train", "train_and_evaluate", "continuous_train_and_eval"] + return FLAGS.worker_id == 0 and FLAGS.schedule in schedules + + +def save_metadata(hparams): + """Saves FLAGS and hparams to output_dir.""" + output_dir = os.path.expanduser(FLAGS.output_dir) + if not tf.gfile.Exists(output_dir): + tf.gfile.MakeDirs(output_dir) + + # Save FLAGS in txt file + if hasattr(FLAGS, "flags_into_string"): + flags_str = FLAGS.flags_into_string() + t2t_flags_str = "\n".join([ + "--%s=%s" % (f.name, f.value) + for f in FLAGS.flags_by_module_dict()[ + "tensor2tensor.utils.flags"] + ]) + else: + flags_dict = FLAGS.__dict__["__flags"] + flags_str = "\n".join( + ["--%s=%s" % (name, str(f)) for (name, f) in flags_dict.items()]) + t2t_flags_str = None + + flags_txt = os.path.join(output_dir, "flags.txt") + with tf.gfile.Open(flags_txt, "w") as f: + f.write(flags_str) + + if t2t_flags_str: + t2t_flags_txt = os.path.join(output_dir, "flags_t2t.txt") + with tf.gfile.Open(t2t_flags_txt, "w") as f: + f.write(t2t_flags_str) + + # Save hparams as hparams.json + hparams_fname = os.path.join(output_dir, "hparams.json") + with tf.gfile.Open(hparams_fname, "w") as f: + f.write(hparams.to_json()) + + def execute_schedule(exp): if not hasattr(exp, FLAGS.schedule): raise ValueError( @@ -171,7 +219,7 @@ def execute_schedule(exp): def main(_): tf.logging.set_verbosity(tf.logging.INFO) - tpu_trainer_lib.set_random_seed(FLAGS.random_seed) + trainer_lib.set_random_seed(FLAGS.random_seed) usr_dir.import_usr_dir(FLAGS.t2t_usr_dir) log_registry() @@ -181,6 +229,9 @@ def main(_): hparams = create_hparams() run_config = create_run_config(hparams) + if is_chief(): + save_metadata(hparams) + exp_fn = create_experiment_fn() exp = exp_fn(run_config, hparams) execute_schedule(exp) diff --git a/tensor2tensor/bin/t2t_translate_all.py b/tensor2tensor/bin/t2t_translate_all.py new file mode 100644 index 000000000..9827705c3 --- /dev/null +++ b/tensor2tensor/bin/t2t_translate_all.py @@ -0,0 +1,107 @@ +# coding=utf-8 +# Copyright 2017 The Tensor2Tensor Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Translate a file with all checkpoints in a given directory. + +t2t-decoder will be executed with these parameters: +--problems +--data_dir +--output_dir with the value of --model_dir +--decode_from_file with the value of --source +--decode_hparams with properly formatted --beam_size and --alpha +--checkpoint_path automatically filled +--decode_to_file automatically filled +""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import os +import shutil + +# Dependency imports + +from tensor2tensor.utils import bleu_hook + +import tensorflow as tf + +flags = tf.flags +FLAGS = flags.FLAGS + +# t2t-translate-all specific options +flags.DEFINE_string("decoder_command", "t2t-decoder {params}", + "Which command to execute instead t2t-decoder. " + "{params} is replaced by the parameters. Useful e.g. for " + "qsub wrapper.") +flags.DEFINE_string("model_dir", "", + "Directory to load model checkpoints from.") +flags.DEFINE_string("source", None, + "Path to the source-language file to be translated") +flags.DEFINE_string("translations_dir", "translations", + "Where to store the translated files.") +flags.DEFINE_integer("min_steps", 0, "Ignore checkpoints with less steps.") +flags.DEFINE_integer("wait_minutes", 0, + "Wait upto N minutes for a new checkpoint") + +# options derived from t2t-decoder +flags.DEFINE_integer("beam_size", 4, "Beam-search width.") +flags.DEFINE_float("alpha", 0.6, "Beam-search alpha.") +flags.DEFINE_string("model", "transformer", "see t2t-decoder") +flags.DEFINE_string("t2t_usr_dir", None, "see t2t-decoder") +flags.DEFINE_string("data_dir", None, "see t2t-decoder") +flags.DEFINE_string("problems", None, "see t2t-decoder") +flags.DEFINE_string("hparams_set", "transformer_big_single_gpu", + "see t2t-decoder") + + +def main(_): + tf.logging.set_verbosity(tf.logging.INFO) + # pylint: disable=unused-variable + model_dir = os.path.expanduser(FLAGS.model_dir) + translations_dir = os.path.expanduser(FLAGS.translations_dir) + source = os.path.expanduser(FLAGS.source) + tf.gfile.MakeDirs(translations_dir) + translated_base_file = os.path.join(translations_dir, FLAGS.problems) + + # Copy flags.txt with the original time, so t2t-bleu can report correct + # relative time. + flags_path = os.path.join(translations_dir, FLAGS.problems + "-flags.txt") + if not os.path.exists(flags_path): + shutil.copy2(os.path.join(model_dir, "flags.txt"), flags_path) + + for model in bleu_hook.stepfiles_iterator(model_dir, FLAGS.wait_minutes, + FLAGS.min_steps): + tf.logging.info("Translating " + model.filename) + out_file = translated_base_file + "-" + str(model.steps) + if os.path.exists(out_file): + tf.logging.info(out_file + " already exists, so skipping it.") + else: + tf.logging.info("Translating " + out_file) + params = ( + "--t2t_usr_dir={FLAGS.t2t_usr_dir} --output_dir={model_dir} " + "--data_dir={FLAGS.data_dir} --problems={FLAGS.problems} " + "--decode_hparams=beam_size={FLAGS.beam_size},alpha={FLAGS.alpha} " + "--model={FLAGS.model} --hparams_set={FLAGS.hparams_set} " + "--checkpoint_path={model.filename} --decode_from_file={source} " + "--decode_to_file={out_file}" + ).format(**locals()) + command = FLAGS.decoder_command.format(**locals()) + tf.logging.info("Running:\n" + command) + os.system(command) + # pylint: enable=unused-variable + + +if __name__ == "__main__": + tf.app.run() diff --git a/tensor2tensor/data_generators/generator_utils.py b/tensor2tensor/data_generators/generator_utils.py index c657a503f..b73b63d74 100644 --- a/tensor2tensor/data_generators/generator_utils.py +++ b/tensor2tensor/data_generators/generator_utils.py @@ -316,8 +316,7 @@ def get_or_generate_vocab_inner(data_dir, vocab_filename, vocab_size, def get_or_generate_vocab(data_dir, tmp_dir, vocab_filename, vocab_size, - sources, - _file_byte_budget=1e6): + sources, file_byte_budget=1e6): """Generate a vocabulary from the datasets in sources.""" def generate(): @@ -350,17 +349,17 @@ def generate(): # Use Tokenizer to count the word occurrences. with tf.gfile.GFile(filepath, mode="r") as source_file: - file_byte_budget = _file_byte_budget + file_byte_budget_ = file_byte_budget counter = 0 - countermax = int(source_file.size() / file_byte_budget / 2) + countermax = int(source_file.size() / file_byte_budget_ / 2) for line in source_file: if counter < countermax: counter += 1 else: - if file_byte_budget <= 0: + if file_byte_budget_ <= 0: break line = line.strip() - file_byte_budget -= len(line) + file_byte_budget_ -= len(line) counter = 0 yield line diff --git a/tensor2tensor/data_generators/problem.py b/tensor2tensor/data_generators/problem.py index 52d7bdab2..53fa48740 100644 --- a/tensor2tensor/data_generators/problem.py +++ b/tensor2tensor/data_generators/problem.py @@ -102,6 +102,7 @@ def default_model_hparams(): max_input_seq_length=0, max_target_seq_length=0, prepend_mode="none", + split_to_length=0, data_dir=None) @@ -117,6 +118,12 @@ def preprocess_example_common(example, hparams, mode): else: example["targets"] = tf.concat( [example["inputs"], [0], example["targets"]], 0) + if hparams.split_to_length: + example["targets"] = tf.reshape( + example["targets"], [-1, hparams.split_to_length, 1, 1]) + if len(example) != 1: + raise ValueError("split_to_length only works for LM problems") + return tf.data.Dataset.from_tensor_slices(example) return example @@ -195,9 +202,67 @@ class Problem(object): def generate_data(self, data_dir, tmp_dir, task_id=-1): raise NotImplementedError() + @property + def multiprocess_generate(self): + """Whether to generate the data in multiple parallel processes.""" + return False + + @property + def num_generate_tasks(self): + """Needed if multiprocess_generate is True.""" + raise NotImplementedError() + + def prepare_to_generate(self, data_dir, tmp_dir): + """Prepare to generate data in parallel on different processes. + + This function is called if multiprocess_generate is True. + + Some things that might need to be done once are downloading the data + if it is not yet downloaded, and building the vocabulary. + + Args: + data_dir: a string + tmp_dir: a string + """ + raise NotImplementedError() + def hparams(self, defaults, model_hparams): pass + def max_length(self, model_hparams): + """Maximum sequence length. + + Problems with fixed length should override. + + Args: + model_hparams: model hyperparameters + Returns: + an integer + """ + return ( + model_hparams.split_to_length or + model_hparams.max_length or + model_hparams.batch_size) + + @property + def batch_size_means_tokens(self): + """Do we specify hparams.batch_size in tokens per datashard per batch. + + This is generally done for text problems. + + If False, we assume that batch sizes are specified in examples per + datashard per batch. + + TODO(noam): we should be more explicit and replace the hyperparameter + batch size with two hyperparameters: + hparams.examples_per_batch_per_datashard + hparams.tokens_per_batch_per_datashard + + Returns: + a boolean + """ + return False + def dataset_filename(self): return self.name @@ -217,6 +282,19 @@ def example_reading_spec(self): return (data_fields, data_items_to_decoders) def preprocess_example(self, example, mode, hparams): + """Runtime preprocessing. + + Return a dict or a tf.Data.Datset.from_tensor_slices (if you want each + example to turn into multiple). + + Args: + example: dict, features + mode: tf.estimator.ModeKeys + hparams: HParams, model hyperparameters + + Returns: + dict or Dataset + """ return preprocess_example_common(example, hparams, mode) def eval_metrics(self): @@ -343,6 +421,7 @@ def dataset(self, num_threads=None, output_buffer_size=None, shuffle_files=None, + repeat=None, hparams=None, preprocess=True, dataset_split=None, @@ -358,6 +437,8 @@ def dataset(self, calls. shuffle_files: whether to shuffle input files. Default behavior (i.e. when shuffle_files=None) is to shuffle if mode == TRAIN. + repeat: whether to repeat the Dataset. Default behavior is to repeat if + mode == TRAIN. hparams: tf.contrib.training.HParams; hparams to be passed to Problem.preprocess_example and Problem.hparams. If None, will use a default set that is a no-op. @@ -370,6 +451,10 @@ def dataset(self, Returns: Dataset containing dict. """ + is_training = mode == tf.estimator.ModeKeys.TRAIN + repeat = repeat or repeat is None and is_training + shuffle_files = shuffle_files or shuffle_files is None and is_training + dataset_split = dataset_split or mode assert data_dir @@ -383,32 +468,56 @@ def dataset(self, # Construct the Problem's hparams so that items within it are accessible _ = self.get_hparams(hparams) - is_training = mode == tf.estimator.ModeKeys.TRAIN data_filepattern = self.filepattern(data_dir, dataset_split, shard=shard) tf.logging.info("Reading data files from %s", data_filepattern) - data_files = tf.contrib.slim.parallel_reader.get_data_files( - data_filepattern) - if shuffle_files or shuffle_files is None and is_training: - # In addition to shuffling the list of file names, we skip a random - # fraction of the first file. The skip is essential for synchronous - # highly-parallel training. Otherwise, we have multiple replicas - # reading the same shard in lock-step. - num_skip = random.randint(0, _file_num_records_cached(data_files[0])) - random.shuffle(data_files) - dataset = tf.data.TFRecordDataset(data_files).skip(num_skip) + dataset = tf.data.Dataset.list_files(data_filepattern) + + if shuffle_files: + dataset = dataset.shuffle(buffer_size=1024) + + def _load_records(filename): + return tf.data.TFRecordDataset(filename, buffer_size=16 * 1000 * 1000) + + if hasattr(tf.contrib.data, "parallel_interleave"): + interleave = lambda ds, fn: ds.apply( # pylint: disable=g-long-lambda + tf.contrib.data.parallel_interleave( + fn, sloppy=is_training, cycle_length=16)) else: - dataset = tf.data.TFRecordDataset(data_files) + interleave = lambda ds, fn: ds.interleave(fn, cycle_length=16) - def _preprocess(example): - example = self.preprocess_example(example, mode, hparams) + dataset = interleave(dataset, _load_records) + + if repeat: + dataset = dataset.repeat() + + if shuffle_files: + # Skip a random fraction at the beginning of the stream. The skip is + # essential for synchronous highly-parallel training to avoid multiple + # replicas reading the same data in lock-step. + data_files = tf.contrib.slim.parallel_reader.get_data_files( + data_filepattern) + num_skip = random.randint(0, _file_num_records_cached(data_files[0])) + dataset = dataset.skip(num_skip) + + def _maybe_reverse_and_copy(example): self.maybe_reverse_features(example) self.maybe_copy_features(example) return example + def _preprocess(example): + examples = self.preprocess_example(example, mode, hparams) + if not isinstance(examples, tf.data.Dataset): + examples = tf.data.Dataset.from_tensors(examples) + return examples + dataset = dataset.map(self.decode_example, num_parallel_calls=num_threads) if preprocess: - dataset = dataset.map(_preprocess, num_parallel_calls=num_threads) + dataset = interleave(dataset, _preprocess) + + dataset = dataset.map( + _maybe_reverse_and_copy, num_parallel_calls=num_threads) + if output_buffer_size: dataset = dataset.prefetch(output_buffer_size) @@ -507,18 +616,23 @@ def input_fn(self, mode, hparams, data_dir=None, params=None, config=None, (features_dict, Tensor targets) """ is_training = mode == tf.estimator.ModeKeys.TRAIN - num_threads = 4 if is_training else 1 + if config.use_tpu: + num_threads = 32 + else: + num_threads = 4 if is_training else 1 + + max_length = self.max_length(hparams) def tpu_valid_size(example): return data_reader.example_valid_size(example, hparams.min_length, - hparams.max_length) + max_length) def gpu_valid_size(example): drop_long_sequences = is_training or hparams.eval_drop_long_sequences return data_reader.example_valid_size( example, hparams.min_length, - hparams.max_length if drop_long_sequences else 10**9) + max_length if drop_long_sequences else 10**9) def define_shapes(example): batch_size = config and config.use_tpu and params["batch_size"] @@ -540,10 +654,24 @@ def define_shapes(example): if is_training: dataset = dataset.repeat(None) + if self.batch_size_means_tokens: + batch_size_means_tokens = True + else: + if _are_shapes_fully_defined(dataset.output_shapes): + batch_size_means_tokens = False + else: + tf.logging.warning( + "Shapes are not fully defined. Assuming batch_size means tokens. " + "You should probably override batch_size_means_tokens() " + "in your problem subclass") + batch_size_means_tokens = True + # Batching - if _are_shapes_fully_defined(dataset.output_shapes): - # Static shape features (e.g. images) + if not batch_size_means_tokens: + # Batch size means examples per datashard. if config and config.use_tpu: + # on TPU, we use params["batch_size"], which specifies the number of + # examples across all datashards tpu_batch_size = params["batch_size"] dataset = dataset.apply( tf.contrib.data.batch_and_drop_remainder(tpu_batch_size)) @@ -551,12 +679,14 @@ def define_shapes(example): num_shards = (config and config.data_parallelism.n) or 1 dataset = dataset.batch(hparams.batch_size * num_shards) else: - # Variable length features + # batch_size means tokens per datashard if config and config.use_tpu: - # On TPU, pad to hparams.max_length + # On TPU, pad to max_length dataset = dataset.filter(tpu_valid_size) padded_shapes = _fill_shape_nones( - dataset.output_shapes, none_filler=hparams.max_length) + dataset.output_shapes, none_filler=max_length) + # on TPU, we use params["batch_size"], which specifies the number of + # examples across all datashards dataset = dataset.apply( tf.contrib.data.padded_batch_and_drop_remainder( params["batch_size"], padded_shapes)) @@ -568,6 +698,7 @@ def define_shapes(example): shard_multiplier=(config and config.data_parallelism.n) or 1, length_multiplier=self.get_hparams().batch_size_multiplier) if hparams.use_fixed_batch_size: + # Here batch_size really means examples per datashard. batching_scheme["batch_sizes"] = [hparams.batch_size] batching_scheme["boundaries"] = [] dataset = data_reader.bucket_by_sequence_length( @@ -590,7 +721,7 @@ def _pad_batch(features): dataset = dataset.map(_pad_batch, num_parallel_calls=num_threads) dataset = dataset.map(define_shapes, num_parallel_calls=num_threads) - dataset = dataset.prefetch(1) + dataset = dataset.prefetch(2) features = dataset.make_one_shot_iterator().get_next() if not config or not config.use_tpu: _summarize_features(features, (config and config.data_parallelism.n) or 1) @@ -738,6 +869,10 @@ def is_character_level(self): def targeted_vocab_size(self): raise NotImplementedError() # Not needed if self.is_character_level. + @property + def batch_size_means_tokens(self): + return True + def generator(self, data_dir, tmp_dir, is_training): """Generator for the training and evaluation data. @@ -764,6 +899,12 @@ def packed_length(self): """ return None + def max_length(self, model_hparams): + """Maximum sequence length.""" + if self.packed_length: + return self.packed_length + return super(Text2TextProblem, self).max_length(model_hparams) + @property def use_train_shards_for_dev(self): """If true, we only generate training data and hold out shards for dev.""" @@ -891,6 +1032,261 @@ def eval_metrics(self): ] +class ChoppedTextProblem(Text2TextProblem): + """Tokenize and chop text files into fixed-length language-modeling examples. + + The input data is a set of text files, as specified by + self.train_text_filepaths() and self.dev_text_filepaths(). + + The text is tokenized using a SubwordTextEncoder, and + then split into examples, each of length self.sequence_length(). + """ + + def train_text_filepaths(self, tmp_dir): + """Local filepaths of text files containing training data. + + This function may want to download the files if they do not exist. + + Args: + tmp_dir: a string + Returns: + a list of strings. + """ + raise NotImplementedError() + + def dev_text_filepaths(self, tmp_dir): + """Local filepaths of text files containing dev data. + + This function may want to download the files if they do not exist. + + Args: + tmp_dir: a string + Returns: + a list of strings. + """ + raise NotImplementedError() + + @property + def sequence_length(self): + """Length of each example (in tokens).""" + raise NotImplementedError() + + def max_length(self, model_hparams): + return model_hparams.split_to_length or self.sequence_length + + @property + def is_character_level(self): + return False + + def text_filepaths_for_task(self, tmp_dir, task_id): + """List of input filepaths for a particular training or dev shard. + + Args: + tmp_dir: a string + task_id: an integer less than self.num_shards + Returns: + a list of tuples (filepath, start_pos, num_bytes) + """ + assert task_id >= 0 + assert task_id < self.num_train_shards + self.num_dev_shards + if task_id < self.num_train_shards: + return [f for i, f in enumerate(self.train_text_filepaths(tmp_dir)) + if i % self.num_train_shards == task_id] + else: + return [f for i, f in enumerate(self.dev_text_filepaths(tmp_dir)) + if i % self.num_dev_shards == task_id - self.num_train_shards] + + def filepath_to_unicode_strings(self, filepath): + """Read text out of an input file. + + The default just reads the text, converts to unicode and yields one + unicode string. + + Subclasses can override this function in order to preprocess, and can + yield any number of strings. + + Args: + filepath: a string + Yields: + unicode strings. + """ + f = tf.gfile.Open(filepath) + b = f.read() + yield to_unicode_ignore_erros(b) + + def file_generator(self, + filepaths, + max_chars_per_file=None, + max_chars_total=None): + """Read complete text of input files and yield unicode strings. + + By default, one unicode string is produced per file, but this is + not guaranteed, since subclasses can override + filepath_to_unicode_strings(). + + max_chars_per_file and max_chars_total can also be specified, in which + case some strings may be truncated or dropped to limit the total + amount of output. + + Args: + filepaths: a list of strings + max_chars_per_file: an optional integer + max_chars_total: an optional integer + Yields: + unicode strings + """ + chars_total = 0 + for fname in filepaths: + chars_this_file = 0 + tf.logging.info("reading file %s" % fname) + for text in self.filepath_to_unicode_strings(fname): + if (max_chars_per_file and chars_this_file + len(text) + > max_chars_per_file): + text = text[:max_chars_per_file - chars_this_file] + if max_chars_total and chars_total + len(text) > max_chars_total: + text = text[:max_chars_total - chars_total] + chars_total += len(text) + chars_this_file += len(text) + if text: + yield text + if max_chars_total and chars_total >= max_chars_total: + return + if max_chars_per_file and chars_this_file >= max_chars_per_file: + break + + def example_generator(self, encoder, tmp_dir, task_id): + """Generator for examples. + + Args: + encoder: a TextEncoder + tmp_dir: a string + task_id: an integer + Yields: + feature dictionaries + """ + filepaths = self.text_filepaths_for_task(tmp_dir, task_id) + if task_id >= self.num_train_shards: + # this is dev data - limit the total length. + max_chars_per_file = self.max_dev_chars // ( + self.num_dev_shards * len(filepaths)) + else: + max_chars_per_file = None + tokens = [] + for ftext in self.file_generator( + filepaths, max_chars_per_file=max_chars_per_file): + tokens.extend(encoder.encode(ftext)) + pos = 0 + while pos + self.sequence_length <= len(tokens): + yield {"inputs": [0], "targets": tokens[pos:pos + self.sequence_length]} + pos += self.sequence_length + if pos > 0: + tokens = tokens[pos:] + if self.remainder_policy == "pad": + if tokens: + targets = tokens + [0] * (self.sequence_length - len(tokens)) + yield {"inputs": [0], "targets": targets} + else: + assert self.remainder_policy == "drop" + + @property + def remainder_policy(self): + """What to do with leftover tokens. + + Returns: + a string - either "pad" or "drop". + """ + return "pad" + + def prepare_to_generate(self, data_dir, tmp_dir): + """Make sure that the data is prepared and the vocab is generated.""" + self.get_or_generate_vocab(data_dir, tmp_dir) + self.train_text_filepaths(tmp_dir) + self.dev_text_filepaths(tmp_dir) + + def get_or_generate_vocab(self, data_dir, tmp_dir): + return generator_utils.get_or_generate_vocab_inner( + data_dir, self.vocab_file, self.targeted_vocab_size, + self.file_generator( + self.train_text_filepaths(tmp_dir), + max_chars_total=self.max_chars_for_vocab)) + + def generate_data(self, data_dir, tmp_dir, task_id=-1): + """Generates training/dev data. + + Args: + data_dir: a string + tmp_dir: a string + task_id: an optional integer + Returns: + shard or shards for which data was generated. + """ + tf.logging.info("generate_data task_id=%s" % task_id) + encoder = self.get_or_generate_vocab(data_dir, tmp_dir) + assert task_id >= 0 and task_id < self.num_generate_tasks + if task_id < self.num_train_shards: + out_file = self.training_filepaths( + data_dir, self.num_train_shards, shuffled=False)[task_id] + else: + out_file = self.dev_filepaths( + data_dir, self.num_dev_shards, + shuffled=False)[task_id - self.num_train_shards] + generator_utils.generate_files( + self.example_generator(encoder, tmp_dir, task_id), [out_file]) + generator_utils.shuffle_dataset([out_file]) + + @property + def max_chars_for_vocab(self): + """Number of characters of training data to use for generating vocab.""" + return 10 ** 7 + + @property + def target_space_id(self): + return SpaceID.EN_TOK + + @property + def num_train_shards(self): + return 100 + + @property + def num_dev_shards(self): + return 1 + + @property + def max_dev_chars(self): + """Limit dev set to at most this many characters (default 10M).""" + return 10 ** 7 + + @property + def multiprocess_generate(self): + return True + + @property + def num_generate_tasks(self): + return self.num_train_shards + self.num_dev_shards + + @property + def vocab_name(self): + raise NotImplementedError() + + @property + def use_subword_tokenizer(self): + return True + + @property + def has_inputs(self): + return False + + def eval_metrics(self): + return [ + metrics.Metrics.ACC, metrics.Metrics.NEG_LOG_PERPLEXITY + ] + + +def to_unicode_ignore_erros(s): + return (unicode(s, "utf-8", errors="ignore") if six.PY2 else + s.decode("utf-8", "ignore")) + + def _are_shapes_fully_defined(shapes_dict): for shape in shapes_dict.values(): if not shape.is_fully_defined(): diff --git a/tensor2tensor/data_generators/text_encoder.py b/tensor2tensor/data_generators/text_encoder.py index 6930b205e..d43236945 100644 --- a/tensor2tensor/data_generators/text_encoder.py +++ b/tensor2tensor/data_generators/text_encoder.py @@ -436,6 +436,23 @@ def encode(self, raw_text): return self._tokens_to_subtoken_ids( tokenizer.encode(native_to_unicode(raw_text))) + def encode_without_tokenizing(self, token_text): + """Converts string to list of subtoken ids without calling tokenizer. + + This treats `token_text` as a single token and directly converts it + to subtoken ids. This may be useful when the default tokenizer doesn't + do what we want (e.g., when encoding text with tokens composed of lots of + nonalphanumeric characters). It is then up to the caller to make sure that + raw text is consistently converted into tokens. Only use this if you are + sure that `encode` doesn't suit your needs. + + Args: + token_text: A native string representation of a single token. + Returns: + A list of subword token ids; i.e., integers in the range [0, vocab_size). + """ + return self._tokens_to_subtoken_ids([native_to_unicode(token_text)]) + def decode(self, subtokens): """Converts a sequence of subtoken ids to a native string. @@ -559,6 +576,8 @@ def build_to_target_size(cls, token_counts, min_val, max_val, + max_subtoken_length=None, + reserved_tokens=None, num_iterations=4): """Builds a SubwordTextEncoder that has `vocab_size` near `target_size`. @@ -570,6 +589,13 @@ def build_to_target_size(cls, token_counts: A dictionary of token counts, mapping string to int. min_val: An integer; lower bound for the minimum token count. max_val: An integer; upper bound for the minimum token count. + max_subtoken_length: Maximum length of a subtoken. If this is not set, + then the runtime and memory use of creating the vocab is quadratic in + the length of the longest token. If this is set, then it is instead + O(max_subtoken_length * length of longest token). + reserved_tokens: List of reserved tokens. The global variable + `RESERVED_TOKENS` must be a prefix of `reserved_tokens`. If this + argument is `None`, it will use `RESERVED_TOKENS`. num_iterations: An integer; how many iterations of refinement. Returns: @@ -584,13 +610,18 @@ def build_to_target_size(cls, if target_size < 1: raise ValueError("Target size must be positive.") + if reserved_tokens is None: + reserved_tokens = RESERVED_TOKENS + def bisect(min_val, max_val): """Bisection to find the right size.""" present_count = (max_val + min_val) // 2 tf.logging.info("Trying min_count %d" % present_count) subtokenizer = cls() - subtokenizer.build_from_token_counts(token_counts, present_count, - num_iterations) + subtokenizer.build_from_token_counts( + token_counts, present_count, num_iterations, + max_subtoken_length=max_subtoken_length, + reserved_tokens=reserved_tokens) # Being within 1% of the target size is ok. is_ok = abs(subtokenizer.vocab_size - target_size) * 100 < target_size @@ -617,36 +648,47 @@ def build_from_token_counts(self, token_counts, min_count, num_iterations=4, - num_reserved_ids=NUM_RESERVED_TOKENS): + reserved_tokens=None, + max_subtoken_length=None): """Train a SubwordTextEncoder based on a dictionary of word counts. Args: token_counts: a dictionary of Unicode strings to int. min_count: an integer - discard subtokens with lower counts. num_iterations: an integer. how many iterations of refinement. - num_reserved_ids: an integer. how many ids to reserve for special tokens. + reserved_tokens: List of reserved tokens. The global variable + `RESERVED_TOKENS` must be a prefix of `reserved_tokens`. If this + argument is `None`, it will use `RESERVED_TOKENS`. + max_subtoken_length: Maximum length of a subtoken. If this is not set, + then the runtime and memory use of creating the vocab is quadratic in + the length of the longest token. If this is set, then it is instead + O(max_subtoken_length * length of longest token). Raises: ValueError: if reserved is not 0 or len(RESERVED_TOKENS). In this case, it is not clear what the space is being reserved for, or when it will be filled in. """ + if reserved_tokens is None: + reserved_tokens = RESERVED_TOKENS + else: + # There is not complete freedom in replacing RESERVED_TOKENS. + for default, proposed in zip(RESERVED_TOKENS, reserved_tokens): + if default != proposed: + raise ValueError("RESERVED_TOKENS must be a prefix of " + "reserved_tokens.") + # Initialize the alphabet. Note, this must include reserved tokens or it can # result in encoding failures. - if num_reserved_ids == NUM_RESERVED_TOKENS: - alphabet_tokens = chain(six.iterkeys(token_counts), - [native_to_unicode(t) for t in RESERVED_TOKENS]) - elif num_reserved_ids == 0: - alphabet_tokens = six.iterkeys(token_counts) - else: - raise ValueError("Unexpected value for reserved. What is being reserved?") + alphabet_tokens = chain(six.iterkeys(token_counts), + [native_to_unicode(t) for t in reserved_tokens]) self._init_alphabet_from_tokens(alphabet_tokens) # Bootstrap the initial list of subtokens with the characters from the # alphabet plus the escaping characters. - self._init_subtokens_from_list( - list(self._alphabet), reserved=num_reserved_ids) + self._init_subtokens_from_list(list(self._alphabet), + reserved_tokens=reserved_tokens) # We build iteratively. On each iteration, we segment all the words, # then count the resulting potential subtokens, keeping the ones @@ -664,7 +706,11 @@ def build_from_token_counts(self, subtokens = self._escaped_token_to_subtoken_strings(escaped_token) start = 0 for subtoken in subtokens: - for end in xrange(start + 1, len(escaped_token) + 1): + last_position = len(escaped_token) + 1 + if max_subtoken_length is not None: + last_position = min(last_position, start + max_subtoken_length) + + for end in xrange(start + 1, last_position): new_subtoken = escaped_token[start:end] subtoken_counts[new_subtoken] += count start += len(subtoken) @@ -700,13 +746,9 @@ def build_from_token_counts(self, # Reinitialize to the candidate vocabulary. new_subtoken_strings = [subtoken for _, subtoken in new_subtoken_strings] - if num_reserved_ids == len(RESERVED_TOKENS): - new_subtoken_strings = RESERVED_TOKENS + new_subtoken_strings - elif num_reserved_ids == 0: - pass - else: - raise ValueError("num_reserved_ids must be 0 or %d but was %d" % - NUM_RESERVED_TOKENS, num_reserved_ids) + if reserved_tokens: + new_subtoken_strings = reserved_tokens + new_subtoken_strings + self._init_subtokens_from_list(new_subtoken_strings) tf.logging.info("vocab_size = %d" % self.vocab_size) @@ -721,32 +763,33 @@ def dump(self): print(u", ".join(u"{0} : '{1}'".format(i, s) for i, s in sorted(subtoken_strings))) - def _init_subtokens_from_list(self, subtoken_strings, reserved=0): + def _init_subtokens_from_list(self, subtoken_strings, reserved_tokens=None): """Initialize token information from a list of subtoken strings. Args: subtoken_strings: a list of subtokens - reserved: number of spaces to save at the beginning for reserved tokens + reserved_tokens: List of reserved tokens. We must have `reserved_tokens` + as None or the empty list, or else the global variable `RESERVED_TOKENS` + must be a prefix of `reserved_tokens`. Raises: ValueError: if reserved is not 0 or len(RESERVED_TOKENS). In this case, it is not clear what the space is being reserved for, or when it will be filled in. """ - if reserved == 0: - self._all_subtoken_strings = subtoken_strings - elif reserved == len(RESERVED_TOKENS): - self._all_subtoken_strings = RESERVED_TOKENS + subtoken_strings + if reserved_tokens is None: + reserved_tokens = [] + + if reserved_tokens: + self._all_subtoken_strings = reserved_tokens + subtoken_strings else: - # TODO(dtarlow): or should we fall back to the previous behavior and - # insert copies of "" for each reserved count? - raise ValueError("Unexpected value for reserved. What is being reserved?") + self._all_subtoken_strings = subtoken_strings # we remember the maximum length of any subtoken to avoid having to # check arbitrarily long strings. self._max_subtoken_len = max([len(s) for s in subtoken_strings]) self._subtoken_string_to_id = { - s: i + reserved + s: i + len(reserved_tokens) for i, s in enumerate(subtoken_strings) if s } # Initialize the cache to empty. @@ -817,8 +860,13 @@ def encode(self, s): Returns: ids: list of integers """ - # TODO(lukaszkaiser): implement this. - raise NotImplementedError + try: + import matplotlib.image as im # pylint: disable=g-import-not-at-top + except ImportError as e: + tf.logging.warning( + "Reading an image requires matplotlib to be installed: %s", e) + raise NotImplementedError("Image reading not implemented.") + return im.imread(s) def decode(self, ids): """Transform a sequence of int ids into an image file. diff --git a/tensor2tensor/data_generators/text_encoder_test.py b/tensor2tensor/data_generators/text_encoder_test.py index 8364afafd..273810684 100644 --- a/tensor2tensor/data_generators/text_encoder_test.py +++ b/tensor2tensor/data_generators/text_encoder_test.py @@ -23,11 +23,14 @@ import collections import io import os +import random import shutil +import string # Dependency imports import mock import six +from six.moves import xrange # pylint: disable=redefined-builtin from tensor2tensor.data_generators import text_encoder import tensorflow as tf @@ -120,7 +123,7 @@ def test_encode_decode(self): "to build a vocabulary. It will be used when strings are encoded " "with a TextEncoder subclass. The encoder was coded by a coder.") token_counts = collections.Counter(corpus.split(" ")) - alphabet = set(corpus) ^ {" "} + alphabet = set(corpus) - {" "} original = "This is a coded sentence encoded by the SubwordTextEncoder." token_counts.update(original.split(" ")) @@ -161,7 +164,7 @@ def test_unicode(self): def test_small_vocab(self): corpus = "The quick brown fox jumps over the lazy dog" token_counts = collections.Counter(corpus.split(" ")) - alphabet = set(corpus) ^ {" "} + alphabet = set(corpus) - {" "} encoder = text_encoder.SubwordTextEncoder.build_to_target_size( 10, token_counts, 2, 10) @@ -173,6 +176,63 @@ def test_small_vocab(self): for a in alphabet: self.assertIn(a, encoder.all_subtoken_strings) + def test_long_tokens(self): + """Subword tokenization should still run efficiently with long tokens. + + To make it run efficiently, we need to use the `max_subtoken_length` + argument when calling SubwordTextEncoder.build_to_target_size. + """ + token_length = 4000 + num_tokens = 50 + target_vocab_size = 600 + max_subtoken_length = 10 # Set this to `None` to get problems. + max_count = 500 + + # Generate some long random strings. + random.seed(0) + long_tokens = [] + for _ in range(num_tokens): + long_token = "".join([random.choice(string.ascii_uppercase) + for _ in xrange(token_length)]) + long_tokens.append(long_token) + + corpus = " ".join(long_tokens) + token_counts = collections.Counter(corpus.split(" ")) + alphabet = set(corpus) - {" "} + + encoder = text_encoder.SubwordTextEncoder.build_to_target_size( + target_vocab_size, token_counts, 1, max_count, num_iterations=1, + max_subtoken_length=max_subtoken_length) + + # All vocabulary elements are in the alphabet and subtoken strings even + # if we requested a smaller vocabulary to assure all expected strings + # are encodable. + self.assertTrue(alphabet.issubset(encoder._alphabet)) + for a in alphabet: + self.assertIn(a, encoder.all_subtoken_strings) + + def test_custom_reserved_tokens(self): + """Test that we can pass custom reserved tokens to SubwordTextEncoder.""" + corpus = "The quick brown fox jumps over the lazy dog" + token_counts = collections.Counter(corpus.split(" ")) + + start_symbol = "" + end_symbol = "" + reserved_tokens = text_encoder.RESERVED_TOKENS + [start_symbol, + end_symbol] + encoder = text_encoder.SubwordTextEncoder.build_to_target_size( + 10, token_counts, 2, 10, reserved_tokens=reserved_tokens) + + # Make sure that reserved tokens appear in the right places. + start_id = encoder._subtoken_string_to_id[start_symbol] + end_id = encoder._subtoken_string_to_id[end_symbol] + self.assertEqual(start_id, 2) + self.assertEqual(end_id, 3) + + # Make sure that we haven't messed up the ability to reconstruct. + reconstructed_corpus = encoder.decode(encoder.encode(corpus)) + self.assertEqual(corpus, reconstructed_corpus) + def test_encodable_when_not_in_alphabet(self): corpus = "the quick brown fox jumps over the lazy dog" token_counts = collections.Counter(corpus.split(" ")) diff --git a/tensor2tensor/data_generators/translate_enzh.py b/tensor2tensor/data_generators/translate_enzh.py index d3ddd8d98..0a645b3bb 100644 --- a/tensor2tensor/data_generators/translate_enzh.py +++ b/tensor2tensor/data_generators/translate_enzh.py @@ -46,9 +46,10 @@ # News Commentary, around 220k lines # This dataset is only a small fraction of full WMT17 task _NC_TRAIN_DATASETS = [[ - "http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz", - ["training/news-commentary-v12.zh-en.en", - "training/news-commentary-v12.zh-en.zh"]]] + "http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12" + ".tgz", + ["training/news-commentary-v12.zh-en.en", + "training/news-commentary-v12.zh-en.zh"]]] # Test set from News Commentary. 2000 lines _NC_TEST_DATASETS = [[ @@ -63,117 +64,96 @@ # NOTE: You need to register to download dataset from official source # place into tmp directory e.g. /tmp/t2t_datagen/dataset.tgz _UN_TRAIN_DATASETS = [[ - "https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/UNv1.0.en-zh.tar.gz", - ["en-zh/UNv1.0.en-zh.en", - "en-zh/UNv1.0.en-zh.zh"]]] + "https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/UNv1.0.en-zh.tar" + ".gz", + ["en-zh/UNv1.0.en-zh.en", "en-zh/UNv1.0.en-zh.zh"]]] # CWMT corpus # Visit source website to download manually: -# http://nlp.nju.edu.cn/cwmt-wmt/ +# http://nlp.nju.edu.cn/cwmt-wmt/ # # casia2015: 1,050,000 lines # casict2015: 2,036,833 lines # datum2015: 1,000,003 lines # datum2017: 1,999,968 lines -# NEU2017: 2,000,000 lines +# NEU2017: 2,000,000 lines # # NOTE: You need to register to download dataset from official source # place into tmp directory e.g. /tmp/t2t_datagen/dataset.tgz _CWMT_TRAIN_DATASETS = [ ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/casia2015/casia2015_en.txt", - "cwmt/casia2015/casia2015_ch.txt"]], + ["cwmt/casia2015/casia2015_en.txt", "cwmt/casia2015/casia2015_ch.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/casict2015/casict2015_en.txt", - "cwmt/casict2015/casict2015_ch.txt"]], + ["cwmt/casict2015/casict2015_en.txt", + "cwmt/casict2015/casict2015_ch.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/neu2017/NEU_en.txt", - "cwmt/neu2017/NEU_cn.txt"]], + ["cwmt/neu2017/NEU_en.txt", "cwmt/neu2017/NEU_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2015/datum_en.txt", - "cwmt/datum2015/datum_ch.txt"]], + ["cwmt/datum2015/datum_en.txt", "cwmt/datum2015/datum_ch.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book1_en.txt", - "cwmt/datum2017/Book1_cn.txt"]], + ["cwmt/datum2017/Book1_en.txt", "cwmt/datum2017/Book1_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book2_en.txt", - "cwmt/datum2017/Book2_cn.txt"]], + ["cwmt/datum2017/Book2_en.txt", "cwmt/datum2017/Book2_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book3_en.txt", - "cwmt/datum2017/Book3_cn.txt"]], + ["cwmt/datum2017/Book3_en.txt", "cwmt/datum2017/Book3_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book4_en.txt", - "cwmt/datum2017/Book4_cn.txt"]], + ["cwmt/datum2017/Book4_en.txt", "cwmt/datum2017/Book4_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book5_en.txt", - "cwmt/datum2017/Book5_cn.txt"]], + ["cwmt/datum2017/Book5_en.txt", "cwmt/datum2017/Book5_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book6_en.txt", - "cwmt/datum2017/Book6_cn.txt"]], + ["cwmt/datum2017/Book6_en.txt", "cwmt/datum2017/Book6_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book7_en.txt", - "cwmt/datum2017/Book7_cn.txt"]], + ["cwmt/datum2017/Book7_en.txt", "cwmt/datum2017/Book7_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book8_en.txt", - "cwmt/datum2017/Book8_cn.txt"]], + ["cwmt/datum2017/Book8_en.txt", "cwmt/datum2017/Book8_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book9_en.txt", - "cwmt/datum2017/Book9_cn.txt"]], + ["cwmt/datum2017/Book9_en.txt", "cwmt/datum2017/Book9_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book10_en.txt", - "cwmt/datum2017/Book10_cn.txt"]], + ["cwmt/datum2017/Book10_en.txt", "cwmt/datum2017/Book10_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book11_en.txt", - "cwmt/datum2017/Book11_cn.txt"]], + ["cwmt/datum2017/Book11_en.txt", "cwmt/datum2017/Book11_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book12_en.txt", - "cwmt/datum2017/Book12_cn.txt"]], + ["cwmt/datum2017/Book12_en.txt", "cwmt/datum2017/Book12_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book13_en.txt", - "cwmt/datum2017/Book13_cn.txt"]], + ["cwmt/datum2017/Book13_en.txt", "cwmt/datum2017/Book13_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book14_en.txt", - "cwmt/datum2017/Book14_cn.txt"]], + ["cwmt/datum2017/Book14_en.txt", "cwmt/datum2017/Book14_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book15_en.txt", - "cwmt/datum2017/Book15_cn.txt"]], + ["cwmt/datum2017/Book15_en.txt", "cwmt/datum2017/Book15_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book16_en.txt", - "cwmt/datum2017/Book16_cn.txt"]], + ["cwmt/datum2017/Book16_en.txt", "cwmt/datum2017/Book16_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book17_en.txt", - "cwmt/datum2017/Book17_cn.txt"]], + ["cwmt/datum2017/Book17_en.txt", "cwmt/datum2017/Book17_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book18_en.txt", - "cwmt/datum2017/Book18_cn.txt"]], + ["cwmt/datum2017/Book18_en.txt", "cwmt/datum2017/Book18_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book19_en.txt", - "cwmt/datum2017/Book19_cn.txt"]], + ["cwmt/datum2017/Book19_en.txt", "cwmt/datum2017/Book19_cn.txt"]], ["https://s3-us-west-2.amazonaws.com/twairball.wmt17.zh-en/cwmt.tgz", - ["cwmt/datum2017/Book20_en.txt", - "cwmt/datum2017/Book20_cn.txt"]] + ["cwmt/datum2017/Book20_en.txt", "cwmt/datum2017/Book20_cn.txt"]] ] def get_filename(dataset): - return dataset[0][0].split('/')[-1] + return dataset[0][0].split("/")[-1] + @registry.register_problem class TranslateEnzhWmt32k(translate.TranslateProblem): """Problem spec for WMT En-Zh translation. - Attempts to use full training dataset, which needs website + + Attempts to use full training dataset, which needs website registration and downloaded manually from official sources: - CWMT: + CWMT: - http://nlp.nju.edu.cn/cwmt-wmt/ - - Website contrains instructions for FTP server access. - - You'll need to download CASIA, CASICT, DATUM2015, DATUM2017, + - Website contrains instructions for FTP server access. + - You'll need to download CASIA, CASICT, DATUM2015, DATUM2017, NEU datasets - UN Parallel Corpus: + UN Parallel Corpus: - https://conferences.unite.un.org/UNCorpus - - You'll need to register your to download the dataset. + - You'll need to register your to download the dataset. NOTE: place into tmp directory e.g. /tmp/t2t_datagen/dataset.tgz """ @@ -189,32 +169,40 @@ def source_vocab_name(self): @property def target_vocab_name(self): return "vocab.enzh-zh.%d" % self.targeted_vocab_size - + def get_training_dataset(self, tmp_dir): """UN Parallel Corpus and CWMT Corpus need to be downloaded manually. + Append to training dataset if available + + Args: + tmp_dir: path to temporary dir with the data in it. + + Returns: + paths """ full_dataset = _NC_TRAIN_DATASETS for dataset in [_CWMT_TRAIN_DATASETS, _UN_TRAIN_DATASETS]: filename = get_filename(dataset) tmp_filepath = os.path.join(tmp_dir, filename) if tf.gfile.Exists(tmp_filepath): - full_dataset = full_dataset + dataset + full_dataset += dataset else: - tf.logging.info("[TranslateEzhWmt] dataset incomplete, you need to manually download %s" % filename) + tf.logging.info("[TranslateEzhWmt] dataset incomplete, you need to " + "manually download %s" % filename) return full_dataset def generator(self, data_dir, tmp_dir, train): - TRAIN_DATASET = self.get_training_dataset(tmp_dir) - datasets = TRAIN_DATASET if train else _NC_TEST_DATASETS - source_datasets = [[item[0], [item[1][0]]] for item in TRAIN_DATASET] - target_datasets = [[item[0], [item[1][1]]] for item in TRAIN_DATASET] + train_dataset = self.get_training_dataset(tmp_dir) + datasets = train_dataset if train else _NC_TEST_DATASETS + source_datasets = [[item[0], [item[1][0]]] for item in train_dataset] + target_datasets = [[item[0], [item[1][1]]] for item in train_dataset] source_vocab = generator_utils.get_or_generate_vocab( data_dir, tmp_dir, self.source_vocab_name, self.targeted_vocab_size, - source_datasets, _file_byte_budget=1e8) + source_datasets, file_byte_budget=1e8) target_vocab = generator_utils.get_or_generate_vocab( data_dir, tmp_dir, self.target_vocab_name, self.targeted_vocab_size, - target_datasets, _file_byte_budget=1e8) + target_datasets, file_byte_budget=1e8) tag = "train" if train else "dev" filename_base = "wmt_enzh_%sk_tok_%s" % (self.targeted_vocab_size, tag) data_path = translate.compile_data(tmp_dir, datasets, filename_base) @@ -244,6 +232,7 @@ def feature_encoders(self, data_dir): @registry.register_problem class TranslateEnzhWmt8k(TranslateEnzhWmt32k): """Problem spec for WMT En-Zh translation. + This is far from being the real WMT17 task - only toyset here """ @@ -254,7 +243,7 @@ def targeted_vocab_size(self): @property def num_shards(self): return 10 # This is a small dataset. - + def get_training_dataset(self, tmp_dir): - """Uses only News Commentary Dataset for training""" + """Uses only News Commentary Dataset for training.""" return _NC_TRAIN_DATASETS diff --git a/tensor2tensor/data_generators/wiki.py b/tensor2tensor/data_generators/wiki.py index a1380c27f..828ef2246 100644 --- a/tensor2tensor/data_generators/wiki.py +++ b/tensor2tensor/data_generators/wiki.py @@ -20,122 +20,108 @@ from __future__ import print_function import os +import subprocess # Dependency imports -import bz2file - import numpy as np -import six from tensor2tensor.data_generators import generator_utils from tensor2tensor.data_generators import problem -from tensor2tensor.data_generators import text_encoder -from tensor2tensor.utils import metrics from tensor2tensor.utils import registry import tensorflow as tf -# End-of-sentence marker. -EOS = text_encoder.EOS_ID - -def _maybe_download_corpus(tmp_dir): - """Download corpus if necessary. - - Args: - tmp_dir: directory containing dataset. +@registry.register_problem +class LanguagemodelWikiXmlV8kL1k(problem.ChoppedTextProblem): + """A language model on English Wikipedia. - Returns: - filepath of the downloaded corpus file. + XML dump is chopped arbitrarily into sequences of length 1024 tokens, + without regard to article boundaries. """ - corpus_url = ("https://dumps.wikimedia.org/enwiki/20170620/" - "enwiki-20170620-pages-articles-multistream.xml.bz2") - corpus_filename = os.path.basename(corpus_url) - corpus_filepath = os.path.join(tmp_dir, corpus_filename) - if not tf.gfile.Exists(corpus_filepath): - generator_utils.maybe_download(tmp_dir, corpus_filename, corpus_url) - return corpus_filepath - - -def page_generator(tmp_dir, max_docs=None): - doc = u"" - count = 0 - corpus_filepath = _maybe_download_corpus(tmp_dir) - for line in bz2file.BZ2File(corpus_filepath, "r", buffering=1000000): - line = unicode(line, "utf-8") if six.PY2 else line.decode("utf-8") - if not doc and line != u" \n": - continue - doc += line - if line == u" \n": - yield doc - doc = u"" - count += 1 - if max_docs and count >= max_docs: - break - - -def _page_title(page): - start_pos = page.find(u"") - end_pos = page.find(u"") - assert start_pos != -1 - assert end_pos != -1 - start_pos += len(u"") - return page[start_pos:end_pos] - -@registry.register_problem -class LanguagemodelWikiFull32k(problem.Text2TextProblem): - """A language model on full English Wikipedia.""" + def maybe_prepare_text(self, tmp_dir): + """Download corpus if necessary, decompress, split into multiple text files. + + Args: + tmp_dir: directory containing dataset. + + Returns: + list of filepaths for local text files. + """ + compressed_filename = os.path.basename(self.corpus_url) + compressed_filepath = os.path.join(tmp_dir, compressed_filename) + decompressed_filepath = compressed_filepath[:-4] + split_file_prefix = decompressed_filepath + "-part-" + split_filepattern = split_file_prefix + "?????" + split_files = sorted(tf.gfile.Glob(split_filepattern)) + if not split_files: + if not tf.gfile.Exists(decompressed_filepath): + if not tf.gfile.Exists(compressed_filepath): + generator_utils.maybe_download( + tmp_dir, compressed_filepath, self.corpus_url) + assert not subprocess.call(["bunzip2", compressed_filepath]) + assert tf.gfile.Exists(decompressed_filepath) + assert not subprocess.call([ + "split", "--line-bytes=4M", "--suffix-length=5", + "--numeric-suffixes", decompressed_filepath, split_file_prefix]) + split_files = sorted(tf.gfile.Glob(split_filepattern)) + assert split_files + return split_files + + def train_text_filepaths(self, tmp_dir): + all_files = self.maybe_prepare_text(tmp_dir) + return [f for i, f in enumerate(all_files) if i % self.dev_fraction != 0] + + def dev_text_filepaths(self, tmp_dir): + all_files = self.maybe_prepare_text(tmp_dir) + return [f for i, f in enumerate(all_files) if i % self.dev_fraction == 0] @property - def is_character_level(self): - return False + def dev_fraction(self): + return 5000 @property - def has_inputs(self): - return True + def corpus_url(self): + return ("https://archive.org/download/enwiki-20171201/" + "enwiki-20171201-pages-articles.xml.bz2") @property - def input_space_id(self): - return problem.SpaceID.EN_TOK + def vocab_name(self): + return "vocab.wiki_xml" @property - def target_space_id(self): - return problem.SpaceID.EN_TOK + def targeted_vocab_size(self): + return 2**13 # 8192 @property - def num_shards(self): - return 1000 + def sequence_length(self): + """Length of each example (in tokens).""" + return 1024 @property - def vocab_name(self): - return "vocab.wiki" + def max_chars_for_vocab(self): + """Number of characters of training data to use for generating vocab.""" + # magic number for backwards compatibility + return 41800829 - @property - def use_subword_tokenizer(self): - return True - @property - def targeted_vocab_size(self): - return 2**15 # 32768 +@registry.register_problem +class LanguagemodelWikiXmlV8kL4k(LanguagemodelWikiXmlV8kL1k): + """A language model on English Wikipedia. - @property - def use_train_shards_for_dev(self): - return True + XML dump is chopped arbitrarily into sequences of length 4096 tokens, + without regard to article boundaries. + """ - def generator(self, data_dir, tmp_dir, _): - encoder = generator_utils.get_or_generate_vocab_inner( - data_dir, self.vocab_file, self.targeted_vocab_size, - page_generator(tmp_dir, max_docs=10000)) - for page in page_generator(tmp_dir): - title = _page_title(page) - encoded = encoder.encode(page) + [EOS] - encoded_title = encoder.encode(title) + [EOS] - yield {"inputs": encoded_title, "targets": encoded} + @property + def sequence_length(self): + """Length of each example (in tokens).""" + return 4096 -class LanguagemodelWikiScramble(problem.Text2TextProblem): +class LanguagemodelWikiScramble(LanguagemodelWikiXmlV8kL1k): """Language modeling on English wikipedia. "targets" is a sequence of sequence_length tokens - a fragment of an article. @@ -146,18 +132,16 @@ class LanguagemodelWikiScramble(problem.Text2TextProblem): of the target sequence given the input sequence. """ - @property - def sequence_length(self): - raise NotImplementedError() + def example_generator(self, encoder, tmp_dir, task_id): + for x in super(LanguagemodelWikiScramble, self).example_generator( + encoder, tmp_dir, task_id): + x["inputs"] = self.scramble(x["targets"]) + yield x @property def scramble_fraction(self): raise NotImplementedError() - @property - def is_character_level(self): - return False - @property def has_inputs(self): return True @@ -166,33 +150,14 @@ def has_inputs(self): def input_space_id(self): return problem.SpaceID.EN_TOK - @property - def target_space_id(self): - return problem.SpaceID.EN_TOK - - @property - def num_shards(self): - return 1000 - - @property - def vocab_name(self): - return "vocab.wiki" - - @property - def use_subword_tokenizer(self): - return True - @property def targeted_vocab_size(self): return 2**13 # 8192 @property - def use_train_shards_for_dev(self): - return True - - @property - def max_cases(self): - return (2 ** 30) / self.sequence_length + def remainder_policy(self): + """What to do with leftover tokens.""" + return "drop" def scramble(self, seq): seq = np.array(seq) @@ -207,30 +172,9 @@ def scramble(self, seq): seq = list(seq) return seq - def generator(self, data_dir, tmp_dir, _): - encoder = generator_utils.get_or_generate_vocab_inner( - data_dir, self.vocab_file, self.targeted_vocab_size, - page_generator(tmp_dir, max_docs=1000)) - case_num = 0 - for page in page_generator(tmp_dir): - encoded = encoder.encode(page) - for i in xrange(len(encoded) // self.sequence_length): - case_num += 1 - if self.max_cases and case_num > self.max_cases: - return - targets = encoded[ - i * self.sequence_length:(i + 1) * self.sequence_length] - inputs = self.scramble(targets) - yield {"inputs": inputs, "targets": targets} - - def eval_metrics(self): - return [ - metrics.Metrics.ACC, metrics.Metrics.NEG_LOG_PERPLEXITY - ] - @registry.register_problem -class LanguagemodelWikiScramble128(LanguagemodelWikiScramble): +class LanguagemodelWikiScrambleL128(LanguagemodelWikiScramble): """Sequence length 128, 50% scrambed.""" @property @@ -243,7 +187,7 @@ def scramble_fraction(self): @registry.register_problem -class LanguagemodelWikiScramble1k50(LanguagemodelWikiScramble): +class LanguagemodelWikiScrambleL1k(LanguagemodelWikiScramble): """Sequence length 1024, 50% scrambed.""" @property @@ -256,13 +200,192 @@ def scramble_fraction(self): @registry.register_problem -class LanguagemodelWikiScramble8k50(LanguagemodelWikiScramble): - """Sequence length 8192, 50% scrambed.""" +class LanguagemodelWikiNorefV8kL1k(LanguagemodelWikiXmlV8kL1k): + """A language model on English Wikipedia. + + References and internal links are removed from the raw XML. + + Special pages (non-articles) are dropped. + + This more closely resemples plain text, though there are still some xml + elements, like tables. + + Each article is prefixed by a line containing the title and length in + characters - e.g. + title: "Price of Tea in China" length: 12345 + During inference time, you can forward generate starting with such a header + in order to obtain a randomly generated article with a given title and + (approximate) length. + + Result is chopped arbitrarily into sequences of length 1024 tokens, + without regard to article boundaries. + """ @property - def sequence_length(self): - return 8192 + def vocab_name(self): + return "vocab.wiki_noref" + + def filepath_to_unicode_text(self, filepath): + """Overriddes the base class to clean up the xml dump before tokenizing.""" + dump = problem.to_unicode_ignore_erros(tf.gfile.Open(filepath).read()) + pages = _dump_to_pages(dump) + ret = u"" + for p in pages: + title = _page_to_title(p) + text = _page_to_text(p) + text = _remove_triple_quotes( + _remove_double_brackets(_remove_references(text))) + if u":" in title: + # not a regular article + continue + if len(text) <= 140: + # Probably a redirect or something like that. Skip it. + continue + ret += u"title: \"%s\" length: %d\n%s\n" % (title, len(text), text) + return ret @property - def scramble_fraction(self): - return 0.5 + def max_chars_for_vocab(self): + """Number of characters of training data to use for generating vocab.""" + # magic number for backwards compatibility + return 21240483 + + +def _dump_to_pages(dump): + """Extract pages from an xml dump. + + Args: + dump: a unicode string + Returns: + a list of unicode strings + """ + pos = 0 + ret = [] + start_tag = u"<page>\n" + end_tag = u"</page>\n" + while True: + start_pos = dump.find(start_tag, pos) + if start_pos == -1: + break + start_pos += len(start_tag) + end_pos = dump.find(end_tag, start_pos) + if end_pos == -1: + break + ret.append(dump[start_pos:end_pos]) + pos = end_pos + len(end_tag) + return ret + + +def _page_to_title(page): + """Extract the title from a page. + + Args: + page: a unicode string + Returns: + a unicode string + """ + # print("page=%s" % page) + start_tag = u"<title>" + end_tag = u"" + start_pos = page.find(start_tag) + end_pos = page.find(end_tag) + assert start_pos != -1 + assert end_pos != -1 + start_pos += len(start_tag) + return page[start_pos:end_pos] + + +def _page_to_text(page): + """Extract the text from a page. + + Args: + page: a unicode string + Returns: + a unicode string + """ + # text start tag looks like "" + start_pos = page.find(u"", start_pos) + assert end_tag_pos != -1 + end_tag_pos += len(u">") + end_pos = page.find(u"") + if end_pos == -1: + return u"" + return page[end_tag_pos:end_pos] + + +def _find_and_replace(text, start_string, end_string, replace_fn): + """Remove everything found between instances of start_string and end_string. + + Replace each such instance with replace_fn(removed_text) + + e.g. _find_and_replace(u"the [[fat]] cat [[sat]]", u"[[", u"]]", lambda x: x) + = u"the fat cat sat" + + Args: + text: a unicode string + start_string: a unicode string + end_string: a unicode string + replace_fn: a unary function from unicode string to unicode string + + Returns: + a string + """ + ret = u"" + current_pos = 0 + while True: + start_pos = text.find(start_string, current_pos) + if start_pos == -1: + ret += text[current_pos:] + break + ret += text[current_pos:start_pos] + end_pos = text.find(end_string, start_pos + len(start_string)) + if end_pos == -1: + break + ret += replace_fn(text[start_pos + len(start_string):end_pos]) + current_pos = end_pos + len(end_string) + return ret + + +def _remove_references(text): + """Strip out references from wikipedia xml.""" + return _find_and_replace(text, u"<ref", u"</ref>", lambda s: "") + + +def _remove_triple_quotes(text): + """Strip out triple quotes from wikipedia xml.""" + return _find_and_replace(text, u"'''", u"'''", lambda s: s) + + +def _remove_double_brackets(text): + """Remove double brackets (internal links) but leave the viewable text. + + Args: + text: a unicode string + Returns: + a unicode string + """ + def replacement_fn(s): + if u":" in s: + # this is probably a category or something like that. + return "" + # keep the part after the bar. + bar_pos = s.find(u"|") + if bar_pos == -1: + return s + return s[bar_pos + 1:] + return _find_and_replace(text, u"[[", u"]]", replacement_fn) + + +@registry.register_problem +class LanguagemodelWikiNorefV8kL16k(LanguagemodelWikiNorefV8kL1k): + """A language model on English Wikipedia. + + References removed. Chopped into segments of 16k tokens. + """ + + @property + def sequence_length(self): + """Length of each example (in tokens).""" + return 2**14 diff --git a/tensor2tensor/insights/README.md b/tensor2tensor/insights/README.md new file mode 100644 index 000000000..ebed255e1 --- /dev/null +++ b/tensor2tensor/insights/README.md @@ -0,0 +1,76 @@ +# Tensor2Tensor Insights + +The Insights packages provides an interactive webservice for understanding the +inner workings of a Tensor2Tensor model. It will provide a series of +visualizations extracted from a requested T2T model that informs model developers +and model users on how to improve or best utilize a model. + +## Dependencies + +Before using the Insights server, you must install [Bower](https://bower.io/) +which we use to manage our web component dependencies. You can easily install +this with the [Node Package Manager](https://www.npmjs.com/). + +## Setup Instructions + +After training a model, such as according to the Quick Start guide, you can run +the `t2t-insights-server` binary and begin querying it. + +First, prepare the bower dependencies by navigating into the +`tensor2tensor/insights/polymer` directory and running `bower install`: + +``` +pushd tensor2tensor/insights/polymer +bower install +popd +``` + +The models run by server is then configured by a JSON version of the +InsightsConfiguration protocol buffer. Using the model trained in the Quick +Start guide, a sample configuration would be: + +``` + { + "configuration": [{ + "source_language": "en", + "target_language": "de", + "label": "transformers_wmt32k", + "transformer": { + "model": "transformer", + "model_dir": "/tmp/t2t/train", + "data_dir": "/tmp/t2t/data", + "hparams": "", + "hparams_set": "transformer_base_single_gpu", + "problems": "translate_ende_wmt32k" + }, + }] + "language": [{ + "code": "en", + "name": "English", + },{ + "code": "de", + "name": "German", + }] + } +``` + +With that saved to `configuration.json`, run the following: + +``` +t2t-insights-server \ + --configuration=configuration.json \ + --static_path=`pwd`/tensor2tensor/insights/polymer +``` + +This will bring up a minimal [Flask](http://flask.pocoo.org/) REST service +served by a [GUnicorn](http://gunicorn.org/) HTTP Server. + +## Features to be developed + +This is a minimal web server. We are in the process of adding additional +exciting features that give insight into a model's behavior: + + * Integrating a multi-head attention visualization. + * Registering multiple models to compare their behavior. + * Indexing training data to find examples related to a current query. + * Tracking interesting query + translation pairs for deeper analysis. diff --git a/tensor2tensor/insights/insight_configuration.proto b/tensor2tensor/insights/insight_configuration.proto new file mode 100644 index 000000000..6a1656eac --- /dev/null +++ b/tensor2tensor/insights/insight_configuration.proto @@ -0,0 +1,55 @@ +syntax = "proto3"; + +package tensor2tensor; + +// Configures the Neural Machine Translation Insight Frontend with a set of +// supported query processors and languages. +message InsightConfiguration { + // Specifies zero or more models to inspect. + repeated QueryProcessorConfiguration configuration = 1; + + // Specifies language codes and display names. + repeated Language language = 2; +} + +// A displayable language name. +message Language { + // The BCP-47 Language code. + string code = 1; + // The language's display name. + string name = 2; +} + +// Configures a QueryProcessor and registers it with the Insight Frontend when +// responding to analysis queries. +message QueryProcessorConfiguration { + // The model's BCP-47 source language code. + string source_language = 1; + // The model's BCP-47 target language code. + string target_language = 2; + // A short label for the model. + string label = 3; + // The QueryProcessor to use. By default we just use the TransformerModel. + string query_processor = 4; + + // Configuration for the TransformerModel. + TransformerConfiguration transformer = 5; +} + +// Specifies the parameters for a trained Transformer model to inspect. These +// parameters match those in t2t-trainer and t2t-decoder. +message TransformerConfiguration { + // The model type. + string model = 1; + // The trained model directory. + string model_dir = 2; + // The data directory for the model. + string data_dir = 3; + + // The hyperparameter set for running the model. + string hparams_set = 4; + // Overriding hyperparameters. + string hparams = 5; + // The problem sets over which this model was trained and configured. + string problems = 6; +} diff --git a/tensor2tensor/insights/polymer/.bowerrc b/tensor2tensor/insights/polymer/.bowerrc new file mode 100644 index 000000000..b316080f0 --- /dev/null +++ b/tensor2tensor/insights/polymer/.bowerrc @@ -0,0 +1,3 @@ +{ + "directory": "." +} diff --git a/tensor2tensor/insights/polymer/attention_visualization/attention-visualization.js b/tensor2tensor/insights/polymer/attention_visualization/attention-visualization.js index b58d90905..e738c2629 100644 --- a/tensor2tensor/insights/polymer/attention_visualization/attention-visualization.js +++ b/tensor2tensor/insights/polymer/attention_visualization/attention-visualization.js @@ -15,8 +15,6 @@ * limitations under the License. */ -goog.module('t2t.AttentionVisualization'); - /** * `` presents a heatmap of input-output associations. * @@ -62,10 +60,16 @@ class AttentionVisualization extends Polymer.Element { this.zoom_ = undefined; } + /** + * @return {string} The component name. + */ static get is() { return 'attention-visualization'; } + /** + * @return {!Object} The component properties. + */ static get properties() { return { /** @@ -84,6 +88,9 @@ class AttentionVisualization extends Polymer.Element { }; } + /** + * @return {!Array} The component observers. + */ static get observers() { return [ 'zoomDepthChanged_(zoomDepth_)', @@ -308,5 +315,3 @@ class AttentionVisualization extends Polymer.Element { } customElements.define(AttentionVisualization.is, AttentionVisualization); - -exports = {AttentionVisualization}; diff --git a/tensor2tensor/insights/polymer/bower.json b/tensor2tensor/insights/polymer/bower.json new file mode 100644 index 000000000..da1f4aaed --- /dev/null +++ b/tensor2tensor/insights/polymer/bower.json @@ -0,0 +1,80 @@ +{ + "name": "tensor2tensor-insights", + "homepage": "https://github.com/tensorflow/tensor2tensor", + "description": "Components for analyzing tensor2tensor neural machine translation models.", + "main": "index.html", + "keywords": [ + "neural", + "machine", + "translation" + ], + "authors": [ + "kstevens@google.com" + ], + "license": "Apache 2.0", + "private": true, + "ignore": [ + "**/.*", + "node_modules", + "bower_components", + "test", + "tests" + ], + "dependencies": { + "app-layout": "PolymerElements/app-layout#2.0.4", + "app-route": "PolymerElements/app-route#2.0.3", + "d3": "d3#4.12.2", + "iron-a11y-keys": "PolymerElements/iron-a11y-keys#2.0.0", + "iron-ajax": "PolymerElements/iron-ajax#2.0.0", + "iron-flex-layout": "PolymerElements/iron-flex-layout#2.0.0", + "iron-icon": "PolymerElements/iron-icon#2.0.0", + "iron-icons": "PolymerElements/iron-icons#2.0.0", + "iron-list": "PolymerElements/iron-list#2.0.0", + "iron-pages": "PolymerElements/iron-pages#2.0.0", + "iron-selector": "PolymerElements/iron-selector#2.0.0", + "neon-animation": "PolymerElements/neon-animation#2.0.0", + "paper-button": "PolymerElements/paper-button#2.0.0", + "paper-card": "PolymerElements/paper-card#2.0.0", + "paper-dialog": "PolymerElements/paper-dialog#2.0.0", + "paper-dropdown-menu": "PolymerElements/paper-dropdown-menu#2.0.0", + "paper-icon-button": "PolymerElements/paper-icon-button#2.0.0", + "paper-input": "PolymerElements/paper-input#2.0.0", + "paper-item": "PolymerElements/paper-item#2.0.0", + "paper-listbox": "PolymerElements/paper-listbox#2.0.0", + "paper-slider": "PolymerElements/paper-slider#2.0.0", + "paper-tabs": "PolymerElements/paper-tabs#2.0.0", + "paper-toggle-button": "PolymerElements/paper-toggle-button#2.0.0", + "paper-tooltip": "PolymerElements/paper-tooltip#2.0.0", + "paper-progress": "PolymerElements/paper-progress#2.0.0", + "polymer": "polymer/polymer#v2.3.1" + }, + "resolutions": { + "webcomponentsjs": "^v1.0.19", + "polymer": "^v2.3.1", + "app-route": "^2.0.3", + "app-layout": "^2.0.4", + "iron-location": "1 - 2", + "iron-selector": "^2.0.0", + "neon-animation": "^2.0.0", + "iron-icon": "^2.0.0", + "iron-pages": "^2.0.0", + "iron-icons": "^2.0.0", + "paper-icon-button": "^2.0.0", + "paper-item": "^2.0.0", + "iron-flex-layout": "^2.0.0", + "paper-listbox": "^2.0.0", + "iron-a11y-keys": "^2.0.0", + "paper-dialog": "^2.0.0", + "iron-ajax": "^2.0.0", + "paper-progress": "^2.0.0", + "paper-dropdown-menu": "^2.0.0", + "paper-tabs": "^2.0.0", + "paper-input": "^2.0.0", + "paper-toggle-button": "^2.0.0", + "paper-slider": "^2.0.0", + "iron-list": "^2.0.0", + "paper-card": "^2.0.0", + "paper-tooltip": "^2.0.0", + "iron-overlay-behavior": "^2.2.0" + } +} diff --git a/tensor2tensor/insights/polymer/explore_view/explore-view.html b/tensor2tensor/insights/polymer/explore_view/explore-view.html index d0456211f..97fce423c 100644 --- a/tensor2tensor/insights/polymer/explore_view/explore-view.html +++ b/tensor2tensor/insights/polymer/explore_view/explore-view.html @@ -31,8 +31,8 @@ - - + +