raw_annotated_data.csv

ReviewId,SentenceId,TurkerId,Text,Label1,Label2
r102,s0,t31,"1.  The paper misses some more recent reference, e.g. [a,b].",actionable,suggestion
r102,s1,t31,"2. Indeed, AlexNet is a good seedbed to test binary methods.",actionable,suggestion
r102,s2,t31,"So, I wish to see a section on testing with Resnet and GoogleNet.",actionable,suggestion
r102,s3,t31,"Indeed, the authors have commented: ""AlexNet with batch-normalization (AlexNet-BN) is the standard model ... acceptance that improvements made to accuracy transfer well to more modern architectures.""",non_actionable,fact
r102,s4,t31,3. The paper wants to find a good trade-off on speed and accuracy.,non_actionable,fact
r102,s5,t31,"The authors have plotted such trade-off on space v.s. accuracy in Figure 3(b), then how about speed v.s. accuracy?",actionable,suggestion
r102,s6,t31,My concern is that one-bit system is already complicated to implement.,actionable,shortcoming
r102,s7,t31,"Indeed, the authors have discussed their implementation in Section 3.3, so, how their method works in practice?",non_actionable,question
r102,s8,t31,4. Is trade-off between 1 to 2 bits really important?,non_actionable,question
r102,s9,t31,"Compared with 2bits or ternary network, the proposed method at most achieving (1.4/2) compression ratio and (2/1.4) speedup (based on their Table 1).",non_actionable,fact
r102,s0,t23,"1.  The paper misses some more recent reference, e.g. [a,b].",actionable,suggestion
r102,s1,t23,"2. Indeed, AlexNet is a good seedbed to test binary methods.",non_actionable,agreement
r102,s2,t23,"So, I wish to see a section on testing with Resnet and GoogleNet.",actionable,suggestion
r102,s3,t23,"Indeed, the authors have commented: ""AlexNet with batch-normalization (AlexNet-BN) is the standard model ... acceptance that improvements made to accuracy transfer well to more modern architectures.""",non_actionable,agreement
r102,s4,t23,3. The paper wants to find a good trade-off on speed and accuracy.,actionable,suggestion
r102,s5,t23,"The authors have plotted such trade-off on space v.s. accuracy in Figure 3(b), then how about speed v.s. accuracy?",actionable,question
r102,s6,t23,My concern is that one-bit system is already complicated to implement.,non_actionable,shortcoming
r102,s7,t23,"Indeed, the authors have discussed their implementation in Section 3.3, so, how their method works in practice?",actionable,question
r102,s8,t23,4. Is trade-off between 1 to 2 bits really important?,actionable,question
r102,s9,t23,"Compared with 2bits or ternary network, the proposed method at most achieving (1.4/2) compression ratio and (2/1.4) speedup (based on their Table 1).",non_actionable,fact
r102,s0,t20,"1.  The paper misses some more recent reference, e.g. [a,b].",actionable,shortcoming
r102,s1,t20,"2. Indeed, AlexNet is a good seedbed to test binary methods.",non_actionable,fact
r102,s2,t20,"So, I wish to see a section on testing with Resnet and GoogleNet.",actionable,suggestion
r102,s3,t20,"Indeed, the authors have commented: ""AlexNet with batch-normalization (AlexNet-BN) is the standard model ... acceptance that improvements made to accuracy transfer well to more modern architectures.""",non_actionable,fact
r102,s4,t20,3. The paper wants to find a good trade-off on speed and accuracy.,non_actionable,fact
r102,s5,t20,"The authors have plotted such trade-off on space v.s. accuracy in Figure 3(b), then how about speed v.s. accuracy?",actionable,question
r102,s6,t20,My concern is that one-bit system is already complicated to implement.,non_actionable,shortcoming
r102,s7,t20,"Indeed, the authors have discussed their implementation in Section 3.3, so, how their method works in practice?",actionable,question
r102,s8,t20,4. Is trade-off between 1 to 2 bits really important?,actionable,question
r102,s9,t20,"Compared with 2bits or ternary network, the proposed method at most achieving (1.4/2) compression ratio and (2/1.4) speedup (based on their Table 1).",non_actionable,fact
r102,s0,t9,"1.  The paper misses some more recent reference, e.g. [a,b].",actionable,shortcoming
r102,s1,t9,"2. Indeed, AlexNet is a good seedbed to test binary methods.",non_actionable,agreement
r102,s2,t9,"So, I wish to see a section on testing with Resnet and GoogleNet.",actionable,suggestion
r102,s3,t9,"Indeed, the authors have commented: ""AlexNet with batch-normalization (AlexNet-BN) is the standard model ... acceptance that improvements made to accuracy transfer well to more modern architectures.""",non_actionable,agreement
r102,s4,t9,3. The paper wants to find a good trade-off on speed and accuracy.,non_actionable,fact
r102,s5,t9,"The authors have plotted such trade-off on space v.s. accuracy in Figure 3(b), then how about speed v.s. accuracy?",non_actionable,question
r102,s6,t9,My concern is that one-bit system is already complicated to implement.,non_actionable,shortcoming
r102,s7,t9,"Indeed, the authors have discussed their implementation in Section 3.3, so, how their method works in practice?",actionable,question
r102,s8,t9,4. Is trade-off between 1 to 2 bits really important?,actionable,question
r102,s9,t9,"Compared with 2bits or ternary network, the proposed method at most achieving (1.4/2) compression ratio and (2/1.4) speedup (based on their Table 1).",non_actionable,fact
r102,s0,t10,"1.  The paper misses some more recent reference, e.g. [a,b].",non_actionable,disagreement
r102,s1,t10,"2. Indeed, AlexNet is a good seedbed to test binary methods.",actionable,agreement
r102,s2,t10,"So, I wish to see a section on testing with Resnet and GoogleNet.",actionable,suggestion
r102,s3,t10,"Indeed, the authors have commented: ""AlexNet with batch-normalization (AlexNet-BN) is the standard model ... acceptance that improvements made to accuracy transfer well to more modern architectures.""",non_actionable,other
r102,s4,t10,3. The paper wants to find a good trade-off on speed and accuracy.,non_actionable,other
r102,s5,t10,"The authors have plotted such trade-off on space v.s. accuracy in Figure 3(b), then how about speed v.s. accuracy?",actionable,suggestion
r102,s6,t10,My concern is that one-bit system is already complicated to implement.,actionable,fact
r102,s7,t10,"Indeed, the authors have discussed their implementation in Section 3.3, so, how their method works in practice?",actionable,question
r102,s8,t10,4. Is trade-off between 1 to 2 bits really important?,actionable,question
r102,s9,t10,"Compared with 2bits or ternary network, the proposed method at most achieving (1.4/2) compression ratio and (2/1.4) speedup (based on their Table 1).",non_actionable,other
r1,s0,t20,"A decent paper with some issues This paper proposes a new output layer in neural networks, which allows them to use logged contextual bandit feedback for training.",non_actionable,agreement
r1,s1,t20,"Others: - The baseline in REINFORCE (Williams'92), which is equivalent to introduced Lagrange multiplier, is well known and well defined as control variate in Monte Carlo simulation, certainly not an ""ad-hoc heuristic"" as claimed in the paper [see Greensmith et al. (2004). Variance Reduction for Gradient Estimates in Reinforcement Learning, JMLR 5.] - Bandit to supervised conversion: please add a supervised baseline system trained just on instances with top feedbacks -- this should be a much more interesting and relevant strong baseline.",actionable,suggestion
r1,s2,t20,There are multiple indications that this bandit-to-supervised baseline is hard to outperform in a number of important applications.,non_actionable,fact
r1,s3,t20,"The control variate of the SNIPS objective can be seen as defining a probability distribution over the log, thus ensuring that for each sample that sample’s delta is multiplied by a value in [0,1] and not by a large importance sampling ratio.",non_actionable,fact
r1,s4,t20,1. IPS for losses<0 and risk minimization: raise the probability of every sample in the log irrespective of the loss itself,actionable,suggestion
r1,s5,t20,- What is the feedback in the CIFAR-10 experiments?,actionable,question
r1,s6,t20,- The claim of Theorem 2 in appendix B does not follow from its proof: what is proven is that the value of S(w) lies in an interval [1-e..1+e] with a certain probability for all w.,non_actionable,fact
r1,s7,t20,"Actually, the proof never makes any connection to optimization.",actionable,shortcoming
r1,s8,t20,"This would contradict some previously established convergence results for this type of problems: Reddi et al. (2016) Stochastic Variance Reduction for Nonconvex Optimization, ICML and Wang et al. 2013.",actionable,shortcoming
r1,s9,t20,"Variance Reduction for Stochastic Gradient Optimization, NIPS.",non_actionable,fact
r1,s0,t31,"A decent paper with some issues This paper proposes a new output layer in neural networks, which allows them to use logged contextual bandit feedback for training.",non_actionable,agreement
r1,s1,t31,"Others: - The baseline in REINFORCE (Williams'92), which is equivalent to introduced Lagrange multiplier, is well known and well defined as control variate in Monte Carlo simulation, certainly not an ""ad-hoc heuristic"" as claimed in the paper [see Greensmith et al. (2004). Variance Reduction for Gradient Estimates in Reinforcement Learning, JMLR 5.] - Bandit to supervised conversion: please add a supervised baseline system trained just on instances with top feedbacks -- this should be a much more interesting and relevant strong baseline.",actionable,shortcoming
r1,s2,t31,There are multiple indications that this bandit-to-supervised baseline is hard to outperform in a number of important applications.,non_actionable,fact
r1,s3,t31,"The control variate of the SNIPS objective can be seen as defining a probability distribution over the log, thus ensuring that for each sample that sample’s delta is multiplied by a value in [0,1] and not by a large importance sampling ratio.",non_actionable,fact
r1,s4,t31,1. IPS for losses<0 and risk minimization: raise the probability of every sample in the log irrespective of the loss itself,non_actionable,fact
r1,s5,t31,- What is the feedback in the CIFAR-10 experiments?,non_actionable,question
r1,s6,t31,- The claim of Theorem 2 in appendix B does not follow from its proof: what is proven is that the value of S(w) lies in an interval [1-e..1+e] with a certain probability for all w.,actionable,shortcoming
r1,s7,t31,"Actually, the proof never makes any connection to optimization.",actionable,shortcoming
r1,s8,t31,"This would contradict some previously established convergence results for this type of problems: Reddi et al. (2016) Stochastic Variance Reduction for Nonconvex Optimization, ICML and Wang et al. 2013.",non_actionable,fact
r1,s9,t31,"Variance Reduction for Stochastic Gradient Optimization, NIPS.",non_actionable,fact
r1,s0,t8,"A decent paper with some issues This paper proposes a new output layer in neural networks, which allows them to use logged contextual bandit feedback for training.",actionable,shortcoming
r1,s1,t8,"Others: - The baseline in REINFORCE (Williams'92), which is equivalent to introduced Lagrange multiplier, is well known and well defined as control variate in Monte Carlo simulation, certainly not an ""ad-hoc heuristic"" as claimed in the paper [see Greensmith et al. (2004). Variance Reduction for Gradient Estimates in Reinforcement Learning, JMLR 5.] - Bandit to supervised conversion: please add a supervised baseline system trained just on instances with top feedbacks -- this should be a much more interesting and relevant strong baseline.",actionable,suggestion
r1,s2,t8,There are multiple indications that this bandit-to-supervised baseline is hard to outperform in a number of important applications.,non_actionable,agreement
r1,s3,t8,"The control variate of the SNIPS objective can be seen as defining a probability distribution over the log, thus ensuring that for each sample that sample’s delta is multiplied by a value in [0,1] and not by a large importance sampling ratio.",non_actionable,fact
r1,s4,t8,1. IPS for losses<0 and risk minimization: raise the probability of every sample in the log irrespective of the loss itself,non_actionable,fact
r1,s5,t8,- What is the feedback in the CIFAR-10 experiments?,non_actionable,question
r1,s6,t8,- The claim of Theorem 2 in appendix B does not follow from its proof: what is proven is that the value of S(w) lies in an interval [1-e..1+e] with a certain probability for all w.,actionable,shortcoming
r1,s7,t8,"Actually, the proof never makes any connection to optimization.",actionable,shortcoming
r1,s8,t8,"This would contradict some previously established convergence results for this type of problems: Reddi et al. (2016) Stochastic Variance Reduction for Nonconvex Optimization, ICML and Wang et al. 2013.",non_actionable,fact
r1,s9,t8,"Variance Reduction for Stochastic Gradient Optimization, NIPS.",non_actionable,other
r1,s0,t16,"A decent paper with some issues This paper proposes a new output layer in neural networks, which allows them to use logged contextual bandit feedback for training.",non_actionable,agreement
r1,s1,t16,"Others: - The baseline in REINFORCE (Williams'92), which is equivalent to introduced Lagrange multiplier, is well known and well defined as control variate in Monte Carlo simulation, certainly not an ""ad-hoc heuristic"" as claimed in the paper [see Greensmith et al. (2004). Variance Reduction for Gradient Estimates in Reinforcement Learning, JMLR 5.] - Bandit to supervised conversion: please add a supervised baseline system trained just on instances with top feedbacks -- this should be a much more interesting and relevant strong baseline.",actionable,suggestion
r1,s2,t16,There are multiple indications that this bandit-to-supervised baseline is hard to outperform in a number of important applications.,non_actionable,fact
r1,s3,t16,"The control variate of the SNIPS objective can be seen as defining a probability distribution over the log, thus ensuring that for each sample that sample’s delta is multiplied by a value in [0,1] and not by a large importance sampling ratio.",non_actionable,fact
r1,s4,t16,1. IPS for losses<0 and risk minimization: raise the probability of every sample in the log irrespective of the loss itself,actionable,suggestion
r1,s5,t16,- What is the feedback in the CIFAR-10 experiments?,actionable,question
r1,s6,t16,- The claim of Theorem 2 in appendix B does not follow from its proof: what is proven is that the value of S(w) lies in an interval [1-e..1+e] with a certain probability for all w.,actionable,shortcoming
r1,s7,t16,"Actually, the proof never makes any connection to optimization.",actionable,shortcoming
r1,s8,t16,"This would contradict some previously established convergence results for this type of problems: Reddi et al. (2016) Stochastic Variance Reduction for Nonconvex Optimization, ICML and Wang et al. 2013.",actionable,disagreement
r1,s9,t16,"Variance Reduction for Stochastic Gradient Optimization, NIPS.",non_actionable,fact
r1,s0,t2,"A decent paper with some issues This paper proposes a new output layer in neural networks, which allows them to use logged contextual bandit feedback for training.",non_actionable,fact
r1,s1,t2,"Others: - The baseline in REINFORCE (Williams'92), which is equivalent to introduced Lagrange multiplier, is well known and well defined as control variate in Monte Carlo simulation, certainly not an ""ad-hoc heuristic"" as claimed in the paper [see Greensmith et al. (2004). Variance Reduction for Gradient Estimates in Reinforcement Learning, JMLR 5.] - Bandit to supervised conversion: please add a supervised baseline system trained just on instances with top feedbacks -- this should be a much more interesting and relevant strong baseline.",actionable,suggestion
r1,s2,t2,There are multiple indications that this bandit-to-supervised baseline is hard to outperform in a number of important applications.,non_actionable,fact
r1,s3,t2,"The control variate of the SNIPS objective can be seen as defining a probability distribution over the log, thus ensuring that for each sample that sample’s delta is multiplied by a value in [0,1] and not by a large importance sampling ratio.",non_actionable,fact
r1,s4,t2,1. IPS for losses<0 and risk minimization: raise the probability of every sample in the log irrespective of the loss itself,non_actionable,other
r1,s5,t2,- What is the feedback in the CIFAR-10 experiments?,actionable,question
r1,s6,t2,- The claim of Theorem 2 in appendix B does not follow from its proof: what is proven is that the value of S(w) lies in an interval [1-e..1+e] with a certain probability for all w.,non_actionable,disagreement
r1,s7,t2,"Actually, the proof never makes any connection to optimization.",non_actionable,shortcoming
r1,s8,t2,"This would contradict some previously established convergence results for this type of problems: Reddi et al. (2016) Stochastic Variance Reduction for Nonconvex Optimization, ICML and Wang et al. 2013.",non_actionable,shortcoming
r1,s9,t2,"Variance Reduction for Stochastic Gradient Optimization, NIPS.",non_actionable,other
r89,s0,t10,A full implementation of binary CNN with code This paper builds on Binary-NET [Hubara et al. 2016] and expands it to CNN architectures.,non_actionable,other
r89,s1,t10,"It also provides optimizations that substantially improve the speed of the forward pass: packing layer bits along the channel dimension, pre-allocation of CUDA resources and binary-optimized CUDA kernels for matrix multiplications.",actionable,agreement
r89,s2,t10,The authors compare their framework to BinaryNET and Nervana/Neon and show a 8x speedup for 8092 matrix-matrix multiplication and a 68x speedup for MLP networks.,non_actionable,other
r89,s3,t10,"For CNN, they a speedup of 5x is obtained from the GPU to binary-optimizimed-GPU.",non_actionable,other
r89,s4,t10,A gain in memory size of 32x is also achieved by using binary weight and activation during the forward pass.,non_actionable,other
r89,s5,t10,The main contribution of this paper is an optimized code for Binary CNN.,actionable,agreement
r89,s6,t10,The authors provide the code with permissive licensing.,non_actionable,other
r89,s7,t10,"As is often the case with such comparisons, it is hard to disentangle from where exactly come the speedups.",actionable,shortcoming
r89,s8,t10,"Overall, i think it makes a good contribution to a field that is gaining importance for mobile and embedded applications of deep convnets.",actionable,agreement
r89,s9,t10,I think it is a good fit for a poster.,actionable,agreement
r89,s0,t9,A full implementation of binary CNN with code This paper builds on Binary-NET [Hubara et al. 2016] and expands it to CNN architectures.,non_actionable,agreement
r89,s1,t9,"It also provides optimizations that substantially improve the speed of the forward pass: packing layer bits along the channel dimension, pre-allocation of CUDA resources and binary-optimized CUDA kernels for matrix multiplications.",non_actionable,agreement
r89,s2,t9,The authors compare their framework to BinaryNET and Nervana/Neon and show a 8x speedup for 8092 matrix-matrix multiplication and a 68x speedup for MLP networks.,non_actionable,fact
r89,s3,t9,"For CNN, they a speedup of 5x is obtained from the GPU to binary-optimizimed-GPU.",non_actionable,fact
r89,s4,t9,A gain in memory size of 32x is also achieved by using binary weight and activation during the forward pass.,non_actionable,fact
r89,s5,t9,The main contribution of this paper is an optimized code for Binary CNN.,non_actionable,agreement
r89,s6,t9,The authors provide the code with permissive licensing.,non_actionable,fact
r89,s7,t9,"As is often the case with such comparisons, it is hard to disentangle from where exactly come the speedups.",non_actionable,fact
r89,s8,t9,"Overall, i think it makes a good contribution to a field that is gaining importance for mobile and embedded applications of deep convnets.",non_actionable,agreement
r89,s9,t9,I think it is a good fit for a poster.,non_actionable,agreement
r89,s0,t31,A full implementation of binary CNN with code This paper builds on Binary-NET [Hubara et al. 2016] and expands it to CNN architectures.,non_actionable,fact
r89,s1,t31,"It also provides optimizations that substantially improve the speed of the forward pass: packing layer bits along the channel dimension, pre-allocation of CUDA resources and binary-optimized CUDA kernels for matrix multiplications.",non_actionable,fact
r89,s2,t31,The authors compare their framework to BinaryNET and Nervana/Neon and show a 8x speedup for 8092 matrix-matrix multiplication and a 68x speedup for MLP networks.,non_actionable,fact
r89,s3,t31,"For CNN, they a speedup of 5x is obtained from the GPU to binary-optimizimed-GPU.",non_actionable,fact
r89,s4,t31,A gain in memory size of 32x is also achieved by using binary weight and activation during the forward pass.,non_actionable,fact
r89,s5,t31,The main contribution of this paper is an optimized code for Binary CNN.,non_actionable,fact
r89,s6,t31,The authors provide the code with permissive licensing.,non_actionable,fact
r89,s7,t31,"As is often the case with such comparisons, it is hard to disentangle from where exactly come the speedups.",non_actionable,fact
r89,s8,t31,"Overall, i think it makes a good contribution to a field that is gaining importance for mobile and embedded applications of deep convnets.",non_actionable,agreement
r89,s9,t31,I think it is a good fit for a poster.,non_actionable,agreement
r89,s0,t16,A full implementation of binary CNN with code This paper builds on Binary-NET [Hubara et al. 2016] and expands it to CNN architectures.,non_actionable,fact
r89,s1,t16,"It also provides optimizations that substantially improve the speed of the forward pass: packing layer bits along the channel dimension, pre-allocation of CUDA resources and binary-optimized CUDA kernels for matrix multiplications.",non_actionable,fact
r89,s2,t16,The authors compare their framework to BinaryNET and Nervana/Neon and show a 8x speedup for 8092 matrix-matrix multiplication and a 68x speedup for MLP networks.,non_actionable,fact
r89,s3,t16,"For CNN, they a speedup of 5x is obtained from the GPU to binary-optimizimed-GPU.",non_actionable,fact
r89,s4,t16,A gain in memory size of 32x is also achieved by using binary weight and activation during the forward pass.,non_actionable,fact
r89,s5,t16,The main contribution of this paper is an optimized code for Binary CNN.,non_actionable,fact
r89,s6,t16,The authors provide the code with permissive licensing.,non_actionable,fact
r89,s7,t16,"As is often the case with such comparisons, it is hard to disentangle from where exactly come the speedups.",actionable,shortcoming
r89,s8,t16,"Overall, i think it makes a good contribution to a field that is gaining importance for mobile and embedded applications of deep convnets.",non_actionable,agreement
r89,s9,t16,I think it is a good fit for a poster.,non_actionable,agreement
r89,s0,t20,A full implementation of binary CNN with code This paper builds on Binary-NET [Hubara et al. 2016] and expands it to CNN architectures.,non_actionable,fact
r89,s1,t20,"It also provides optimizations that substantially improve the speed of the forward pass: packing layer bits along the channel dimension, pre-allocation of CUDA resources and binary-optimized CUDA kernels for matrix multiplications.",non_actionable,fact
r89,s2,t20,The authors compare their framework to BinaryNET and Nervana/Neon and show a 8x speedup for 8092 matrix-matrix multiplication and a 68x speedup for MLP networks.,non_actionable,fact
r89,s3,t20,"For CNN, they a speedup of 5x is obtained from the GPU to binary-optimizimed-GPU.",non_actionable,fact
r89,s4,t20,A gain in memory size of 32x is also achieved by using binary weight and activation during the forward pass.,non_actionable,fact
r89,s5,t20,The main contribution of this paper is an optimized code for Binary CNN.,non_actionable,fact
r89,s6,t20,The authors provide the code with permissive licensing.,non_actionable,fact
r89,s7,t20,"As is often the case with such comparisons, it is hard to disentangle from where exactly come the speedups.",non_actionable,fact
r89,s8,t20,"Overall, i think it makes a good contribution to a field that is gaining importance for mobile and embedded applications of deep convnets.",non_actionable,agreement
r89,s9,t20,I think it is a good fit for a poster.,non_actionable,agreement
r39,s0,t10,"A good approach with some open questions on related work, scalability, and robustness The authors propose an approach for zero-shot visual learning.",actionable,agreement
r39,s1,t10,The robot then uses the learned parametric skill functions to reach goal states (images) provided by the demonstrator.,non_actionable,other
r39,s2,t10,"These topics seem sufficiently related to the proposed approach that the authors should include them in their related work section, and explain the similarities and differences.",actionable,agreement
r39,s3,t10,How much can change between the goal images and the environment before the system fails?,actionable,question
r39,s4,t10,"In the videos, it seems that the people and chairs are always in the same place.",non_actionable,other
r39,s5,t10,I could imagine a network learning to ignore features of objects that tend to wander over time.,actionable,suggestion
r39,s6,t10,The authors should consider exploring and discussing the effects of adding/moving/removing objects on the performance.,actionable,suggestion
r39,s7,t10,The evaluation with the sequence of checkpoints was created by using every fifth image.,non_actionable,other
r39,s8,t10,"In the videos, it seems like the robot could get a slightly better view if it took another couple of steps.",actionable,suggestion
r39,s9,t10,I assume this is an artifact of the way the goal recognizer is trained.,actionable,fact
r39,s0,t31,"A good approach with some open questions on related work, scalability, and robustness The authors propose an approach for zero-shot visual learning.",non_actionable,agreement
r39,s1,t31,The robot then uses the learned parametric skill functions to reach goal states (images) provided by the demonstrator.,non_actionable,fact
r39,s2,t31,"These topics seem sufficiently related to the proposed approach that the authors should include them in their related work section, and explain the similarities and differences.",actionable,suggestion
r39,s3,t31,How much can change between the goal images and the environment before the system fails?,non_actionable,question
r39,s4,t31,"In the videos, it seems that the people and chairs are always in the same place.",non_actionable,fact
r39,s5,t31,I could imagine a network learning to ignore features of objects that tend to wander over time.,non_actionable,fact
r39,s6,t31,The authors should consider exploring and discussing the effects of adding/moving/removing objects on the performance.,actionable,suggestion
r39,s7,t31,The evaluation with the sequence of checkpoints was created by using every fifth image.,non_actionable,fact
r39,s8,t31,"In the videos, it seems like the robot could get a slightly better view if it took another couple of steps.",actionable,suggestion
r39,s9,t31,I assume this is an artifact of the way the goal recognizer is trained.,non_actionable,fact
r39,s0,t2,"A good approach with some open questions on related work, scalability, and robustness The authors propose an approach for zero-shot visual learning.",non_actionable,fact
r39,s1,t2,The robot then uses the learned parametric skill functions to reach goal states (images) provided by the demonstrator.,non_actionable,fact
r39,s2,t2,"These topics seem sufficiently related to the proposed approach that the authors should include them in their related work section, and explain the similarities and differences.",actionable,suggestion
r39,s3,t2,How much can change between the goal images and the environment before the system fails?,actionable,question
r39,s4,t2,"In the videos, it seems that the people and chairs are always in the same place.",non_actionable,fact
r39,s5,t2,I could imagine a network learning to ignore features of objects that tend to wander over time.,non_actionable,fact
r39,s6,t2,The authors should consider exploring and discussing the effects of adding/moving/removing objects on the performance.,actionable,suggestion
r39,s7,t2,The evaluation with the sequence of checkpoints was created by using every fifth image.,non_actionable,fact
r39,s8,t2,"In the videos, it seems like the robot could get a slightly better view if it took another couple of steps.",non_actionable,fact
r39,s9,t2,I assume this is an artifact of the way the goal recognizer is trained.,actionable,shortcoming
r39,s0,t16,"A good approach with some open questions on related work, scalability, and robustness The authors propose an approach for zero-shot visual learning.",non_actionable,agreement
r39,s1,t16,The robot then uses the learned parametric skill functions to reach goal states (images) provided by the demonstrator.,non_actionable,fact
r39,s2,t16,"These topics seem sufficiently related to the proposed approach that the authors should include them in their related work section, and explain the similarities and differences.",non_actionable,agreement
r39,s3,t16,How much can change between the goal images and the environment before the system fails?,non_actionable,question
r39,s4,t16,"In the videos, it seems that the people and chairs are always in the same place.",non_actionable,fact
r39,s5,t16,I could imagine a network learning to ignore features of objects that tend to wander over time.,non_actionable,fact
r39,s6,t16,The authors should consider exploring and discussing the effects of adding/moving/removing objects on the performance.,actionable,suggestion
r39,s7,t16,The evaluation with the sequence of checkpoints was created by using every fifth image.,non_actionable,fact
r39,s8,t16,"In the videos, it seems like the robot could get a slightly better view if it took another couple of steps.",actionable,suggestion
r39,s9,t16,I assume this is an artifact of the way the goal recognizer is trained.,non_actionable,fact
r39,s0,t20,"A good approach with some open questions on related work, scalability, and robustness The authors propose an approach for zero-shot visual learning.",non_actionable,fact
r39,s1,t20,The robot then uses the learned parametric skill functions to reach goal states (images) provided by the demonstrator.,non_actionable,fact
r39,s2,t20,"These topics seem sufficiently related to the proposed approach that the authors should include them in their related work section, and explain the similarities and differences.",non_actionable,agreement
r39,s3,t20,How much can change between the goal images and the environment before the system fails?,actionable,question
r39,s4,t20,"In the videos, it seems that the people and chairs are always in the same place.",non_actionable,shortcoming
r39,s5,t20,I could imagine a network learning to ignore features of objects that tend to wander over time.,actionable,suggestion
r39,s6,t20,The authors should consider exploring and discussing the effects of adding/moving/removing objects on the performance.,actionable,suggestion
r39,s7,t20,The evaluation with the sequence of checkpoints was created by using every fifth image.,non_actionable,fact
r39,s8,t20,"In the videos, it seems like the robot could get a slightly better view if it took another couple of steps.",actionable,suggestion
r39,s9,t20,I assume this is an artifact of the way the goal recognizer is trained.,non_actionable,fact
r90,s0,t2,"A good paper, but it could be better for writing and baseline comparisons This paper studies the problem of text-to-speech synthesis (TTS) ""in the wild"" and proposes to use the shifting buffer memory.",non_actionable,fact
r90,s1,t2,"Specifically, an input text is transformed to phoneme encoding and then context vector is created with attention mechanism.",non_actionable,fact
r90,s2,t2,A novel speaker can be adapted just by fitting it with SGD while fixing all other components.,non_actionable,fact
r90,s3,t2,"In experiments, authors try single-speaker TTS and multi-speaker TTS along with speaker identification (ID), and show that the proposed approach outperforms baselines, namely, Tacotron and Char2wav.",non_actionable,fact
r90,s4,t2,"Finally, they use the challenging Youtube data to train the model and show promising results.",non_actionable,agreement
r90,s5,t2,"3. The proposed approach outperforms baselines in several tasks, and the ability to fit to a novel speaker is nice.",non_actionable,agreement
r90,s6,t2,But there are some issues as well (see Cons.) Cons:,non_actionable,shortcoming
r90,s7,t2,Some notations were not clearly described in the text even though it was in the table.,actionable,shortcoming
r90,s8,t2,"The paper says Deep Voice 2 (Arik et al., 2017a) is only prior work for multi-speaker TTS.",non_actionable,fact
r90,s9,t2,"3. Why do you think your model is better than VCTK test split, and even VCTK85 is better than VCTK101?",actionable,question
r90,s0,t20,"A good paper, but it could be better for writing and baseline comparisons This paper studies the problem of text-to-speech synthesis (TTS) ""in the wild"" and proposes to use the shifting buffer memory.",non_actionable,agreement
r90,s1,t20,"Specifically, an input text is transformed to phoneme encoding and then context vector is created with attention mechanism.",non_actionable,fact
r90,s2,t20,A novel speaker can be adapted just by fitting it with SGD while fixing all other components.,actionable,suggestion
r90,s3,t20,"In experiments, authors try single-speaker TTS and multi-speaker TTS along with speaker identification (ID), and show that the proposed approach outperforms baselines, namely, Tacotron and Char2wav.",non_actionable,fact
r90,s4,t20,"Finally, they use the challenging Youtube data to train the model and show promising results.",non_actionable,fact
r90,s5,t20,"3. The proposed approach outperforms baselines in several tasks, and the ability to fit to a novel speaker is nice.",non_actionable,agreement
r90,s6,t20,But there are some issues as well (see Cons.) Cons:,non_actionable,shortcoming
r90,s7,t20,Some notations were not clearly described in the text even though it was in the table.,actionable,shortcoming
r90,s8,t20,"The paper says Deep Voice 2 (Arik et al., 2017a) is only prior work for multi-speaker TTS.",non_actionable,fact
r90,s9,t20,"3. Why do you think your model is better than VCTK test split, and even VCTK85 is better than VCTK101?",actionable,question
r90,s0,t31,"A good paper, but it could be better for writing and baseline comparisons This paper studies the problem of text-to-speech synthesis (TTS) ""in the wild"" and proposes to use the shifting buffer memory.",actionable,shortcoming
r90,s1,t31,"Specifically, an input text is transformed to phoneme encoding and then context vector is created with attention mechanism.",non_actionable,fact
r90,s2,t31,A novel speaker can be adapted just by fitting it with SGD while fixing all other components.,non_actionable,fact
r90,s3,t31,"In experiments, authors try single-speaker TTS and multi-speaker TTS along with speaker identification (ID), and show that the proposed approach outperforms baselines, namely, Tacotron and Char2wav.",non_actionable,fact
r90,s4,t31,"Finally, they use the challenging Youtube data to train the model and show promising results.",non_actionable,fact
r90,s5,t31,"3. The proposed approach outperforms baselines in several tasks, and the ability to fit to a novel speaker is nice.",non_actionable,agreement
r90,s6,t31,But there are some issues as well (see Cons.) Cons:,actionable,shortcoming
r90,s7,t31,Some notations were not clearly described in the text even though it was in the table.,actionable,shortcoming
r90,s8,t31,"The paper says Deep Voice 2 (Arik et al., 2017a) is only prior work for multi-speaker TTS.",non_actionable,fact
r90,s9,t31,"3. Why do you think your model is better than VCTK test split, and even VCTK85 is better than VCTK101?",non_actionable,question
r90,s0,t10,"A good paper, but it could be better for writing and baseline comparisons This paper studies the problem of text-to-speech synthesis (TTS) ""in the wild"" and proposes to use the shifting buffer memory.",actionable,suggestion
r90,s1,t10,"Specifically, an input text is transformed to phoneme encoding and then context vector is created with attention mechanism.",non_actionable,other
r90,s2,t10,A novel speaker can be adapted just by fitting it with SGD while fixing all other components.,non_actionable,other
r90,s3,t10,"In experiments, authors try single-speaker TTS and multi-speaker TTS along with speaker identification (ID), and show that the proposed approach outperforms baselines, namely, Tacotron and Char2wav.",non_actionable,other
r90,s4,t10,"Finally, they use the challenging Youtube data to train the model and show promising results.",non_actionable,other
r90,s5,t10,"3. The proposed approach outperforms baselines in several tasks, and the ability to fit to a novel speaker is nice.",non_actionable,other
r90,s6,t10,But there are some issues as well (see Cons.) Cons:,actionable,disagreement
r90,s7,t10,Some notations were not clearly described in the text even though it was in the table.,actionable,shortcoming
r90,s8,t10,"The paper says Deep Voice 2 (Arik et al., 2017a) is only prior work for multi-speaker TTS.",non_actionable,other
r90,s9,t10,"3. Why do you think your model is better than VCTK test split, and even VCTK85 is better than VCTK101?",actionable,question
r90,s0,t8,"A good paper, but it could be better for writing and baseline comparisons This paper studies the problem of text-to-speech synthesis (TTS) ""in the wild"" and proposes to use the shifting buffer memory.",actionable,suggestion
r90,s1,t8,"Specifically, an input text is transformed to phoneme encoding and then context vector is created with attention mechanism.",non_actionable,fact
r90,s2,t8,A novel speaker can be adapted just by fitting it with SGD while fixing all other components.,non_actionable,fact
r90,s3,t8,"In experiments, authors try single-speaker TTS and multi-speaker TTS along with speaker identification (ID), and show that the proposed approach outperforms baselines, namely, Tacotron and Char2wav.",non_actionable,fact
r90,s4,t8,"Finally, they use the challenging Youtube data to train the model and show promising results.",non_actionable,agreement
r90,s5,t8,"3. The proposed approach outperforms baselines in several tasks, and the ability to fit to a novel speaker is nice.",non_actionable,agreement
r90,s6,t8,But there are some issues as well (see Cons.) Cons:,actionable,shortcoming
r90,s7,t8,Some notations were not clearly described in the text even though it was in the table.,actionable,shortcoming
r90,s8,t8,"The paper says Deep Voice 2 (Arik et al., 2017a) is only prior work for multi-speaker TTS.",non_actionable,fact
r90,s9,t8,"3. Why do you think your model is better than VCTK test split, and even VCTK85 is better than VCTK101?",non_actionable,question
r97,s0,t2,A new method for weight quantization.,non_actionable,fact
r97,s1,t2,"A step in the right direction, with interesting results, but not a huge level of novelty.",non_actionable,fact
r97,s2,t2,"This paper proposes a new method to train DNNs with quantized weights, by including the quantization as a constraint in a proximal quasi-Newton algorithm, which simultaneously learns a scaling for the quantized values (possibly different for positive and negative weights).",non_actionable,fact
r97,s3,t2,"The paper is very clearly written, and the proposal is very well placed in the context of previous methods for the same purpose.",non_actionable,agreement
r97,s4,t2,The experiments are very clearly presented and solidly designed.,non_actionable,agreement
r97,s5,t2,"In fact, the paper is a somewhat simple extension of the method proposed by Hou, Yao, and Kwok (2017), which is where the novelty resides.",non_actionable,fact
r97,s6,t2,"Consequently, there is not a great degree of novelty in terms of the proposed method, and the results are only slightly better than those of previous methods.",non_actionable,shortcoming
r97,s7,t2,"Finally, in terms of analysis of the algorithm, the authors simply invoke a theorem from Hou, Yao, and Kwok (2017), which claims convergence of the proposed algorithm.",non_actionable,shortcoming
r97,s8,t2,"However, what is shown in that paper is that the sequence of loss function values converges, which does not imply that the sequence of weight estimates also converges, because of the presence of a non-convex constraint ($b_j^t \in Q^{n_l}$).",non_actionable,shortcoming
r97,s9,t2,"This may not be relevant for the practical results, but to be accurate, it can't be simply stated that the algorithm converges, without a more careful analysis.",actionable,shortcoming
r97,s0,t8,A new method for weight quantization.,non_actionable,fact
r97,s1,t8,"A step in the right direction, with interesting results, but not a huge level of novelty.",non_actionable,agreement
r97,s2,t8,"This paper proposes a new method to train DNNs with quantized weights, by including the quantization as a constraint in a proximal quasi-Newton algorithm, which simultaneously learns a scaling for the quantized values (possibly different for positive and negative weights).",non_actionable,fact
r97,s3,t8,"The paper is very clearly written, and the proposal is very well placed in the context of previous methods for the same purpose.",non_actionable,agreement
r97,s4,t8,The experiments are very clearly presented and solidly designed.,non_actionable,agreement
r97,s5,t8,"In fact, the paper is a somewhat simple extension of the method proposed by Hou, Yao, and Kwok (2017), which is where the novelty resides.",non_actionable,fact
r97,s6,t8,"Consequently, there is not a great degree of novelty in terms of the proposed method, and the results are only slightly better than those of previous methods.",actionable,shortcoming
r97,s7,t8,"Finally, in terms of analysis of the algorithm, the authors simply invoke a theorem from Hou, Yao, and Kwok (2017), which claims convergence of the proposed algorithm.",non_actionable,fact
r97,s8,t8,"However, what is shown in that paper is that the sequence of loss function values converges, which does not imply that the sequence of weight estimates also converges, because of the presence of a non-convex constraint ($b_j^t \in Q^{n_l}$).",non_actionable,fact
r97,s9,t8,"This may not be relevant for the practical results, but to be accurate, it can't be simply stated that the algorithm converges, without a more careful analysis.",actionable,shortcoming
r97,s0,t16,A new method for weight quantization.,non_actionable,fact
r97,s1,t16,"A step in the right direction, with interesting results, but not a huge level of novelty.",non_actionable,fact
r97,s2,t16,"This paper proposes a new method to train DNNs with quantized weights, by including the quantization as a constraint in a proximal quasi-Newton algorithm, which simultaneously learns a scaling for the quantized values (possibly different for positive and negative weights).",non_actionable,fact
r97,s3,t16,"The paper is very clearly written, and the proposal is very well placed in the context of previous methods for the same purpose.",non_actionable,agreement
r97,s4,t16,The experiments are very clearly presented and solidly designed.,non_actionable,agreement
r97,s5,t16,"In fact, the paper is a somewhat simple extension of the method proposed by Hou, Yao, and Kwok (2017), which is where the novelty resides.",non_actionable,fact
r97,s6,t16,"Consequently, there is not a great degree of novelty in terms of the proposed method, and the results are only slightly better than those of previous methods.",actionable,shortcoming
r97,s7,t16,"Finally, in terms of analysis of the algorithm, the authors simply invoke a theorem from Hou, Yao, and Kwok (2017), which claims convergence of the proposed algorithm.",actionable,shortcoming
r97,s8,t16,"However, what is shown in that paper is that the sequence of loss function values converges, which does not imply that the sequence of weight estimates also converges, because of the presence of a non-convex constraint ($b_j^t \in Q^{n_l}$).",actionable,shortcoming
r97,s9,t16,"This may not be relevant for the practical results, but to be accurate, it can't be simply stated that the algorithm converges, without a more careful analysis.",actionable,suggestion
r97,s0,t23,A new method for weight quantization.,non_actionable,fact
r97,s1,t23,"A step in the right direction, with interesting results, but not a huge level of novelty.",non_actionable,agreement
r97,s2,t23,"This paper proposes a new method to train DNNs with quantized weights, by including the quantization as a constraint in a proximal quasi-Newton algorithm, which simultaneously learns a scaling for the quantized values (possibly different for positive and negative weights).",non_actionable,fact
r97,s3,t23,"The paper is very clearly written, and the proposal is very well placed in the context of previous methods for the same purpose.",non_actionable,agreement
r97,s4,t23,The experiments are very clearly presented and solidly designed.,non_actionable,agreement
r97,s5,t23,"In fact, the paper is a somewhat simple extension of the method proposed by Hou, Yao, and Kwok (2017), which is where the novelty resides.",non_actionable,fact
r97,s6,t23,"Consequently, there is not a great degree of novelty in terms of the proposed method, and the results are only slightly better than those of previous methods.",non_actionable,shortcoming
r97,s7,t23,"Finally, in terms of analysis of the algorithm, the authors simply invoke a theorem from Hou, Yao, and Kwok (2017), which claims convergence of the proposed algorithm.",actionable,fact
r97,s8,t23,"However, what is shown in that paper is that the sequence of loss function values converges, which does not imply that the sequence of weight estimates also converges, because of the presence of a non-convex constraint ($b_j^t \in Q^{n_l}$).",non_actionable,shortcoming
r97,s9,t23,"This may not be relevant for the practical results, but to be accurate, it can't be simply stated that the algorithm converges, without a more careful analysis.",actionable,suggestion
r97,s0,t20,A new method for weight quantization.,non_actionable,fact
r97,s1,t20,"A step in the right direction, with interesting results, but not a huge level of novelty.",non_actionable,fact
r97,s2,t20,"This paper proposes a new method to train DNNs with quantized weights, by including the quantization as a constraint in a proximal quasi-Newton algorithm, which simultaneously learns a scaling for the quantized values (possibly different for positive and negative weights).",non_actionable,fact
r97,s3,t20,"The paper is very clearly written, and the proposal is very well placed in the context of previous methods for the same purpose.",non_actionable,agreement
r97,s4,t20,The experiments are very clearly presented and solidly designed.,non_actionable,agreement
r97,s5,t20,"In fact, the paper is a somewhat simple extension of the method proposed by Hou, Yao, and Kwok (2017), which is where the novelty resides.",non_actionable,fact
r97,s6,t20,"Consequently, there is not a great degree of novelty in terms of the proposed method, and the results are only slightly better than those of previous methods.",non_actionable,shortcoming
r97,s7,t20,"Finally, in terms of analysis of the algorithm, the authors simply invoke a theorem from Hou, Yao, and Kwok (2017), which claims convergence of the proposed algorithm.",non_actionable,fact
r97,s8,t20,"However, what is shown in that paper is that the sequence of loss function values converges, which does not imply that the sequence of weight estimates also converges, because of the presence of a non-convex constraint ($b_j^t \in Q^{n_l}$).",non_actionable,shortcoming
r97,s9,t20,"This may not be relevant for the practical results, but to be accurate, it can't be simply stated that the algorithm converges, without a more careful analysis.",non_actionable,shortcoming
r14,s0,t31,A paper with interesting ideas but lacking convincing evidence Summary: The authors proposed an unsupervised time series clustering methods built with deep neural networks.,actionable,shortcoming
r14,s1,t31,"First, the encoder employs CNN to shorten the time series and extract local temporal features, and the CNN is followed by bidirectional LSTMs to get the encoded representations.",non_actionable,fact
r14,s2,t31,A temporal clustering model and a DCNN decoder are applied on the encoded representations and jointly trained.,non_actionable,fact
r14,s3,t31,An additional heatmap generator component can be further included in the clustering model.,actionable,suggestion
r14,s4,t31,Detailed comments: The problem of unsupervised time series clustering is important and challenging.,non_actionable,fact
r14,s5,t31,The idea of utilizing deep learning models to learn encoded representations for clustering is interesting and could be a promising solution.,actionable,agreement
r14,s6,t31,"For example, what is the size of each layer and the dimension of the encoded space?",actionable,shortcoming
r14,s7,t31,How does the model combine the heatmap output (which is a sequence of the same length as the time series) and the clustering output (which is a vector of size K) in Figure 1?,actionable,shortcoming
r14,s8,t31,How do we interpret the generated heatmap?,actionable,shortcoming
r14,s9,t31,"For example, in Figure 4, all 4 DTC-methods achieved the best performance on one or two datasets.",actionable,shortcoming
r14,s0,t10,A paper with interesting ideas but lacking convincing evidence Summary: The authors proposed an unsupervised time series clustering methods built with deep neural networks.,actionable,shortcoming
r14,s1,t10,"First, the encoder employs CNN to shorten the time series and extract local temporal features, and the CNN is followed by bidirectional LSTMs to get the encoded representations.",non_actionable,other
r14,s2,t10,A temporal clustering model and a DCNN decoder are applied on the encoded representations and jointly trained.,non_actionable,other
r14,s3,t10,An additional heatmap generator component can be further included in the clustering model.,non_actionable,other
r14,s4,t10,Detailed comments: The problem of unsupervised time series clustering is important and challenging.,actionable,shortcoming
r14,s5,t10,The idea of utilizing deep learning models to learn encoded representations for clustering is interesting and could be a promising solution.,actionable,agreement
r14,s6,t10,"For example, what is the size of each layer and the dimension of the encoded space?",actionable,question
r14,s7,t10,How does the model combine the heatmap output (which is a sequence of the same length as the time series) and the clustering output (which is a vector of size K) in Figure 1?,actionable,question
r14,s8,t10,How do we interpret the generated heatmap?,actionable,question
r14,s9,t10,"For example, in Figure 4, all 4 DTC-methods achieved the best performance on one or two datasets.",non_actionable,other
r14,s0,t2,A paper with interesting ideas but lacking convincing evidence Summary: The authors proposed an unsupervised time series clustering methods built with deep neural networks.,non_actionable,fact
r14,s1,t2,"First, the encoder employs CNN to shorten the time series and extract local temporal features, and the CNN is followed by bidirectional LSTMs to get the encoded representations.",non_actionable,fact
r14,s2,t2,A temporal clustering model and a DCNN decoder are applied on the encoded representations and jointly trained.,non_actionable,fact
r14,s3,t2,An additional heatmap generator component can be further included in the clustering model.,actionable,suggestion
r14,s4,t2,Detailed comments: The problem of unsupervised time series clustering is important and challenging.,non_actionable,fact
r14,s5,t2,The idea of utilizing deep learning models to learn encoded representations for clustering is interesting and could be a promising solution.,non_actionable,agreement
r14,s6,t2,"For example, what is the size of each layer and the dimension of the encoded space?",actionable,question
r14,s7,t2,How does the model combine the heatmap output (which is a sequence of the same length as the time series) and the clustering output (which is a vector of size K) in Figure 1?,actionable,question
r14,s8,t2,How do we interpret the generated heatmap?,actionable,question
r14,s9,t2,"For example, in Figure 4, all 4 DTC-methods achieved the best performance on one or two datasets.",non_actionable,fact
r14,s0,t16,A paper with interesting ideas but lacking convincing evidence Summary: The authors proposed an unsupervised time series clustering methods built with deep neural networks.,actionable,shortcoming
r14,s1,t16,"First, the encoder employs CNN to shorten the time series and extract local temporal features, and the CNN is followed by bidirectional LSTMs to get the encoded representations.",non_actionable,fact
r14,s2,t16,A temporal clustering model and a DCNN decoder are applied on the encoded representations and jointly trained.,non_actionable,fact
r14,s3,t16,An additional heatmap generator component can be further included in the clustering model.,actionable,suggestion
r14,s4,t16,Detailed comments: The problem of unsupervised time series clustering is important and challenging.,non_actionable,fact
r14,s5,t16,The idea of utilizing deep learning models to learn encoded representations for clustering is interesting and could be a promising solution.,non_actionable,agreement
r14,s6,t16,"For example, what is the size of each layer and the dimension of the encoded space?",actionable,question
r14,s7,t16,How does the model combine the heatmap output (which is a sequence of the same length as the time series) and the clustering output (which is a vector of size K) in Figure 1?,actionable,question
r14,s8,t16,How do we interpret the generated heatmap?,actionable,question
r14,s9,t16,"For example, in Figure 4, all 4 DTC-methods achieved the best performance on one or two datasets.",non_actionable,fact
r14,s0,t20,A paper with interesting ideas but lacking convincing evidence Summary: The authors proposed an unsupervised time series clustering methods built with deep neural networks.,non_actionable,fact
r14,s1,t20,"First, the encoder employs CNN to shorten the time series and extract local temporal features, and the CNN is followed by bidirectional LSTMs to get the encoded representations.",non_actionable,fact
r14,s2,t20,A temporal clustering model and a DCNN decoder are applied on the encoded representations and jointly trained.,non_actionable,fact
r14,s3,t20,An additional heatmap generator component can be further included in the clustering model.,actionable,suggestion
r14,s4,t20,Detailed comments: The problem of unsupervised time series clustering is important and challenging.,non_actionable,fact
r14,s5,t20,The idea of utilizing deep learning models to learn encoded representations for clustering is interesting and could be a promising solution.,non_actionable,fact
r14,s6,t20,"For example, what is the size of each layer and the dimension of the encoded space?",actionable,question
r14,s7,t20,How does the model combine the heatmap output (which is a sequence of the same length as the time series) and the clustering output (which is a vector of size K) in Figure 1?,actionable,question
r14,s8,t20,How do we interpret the generated heatmap?,actionable,question
r14,s9,t20,"For example, in Figure 4, all 4 DTC-methods achieved the best performance on one or two datasets.",non_actionable,fact
r112,s0,t8,A promising approach on nonparametric modelling of partial differential equations with deep architectures that requires more details.,actionable,shortcoming
r112,s1,t8,This paper addresses complex dynamical systems modelling through nonparametric Partial Differential Equations using neural architectures.,non_actionable,fact
r112,s2,t8,The most important idea of the papier (PDE-net) is to learn both differential operators and the function that governs the PDE.,non_actionable,fact
r112,s3,t8,"To achieve this goal, the approach relies on the approximation of differential operators by convolution of filters of appropriate order.",non_actionable,fact
r112,s4,t8,This is really the strongest point of the paper.,non_actionable,agreement
r112,s5,t8,"Moreover, a basic system called delta t block implements one level of full approximation and is stoked several times.",non_actionable,fact
r112,s6,t8,"Comments: The paper is badly structured and is sometimes hard to read because it does not present in a linear way the classic ingredients of Machine Learning, expression of the full function to be estimated, equations of each layer, description of the set of parameters to be learned and the loss function.",actionable,shortcoming
r112,s7,t8,"About the loss function, I was surprised not to see a sparsity constraint on the different filters in order to select the order of the differential operators themselves.",actionable,shortcoming
r112,s8,t8,I also found difficult to measure the degree of novelty of the approach considering the recent works and  the related work section should have been much more precise in terms of comparison.,actionable,shortcoming
r112,s9,t8,"Finally, I’ve found the paper very interesting and promising but regarding the standard of scientific publication, it requires additional attention to provide a better description the model and discuss the learning scheme to get a strongest and reproducible approach.",actionable,suggestion
r112,s0,t2,A promising approach on nonparametric modelling of partial differential equations with deep architectures that requires more details.,non_actionable,fact
r112,s1,t2,This paper addresses complex dynamical systems modelling through nonparametric Partial Differential Equations using neural architectures.,non_actionable,fact
r112,s2,t2,The most important idea of the papier (PDE-net) is to learn both differential operators and the function that governs the PDE.,non_actionable,fact
r112,s3,t2,"To achieve this goal, the approach relies on the approximation of differential operators by convolution of filters of appropriate order.",non_actionable,fact
r112,s4,t2,This is really the strongest point of the paper.,non_actionable,fact
r112,s5,t2,"Moreover, a basic system called delta t block implements one level of full approximation and is stoked several times.",non_actionable,fact
r112,s6,t2,"Comments: The paper is badly structured and is sometimes hard to read because it does not present in a linear way the classic ingredients of Machine Learning, expression of the full function to be estimated, equations of each layer, description of the set of parameters to be learned and the loss function.",actionable,shortcoming
r112,s7,t2,"About the loss function, I was surprised not to see a sparsity constraint on the different filters in order to select the order of the differential operators themselves.",actionable,shortcoming
r112,s8,t2,I also found difficult to measure the degree of novelty of the approach considering the recent works and  the related work section should have been much more precise in terms of comparison.,actionable,shortcoming
r112,s9,t2,"Finally, I’ve found the paper very interesting and promising but regarding the standard of scientific publication, it requires additional attention to provide a better description the model and discuss the learning scheme to get a strongest and reproducible approach.",actionable,suggestion
r112,s0,t10,A promising approach on nonparametric modelling of partial differential equations with deep architectures that requires more details.,actionable,shortcoming
r112,s1,t10,This paper addresses complex dynamical systems modelling through nonparametric Partial Differential Equations using neural architectures.,non_actionable,other
r112,s2,t10,The most important idea of the papier (PDE-net) is to learn both differential operators and the function that governs the PDE.,non_actionable,other
r112,s3,t10,"To achieve this goal, the approach relies on the approximation of differential operators by convolution of filters of appropriate order.",non_actionable,other
r112,s4,t10,This is really the strongest point of the paper.,actionable,agreement
r112,s5,t10,"Moreover, a basic system called delta t block implements one level of full approximation and is stoked several times.",non_actionable,other
r112,s6,t10,"Comments: The paper is badly structured and is sometimes hard to read because it does not present in a linear way the classic ingredients of Machine Learning, expression of the full function to be estimated, equations of each layer, description of the set of parameters to be learned and the loss function.",actionable,shortcoming
r112,s7,t10,"About the loss function, I was surprised not to see a sparsity constraint on the different filters in order to select the order of the differential operators themselves.",actionable,fact
r112,s8,t10,I also found difficult to measure the degree of novelty of the approach considering the recent works and  the related work section should have been much more precise in terms of comparison.,actionable,fact
r112,s9,t10,"Finally, I’ve found the paper very interesting and promising but regarding the standard of scientific publication, it requires additional attention to provide a better description the model and discuss the learning scheme to get a strongest and reproducible approach.",actionable,suggestion
r112,s0,t20,A promising approach on nonparametric modelling of partial differential equations with deep architectures that requires more details.,non_actionable,agreement
r112,s1,t20,This paper addresses complex dynamical systems modelling through nonparametric Partial Differential Equations using neural architectures.,non_actionable,fact
r112,s2,t20,The most important idea of the papier (PDE-net) is to learn both differential operators and the function that governs the PDE.,non_actionable,fact
r112,s3,t20,"To achieve this goal, the approach relies on the approximation of differential operators by convolution of filters of appropriate order.",non_actionable,fact
r112,s4,t20,This is really the strongest point of the paper.,non_actionable,agreement
r112,s5,t20,"Moreover, a basic system called delta t block implements one level of full approximation and is stoked several times.",non_actionable,fact
r112,s6,t20,"Comments: The paper is badly structured and is sometimes hard to read because it does not present in a linear way the classic ingredients of Machine Learning, expression of the full function to be estimated, equations of each layer, description of the set of parameters to be learned and the loss function.",actionable,shortcoming
r112,s7,t20,"About the loss function, I was surprised not to see a sparsity constraint on the different filters in order to select the order of the differential operators themselves.",actionable,shortcoming
r112,s8,t20,I also found difficult to measure the degree of novelty of the approach considering the recent works and  the related work section should have been much more precise in terms of comparison.,actionable,suggestion
r112,s9,t20,"Finally, I’ve found the paper very interesting and promising but regarding the standard of scientific publication, it requires additional attention to provide a better description the model and discuss the learning scheme to get a strongest and reproducible approach.",actionable,suggestion
r112,s0,t16,A promising approach on nonparametric modelling of partial differential equations with deep architectures that requires more details.,non_actionable,agreement
r112,s1,t16,This paper addresses complex dynamical systems modelling through nonparametric Partial Differential Equations using neural architectures.,non_actionable,fact
r112,s2,t16,The most important idea of the papier (PDE-net) is to learn both differential operators and the function that governs the PDE.,non_actionable,fact
r112,s3,t16,"To achieve this goal, the approach relies on the approximation of differential operators by convolution of filters of appropriate order.",non_actionable,fact
r112,s4,t16,This is really the strongest point of the paper.,non_actionable,agreement
r112,s5,t16,"Moreover, a basic system called delta t block implements one level of full approximation and is stoked several times.",non_actionable,fact
r112,s6,t16,"Comments: The paper is badly structured and is sometimes hard to read because it does not present in a linear way the classic ingredients of Machine Learning, expression of the full function to be estimated, equations of each layer, description of the set of parameters to be learned and the loss function.",actionable,shortcoming
r112,s7,t16,"About the loss function, I was surprised not to see a sparsity constraint on the different filters in order to select the order of the differential operators themselves.",actionable,shortcoming
r112,s8,t16,I also found difficult to measure the degree of novelty of the approach considering the recent works and  the related work section should have been much more precise in terms of comparison.,actionable,shortcoming
r112,s9,t16,"Finally, I’ve found the paper very interesting and promising but regarding the standard of scientific publication, it requires additional attention to provide a better description the model and discuss the learning scheme to get a strongest and reproducible approach.",actionable,shortcoming
r65,s0,t10,"A thorough exploration of techniques for unsupervised translation, a very strong start for this problem This paper describes an approach to train a neural machine translation system without parallel data.",actionable,agreement
r65,s1,t10,"Starting from a word-to-word translation lexicon, which was also learned with unsupervised methods, this approach combines a denoising auto-encoder objective with a back-translation objective, both in two translation directions, with an adversarial objective that attempts to fool a discriminator that detects the source language of an encoded sentence.",non_actionable,other
r65,s2,t10,"These five objectives together are sufficient to achieve impressive English <-> German and Engish <-> French results in Multi30k, a bilingual image caption scenario with short simple sentences, and to achieve a strong start for a standard WMT scenario.",actionable,agreement
r65,s3,t10,And it is genuinely impressive to see all these pieces come together into something that translates substantially better than a word-to-word baseline.,actionable,agreement
r65,s4,t10,But the aspect I like most about this paper is the experimental analysis.,actionable,fact
r65,s5,t10,"Considering that this is a big, complicated system, it is crucial that the authors included both an ablation experiment to see which pieces were most important, and an experiment that indicates the amount of labeled data that would be required to achieve the same results with a supervised system.",actionable,fact
r65,s6,t10,"I am glad you take the time to give your model selection criterion it's own section in 3.2, as it does seem to be an important part of this puzzle.",actionable,fact
r65,s7,t10,"In the first paragraph of Section 4.5, I disagree with the sentence, ""Similar observations can be made for the other language pairs we considered.""",actionable,disagreement
r65,s8,t10,"In fact, I would go so far as to say that the English to French scenario described in that paragraph is a notable outlier, in that it is the other language pair where you beat the oracle re-ordering baseline in both Multi30k and WMT.",actionable,disagreement
r65,s9,t10,"When citing Shen et al., 2017, consider also mentioning the following: Controllable Invariance through Adversarial Feature Learning; Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, Graham Neubig; NIPS 2017; https://arxiv.org/abs/1705.11122 Response read -- thanks.",actionable,suggestion
r65,s0,t20,"A thorough exploration of techniques for unsupervised translation, a very strong start for this problem This paper describes an approach to train a neural machine translation system without parallel data.",non_actionable,agreement
r65,s1,t20,"Starting from a word-to-word translation lexicon, which was also learned with unsupervised methods, this approach combines a denoising auto-encoder objective with a back-translation objective, both in two translation directions, with an adversarial objective that attempts to fool a discriminator that detects the source language of an encoded sentence.",non_actionable,fact
r65,s2,t20,"These five objectives together are sufficient to achieve impressive English <-> German and Engish <-> French results in Multi30k, a bilingual image caption scenario with short simple sentences, and to achieve a strong start for a standard WMT scenario.",non_actionable,fact
r65,s3,t20,And it is genuinely impressive to see all these pieces come together into something that translates substantially better than a word-to-word baseline.,non_actionable,agreement
r65,s4,t20,But the aspect I like most about this paper is the experimental analysis.,non_actionable,agreement
r65,s5,t20,"Considering that this is a big, complicated system, it is crucial that the authors included both an ablation experiment to see which pieces were most important, and an experiment that indicates the amount of labeled data that would be required to achieve the same results with a supervised system.",actionable,suggestion
r65,s6,t20,"I am glad you take the time to give your model selection criterion it's own section in 3.2, as it does seem to be an important part of this puzzle.",non_actionable,agreement
r65,s7,t20,"In the first paragraph of Section 4.5, I disagree with the sentence, ""Similar observations can be made for the other language pairs we considered.""",actionable,disagreement
r65,s8,t20,"In fact, I would go so far as to say that the English to French scenario described in that paragraph is a notable outlier, in that it is the other language pair where you beat the oracle re-ordering baseline in both Multi30k and WMT.",actionable,shortcoming
r65,s9,t20,"When citing Shen et al., 2017, consider also mentioning the following: Controllable Invariance through Adversarial Feature Learning; Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, Graham Neubig; NIPS 2017; https://arxiv.org/abs/1705.11122 Response read -- thanks.",actionable,suggestion
r65,s0,t31,"A thorough exploration of techniques for unsupervised translation, a very strong start for this problem This paper describes an approach to train a neural machine translation system without parallel data.",non_actionable,fact
r65,s1,t31,"Starting from a word-to-word translation lexicon, which was also learned with unsupervised methods, this approach combines a denoising auto-encoder objective with a back-translation objective, both in two translation directions, with an adversarial objective that attempts to fool a discriminator that detects the source language of an encoded sentence.",non_actionable,fact
r65,s2,t31,"These five objectives together are sufficient to achieve impressive English <-> German and Engish <-> French results in Multi30k, a bilingual image caption scenario with short simple sentences, and to achieve a strong start for a standard WMT scenario.",non_actionable,fact
r65,s3,t31,And it is genuinely impressive to see all these pieces come together into something that translates substantially better than a word-to-word baseline.,non_actionable,agreement
r65,s4,t31,But the aspect I like most about this paper is the experimental analysis.,non_actionable,agreement
r65,s5,t31,"Considering that this is a big, complicated system, it is crucial that the authors included both an ablation experiment to see which pieces were most important, and an experiment that indicates the amount of labeled data that would be required to achieve the same results with a supervised system.",non_actionable,agreement
r65,s6,t31,"I am glad you take the time to give your model selection criterion it's own section in 3.2, as it does seem to be an important part of this puzzle.",non_actionable,agreement
r65,s7,t31,"In the first paragraph of Section 4.5, I disagree with the sentence, ""Similar observations can be made for the other language pairs we considered.""",non_actionable,fact
r65,s8,t31,"In fact, I would go so far as to say that the English to French scenario described in that paragraph is a notable outlier, in that it is the other language pair where you beat the oracle re-ordering baseline in both Multi30k and WMT.",non_actionable,fact
r65,s9,t31,"When citing Shen et al., 2017, consider also mentioning the following: Controllable Invariance through Adversarial Feature Learning; Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, Graham Neubig; NIPS 2017; https://arxiv.org/abs/1705.11122 Response read -- thanks.",actionable,suggestion
r65,s0,t30,"A thorough exploration of techniques for unsupervised translation, a very strong start for this problem This paper describes an approach to train a neural machine translation system without parallel data.",non_actionable,agreement
r65,s1,t30,"Starting from a word-to-word translation lexicon, which was also learned with unsupervised methods, this approach combines a denoising auto-encoder objective with a back-translation objective, both in two translation directions, with an adversarial objective that attempts to fool a discriminator that detects the source language of an encoded sentence.",non_actionable,fact
r65,s2,t30,"These five objectives together are sufficient to achieve impressive English <-> German and Engish <-> French results in Multi30k, a bilingual image caption scenario with short simple sentences, and to achieve a strong start for a standard WMT scenario.",non_actionable,fact
r65,s3,t30,And it is genuinely impressive to see all these pieces come together into something that translates substantially better than a word-to-word baseline.,non_actionable,agreement
r65,s4,t30,But the aspect I like most about this paper is the experimental analysis.,non_actionable,agreement
r65,s5,t30,"Considering that this is a big, complicated system, it is crucial that the authors included both an ablation experiment to see which pieces were most important, and an experiment that indicates the amount of labeled data that would be required to achieve the same results with a supervised system.",actionable,suggestion
r65,s6,t30,"I am glad you take the time to give your model selection criterion it's own section in 3.2, as it does seem to be an important part of this puzzle.",non_actionable,agreement
r65,s7,t30,"In the first paragraph of Section 4.5, I disagree with the sentence, ""Similar observations can be made for the other language pairs we considered.""",actionable,disagreement
r65,s8,t30,"In fact, I would go so far as to say that the English to French scenario described in that paragraph is a notable outlier, in that it is the other language pair where you beat the oracle re-ordering baseline in both Multi30k and WMT.",actionable,suggestion
r65,s9,t30,"When citing Shen et al., 2017, consider also mentioning the following: Controllable Invariance through Adversarial Feature Learning; Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, Graham Neubig; NIPS 2017; https://arxiv.org/abs/1705.11122 Response read -- thanks.",actionable,suggestion
r65,s0,t2,"A thorough exploration of techniques for unsupervised translation, a very strong start for this problem This paper describes an approach to train a neural machine translation system without parallel data.",non_actionable,agreement
r65,s1,t2,"Starting from a word-to-word translation lexicon, which was also learned with unsupervised methods, this approach combines a denoising auto-encoder objective with a back-translation objective, both in two translation directions, with an adversarial objective that attempts to fool a discriminator that detects the source language of an encoded sentence.",non_actionable,fact
r65,s2,t2,"These five objectives together are sufficient to achieve impressive English <-> German and Engish <-> French results in Multi30k, a bilingual image caption scenario with short simple sentences, and to achieve a strong start for a standard WMT scenario.",non_actionable,agreement
r65,s3,t2,And it is genuinely impressive to see all these pieces come together into something that translates substantially better than a word-to-word baseline.,non_actionable,agreement
r65,s4,t2,But the aspect I like most about this paper is the experimental analysis.,non_actionable,agreement
r65,s5,t2,"Considering that this is a big, complicated system, it is crucial that the authors included both an ablation experiment to see which pieces were most important, and an experiment that indicates the amount of labeled data that would be required to achieve the same results with a supervised system.",non_actionable,agreement
r65,s6,t2,"I am glad you take the time to give your model selection criterion it's own section in 3.2, as it does seem to be an important part of this puzzle.",non_actionable,agreement
r65,s7,t2,"In the first paragraph of Section 4.5, I disagree with the sentence, ""Similar observations can be made for the other language pairs we considered.""",non_actionable,disagreement
r65,s8,t2,"In fact, I would go so far as to say that the English to French scenario described in that paragraph is a notable outlier, in that it is the other language pair where you beat the oracle re-ordering baseline in both Multi30k and WMT.",non_actionable,disagreement
r65,s9,t2,"When citing Shen et al., 2017, consider also mentioning the following: Controllable Invariance through Adversarial Feature Learning; Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, Graham Neubig; NIPS 2017; https://arxiv.org/abs/1705.11122 Response read -- thanks.",actionable,suggestion
r73,s0,t20,"A well written, clear paper presenting a novel representation of graphs   as multi-channel image-like structures from their node embeddings.",non_actionable,agreement
r73,s1,t20,The paper presents a novel representation of graphs as multi-channel image-like structures.,non_actionable,agreement
r73,s2,t20,These structures are extrapolated  by,non_actionable,fact
r73,s3,t20,1) mapping the graph nodes into an embedding using an algorithm like node2vec,non_actionable,fact
r73,s4,t20,2) compressing the embedding space using pca,non_actionable,fact
r73,s5,t20,3) and extracting 2D slices from the compressed space and computing 2D histograms per slice.,non_actionable,fact
r73,s6,t20,he resulting multi-channel image-like structures are then feed into vanilla 2D CNN.,non_actionable,fact
r73,s7,t20,"The papers is well written and clear, and proposes an interesting idea of representing graphs as multi-channel image-like structures.",non_actionable,agreement
r73,s8,t20,"Furthermore, the authors perform experiments with real graph datasets from the social science domain and a comparison with the SoA method both graph kernels and deep learning architectures.",non_actionable,agreement
r73,s9,t20,"The proposed algorithm in 3 out of 5 datasets, two of theme with statistical significant.",non_actionable,fact
r73,s0,t31,"A well written, clear paper presenting a novel representation of graphs   as multi-channel image-like structures from their node embeddings.",non_actionable,fact
r73,s1,t31,The paper presents a novel representation of graphs as multi-channel image-like structures.,non_actionable,fact
r73,s2,t31,These structures are extrapolated  by,non_actionable,fact
r73,s3,t31,1) mapping the graph nodes into an embedding using an algorithm like node2vec,non_actionable,fact
r73,s4,t31,2) compressing the embedding space using pca,non_actionable,fact
r73,s5,t31,3) and extracting 2D slices from the compressed space and computing 2D histograms per slice.,non_actionable,fact
r73,s6,t31,he resulting multi-channel image-like structures are then feed into vanilla 2D CNN.,non_actionable,fact
r73,s7,t31,"The papers is well written and clear, and proposes an interesting idea of representing graphs as multi-channel image-like structures.",non_actionable,agreement
r73,s8,t31,"Furthermore, the authors perform experiments with real graph datasets from the social science domain and a comparison with the SoA method both graph kernels and deep learning architectures.",non_actionable,fact
r73,s9,t31,"The proposed algorithm in 3 out of 5 datasets, two of theme with statistical significant.",non_actionable,fact
r73,s0,t16,"A well written, clear paper presenting a novel representation of graphs   as multi-channel image-like structures from their node embeddings.",non_actionable,agreement
r73,s1,t16,The paper presents a novel representation of graphs as multi-channel image-like structures.,non_actionable,fact
r73,s2,t16,These structures are extrapolated  by,non_actionable,fact
r73,s3,t16,1) mapping the graph nodes into an embedding using an algorithm like node2vec,non_actionable,fact
r73,s4,t16,2) compressing the embedding space using pca,non_actionable,fact
r73,s5,t16,3) and extracting 2D slices from the compressed space and computing 2D histograms per slice.,non_actionable,fact
r73,s6,t16,he resulting multi-channel image-like structures are then feed into vanilla 2D CNN.,non_actionable,fact
r73,s7,t16,"The papers is well written and clear, and proposes an interesting idea of representing graphs as multi-channel image-like structures.",non_actionable,agreement
r73,s8,t16,"Furthermore, the authors perform experiments with real graph datasets from the social science domain and a comparison with the SoA method both graph kernels and deep learning architectures.",non_actionable,fact
r73,s9,t16,"The proposed algorithm in 3 out of 5 datasets, two of theme with statistical significant.",non_actionable,fact
r73,s0,t10,"A well written, clear paper presenting a novel representation of graphs   as multi-channel image-like structures from their node embeddings.",actionable,agreement
r73,s1,t10,The paper presents a novel representation of graphs as multi-channel image-like structures.,non_actionable,other
r73,s2,t10,These structures are extrapolated  by,non_actionable,other
r73,s3,t10,1) mapping the graph nodes into an embedding using an algorithm like node2vec,non_actionable,other
r73,s4,t10,2) compressing the embedding space using pca,non_actionable,other
r73,s5,t10,3) and extracting 2D slices from the compressed space and computing 2D histograms per slice.,non_actionable,other
r73,s6,t10,he resulting multi-channel image-like structures are then feed into vanilla 2D CNN.,non_actionable,other
r73,s7,t10,"The papers is well written and clear, and proposes an interesting idea of representing graphs as multi-channel image-like structures.",actionable,agreement
r73,s8,t10,"Furthermore, the authors perform experiments with real graph datasets from the social science domain and a comparison with the SoA method both graph kernels and deep learning architectures.",non_actionable,other
r73,s9,t10,"The proposed algorithm in 3 out of 5 datasets, two of theme with statistical significant.",non_actionable,other
r73,s0,t2,"A well written, clear paper presenting a novel representation of graphs   as multi-channel image-like structures from their node embeddings.",non_actionable,agreement
r73,s1,t2,The paper presents a novel representation of graphs as multi-channel image-like structures.,non_actionable,agreement
r73,s2,t2,These structures are extrapolated  by,non_actionable,fact
r73,s3,t2,1) mapping the graph nodes into an embedding using an algorithm like node2vec,non_actionable,fact
r73,s4,t2,2) compressing the embedding space using pca,non_actionable,shortcoming
r73,s5,t2,3) and extracting 2D slices from the compressed space and computing 2D histograms per slice.,non_actionable,fact
r73,s6,t2,he resulting multi-channel image-like structures are then feed into vanilla 2D CNN.,non_actionable,fact
r73,s7,t2,"The papers is well written and clear, and proposes an interesting idea of representing graphs as multi-channel image-like structures.",non_actionable,agreement
r73,s8,t2,"Furthermore, the authors perform experiments with real graph datasets from the social science domain and a comparison with the SoA method both graph kernels and deep learning architectures.",non_actionable,fact
r73,s9,t2,"The proposed algorithm in 3 out of 5 datasets, two of theme with statistical significant.",non_actionable,fact
r76,s0,t31,"Also, BF is an odd language to target for program synthesis.",actionable,shortcoming
r76,s1,t31,This paper introduces a method for regularizing the REINFORCE algorithm by keeping around a small set of known high-quality samples as part of the sample set when performing stochastic gradient estimation.,non_actionable,fact
r76,s2,t31,I question the value of program synthesis in a language which is not human-readable.,actionable,shortcoming
r76,s3,t31,"There are no program synthesis examples demonstrating types of functions which perform complex tasks involving e.g. recursion, such as sorting operations.",actionable,shortcoming
r76,s4,t31,"All this said, the priority queue training presented here for reinforcement learning with sparse rewards is interesting, and appears to significantly improve the quality of results from a naive policy gradient approach.",non_actionable,agreement
r76,s5,t31,"It would be nice to provide some sort of analysis of it, even an empirical one.",actionable,suggestion
r76,s6,t31,"For example, how frequently are the entries in the queue updated?",actionable,suggestion
r76,s7,t31,Is this consistent over training time?,actionable,suggestion
r76,s8,t31,"While the paper does demonstrate that PQT is helpful on this very particular task, it makes very little effort to investigate *why* it is helpful, or whether it will usefully generalize to other domains.",actionable,shortcoming
r76,s9,t31,It would also help clarify under what situations one should or should not use this.,actionable,suggestion
r76,s0,t20,"Also, BF is an odd language to target for program synthesis.",non_actionable,shortcoming
r76,s1,t20,This paper introduces a method for regularizing the REINFORCE algorithm by keeping around a small set of known high-quality samples as part of the sample set when performing stochastic gradient estimation.,non_actionable,fact
r76,s2,t20,I question the value of program synthesis in a language which is not human-readable.,non_actionable,shortcoming
r76,s3,t20,"There are no program synthesis examples demonstrating types of functions which perform complex tasks involving e.g. recursion, such as sorting operations.",actionable,shortcoming
r76,s4,t20,"All this said, the priority queue training presented here for reinforcement learning with sparse rewards is interesting, and appears to significantly improve the quality of results from a naive policy gradient approach.",non_actionable,agreement
r76,s5,t20,"It would be nice to provide some sort of analysis of it, even an empirical one.",actionable,suggestion
r76,s6,t20,"For example, how frequently are the entries in the queue updated?",actionable,question
r76,s7,t20,Is this consistent over training time?,actionable,question
r76,s8,t20,"While the paper does demonstrate that PQT is helpful on this very particular task, it makes very little effort to investigate *why* it is helpful, or whether it will usefully generalize to other domains.",actionable,shortcoming
r76,s9,t20,It would also help clarify under what situations one should or should not use this.,actionable,suggestion
r76,s0,t16,"Also, BF is an odd language to target for program synthesis.",actionable,disagreement
r76,s1,t16,This paper introduces a method for regularizing the REINFORCE algorithm by keeping around a small set of known high-quality samples as part of the sample set when performing stochastic gradient estimation.,non_actionable,fact
r76,s2,t16,I question the value of program synthesis in a language which is not human-readable.,actionable,disagreement
r76,s3,t16,"There are no program synthesis examples demonstrating types of functions which perform complex tasks involving e.g. recursion, such as sorting operations.",non_actionable,fact
r76,s4,t16,"All this said, the priority queue training presented here for reinforcement learning with sparse rewards is interesting, and appears to significantly improve the quality of results from a naive policy gradient approach.",non_actionable,agreement
r76,s5,t16,"It would be nice to provide some sort of analysis of it, even an empirical one.",actionable,suggestion
r76,s6,t16,"For example, how frequently are the entries in the queue updated?",actionable,question
r76,s7,t16,Is this consistent over training time?,actionable,question
r76,s8,t16,"While the paper does demonstrate that PQT is helpful on this very particular task, it makes very little effort to investigate *why* it is helpful, or whether it will usefully generalize to other domains.",actionable,shortcoming
r76,s9,t16,It would also help clarify under what situations one should or should not use this.,actionable,suggestion
r76,s0,t2,"Also, BF is an odd language to target for program synthesis.",non_actionable,fact
r76,s1,t2,This paper introduces a method for regularizing the REINFORCE algorithm by keeping around a small set of known high-quality samples as part of the sample set when performing stochastic gradient estimation.,non_actionable,fact
r76,s2,t2,I question the value of program synthesis in a language which is not human-readable.,non_actionable,fact
r76,s3,t2,"There are no program synthesis examples demonstrating types of functions which perform complex tasks involving e.g. recursion, such as sorting operations.",actionable,shortcoming
r76,s4,t2,"All this said, the priority queue training presented here for reinforcement learning with sparse rewards is interesting, and appears to significantly improve the quality of results from a naive policy gradient approach.",non_actionable,agreement
r76,s5,t2,"It would be nice to provide some sort of analysis of it, even an empirical one.",actionable,suggestion
r76,s6,t2,"For example, how frequently are the entries in the queue updated?",actionable,question
r76,s7,t2,Is this consistent over training time?,actionable,question
r76,s8,t2,"While the paper does demonstrate that PQT is helpful on this very particular task, it makes very little effort to investigate *why* it is helpful, or whether it will usefully generalize to other domains.",actionable,shortcoming
r76,s9,t2,It would also help clarify under what situations one should or should not use this.,actionable,shortcoming
r76,s0,t8,"Also, BF is an odd language to target for program synthesis.",actionable,shortcoming
r76,s1,t8,This paper introduces a method for regularizing the REINFORCE algorithm by keeping around a small set of known high-quality samples as part of the sample set when performing stochastic gradient estimation.,non_actionable,fact
r76,s2,t8,I question the value of program synthesis in a language which is not human-readable.,actionable,disagreement
r76,s3,t8,"There are no program synthesis examples demonstrating types of functions which perform complex tasks involving e.g. recursion, such as sorting operations.",actionable,shortcoming
r76,s4,t8,"All this said, the priority queue training presented here for reinforcement learning with sparse rewards is interesting, and appears to significantly improve the quality of results from a naive policy gradient approach.",non_actionable,agreement
r76,s5,t8,"It would be nice to provide some sort of analysis of it, even an empirical one.",actionable,suggestion
r76,s6,t8,"For example, how frequently are the entries in the queue updated?",actionable,question
r76,s7,t8,Is this consistent over training time?,actionable,question
r76,s8,t8,"While the paper does demonstrate that PQT is helpful on this very particular task, it makes very little effort to investigate *why* it is helpful, or whether it will usefully generalize to other domains.",actionable,shortcoming
r76,s9,t8,It would also help clarify under what situations one should or should not use this.,non_actionable,fact
r63,s0,t8,An encoder reads in the document to condition the generator which outputs a summary.,non_actionable,fact
r63,s1,t8,An additional GAN loss is used on the generator output to encourage the output to look like summaries -- this procedure only requires unpaired summaries.,non_actionable,fact
r63,s2,t8,The results are that this procedure improves upon the trivial baseline  but still significantly underperforms supervised training.,non_actionable,fact
r63,s3,t8,The idea is a simple but useful extension of these previous works.,non_actionable,agreement
r63,s4,t8,"The problem set-up of unpaired summarization is not particularly compelling, since summaries are typically found paired with their original documents.",actionable,shortcoming
r63,s5,t8,"It would be more interesting to see how well it can be used for other textual domains such as translation, where a lot of unpaired data exists (some other submissions to ICLR tackle this problem).",actionable,suggestion
r63,s6,t8,A key baseline that is missing is pretraining the generator as a language model over summaries.,actionable,shortcoming
r63,s7,t8,"Without this baseline, it is hard to tell whether GAN training is even useful.",actionable,shortcoming
r63,s8,t8,Another experiment missing is seeing whether joint supervised-GAN-reconstruction training can outperform purely supervised training.,actionable,shortcoming
r63,s9,t8,"This paper has numerous grammatical and spelling errors throughout the paper (worse, the same errors are copy-pasted everywhere).",actionable,shortcoming
r63,s0,t20,An encoder reads in the document to condition the generator which outputs a summary.,non_actionable,fact
r63,s1,t20,An additional GAN loss is used on the generator output to encourage the output to look like summaries -- this procedure only requires unpaired summaries.,non_actionable,fact
r63,s2,t20,The results are that this procedure improves upon the trivial baseline  but still significantly underperforms supervised training.,non_actionable,fact
r63,s3,t20,The idea is a simple but useful extension of these previous works.,non_actionable,fact
r63,s4,t20,"The problem set-up of unpaired summarization is not particularly compelling, since summaries are typically found paired with their original documents.",actionable,shortcoming
r63,s5,t20,"It would be more interesting to see how well it can be used for other textual domains such as translation, where a lot of unpaired data exists (some other submissions to ICLR tackle this problem).",actionable,suggestion
r63,s6,t20,A key baseline that is missing is pretraining the generator as a language model over summaries.,actionable,shortcoming
r63,s7,t20,"Without this baseline, it is hard to tell whether GAN training is even useful.",actionable,shortcoming
r63,s8,t20,Another experiment missing is seeing whether joint supervised-GAN-reconstruction training can outperform purely supervised training.,actionable,shortcoming
r63,s9,t20,"This paper has numerous grammatical and spelling errors throughout the paper (worse, the same errors are copy-pasted everywhere).",actionable,shortcoming
r63,s0,t10,An encoder reads in the document to condition the generator which outputs a summary.,non_actionable,other
r63,s1,t10,An additional GAN loss is used on the generator output to encourage the output to look like summaries -- this procedure only requires unpaired summaries.,non_actionable,other
r63,s2,t10,The results are that this procedure improves upon the trivial baseline  but still significantly underperforms supervised training.,non_actionable,other
r63,s3,t10,The idea is a simple but useful extension of these previous works.,actionable,agreement
r63,s4,t10,"The problem set-up of unpaired summarization is not particularly compelling, since summaries are typically found paired with their original documents.",actionable,shortcoming
r63,s5,t10,"It would be more interesting to see how well it can be used for other textual domains such as translation, where a lot of unpaired data exists (some other submissions to ICLR tackle this problem).",actionable,suggestion
r63,s6,t10,A key baseline that is missing is pretraining the generator as a language model over summaries.,actionable,shortcoming
r63,s7,t10,"Without this baseline, it is hard to tell whether GAN training is even useful.",actionable,shortcoming
r63,s8,t10,Another experiment missing is seeing whether joint supervised-GAN-reconstruction training can outperform purely supervised training.,actionable,shortcoming
r63,s9,t10,"This paper has numerous grammatical and spelling errors throughout the paper (worse, the same errors are copy-pasted everywhere).",actionable,shortcoming
r63,s0,t16,An encoder reads in the document to condition the generator which outputs a summary.,non_actionable,fact
r63,s1,t16,An additional GAN loss is used on the generator output to encourage the output to look like summaries -- this procedure only requires unpaired summaries.,non_actionable,fact
r63,s2,t16,The results are that this procedure improves upon the trivial baseline  but still significantly underperforms supervised training.,actionable,shortcoming
r63,s3,t16,The idea is a simple but useful extension of these previous works.,non_actionable,agreement
r63,s4,t16,"The problem set-up of unpaired summarization is not particularly compelling, since summaries are typically found paired with their original documents.",non_actionable,fact
r63,s5,t16,"It would be more interesting to see how well it can be used for other textual domains such as translation, where a lot of unpaired data exists (some other submissions to ICLR tackle this problem).",actionable,suggestion
r63,s6,t16,A key baseline that is missing is pretraining the generator as a language model over summaries.,actionable,shortcoming
r63,s7,t16,"Without this baseline, it is hard to tell whether GAN training is even useful.",actionable,shortcoming
r63,s8,t16,Another experiment missing is seeing whether joint supervised-GAN-reconstruction training can outperform purely supervised training.,actionable,shortcoming
r63,s9,t16,"This paper has numerous grammatical and spelling errors throughout the paper (worse, the same errors are copy-pasted everywhere).",actionable,shortcoming
r63,s0,t2,An encoder reads in the document to condition the generator which outputs a summary.,non_actionable,fact
r63,s1,t2,An additional GAN loss is used on the generator output to encourage the output to look like summaries -- this procedure only requires unpaired summaries.,non_actionable,fact
r63,s2,t2,The results are that this procedure improves upon the trivial baseline  but still significantly underperforms supervised training.,non_actionable,fact
r63,s3,t2,The idea is a simple but useful extension of these previous works.,non_actionable,agreement
r63,s4,t2,"The problem set-up of unpaired summarization is not particularly compelling, since summaries are typically found paired with their original documents.",non_actionable,shortcoming
r63,s5,t2,"It would be more interesting to see how well it can be used for other textual domains such as translation, where a lot of unpaired data exists (some other submissions to ICLR tackle this problem).",actionable,shortcoming
r63,s6,t2,A key baseline that is missing is pretraining the generator as a language model over summaries.,actionable,shortcoming
r63,s7,t2,"Without this baseline, it is hard to tell whether GAN training is even useful.",actionable,shortcoming
r63,s8,t2,Another experiment missing is seeing whether joint supervised-GAN-reconstruction training can outperform purely supervised training.,actionable,shortcoming
r63,s9,t2,"This paper has numerous grammatical and spelling errors throughout the paper (worse, the same errors are copy-pasted everywhere).",actionable,shortcoming
r100,s0,t20,An interesting paper which is marginally above acceptance threshold The authors of the paper propose a framework to generate natural adversarial examples by searching adversaries in a latent space of dense and continuous data representation (instead of in the original input data space).,actionable,agreement
r100,s1,t20,"The details of their proposed method are covered in Algorithm 1 on Page 12, where an additional GAN (generative adversarial network) I_{\gamma}, which can be regarded as the inverse function of the original GAN G_{\theta}, is trained to learn a map from the original input data space to the latent z-space.",non_actionable,fact
r100,s2,t20,The intuition of the proposed approach is clearly explained and it seems very reasonable to me.,non_actionable,agreement
r100,s3,t20,"My main concern, however, is in the current sampling-based search algorithm in the latent z-space, which the authors have already admitted in the paper.",non_actionable,shortcoming
r100,s4,t20,The efficiency of such a search method decreases very fast when the dimensions of the z-space increases.,non_actionable,fact
r100,s5,t20,Another concern is that the authors have not provided sufficient number of examples to show the advantages of their proposed method over the other method (such as FGSM) in generating the adversaries.,actionable,shortcoming
r100,s6,t20,Could you explicitly specify the dimension of the latent z-space in each example in image and text domain in Section 3?,actionable,question
r100,s7,t20,"In Tables 7 and 8, the human beings agree with the LeNet in >= 58% of cases.",non_actionable,fact
r100,s8,t20,Could you still say that your generated “adversaries” leading to the wrong decision from LeNet?,actionable,question
r100,s9,t20,How do you choose the parameter \lambda in Equation (2)?,actionable,question
r100,s0,t25,An interesting paper which is marginally above acceptance threshold The authors of the paper propose a framework to generate natural adversarial examples by searching adversaries in a latent space of dense and continuous data representation (instead of in the original input data space).,actionable,shortcoming
r100,s1,t25,"The details of their proposed method are covered in Algorithm 1 on Page 12, where an additional GAN (generative adversarial network) I_{\gamma}, which can be regarded as the inverse function of the original GAN G_{\theta}, is trained to learn a map from the original input data space to the latent z-space.",non_actionable,shortcoming
r100,s2,t25,The intuition of the proposed approach is clearly explained and it seems very reasonable to me.,non_actionable,agreement
r100,s3,t25,"My main concern, however, is in the current sampling-based search algorithm in the latent z-space, which the authors have already admitted in the paper.",non_actionable,shortcoming
r100,s4,t25,The efficiency of such a search method decreases very fast when the dimensions of the z-space increases.,non_actionable,disagreement
r100,s5,t25,Another concern is that the authors have not provided sufficient number of examples to show the advantages of their proposed method over the other method (such as FGSM) in generating the adversaries.,actionable,shortcoming
r100,s6,t25,Could you explicitly specify the dimension of the latent z-space in each example in image and text domain in Section 3?,actionable,question
r100,s7,t25,"In Tables 7 and 8, the human beings agree with the LeNet in >= 58% of cases.",non_actionable,fact
r100,s8,t25,Could you still say that your generated “adversaries” leading to the wrong decision from LeNet?,actionable,question
r100,s9,t25,How do you choose the parameter \lambda in Equation (2)?,actionable,question
r100,s0,t31,An interesting paper which is marginally above acceptance threshold The authors of the paper propose a framework to generate natural adversarial examples by searching adversaries in a latent space of dense and continuous data representation (instead of in the original input data space).,non_actionable,fact
r100,s1,t31,"The details of their proposed method are covered in Algorithm 1 on Page 12, where an additional GAN (generative adversarial network) I_{\gamma}, which can be regarded as the inverse function of the original GAN G_{\theta}, is trained to learn a map from the original input data space to the latent z-space.",non_actionable,fact
r100,s2,t31,The intuition of the proposed approach is clearly explained and it seems very reasonable to me.,non_actionable,agreement
r100,s3,t31,"My main concern, however, is in the current sampling-based search algorithm in the latent z-space, which the authors have already admitted in the paper.",actionable,shortcoming
r100,s4,t31,The efficiency of such a search method decreases very fast when the dimensions of the z-space increases.,actionable,shortcoming
r100,s5,t31,Another concern is that the authors have not provided sufficient number of examples to show the advantages of their proposed method over the other method (such as FGSM) in generating the adversaries.,actionable,shortcoming
r100,s6,t31,Could you explicitly specify the dimension of the latent z-space in each example in image and text domain in Section 3?,actionable,shortcoming
r100,s7,t31,"In Tables 7 and 8, the human beings agree with the LeNet in >= 58% of cases.",non_actionable,fact
r100,s8,t31,Could you still say that your generated “adversaries” leading to the wrong decision from LeNet?,non_actionable,question
r100,s9,t31,How do you choose the parameter \lambda in Equation (2)?,non_actionable,question
r100,s0,t10,An interesting paper which is marginally above acceptance threshold The authors of the paper propose a framework to generate natural adversarial examples by searching adversaries in a latent space of dense and continuous data representation (instead of in the original input data space).,non_actionable,other
r100,s1,t10,"The details of their proposed method are covered in Algorithm 1 on Page 12, where an additional GAN (generative adversarial network) I_{\gamma}, which can be regarded as the inverse function of the original GAN G_{\theta}, is trained to learn a map from the original input data space to the latent z-space.",non_actionable,other
r100,s2,t10,The intuition of the proposed approach is clearly explained and it seems very reasonable to me.,actionable,agreement
r100,s3,t10,"My main concern, however, is in the current sampling-based search algorithm in the latent z-space, which the authors have already admitted in the paper.",actionable,shortcoming
r100,s4,t10,The efficiency of such a search method decreases very fast when the dimensions of the z-space increases.,actionable,shortcoming
r100,s5,t10,Another concern is that the authors have not provided sufficient number of examples to show the advantages of their proposed method over the other method (such as FGSM) in generating the adversaries.,actionable,shortcoming
r100,s6,t10,Could you explicitly specify the dimension of the latent z-space in each example in image and text domain in Section 3?,actionable,suggestion
r100,s7,t10,"In Tables 7 and 8, the human beings agree with the LeNet in >= 58% of cases.",non_actionable,other
r100,s8,t10,Could you still say that your generated “adversaries” leading to the wrong decision from LeNet?,actionable,suggestion
r100,s9,t10,How do you choose the parameter \lambda in Equation (2)?,actionable,suggestion
r100,s0,t2,An interesting paper which is marginally above acceptance threshold The authors of the paper propose a framework to generate natural adversarial examples by searching adversaries in a latent space of dense and continuous data representation (instead of in the original input data space).,non_actionable,agreement
r100,s1,t2,"The details of their proposed method are covered in Algorithm 1 on Page 12, where an additional GAN (generative adversarial network) I_{\gamma}, which can be regarded as the inverse function of the original GAN G_{\theta}, is trained to learn a map from the original input data space to the latent z-space.",non_actionable,fact
r100,s2,t2,The intuition of the proposed approach is clearly explained and it seems very reasonable to me.,non_actionable,agreement
r100,s3,t2,"My main concern, however, is in the current sampling-based search algorithm in the latent z-space, which the authors have already admitted in the paper.",non_actionable,shortcoming
r100,s4,t2,The efficiency of such a search method decreases very fast when the dimensions of the z-space increases.,non_actionable,fact
r100,s5,t2,Another concern is that the authors have not provided sufficient number of examples to show the advantages of their proposed method over the other method (such as FGSM) in generating the adversaries.,actionable,shortcoming
r100,s6,t2,Could you explicitly specify the dimension of the latent z-space in each example in image and text domain in Section 3?,actionable,suggestion
r100,s7,t2,"In Tables 7 and 8, the human beings agree with the LeNet in >= 58% of cases.",non_actionable,fact
r100,s8,t2,Could you still say that your generated “adversaries” leading to the wrong decision from LeNet?,actionable,question
r100,s9,t2,How do you choose the parameter \lambda in Equation (2)?,actionable,question
r118,s0,t31,"An interesting paper, but not the clearest presentation.",actionable,shortcoming
r118,s1,t31,Additional regularization terms are also added to encourage the model to encode longer term dependencies in its latent distributions.,actionable,suggestion
r118,s2,t31,My first concern with this paper is that the derivation in Eq.,actionable,shortcoming
r118,s3,t31,There is a p(z_1:T) term that should appear in the integrand.,actionable,suggestion
r118,s4,t31,It is not clear to me why h_t should depend on \tilde{b}_t.,actionable,suggestion
r118,s5,t31,"It may add capacity to the decoder in the form of extra weights, but the same could be achieved by making z_t larger.",non_actionable,fact
r118,s6,t31,In the no reconstruction loss experiments do you still sample \tilde{b}_t in the generative part?,non_actionable,question
r118,s7,t31,It seems the Blizzard results in Figure 2 are missing no reconstruction loss + full backprop.,actionable,shortcoming
r118,s8,t31,Exactly which gradients are you skipping at random?,non_actionable,question
r118,s9,t31,Do you have any intuition for why it is sometimes necessary to set beta=0?,non_actionable,question
r118,s0,t10,"An interesting paper, but not the clearest presentation.",actionable,disagreement
r118,s1,t10,Additional regularization terms are also added to encourage the model to encode longer term dependencies in its latent distributions.,non_actionable,other
r118,s2,t10,My first concern with this paper is that the derivation in Eq.,actionable,disagreement
r118,s3,t10,There is a p(z_1:T) term that should appear in the integrand.,actionable,shortcoming
r118,s4,t10,It is not clear to me why h_t should depend on \tilde{b}_t.,actionable,fact
r118,s5,t10,"It may add capacity to the decoder in the form of extra weights, but the same could be achieved by making z_t larger.",actionable,suggestion
r118,s6,t10,In the no reconstruction loss experiments do you still sample \tilde{b}_t in the generative part?,actionable,question
r118,s7,t10,It seems the Blizzard results in Figure 2 are missing no reconstruction loss + full backprop.,actionable,shortcoming
r118,s8,t10,Exactly which gradients are you skipping at random?,actionable,question
r118,s9,t10,Do you have any intuition for why it is sometimes necessary to set beta=0?,actionable,question
r118,s0,t8,"An interesting paper, but not the clearest presentation.",actionable,shortcoming
r118,s1,t8,Additional regularization terms are also added to encourage the model to encode longer term dependencies in its latent distributions.,non_actionable,fact
r118,s2,t8,My first concern with this paper is that the derivation in Eq.,actionable,shortcoming
r118,s3,t8,There is a p(z_1:T) term that should appear in the integrand.,actionable,shortcoming
r118,s4,t8,It is not clear to me why h_t should depend on \tilde{b}_t.,actionable,disagreement
r118,s5,t8,"It may add capacity to the decoder in the form of extra weights, but the same could be achieved by making z_t larger.",actionable,disagreement
r118,s6,t8,In the no reconstruction loss experiments do you still sample \tilde{b}_t in the generative part?,non_actionable,question
r118,s7,t8,It seems the Blizzard results in Figure 2 are missing no reconstruction loss + full backprop.,non_actionable,agreement
r118,s8,t8,Exactly which gradients are you skipping at random?,non_actionable,question
r118,s9,t8,Do you have any intuition for why it is sometimes necessary to set beta=0?,non_actionable,question
r118,s0,t2,"An interesting paper, but not the clearest presentation.",non_actionable,shortcoming
r118,s1,t2,Additional regularization terms are also added to encourage the model to encode longer term dependencies in its latent distributions.,non_actionable,fact
r118,s2,t2,My first concern with this paper is that the derivation in Eq.,non_actionable,shortcoming
r118,s3,t2,There is a p(z_1:T) term that should appear in the integrand.,actionable,suggestion
r118,s4,t2,It is not clear to me why h_t should depend on \tilde{b}_t.,actionable,shortcoming
r118,s5,t2,"It may add capacity to the decoder in the form of extra weights, but the same could be achieved by making z_t larger.",actionable,suggestion
r118,s6,t2,In the no reconstruction loss experiments do you still sample \tilde{b}_t in the generative part?,actionable,question
r118,s7,t2,It seems the Blizzard results in Figure 2 are missing no reconstruction loss + full backprop.,actionable,shortcoming
r118,s8,t2,Exactly which gradients are you skipping at random?,actionable,question
r118,s9,t2,Do you have any intuition for why it is sometimes necessary to set beta=0?,actionable,question
r118,s0,t20,"An interesting paper, but not the clearest presentation.",non_actionable,fact
r118,s1,t20,Additional regularization terms are also added to encourage the model to encode longer term dependencies in its latent distributions.,non_actionable,fact
r118,s2,t20,My first concern with this paper is that the derivation in Eq.,actionable,shortcoming
r118,s3,t20,There is a p(z_1:T) term that should appear in the integrand.,actionable,suggestion
r118,s4,t20,It is not clear to me why h_t should depend on \tilde{b}_t.,actionable,shortcoming
r118,s5,t20,"It may add capacity to the decoder in the form of extra weights, but the same could be achieved by making z_t larger.",actionable,shortcoming
r118,s6,t20,In the no reconstruction loss experiments do you still sample \tilde{b}_t in the generative part?,actionable,question
r118,s7,t20,It seems the Blizzard results in Figure 2 are missing no reconstruction loss + full backprop.,actionable,shortcoming
r118,s8,t20,Exactly which gradients are you skipping at random?,actionable,question
r118,s9,t20,Do you have any intuition for why it is sometimes necessary to set beta=0?,actionable,question
r85,s0,t18,"But, apparently, this is not the problem the authors actually solve, according to eq.",non_actionable,fact
r85,s1,t18,I am not sure how this greedy action should result in maximizing the total discounted reward along a trajectory.,non_actionable,shortcoming
r85,s2,t18,Equation 3 seems to be a cost function penalizing differences between predicted and observed states.,non_actionable,agreement
r85,s3,t18,"Similarly, equation 4 penalizes differences between predicted and observed state transitions.",actionable,suggestion
r85,s4,t18,"Essentially, the current manuscript does not learn the reward function of an MDP in the RL setting, but it learns some sort of a shaping reward function to do policy imitation, i.e. copy the behavior of the demonstrator as closely as possible.",non_actionable,agreement
r85,s5,t18,"So, in my view, the manuscript does a nice job at policy fitting, but this is not reward estimation.",non_actionable,other
r85,s6,t18,The manuscript has to be rewritten that way.,non_actionable,suggestion
r85,s7,t18,"One could also argue that the manuscript would profit from a better theoretical analysis of the IRL problem, say: C. A. Rothkopf, C. Dimitrakakis.",non_actionable,suggestion
r85,s8,t18,Preference elicitation and inverse reinforcement learning.,actionable,other
r85,s9,t18,"ECML 2011 Overall the manuscript leverages on deep learning’s power of function approximation and the simulation results are nice, but in terms of the soundness of the underlying RL and IRL theory there is some work to do.",actionable,fact
r85,s0,t2,"But, apparently, this is not the problem the authors actually solve, according to eq.",non_actionable,shortcoming
r85,s1,t2,I am not sure how this greedy action should result in maximizing the total discounted reward along a trajectory.,non_actionable,shortcoming
r85,s2,t2,Equation 3 seems to be a cost function penalizing differences between predicted and observed states.,non_actionable,fact
r85,s3,t2,"Similarly, equation 4 penalizes differences between predicted and observed state transitions.",non_actionable,fact
r85,s4,t2,"Essentially, the current manuscript does not learn the reward function of an MDP in the RL setting, but it learns some sort of a shaping reward function to do policy imitation, i.e. copy the behavior of the demonstrator as closely as possible.",non_actionable,fact
r85,s5,t2,"So, in my view, the manuscript does a nice job at policy fitting, but this is not reward estimation.",non_actionable,disagreement
r85,s6,t2,The manuscript has to be rewritten that way.,actionable,suggestion
r85,s7,t2,"One could also argue that the manuscript would profit from a better theoretical analysis of the IRL problem, say: C. A. Rothkopf, C. Dimitrakakis.",actionable,suggestion
r85,s8,t2,Preference elicitation and inverse reinforcement learning.,non_actionable,other
r85,s9,t2,"ECML 2011 Overall the manuscript leverages on deep learning’s power of function approximation and the simulation results are nice, but in terms of the soundness of the underlying RL and IRL theory there is some work to do.",actionable,shortcoming
r85,s0,t31,"But, apparently, this is not the problem the authors actually solve, according to eq.",actionable,disagreement
r85,s1,t31,I am not sure how this greedy action should result in maximizing the total discounted reward along a trajectory.,actionable,shortcoming
r85,s2,t31,Equation 3 seems to be a cost function penalizing differences between predicted and observed states.,non_actionable,fact
r85,s3,t31,"Similarly, equation 4 penalizes differences between predicted and observed state transitions.",non_actionable,fact
r85,s4,t31,"Essentially, the current manuscript does not learn the reward function of an MDP in the RL setting, but it learns some sort of a shaping reward function to do policy imitation, i.e. copy the behavior of the demonstrator as closely as possible.",non_actionable,fact
r85,s5,t31,"So, in my view, the manuscript does a nice job at policy fitting, but this is not reward estimation.",actionable,shortcoming
r85,s6,t31,The manuscript has to be rewritten that way.,actionable,suggestion
r85,s7,t31,"One could also argue that the manuscript would profit from a better theoretical analysis of the IRL problem, say: C. A. Rothkopf, C. Dimitrakakis.",actionable,suggestion
r85,s8,t31,Preference elicitation and inverse reinforcement learning.,non_actionable,fact
r85,s9,t31,"ECML 2011 Overall the manuscript leverages on deep learning’s power of function approximation and the simulation results are nice, but in terms of the soundness of the underlying RL and IRL theory there is some work to do.",actionable,shortcoming
r85,s0,t20,"But, apparently, this is not the problem the authors actually solve, according to eq.",non_actionable,shortcoming
r85,s1,t20,I am not sure how this greedy action should result in maximizing the total discounted reward along a trajectory.,non_actionable,shortcoming
r85,s2,t20,Equation 3 seems to be a cost function penalizing differences between predicted and observed states.,non_actionable,fact
r85,s3,t20,"Similarly, equation 4 penalizes differences between predicted and observed state transitions.",non_actionable,fact
r85,s4,t20,"Essentially, the current manuscript does not learn the reward function of an MDP in the RL setting, but it learns some sort of a shaping reward function to do policy imitation, i.e. copy the behavior of the demonstrator as closely as possible.",non_actionable,fact
r85,s5,t20,"So, in my view, the manuscript does a nice job at policy fitting, but this is not reward estimation.",actionable,shortcoming
r85,s6,t20,The manuscript has to be rewritten that way.,actionable,suggestion
r85,s7,t20,"One could also argue that the manuscript would profit from a better theoretical analysis of the IRL problem, say: C. A. Rothkopf, C. Dimitrakakis.",actionable,suggestion
r85,s8,t20,Preference elicitation and inverse reinforcement learning.,non_actionable,fact
r85,s9,t20,"ECML 2011 Overall the manuscript leverages on deep learning’s power of function approximation and the simulation results are nice, but in terms of the soundness of the underlying RL and IRL theory there is some work to do.",actionable,shortcoming
r85,s0,t10,"But, apparently, this is not the problem the authors actually solve, according to eq.",actionable,shortcoming
r85,s1,t10,I am not sure how this greedy action should result in maximizing the total discounted reward along a trajectory.,actionable,disagreement
r85,s2,t10,Equation 3 seems to be a cost function penalizing differences between predicted and observed states.,non_actionable,other
r85,s3,t10,"Similarly, equation 4 penalizes differences between predicted and observed state transitions.",non_actionable,other
r85,s4,t10,"Essentially, the current manuscript does not learn the reward function of an MDP in the RL setting, but it learns some sort of a shaping reward function to do policy imitation, i.e. copy the behavior of the demonstrator as closely as possible.",non_actionable,other
r85,s5,t10,"So, in my view, the manuscript does a nice job at policy fitting, but this is not reward estimation.",actionable,shortcoming
r85,s6,t10,The manuscript has to be rewritten that way.,actionable,suggestion
r85,s7,t10,"One could also argue that the manuscript would profit from a better theoretical analysis of the IRL problem, say: C. A. Rothkopf, C. Dimitrakakis.",actionable,disagreement
r85,s8,t10,Preference elicitation and inverse reinforcement learning.,non_actionable,other
r85,s9,t10,"ECML 2011 Overall the manuscript leverages on deep learning’s power of function approximation and the simulation results are nice, but in terms of the soundness of the underlying RL and IRL theory there is some work to do.",actionable,suggestion
r25,s0,t8,Claiming much of common intuition around tricks for avoiding gradient issues are incorrect.,actionable,disagreement
r25,s1,t8,The paper makes some bold claims.,non_actionable,fact
r25,s2,t8,"It's possible some of the issues arise from the particular architectures they choose to investigate and demonstrate on (eg I have mostly seen ResNets in the context of CNNs but they analyze on FC topologies, the form of the loss, etc) but that's a guess and there are some further analysis in the supp material for these networks which I haven't looked at in detail.",actionable,fact
r25,s3,t8,"Regardless - an important note to the authors is that it's a particularly long and verbose paper, coming in at 16 pages of the main paper(!) with nearly 50 (!) pages of supplementary material where the heart and meat of the proofs and experiments reside.",actionable,shortcoming
r25,s4,t8,As such it's not even clear if this is proper for a conference.,actionable,shortcoming
r25,s5,t8,The authors have already provided several pages worth of additional comments on the website on further related work.,non_actionable,fact
r25,s6,t8,I view this as an issue in and of itself.,actionable,fact
r25,s7,t8,I've seen many papers that need to go through much more complicated derivations and theory and remain within a 8-10 page limit by being precise and strictly to the point.,actionable,suggestion
r25,s8,t8,"Perhaps Godel could be a good inspiration here, with a 21 page PhD thesis that fundamentally changed mathematics.",actionable,suggestion
r25,s9,t8,"So, while I cannot vouch for the correctness, I think it can and should go through a serious revision to make it succinct and that will likely considerably help in making it accessible to a wider readership and aligned to the expectations from a conference paper in the field.",actionable,suggestion
r25,s0,t20,Claiming much of common intuition around tricks for avoiding gradient issues are incorrect.,actionable,shortcoming
r25,s1,t20,The paper makes some bold claims.,non_actionable,fact
r25,s2,t20,"It's possible some of the issues arise from the particular architectures they choose to investigate and demonstrate on (eg I have mostly seen ResNets in the context of CNNs but they analyze on FC topologies, the form of the loss, etc) but that's a guess and there are some further analysis in the supp material for these networks which I haven't looked at in detail.",non_actionable,fact
r25,s3,t20,"Regardless - an important note to the authors is that it's a particularly long and verbose paper, coming in at 16 pages of the main paper(!) with nearly 50 (!) pages of supplementary material where the heart and meat of the proofs and experiments reside.",non_actionable,fact
r25,s4,t20,As such it's not even clear if this is proper for a conference.,non_actionable,shortcoming
r25,s5,t20,The authors have already provided several pages worth of additional comments on the website on further related work.,non_actionable,fact
r25,s6,t20,I view this as an issue in and of itself.,non_actionable,shortcoming
r25,s7,t20,I've seen many papers that need to go through much more complicated derivations and theory and remain within a 8-10 page limit by being precise and strictly to the point.,non_actionable,fact
r25,s8,t20,"Perhaps Godel could be a good inspiration here, with a 21 page PhD thesis that fundamentally changed mathematics.",non_actionable,fact
r25,s9,t20,"So, while I cannot vouch for the correctness, I think it can and should go through a serious revision to make it succinct and that will likely considerably help in making it accessible to a wider readership and aligned to the expectations from a conference paper in the field.",actionable,suggestion
r25,s0,t10,Claiming much of common intuition around tricks for avoiding gradient issues are incorrect.,actionable,shortcoming
r25,s1,t10,The paper makes some bold claims.,actionable,other
r25,s2,t10,"It's possible some of the issues arise from the particular architectures they choose to investigate and demonstrate on (eg I have mostly seen ResNets in the context of CNNs but they analyze on FC topologies, the form of the loss, etc) but that's a guess and there are some further analysis in the supp material for these networks which I haven't looked at in detail.",actionable,fact
r25,s3,t10,"Regardless - an important note to the authors is that it's a particularly long and verbose paper, coming in at 16 pages of the main paper(!) with nearly 50 (!) pages of supplementary material where the heart and meat of the proofs and experiments reside.",actionable,shortcoming
r25,s4,t10,As such it's not even clear if this is proper for a conference.,actionable,shortcoming
r25,s5,t10,The authors have already provided several pages worth of additional comments on the website on further related work.,actionable,shortcoming
r25,s6,t10,I view this as an issue in and of itself.,actionable,shortcoming
r25,s7,t10,I've seen many papers that need to go through much more complicated derivations and theory and remain within a 8-10 page limit by being precise and strictly to the point.,actionable,fact
r25,s8,t10,"Perhaps Godel could be a good inspiration here, with a 21 page PhD thesis that fundamentally changed mathematics.",actionable,suggestion
r25,s9,t10,"So, while I cannot vouch for the correctness, I think it can and should go through a serious revision to make it succinct and that will likely considerably help in making it accessible to a wider readership and aligned to the expectations from a conference paper in the field.",actionable,suggestion
r25,s0,t31,Claiming much of common intuition around tricks for avoiding gradient issues are incorrect.,actionable,shortcoming
r25,s1,t31,The paper makes some bold claims.,non_actionable,fact
r25,s2,t31,"It's possible some of the issues arise from the particular architectures they choose to investigate and demonstrate on (eg I have mostly seen ResNets in the context of CNNs but they analyze on FC topologies, the form of the loss, etc) but that's a guess and there are some further analysis in the supp material for these networks which I haven't looked at in detail.",non_actionable,fact
r25,s3,t31,"Regardless - an important note to the authors is that it's a particularly long and verbose paper, coming in at 16 pages of the main paper(!) with nearly 50 (!) pages of supplementary material where the heart and meat of the proofs and experiments reside.",actionable,shortcoming
r25,s4,t31,As such it's not even clear if this is proper for a conference.,actionable,shortcoming
r25,s5,t31,The authors have already provided several pages worth of additional comments on the website on further related work.,non_actionable,fact
r25,s6,t31,I view this as an issue in and of itself.,actionable,shortcoming
r25,s7,t31,I've seen many papers that need to go through much more complicated derivations and theory and remain within a 8-10 page limit by being precise and strictly to the point.,actionable,shortcoming
r25,s8,t31,"Perhaps Godel could be a good inspiration here, with a 21 page PhD thesis that fundamentally changed mathematics.",actionable,suggestion
r25,s9,t31,"So, while I cannot vouch for the correctness, I think it can and should go through a serious revision to make it succinct and that will likely considerably help in making it accessible to a wider readership and aligned to the expectations from a conference paper in the field.",actionable,shortcoming
r25,s0,t2,Claiming much of common intuition around tricks for avoiding gradient issues are incorrect.,non_actionable,disagreement
r25,s1,t2,The paper makes some bold claims.,non_actionable,fact
r25,s2,t2,"It's possible some of the issues arise from the particular architectures they choose to investigate and demonstrate on (eg I have mostly seen ResNets in the context of CNNs but they analyze on FC topologies, the form of the loss, etc) but that's a guess and there are some further analysis in the supp material for these networks which I haven't looked at in detail.",non_actionable,fact
r25,s3,t2,"Regardless - an important note to the authors is that it's a particularly long and verbose paper, coming in at 16 pages of the main paper(!) with nearly 50 (!) pages of supplementary material where the heart and meat of the proofs and experiments reside.",actionable,shortcoming
r25,s4,t2,As such it's not even clear if this is proper for a conference.,non_actionable,shortcoming
r25,s5,t2,The authors have already provided several pages worth of additional comments on the website on further related work.,non_actionable,fact
r25,s6,t2,I view this as an issue in and of itself.,non_actionable,shortcoming
r25,s7,t2,I've seen many papers that need to go through much more complicated derivations and theory and remain within a 8-10 page limit by being precise and strictly to the point.,actionable,shortcoming
r25,s8,t2,"Perhaps Godel could be a good inspiration here, with a 21 page PhD thesis that fundamentally changed mathematics.",actionable,suggestion
r25,s9,t2,"So, while I cannot vouch for the correctness, I think it can and should go through a serious revision to make it succinct and that will likely considerably help in making it accessible to a wider readership and aligned to the expectations from a conference paper in the field.",actionable,suggestion
r122,s0,t20,"Creative and interesting The paper introduces an application of Graph Neural Networks (Li's Gated Graph Neural Nets, GGNNs, specifically) for reasoning about programs and programming.",non_actionable,agreement
r122,s1,t20,"The core idea is to represent a program as a graph that a GGNN can take as input, and train the GGNN to make token-level predictions that depend on the semantic context.",non_actionable,fact
r122,s2,t20,"identifying bugs in programs where the wrong variable is used, and",non_actionable,fact
r122,s3,t20,2) predicting a variable's name by consider its semantic context.,non_actionable,fact
r122,s4,t20,"The paper is generally well written, easy to read and understand, and the results are compelling.",non_actionable,agreement
r122,s5,t20,The proposed GGNN approach outperforms (bi-)LSTMs on both tasks.,non_actionable,agreement
r122,s6,t20,"Because the tasks are not widely explored in the literature, it could be difficult to know how crucial exploiting graphically structured information is, so the authors performed several ablation studies to analyze  this out.",non_actionable,fact
r122,s7,t20,"Those results show that as structural information is removed, the GGNN's performance diminishes, as expected.",non_actionable,fact
r122,s8,t20,"As a demonstration of the usefulness of their approach, the authors ran their model on an unnamed open-source project and claimed to find several bugs, at least one of which potentially reduced memory performance.",non_actionable,fact
r122,s9,t20,"Overall the work is important, original, well-executed, and should open new directions for deep learning in program analysis.",non_actionable,agreement
r122,s0,t10,"Creative and interesting The paper introduces an application of Graph Neural Networks (Li's Gated Graph Neural Nets, GGNNs, specifically) for reasoning about programs and programming.",non_actionable,other
r122,s1,t10,"The core idea is to represent a program as a graph that a GGNN can take as input, and train the GGNN to make token-level predictions that depend on the semantic context.",non_actionable,other
r122,s2,t10,"identifying bugs in programs where the wrong variable is used, and",non_actionable,other
r122,s3,t10,2) predicting a variable's name by consider its semantic context.,non_actionable,other
r122,s4,t10,"The paper is generally well written, easy to read and understand, and the results are compelling.",actionable,agreement
r122,s5,t10,The proposed GGNN approach outperforms (bi-)LSTMs on both tasks.,actionable,agreement
r122,s6,t10,"Because the tasks are not widely explored in the literature, it could be difficult to know how crucial exploiting graphically structured information is, so the authors performed several ablation studies to analyze  this out.",non_actionable,other
r122,s7,t10,"Those results show that as structural information is removed, the GGNN's performance diminishes, as expected.",non_actionable,other
r122,s8,t10,"As a demonstration of the usefulness of their approach, the authors ran their model on an unnamed open-source project and claimed to find several bugs, at least one of which potentially reduced memory performance.",non_actionable,other
r122,s9,t10,"Overall the work is important, original, well-executed, and should open new directions for deep learning in program analysis.",actionable,agreement
r122,s0,t25,"Creative and interesting The paper introduces an application of Graph Neural Networks (Li's Gated Graph Neural Nets, GGNNs, specifically) for reasoning about programs and programming.",non_actionable,agreement
r122,s1,t25,"The core idea is to represent a program as a graph that a GGNN can take as input, and train the GGNN to make token-level predictions that depend on the semantic context.",non_actionable,fact
r122,s2,t25,"identifying bugs in programs where the wrong variable is used, and",actionable,shortcoming
r122,s3,t25,2) predicting a variable's name by consider its semantic context.,non_actionable,fact
r122,s4,t25,"The paper is generally well written, easy to read and understand, and the results are compelling.",non_actionable,agreement
r122,s5,t25,The proposed GGNN approach outperforms (bi-)LSTMs on both tasks.,actionable,shortcoming
r122,s6,t25,"Because the tasks are not widely explored in the literature, it could be difficult to know how crucial exploiting graphically structured information is, so the authors performed several ablation studies to analyze  this out.",non_actionable,agreement
r122,s7,t25,"Those results show that as structural information is removed, the GGNN's performance diminishes, as expected.",non_actionable,agreement
r122,s8,t25,"As a demonstration of the usefulness of their approach, the authors ran their model on an unnamed open-source project and claimed to find several bugs, at least one of which potentially reduced memory performance.",non_actionable,agreement
r122,s9,t25,"Overall the work is important, original, well-executed, and should open new directions for deep learning in program analysis.",non_actionable,agreement
r122,s0,t31,"Creative and interesting The paper introduces an application of Graph Neural Networks (Li's Gated Graph Neural Nets, GGNNs, specifically) for reasoning about programs and programming.",non_actionable,agreement
r122,s1,t31,"The core idea is to represent a program as a graph that a GGNN can take as input, and train the GGNN to make token-level predictions that depend on the semantic context.",non_actionable,fact
r122,s2,t31,"identifying bugs in programs where the wrong variable is used, and",non_actionable,fact
r122,s3,t31,2) predicting a variable's name by consider its semantic context.,non_actionable,fact
r122,s4,t31,"The paper is generally well written, easy to read and understand, and the results are compelling.",non_actionable,agreement
r122,s5,t31,The proposed GGNN approach outperforms (bi-)LSTMs on both tasks.,non_actionable,fact
r122,s6,t31,"Because the tasks are not widely explored in the literature, it could be difficult to know how crucial exploiting graphically structured information is, so the authors performed several ablation studies to analyze  this out.",non_actionable,fact
r122,s7,t31,"Those results show that as structural information is removed, the GGNN's performance diminishes, as expected.",non_actionable,fact
r122,s8,t31,"As a demonstration of the usefulness of their approach, the authors ran their model on an unnamed open-source project and claimed to find several bugs, at least one of which potentially reduced memory performance.",non_actionable,fact
r122,s9,t31,"Overall the work is important, original, well-executed, and should open new directions for deep learning in program analysis.",non_actionable,agreement
r122,s0,t12,"Creative and interesting The paper introduces an application of Graph Neural Networks (Li's Gated Graph Neural Nets, GGNNs, specifically) for reasoning about programs and programming.",non_actionable,fact
r122,s1,t12,"The core idea is to represent a program as a graph that a GGNN can take as input, and train the GGNN to make token-level predictions that depend on the semantic context.",non_actionable,fact
r122,s2,t12,"identifying bugs in programs where the wrong variable is used, and",non_actionable,other
r122,s3,t12,2) predicting a variable's name by consider its semantic context.,non_actionable,fact
r122,s4,t12,"The paper is generally well written, easy to read and understand, and the results are compelling.",non_actionable,agreement
r122,s5,t12,The proposed GGNN approach outperforms (bi-)LSTMs on both tasks.,non_actionable,fact
r122,s6,t12,"Because the tasks are not widely explored in the literature, it could be difficult to know how crucial exploiting graphically structured information is, so the authors performed several ablation studies to analyze  this out.",non_actionable,agreement
r122,s7,t12,"Those results show that as structural information is removed, the GGNN's performance diminishes, as expected.",non_actionable,fact
r122,s8,t12,"As a demonstration of the usefulness of their approach, the authors ran their model on an unnamed open-source project and claimed to find several bugs, at least one of which potentially reduced memory performance.",non_actionable,fact
r122,s9,t12,"Overall the work is important, original, well-executed, and should open new directions for deep learning in program analysis.",non_actionable,agreement
r43,s0,t31,deep learning with the boosting trick This paper applies the boosting trick to deep learning.,non_actionable,fact
r43,s1,t31,The proposed algorithm is validated on several image classification datasets.,non_actionable,fact
r43,s2,t31,The paper is its current form has the following issues:,actionable,shortcoming
r43,s3,t31,1. There is hardly any baseline compared in the paper.,actionable,shortcoming
r43,s4,t31,"The proposed algorithm is essentially an ensemble algorithm, there exist several works on deep model ensemble (e.g., Boosted convolutional neural networks, and Snapshot Ensemble) should be compared against.",actionable,shortcoming
r43,s5,t31,"2. I did not carefully check all the proofs, but seems most of the proof can be moved to supplementary to keep the paper more concise.",actionable,suggestion
r43,s6,t31,Why the error rate reported here is higher than that in the original paper?,non_actionable,question
r43,s7,t31,"Typo: In Session 3 Line 7, there is a missing reference.",actionable,shortcoming
r43,s8,t31,"In Session 3 Line 10, “1,00 object classes” should be “100 object classes”.",actionable,shortcoming
r43,s9,t31,"In Line 3 of the paragraph below Equation 5, “classe” should be “class”.",actionable,shortcoming
r43,s0,t10,deep learning with the boosting trick This paper applies the boosting trick to deep learning.,non_actionable,other
r43,s1,t10,The proposed algorithm is validated on several image classification datasets.,non_actionable,other
r43,s2,t10,The paper is its current form has the following issues:,actionable,shortcoming
r43,s3,t10,1. There is hardly any baseline compared in the paper.,actionable,shortcoming
r43,s4,t10,"The proposed algorithm is essentially an ensemble algorithm, there exist several works on deep model ensemble (e.g., Boosted convolutional neural networks, and Snapshot Ensemble) should be compared against.",actionable,shortcoming
r43,s5,t10,"2. I did not carefully check all the proofs, but seems most of the proof can be moved to supplementary to keep the paper more concise.",actionable,fact
r43,s6,t10,Why the error rate reported here is higher than that in the original paper?,actionable,question
r43,s7,t10,"Typo: In Session 3 Line 7, there is a missing reference.",actionable,shortcoming
r43,s8,t10,"In Session 3 Line 10, “1,00 object classes” should be “100 object classes”.",actionable,shortcoming
r43,s9,t10,"In Line 3 of the paragraph below Equation 5, “classe” should be “class”.",actionable,shortcoming
r43,s0,t20,deep learning with the boosting trick This paper applies the boosting trick to deep learning.,non_actionable,fact
r43,s1,t20,The proposed algorithm is validated on several image classification datasets.,non_actionable,fact
r43,s2,t20,The paper is its current form has the following issues:,non_actionable,shortcoming
r43,s3,t20,1. There is hardly any baseline compared in the paper.,non_actionable,shortcoming
r43,s4,t20,"The proposed algorithm is essentially an ensemble algorithm, there exist several works on deep model ensemble (e.g., Boosted convolutional neural networks, and Snapshot Ensemble) should be compared against.",actionable,suggestion
r43,s5,t20,"2. I did not carefully check all the proofs, but seems most of the proof can be moved to supplementary to keep the paper more concise.",actionable,suggestion
r43,s6,t20,Why the error rate reported here is higher than that in the original paper?,actionable,question
r43,s7,t20,"Typo: In Session 3 Line 7, there is a missing reference.",actionable,shortcoming
r43,s8,t20,"In Session 3 Line 10, “1,00 object classes” should be “100 object classes”.",actionable,shortcoming
r43,s9,t20,"In Line 3 of the paragraph below Equation 5, “classe” should be “class”.",actionable,shortcoming
r43,s0,t8,deep learning with the boosting trick This paper applies the boosting trick to deep learning.,non_actionable,fact
r43,s1,t8,The proposed algorithm is validated on several image classification datasets.,non_actionable,fact
r43,s2,t8,The paper is its current form has the following issues:,actionable,shortcoming
r43,s3,t8,1. There is hardly any baseline compared in the paper.,actionable,shortcoming
r43,s4,t8,"The proposed algorithm is essentially an ensemble algorithm, there exist several works on deep model ensemble (e.g., Boosted convolutional neural networks, and Snapshot Ensemble) should be compared against.",actionable,suggestion
r43,s5,t8,"2. I did not carefully check all the proofs, but seems most of the proof can be moved to supplementary to keep the paper more concise.",actionable,suggestion
r43,s6,t8,Why the error rate reported here is higher than that in the original paper?,non_actionable,question
r43,s7,t8,"Typo: In Session 3 Line 7, there is a missing reference.",actionable,shortcoming
r43,s8,t8,"In Session 3 Line 10, “1,00 object classes” should be “100 object classes”.",actionable,suggestion
r43,s9,t8,"In Line 3 of the paragraph below Equation 5, “classe” should be “class”.",actionable,suggestion
r43,s0,t16,deep learning with the boosting trick This paper applies the boosting trick to deep learning.,non_actionable,fact
r43,s1,t16,The proposed algorithm is validated on several image classification datasets.,non_actionable,fact
r43,s2,t16,The paper is its current form has the following issues:,actionable,shortcoming
r43,s3,t16,1. There is hardly any baseline compared in the paper.,actionable,shortcoming
r43,s4,t16,"The proposed algorithm is essentially an ensemble algorithm, there exist several works on deep model ensemble (e.g., Boosted convolutional neural networks, and Snapshot Ensemble) should be compared against.",non_actionable,fact
r43,s5,t16,"2. I did not carefully check all the proofs, but seems most of the proof can be moved to supplementary to keep the paper more concise.",actionable,suggestion
r43,s6,t16,Why the error rate reported here is higher than that in the original paper?,actionable,question
r43,s7,t16,"Typo: In Session 3 Line 7, there is a missing reference.",actionable,shortcoming
r43,s8,t16,"In Session 3 Line 10, “1,00 object classes” should be “100 object classes”.",actionable,shortcoming
r43,s9,t16,"In Line 3 of the paragraph below Equation 5, “classe” should be “class”.",actionable,shortcoming
r49,s0,t2,Deep Temporal Clustering This paper proposes an algorithm for jointly performing dimensionality reduction and temporal clustering in a deep learning context.,non_actionable,fact
r49,s1,t2,"An autoencoder is utilized for dimensionality reduction alongside a clustering objective - that is the autoencoder optimizes the mse (using LSTM layers are utilized in the autoencoder for modelling temporal information), while the latent space  is fed into the temporal clustering layer.",non_actionable,fact
r49,s2,t2,The clustering/autoencoder objectives are optimized in an alternating optimization fashion.,non_actionable,fact
r49,s3,t2,"The main con lies in this work being very closely related to t-sne, i.e. compare the the temporal clustering loss based on kl-div (eq 6) to t-sne.",non_actionable,shortcoming
r49,s4,t2,"If we consider e.g., a linear 1-layer autoencoder to be equivalent to PCA (without the rnn layers), in essence this formulation is closely related to applying pca to reduce the initial dimensionality and then t-sne.",non_actionable,fact
r49,s5,t2,"Also, do the cluster centroids appear to be roughly stable over many runs of the algorithm?",actionable,question
r49,s6,t2,"As the averaged results over 5 runs are shown, the standard deviation would be helpful towards showing this empirically.",actionable,suggestion
r49,s7,t2,"On the positive side, it is likely that richer representations can be obtained via this architecture, and results appear to be good with comparison to other metrics The section of the paper that discusses heat-maps should be written more clearly.",actionable,suggestion
r49,s8,t2,Figure 3 is commented with respect to detecting an event - non-event but the process itself is not clearly described as far as I can see.,actionable,shortcoming
r49,s9,t2,minor note: the dynamic time warping is formally not a metric,actionable,disagreement
r49,s0,t10,Deep Temporal Clustering This paper proposes an algorithm for jointly performing dimensionality reduction and temporal clustering in a deep learning context.,non_actionable,other
r49,s1,t10,"An autoencoder is utilized for dimensionality reduction alongside a clustering objective - that is the autoencoder optimizes the mse (using LSTM layers are utilized in the autoencoder for modelling temporal information), while the latent space  is fed into the temporal clustering layer.",non_actionable,other
r49,s2,t10,The clustering/autoencoder objectives are optimized in an alternating optimization fashion.,non_actionable,other
r49,s3,t10,"The main con lies in this work being very closely related to t-sne, i.e. compare the the temporal clustering loss based on kl-div (eq 6) to t-sne.",actionable,shortcoming
r49,s4,t10,"If we consider e.g., a linear 1-layer autoencoder to be equivalent to PCA (without the rnn layers), in essence this formulation is closely related to applying pca to reduce the initial dimensionality and then t-sne.",actionable,shortcoming
r49,s5,t10,"Also, do the cluster centroids appear to be roughly stable over many runs of the algorithm?",actionable,question
r49,s6,t10,"As the averaged results over 5 runs are shown, the standard deviation would be helpful towards showing this empirically.",actionable,suggestion
r49,s7,t10,"On the positive side, it is likely that richer representations can be obtained via this architecture, and results appear to be good with comparison to other metrics The section of the paper that discusses heat-maps should be written more clearly.",actionable,suggestion
r49,s8,t10,Figure 3 is commented with respect to detecting an event - non-event but the process itself is not clearly described as far as I can see.,actionable,fact
r49,s9,t10,minor note: the dynamic time warping is formally not a metric,actionable,shortcoming
r49,s0,t16,Deep Temporal Clustering This paper proposes an algorithm for jointly performing dimensionality reduction and temporal clustering in a deep learning context.,non_actionable,fact
r49,s1,t16,"An autoencoder is utilized for dimensionality reduction alongside a clustering objective - that is the autoencoder optimizes the mse (using LSTM layers are utilized in the autoencoder for modelling temporal information), while the latent space  is fed into the temporal clustering layer.",non_actionable,fact
r49,s2,t16,The clustering/autoencoder objectives are optimized in an alternating optimization fashion.,non_actionable,fact
r49,s3,t16,"The main con lies in this work being very closely related to t-sne, i.e. compare the the temporal clustering loss based on kl-div (eq 6) to t-sne.",actionable,fact
r49,s4,t16,"If we consider e.g., a linear 1-layer autoencoder to be equivalent to PCA (without the rnn layers), in essence this formulation is closely related to applying pca to reduce the initial dimensionality and then t-sne.",non_actionable,fact
r49,s5,t16,"Also, do the cluster centroids appear to be roughly stable over many runs of the algorithm?",non_actionable,question
r49,s6,t16,"As the averaged results over 5 runs are shown, the standard deviation would be helpful towards showing this empirically.",actionable,suggestion
r49,s7,t16,"On the positive side, it is likely that richer representations can be obtained via this architecture, and results appear to be good with comparison to other metrics The section of the paper that discusses heat-maps should be written more clearly.",non_actionable,agreement
r49,s8,t16,Figure 3 is commented with respect to detecting an event - non-event but the process itself is not clearly described as far as I can see.,actionable,shortcoming
r49,s9,t16,minor note: the dynamic time warping is formally not a metric,actionable,fact
r49,s0,t31,Deep Temporal Clustering This paper proposes an algorithm for jointly performing dimensionality reduction and temporal clustering in a deep learning context.,non_actionable,fact
r49,s1,t31,"An autoencoder is utilized for dimensionality reduction alongside a clustering objective - that is the autoencoder optimizes the mse (using LSTM layers are utilized in the autoencoder for modelling temporal information), while the latent space  is fed into the temporal clustering layer.",non_actionable,fact
r49,s2,t31,The clustering/autoencoder objectives are optimized in an alternating optimization fashion.,non_actionable,agreement
r49,s3,t31,"The main con lies in this work being very closely related to t-sne, i.e. compare the the temporal clustering loss based on kl-div (eq 6) to t-sne.",actionable,shortcoming
r49,s4,t31,"If we consider e.g., a linear 1-layer autoencoder to be equivalent to PCA (without the rnn layers), in essence this formulation is closely related to applying pca to reduce the initial dimensionality and then t-sne.",non_actionable,fact
r49,s5,t31,"Also, do the cluster centroids appear to be roughly stable over many runs of the algorithm?",non_actionable,question
r49,s6,t31,"As the averaged results over 5 runs are shown, the standard deviation would be helpful towards showing this empirically.",actionable,suggestion
r49,s7,t31,"On the positive side, it is likely that richer representations can be obtained via this architecture, and results appear to be good with comparison to other metrics The section of the paper that discusses heat-maps should be written more clearly.",actionable,shortcoming
r49,s8,t31,Figure 3 is commented with respect to detecting an event - non-event but the process itself is not clearly described as far as I can see.,actionable,shortcoming
r49,s9,t31,minor note: the dynamic time warping is formally not a metric,actionable,shortcoming
r49,s0,t20,Deep Temporal Clustering This paper proposes an algorithm for jointly performing dimensionality reduction and temporal clustering in a deep learning context.,non_actionable,fact
r49,s1,t20,"An autoencoder is utilized for dimensionality reduction alongside a clustering objective - that is the autoencoder optimizes the mse (using LSTM layers are utilized in the autoencoder for modelling temporal information), while the latent space  is fed into the temporal clustering layer.",non_actionable,fact
r49,s2,t20,The clustering/autoencoder objectives are optimized in an alternating optimization fashion.,non_actionable,fact
r49,s3,t20,"The main con lies in this work being very closely related to t-sne, i.e. compare the the temporal clustering loss based on kl-div (eq 6) to t-sne.",non_actionable,fact
r49,s4,t20,"If we consider e.g., a linear 1-layer autoencoder to be equivalent to PCA (without the rnn layers), in essence this formulation is closely related to applying pca to reduce the initial dimensionality and then t-sne.",non_actionable,fact
r49,s5,t20,"Also, do the cluster centroids appear to be roughly stable over many runs of the algorithm?",actionable,question
r49,s6,t20,"As the averaged results over 5 runs are shown, the standard deviation would be helpful towards showing this empirically.",actionable,suggestion
r49,s7,t20,"On the positive side, it is likely that richer representations can be obtained via this architecture, and results appear to be good with comparison to other metrics The section of the paper that discusses heat-maps should be written more clearly.",actionable,suggestion
r49,s8,t20,Figure 3 is commented with respect to detecting an event - non-event but the process itself is not clearly described as far as I can see.,actionable,shortcoming
r49,s9,t20,minor note: the dynamic time warping is formally not a metric,actionable,disagreement
r47,s0,t31,Each subtask execution is represented by a (non-learned) option.,non_actionable,fact
r47,s1,t31,"Alternatively, if the subtask graphs were learned instead of given, that would open the door to scaling an general learning.",non_actionable,fact
r47,s2,t31,"Yet, this is not discussed in the paper.",actionable,shortcoming
r47,s3,t31,"The proposed algorithm relies on fairly involved reward shaping, in that it is a very strong signal of supervision on what the next action should be.",non_actionable,fact
r47,s4,t31,"Additionaly, it's not clear why learning seems to completely ""fail"" without the pre-trained policy.",actionable,shortcoming
r47,s5,t31,"The justification given is that it is ""to address the difficulty of training due to the complex nature of the problem"" but this is not really satisfying as the problems are not that hard.",actionable,shortcoming
r47,s6,t31,It it thus hard to properly evaluate your method against other proposed methods.,actionable,shortcoming
r47,s7,t31,- It seems weird that the smoothed logical AND/OR functions do not depend on the number of inputs; that is unless there are always 3 inputs (but it is not explained why; logical functions are usually formalised as functions of 2 inputs) as suggested by Fig 3.,actionable,shortcoming
r47,s8,t31,Is the time budget different for each new generated environment?,non_actionable,question
r47,s9,t31,- why wait until exactly 120 epochs for NTS-RProp before fine-tuning with actor-critic?,non_actionable,question
r47,s0,t10,Each subtask execution is represented by a (non-learned) option.,non_actionable,other
r47,s1,t10,"Alternatively, if the subtask graphs were learned instead of given, that would open the door to scaling an general learning.",actionable,suggestion
r47,s2,t10,"Yet, this is not discussed in the paper.",actionable,shortcoming
r47,s3,t10,"The proposed algorithm relies on fairly involved reward shaping, in that it is a very strong signal of supervision on what the next action should be.",actionable,fact
r47,s4,t10,"Additionaly, it's not clear why learning seems to completely ""fail"" without the pre-trained policy.",actionable,shortcoming
r47,s5,t10,"The justification given is that it is ""to address the difficulty of training due to the complex nature of the problem"" but this is not really satisfying as the problems are not that hard.",actionable,disagreement
r47,s6,t10,It it thus hard to properly evaluate your method against other proposed methods.,actionable,shortcoming
r47,s7,t10,- It seems weird that the smoothed logical AND/OR functions do not depend on the number of inputs; that is unless there are always 3 inputs (but it is not explained why; logical functions are usually formalised as functions of 2 inputs) as suggested by Fig 3.,actionable,fact
r47,s8,t10,Is the time budget different for each new generated environment?,actionable,question
r47,s9,t10,- why wait until exactly 120 epochs for NTS-RProp before fine-tuning with actor-critic?,actionable,question
r47,s0,t20,Each subtask execution is represented by a (non-learned) option.,non_actionable,fact
r47,s1,t20,"Alternatively, if the subtask graphs were learned instead of given, that would open the door to scaling an general learning.",non_actionable,fact
r47,s2,t20,"Yet, this is not discussed in the paper.",actionable,shortcoming
r47,s3,t20,"The proposed algorithm relies on fairly involved reward shaping, in that it is a very strong signal of supervision on what the next action should be.",non_actionable,fact
r47,s4,t20,"Additionaly, it's not clear why learning seems to completely ""fail"" without the pre-trained policy.",actionable,shortcoming
r47,s5,t20,"The justification given is that it is ""to address the difficulty of training due to the complex nature of the problem"" but this is not really satisfying as the problems are not that hard.",actionable,shortcoming
r47,s6,t20,It it thus hard to properly evaluate your method against other proposed methods.,non_actionable,fact
r47,s7,t20,- It seems weird that the smoothed logical AND/OR functions do not depend on the number of inputs; that is unless there are always 3 inputs (but it is not explained why; logical functions are usually formalised as functions of 2 inputs) as suggested by Fig 3.,actionable,shortcoming
r47,s8,t20,Is the time budget different for each new generated environment?,actionable,question
r47,s9,t20,- why wait until exactly 120 epochs for NTS-RProp before fine-tuning with actor-critic?,actionable,question
r47,s0,t8,Each subtask execution is represented by a (non-learned) option.,non_actionable,fact
r47,s1,t8,"Alternatively, if the subtask graphs were learned instead of given, that would open the door to scaling an general learning.",actionable,suggestion
r47,s2,t8,"Yet, this is not discussed in the paper.",actionable,shortcoming
r47,s3,t8,"The proposed algorithm relies on fairly involved reward shaping, in that it is a very strong signal of supervision on what the next action should be.",non_actionable,fact
r47,s4,t8,"Additionaly, it's not clear why learning seems to completely ""fail"" without the pre-trained policy.",actionable,shortcoming
r47,s5,t8,"The justification given is that it is ""to address the difficulty of training due to the complex nature of the problem"" but this is not really satisfying as the problems are not that hard.",actionable,shortcoming
r47,s6,t8,It it thus hard to properly evaluate your method against other proposed methods.,actionable,shortcoming
r47,s7,t8,- It seems weird that the smoothed logical AND/OR functions do not depend on the number of inputs; that is unless there are always 3 inputs (but it is not explained why; logical functions are usually formalised as functions of 2 inputs) as suggested by Fig 3.,actionable,shortcoming
r47,s8,t8,Is the time budget different for each new generated environment?,non_actionable,question
r47,s9,t8,- why wait until exactly 120 epochs for NTS-RProp before fine-tuning with actor-critic?,non_actionable,question
r47,s0,t2,Each subtask execution is represented by a (non-learned) option.,non_actionable,fact
r47,s1,t2,"Alternatively, if the subtask graphs were learned instead of given, that would open the door to scaling an general learning.",non_actionable,fact
r47,s2,t2,"Yet, this is not discussed in the paper.",actionable,shortcoming
r47,s3,t2,"The proposed algorithm relies on fairly involved reward shaping, in that it is a very strong signal of supervision on what the next action should be.",non_actionable,fact
r47,s4,t2,"Additionaly, it's not clear why learning seems to completely ""fail"" without the pre-trained policy.",non_actionable,shortcoming
r47,s5,t2,"The justification given is that it is ""to address the difficulty of training due to the complex nature of the problem"" but this is not really satisfying as the problems are not that hard.",non_actionable,disagreement
r47,s6,t2,It it thus hard to properly evaluate your method against other proposed methods.,actionable,shortcoming
r47,s7,t2,- It seems weird that the smoothed logical AND/OR functions do not depend on the number of inputs; that is unless there are always 3 inputs (but it is not explained why; logical functions are usually formalised as functions of 2 inputs) as suggested by Fig 3.,actionable,shortcoming
r47,s8,t2,Is the time budget different for each new generated environment?,actionable,question
r47,s9,t2,- why wait until exactly 120 epochs for NTS-RProp before fine-tuning with actor-critic?,actionable,question
r78,s0,t20,"Firstly, the paper has a fatal mathematical flaw.",non_actionable,shortcoming
r78,s1,t20,"However, batch normalization only sees the variation in the activations given to it by a SPECIFIC set of weights.",non_actionable,fact
r78,s2,t20,"Secondly, in section 4, your analysis depends on the specific type of ResNet you chose.",non_actionable,fact
r78,s3,t20,(I hope that my area chair agrees with me that honesty is the best and kindest policy.),non_actionable,fact
r78,s4,t20,You do not know how to interpret gradient variances.,non_actionable,shortcoming
r78,s5,t20,(A) Veit et al. argued that a deep ResNet behaves as an ensemble of shallower networks as long as the gradient flowing through the residual paths is not larger than the gradient flowing through the skip paths.,non_actionable,fact
r78,s6,t20,"To be fair, one might make an argument as follows: ""the point of deep nets is to be expressive, expressiveness of a layer relates to the spetrum of the layer-Jacobian, a small increase in gradient scale implies the layer-Jacobian has many similar singular values, therefore a small increase in gradient scale implies low expressiveness of the layer, therefore the layer is pathological"".",non_actionable,fact
r78,s7,t20,"In any case, I also don't think that was the argument you were trying to make.",non_actionable,shortcoming
r78,s8,t20,"Note that after I skimmed through the submissions to this conference, there seem to be interesting papers on the topic of gradients.",non_actionable,fact
r78,s9,t20,Other comments: - your notation is quite sloppy and may have lead to errors.,actionable,shortcoming
r78,s0,t31,"Firstly, the paper has a fatal mathematical flaw.",actionable,shortcoming
r78,s1,t31,"However, batch normalization only sees the variation in the activations given to it by a SPECIFIC set of weights.",non_actionable,fact
r78,s2,t31,"Secondly, in section 4, your analysis depends on the specific type of ResNet you chose.",actionable,shortcoming
r78,s3,t31,(I hope that my area chair agrees with me that honesty is the best and kindest policy.),non_actionable,other
r78,s4,t31,You do not know how to interpret gradient variances.,actionable,shortcoming
r78,s5,t31,(A) Veit et al. argued that a deep ResNet behaves as an ensemble of shallower networks as long as the gradient flowing through the residual paths is not larger than the gradient flowing through the skip paths.,non_actionable,fact
r78,s6,t31,"To be fair, one might make an argument as follows: ""the point of deep nets is to be expressive, expressiveness of a layer relates to the spetrum of the layer-Jacobian, a small increase in gradient scale implies the layer-Jacobian has many similar singular values, therefore a small increase in gradient scale implies low expressiveness of the layer, therefore the layer is pathological"".",actionable,suggestion
r78,s7,t31,"In any case, I also don't think that was the argument you were trying to make.",non_actionable,fact
r78,s8,t31,"Note that after I skimmed through the submissions to this conference, there seem to be interesting papers on the topic of gradients.",non_actionable,fact
r78,s9,t31,Other comments: - your notation is quite sloppy and may have lead to errors.,actionable,shortcoming
r78,s0,t16,"Firstly, the paper has a fatal mathematical flaw.",actionable,shortcoming
r78,s1,t16,"However, batch normalization only sees the variation in the activations given to it by a SPECIFIC set of weights.",actionable,disagreement
r78,s2,t16,"Secondly, in section 4, your analysis depends on the specific type of ResNet you chose.",actionable,fact
r78,s3,t16,(I hope that my area chair agrees with me that honesty is the best and kindest policy.),non_actionable,fact
r78,s4,t16,You do not know how to interpret gradient variances.,actionable,fact
r78,s5,t16,(A) Veit et al. argued that a deep ResNet behaves as an ensemble of shallower networks as long as the gradient flowing through the residual paths is not larger than the gradient flowing through the skip paths.,non_actionable,fact
r78,s6,t16,"To be fair, one might make an argument as follows: ""the point of deep nets is to be expressive, expressiveness of a layer relates to the spetrum of the layer-Jacobian, a small increase in gradient scale implies the layer-Jacobian has many similar singular values, therefore a small increase in gradient scale implies low expressiveness of the layer, therefore the layer is pathological"".",actionable,disagreement
r78,s7,t16,"In any case, I also don't think that was the argument you were trying to make.",actionable,shortcoming
r78,s8,t16,"Note that after I skimmed through the submissions to this conference, there seem to be interesting papers on the topic of gradients.",non_actionable,fact
r78,s9,t16,Other comments: - your notation is quite sloppy and may have lead to errors.,actionable,shortcoming
r78,s0,t23,"Firstly, the paper has a fatal mathematical flaw.",actionable,shortcoming
r78,s1,t23,"However, batch normalization only sees the variation in the activations given to it by a SPECIFIC set of weights.",non_actionable,fact
r78,s2,t23,"Secondly, in section 4, your analysis depends on the specific type of ResNet you chose.",non_actionable,fact
r78,s3,t23,(I hope that my area chair agrees with me that honesty is the best and kindest policy.),non_actionable,other
r78,s4,t23,You do not know how to interpret gradient variances.,non_actionable,shortcoming
r78,s5,t23,(A) Veit et al. argued that a deep ResNet behaves as an ensemble of shallower networks as long as the gradient flowing through the residual paths is not larger than the gradient flowing through the skip paths.,non_actionable,fact
r78,s6,t23,"To be fair, one might make an argument as follows: ""the point of deep nets is to be expressive, expressiveness of a layer relates to the spetrum of the layer-Jacobian, a small increase in gradient scale implies the layer-Jacobian has many similar singular values, therefore a small increase in gradient scale implies low expressiveness of the layer, therefore the layer is pathological"".",non_actionable,fact
r78,s7,t23,"In any case, I also don't think that was the argument you were trying to make.",non_actionable,disagreement
r78,s8,t23,"Note that after I skimmed through the submissions to this conference, there seem to be interesting papers on the topic of gradients.",non_actionable,fact
r78,s9,t23,Other comments: - your notation is quite sloppy and may have lead to errors.,actionable,shortcoming
r78,s0,t2,"Firstly, the paper has a fatal mathematical flaw.",non_actionable,shortcoming
r78,s1,t2,"However, batch normalization only sees the variation in the activations given to it by a SPECIFIC set of weights.",non_actionable,fact
r78,s2,t2,"Secondly, in section 4, your analysis depends on the specific type of ResNet you chose.",non_actionable,shortcoming
r78,s3,t2,(I hope that my area chair agrees with me that honesty is the best and kindest policy.),non_actionable,fact
r78,s4,t2,You do not know how to interpret gradient variances.,non_actionable,shortcoming
r78,s5,t2,(A) Veit et al. argued that a deep ResNet behaves as an ensemble of shallower networks as long as the gradient flowing through the residual paths is not larger than the gradient flowing through the skip paths.,non_actionable,fact
r78,s6,t2,"To be fair, one might make an argument as follows: ""the point of deep nets is to be expressive, expressiveness of a layer relates to the spetrum of the layer-Jacobian, a small increase in gradient scale implies the layer-Jacobian has many similar singular values, therefore a small increase in gradient scale implies low expressiveness of the layer, therefore the layer is pathological"".",non_actionable,fact
r78,s7,t2,"In any case, I also don't think that was the argument you were trying to make.",non_actionable,disagreement
r78,s8,t2,"Note that after I skimmed through the submissions to this conference, there seem to be interesting papers on the topic of gradients.",non_actionable,fact
r78,s9,t2,Other comments: - your notation is quite sloppy and may have lead to errors.,actionable,shortcoming
r94,s0,t31,"Given representations from the language model, a logistic regression classifier is trained with supervised data from the task of interest to produce the final model.",non_actionable,fact
r94,s1,t31,"The authors evaluated their approach on six sentiment analysis datasets (MR, CR, SUBJ, MPQA, SST, and IMDB), and found that the proposed method is competitive with existing supervised methods.",non_actionable,fact
r94,s2,t31,"The results are mixed, and they understandably are better for test datasets from similar domains to the Amazon product reviews dataset used to train the language model.",non_actionable,agreement
r94,s3,t31,An interesting finding is that one of the neurons captures sentiment property and can be used to predict sentiment as a single unit.,non_actionable,fact
r94,s4,t31,I think the main result of the paper is not surprising and does not show much beyond we can do pretraining on unlabeled datasets from a similar domain to the domain of interest.,non_actionable,fact
r94,s5,t31,"There are a few unanswered questions in the paper: - What are the performance of the sentiment unit on other datasets (e.g., SST, MR, CR)?",actionable,shortcoming
r94,s6,t31,Is it also competitive with the full model?,actionable,shortcoming
r94,s7,t31,Or is similarity to the corpus of interest more important?,actionable,shortcoming
r94,s8,t31,"If the model is capturing sentiment as is claimed by the authors, why does it only capture binary sentiment instead of a spectrum of sentiment level?",actionable,shortcoming
r94,s9,t31,"There are many typos (e.g., ""This advantage is also its difficulty"", ""Much previous work on language modeling has evaluated "", ""We focus in on the task"", and others) so the writing needs to be significantly improved for it to be a conference paper, preferably with some help from a native English speaker.",actionable,shortcoming
r94,s0,t20,"Given representations from the language model, a logistic regression classifier is trained with supervised data from the task of interest to produce the final model.",non_actionable,fact
r94,s1,t20,"The authors evaluated their approach on six sentiment analysis datasets (MR, CR, SUBJ, MPQA, SST, and IMDB), and found that the proposed method is competitive with existing supervised methods.",non_actionable,fact
r94,s2,t20,"The results are mixed, and they understandably are better for test datasets from similar domains to the Amazon product reviews dataset used to train the language model.",non_actionable,fact
r94,s3,t20,An interesting finding is that one of the neurons captures sentiment property and can be used to predict sentiment as a single unit.,non_actionable,fact
r94,s4,t20,I think the main result of the paper is not surprising and does not show much beyond we can do pretraining on unlabeled datasets from a similar domain to the domain of interest.,non_actionable,fact
r94,s5,t20,"There are a few unanswered questions in the paper: - What are the performance of the sentiment unit on other datasets (e.g., SST, MR, CR)?",actionable,question
r94,s6,t20,Is it also competitive with the full model?,actionable,question
r94,s7,t20,Or is similarity to the corpus of interest more important?,actionable,question
r94,s8,t20,"If the model is capturing sentiment as is claimed by the authors, why does it only capture binary sentiment instead of a spectrum of sentiment level?",actionable,question
r94,s9,t20,"There are many typos (e.g., ""This advantage is also its difficulty"", ""Much previous work on language modeling has evaluated "", ""We focus in on the task"", and others) so the writing needs to be significantly improved for it to be a conference paper, preferably with some help from a native English speaker.",actionable,shortcoming
r94,s0,t16,"Given representations from the language model, a logistic regression classifier is trained with supervised data from the task of interest to produce the final model.",non_actionable,fact
r94,s1,t16,"The authors evaluated their approach on six sentiment analysis datasets (MR, CR, SUBJ, MPQA, SST, and IMDB), and found that the proposed method is competitive with existing supervised methods.",non_actionable,fact
r94,s2,t16,"The results are mixed, and they understandably are better for test datasets from similar domains to the Amazon product reviews dataset used to train the language model.",non_actionable,fact
r94,s3,t16,An interesting finding is that one of the neurons captures sentiment property and can be used to predict sentiment as a single unit.,non_actionable,agreement
r94,s4,t16,I think the main result of the paper is not surprising and does not show much beyond we can do pretraining on unlabeled datasets from a similar domain to the domain of interest.,non_actionable,fact
r94,s5,t16,"There are a few unanswered questions in the paper: - What are the performance of the sentiment unit on other datasets (e.g., SST, MR, CR)?",actionable,question
r94,s6,t16,Is it also competitive with the full model?,actionable,question
r94,s7,t16,Or is similarity to the corpus of interest more important?,actionable,question
r94,s8,t16,"If the model is capturing sentiment as is claimed by the authors, why does it only capture binary sentiment instead of a spectrum of sentiment level?",actionable,question
r94,s9,t16,"There are many typos (e.g., ""This advantage is also its difficulty"", ""Much previous work on language modeling has evaluated "", ""We focus in on the task"", and others) so the writing needs to be significantly improved for it to be a conference paper, preferably with some help from a native English speaker.",actionable,suggestion
r94,s0,t10,"Given representations from the language model, a logistic regression classifier is trained with supervised data from the task of interest to produce the final model.",non_actionable,other
r94,s1,t10,"The authors evaluated their approach on six sentiment analysis datasets (MR, CR, SUBJ, MPQA, SST, and IMDB), and found that the proposed method is competitive with existing supervised methods.",non_actionable,other
r94,s2,t10,"The results are mixed, and they understandably are better for test datasets from similar domains to the Amazon product reviews dataset used to train the language model.",actionable,shortcoming
r94,s3,t10,An interesting finding is that one of the neurons captures sentiment property and can be used to predict sentiment as a single unit.,actionable,agreement
r94,s4,t10,I think the main result of the paper is not surprising and does not show much beyond we can do pretraining on unlabeled datasets from a similar domain to the domain of interest.,actionable,fact
r94,s5,t10,"There are a few unanswered questions in the paper: - What are the performance of the sentiment unit on other datasets (e.g., SST, MR, CR)?",actionable,question
r94,s6,t10,Is it also competitive with the full model?,actionable,question
r94,s7,t10,Or is similarity to the corpus of interest more important?,actionable,question
r94,s8,t10,"If the model is capturing sentiment as is claimed by the authors, why does it only capture binary sentiment instead of a spectrum of sentiment level?",actionable,question
r94,s9,t10,"There are many typos (e.g., ""This advantage is also its difficulty"", ""Much previous work on language modeling has evaluated "", ""We focus in on the task"", and others) so the writing needs to be significantly improved for it to be a conference paper, preferably with some help from a native English speaker.",actionable,suggestion
r94,s0,t2,"Given representations from the language model, a logistic regression classifier is trained with supervised data from the task of interest to produce the final model.",non_actionable,fact
r94,s1,t2,"The authors evaluated their approach on six sentiment analysis datasets (MR, CR, SUBJ, MPQA, SST, and IMDB), and found that the proposed method is competitive with existing supervised methods.",non_actionable,fact
r94,s2,t2,"The results are mixed, and they understandably are better for test datasets from similar domains to the Amazon product reviews dataset used to train the language model.",non_actionable,fact
r94,s3,t2,An interesting finding is that one of the neurons captures sentiment property and can be used to predict sentiment as a single unit.,non_actionable,fact
r94,s4,t2,I think the main result of the paper is not surprising and does not show much beyond we can do pretraining on unlabeled datasets from a similar domain to the domain of interest.,non_actionable,shortcoming
r94,s5,t2,"There are a few unanswered questions in the paper: - What are the performance of the sentiment unit on other datasets (e.g., SST, MR, CR)?",actionable,question
r94,s6,t2,Is it also competitive with the full model?,actionable,question
r94,s7,t2,Or is similarity to the corpus of interest more important?,actionable,question
r94,s8,t2,"If the model is capturing sentiment as is claimed by the authors, why does it only capture binary sentiment instead of a spectrum of sentiment level?",actionable,question
r94,s9,t2,"There are many typos (e.g., ""This advantage is also its difficulty"", ""Much previous work on language modeling has evaluated "", ""We focus in on the task"", and others) so the writing needs to be significantly improved for it to be a conference paper, preferably with some help from a native English speaker.",actionable,suggestion
r107,s0,t8,"Given the recent interest in (deep) reinforcement learning (combined with the lack of theoretical guarantees in this space), this is a very timely problem to study.",non_actionable,agreement
r107,s1,t8,The proposed approach leverages recent work that gives a novel parametrization of control problems in the LDS setting.,non_actionable,fact
r107,s2,t8,The time and sample complexities of this approach are polynomial in all relevant parameters.,non_actionable,fact
r107,s3,t8,The authors also highlight that their sample complexity depends only logarithmically on the time horizon T. The paper focuses on the theoretical results and does not present experiments (the polynomials are also not elaborated further).,actionable,shortcoming
r107,s4,t8,"However, I am not sure if the paper is a good fit for ICLR since it is purely theoretical in nature and has no experiments.",actionable,shortcoming
r107,s5,t8,It would be helpful if the authors could comment on the dependence between T and L. (B) Why does the bound in Theorem 2.4 become worse when there are some directions that do not contribute to the cost (the lambda dependence)?,actionable,suggestion
r107,s6,t8,It would enhance readability of the paper if the results were more self-contained.,actionable,suggestion
r107,s7,t8,Their abstract also claims to utilize a convex programming formulation.,non_actionable,fact
r107,s8,t8,(9) Lemma 3.2: Is \hat{D} defined in the paper?,non_actionable,question
r107,s9,t8,"I assume that it involves \hat{M}, but it would be good to formally define this notation.",actionable,suggestion
r107,s0,t2,"Given the recent interest in (deep) reinforcement learning (combined with the lack of theoretical guarantees in this space), this is a very timely problem to study.",non_actionable,fact
r107,s1,t2,The proposed approach leverages recent work that gives a novel parametrization of control problems in the LDS setting.,non_actionable,fact
r107,s2,t2,The time and sample complexities of this approach are polynomial in all relevant parameters.,non_actionable,fact
r107,s3,t2,The authors also highlight that their sample complexity depends only logarithmically on the time horizon T. The paper focuses on the theoretical results and does not present experiments (the polynomials are also not elaborated further).,non_actionable,fact
r107,s4,t2,"However, I am not sure if the paper is a good fit for ICLR since it is purely theoretical in nature and has no experiments.",non_actionable,shortcoming
r107,s5,t2,It would be helpful if the authors could comment on the dependence between T and L. (B) Why does the bound in Theorem 2.4 become worse when there are some directions that do not contribute to the cost (the lambda dependence)?,actionable,suggestion
r107,s6,t2,It would enhance readability of the paper if the results were more self-contained.,actionable,shortcoming
r107,s7,t2,Their abstract also claims to utilize a convex programming formulation.,non_actionable,fact
r107,s8,t2,(9) Lemma 3.2: Is \hat{D} defined in the paper?,actionable,question
r107,s9,t2,"I assume that it involves \hat{M}, but it would be good to formally define this notation.",actionable,suggestion
r107,s0,t31,"Given the recent interest in (deep) reinforcement learning (combined with the lack of theoretical guarantees in this space), this is a very timely problem to study.",non_actionable,agreement
r107,s1,t31,The proposed approach leverages recent work that gives a novel parametrization of control problems in the LDS setting.,non_actionable,agreement
r107,s2,t31,The time and sample complexities of this approach are polynomial in all relevant parameters.,non_actionable,fact
r107,s3,t31,The authors also highlight that their sample complexity depends only logarithmically on the time horizon T. The paper focuses on the theoretical results and does not present experiments (the polynomials are also not elaborated further).,actionable,shortcoming
r107,s4,t31,"However, I am not sure if the paper is a good fit for ICLR since it is purely theoretical in nature and has no experiments.",actionable,shortcoming
r107,s5,t31,It would be helpful if the authors could comment on the dependence between T and L. (B) Why does the bound in Theorem 2.4 become worse when there are some directions that do not contribute to the cost (the lambda dependence)?,actionable,suggestion
r107,s6,t31,It would enhance readability of the paper if the results were more self-contained.,actionable,suggestion
r107,s7,t31,Their abstract also claims to utilize a convex programming formulation.,non_actionable,fact
r107,s8,t31,(9) Lemma 3.2: Is \hat{D} defined in the paper?,non_actionable,question
r107,s9,t31,"I assume that it involves \hat{M}, but it would be good to formally define this notation.",actionable,suggestion
r107,s0,t5,"Given the recent interest in (deep) reinforcement learning (combined with the lack of theoretical guarantees in this space), this is a very timely problem to study.",non_actionable,agreement
r107,s1,t5,The proposed approach leverages recent work that gives a novel parametrization of control problems in the LDS setting.,non_actionable,agreement
r107,s2,t5,The time and sample complexities of this approach are polynomial in all relevant parameters.,non_actionable,agreement
r107,s3,t5,The authors also highlight that their sample complexity depends only logarithmically on the time horizon T. The paper focuses on the theoretical results and does not present experiments (the polynomials are also not elaborated further).,actionable,shortcoming
r107,s4,t5,"However, I am not sure if the paper is a good fit for ICLR since it is purely theoretical in nature and has no experiments.",non_actionable,shortcoming
r107,s5,t5,It would be helpful if the authors could comment on the dependence between T and L. (B) Why does the bound in Theorem 2.4 become worse when there are some directions that do not contribute to the cost (the lambda dependence)?,actionable,suggestion
r107,s6,t5,It would enhance readability of the paper if the results were more self-contained.,actionable,suggestion
r107,s7,t5,Their abstract also claims to utilize a convex programming formulation.,non_actionable,fact
r107,s8,t5,(9) Lemma 3.2: Is \hat{D} defined in the paper?,actionable,question
r107,s9,t5,"I assume that it involves \hat{M}, but it would be good to formally define this notation.",actionable,suggestion
r107,s0,t20,"Given the recent interest in (deep) reinforcement learning (combined with the lack of theoretical guarantees in this space), this is a very timely problem to study.",non_actionable,agreement
r107,s1,t20,The proposed approach leverages recent work that gives a novel parametrization of control problems in the LDS setting.,non_actionable,fact
r107,s2,t20,The time and sample complexities of this approach are polynomial in all relevant parameters.,non_actionable,fact
r107,s3,t20,The authors also highlight that their sample complexity depends only logarithmically on the time horizon T. The paper focuses on the theoretical results and does not present experiments (the polynomials are also not elaborated further).,non_actionable,fact
r107,s4,t20,"However, I am not sure if the paper is a good fit for ICLR since it is purely theoretical in nature and has no experiments.",non_actionable,shortcoming
r107,s5,t20,It would be helpful if the authors could comment on the dependence between T and L. (B) Why does the bound in Theorem 2.4 become worse when there are some directions that do not contribute to the cost (the lambda dependence)?,actionable,question
r107,s6,t20,It would enhance readability of the paper if the results were more self-contained.,actionable,suggestion
r107,s7,t20,Their abstract also claims to utilize a convex programming formulation.,non_actionable,fact
r107,s8,t20,(9) Lemma 3.2: Is \hat{D} defined in the paper?,actionable,question
r107,s9,t20,"I assume that it involves \hat{M}, but it would be good to formally define this notation.",actionable,suggestion
r41,s0,t16,"Good contribution, paper needs to be made clearer This paper thoroughly analyzes an algorithmic task (determining if two points in a maze are connected, which requires BFS to solve) by constructing an explicit ConvNet solution and analytically deriving properties of the loss surface around this analytical solution.",actionable,shortcoming
r41,s1,t16,"They show that their analytical solution implements a form of BFS algorithm, characterize the probability of introducing ""bugs"" in the algorithm as the weights move away from the optimal solution, and how this influences the error surface for different depths.",non_actionable,fact
r41,s2,t16,This analysis is conducted by drawing on results from the field of critical percolation in physics.,non_actionable,fact
r41,s3,t16,"In particular, this could be a first step towards better understanding the optimization landscape of memory-augmented neural networks (Memory Networks, Neural Turing Machines, etc) which try to learn reasoning tasks or algorithms.",non_actionable,agreement
r41,s4,t16,It is well-known that these are sensitive to initialization and often require running the optimizer with multiple random seeds and picking the best one.,non_actionable,fact
r41,s5,t16,"In particular, many parts are quite technical and may not be accessible to a broader machine learning audience.",actionable,shortcoming
r41,s6,t16,"- Top of page 5, when you describe the checkerboard BFS: please include a visualization somewhere, it could be in the Appendix.",actionable,suggestion
r41,s7,t16,"- Section 6: there is lots of math here, but the main results don't obviously stand out.",actionable,shortcoming
r41,s8,t16,Interested readers can then work through the math if they want to.,non_actionable,fact
r41,s9,t16,"This second part was confusing - although the maze can be viewed as a graph, many other works apply ConvNets to maze environments [1, 2, 3], and their work has little relation to other work on graph CNNs.",actionable,shortcoming
r41,s0,t20,"Good contribution, paper needs to be made clearer This paper thoroughly analyzes an algorithmic task (determining if two points in a maze are connected, which requires BFS to solve) by constructing an explicit ConvNet solution and analytically deriving properties of the loss surface around this analytical solution.",non_actionable,agreement
r41,s1,t20,"They show that their analytical solution implements a form of BFS algorithm, characterize the probability of introducing ""bugs"" in the algorithm as the weights move away from the optimal solution, and how this influences the error surface for different depths.",non_actionable,fact
r41,s2,t20,This analysis is conducted by drawing on results from the field of critical percolation in physics.,non_actionable,fact
r41,s3,t20,"In particular, this could be a first step towards better understanding the optimization landscape of memory-augmented neural networks (Memory Networks, Neural Turing Machines, etc) which try to learn reasoning tasks or algorithms.",non_actionable,fact
r41,s4,t20,It is well-known that these are sensitive to initialization and often require running the optimizer with multiple random seeds and picking the best one.,non_actionable,fact
r41,s5,t20,"In particular, many parts are quite technical and may not be accessible to a broader machine learning audience.",actionable,shortcoming
r41,s6,t20,"- Top of page 5, when you describe the checkerboard BFS: please include a visualization somewhere, it could be in the Appendix.",actionable,suggestion
r41,s7,t20,"- Section 6: there is lots of math here, but the main results don't obviously stand out.",actionable,shortcoming
r41,s8,t20,Interested readers can then work through the math if they want to.,non_actionable,fact
r41,s9,t20,"This second part was confusing - although the maze can be viewed as a graph, many other works apply ConvNets to maze environments [1, 2, 3], and their work has little relation to other work on graph CNNs.",actionable,shortcoming
r41,s0,t8,"Good contribution, paper needs to be made clearer This paper thoroughly analyzes an algorithmic task (determining if two points in a maze are connected, which requires BFS to solve) by constructing an explicit ConvNet solution and analytically deriving properties of the loss surface around this analytical solution.",actionable,suggestion
r41,s1,t8,"They show that their analytical solution implements a form of BFS algorithm, characterize the probability of introducing ""bugs"" in the algorithm as the weights move away from the optimal solution, and how this influences the error surface for different depths.",non_actionable,fact
r41,s2,t8,This analysis is conducted by drawing on results from the field of critical percolation in physics.,non_actionable,fact
r41,s3,t8,"In particular, this could be a first step towards better understanding the optimization landscape of memory-augmented neural networks (Memory Networks, Neural Turing Machines, etc) which try to learn reasoning tasks or algorithms.",non_actionable,agreement
r41,s4,t8,It is well-known that these are sensitive to initialization and often require running the optimizer with multiple random seeds and picking the best one.,non_actionable,fact
r41,s5,t8,"In particular, many parts are quite technical and may not be accessible to a broader machine learning audience.",actionable,shortcoming
r41,s6,t8,"- Top of page 5, when you describe the checkerboard BFS: please include a visualization somewhere, it could be in the Appendix.",actionable,suggestion
r41,s7,t8,"- Section 6: there is lots of math here, but the main results don't obviously stand out.",actionable,shortcoming
r41,s8,t8,Interested readers can then work through the math if they want to.,non_actionable,fact
r41,s9,t8,"This second part was confusing - although the maze can be viewed as a graph, many other works apply ConvNets to maze environments [1, 2, 3], and their work has little relation to other work on graph CNNs.",actionable,shortcoming
r41,s0,t31,"Good contribution, paper needs to be made clearer This paper thoroughly analyzes an algorithmic task (determining if two points in a maze are connected, which requires BFS to solve) by constructing an explicit ConvNet solution and analytically deriving properties of the loss surface around this analytical solution.",actionable,suggestion
r41,s1,t31,"They show that their analytical solution implements a form of BFS algorithm, characterize the probability of introducing ""bugs"" in the algorithm as the weights move away from the optimal solution, and how this influences the error surface for different depths.",non_actionable,fact
r41,s2,t31,This analysis is conducted by drawing on results from the field of critical percolation in physics.,non_actionable,fact
r41,s3,t31,"In particular, this could be a first step towards better understanding the optimization landscape of memory-augmented neural networks (Memory Networks, Neural Turing Machines, etc) which try to learn reasoning tasks or algorithms.",non_actionable,fact
r41,s4,t31,It is well-known that these are sensitive to initialization and often require running the optimizer with multiple random seeds and picking the best one.,non_actionable,fact
r41,s5,t31,"In particular, many parts are quite technical and may not be accessible to a broader machine learning audience.",actionable,shortcoming
r41,s6,t31,"- Top of page 5, when you describe the checkerboard BFS: please include a visualization somewhere, it could be in the Appendix.",actionable,suggestion
r41,s7,t31,"- Section 6: there is lots of math here, but the main results don't obviously stand out.",actionable,shortcoming
r41,s8,t31,Interested readers can then work through the math if they want to.,non_actionable,fact
r41,s9,t31,"This second part was confusing - although the maze can be viewed as a graph, many other works apply ConvNets to maze environments [1, 2, 3], and their work has little relation to other work on graph CNNs.",actionable,shortcoming
r41,s0,t2,"Good contribution, paper needs to be made clearer This paper thoroughly analyzes an algorithmic task (determining if two points in a maze are connected, which requires BFS to solve) by constructing an explicit ConvNet solution and analytically deriving properties of the loss surface around this analytical solution.",non_actionable,fact
r41,s1,t2,"They show that their analytical solution implements a form of BFS algorithm, characterize the probability of introducing ""bugs"" in the algorithm as the weights move away from the optimal solution, and how this influences the error surface for different depths.",non_actionable,fact
r41,s2,t2,This analysis is conducted by drawing on results from the field of critical percolation in physics.,non_actionable,fact
r41,s3,t2,"In particular, this could be a first step towards better understanding the optimization landscape of memory-augmented neural networks (Memory Networks, Neural Turing Machines, etc) which try to learn reasoning tasks or algorithms.",non_actionable,fact
r41,s4,t2,It is well-known that these are sensitive to initialization and often require running the optimizer with multiple random seeds and picking the best one.,non_actionable,fact
r41,s5,t2,"In particular, many parts are quite technical and may not be accessible to a broader machine learning audience.",actionable,shortcoming
r41,s6,t2,"- Top of page 5, when you describe the checkerboard BFS: please include a visualization somewhere, it could be in the Appendix.",actionable,suggestion
r41,s7,t2,"- Section 6: there is lots of math here, but the main results don't obviously stand out.",actionable,shortcoming
r41,s8,t2,Interested readers can then work through the math if they want to.,non_actionable,fact
r41,s9,t2,"This second part was confusing - although the maze can be viewed as a graph, many other works apply ConvNets to maze environments [1, 2, 3], and their work has little relation to other work on graph CNNs.",actionable,shortcoming
r16,s0,t1,"Good paper, pushing the limits of RL to harder tasks.",non_actionable,agreement
r16,s1,t1,"This paper presents a reinforcement learning method for learning complex tasks by dividing the state space into slices, learning local policies within each slice, while ensuring that they don't deviate too far from each other, while simultaneously learning a central policy that works across the entire state space in the process.",non_actionable,agreement
r16,s2,t1,"The paper is written well, has good insights, is technically sound, and has all the relevant references.",non_actionable,agreement
r16,s3,t1,The authors show through several experiments that the divide and conquer (DnC) technique can solve more complex tasks than can be solved with conventional policy gradient methods (TRPO is used as the baseline).,non_actionable,fact
r16,s4,t1,The paper and included experiments are a valuable contribution to the community interested in solving harder and harder tasks using reinforcement learning.,non_actionable,fact
r16,s5,t1,"For completeness, it would be great to include one more algorithm in the evaluation: an ablation of DnC which does not involve a central policy at all.",actionable,suggestion
r16,s6,t1,"If the local policies are trained to convergence, (and the context omega is provided by an oracle), how well does this mixture of local policies perform?",actionable,question
r16,s7,t1,This result would be instructive to see for each of the tasks.,actionable,suggestion
r16,s8,t1,The partitioning of each task must currently be designed by hand.,actionable,shortcoming
r16,s9,t1,It would be interesting (in future work) to explore how the partitioning could perhaps be discovered automatically.,non_actionable,suggestion
r16,s0,t10,"Good paper, pushing the limits of RL to harder tasks.",actionable,agreement
r16,s1,t10,"This paper presents a reinforcement learning method for learning complex tasks by dividing the state space into slices, learning local policies within each slice, while ensuring that they don't deviate too far from each other, while simultaneously learning a central policy that works across the entire state space in the process.",non_actionable,other
r16,s2,t10,"The paper is written well, has good insights, is technically sound, and has all the relevant references.",actionable,agreement
r16,s3,t10,The authors show through several experiments that the divide and conquer (DnC) technique can solve more complex tasks than can be solved with conventional policy gradient methods (TRPO is used as the baseline).,non_actionable,other
r16,s4,t10,The paper and included experiments are a valuable contribution to the community interested in solving harder and harder tasks using reinforcement learning.,actionable,agreement
r16,s5,t10,"For completeness, it would be great to include one more algorithm in the evaluation: an ablation of DnC which does not involve a central policy at all.",actionable,suggestion
r16,s6,t10,"If the local policies are trained to convergence, (and the context omega is provided by an oracle), how well does this mixture of local policies perform?",actionable,other
r16,s7,t10,This result would be instructive to see for each of the tasks.,actionable,shortcoming
r16,s8,t10,The partitioning of each task must currently be designed by hand.,actionable,shortcoming
r16,s9,t10,It would be interesting (in future work) to explore how the partitioning could perhaps be discovered automatically.,actionable,suggestion
r16,s0,t16,"Good paper, pushing the limits of RL to harder tasks.",non_actionable,agreement
r16,s1,t16,"This paper presents a reinforcement learning method for learning complex tasks by dividing the state space into slices, learning local policies within each slice, while ensuring that they don't deviate too far from each other, while simultaneously learning a central policy that works across the entire state space in the process.",non_actionable,fact
r16,s2,t16,"The paper is written well, has good insights, is technically sound, and has all the relevant references.",non_actionable,agreement
r16,s3,t16,The authors show through several experiments that the divide and conquer (DnC) technique can solve more complex tasks than can be solved with conventional policy gradient methods (TRPO is used as the baseline).,non_actionable,fact
r16,s4,t16,The paper and included experiments are a valuable contribution to the community interested in solving harder and harder tasks using reinforcement learning.,non_actionable,agreement
r16,s5,t16,"For completeness, it would be great to include one more algorithm in the evaluation: an ablation of DnC which does not involve a central policy at all.",actionable,suggestion
r16,s6,t16,"If the local policies are trained to convergence, (and the context omega is provided by an oracle), how well does this mixture of local policies perform?",actionable,question
r16,s7,t16,This result would be instructive to see for each of the tasks.,non_actionable,fact
r16,s8,t16,The partitioning of each task must currently be designed by hand.,non_actionable,fact
r16,s9,t16,It would be interesting (in future work) to explore how the partitioning could perhaps be discovered automatically.,actionable,suggestion
r16,s0,t31,"Good paper, pushing the limits of RL to harder tasks.",non_actionable,agreement
r16,s1,t31,"This paper presents a reinforcement learning method for learning complex tasks by dividing the state space into slices, learning local policies within each slice, while ensuring that they don't deviate too far from each other, while simultaneously learning a central policy that works across the entire state space in the process.",non_actionable,fact
r16,s2,t31,"The paper is written well, has good insights, is technically sound, and has all the relevant references.",non_actionable,agreement
r16,s3,t31,The authors show through several experiments that the divide and conquer (DnC) technique can solve more complex tasks than can be solved with conventional policy gradient methods (TRPO is used as the baseline).,non_actionable,fact
r16,s4,t31,The paper and included experiments are a valuable contribution to the community interested in solving harder and harder tasks using reinforcement learning.,non_actionable,agreement
r16,s5,t31,"For completeness, it would be great to include one more algorithm in the evaluation: an ablation of DnC which does not involve a central policy at all.",actionable,suggestion
r16,s6,t31,"If the local policies are trained to convergence, (and the context omega is provided by an oracle), how well does this mixture of local policies perform?",non_actionable,question
r16,s7,t31,This result would be instructive to see for each of the tasks.,actionable,suggestion
r16,s8,t31,The partitioning of each task must currently be designed by hand.,non_actionable,fact
r16,s9,t31,It would be interesting (in future work) to explore how the partitioning could perhaps be discovered automatically.,non_actionable,other
r16,s0,t20,"Good paper, pushing the limits of RL to harder tasks.",non_actionable,agreement
r16,s1,t20,"This paper presents a reinforcement learning method for learning complex tasks by dividing the state space into slices, learning local policies within each slice, while ensuring that they don't deviate too far from each other, while simultaneously learning a central policy that works across the entire state space in the process.",non_actionable,fact
r16,s2,t20,"The paper is written well, has good insights, is technically sound, and has all the relevant references.",non_actionable,agreement
r16,s3,t20,The authors show through several experiments that the divide and conquer (DnC) technique can solve more complex tasks than can be solved with conventional policy gradient methods (TRPO is used as the baseline).,non_actionable,fact
r16,s4,t20,The paper and included experiments are a valuable contribution to the community interested in solving harder and harder tasks using reinforcement learning.,non_actionable,agreement
r16,s5,t20,"For completeness, it would be great to include one more algorithm in the evaluation: an ablation of DnC which does not involve a central policy at all.",actionable,suggestion
r16,s6,t20,"If the local policies are trained to convergence, (and the context omega is provided by an oracle), how well does this mixture of local policies perform?",actionable,question
r16,s7,t20,This result would be instructive to see for each of the tasks.,actionable,suggestion
r16,s8,t20,The partitioning of each task must currently be designed by hand.,actionable,suggestion
r16,s9,t20,It would be interesting (in future work) to explore how the partitioning could perhaps be discovered automatically.,actionable,suggestion
r18,s0,t16,"However, the presentation is somewhat confusing and the resulting architecture does not seem justified by the theory The paper considers the challenges of disentangling factors of variation in images: for example disentangling viewpoint from vehicle type in an image of a car.",actionable,shortcoming
r18,s1,t16,2. Interesting use of GAN on the artificial instance named x_{3 \oplus 1} Cons:,non_actionable,agreement
r18,s2,t16,"3. Some of the architectural choices (the one derived from ""shortcut problem"") are barely explained or looked into.",actionable,shortcoming
r18,s3,t16,1. An important point regarding the reference ambiguity problem and eq.,non_actionable,fact
r18,s4,t16,(2): a general bijective function mixing v and c would not have the two components as independent.,actionable,shortcoming
r18,s5,t16,3. In presenting autoencoders it is crucial to note that they are all built around the idea of compression.,non_actionable,fact
r18,s6,t16,"Otherwise, the perfect latent representation is z=x.",non_actionable,fact
r18,s7,t16,"5. In discussing attributes and ""valid"" features, I found the paper rather vague.",actionable,shortcoming
r18,s8,t16,"An image has many attributes: the glint in the corner of a window, the hue of a leaf.",non_actionable,fact
r18,s9,t16,"15. In the conclusion, I would edit to say the ""our trained model works well on *several* datasets"".",actionable,suggestion
r18,s0,t7,"However, the presentation is somewhat confusing and the resulting architecture does not seem justified by the theory The paper considers the challenges of disentangling factors of variation in images: for example disentangling viewpoint from vehicle type in an image of a car.",actionable,shortcoming
r18,s1,t7,2. Interesting use of GAN on the artificial instance named x_{3 \oplus 1} Cons:,non_actionable,agreement
r18,s2,t7,"3. Some of the architectural choices (the one derived from ""shortcut problem"") are barely explained or looked into.",actionable,shortcoming
r18,s3,t7,1. An important point regarding the reference ambiguity problem and eq.,non_actionable,fact
r18,s4,t7,(2): a general bijective function mixing v and c would not have the two components as independent.,actionable,shortcoming
r18,s5,t7,3. In presenting autoencoders it is crucial to note that they are all built around the idea of compression.,actionable,shortcoming
r18,s6,t7,"Otherwise, the perfect latent representation is z=x.",actionable,fact
r18,s7,t7,"5. In discussing attributes and ""valid"" features, I found the paper rather vague.",actionable,shortcoming
r18,s8,t7,"An image has many attributes: the glint in the corner of a window, the hue of a leaf.",non_actionable,fact
r18,s9,t7,"15. In the conclusion, I would edit to say the ""our trained model works well on *several* datasets"".",actionable,suggestion
r18,s0,t2,"However, the presentation is somewhat confusing and the resulting architecture does not seem justified by the theory The paper considers the challenges of disentangling factors of variation in images: for example disentangling viewpoint from vehicle type in an image of a car.",actionable,shortcoming
r18,s1,t2,2. Interesting use of GAN on the artificial instance named x_{3 \oplus 1} Cons:,non_actionable,fact
r18,s2,t2,"3. Some of the architectural choices (the one derived from ""shortcut problem"") are barely explained or looked into.",actionable,shortcoming
r18,s3,t2,1. An important point regarding the reference ambiguity problem and eq.,non_actionable,fact
r18,s4,t2,(2): a general bijective function mixing v and c would not have the two components as independent.,non_actionable,disagreement
r18,s5,t2,3. In presenting autoencoders it is crucial to note that they are all built around the idea of compression.,actionable,suggestion
r18,s6,t2,"Otherwise, the perfect latent representation is z=x.",non_actionable,fact
r18,s7,t2,"5. In discussing attributes and ""valid"" features, I found the paper rather vague.",actionable,shortcoming
r18,s8,t2,"An image has many attributes: the glint in the corner of a window, the hue of a leaf.",non_actionable,fact
r18,s9,t2,"15. In the conclusion, I would edit to say the ""our trained model works well on *several* datasets"".",actionable,suggestion
r18,s0,t31,"However, the presentation is somewhat confusing and the resulting architecture does not seem justified by the theory The paper considers the challenges of disentangling factors of variation in images: for example disentangling viewpoint from vehicle type in an image of a car.",actionable,shortcoming
r18,s1,t31,2. Interesting use of GAN on the artificial instance named x_{3 \oplus 1} Cons:,non_actionable,agreement
r18,s2,t31,"3. Some of the architectural choices (the one derived from ""shortcut problem"") are barely explained or looked into.",actionable,shortcoming
r18,s3,t31,1. An important point regarding the reference ambiguity problem and eq.,non_actionable,fact
r18,s4,t31,(2): a general bijective function mixing v and c would not have the two components as independent.,non_actionable,fact
r18,s5,t31,3. In presenting autoencoders it is crucial to note that they are all built around the idea of compression.,non_actionable,fact
r18,s6,t31,"Otherwise, the perfect latent representation is z=x.",non_actionable,fact
r18,s7,t31,"5. In discussing attributes and ""valid"" features, I found the paper rather vague.",actionable,shortcoming
r18,s8,t31,"An image has many attributes: the glint in the corner of a window, the hue of a leaf.",non_actionable,fact
r18,s9,t31,"15. In the conclusion, I would edit to say the ""our trained model works well on *several* datasets"".",non_actionable,fact
r18,s0,t20,"However, the presentation is somewhat confusing and the resulting architecture does not seem justified by the theory The paper considers the challenges of disentangling factors of variation in images: for example disentangling viewpoint from vehicle type in an image of a car.",non_actionable,shortcoming
r18,s1,t20,2. Interesting use of GAN on the artificial instance named x_{3 \oplus 1} Cons:,non_actionable,agreement
r18,s2,t20,"3. Some of the architectural choices (the one derived from ""shortcut problem"") are barely explained or looked into.",actionable,shortcoming
r18,s3,t20,1. An important point regarding the reference ambiguity problem and eq.,non_actionable,fact
r18,s4,t20,(2): a general bijective function mixing v and c would not have the two components as independent.,actionable,shortcoming
r18,s5,t20,3. In presenting autoencoders it is crucial to note that they are all built around the idea of compression.,non_actionable,fact
r18,s6,t20,"Otherwise, the perfect latent representation is z=x.",non_actionable,fact
r18,s7,t20,"5. In discussing attributes and ""valid"" features, I found the paper rather vague.",actionable,shortcoming
r18,s8,t20,"An image has many attributes: the glint in the corner of a window, the hue of a leaf.",non_actionable,fact
r18,s9,t20,"15. In the conclusion, I would edit to say the ""our trained model works well on *several* datasets"".",actionable,suggestion
r53,s0,t16,"However, the presentation of the method is not clear, the experiment is not sufficient, and the paper is not polished.",actionable,shortcoming
r53,s1,t16,"For the application of images, using text description to refine the representation is a natural and important research question.",actionable,suggestion
r53,s2,t16,Figure 2 which is the graphic representation of the model is hard to read.,actionable,shortcoming
r53,s3,t16,This group of related work should also be discussed.,actionable,suggestion
r53,s4,t16,3. Experiment evaluation is not sufficient.,actionable,shortcoming
r53,s5,t16,"Secondly, there are no other state-of-the-art baselines are used.",actionable,shortcoming
r53,s6,t16,"5. The paper, in general, needs to be polished.",actionable,suggestion
r53,s7,t16,"[1] Wang, Weiran, Honglak Lee, and Karen Livescu.",non_actionable,fact
r53,s8,t16,"[2] Suzuki, Masahiro, Kotaro Nakayama, and Yutaka Matsuo.",non_actionable,fact
r53,s9,t16,"""Joint Multimodal Learning with Deep Generative Models."" arXiv preprint arXiv:1611.01891 (2016).",non_actionable,fact
r53,s0,t8,"However, the presentation of the method is not clear, the experiment is not sufficient, and the paper is not polished.",actionable,shortcoming
r53,s1,t8,"For the application of images, using text description to refine the representation is a natural and important research question.",non_actionable,fact
r53,s2,t8,Figure 2 which is the graphic representation of the model is hard to read.,actionable,shortcoming
r53,s3,t8,This group of related work should also be discussed.,actionable,suggestion
r53,s4,t8,3. Experiment evaluation is not sufficient.,actionable,shortcoming
r53,s5,t8,"Secondly, there are no other state-of-the-art baselines are used.",actionable,shortcoming
r53,s6,t8,"5. The paper, in general, needs to be polished.",actionable,suggestion
r53,s7,t8,"[1] Wang, Weiran, Honglak Lee, and Karen Livescu.",non_actionable,other
r53,s8,t8,"[2] Suzuki, Masahiro, Kotaro Nakayama, and Yutaka Matsuo.",non_actionable,other
r53,s9,t8,"""Joint Multimodal Learning with Deep Generative Models."" arXiv preprint arXiv:1611.01891 (2016).",non_actionable,other
r53,s0,t20,"However, the presentation of the method is not clear, the experiment is not sufficient, and the paper is not polished.",actionable,shortcoming
r53,s1,t20,"For the application of images, using text description to refine the representation is a natural and important research question.",non_actionable,fact
r53,s2,t20,Figure 2 which is the graphic representation of the model is hard to read.,actionable,shortcoming
r53,s3,t20,This group of related work should also be discussed.,actionable,suggestion
r53,s4,t20,3. Experiment evaluation is not sufficient.,actionable,shortcoming
r53,s5,t20,"Secondly, there are no other state-of-the-art baselines are used.",actionable,shortcoming
r53,s6,t20,"5. The paper, in general, needs to be polished.",actionable,suggestion
r53,s7,t20,"[1] Wang, Weiran, Honglak Lee, and Karen Livescu.",non_actionable,other
r53,s8,t20,"[2] Suzuki, Masahiro, Kotaro Nakayama, and Yutaka Matsuo.",non_actionable,other
r53,s9,t20,"""Joint Multimodal Learning with Deep Generative Models."" arXiv preprint arXiv:1611.01891 (2016).",non_actionable,other
r53,s0,t2,"However, the presentation of the method is not clear, the experiment is not sufficient, and the paper is not polished.",non_actionable,shortcoming
r53,s1,t2,"For the application of images, using text description to refine the representation is a natural and important research question.",non_actionable,fact
r53,s2,t2,Figure 2 which is the graphic representation of the model is hard to read.,actionable,shortcoming
r53,s3,t2,This group of related work should also be discussed.,actionable,suggestion
r53,s4,t2,3. Experiment evaluation is not sufficient.,actionable,shortcoming
r53,s5,t2,"Secondly, there are no other state-of-the-art baselines are used.",actionable,shortcoming
r53,s6,t2,"5. The paper, in general, needs to be polished.",actionable,suggestion
r53,s7,t2,"[1] Wang, Weiran, Honglak Lee, and Karen Livescu.",non_actionable,other
r53,s8,t2,"[2] Suzuki, Masahiro, Kotaro Nakayama, and Yutaka Matsuo.",non_actionable,other
r53,s9,t2,"""Joint Multimodal Learning with Deep Generative Models."" arXiv preprint arXiv:1611.01891 (2016).",non_actionable,other
r53,s0,t15,"However, the presentation of the method is not clear, the experiment is not sufficient, and the paper is not polished.",actionable,shortcoming
r53,s1,t15,"For the application of images, using text description to refine the representation is a natural and important research question.",actionable,fact
r53,s2,t15,Figure 2 which is the graphic representation of the model is hard to read.,actionable,shortcoming
r53,s3,t15,This group of related work should also be discussed.,actionable,suggestion
r53,s4,t15,3. Experiment evaluation is not sufficient.,actionable,shortcoming
r53,s5,t15,"Secondly, there are no other state-of-the-art baselines are used.",actionable,fact
r53,s6,t15,"5. The paper, in general, needs to be polished.",actionable,suggestion
r53,s7,t15,"[1] Wang, Weiran, Honglak Lee, and Karen Livescu.",non_actionable,other
r53,s8,t15,"[2] Suzuki, Masahiro, Kotaro Nakayama, and Yutaka Matsuo.",non_actionable,other
r53,s9,t15,"""Joint Multimodal Learning with Deep Generative Models."" arXiv preprint arXiv:1611.01891 (2016).",non_actionable,other
r77,s0,t20,"I think this paper is still pretty borderline, but I increased my rating to a 6.",non_actionable,fact
r77,s1,t20,"Section 5.2: When trained on a small dataset, training amortization error becomes negligible.",non_actionable,fact
r77,s2,t20,"Also, I think it’s a bit of an exaggeration to call a gap of 2.71 nats “much tighter” than a gap of 3.01 nats.",actionable,shortcoming
r77,s3,t20,"Section 5.3: Amortization error is an important contributor to the slack in the ELBO on MNIST, and the dominant contributor on the more complicated Fashion MNIST dataset.",non_actionable,fact
r77,s4,t20,(This is totally consistent with Krishnan et al.’s finding that eliminating amortization error gave a bigger improvement for more complex datasets than for MNIST.),non_actionable,fact
r77,s5,t20,This idea has been around for a long time (although I’m having a hard time coming up with a reference).,non_actionable,fact
r77,s6,t20,It might be good to emphasize that you don’t train on the IWAE bound in any experiments.,actionable,suggestion
r77,s7,t20,Table 2: It would be good to see standard errors on these numbers; they may be quite high given that they’re only evaluated on 100 examples.,actionable,suggestion
r77,s8,t20,“We can quantitatively determine how close the posterior is to a FFG distribution by comparing the Optimal FFG bound and the Optimal Flow bound.”: Why not just compare the optimal with the AIS evaluation?,actionable,question
r77,s9,t20,"If you trust the AIS estimate, then the result will be the actual KL divergence between the FFG and the true posterior.",non_actionable,fact
r77,s0,t2,"I think this paper is still pretty borderline, but I increased my rating to a 6.",non_actionable,fact
r77,s1,t2,"Section 5.2: When trained on a small dataset, training amortization error becomes negligible.",non_actionable,fact
r77,s2,t2,"Also, I think it’s a bit of an exaggeration to call a gap of 2.71 nats “much tighter” than a gap of 3.01 nats.",actionable,shortcoming
r77,s3,t2,"Section 5.3: Amortization error is an important contributor to the slack in the ELBO on MNIST, and the dominant contributor on the more complicated Fashion MNIST dataset.",non_actionable,fact
r77,s4,t2,(This is totally consistent with Krishnan et al.’s finding that eliminating amortization error gave a bigger improvement for more complex datasets than for MNIST.),non_actionable,fact
r77,s5,t2,This idea has been around for a long time (although I’m having a hard time coming up with a reference).,non_actionable,fact
r77,s6,t2,It might be good to emphasize that you don’t train on the IWAE bound in any experiments.,actionable,suggestion
r77,s7,t2,Table 2: It would be good to see standard errors on these numbers; they may be quite high given that they’re only evaluated on 100 examples.,actionable,suggestion
r77,s8,t2,“We can quantitatively determine how close the posterior is to a FFG distribution by comparing the Optimal FFG bound and the Optimal Flow bound.”: Why not just compare the optimal with the AIS evaluation?,actionable,suggestion
r77,s9,t2,"If you trust the AIS estimate, then the result will be the actual KL divergence between the FFG and the true posterior.",non_actionable,fact
r77,s0,t10,"I think this paper is still pretty borderline, but I increased my rating to a 6.",actionable,shortcoming
r77,s1,t10,"Section 5.2: When trained on a small dataset, training amortization error becomes negligible.",actionable,suggestion
r77,s2,t10,"Also, I think it’s a bit of an exaggeration to call a gap of 2.71 nats “much tighter” than a gap of 3.01 nats.",actionable,fact
r77,s3,t10,"Section 5.3: Amortization error is an important contributor to the slack in the ELBO on MNIST, and the dominant contributor on the more complicated Fashion MNIST dataset.",non_actionable,other
r77,s4,t10,(This is totally consistent with Krishnan et al.’s finding that eliminating amortization error gave a bigger improvement for more complex datasets than for MNIST.),non_actionable,other
r77,s5,t10,This idea has been around for a long time (although I’m having a hard time coming up with a reference).,non_actionable,other
r77,s6,t10,It might be good to emphasize that you don’t train on the IWAE bound in any experiments.,actionable,suggestion
r77,s7,t10,Table 2: It would be good to see standard errors on these numbers; they may be quite high given that they’re only evaluated on 100 examples.,actionable,suggestion
r77,s8,t10,“We can quantitatively determine how close the posterior is to a FFG distribution by comparing the Optimal FFG bound and the Optimal Flow bound.”: Why not just compare the optimal with the AIS evaluation?,actionable,question
r77,s9,t10,"If you trust the AIS estimate, then the result will be the actual KL divergence between the FFG and the true posterior.",actionable,suggestion
r77,s0,t31,"I think this paper is still pretty borderline, but I increased my rating to a 6.",non_actionable,fact
r77,s1,t31,"Section 5.2: When trained on a small dataset, training amortization error becomes negligible.",non_actionable,fact
r77,s2,t31,"Also, I think it’s a bit of an exaggeration to call a gap of 2.71 nats “much tighter” than a gap of 3.01 nats.",actionable,shortcoming
r77,s3,t31,"Section 5.3: Amortization error is an important contributor to the slack in the ELBO on MNIST, and the dominant contributor on the more complicated Fashion MNIST dataset.",non_actionable,fact
r77,s4,t31,(This is totally consistent with Krishnan et al.’s finding that eliminating amortization error gave a bigger improvement for more complex datasets than for MNIST.),non_actionable,fact
r77,s5,t31,This idea has been around for a long time (although I’m having a hard time coming up with a reference).,non_actionable,fact
r77,s6,t31,It might be good to emphasize that you don’t train on the IWAE bound in any experiments.,actionable,suggestion
r77,s7,t31,Table 2: It would be good to see standard errors on these numbers; they may be quite high given that they’re only evaluated on 100 examples.,actionable,suggestion
r77,s8,t31,“We can quantitatively determine how close the posterior is to a FFG distribution by comparing the Optimal FFG bound and the Optimal Flow bound.”: Why not just compare the optimal with the AIS evaluation?,actionable,suggestion
r77,s9,t31,"If you trust the AIS estimate, then the result will be the actual KL divergence between the FFG and the true posterior.",non_actionable,fact
r77,s0,t23,"I think this paper is still pretty borderline, but I increased my rating to a 6.",non_actionable,disagreement
r77,s1,t23,"Section 5.2: When trained on a small dataset, training amortization error becomes negligible.",non_actionable,fact
r77,s2,t23,"Also, I think it’s a bit of an exaggeration to call a gap of 2.71 nats “much tighter” than a gap of 3.01 nats.",non_actionable,disagreement
r77,s3,t23,"Section 5.3: Amortization error is an important contributor to the slack in the ELBO on MNIST, and the dominant contributor on the more complicated Fashion MNIST dataset.",non_actionable,fact
r77,s4,t23,(This is totally consistent with Krishnan et al.’s finding that eliminating amortization error gave a bigger improvement for more complex datasets than for MNIST.),non_actionable,fact
r77,s5,t23,This idea has been around for a long time (although I’m having a hard time coming up with a reference).,non_actionable,fact
r77,s6,t23,It might be good to emphasize that you don’t train on the IWAE bound in any experiments.,actionable,suggestion
r77,s7,t23,Table 2: It would be good to see standard errors on these numbers; they may be quite high given that they’re only evaluated on 100 examples.,actionable,suggestion
r77,s8,t23,“We can quantitatively determine how close the posterior is to a FFG distribution by comparing the Optimal FFG bound and the Optimal Flow bound.”: Why not just compare the optimal with the AIS evaluation?,actionable,suggestion
r77,s9,t23,"If you trust the AIS estimate, then the result will be the actual KL divergence between the FFG and the true posterior.",non_actionable,suggestion
r15,s0,t3,"I.e., with enough throws, I can always hit the bullseye with a dart even when blindfolded.",non_actionable,fact
r15,s1,t3,Comments: The model proposes to learn a conditional stochastic deep model by training an output noise model on the input x_i and the residual y_i - g(x_i).,non_actionable,agreement
r15,s2,t3,The trained residual function can be used to predict a residual z_i for x_i.,non_actionable,fact
r15,s3,t3,"To start, why not plot the empirical histogram of p(z|x) (for some fixed x's) to get a sense of how well-behaved it is as a distribution.",actionable,suggestion
r15,s4,t3,One could even start with a simple Gaussian and linear parameterization of the mean and variance in terms of x.,actionable,suggestion
r15,s5,t3,"If the contribution of the paper is the ""output stochastic"" noise model, I think it is worth experimenting with the design options one has with such a model.",actionable,suggestion
r15,s6,t3,The experiments range over 4 video datasets.,non_actionable,fact
r15,s7,t3,"PSNR is evaluated on predicted frames -- PSNR does not appear to be explicitly defined but I am taking it to be the metric defined in the 2nd paragraph from the bottom on page 7.  The new model ""EEN"" is compared to a deterministic model and conditional GAN.  The GAN never seems to perform well -- the authors claim mode collapse, but I wonder if the GAN was simply hard to train in the first place and this is the key reason?",non_actionable,question
r15,s8,t3,"Unsurprisingly (since the EEN noise does not seem to be conditioned on the input), the baseline deterministic model performs quite well.",non_actionable,agreement
r15,s9,t3,"If I understand what is being evaluated correctly (i.e., best random guess) then I am not surprised the EEN can perform better with enough random samples.",non_actionable,agreement
r15,s0,t10,"I.e., with enough throws, I can always hit the bullseye with a dart even when blindfolded.",non_actionable,other
r15,s1,t10,Comments: The model proposes to learn a conditional stochastic deep model by training an output noise model on the input x_i and the residual y_i - g(x_i).,non_actionable,other
r15,s2,t10,The trained residual function can be used to predict a residual z_i for x_i.,non_actionable,other
r15,s3,t10,"To start, why not plot the empirical histogram of p(z|x) (for some fixed x's) to get a sense of how well-behaved it is as a distribution.",actionable,question
r15,s4,t10,One could even start with a simple Gaussian and linear parameterization of the mean and variance in terms of x.,non_actionable,other
r15,s5,t10,"If the contribution of the paper is the ""output stochastic"" noise model, I think it is worth experimenting with the design options one has with such a model.",actionable,suggestion
r15,s6,t10,The experiments range over 4 video datasets.,non_actionable,other
r15,s7,t10,"PSNR is evaluated on predicted frames -- PSNR does not appear to be explicitly defined but I am taking it to be the metric defined in the 2nd paragraph from the bottom on page 7.  The new model ""EEN"" is compared to a deterministic model and conditional GAN.  The GAN never seems to perform well -- the authors claim mode collapse, but I wonder if the GAN was simply hard to train in the first place and this is the key reason?",actionable,question
r15,s8,t10,"Unsurprisingly (since the EEN noise does not seem to be conditioned on the input), the baseline deterministic model performs quite well.",actionable,other
r15,s9,t10,"If I understand what is being evaluated correctly (i.e., best random guess) then I am not surprised the EEN can perform better with enough random samples.",actionable,fact
r15,s0,t24,"I.e., with enough throws, I can always hit the bullseye with a dart even when blindfolded.",non_actionable,fact
r15,s1,t24,Comments: The model proposes to learn a conditional stochastic deep model by training an output noise model on the input x_i and the residual y_i - g(x_i).,non_actionable,fact
r15,s2,t24,The trained residual function can be used to predict a residual z_i for x_i.,actionable,suggestion
r15,s3,t24,"To start, why not plot the empirical histogram of p(z|x) (for some fixed x's) to get a sense of how well-behaved it is as a distribution.",actionable,suggestion
r15,s4,t24,One could even start with a simple Gaussian and linear parameterization of the mean and variance in terms of x.,actionable,suggestion
r15,s5,t24,"If the contribution of the paper is the ""output stochastic"" noise model, I think it is worth experimenting with the design options one has with such a model.",actionable,suggestion
r15,s6,t24,The experiments range over 4 video datasets.,non_actionable,fact
r15,s7,t24,"PSNR is evaluated on predicted frames -- PSNR does not appear to be explicitly defined but I am taking it to be the metric defined in the 2nd paragraph from the bottom on page 7.  The new model ""EEN"" is compared to a deterministic model and conditional GAN.  The GAN never seems to perform well -- the authors claim mode collapse, but I wonder if the GAN was simply hard to train in the first place and this is the key reason?",actionable,shortcoming
r15,s8,t24,"Unsurprisingly (since the EEN noise does not seem to be conditioned on the input), the baseline deterministic model performs quite well.",non_actionable,fact
r15,s9,t24,"If I understand what is being evaluated correctly (i.e., best random guess) then I am not surprised the EEN can perform better with enough random samples.",non_actionable,fact
r15,s0,t20,"I.e., with enough throws, I can always hit the bullseye with a dart even when blindfolded.",non_actionable,fact
r15,s1,t20,Comments: The model proposes to learn a conditional stochastic deep model by training an output noise model on the input x_i and the residual y_i - g(x_i).,non_actionable,fact
r15,s2,t20,The trained residual function can be used to predict a residual z_i for x_i.,non_actionable,fact
r15,s3,t20,"To start, why not plot the empirical histogram of p(z|x) (for some fixed x's) to get a sense of how well-behaved it is as a distribution.",actionable,suggestion
r15,s4,t20,One could even start with a simple Gaussian and linear parameterization of the mean and variance in terms of x.,actionable,suggestion
r15,s5,t20,"If the contribution of the paper is the ""output stochastic"" noise model, I think it is worth experimenting with the design options one has with such a model.",actionable,suggestion
r15,s6,t20,The experiments range over 4 video datasets.,non_actionable,fact
r15,s7,t20,"PSNR is evaluated on predicted frames -- PSNR does not appear to be explicitly defined but I am taking it to be the metric defined in the 2nd paragraph from the bottom on page 7.  The new model ""EEN"" is compared to a deterministic model and conditional GAN.  The GAN never seems to perform well -- the authors claim mode collapse, but I wonder if the GAN was simply hard to train in the first place and this is the key reason?",actionable,question
r15,s8,t20,"Unsurprisingly (since the EEN noise does not seem to be conditioned on the input), the baseline deterministic model performs quite well.",non_actionable,fact
r15,s9,t20,"If I understand what is being evaluated correctly (i.e., best random guess) then I am not surprised the EEN can perform better with enough random samples.",non_actionable,fact
r15,s0,t31,"I.e., with enough throws, I can always hit the bullseye with a dart even when blindfolded.",non_actionable,disagreement
r15,s1,t31,Comments: The model proposes to learn a conditional stochastic deep model by training an output noise model on the input x_i and the residual y_i - g(x_i).,non_actionable,fact
r15,s2,t31,The trained residual function can be used to predict a residual z_i for x_i.,actionable,suggestion
r15,s3,t31,"To start, why not plot the empirical histogram of p(z|x) (for some fixed x's) to get a sense of how well-behaved it is as a distribution.",actionable,suggestion
r15,s4,t31,One could even start with a simple Gaussian and linear parameterization of the mean and variance in terms of x.,non_actionable,fact
r15,s5,t31,"If the contribution of the paper is the ""output stochastic"" noise model, I think it is worth experimenting with the design options one has with such a model.",actionable,suggestion
r15,s6,t31,The experiments range over 4 video datasets.,non_actionable,fact
r15,s7,t31,"PSNR is evaluated on predicted frames -- PSNR does not appear to be explicitly defined but I am taking it to be the metric defined in the 2nd paragraph from the bottom on page 7.  The new model ""EEN"" is compared to a deterministic model and conditional GAN.  The GAN never seems to perform well -- the authors claim mode collapse, but I wonder if the GAN was simply hard to train in the first place and this is the key reason?",non_actionable,fact
r15,s8,t31,"Unsurprisingly (since the EEN noise does not seem to be conditioned on the input), the baseline deterministic model performs quite well.",non_actionable,fact
r15,s9,t31,"If I understand what is being evaluated correctly (i.e., best random guess) then I am not surprised the EEN can perform better with enough random samples.",non_actionable,fact
r28,s0,t8,"In comparison, existing approaches appear to achieve little to no generalization, especially when tested on longer examples than seen during training.",actionable,shortcoming
r28,s1,t8,The approach is presented very thoroughly.,non_actionable,agreement
r28,s2,t8,"Despite the thoroughness of the task and model descriptions, the proposed method is not well motivated.",actionable,shortcoming
r28,s3,t8,"Quite a few natural questions left unanswered, limiting what readers can learn from this paper, e.g. - How quickly does the model learn?",actionable,shortcoming
r28,s4,t8,"- Presumably the policy learned in Phase 1 is a decent model by itself, since it can reliably find candidate traces.",non_actionable,fact
r28,s5,t8,Why choose F = 10 and K = 3?,non_actionable,question
r28,s6,t8,"Presumably, there exists some hyperparameters where the model does not achieve 100% test accuracy, in which case, what are the failure modes?",non_actionable,question
r28,s7,t8,"Unless I am misunderstanding the experimental setup, this is not supported by the result, correct?",non_actionable,question
r28,s8,t8,The proposed method achieves perfect accuracy in every condition.,non_actionable,agreement
r28,s9,t8,- The reimplementations of the methods from Grefenstette et al. 2015 have surprisingly low training accuracy (in some cases 0% for Stack LSTM and 2.23% for DeQueue LSTM).,actionable,shortcoming
r28,s0,t20,"In comparison, existing approaches appear to achieve little to no generalization, especially when tested on longer examples than seen during training.",non_actionable,fact
r28,s1,t20,The approach is presented very thoroughly.,non_actionable,agreement
r28,s2,t20,"Despite the thoroughness of the task and model descriptions, the proposed method is not well motivated.",non_actionable,shortcoming
r28,s3,t20,"Quite a few natural questions left unanswered, limiting what readers can learn from this paper, e.g. - How quickly does the model learn?",actionable,question
r28,s4,t20,"- Presumably the policy learned in Phase 1 is a decent model by itself, since it can reliably find candidate traces.",non_actionable,fact
r28,s5,t20,Why choose F = 10 and K = 3?,actionable,question
r28,s6,t20,"Presumably, there exists some hyperparameters where the model does not achieve 100% test accuracy, in which case, what are the failure modes?",actionable,question
r28,s7,t20,"Unless I am misunderstanding the experimental setup, this is not supported by the result, correct?",actionable,question
r28,s8,t20,The proposed method achieves perfect accuracy in every condition.,non_actionable,fact
r28,s9,t20,- The reimplementations of the methods from Grefenstette et al. 2015 have surprisingly low training accuracy (in some cases 0% for Stack LSTM and 2.23% for DeQueue LSTM).,non_actionable,fact
r28,s0,t2,"In comparison, existing approaches appear to achieve little to no generalization, especially when tested on longer examples than seen during training.",non_actionable,fact
r28,s1,t2,The approach is presented very thoroughly.,non_actionable,agreement
r28,s2,t2,"Despite the thoroughness of the task and model descriptions, the proposed method is not well motivated.",non_actionable,shortcoming
r28,s3,t2,"Quite a few natural questions left unanswered, limiting what readers can learn from this paper, e.g. - How quickly does the model learn?",actionable,question
r28,s4,t2,"- Presumably the policy learned in Phase 1 is a decent model by itself, since it can reliably find candidate traces.",non_actionable,fact
r28,s5,t2,Why choose F = 10 and K = 3?,actionable,question
r28,s6,t2,"Presumably, there exists some hyperparameters where the model does not achieve 100% test accuracy, in which case, what are the failure modes?",actionable,question
r28,s7,t2,"Unless I am misunderstanding the experimental setup, this is not supported by the result, correct?",actionable,question
r28,s8,t2,The proposed method achieves perfect accuracy in every condition.,non_actionable,fact
r28,s9,t2,- The reimplementations of the methods from Grefenstette et al. 2015 have surprisingly low training accuracy (in some cases 0% for Stack LSTM and 2.23% for DeQueue LSTM).,non_actionable,shortcoming
r28,s0,t31,"In comparison, existing approaches appear to achieve little to no generalization, especially when tested on longer examples than seen during training.",non_actionable,agreement
r28,s1,t31,The approach is presented very thoroughly.,non_actionable,agreement
r28,s2,t31,"Despite the thoroughness of the task and model descriptions, the proposed method is not well motivated.",actionable,shortcoming
r28,s3,t31,"Quite a few natural questions left unanswered, limiting what readers can learn from this paper, e.g. - How quickly does the model learn?",actionable,shortcoming
r28,s4,t31,"- Presumably the policy learned in Phase 1 is a decent model by itself, since it can reliably find candidate traces.",non_actionable,fact
r28,s5,t31,Why choose F = 10 and K = 3?,non_actionable,question
r28,s6,t31,"Presumably, there exists some hyperparameters where the model does not achieve 100% test accuracy, in which case, what are the failure modes?",actionable,shortcoming
r28,s7,t31,"Unless I am misunderstanding the experimental setup, this is not supported by the result, correct?",actionable,disagreement
r28,s8,t31,The proposed method achieves perfect accuracy in every condition.,non_actionable,agreement
r28,s9,t31,- The reimplementations of the methods from Grefenstette et al. 2015 have surprisingly low training accuracy (in some cases 0% for Stack LSTM and 2.23% for DeQueue LSTM).,non_actionable,fact
r28,s0,t16,"In comparison, existing approaches appear to achieve little to no generalization, especially when tested on longer examples than seen during training.",actionable,shortcoming
r28,s1,t16,The approach is presented very thoroughly.,non_actionable,agreement
r28,s2,t16,"Despite the thoroughness of the task and model descriptions, the proposed method is not well motivated.",actionable,shortcoming
r28,s3,t16,"Quite a few natural questions left unanswered, limiting what readers can learn from this paper, e.g. - How quickly does the model learn?",actionable,question
r28,s4,t16,"- Presumably the policy learned in Phase 1 is a decent model by itself, since it can reliably find candidate traces.",non_actionable,fact
r28,s5,t16,Why choose F = 10 and K = 3?,actionable,question
r28,s6,t16,"Presumably, there exists some hyperparameters where the model does not achieve 100% test accuracy, in which case, what are the failure modes?",actionable,question
r28,s7,t16,"Unless I am misunderstanding the experimental setup, this is not supported by the result, correct?",actionable,disagreement
r28,s8,t16,The proposed method achieves perfect accuracy in every condition.,non_actionable,agreement
r28,s9,t16,- The reimplementations of the methods from Grefenstette et al. 2015 have surprisingly low training accuracy (in some cases 0% for Stack LSTM and 2.23% for DeQueue LSTM).,non_actionable,fact
r109,s0,t8,"In particular, when node2vec has its restart probability set pretty high, the random walks tend to stay within the local neighborhood (near the starting node).",non_actionable,fact
r109,s1,t8,The main difference is in the sentence construction strategy.,non_actionable,fact
r109,s2,t8,"Whereas node2vec may sample walks that have context windows containing the same node, the proposed method does not as it uses a random permutation of a node's neighbors.",non_actionable,fact
r109,s3,t8,This is the main difference between the proposed method and node2vec/DeepWalk.,non_actionable,fact
r109,s4,t8,Proposes a simple yet effective way to sample walks from large graphs.,non_actionable,agreement
r109,s5,t8,Cons: The description on the experimental setup seems to lack some important details.,actionable,shortcoming
r109,s6,t8,See more detailed comments in the paragraph below.,non_actionable,other
r109,s7,t8,"While LINE or SDNE, which the authors cite, may not run on some of the larger datasets they can be tested on the smaller datasets.",non_actionable,fact
r109,s8,t8,"For instance, on page 5 footnote 4 the authors state that DeepWalk and node2vec are tested under similar conditions but do not elaborate.",actionable,shortcoming
r109,s9,t8,For instance a sentence of length 10 with context window 5 gives 6 contexts.,non_actionable,fact
r109,s0,t16,"In particular, when node2vec has its restart probability set pretty high, the random walks tend to stay within the local neighborhood (near the starting node).",non_actionable,fact
r109,s1,t16,The main difference is in the sentence construction strategy.,non_actionable,fact
r109,s2,t16,"Whereas node2vec may sample walks that have context windows containing the same node, the proposed method does not as it uses a random permutation of a node's neighbors.",non_actionable,fact
r109,s3,t16,This is the main difference between the proposed method and node2vec/DeepWalk.,non_actionable,fact
r109,s4,t16,Proposes a simple yet effective way to sample walks from large graphs.,non_actionable,fact
r109,s5,t16,Cons: The description on the experimental setup seems to lack some important details.,actionable,shortcoming
r109,s6,t16,See more detailed comments in the paragraph below.,non_actionable,fact
r109,s7,t16,"While LINE or SDNE, which the authors cite, may not run on some of the larger datasets they can be tested on the smaller datasets.",actionable,shortcoming
r109,s8,t16,"For instance, on page 5 footnote 4 the authors state that DeepWalk and node2vec are tested under similar conditions but do not elaborate.",actionable,shortcoming
r109,s9,t16,For instance a sentence of length 10 with context window 5 gives 6 contexts.,non_actionable,fact
r109,s0,t2,"In particular, when node2vec has its restart probability set pretty high, the random walks tend to stay within the local neighborhood (near the starting node).",non_actionable,fact
r109,s1,t2,The main difference is in the sentence construction strategy.,non_actionable,fact
r109,s2,t2,"Whereas node2vec may sample walks that have context windows containing the same node, the proposed method does not as it uses a random permutation of a node's neighbors.",non_actionable,fact
r109,s3,t2,This is the main difference between the proposed method and node2vec/DeepWalk.,non_actionable,fact
r109,s4,t2,Proposes a simple yet effective way to sample walks from large graphs.,non_actionable,agreement
r109,s5,t2,Cons: The description on the experimental setup seems to lack some important details.,actionable,shortcoming
r109,s6,t2,See more detailed comments in the paragraph below.,non_actionable,other
r109,s7,t2,"While LINE or SDNE, which the authors cite, may not run on some of the larger datasets they can be tested on the smaller datasets.",actionable,suggestion
r109,s8,t2,"For instance, on page 5 footnote 4 the authors state that DeepWalk and node2vec are tested under similar conditions but do not elaborate.",actionable,shortcoming
r109,s9,t2,For instance a sentence of length 10 with context window 5 gives 6 contexts.,non_actionable,fact
r109,s0,t31,"In particular, when node2vec has its restart probability set pretty high, the random walks tend to stay within the local neighborhood (near the starting node).",non_actionable,fact
r109,s1,t31,The main difference is in the sentence construction strategy.,non_actionable,fact
r109,s2,t31,"Whereas node2vec may sample walks that have context windows containing the same node, the proposed method does not as it uses a random permutation of a node's neighbors.",non_actionable,fact
r109,s3,t31,This is the main difference between the proposed method and node2vec/DeepWalk.,non_actionable,fact
r109,s4,t31,Proposes a simple yet effective way to sample walks from large graphs.,non_actionable,fact
r109,s5,t31,Cons: The description on the experimental setup seems to lack some important details.,actionable,shortcoming
r109,s6,t31,See more detailed comments in the paragraph below.,actionable,shortcoming
r109,s7,t31,"While LINE or SDNE, which the authors cite, may not run on some of the larger datasets they can be tested on the smaller datasets.",actionable,suggestion
r109,s8,t31,"For instance, on page 5 footnote 4 the authors state that DeepWalk and node2vec are tested under similar conditions but do not elaborate.",actionable,suggestion
r109,s9,t31,For instance a sentence of length 10 with context window 5 gives 6 contexts.,actionable,shortcoming
r109,s0,t20,"In particular, when node2vec has its restart probability set pretty high, the random walks tend to stay within the local neighborhood (near the starting node).",non_actionable,fact
r109,s1,t20,The main difference is in the sentence construction strategy.,non_actionable,fact
r109,s2,t20,"Whereas node2vec may sample walks that have context windows containing the same node, the proposed method does not as it uses a random permutation of a node's neighbors.",non_actionable,fact
r109,s3,t20,This is the main difference between the proposed method and node2vec/DeepWalk.,non_actionable,fact
r109,s4,t20,Proposes a simple yet effective way to sample walks from large graphs.,non_actionable,agreement
r109,s5,t20,Cons: The description on the experimental setup seems to lack some important details.,actionable,shortcoming
r109,s6,t20,See more detailed comments in the paragraph below.,non_actionable,fact
r109,s7,t20,"While LINE or SDNE, which the authors cite, may not run on some of the larger datasets they can be tested on the smaller datasets.",non_actionable,fact
r109,s8,t20,"For instance, on page 5 footnote 4 the authors state that DeepWalk and node2vec are tested under similar conditions but do not elaborate.",non_actionable,fact
r109,s9,t20,For instance a sentence of length 10 with context window 5 gives 6 contexts.,non_actionable,fact
r51,s0,t20,"In the paper, the authors proposed using GAN for anomaly detection.",non_actionable,fact
r51,s1,t20,"For evaluating whether the data point x is anomalous or not, we search for a latent representation z such that x \approx g_\theta(z).",non_actionable,fact
r51,s2,t20,"If such a representation z could be found, x is deemed to be healthy, and anomalous otherwise.",non_actionable,fact
r51,s3,t20,"Moreover, the authors proposed updating the parameter \theta of the generator g_\theta.",non_actionable,fact
r51,s4,t20,"The authors claimed that this parameter update is one of the novelty of their method, making it different from the method of Schlegl et al. (2017).",non_actionable,fact
r51,s5,t20,"In my first reading of the paper, I felt that the baselines in the experiments are too primitive.",actionable,shortcoming
r51,s6,t20,"Specifically, for KDE and OC-SVM, a naive PCA is used to reduce the data dimension.",non_actionable,fact
r51,s7,t20,"Additionally, I found that some well-known anomaly detection methods are excluded from the comparison.",actionable,shortcoming
r51,s8,t20,It would be essential to add these methods as baselines to be compared with the proposed method.,actionable,suggestion
r51,s9,t20,"However, I have decided to keep my score unchanged, as the additional experiments have shown that the performance of the proposed method is not significantly better than the other methods.",non_actionable,fact
r51,s0,t10,"In the paper, the authors proposed using GAN for anomaly detection.",non_actionable,other
r51,s1,t10,"For evaluating whether the data point x is anomalous or not, we search for a latent representation z such that x \approx g_\theta(z).",non_actionable,other
r51,s2,t10,"If such a representation z could be found, x is deemed to be healthy, and anomalous otherwise.",non_actionable,other
r51,s3,t10,"Moreover, the authors proposed updating the parameter \theta of the generator g_\theta.",non_actionable,other
r51,s4,t10,"The authors claimed that this parameter update is one of the novelty of their method, making it different from the method of Schlegl et al. (2017).",non_actionable,other
r51,s5,t10,"In my first reading of the paper, I felt that the baselines in the experiments are too primitive.",actionable,fact
r51,s6,t10,"Specifically, for KDE and OC-SVM, a naive PCA is used to reduce the data dimension.",actionable,shortcoming
r51,s7,t10,"Additionally, I found that some well-known anomaly detection methods are excluded from the comparison.",actionable,shortcoming
r51,s8,t10,It would be essential to add these methods as baselines to be compared with the proposed method.,actionable,suggestion
r51,s9,t10,"However, I have decided to keep my score unchanged, as the additional experiments have shown that the performance of the proposed method is not significantly better than the other methods.",actionable,fact
r51,s0,t16,"In the paper, the authors proposed using GAN for anomaly detection.",non_actionable,fact
r51,s1,t16,"For evaluating whether the data point x is anomalous or not, we search for a latent representation z such that x \approx g_\theta(z).",non_actionable,fact
r51,s2,t16,"If such a representation z could be found, x is deemed to be healthy, and anomalous otherwise.",non_actionable,fact
r51,s3,t16,"Moreover, the authors proposed updating the parameter \theta of the generator g_\theta.",non_actionable,fact
r51,s4,t16,"The authors claimed that this parameter update is one of the novelty of their method, making it different from the method of Schlegl et al. (2017).",non_actionable,fact
r51,s5,t16,"In my first reading of the paper, I felt that the baselines in the experiments are too primitive.",actionable,shortcoming
r51,s6,t16,"Specifically, for KDE and OC-SVM, a naive PCA is used to reduce the data dimension.",non_actionable,fact
r51,s7,t16,"Additionally, I found that some well-known anomaly detection methods are excluded from the comparison.",actionable,shortcoming
r51,s8,t16,It would be essential to add these methods as baselines to be compared with the proposed method.,actionable,suggestion
r51,s9,t16,"However, I have decided to keep my score unchanged, as the additional experiments have shown that the performance of the proposed method is not significantly better than the other methods.",non_actionable,fact
r51,s0,t31,"In the paper, the authors proposed using GAN for anomaly detection.",non_actionable,fact
r51,s1,t31,"For evaluating whether the data point x is anomalous or not, we search for a latent representation z such that x \approx g_\theta(z).",non_actionable,fact
r51,s2,t31,"If such a representation z could be found, x is deemed to be healthy, and anomalous otherwise.",non_actionable,fact
r51,s3,t31,"Moreover, the authors proposed updating the parameter \theta of the generator g_\theta.",non_actionable,fact
r51,s4,t31,"The authors claimed that this parameter update is one of the novelty of their method, making it different from the method of Schlegl et al. (2017).",non_actionable,fact
r51,s5,t31,"In my first reading of the paper, I felt that the baselines in the experiments are too primitive.",actionable,shortcoming
r51,s6,t31,"Specifically, for KDE and OC-SVM, a naive PCA is used to reduce the data dimension.",non_actionable,fact
r51,s7,t31,"Additionally, I found that some well-known anomaly detection methods are excluded from the comparison.",actionable,shortcoming
r51,s8,t31,It would be essential to add these methods as baselines to be compared with the proposed method.,actionable,suggestion
r51,s9,t31,"However, I have decided to keep my score unchanged, as the additional experiments have shown that the performance of the proposed method is not significantly better than the other methods.",non_actionable,fact
r51,s0,t2,"In the paper, the authors proposed using GAN for anomaly detection.",non_actionable,fact
r51,s1,t2,"For evaluating whether the data point x is anomalous or not, we search for a latent representation z such that x \approx g_\theta(z).",non_actionable,fact
r51,s2,t2,"If such a representation z could be found, x is deemed to be healthy, and anomalous otherwise.",non_actionable,fact
r51,s3,t2,"Moreover, the authors proposed updating the parameter \theta of the generator g_\theta.",non_actionable,fact
r51,s4,t2,"The authors claimed that this parameter update is one of the novelty of their method, making it different from the method of Schlegl et al. (2017).",non_actionable,fact
r51,s5,t2,"In my first reading of the paper, I felt that the baselines in the experiments are too primitive.",non_actionable,shortcoming
r51,s6,t2,"Specifically, for KDE and OC-SVM, a naive PCA is used to reduce the data dimension.",non_actionable,fact
r51,s7,t2,"Additionally, I found that some well-known anomaly detection methods are excluded from the comparison.",non_actionable,shortcoming
r51,s8,t2,It would be essential to add these methods as baselines to be compared with the proposed method.,actionable,suggestion
r51,s9,t2,"However, I have decided to keep my score unchanged, as the additional experiments have shown that the performance of the proposed method is not significantly better than the other methods.",non_actionable,shortcoming
r108,s0,t26,"In this paper, the authors propose deep architecture that preserves mutual information between the input and the hidden representation and show that the loss of information can only occur at the final layer.",non_actionable,fact
r108,s1,t26,They illustrate empirically that the loss of information can be avoided on large-scale classification such as ImageNet and propose to build an invertible deep network that is capable of retaining the information of the input signal through all the layers of the network until the last layer where the input could be reconstructed.,non_actionable,fact
r108,s2,t26,"As it requires a special care to design an invertible architecture, the authors architecture is based on the recent reversible residual network (RevNet) introduced in (Gomez et al., 2017) and an invertible down-sampling operator introduced in (Shi et al., 2016).",non_actionable,fact
r108,s3,t26,The obtained result is competitive with the original Resnet and the RevNet models.,non_actionable,fact
r108,s4,t26,"Still, the classification and the reconstructing results are quite impressive as the work is the first empirical evidence that learning invertible representation that preserves information about the input is possible on large-scale classification tasks.",non_actionable,agreement
r108,s5,t26,The submitted paper shows that this principle is not a necessary condition large-scale classification.,non_actionable,fact
r108,s6,t26,It is also simple and easy to understand.,non_actionable,agreement
r108,s7,t26,The spectral analysis of the differential operator in section 4.1 provide another motivation for the “hard-constrained” invertible architecture.,non_actionable,fact
r108,s8,t26,"Section 5 shows that even when using either an SVM or a Nearest Neighbor classifier on n extracted features from a layer in the network, both classifiers progressively improve with deeper layers.",non_actionable,fact
r108,s9,t26,"Indeed, the authors have succeed in showing that this is not necessarily the case.",non_actionable,agreement
r108,s0,t2,"In this paper, the authors propose deep architecture that preserves mutual information between the input and the hidden representation and show that the loss of information can only occur at the final layer.",non_actionable,fact
r108,s1,t2,They illustrate empirically that the loss of information can be avoided on large-scale classification such as ImageNet and propose to build an invertible deep network that is capable of retaining the information of the input signal through all the layers of the network until the last layer where the input could be reconstructed.,non_actionable,fact
r108,s2,t2,"As it requires a special care to design an invertible architecture, the authors architecture is based on the recent reversible residual network (RevNet) introduced in (Gomez et al., 2017) and an invertible down-sampling operator introduced in (Shi et al., 2016).",non_actionable,fact
r108,s3,t2,The obtained result is competitive with the original Resnet and the RevNet models.,non_actionable,fact
r108,s4,t2,"Still, the classification and the reconstructing results are quite impressive as the work is the first empirical evidence that learning invertible representation that preserves information about the input is possible on large-scale classification tasks.",non_actionable,agreement
r108,s5,t2,The submitted paper shows that this principle is not a necessary condition large-scale classification.,non_actionable,fact
r108,s6,t2,It is also simple and easy to understand.,non_actionable,agreement
r108,s7,t2,The spectral analysis of the differential operator in section 4.1 provide another motivation for the “hard-constrained” invertible architecture.,non_actionable,fact
r108,s8,t2,"Section 5 shows that even when using either an SVM or a Nearest Neighbor classifier on n extracted features from a layer in the network, both classifiers progressively improve with deeper layers.",non_actionable,fact
r108,s9,t2,"Indeed, the authors have succeed in showing that this is not necessarily the case.",non_actionable,agreement
r108,s0,t31,"In this paper, the authors propose deep architecture that preserves mutual information between the input and the hidden representation and show that the loss of information can only occur at the final layer.",non_actionable,fact
r108,s1,t31,They illustrate empirically that the loss of information can be avoided on large-scale classification such as ImageNet and propose to build an invertible deep network that is capable of retaining the information of the input signal through all the layers of the network until the last layer where the input could be reconstructed.,non_actionable,fact
r108,s2,t31,"As it requires a special care to design an invertible architecture, the authors architecture is based on the recent reversible residual network (RevNet) introduced in (Gomez et al., 2017) and an invertible down-sampling operator introduced in (Shi et al., 2016).",non_actionable,fact
r108,s3,t31,The obtained result is competitive with the original Resnet and the RevNet models.,non_actionable,fact
r108,s4,t31,"Still, the classification and the reconstructing results are quite impressive as the work is the first empirical evidence that learning invertible representation that preserves information about the input is possible on large-scale classification tasks.",non_actionable,agreement
r108,s5,t31,The submitted paper shows that this principle is not a necessary condition large-scale classification.,non_actionable,fact
r108,s6,t31,It is also simple and easy to understand.,non_actionable,agreement
r108,s7,t31,The spectral analysis of the differential operator in section 4.1 provide another motivation for the “hard-constrained” invertible architecture.,non_actionable,fact
r108,s8,t31,"Section 5 shows that even when using either an SVM or a Nearest Neighbor classifier on n extracted features from a layer in the network, both classifiers progressively improve with deeper layers.",non_actionable,fact
r108,s9,t31,"Indeed, the authors have succeed in showing that this is not necessarily the case.",non_actionable,fact
r108,s0,t10,"In this paper, the authors propose deep architecture that preserves mutual information between the input and the hidden representation and show that the loss of information can only occur at the final layer.",non_actionable,other
r108,s1,t10,They illustrate empirically that the loss of information can be avoided on large-scale classification such as ImageNet and propose to build an invertible deep network that is capable of retaining the information of the input signal through all the layers of the network until the last layer where the input could be reconstructed.,non_actionable,other
r108,s2,t10,"As it requires a special care to design an invertible architecture, the authors architecture is based on the recent reversible residual network (RevNet) introduced in (Gomez et al., 2017) and an invertible down-sampling operator introduced in (Shi et al., 2016).",non_actionable,other
r108,s3,t10,The obtained result is competitive with the original Resnet and the RevNet models.,non_actionable,other
r108,s4,t10,"Still, the classification and the reconstructing results are quite impressive as the work is the first empirical evidence that learning invertible representation that preserves information about the input is possible on large-scale classification tasks.",actionable,agreement
r108,s5,t10,The submitted paper shows that this principle is not a necessary condition large-scale classification.,non_actionable,other
r108,s6,t10,It is also simple and easy to understand.,actionable,agreement
r108,s7,t10,The spectral analysis of the differential operator in section 4.1 provide another motivation for the “hard-constrained” invertible architecture.,non_actionable,other
r108,s8,t10,"Section 5 shows that even when using either an SVM or a Nearest Neighbor classifier on n extracted features from a layer in the network, both classifiers progressively improve with deeper layers.",actionable,agreement
r108,s9,t10,"Indeed, the authors have succeed in showing that this is not necessarily the case.",actionable,agreement
r108,s0,t20,"In this paper, the authors propose deep architecture that preserves mutual information between the input and the hidden representation and show that the loss of information can only occur at the final layer.",non_actionable,fact
r108,s1,t20,They illustrate empirically that the loss of information can be avoided on large-scale classification such as ImageNet and propose to build an invertible deep network that is capable of retaining the information of the input signal through all the layers of the network until the last layer where the input could be reconstructed.,non_actionable,fact
r108,s2,t20,"As it requires a special care to design an invertible architecture, the authors architecture is based on the recent reversible residual network (RevNet) introduced in (Gomez et al., 2017) and an invertible down-sampling operator introduced in (Shi et al., 2016).",non_actionable,fact
r108,s3,t20,The obtained result is competitive with the original Resnet and the RevNet models.,non_actionable,fact
r108,s4,t20,"Still, the classification and the reconstructing results are quite impressive as the work is the first empirical evidence that learning invertible representation that preserves information about the input is possible on large-scale classification tasks.",non_actionable,agreement
r108,s5,t20,The submitted paper shows that this principle is not a necessary condition large-scale classification.,non_actionable,fact
r108,s6,t20,It is also simple and easy to understand.,non_actionable,agreement
r108,s7,t20,The spectral analysis of the differential operator in section 4.1 provide another motivation for the “hard-constrained” invertible architecture.,non_actionable,fact
r108,s8,t20,"Section 5 shows that even when using either an SVM or a Nearest Neighbor classifier on n extracted features from a layer in the network, both classifiers progressively improve with deeper layers.",non_actionable,fact
r108,s9,t20,"Indeed, the authors have succeed in showing that this is not necessarily the case.",non_actionable,agreement
r34,s0,t20,"In this paper, they design an apporach to learn to cooperate in a more complex game, like a hybrid pong meets prisoner's dilemma game.",non_actionable,fact
r34,s1,t20,This is fun but I did not find it particularly surprising from a game-theoretic or from a deep learning point of view.,non_actionable,agreement
r34,s2,t20,"From a game-theoretic point of view, the paper begins with a game-theoretic analysis of a cooperative strategy for these markov games with imperfect information.",non_actionable,fact
r34,s3,t20,"It is basically a straightforward generalization of the idea of punishing, which is common in ""folk theorems"" from game theory, to give a particular equilibrium for cooperating in Markov games.",non_actionable,fact
r34,s4,t20,"Many Markov games do not have a cooperative equilibrium, so this paper restricts attention to those that do.",non_actionable,fact
r34,s5,t20,"Even in games where there is a cooperative solution that maximizes the total welfare, it is not clear why players would choose to do so.",non_actionable,fact
r34,s6,t20,The paper follows with some fun experiments implementing these new game theory notions.,non_actionable,agreement
r34,s7,t20,"It is perhaps interesting that one can make deep learning learn to cooperate with imperfect information, but one could have illustrated the game theory equally well with other techniques.",non_actionable,agreement
r34,s8,t20,"However, a fully cooperative agent can be exploited by a defector.",non_actionable,fact
r34,s9,t20,In what we call the Pong Player’s Dilemma (PPD) when an agent scores they gain a reward of 1 but the partner receives a reward of −2.,non_actionable,fact
r34,s0,t31,"In this paper, they design an apporach to learn to cooperate in a more complex game, like a hybrid pong meets prisoner's dilemma game.",non_actionable,fact
r34,s1,t31,This is fun but I did not find it particularly surprising from a game-theoretic or from a deep learning point of view.,non_actionable,fact
r34,s2,t31,"From a game-theoretic point of view, the paper begins with a game-theoretic analysis of a cooperative strategy for these markov games with imperfect information.",non_actionable,fact
r34,s3,t31,"It is basically a straightforward generalization of the idea of punishing, which is common in ""folk theorems"" from game theory, to give a particular equilibrium for cooperating in Markov games.",non_actionable,fact
r34,s4,t31,"Many Markov games do not have a cooperative equilibrium, so this paper restricts attention to those that do.",non_actionable,fact
r34,s5,t31,"Even in games where there is a cooperative solution that maximizes the total welfare, it is not clear why players would choose to do so.",non_actionable,fact
r34,s6,t31,The paper follows with some fun experiments implementing these new game theory notions.,non_actionable,fact
r34,s7,t31,"It is perhaps interesting that one can make deep learning learn to cooperate with imperfect information, but one could have illustrated the game theory equally well with other techniques.",non_actionable,fact
r34,s8,t31,"However, a fully cooperative agent can be exploited by a defector.",non_actionable,fact
r34,s9,t31,In what we call the Pong Player’s Dilemma (PPD) when an agent scores they gain a reward of 1 but the partner receives a reward of −2.,non_actionable,fact
r34,s0,t2,"In this paper, they design an apporach to learn to cooperate in a more complex game, like a hybrid pong meets prisoner's dilemma game.",non_actionable,fact
r34,s1,t2,This is fun but I did not find it particularly surprising from a game-theoretic or from a deep learning point of view.,non_actionable,fact
r34,s2,t2,"From a game-theoretic point of view, the paper begins with a game-theoretic analysis of a cooperative strategy for these markov games with imperfect information.",non_actionable,fact
r34,s3,t2,"It is basically a straightforward generalization of the idea of punishing, which is common in ""folk theorems"" from game theory, to give a particular equilibrium for cooperating in Markov games.",non_actionable,fact
r34,s4,t2,"Many Markov games do not have a cooperative equilibrium, so this paper restricts attention to those that do.",non_actionable,fact
r34,s5,t2,"Even in games where there is a cooperative solution that maximizes the total welfare, it is not clear why players would choose to do so.",non_actionable,fact
r34,s6,t2,The paper follows with some fun experiments implementing these new game theory notions.,non_actionable,fact
r34,s7,t2,"It is perhaps interesting that one can make deep learning learn to cooperate with imperfect information, but one could have illustrated the game theory equally well with other techniques.",non_actionable,shortcoming
r34,s8,t2,"However, a fully cooperative agent can be exploited by a defector.",non_actionable,fact
r34,s9,t2,In what we call the Pong Player’s Dilemma (PPD) when an agent scores they gain a reward of 1 but the partner receives a reward of −2.,non_actionable,fact
r34,s0,t10,"In this paper, they design an apporach to learn to cooperate in a more complex game, like a hybrid pong meets prisoner's dilemma game.",non_actionable,other
r34,s1,t10,This is fun but I did not find it particularly surprising from a game-theoretic or from a deep learning point of view.,actionable,shortcoming
r34,s2,t10,"From a game-theoretic point of view, the paper begins with a game-theoretic analysis of a cooperative strategy for these markov games with imperfect information.",non_actionable,other
r34,s3,t10,"It is basically a straightforward generalization of the idea of punishing, which is common in ""folk theorems"" from game theory, to give a particular equilibrium for cooperating in Markov games.",non_actionable,other
r34,s4,t10,"Many Markov games do not have a cooperative equilibrium, so this paper restricts attention to those that do.",non_actionable,other
r34,s5,t10,"Even in games where there is a cooperative solution that maximizes the total welfare, it is not clear why players would choose to do so.",actionable,shortcoming
r34,s6,t10,The paper follows with some fun experiments implementing these new game theory notions.,actionable,agreement
r34,s7,t10,"It is perhaps interesting that one can make deep learning learn to cooperate with imperfect information, but one could have illustrated the game theory equally well with other techniques.",actionable,suggestion
r34,s8,t10,"However, a fully cooperative agent can be exploited by a defector.",actionable,shortcoming
r34,s9,t10,In what we call the Pong Player’s Dilemma (PPD) when an agent scores they gain a reward of 1 but the partner receives a reward of −2.,non_actionable,other
r34,s0,t16,"In this paper, they design an apporach to learn to cooperate in a more complex game, like a hybrid pong meets prisoner's dilemma game.",non_actionable,fact
r34,s1,t16,This is fun but I did not find it particularly surprising from a game-theoretic or from a deep learning point of view.,non_actionable,fact
r34,s2,t16,"From a game-theoretic point of view, the paper begins with a game-theoretic analysis of a cooperative strategy for these markov games with imperfect information.",non_actionable,fact
r34,s3,t16,"It is basically a straightforward generalization of the idea of punishing, which is common in ""folk theorems"" from game theory, to give a particular equilibrium for cooperating in Markov games.",non_actionable,fact
r34,s4,t16,"Many Markov games do not have a cooperative equilibrium, so this paper restricts attention to those that do.",actionable,shortcoming
r34,s5,t16,"Even in games where there is a cooperative solution that maximizes the total welfare, it is not clear why players would choose to do so.",actionable,shortcoming
r34,s6,t16,The paper follows with some fun experiments implementing these new game theory notions.,non_actionable,agreement
r34,s7,t16,"It is perhaps interesting that one can make deep learning learn to cooperate with imperfect information, but one could have illustrated the game theory equally well with other techniques.",actionable,suggestion
r34,s8,t16,"However, a fully cooperative agent can be exploited by a defector.",actionable,shortcoming
r34,s9,t16,In what we call the Pong Player’s Dilemma (PPD) when an agent scores they gain a reward of 1 but the partner receives a reward of −2.,non_actionable,fact
r101,s0,t10,"In this work, the authors suggest the use of control variate schemes for estimating gradient values, within a reinforcement learning  framework.",non_actionable,other
r101,s1,t10,The authors also introduce a specific control variate technique based on the so-called Stein’s identity.,non_actionable,other
r101,s2,t10,The paper is interesting and well-written.,actionable,agreement
r101,s3,t10,- I believe that different Monte Carlo (or Quasi-Monte Carlo) strategies can be applied in order to estimate the integral (expected value) in Eq.,actionable,fact
r101,s4,t10,"(1), as also suggested in this work.",non_actionable,other
r101,s5,t10,"Please, please discuss and cite some papers if required.",actionable,suggestion
r101,s6,t10,- I suggest to divide Section 3.1 in two subsections.,actionable,suggestion
r101,s7,t10,"-  Please also discuss the relationships, connections, and possible applications of your technique to other algorithms used in Bayesian optimization, active learning and/or sequential learning, for instance as M. U. Gutmann and J. Corander, “Bayesian optimization for likelihood-free inference of simulator-based statistical mod- els,” Journal of Machine Learning Research, vol.",actionable,suggestion
r101,s8,t10,"G. da Silva Ferreira and D. Gamerman, “Optimal design in geostatistics under preferential sampling,” Bayesian Analysis, vol.",actionable,suggestion
r101,s9,t10,"L. Martino, J. Vicent, G. Camps-Valls, ""Automatic Emulator and Optimized Look-up Table Generation for Radiative Transfer Models"", IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2017.",actionable,suggestion
r101,s0,t2,"In this work, the authors suggest the use of control variate schemes for estimating gradient values, within a reinforcement learning  framework.",non_actionable,fact
r101,s1,t2,The authors also introduce a specific control variate technique based on the so-called Stein’s identity.,non_actionable,fact
r101,s2,t2,The paper is interesting and well-written.,non_actionable,agreement
r101,s3,t2,- I believe that different Monte Carlo (or Quasi-Monte Carlo) strategies can be applied in order to estimate the integral (expected value) in Eq.,non_actionable,fact
r101,s4,t2,"(1), as also suggested in this work.",non_actionable,agreement
r101,s5,t2,"Please, please discuss and cite some papers if required.",actionable,suggestion
r101,s6,t2,- I suggest to divide Section 3.1 in two subsections.,actionable,suggestion
r101,s7,t2,"-  Please also discuss the relationships, connections, and possible applications of your technique to other algorithms used in Bayesian optimization, active learning and/or sequential learning, for instance as M. U. Gutmann and J. Corander, “Bayesian optimization for likelihood-free inference of simulator-based statistical mod- els,” Journal of Machine Learning Research, vol.",actionable,suggestion
r101,s8,t2,"G. da Silva Ferreira and D. Gamerman, “Optimal design in geostatistics under preferential sampling,” Bayesian Analysis, vol.",non_actionable,other
r101,s9,t2,"L. Martino, J. Vicent, G. Camps-Valls, ""Automatic Emulator and Optimized Look-up Table Generation for Radiative Transfer Models"", IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2017.",non_actionable,other
r101,s0,t24,"In this work, the authors suggest the use of control variate schemes for estimating gradient values, within a reinforcement learning  framework.",non_actionable,fact
r101,s1,t24,The authors also introduce a specific control variate technique based on the so-called Stein’s identity.,non_actionable,fact
r101,s2,t24,The paper is interesting and well-written.,non_actionable,fact
r101,s3,t24,- I believe that different Monte Carlo (or Quasi-Monte Carlo) strategies can be applied in order to estimate the integral (expected value) in Eq.,actionable,fact
r101,s4,t24,"(1), as also suggested in this work.",actionable,suggestion
r101,s5,t24,"Please, please discuss and cite some papers if required.",actionable,suggestion
r101,s6,t24,- I suggest to divide Section 3.1 in two subsections.,actionable,suggestion
r101,s7,t24,"-  Please also discuss the relationships, connections, and possible applications of your technique to other algorithms used in Bayesian optimization, active learning and/or sequential learning, for instance as M. U. Gutmann and J. Corander, “Bayesian optimization for likelihood-free inference of simulator-based statistical mod- els,” Journal of Machine Learning Research, vol.",actionable,suggestion
r101,s8,t24,"G. da Silva Ferreira and D. Gamerman, “Optimal design in geostatistics under preferential sampling,” Bayesian Analysis, vol.",non_actionable,other
r101,s9,t24,"L. Martino, J. Vicent, G. Camps-Valls, ""Automatic Emulator and Optimized Look-up Table Generation for Radiative Transfer Models"", IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2017.",non_actionable,other
r101,s0,t20,"In this work, the authors suggest the use of control variate schemes for estimating gradient values, within a reinforcement learning  framework.",non_actionable,fact
r101,s1,t20,The authors also introduce a specific control variate technique based on the so-called Stein’s identity.,non_actionable,fact
r101,s2,t20,The paper is interesting and well-written.,non_actionable,agreement
r101,s3,t20,- I believe that different Monte Carlo (or Quasi-Monte Carlo) strategies can be applied in order to estimate the integral (expected value) in Eq.,non_actionable,fact
r101,s4,t20,"(1), as also suggested in this work.",non_actionable,fact
r101,s5,t20,"Please, please discuss and cite some papers if required.",actionable,suggestion
r101,s6,t20,- I suggest to divide Section 3.1 in two subsections.,actionable,suggestion
r101,s7,t20,"-  Please also discuss the relationships, connections, and possible applications of your technique to other algorithms used in Bayesian optimization, active learning and/or sequential learning, for instance as M. U. Gutmann and J. Corander, “Bayesian optimization for likelihood-free inference of simulator-based statistical mod- els,” Journal of Machine Learning Research, vol.",actionable,suggestion
r101,s8,t20,"G. da Silva Ferreira and D. Gamerman, “Optimal design in geostatistics under preferential sampling,” Bayesian Analysis, vol.",non_actionable,fact
r101,s9,t20,"L. Martino, J. Vicent, G. Camps-Valls, ""Automatic Emulator and Optimized Look-up Table Generation for Radiative Transfer Models"", IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2017.",non_actionable,fact
r101,s0,t31,"In this work, the authors suggest the use of control variate schemes for estimating gradient values, within a reinforcement learning  framework.",non_actionable,fact
r101,s1,t31,The authors also introduce a specific control variate technique based on the so-called Stein’s identity.,non_actionable,fact
r101,s2,t31,The paper is interesting and well-written.,non_actionable,agreement
r101,s3,t31,- I believe that different Monte Carlo (or Quasi-Monte Carlo) strategies can be applied in order to estimate the integral (expected value) in Eq.,actionable,suggestion
r101,s4,t31,"(1), as also suggested in this work.",non_actionable,fact
r101,s5,t31,"Please, please discuss and cite some papers if required.",actionable,suggestion
r101,s6,t31,- I suggest to divide Section 3.1 in two subsections.,actionable,suggestion
r101,s7,t31,"-  Please also discuss the relationships, connections, and possible applications of your technique to other algorithms used in Bayesian optimization, active learning and/or sequential learning, for instance as M. U. Gutmann and J. Corander, “Bayesian optimization for likelihood-free inference of simulator-based statistical mod- els,” Journal of Machine Learning Research, vol.",actionable,suggestion
r101,s8,t31,"G. da Silva Ferreira and D. Gamerman, “Optimal design in geostatistics under preferential sampling,” Bayesian Analysis, vol.",actionable,suggestion
r101,s9,t31,"L. Martino, J. Vicent, G. Camps-Valls, ""Automatic Emulator and Optimized Look-up Table Generation for Radiative Transfer Models"", IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2017.",actionable,suggestion
r66,s0,t31,Initial work on building a framework for finding best performing NN architecture across multiple tasks simultaneously In this paper authors are summarizing their work on building a framework for automated neural network (NN) construction across multiple tasks simultaneously.,non_actionable,fact
r66,s1,t31,They present initial results on the performance of their framework called Multitask Neural Model Search (MNMS) controller.,non_actionable,fact
r66,s2,t31,The idea behind building such a framework is motivated by the successes of recently proposed reinforcement based approaches for finding the best NN architecture across the space of all possible architectures.,non_actionable,fact
r66,s3,t31,Authors cite the Neural Architecture Search (NAS) framework as an example of such a framework that yields better results compared to NN architectures configured by humans.,non_actionable,fact
r66,s4,t31,In its current state and format the major issue with this work is the lack of more in-depth performance analysis which would help the reader draw more solid conclusions about the generalization of the approach.,actionable,shortcoming
r66,s5,t31,"It would be good if they could expand and analyze how well does their framework generalizes across other non-binary tasks, tasks in other domains and different NNs.",actionable,suggestion
r66,s6,t31,"In the NAS overview section, readers would benefit more if authors spend more time in outlining the RL detail used in the original NAS framework instead of Figure 1 which looks like a space filler.",actionable,suggestion
r66,s7,t31,Across the two NLP tasks authors show that MNMS models trained simultaneously give better performance than hand tuned architectures.,non_actionable,fact
r66,s8,t31,For better clarity figures 3 and 5 should be made bigger.,actionable,suggestion
r66,s9,t31,What is LSS in figure 4?,non_actionable,question
r66,s0,t14,Initial work on building a framework for finding best performing NN architecture across multiple tasks simultaneously In this paper authors are summarizing their work on building a framework for automated neural network (NN) construction across multiple tasks simultaneously.,non_actionable,fact
r66,s1,t14,They present initial results on the performance of their framework called Multitask Neural Model Search (MNMS) controller.,non_actionable,fact
r66,s2,t14,The idea behind building such a framework is motivated by the successes of recently proposed reinforcement based approaches for finding the best NN architecture across the space of all possible architectures.,non_actionable,fact
r66,s3,t14,Authors cite the Neural Architecture Search (NAS) framework as an example of such a framework that yields better results compared to NN architectures configured by humans.,non_actionable,fact
r66,s4,t14,In its current state and format the major issue with this work is the lack of more in-depth performance analysis which would help the reader draw more solid conclusions about the generalization of the approach.,non_actionable,shortcoming
r66,s5,t14,"It would be good if they could expand and analyze how well does their framework generalizes across other non-binary tasks, tasks in other domains and different NNs.",actionable,question
r66,s6,t14,"In the NAS overview section, readers would benefit more if authors spend more time in outlining the RL detail used in the original NAS framework instead of Figure 1 which looks like a space filler.",non_actionable,fact
r66,s7,t14,Across the two NLP tasks authors show that MNMS models trained simultaneously give better performance than hand tuned architectures.,actionable,fact
r66,s8,t14,For better clarity figures 3 and 5 should be made bigger.,actionable,suggestion
r66,s9,t14,What is LSS in figure 4?,actionable,question
r66,s0,t2,Initial work on building a framework for finding best performing NN architecture across multiple tasks simultaneously In this paper authors are summarizing their work on building a framework for automated neural network (NN) construction across multiple tasks simultaneously.,non_actionable,fact
r66,s1,t2,They present initial results on the performance of their framework called Multitask Neural Model Search (MNMS) controller.,non_actionable,fact
r66,s2,t2,The idea behind building such a framework is motivated by the successes of recently proposed reinforcement based approaches for finding the best NN architecture across the space of all possible architectures.,non_actionable,fact
r66,s3,t2,Authors cite the Neural Architecture Search (NAS) framework as an example of such a framework that yields better results compared to NN architectures configured by humans.,non_actionable,fact
r66,s4,t2,In its current state and format the major issue with this work is the lack of more in-depth performance analysis which would help the reader draw more solid conclusions about the generalization of the approach.,actionable,shortcoming
r66,s5,t2,"It would be good if they could expand and analyze how well does their framework generalizes across other non-binary tasks, tasks in other domains and different NNs.",actionable,suggestion
r66,s6,t2,"In the NAS overview section, readers would benefit more if authors spend more time in outlining the RL detail used in the original NAS framework instead of Figure 1 which looks like a space filler.",actionable,suggestion
r66,s7,t2,Across the two NLP tasks authors show that MNMS models trained simultaneously give better performance than hand tuned architectures.,non_actionable,fact
r66,s8,t2,For better clarity figures 3 and 5 should be made bigger.,actionable,suggestion
r66,s9,t2,What is LSS in figure 4?,actionable,question
r66,s0,t10,Initial work on building a framework for finding best performing NN architecture across multiple tasks simultaneously In this paper authors are summarizing their work on building a framework for automated neural network (NN) construction across multiple tasks simultaneously.,non_actionable,other
r66,s1,t10,They present initial results on the performance of their framework called Multitask Neural Model Search (MNMS) controller.,non_actionable,other
r66,s2,t10,The idea behind building such a framework is motivated by the successes of recently proposed reinforcement based approaches for finding the best NN architecture across the space of all possible architectures.,non_actionable,other
r66,s3,t10,Authors cite the Neural Architecture Search (NAS) framework as an example of such a framework that yields better results compared to NN architectures configured by humans.,non_actionable,other
r66,s4,t10,In its current state and format the major issue with this work is the lack of more in-depth performance analysis which would help the reader draw more solid conclusions about the generalization of the approach.,actionable,shortcoming
r66,s5,t10,"It would be good if they could expand and analyze how well does their framework generalizes across other non-binary tasks, tasks in other domains and different NNs.",actionable,suggestion
r66,s6,t10,"In the NAS overview section, readers would benefit more if authors spend more time in outlining the RL detail used in the original NAS framework instead of Figure 1 which looks like a space filler.",actionable,suggestion
r66,s7,t10,Across the two NLP tasks authors show that MNMS models trained simultaneously give better performance than hand tuned architectures.,actionable,shortcoming
r66,s8,t10,For better clarity figures 3 and 5 should be made bigger.,actionable,suggestion
r66,s9,t10,What is LSS in figure 4?,actionable,question
r66,s0,t20,Initial work on building a framework for finding best performing NN architecture across multiple tasks simultaneously In this paper authors are summarizing their work on building a framework for automated neural network (NN) construction across multiple tasks simultaneously.,non_actionable,fact
r66,s1,t20,They present initial results on the performance of their framework called Multitask Neural Model Search (MNMS) controller.,non_actionable,fact
r66,s2,t20,The idea behind building such a framework is motivated by the successes of recently proposed reinforcement based approaches for finding the best NN architecture across the space of all possible architectures.,non_actionable,fact
r66,s3,t20,Authors cite the Neural Architecture Search (NAS) framework as an example of such a framework that yields better results compared to NN architectures configured by humans.,non_actionable,fact
r66,s4,t20,In its current state and format the major issue with this work is the lack of more in-depth performance analysis which would help the reader draw more solid conclusions about the generalization of the approach.,actionable,shortcoming
r66,s5,t20,"It would be good if they could expand and analyze how well does their framework generalizes across other non-binary tasks, tasks in other domains and different NNs.",actionable,suggestion
r66,s6,t20,"In the NAS overview section, readers would benefit more if authors spend more time in outlining the RL detail used in the original NAS framework instead of Figure 1 which looks like a space filler.",actionable,suggestion
r66,s7,t20,Across the two NLP tasks authors show that MNMS models trained simultaneously give better performance than hand tuned architectures.,non_actionable,fact
r66,s8,t20,For better clarity figures 3 and 5 should be made bigger.,actionable,suggestion
r66,s9,t20,What is LSS in figure 4?,actionable,question
r115,s0,t31,Interesting experiments but lack of model description The authors propose to use a byte level RNN to classify reviews.,actionable,shortcoming
r115,s1,t31,"In the meantime, they learn to generate reviews.",non_actionable,fact
r115,s2,t31,They apply this architecture on the same task as the original article: document classification; they use a logistic regression on the extracted representation.,non_actionable,agreement
r115,s3,t31,The authors propose an evaluation on classical datasets and compare themselves to the state of the art.,non_actionable,fact
r115,s4,t31,A deeper analyze shows that this neuron is more efficient on small datasets than on larger.,non_actionable,disagreement
r115,s5,t31,"Exploiting the generative capacity of the network, they play with the ""sentiment neuron"" to deform a review.",non_actionable,fact
r115,s6,t31,The authors do not propose an original model and they do not describe the used model inside this publication.,actionable,shortcoming
r115,s7,t31,"Nor the model neither the optimized criterion is detailled: the authors present some curve mentioning ""bits per character"" but we do not know what is measured.",actionable,shortcoming
r115,s8,t31,Figure 2 is very interesting: it is a very relevant way to compare authors model with the literature.,non_actionable,agreement
r115,s9,t31,"However, according to me, the fact that it provides no model description, no model analysis, no modification of the model to improve the sentiment discovery, prevents this article from being publicized at ICLR.",actionable,shortcoming
r115,s0,t20,Interesting experiments but lack of model description The authors propose to use a byte level RNN to classify reviews.,non_actionable,shortcoming
r115,s1,t20,"In the meantime, they learn to generate reviews.",non_actionable,fact
r115,s2,t20,They apply this architecture on the same task as the original article: document classification; they use a logistic regression on the extracted representation.,non_actionable,fact
r115,s3,t20,The authors propose an evaluation on classical datasets and compare themselves to the state of the art.,non_actionable,fact
r115,s4,t20,A deeper analyze shows that this neuron is more efficient on small datasets than on larger.,non_actionable,fact
r115,s5,t20,"Exploiting the generative capacity of the network, they play with the ""sentiment neuron"" to deform a review.",non_actionable,fact
r115,s6,t20,The authors do not propose an original model and they do not describe the used model inside this publication.,actionable,shortcoming
r115,s7,t20,"Nor the model neither the optimized criterion is detailled: the authors present some curve mentioning ""bits per character"" but we do not know what is measured.",actionable,shortcoming
r115,s8,t20,Figure 2 is very interesting: it is a very relevant way to compare authors model with the literature.,non_actionable,agreement
r115,s9,t20,"However, according to me, the fact that it provides no model description, no model analysis, no modification of the model to improve the sentiment discovery, prevents this article from being publicized at ICLR.",actionable,shortcoming
r115,s0,t2,Interesting experiments but lack of model description The authors propose to use a byte level RNN to classify reviews.,non_actionable,fact
r115,s1,t2,"In the meantime, they learn to generate reviews.",non_actionable,fact
r115,s2,t2,They apply this architecture on the same task as the original article: document classification; they use a logistic regression on the extracted representation.,non_actionable,fact
r115,s3,t2,The authors propose an evaluation on classical datasets and compare themselves to the state of the art.,non_actionable,fact
r115,s4,t2,A deeper analyze shows that this neuron is more efficient on small datasets than on larger.,non_actionable,fact
r115,s5,t2,"Exploiting the generative capacity of the network, they play with the ""sentiment neuron"" to deform a review.",non_actionable,fact
r115,s6,t2,The authors do not propose an original model and they do not describe the used model inside this publication.,non_actionable,shortcoming
r115,s7,t2,"Nor the model neither the optimized criterion is detailled: the authors present some curve mentioning ""bits per character"" but we do not know what is measured.",actionable,shortcoming
r115,s8,t2,Figure 2 is very interesting: it is a very relevant way to compare authors model with the literature.,non_actionable,agreement
r115,s9,t2,"However, according to me, the fact that it provides no model description, no model analysis, no modification of the model to improve the sentiment discovery, prevents this article from being publicized at ICLR.",non_actionable,shortcoming
r115,s0,t8,Interesting experiments but lack of model description The authors propose to use a byte level RNN to classify reviews.,actionable,shortcoming
r115,s1,t8,"In the meantime, they learn to generate reviews.",non_actionable,fact
r115,s2,t8,They apply this architecture on the same task as the original article: document classification; they use a logistic regression on the extracted representation.,non_actionable,fact
r115,s3,t8,The authors propose an evaluation on classical datasets and compare themselves to the state of the art.,non_actionable,fact
r115,s4,t8,A deeper analyze shows that this neuron is more efficient on small datasets than on larger.,non_actionable,fact
r115,s5,t8,"Exploiting the generative capacity of the network, they play with the ""sentiment neuron"" to deform a review.",non_actionable,fact
r115,s6,t8,The authors do not propose an original model and they do not describe the used model inside this publication.,actionable,shortcoming
r115,s7,t8,"Nor the model neither the optimized criterion is detailled: the authors present some curve mentioning ""bits per character"" but we do not know what is measured.",actionable,shortcoming
r115,s8,t8,Figure 2 is very interesting: it is a very relevant way to compare authors model with the literature.,non_actionable,agreement
r115,s9,t8,"However, according to me, the fact that it provides no model description, no model analysis, no modification of the model to improve the sentiment discovery, prevents this article from being publicized at ICLR.",actionable,shortcoming
r115,s0,t10,Interesting experiments but lack of model description The authors propose to use a byte level RNN to classify reviews.,actionable,shortcoming
r115,s1,t10,"In the meantime, they learn to generate reviews.",non_actionable,other
r115,s2,t10,They apply this architecture on the same task as the original article: document classification; they use a logistic regression on the extracted representation.,non_actionable,other
r115,s3,t10,The authors propose an evaluation on classical datasets and compare themselves to the state of the art.,non_actionable,other
r115,s4,t10,A deeper analyze shows that this neuron is more efficient on small datasets than on larger.,non_actionable,other
r115,s5,t10,"Exploiting the generative capacity of the network, they play with the ""sentiment neuron"" to deform a review.",non_actionable,other
r115,s6,t10,The authors do not propose an original model and they do not describe the used model inside this publication.,actionable,shortcoming
r115,s7,t10,"Nor the model neither the optimized criterion is detailled: the authors present some curve mentioning ""bits per character"" but we do not know what is measured.",actionable,shortcoming
r115,s8,t10,Figure 2 is very interesting: it is a very relevant way to compare authors model with the literature.,actionable,agreement
r115,s9,t10,"However, according to me, the fact that it provides no model description, no model analysis, no modification of the model to improve the sentiment discovery, prevents this article from being publicized at ICLR.",actionable,fact
r30,s0,t31,"interesting model, baselines are weak, not enough signal on it's generalization ability The paper proposes a model for prediction under uncertainty where the separate out deterministic component prediction and uncertain component prediction.",actionable,shortcoming
r30,s1,t31,"For the non-deterministic information, they have a residual predictor that uses a low-dimensional latent space.",non_actionable,fact
r30,s2,t31,Hence inferring a structured latent space is a challenge.,non_actionable,fact
r30,s3,t31,The biggest weakness of the paper (and the reason for my final decision) is that the paper completely goes easy on baseline models.,actionable,shortcoming
r30,s4,t31,"It's only baseline is a GAN model that isn't even very convincing (GANs are finicky to train, so is this a badly tuned GAN model? or did you spend a lot of time tuning it?).",actionable,shortcoming
r30,s5,t31,"Because of the plethora of VAE models used in video prediction [1] (albeit, used with pre-structured latent spaces), there has to be atleast one VAE baseline.",actionable,shortcoming
r30,s6,t31,Just because such a baseline wasn't previously proposed in literature (in the narrow scope of this problem) doesn't mean it's not an obvious baseline to try.,actionable,shortcoming
r30,s7,t31,"In fact, a VAE would be nicely suited when proposing to work with low-dimensional latent spaces.",actionable,suggestion
r30,s8,t31,The main signal I lack from reading the paper is whether the proposed model actually does better than a reasonable baseline.,actionable,shortcoming
r30,s9,t31,"If the baselines are stronger and this point is more convincing, I am happy to raise my rating of the paper.",actionable,suggestion
r30,s0,t0,"interesting model, baselines are weak, not enough signal on it's generalization ability The paper proposes a model for prediction under uncertainty where the separate out deterministic component prediction and uncertain component prediction.",actionable,disagreement
r30,s1,t0,"For the non-deterministic information, they have a residual predictor that uses a low-dimensional latent space.",non_actionable,fact
r30,s2,t0,Hence inferring a structured latent space is a challenge.,non_actionable,fact
r30,s3,t0,The biggest weakness of the paper (and the reason for my final decision) is that the paper completely goes easy on baseline models.,actionable,shortcoming
r30,s4,t0,"It's only baseline is a GAN model that isn't even very convincing (GANs are finicky to train, so is this a badly tuned GAN model? or did you spend a lot of time tuning it?).",actionable,shortcoming
r30,s5,t0,"Because of the plethora of VAE models used in video prediction [1] (albeit, used with pre-structured latent spaces), there has to be atleast one VAE baseline.",actionable,fact
r30,s6,t0,Just because such a baseline wasn't previously proposed in literature (in the narrow scope of this problem) doesn't mean it's not an obvious baseline to try.,non_actionable,fact
r30,s7,t0,"In fact, a VAE would be nicely suited when proposing to work with low-dimensional latent spaces.",non_actionable,fact
r30,s8,t0,The main signal I lack from reading the paper is whether the proposed model actually does better than a reasonable baseline.,actionable,shortcoming
r30,s9,t0,"If the baselines are stronger and this point is more convincing, I am happy to raise my rating of the paper.",actionable,suggestion
r30,s0,t20,"interesting model, baselines are weak, not enough signal on it's generalization ability The paper proposes a model for prediction under uncertainty where the separate out deterministic component prediction and uncertain component prediction.",actionable,shortcoming
r30,s1,t20,"For the non-deterministic information, they have a residual predictor that uses a low-dimensional latent space.",non_actionable,fact
r30,s2,t20,Hence inferring a structured latent space is a challenge.,non_actionable,shortcoming
r30,s3,t20,The biggest weakness of the paper (and the reason for my final decision) is that the paper completely goes easy on baseline models.,actionable,shortcoming
r30,s4,t20,"It's only baseline is a GAN model that isn't even very convincing (GANs are finicky to train, so is this a badly tuned GAN model? or did you spend a lot of time tuning it?).",actionable,question
r30,s5,t20,"Because of the plethora of VAE models used in video prediction [1] (albeit, used with pre-structured latent spaces), there has to be atleast one VAE baseline.",non_actionable,fact
r30,s6,t20,Just because such a baseline wasn't previously proposed in literature (in the narrow scope of this problem) doesn't mean it's not an obvious baseline to try.,non_actionable,fact
r30,s7,t20,"In fact, a VAE would be nicely suited when proposing to work with low-dimensional latent spaces.",actionable,suggestion
r30,s8,t20,The main signal I lack from reading the paper is whether the proposed model actually does better than a reasonable baseline.,actionable,shortcoming
r30,s9,t20,"If the baselines are stronger and this point is more convincing, I am happy to raise my rating of the paper.",actionable,suggestion
r30,s0,t2,"interesting model, baselines are weak, not enough signal on it's generalization ability The paper proposes a model for prediction under uncertainty where the separate out deterministic component prediction and uncertain component prediction.",non_actionable,fact
r30,s1,t2,"For the non-deterministic information, they have a residual predictor that uses a low-dimensional latent space.",non_actionable,fact
r30,s2,t2,Hence inferring a structured latent space is a challenge.,non_actionable,fact
r30,s3,t2,The biggest weakness of the paper (and the reason for my final decision) is that the paper completely goes easy on baseline models.,non_actionable,shortcoming
r30,s4,t2,"It's only baseline is a GAN model that isn't even very convincing (GANs are finicky to train, so is this a badly tuned GAN model? or did you spend a lot of time tuning it?).",actionable,question
r30,s5,t2,"Because of the plethora of VAE models used in video prediction [1] (albeit, used with pre-structured latent spaces), there has to be atleast one VAE baseline.",actionable,suggestion
r30,s6,t2,Just because such a baseline wasn't previously proposed in literature (in the narrow scope of this problem) doesn't mean it's not an obvious baseline to try.,actionable,suggestion
r30,s7,t2,"In fact, a VAE would be nicely suited when proposing to work with low-dimensional latent spaces.",non_actionable,fact
r30,s8,t2,The main signal I lack from reading the paper is whether the proposed model actually does better than a reasonable baseline.,actionable,shortcoming
r30,s9,t2,"If the baselines are stronger and this point is more convincing, I am happy to raise my rating of the paper.",actionable,suggestion
r30,s0,t10,"interesting model, baselines are weak, not enough signal on it's generalization ability The paper proposes a model for prediction under uncertainty where the separate out deterministic component prediction and uncertain component prediction.",actionable,shortcoming
r30,s1,t10,"For the non-deterministic information, they have a residual predictor that uses a low-dimensional latent space.",non_actionable,other
r30,s2,t10,Hence inferring a structured latent space is a challenge.,actionable,shortcoming
r30,s3,t10,The biggest weakness of the paper (and the reason for my final decision) is that the paper completely goes easy on baseline models.,actionable,shortcoming
r30,s4,t10,"It's only baseline is a GAN model that isn't even very convincing (GANs are finicky to train, so is this a badly tuned GAN model? or did you spend a lot of time tuning it?).",actionable,disagreement
r30,s5,t10,"Because of the plethora of VAE models used in video prediction [1] (albeit, used with pre-structured latent spaces), there has to be atleast one VAE baseline.",actionable,suggestion
r30,s6,t10,Just because such a baseline wasn't previously proposed in literature (in the narrow scope of this problem) doesn't mean it's not an obvious baseline to try.,actionable,disagreement
r30,s7,t10,"In fact, a VAE would be nicely suited when proposing to work with low-dimensional latent spaces.",actionable,suggestion
r30,s8,t10,The main signal I lack from reading the paper is whether the proposed model actually does better than a reasonable baseline.,actionable,shortcoming
r30,s9,t10,"If the baselines are stronger and this point is more convincing, I am happy to raise my rating of the paper.",actionable,suggestion
r22,s0,t16,"Interesting problem and approach, but not ready for ICLR In this paper, the authors define a simulated, multi-agent “taxi pickup” task in a GridWorld environment.",actionable,shortcoming
r22,s1,t16,“Customers” randomly appear throughout the task and the taxi agents receive reward for moving to the same square as a customer.,non_actionable,fact
r22,s2,t16,"Cooperative multi-agent problem solving is an important problem in machine learning, artificial intelligence, and cognitive science.",non_actionable,fact
r22,s3,t16,The authors propose a centralized solution to the problem by adapting the Deep Q-learning Network model.,non_actionable,fact
r22,s4,t16,I do not know whether using a centralized network where each agent has a window of observations is a novel algorithm.,actionable,disagreement
r22,s5,t16,"They assess their solution quantitatively, demonstrating their model performs better than first, a simple heuristic model (I believe de-centralized Dijkstra’s for each agent, but there is not enough description in the manuscript to know for sure), and then, two other baselines that I could not figure out from the manuscript (I believe it was Dijkstra’s with two added rules for when to recharge).",actionable,shortcoming
r22,s6,t16,"Although the manuscript has many positive aspects to it, I do not believe it should be accepted for the following reasons.",actionable,shortcoming
r22,s7,t16,"This would solve another issue, which is the weakness of their baseline measure.",actionable,shortcoming
r22,s8,t16,There are many multi-agent techniques that can be applied to the problem that would have served as a better baseline.,actionable,suggestion
r22,s9,t16,"I encourage the authors to develop the problem and method further, as well as the analysis and evaluation.",actionable,suggestion
r22,s0,t31,"Interesting problem and approach, but not ready for ICLR In this paper, the authors define a simulated, multi-agent “taxi pickup” task in a GridWorld environment.",actionable,shortcoming
r22,s1,t31,“Customers” randomly appear throughout the task and the taxi agents receive reward for moving to the same square as a customer.,non_actionable,fact
r22,s2,t31,"Cooperative multi-agent problem solving is an important problem in machine learning, artificial intelligence, and cognitive science.",non_actionable,fact
r22,s3,t31,The authors propose a centralized solution to the problem by adapting the Deep Q-learning Network model.,non_actionable,fact
r22,s4,t31,I do not know whether using a centralized network where each agent has a window of observations is a novel algorithm.,non_actionable,fact
r22,s5,t31,"They assess their solution quantitatively, demonstrating their model performs better than first, a simple heuristic model (I believe de-centralized Dijkstra’s for each agent, but there is not enough description in the manuscript to know for sure), and then, two other baselines that I could not figure out from the manuscript (I believe it was Dijkstra’s with two added rules for when to recharge).",actionable,suggestion
r22,s6,t31,"Although the manuscript has many positive aspects to it, I do not believe it should be accepted for the following reasons.",actionable,shortcoming
r22,s7,t31,"This would solve another issue, which is the weakness of their baseline measure.",actionable,shortcoming
r22,s8,t31,There are many multi-agent techniques that can be applied to the problem that would have served as a better baseline.,actionable,suggestion
r22,s9,t31,"I encourage the authors to develop the problem and method further, as well as the analysis and evaluation.",actionable,suggestion
r22,s0,t2,"Interesting problem and approach, but not ready for ICLR In this paper, the authors define a simulated, multi-agent “taxi pickup” task in a GridWorld environment.",non_actionable,fact
r22,s1,t2,“Customers” randomly appear throughout the task and the taxi agents receive reward for moving to the same square as a customer.,non_actionable,fact
r22,s2,t2,"Cooperative multi-agent problem solving is an important problem in machine learning, artificial intelligence, and cognitive science.",non_actionable,fact
r22,s3,t2,The authors propose a centralized solution to the problem by adapting the Deep Q-learning Network model.,non_actionable,fact
r22,s4,t2,I do not know whether using a centralized network where each agent has a window of observations is a novel algorithm.,non_actionable,fact
r22,s5,t2,"They assess their solution quantitatively, demonstrating their model performs better than first, a simple heuristic model (I believe de-centralized Dijkstra’s for each agent, but there is not enough description in the manuscript to know for sure), and then, two other baselines that I could not figure out from the manuscript (I believe it was Dijkstra’s with two added rules for when to recharge).",actionable,shortcoming
r22,s6,t2,"Although the manuscript has many positive aspects to it, I do not believe it should be accepted for the following reasons.",non_actionable,fact
r22,s7,t2,"This would solve another issue, which is the weakness of their baseline measure.",actionable,suggestion
r22,s8,t2,There are many multi-agent techniques that can be applied to the problem that would have served as a better baseline.,actionable,shortcoming
r22,s9,t2,"I encourage the authors to develop the problem and method further, as well as the analysis and evaluation.",actionable,suggestion
r22,s0,t20,"Interesting problem and approach, but not ready for ICLR In this paper, the authors define a simulated, multi-agent “taxi pickup” task in a GridWorld environment.",non_actionable,fact
r22,s1,t20,“Customers” randomly appear throughout the task and the taxi agents receive reward for moving to the same square as a customer.,non_actionable,fact
r22,s2,t20,"Cooperative multi-agent problem solving is an important problem in machine learning, artificial intelligence, and cognitive science.",non_actionable,fact
r22,s3,t20,The authors propose a centralized solution to the problem by adapting the Deep Q-learning Network model.,non_actionable,fact
r22,s4,t20,I do not know whether using a centralized network where each agent has a window of observations is a novel algorithm.,non_actionable,shortcoming
r22,s5,t20,"They assess their solution quantitatively, demonstrating their model performs better than first, a simple heuristic model (I believe de-centralized Dijkstra’s for each agent, but there is not enough description in the manuscript to know for sure), and then, two other baselines that I could not figure out from the manuscript (I believe it was Dijkstra’s with two added rules for when to recharge).",actionable,shortcoming
r22,s6,t20,"Although the manuscript has many positive aspects to it, I do not believe it should be accepted for the following reasons.",non_actionable,shortcoming
r22,s7,t20,"This would solve another issue, which is the weakness of their baseline measure.",non_actionable,shortcoming
r22,s8,t20,There are many multi-agent techniques that can be applied to the problem that would have served as a better baseline.,actionable,suggestion
r22,s9,t20,"I encourage the authors to develop the problem and method further, as well as the analysis and evaluation.",actionable,suggestion
r22,s0,t8,"Interesting problem and approach, but not ready for ICLR In this paper, the authors define a simulated, multi-agent “taxi pickup” task in a GridWorld environment.",non_actionable,shortcoming
r22,s1,t8,“Customers” randomly appear throughout the task and the taxi agents receive reward for moving to the same square as a customer.,non_actionable,fact
r22,s2,t8,"Cooperative multi-agent problem solving is an important problem in machine learning, artificial intelligence, and cognitive science.",non_actionable,fact
r22,s3,t8,The authors propose a centralized solution to the problem by adapting the Deep Q-learning Network model.,non_actionable,fact
r22,s4,t8,I do not know whether using a centralized network where each agent has a window of observations is a novel algorithm.,non_actionable,fact
r22,s5,t8,"They assess their solution quantitatively, demonstrating their model performs better than first, a simple heuristic model (I believe de-centralized Dijkstra’s for each agent, but there is not enough description in the manuscript to know for sure), and then, two other baselines that I could not figure out from the manuscript (I believe it was Dijkstra’s with two added rules for when to recharge).",actionable,shortcoming
r22,s6,t8,"Although the manuscript has many positive aspects to it, I do not believe it should be accepted for the following reasons.",non_actionable,shortcoming
r22,s7,t8,"This would solve another issue, which is the weakness of their baseline measure.",non_actionable,suggestion
r22,s8,t8,There are many multi-agent techniques that can be applied to the problem that would have served as a better baseline.,actionable,shortcoming
r22,s9,t8,"I encourage the authors to develop the problem and method further, as well as the analysis and evaluation.",actionable,suggestion
r70,s0,t27,"Interesting set of ideas and direction, but lack of quantitative analysis supporting the results.",actionable,disagreement
r70,s1,t27,It puts forward a qualitative analogy between some recently observed behaviours in deep learning and results stemming from previous quantitative statistical physics analysis of single and two-layer neural networks.,non_actionable,agreement
r70,s2,t27,"I agree with the authors that this line of work, that is not very well known in the current machine learning community, includes a number of ideas that should be able to shed light on some of the currently open theoretical questions.",non_actionable,agreement
r70,s3,t27,As such the paper would be a nice contribution to ICLR.,actionable,agreement
r70,s4,t27,"On the negative side, the paper is only qualitative.",actionable,disagreement
r70,s5,t27,"The Very Simple Deep Learning model that it introduces is not even a model in the physics or statistics sense, since it cannot be fit on data, it does not specify any macroscopic details.",non_actionable,shortcoming
r70,s6,t27,I only saw something like that to be called a *model* in experimental biology papers ...,non_actionable,shortcoming
r70,s7,t27,"The models that are reviewed in the appendix, i.e. the continuous and Ising perceptron and the committee machine are more relevant.",non_actionable,agreement
r70,s8,t27,"And even in that there are flaws, because it is not always clear from what previous works are the results taken nor is it clear how exactly they were obtained (e.g.  Fig. 2 (a) is for Ising or continuous weights?",actionable,shortcoming
r70,s9,t27,"Concerning the lack of mathematical rigour in the statistical physics literature on which the authors comment, they might want to relate to a very recent work https://arxiv.org/pdf/1708.03395.pdf work that sets all the past statistical physics results on optimal generalization in single layer neural networks on fully rigorous basis by proving that the corresponding formulas stemming from the replica method are indeed correct.",non_actionable,suggestion
r70,s0,t31,"Interesting set of ideas and direction, but lack of quantitative analysis supporting the results.",actionable,suggestion
r70,s1,t31,It puts forward a qualitative analogy between some recently observed behaviours in deep learning and results stemming from previous quantitative statistical physics analysis of single and two-layer neural networks.,non_actionable,fact
r70,s2,t31,"I agree with the authors that this line of work, that is not very well known in the current machine learning community, includes a number of ideas that should be able to shed light on some of the currently open theoretical questions.",non_actionable,agreement
r70,s3,t31,As such the paper would be a nice contribution to ICLR.,non_actionable,agreement
r70,s4,t31,"On the negative side, the paper is only qualitative.",actionable,shortcoming
r70,s5,t31,"The Very Simple Deep Learning model that it introduces is not even a model in the physics or statistics sense, since it cannot be fit on data, it does not specify any macroscopic details.",actionable,shortcoming
r70,s6,t31,I only saw something like that to be called a *model* in experimental biology papers ...,actionable,shortcoming
r70,s7,t31,"The models that are reviewed in the appendix, i.e. the continuous and Ising perceptron and the committee machine are more relevant.",non_actionable,fact
r70,s8,t31,"And even in that there are flaws, because it is not always clear from what previous works are the results taken nor is it clear how exactly they were obtained (e.g.  Fig. 2 (a) is for Ising or continuous weights?",actionable,shortcoming
r70,s9,t31,"Concerning the lack of mathematical rigour in the statistical physics literature on which the authors comment, they might want to relate to a very recent work https://arxiv.org/pdf/1708.03395.pdf work that sets all the past statistical physics results on optimal generalization in single layer neural networks on fully rigorous basis by proving that the corresponding formulas stemming from the replica method are indeed correct.",actionable,suggestion
r70,s0,t10,"Interesting set of ideas and direction, but lack of quantitative analysis supporting the results.",actionable,shortcoming
r70,s1,t10,It puts forward a qualitative analogy between some recently observed behaviours in deep learning and results stemming from previous quantitative statistical physics analysis of single and two-layer neural networks.,non_actionable,other
r70,s2,t10,"I agree with the authors that this line of work, that is not very well known in the current machine learning community, includes a number of ideas that should be able to shed light on some of the currently open theoretical questions.",actionable,agreement
r70,s3,t10,As such the paper would be a nice contribution to ICLR.,actionable,agreement
r70,s4,t10,"On the negative side, the paper is only qualitative.",actionable,shortcoming
r70,s5,t10,"The Very Simple Deep Learning model that it introduces is not even a model in the physics or statistics sense, since it cannot be fit on data, it does not specify any macroscopic details.",actionable,shortcoming
r70,s6,t10,I only saw something like that to be called a *model* in experimental biology papers ...,actionable,fact
r70,s7,t10,"The models that are reviewed in the appendix, i.e. the continuous and Ising perceptron and the committee machine are more relevant.",actionable,suggestion
r70,s8,t10,"And even in that there are flaws, because it is not always clear from what previous works are the results taken nor is it clear how exactly they were obtained (e.g.  Fig. 2 (a) is for Ising or continuous weights?",actionable,question
r70,s9,t10,"Concerning the lack of mathematical rigour in the statistical physics literature on which the authors comment, they might want to relate to a very recent work https://arxiv.org/pdf/1708.03395.pdf work that sets all the past statistical physics results on optimal generalization in single layer neural networks on fully rigorous basis by proving that the corresponding formulas stemming from the replica method are indeed correct.",actionable,suggestion
r70,s0,t2,"Interesting set of ideas and direction, but lack of quantitative analysis supporting the results.",actionable,shortcoming
r70,s1,t2,It puts forward a qualitative analogy between some recently observed behaviours in deep learning and results stemming from previous quantitative statistical physics analysis of single and two-layer neural networks.,non_actionable,fact
r70,s2,t2,"I agree with the authors that this line of work, that is not very well known in the current machine learning community, includes a number of ideas that should be able to shed light on some of the currently open theoretical questions.",non_actionable,agreement
r70,s3,t2,As such the paper would be a nice contribution to ICLR.,non_actionable,agreement
r70,s4,t2,"On the negative side, the paper is only qualitative.",actionable,shortcoming
r70,s5,t2,"The Very Simple Deep Learning model that it introduces is not even a model in the physics or statistics sense, since it cannot be fit on data, it does not specify any macroscopic details.",non_actionable,shortcoming
r70,s6,t2,I only saw something like that to be called a *model* in experimental biology papers ...,non_actionable,fact
r70,s7,t2,"The models that are reviewed in the appendix, i.e. the continuous and Ising perceptron and the committee machine are more relevant.",non_actionable,shortcoming
r70,s8,t2,"And even in that there are flaws, because it is not always clear from what previous works are the results taken nor is it clear how exactly they were obtained (e.g.  Fig. 2 (a) is for Ising or continuous weights?",actionable,shortcoming
r70,s9,t2,"Concerning the lack of mathematical rigour in the statistical physics literature on which the authors comment, they might want to relate to a very recent work https://arxiv.org/pdf/1708.03395.pdf work that sets all the past statistical physics results on optimal generalization in single layer neural networks on fully rigorous basis by proving that the corresponding formulas stemming from the replica method are indeed correct.",actionable,suggestion
r70,s0,t20,"Interesting set of ideas and direction, but lack of quantitative analysis supporting the results.",non_actionable,shortcoming
r70,s1,t20,It puts forward a qualitative analogy between some recently observed behaviours in deep learning and results stemming from previous quantitative statistical physics analysis of single and two-layer neural networks.,non_actionable,fact
r70,s2,t20,"I agree with the authors that this line of work, that is not very well known in the current machine learning community, includes a number of ideas that should be able to shed light on some of the currently open theoretical questions.",non_actionable,agreement
r70,s3,t20,As such the paper would be a nice contribution to ICLR.,non_actionable,agreement
r70,s4,t20,"On the negative side, the paper is only qualitative.",actionable,shortcoming
r70,s5,t20,"The Very Simple Deep Learning model that it introduces is not even a model in the physics or statistics sense, since it cannot be fit on data, it does not specify any macroscopic details.",non_actionable,shortcoming
r70,s6,t20,I only saw something like that to be called a *model* in experimental biology papers ...,non_actionable,shortcoming
r70,s7,t20,"The models that are reviewed in the appendix, i.e. the continuous and Ising perceptron and the committee machine are more relevant.",non_actionable,fact
r70,s8,t20,"And even in that there are flaws, because it is not always clear from what previous works are the results taken nor is it clear how exactly they were obtained (e.g.  Fig. 2 (a) is for Ising or continuous weights?",actionable,question
r70,s9,t20,"Concerning the lack of mathematical rigour in the statistical physics literature on which the authors comment, they might want to relate to a very recent work https://arxiv.org/pdf/1708.03395.pdf work that sets all the past statistical physics results on optimal generalization in single layer neural networks on fully rigorous basis by proving that the corresponding formulas stemming from the replica method are indeed correct.",actionable,suggestion
r80,s0,t10,Interesting work that could be grounded more strongly in cognitive science This paper presents an analysis of the properties of agents who learn grounded language through reinforcement learning in a simple environment that combines verbal instruction with visual information.,actionable,suggestion
r80,s1,t10,"The analyses are motivated by results from cognitive and developmental psychology, exploring questions such as whether agents develop biases for shape/color, the difficulty of learning negation, the impact of curriculum format, and how representations at different levels of abstraction are acquired.",non_actionable,other
r80,s2,t10,I think this is a nice example of a detailed analysis of the representations acquired by a reinforcement learning agent.,actionable,agreement
r80,s3,t10,"2. In figure 2 and the associated analyses, why were 20 shape terms used rather than 8 to parallel the other cases?",actionable,question
r80,s4,t10,It seems like there is a strong basic color bias.,actionable,shortcoming
r80,s5,t10,This seems like one of the most novel findings in the paper and is worth highlighting.,non_actionable,other
r80,s6,t10,This figure and the corresponding analysis could be made more systematic by mapping out the degree of shape versus color bias as a function of the number of shape and color terms in a 2D plot.,actionable,suggestion
r80,s7,t10,"3. The section on curriculum learning does not mention relevant work on “starting small”  and the “less is more"" hypothesis in language development by Jeff Elman and Elissa Newport: https://pdfs.semanticscholar.org/371b/240bebcaa68921aa87db4cd3a5d4e2a3a36b.pdf http://www.sciencedirect.com/science/article/pii/0388000188900101",actionable,shortcoming
r80,s8,t10,The explanation of layerwise attention could be clearer.,actionable,suggestion
r80,s9,t10,"Minor: “analagous” -> “analogous” The paper runs longer than eight pages, and it is not obvious that the extra space is warranted.",actionable,shortcoming
r80,s0,t31,Interesting work that could be grounded more strongly in cognitive science This paper presents an analysis of the properties of agents who learn grounded language through reinforcement learning in a simple environment that combines verbal instruction with visual information.,non_actionable,agreement
r80,s1,t31,"The analyses are motivated by results from cognitive and developmental psychology, exploring questions such as whether agents develop biases for shape/color, the difficulty of learning negation, the impact of curriculum format, and how representations at different levels of abstraction are acquired.",non_actionable,fact
r80,s2,t31,I think this is a nice example of a detailed analysis of the representations acquired by a reinforcement learning agent.,non_actionable,agreement
r80,s3,t31,"2. In figure 2 and the associated analyses, why were 20 shape terms used rather than 8 to parallel the other cases?",non_actionable,question
r80,s4,t31,It seems like there is a strong basic color bias.,non_actionable,fact
r80,s5,t31,This seems like one of the most novel findings in the paper and is worth highlighting.,actionable,suggestion
r80,s6,t31,This figure and the corresponding analysis could be made more systematic by mapping out the degree of shape versus color bias as a function of the number of shape and color terms in a 2D plot.,actionable,suggestion
r80,s7,t31,"3. The section on curriculum learning does not mention relevant work on “starting small”  and the “less is more"" hypothesis in language development by Jeff Elman and Elissa Newport: https://pdfs.semanticscholar.org/371b/240bebcaa68921aa87db4cd3a5d4e2a3a36b.pdf http://www.sciencedirect.com/science/article/pii/0388000188900101",actionable,suggestion
r80,s8,t31,The explanation of layerwise attention could be clearer.,actionable,suggestion
r80,s9,t31,"Minor: “analagous” -> “analogous” The paper runs longer than eight pages, and it is not obvious that the extra space is warranted.",actionable,shortcoming
r80,s0,t20,Interesting work that could be grounded more strongly in cognitive science This paper presents an analysis of the properties of agents who learn grounded language through reinforcement learning in a simple environment that combines verbal instruction with visual information.,actionable,suggestion
r80,s1,t20,"The analyses are motivated by results from cognitive and developmental psychology, exploring questions such as whether agents develop biases for shape/color, the difficulty of learning negation, the impact of curriculum format, and how representations at different levels of abstraction are acquired.",non_actionable,fact
r80,s2,t20,I think this is a nice example of a detailed analysis of the representations acquired by a reinforcement learning agent.,non_actionable,agreement
r80,s3,t20,"2. In figure 2 and the associated analyses, why were 20 shape terms used rather than 8 to parallel the other cases?",actionable,question
r80,s4,t20,It seems like there is a strong basic color bias.,non_actionable,fact
r80,s5,t20,This seems like one of the most novel findings in the paper and is worth highlighting.,non_actionable,agreement
r80,s6,t20,This figure and the corresponding analysis could be made more systematic by mapping out the degree of shape versus color bias as a function of the number of shape and color terms in a 2D plot.,actionable,suggestion
r80,s7,t20,"3. The section on curriculum learning does not mention relevant work on “starting small”  and the “less is more"" hypothesis in language development by Jeff Elman and Elissa Newport: https://pdfs.semanticscholar.org/371b/240bebcaa68921aa87db4cd3a5d4e2a3a36b.pdf http://www.sciencedirect.com/science/article/pii/0388000188900101",actionable,shortcoming
r80,s8,t20,The explanation of layerwise attention could be clearer.,actionable,suggestion
r80,s9,t20,"Minor: “analagous” -> “analogous” The paper runs longer than eight pages, and it is not obvious that the extra space is warranted.",actionable,shortcoming
r80,s0,t8,Interesting work that could be grounded more strongly in cognitive science This paper presents an analysis of the properties of agents who learn grounded language through reinforcement learning in a simple environment that combines verbal instruction with visual information.,actionable,suggestion
r80,s1,t8,"The analyses are motivated by results from cognitive and developmental psychology, exploring questions such as whether agents develop biases for shape/color, the difficulty of learning negation, the impact of curriculum format, and how representations at different levels of abstraction are acquired.",non_actionable,fact
r80,s2,t8,I think this is a nice example of a detailed analysis of the representations acquired by a reinforcement learning agent.,non_actionable,agreement
r80,s3,t8,"2. In figure 2 and the associated analyses, why were 20 shape terms used rather than 8 to parallel the other cases?",non_actionable,question
r80,s4,t8,It seems like there is a strong basic color bias.,non_actionable,fact
r80,s5,t8,This seems like one of the most novel findings in the paper and is worth highlighting.,actionable,suggestion
r80,s6,t8,This figure and the corresponding analysis could be made more systematic by mapping out the degree of shape versus color bias as a function of the number of shape and color terms in a 2D plot.,actionable,suggestion
r80,s7,t8,"3. The section on curriculum learning does not mention relevant work on “starting small”  and the “less is more"" hypothesis in language development by Jeff Elman and Elissa Newport: https://pdfs.semanticscholar.org/371b/240bebcaa68921aa87db4cd3a5d4e2a3a36b.pdf http://www.sciencedirect.com/science/article/pii/0388000188900101",actionable,shortcoming
r80,s8,t8,The explanation of layerwise attention could be clearer.,actionable,suggestion
r80,s9,t8,"Minor: “analagous” -> “analogous” The paper runs longer than eight pages, and it is not obvious that the extra space is warranted.",actionable,shortcoming
r80,s0,t2,Interesting work that could be grounded more strongly in cognitive science This paper presents an analysis of the properties of agents who learn grounded language through reinforcement learning in a simple environment that combines verbal instruction with visual information.,non_actionable,fact
r80,s1,t2,"The analyses are motivated by results from cognitive and developmental psychology, exploring questions such as whether agents develop biases for shape/color, the difficulty of learning negation, the impact of curriculum format, and how representations at different levels of abstraction are acquired.",non_actionable,fact
r80,s2,t2,I think this is a nice example of a detailed analysis of the representations acquired by a reinforcement learning agent.,non_actionable,agreement
r80,s3,t2,"2. In figure 2 and the associated analyses, why were 20 shape terms used rather than 8 to parallel the other cases?",actionable,question
r80,s4,t2,It seems like there is a strong basic color bias.,non_actionable,fact
r80,s5,t2,This seems like one of the most novel findings in the paper and is worth highlighting.,actionable,suggestion
r80,s6,t2,This figure and the corresponding analysis could be made more systematic by mapping out the degree of shape versus color bias as a function of the number of shape and color terms in a 2D plot.,actionable,suggestion
r80,s7,t2,"3. The section on curriculum learning does not mention relevant work on “starting small”  and the “less is more"" hypothesis in language development by Jeff Elman and Elissa Newport: https://pdfs.semanticscholar.org/371b/240bebcaa68921aa87db4cd3a5d4e2a3a36b.pdf http://www.sciencedirect.com/science/article/pii/0388000188900101",actionable,suggestion
r80,s8,t2,The explanation of layerwise attention could be clearer.,actionable,shortcoming
r80,s9,t2,"Minor: “analagous” -> “analogous” The paper runs longer than eight pages, and it is not obvious that the extra space is warranted.",actionable,shortcoming
r69,s0,t31,It adds a relational mechanism to Neural Expectation Maximization and shows that this mechanism provides a better simulation of bouncing balls in a synthetic environment.,non_actionable,fact
r69,s1,t31,A decoder network reconstructs K images from each of these latent variables and these K images are combined into a single reconstruction using pixel-wise mixture components that place more weight on pixels that match the ground truth.,non_actionable,fact
r69,s2,t31,"This paper paper combines the two methods to create Relational Neural Expectation Maximization (R-NEM), allowing direct interaction at inference time between the latent variables that encode a scene.",non_actionable,fact
r69,s3,t31,The encoder network from NEM can be seen as a recurrent network which takes one latent variable theta_k at time t and some input x to produce the next latent variable theta_k at time t+1.,non_actionable,fact
r69,s4,t31,"Unlike previous neural physics models, R-NEM uses a soft attention mechanism to determine which objects are neighbors and which are not.",non_actionable,fact
r69,s5,t31,3) A version of R-NEM without neighborhood attention in the relation module matches the performance of R-NEM using 4 objects and performs worse than R-NEM at 6-8 objects.,non_actionable,fact
r69,s6,t31,"Qualitative results show that the attentional mechanism attends to objects which are close to the context object together, acting like the heuristic neighborhood mechanism from previous work.",non_actionable,fact
r69,s7,t31,Follow up experiments extend the basic setup significantly.,non_actionable,fact
r69,s8,t31,"Minor comments/concerns: * 2nd paragraph in section 4: Are parameters shared between these 3 MLPs (enc,emb,eff)?",actionable,shortcoming
r69,s9,t31,Final Evaluation --- This paper clearly advances the body of work on neural intuitive physics by incorporating NEM entity representation to allow for less supervision.,non_actionable,agreement
r69,s0,t16,It adds a relational mechanism to Neural Expectation Maximization and shows that this mechanism provides a better simulation of bouncing balls in a synthetic environment.,non_actionable,fact
r69,s1,t16,A decoder network reconstructs K images from each of these latent variables and these K images are combined into a single reconstruction using pixel-wise mixture components that place more weight on pixels that match the ground truth.,non_actionable,fact
r69,s2,t16,"This paper paper combines the two methods to create Relational Neural Expectation Maximization (R-NEM), allowing direct interaction at inference time between the latent variables that encode a scene.",non_actionable,fact
r69,s3,t16,The encoder network from NEM can be seen as a recurrent network which takes one latent variable theta_k at time t and some input x to produce the next latent variable theta_k at time t+1.,non_actionable,fact
r69,s4,t16,"Unlike previous neural physics models, R-NEM uses a soft attention mechanism to determine which objects are neighbors and which are not.",non_actionable,fact
r69,s5,t16,3) A version of R-NEM without neighborhood attention in the relation module matches the performance of R-NEM using 4 objects and performs worse than R-NEM at 6-8 objects.,non_actionable,fact
r69,s6,t16,"Qualitative results show that the attentional mechanism attends to objects which are close to the context object together, acting like the heuristic neighborhood mechanism from previous work.",non_actionable,fact
r69,s7,t16,Follow up experiments extend the basic setup significantly.,non_actionable,fact
r69,s8,t16,"Minor comments/concerns: * 2nd paragraph in section 4: Are parameters shared between these 3 MLPs (enc,emb,eff)?",actionable,question
r69,s9,t16,Final Evaluation --- This paper clearly advances the body of work on neural intuitive physics by incorporating NEM entity representation to allow for less supervision.,actionable,agreement
r69,s0,t23,It adds a relational mechanism to Neural Expectation Maximization and shows that this mechanism provides a better simulation of bouncing balls in a synthetic environment.,non_actionable,fact
r69,s1,t23,A decoder network reconstructs K images from each of these latent variables and these K images are combined into a single reconstruction using pixel-wise mixture components that place more weight on pixels that match the ground truth.,non_actionable,fact
r69,s2,t23,"This paper paper combines the two methods to create Relational Neural Expectation Maximization (R-NEM), allowing direct interaction at inference time between the latent variables that encode a scene.",non_actionable,fact
r69,s3,t23,The encoder network from NEM can be seen as a recurrent network which takes one latent variable theta_k at time t and some input x to produce the next latent variable theta_k at time t+1.,non_actionable,fact
r69,s4,t23,"Unlike previous neural physics models, R-NEM uses a soft attention mechanism to determine which objects are neighbors and which are not.",non_actionable,fact
r69,s5,t23,3) A version of R-NEM without neighborhood attention in the relation module matches the performance of R-NEM using 4 objects and performs worse than R-NEM at 6-8 objects.,non_actionable,shortcoming
r69,s6,t23,"Qualitative results show that the attentional mechanism attends to objects which are close to the context object together, acting like the heuristic neighborhood mechanism from previous work.",non_actionable,fact
r69,s7,t23,Follow up experiments extend the basic setup significantly.,non_actionable,fact
r69,s8,t23,"Minor comments/concerns: * 2nd paragraph in section 4: Are parameters shared between these 3 MLPs (enc,emb,eff)?",non_actionable,fact
r69,s9,t23,Final Evaluation --- This paper clearly advances the body of work on neural intuitive physics by incorporating NEM entity representation to allow for less supervision.,non_actionable,agreement
r69,s0,t2,It adds a relational mechanism to Neural Expectation Maximization and shows that this mechanism provides a better simulation of bouncing balls in a synthetic environment.,non_actionable,fact
r69,s1,t2,A decoder network reconstructs K images from each of these latent variables and these K images are combined into a single reconstruction using pixel-wise mixture components that place more weight on pixels that match the ground truth.,non_actionable,fact
r69,s2,t2,"This paper paper combines the two methods to create Relational Neural Expectation Maximization (R-NEM), allowing direct interaction at inference time between the latent variables that encode a scene.",non_actionable,fact
r69,s3,t2,The encoder network from NEM can be seen as a recurrent network which takes one latent variable theta_k at time t and some input x to produce the next latent variable theta_k at time t+1.,non_actionable,fact
r69,s4,t2,"Unlike previous neural physics models, R-NEM uses a soft attention mechanism to determine which objects are neighbors and which are not.",non_actionable,fact
r69,s5,t2,3) A version of R-NEM without neighborhood attention in the relation module matches the performance of R-NEM using 4 objects and performs worse than R-NEM at 6-8 objects.,non_actionable,fact
r69,s6,t2,"Qualitative results show that the attentional mechanism attends to objects which are close to the context object together, acting like the heuristic neighborhood mechanism from previous work.",non_actionable,fact
r69,s7,t2,Follow up experiments extend the basic setup significantly.,non_actionable,agreement
r69,s8,t2,"Minor comments/concerns: * 2nd paragraph in section 4: Are parameters shared between these 3 MLPs (enc,emb,eff)?",actionable,question
r69,s9,t2,Final Evaluation --- This paper clearly advances the body of work on neural intuitive physics by incorporating NEM entity representation to allow for less supervision.,non_actionable,agreement
r69,s0,t20,It adds a relational mechanism to Neural Expectation Maximization and shows that this mechanism provides a better simulation of bouncing balls in a synthetic environment.,non_actionable,fact
r69,s1,t20,A decoder network reconstructs K images from each of these latent variables and these K images are combined into a single reconstruction using pixel-wise mixture components that place more weight on pixels that match the ground truth.,non_actionable,fact
r69,s2,t20,"This paper paper combines the two methods to create Relational Neural Expectation Maximization (R-NEM), allowing direct interaction at inference time between the latent variables that encode a scene.",non_actionable,fact
r69,s3,t20,The encoder network from NEM can be seen as a recurrent network which takes one latent variable theta_k at time t and some input x to produce the next latent variable theta_k at time t+1.,non_actionable,fact
r69,s4,t20,"Unlike previous neural physics models, R-NEM uses a soft attention mechanism to determine which objects are neighbors and which are not.",non_actionable,fact
r69,s5,t20,3) A version of R-NEM without neighborhood attention in the relation module matches the performance of R-NEM using 4 objects and performs worse than R-NEM at 6-8 objects.,non_actionable,fact
r69,s6,t20,"Qualitative results show that the attentional mechanism attends to objects which are close to the context object together, acting like the heuristic neighborhood mechanism from previous work.",non_actionable,fact
r69,s7,t20,Follow up experiments extend the basic setup significantly.,non_actionable,fact
r69,s8,t20,"Minor comments/concerns: * 2nd paragraph in section 4: Are parameters shared between these 3 MLPs (enc,emb,eff)?",actionable,question
r69,s9,t20,Final Evaluation --- This paper clearly advances the body of work on neural intuitive physics by incorporating NEM entity representation to allow for less supervision.,non_actionable,fact
r11,s0,t10,It lacks a few references and important technical aspects are not discussed.,actionable,shortcoming
r11,s1,t10,"Overall, the contribution is modest at best.",actionable,shortcoming
r11,s2,t10,"** DETAILED REVIEW ** On mistakes, it is wrong to say that an SVM is a parameterless classifier.",actionable,disagreement
r11,s3,t10,I think slack variables come from (Cortes et al 95).,non_actionable,other
r11,s4,t10,Could you explain how classes are predicted given a test problem?,actionable,question
r11,s5,t10,What are the class you are interested in?,actionable,question
r11,s6,t10,since you mention at the end of page 4 that you will only rely on linear SVMs for computational reasons.,non_actionable,other
r11,s7,t10,"You need to find a better justification for using L2-SVM than ""L2-SVM loss variant is considered to be the best by the author of the paper"", did you try classical SVM and found them performing worse?",actionable,shortcoming
r11,s8,t10,"On empirical evaluation, I already mentioned that it impossible to understand what the classification problem on TIMIT is. I suspect it might be speaker identification.",actionable,shortcoming
r11,s9,t10,"Also, most work typically use omniglot as a proof of concept and consider mini-imagenet as a more challenging set.",actionable,shortcoming
r11,s0,t28,It lacks a few references and important technical aspects are not discussed.,actionable,suggestion
r11,s1,t28,"Overall, the contribution is modest at best.",non_actionable,fact
r11,s2,t28,"** DETAILED REVIEW ** On mistakes, it is wrong to say that an SVM is a parameterless classifier.",actionable,shortcoming
r11,s3,t28,I think slack variables come from (Cortes et al 95).,non_actionable,fact
r11,s4,t28,Could you explain how classes are predicted given a test problem?,actionable,question
r11,s5,t28,What are the class you are interested in?,actionable,question
r11,s6,t28,since you mention at the end of page 4 that you will only rely on linear SVMs for computational reasons.,non_actionable,other
r11,s7,t28,"You need to find a better justification for using L2-SVM than ""L2-SVM loss variant is considered to be the best by the author of the paper"", did you try classical SVM and found them performing worse?",actionable,question
r11,s8,t28,"On empirical evaluation, I already mentioned that it impossible to understand what the classification problem on TIMIT is. I suspect it might be speaker identification.",actionable,shortcoming
r11,s9,t28,"Also, most work typically use omniglot as a proof of concept and consider mini-imagenet as a more challenging set.",non_actionable,fact
r11,s0,t20,It lacks a few references and important technical aspects are not discussed.,actionable,shortcoming
r11,s1,t20,"Overall, the contribution is modest at best.",non_actionable,shortcoming
r11,s2,t20,"** DETAILED REVIEW ** On mistakes, it is wrong to say that an SVM is a parameterless classifier.",actionable,shortcoming
r11,s3,t20,I think slack variables come from (Cortes et al 95).,non_actionable,fact
r11,s4,t20,Could you explain how classes are predicted given a test problem?,actionable,question
r11,s5,t20,What are the class you are interested in?,actionable,question
r11,s6,t20,since you mention at the end of page 4 that you will only rely on linear SVMs for computational reasons.,non_actionable,fact
r11,s7,t20,"You need to find a better justification for using L2-SVM than ""L2-SVM loss variant is considered to be the best by the author of the paper"", did you try classical SVM and found them performing worse?",actionable,suggestion
r11,s8,t20,"On empirical evaluation, I already mentioned that it impossible to understand what the classification problem on TIMIT is. I suspect it might be speaker identification.",non_actionable,fact
r11,s9,t20,"Also, most work typically use omniglot as a proof of concept and consider mini-imagenet as a more challenging set.",non_actionable,fact
r11,s0,t6,It lacks a few references and important technical aspects are not discussed.,actionable,shortcoming
r11,s1,t6,"Overall, the contribution is modest at best.",non_actionable,fact
r11,s2,t6,"** DETAILED REVIEW ** On mistakes, it is wrong to say that an SVM is a parameterless classifier.",actionable,suggestion
r11,s3,t6,I think slack variables come from (Cortes et al 95).,actionable,suggestion
r11,s4,t6,Could you explain how classes are predicted given a test problem?,actionable,question
r11,s5,t6,What are the class you are interested in?,actionable,question
r11,s6,t6,since you mention at the end of page 4 that you will only rely on linear SVMs for computational reasons.,actionable,fact
r11,s7,t6,"You need to find a better justification for using L2-SVM than ""L2-SVM loss variant is considered to be the best by the author of the paper"", did you try classical SVM and found them performing worse?",actionable,suggestion
r11,s8,t6,"On empirical evaluation, I already mentioned that it impossible to understand what the classification problem on TIMIT is. I suspect it might be speaker identification.",actionable,suggestion
r11,s9,t6,"Also, most work typically use omniglot as a proof of concept and consider mini-imagenet as a more challenging set.",actionable,fact
r11,s0,t24,It lacks a few references and important technical aspects are not discussed.,actionable,shortcoming
r11,s1,t24,"Overall, the contribution is modest at best.",non_actionable,fact
r11,s2,t24,"** DETAILED REVIEW ** On mistakes, it is wrong to say that an SVM is a parameterless classifier.",actionable,shortcoming
r11,s3,t24,I think slack variables come from (Cortes et al 95).,actionable,fact
r11,s4,t24,Could you explain how classes are predicted given a test problem?,actionable,question
r11,s5,t24,What are the class you are interested in?,actionable,question
r11,s6,t24,since you mention at the end of page 4 that you will only rely on linear SVMs for computational reasons.,non_actionable,fact
r11,s7,t24,"You need to find a better justification for using L2-SVM than ""L2-SVM loss variant is considered to be the best by the author of the paper"", did you try classical SVM and found them performing worse?",actionable,suggestion
r11,s8,t24,"On empirical evaluation, I already mentioned that it impossible to understand what the classification problem on TIMIT is. I suspect it might be speaker identification.",non_actionable,fact
r11,s9,t24,"Also, most work typically use omniglot as a proof of concept and consider mini-imagenet as a more challenging set.",non_actionable,fact
r55,s0,t31,"It proposes to do this by simultaneously optimizing a VAE with PixelCNN++ decoder, and a VAE with factorial decoder.",non_actionable,fact
r55,s1,t31,"Unfortunately the authors were unable to make any good use of this insight, and I will explain below why I don’t see any evidence of an improved generative model in this paper.",actionable,shortcoming
r55,s2,t31,"Then the addition of the VAE had no measurable effect on the PixelCNN++’s performance, i.e., it seems like a bad idea due to the added complexity and loss of tractability.",actionable,shortcoming
r55,s3,t31,Then the paper is missing experiments to support the idea that the learned representations are in any way an improvement.,actionable,shortcoming
r55,s4,t31,Much of the authors’ analysis is based on a qualitative evaluation of samples.,non_actionable,fact
r55,s5,t31,"A lookup table storing the training data generates samples containing objects and perfect details, but obviously has not learned anything about either objects or the low-level statistics of natural images.",actionable,shortcoming
r55,s6,t31,"If the former, what was sigma and how was it chosen?",non_actionable,question
r55,s7,t31,I am merely providing a possible counter example to the notion that the PixelCNN++ has learned to use of the latent representation in a meaningful way.,actionable,suggestion
r55,s8,t31,This seems like it would be a useful control to include.,actionable,suggestion
r55,s9,t31,The paper is well written and clear.,non_actionable,agreement
r55,s0,t10,"It proposes to do this by simultaneously optimizing a VAE with PixelCNN++ decoder, and a VAE with factorial decoder.",non_actionable,other
r55,s1,t10,"Unfortunately the authors were unable to make any good use of this insight, and I will explain below why I don’t see any evidence of an improved generative model in this paper.",actionable,shortcoming
r55,s2,t10,"Then the addition of the VAE had no measurable effect on the PixelCNN++’s performance, i.e., it seems like a bad idea due to the added complexity and loss of tractability.",actionable,fact
r55,s3,t10,Then the paper is missing experiments to support the idea that the learned representations are in any way an improvement.,actionable,shortcoming
r55,s4,t10,Much of the authors’ analysis is based on a qualitative evaluation of samples.,non_actionable,other
r55,s5,t10,"A lookup table storing the training data generates samples containing objects and perfect details, but obviously has not learned anything about either objects or the low-level statistics of natural images.",actionable,shortcoming
r55,s6,t10,"If the former, what was sigma and how was it chosen?",actionable,question
r55,s7,t10,I am merely providing a possible counter example to the notion that the PixelCNN++ has learned to use of the latent representation in a meaningful way.,actionable,suggestion
r55,s8,t10,This seems like it would be a useful control to include.,actionable,suggestion
r55,s9,t10,The paper is well written and clear.,actionable,agreement
r55,s0,t8,"It proposes to do this by simultaneously optimizing a VAE with PixelCNN++ decoder, and a VAE with factorial decoder.",non_actionable,fact
r55,s1,t8,"Unfortunately the authors were unable to make any good use of this insight, and I will explain below why I don’t see any evidence of an improved generative model in this paper.",actionable,shortcoming
r55,s2,t8,"Then the addition of the VAE had no measurable effect on the PixelCNN++’s performance, i.e., it seems like a bad idea due to the added complexity and loss of tractability.",non_actionable,fact
r55,s3,t8,Then the paper is missing experiments to support the idea that the learned representations are in any way an improvement.,actionable,shortcoming
r55,s4,t8,Much of the authors’ analysis is based on a qualitative evaluation of samples.,non_actionable,fact
r55,s5,t8,"A lookup table storing the training data generates samples containing objects and perfect details, but obviously has not learned anything about either objects or the low-level statistics of natural images.",actionable,shortcoming
r55,s6,t8,"If the former, what was sigma and how was it chosen?",non_actionable,question
r55,s7,t8,I am merely providing a possible counter example to the notion that the PixelCNN++ has learned to use of the latent representation in a meaningful way.,non_actionable,fact
r55,s8,t8,This seems like it would be a useful control to include.,actionable,suggestion
r55,s9,t8,The paper is well written and clear.,non_actionable,agreement
r55,s0,t16,"It proposes to do this by simultaneously optimizing a VAE with PixelCNN++ decoder, and a VAE with factorial decoder.",non_actionable,fact
r55,s1,t16,"Unfortunately the authors were unable to make any good use of this insight, and I will explain below why I don’t see any evidence of an improved generative model in this paper.",actionable,shortcoming
r55,s2,t16,"Then the addition of the VAE had no measurable effect on the PixelCNN++’s performance, i.e., it seems like a bad idea due to the added complexity and loss of tractability.",actionable,shortcoming
r55,s3,t16,Then the paper is missing experiments to support the idea that the learned representations are in any way an improvement.,actionable,shortcoming
r55,s4,t16,Much of the authors’ analysis is based on a qualitative evaluation of samples.,non_actionable,fact
r55,s5,t16,"A lookup table storing the training data generates samples containing objects and perfect details, but obviously has not learned anything about either objects or the low-level statistics of natural images.",actionable,shortcoming
r55,s6,t16,"If the former, what was sigma and how was it chosen?",actionable,disagreement
r55,s7,t16,I am merely providing a possible counter example to the notion that the PixelCNN++ has learned to use of the latent representation in a meaningful way.,actionable,disagreement
r55,s8,t16,This seems like it would be a useful control to include.,actionable,suggestion
r55,s9,t16,The paper is well written and clear.,non_actionable,agreement
r55,s0,t20,"It proposes to do this by simultaneously optimizing a VAE with PixelCNN++ decoder, and a VAE with factorial decoder.",non_actionable,fact
r55,s1,t20,"Unfortunately the authors were unable to make any good use of this insight, and I will explain below why I don’t see any evidence of an improved generative model in this paper.",non_actionable,shortcoming
r55,s2,t20,"Then the addition of the VAE had no measurable effect on the PixelCNN++’s performance, i.e., it seems like a bad idea due to the added complexity and loss of tractability.",non_actionable,shortcoming
r55,s3,t20,Then the paper is missing experiments to support the idea that the learned representations are in any way an improvement.,actionable,shortcoming
r55,s4,t20,Much of the authors’ analysis is based on a qualitative evaluation of samples.,non_actionable,fact
r55,s5,t20,"A lookup table storing the training data generates samples containing objects and perfect details, but obviously has not learned anything about either objects or the low-level statistics of natural images.",non_actionable,shortcoming
r55,s6,t20,"If the former, what was sigma and how was it chosen?",actionable,question
r55,s7,t20,I am merely providing a possible counter example to the notion that the PixelCNN++ has learned to use of the latent representation in a meaningful way.,non_actionable,fact
r55,s8,t20,This seems like it would be a useful control to include.,actionable,suggestion
r55,s9,t20,The paper is well written and clear.,non_actionable,agreement
r105,s0,t16,It then argues that speeding up the activation function may be important since the convolution operations in CNNs are becoming heavily optimized and may form a lesser fraction of the overall computation.,non_actionable,fact
r105,s1,t16,The ISRLU is then reported to be 2.6x faster compared to ELU using AVX2 instructions.,non_actionable,fact
r105,s2,t16,Preliminary experimental results are reported which demonstrate that ISRLU can perform similar to ELU.,non_actionable,fact
r105,s3,t16,"However, on one hand the contribution is rather narrow, and on the other the results presented do not clearly show that the contribution is of significance in practice.",actionable,shortcoming
r105,s4,t16,b) and what is the percentage of cycles saved by employing the ISRLU.,actionable,shortcoming
r105,s5,t16,"The presented results using small networks on the MNIST dataset only show that networks with ISRLU can perform similar to those with other activation functions, but not the speed advantages of ISRLU.",actionable,shortcoming
r105,s6,t16,The effect of using the faster approximation on performance also remains to be investigated.,actionable,shortcoming
r105,s7,t16,Clarity: The content of the paper is unclear in certain areas.,actionable,shortcoming
r105,s8,t16,- It is not clear what Table 2 is showing.,actionable,shortcoming
r105,s9,t16,- Why is the final Cross-Entropy Loss so high even though the accuracy is >99% for the MNIST experiments?,actionable,question
r105,s0,t20,It then argues that speeding up the activation function may be important since the convolution operations in CNNs are becoming heavily optimized and may form a lesser fraction of the overall computation.,non_actionable,fact
r105,s1,t20,The ISRLU is then reported to be 2.6x faster compared to ELU using AVX2 instructions.,non_actionable,fact
r105,s2,t20,Preliminary experimental results are reported which demonstrate that ISRLU can perform similar to ELU.,non_actionable,fact
r105,s3,t20,"However, on one hand the contribution is rather narrow, and on the other the results presented do not clearly show that the contribution is of significance in practice.",actionable,shortcoming
r105,s4,t20,b) and what is the percentage of cycles saved by employing the ISRLU.,actionable,question
r105,s5,t20,"The presented results using small networks on the MNIST dataset only show that networks with ISRLU can perform similar to those with other activation functions, but not the speed advantages of ISRLU.",actionable,shortcoming
r105,s6,t20,The effect of using the faster approximation on performance also remains to be investigated.,actionable,shortcoming
r105,s7,t20,Clarity: The content of the paper is unclear in certain areas.,actionable,shortcoming
r105,s8,t20,- It is not clear what Table 2 is showing.,actionable,shortcoming
r105,s9,t20,- Why is the final Cross-Entropy Loss so high even though the accuracy is >99% for the MNIST experiments?,actionable,question
r105,s0,t23,It then argues that speeding up the activation function may be important since the convolution operations in CNNs are becoming heavily optimized and may form a lesser fraction of the overall computation.,non_actionable,fact
r105,s1,t23,The ISRLU is then reported to be 2.6x faster compared to ELU using AVX2 instructions.,non_actionable,fact
r105,s2,t23,Preliminary experimental results are reported which demonstrate that ISRLU can perform similar to ELU.,non_actionable,fact
r105,s3,t23,"However, on one hand the contribution is rather narrow, and on the other the results presented do not clearly show that the contribution is of significance in practice.",non_actionable,shortcoming
r105,s4,t23,b) and what is the percentage of cycles saved by employing the ISRLU.,non_actionable,question
r105,s5,t23,"The presented results using small networks on the MNIST dataset only show that networks with ISRLU can perform similar to those with other activation functions, but not the speed advantages of ISRLU.",non_actionable,fact
r105,s6,t23,The effect of using the faster approximation on performance also remains to be investigated.,actionable,suggestion
r105,s7,t23,Clarity: The content of the paper is unclear in certain areas.,non_actionable,shortcoming
r105,s8,t23,- It is not clear what Table 2 is showing.,actionable,shortcoming
r105,s9,t23,- Why is the final Cross-Entropy Loss so high even though the accuracy is >99% for the MNIST experiments?,non_actionable,question
r105,s0,t10,It then argues that speeding up the activation function may be important since the convolution operations in CNNs are becoming heavily optimized and may form a lesser fraction of the overall computation.,non_actionable,other
r105,s1,t10,The ISRLU is then reported to be 2.6x faster compared to ELU using AVX2 instructions.,non_actionable,other
r105,s2,t10,Preliminary experimental results are reported which demonstrate that ISRLU can perform similar to ELU.,non_actionable,other
r105,s3,t10,"However, on one hand the contribution is rather narrow, and on the other the results presented do not clearly show that the contribution is of significance in practice.",actionable,shortcoming
r105,s4,t10,b) and what is the percentage of cycles saved by employing the ISRLU.,actionable,question
r105,s5,t10,"The presented results using small networks on the MNIST dataset only show that networks with ISRLU can perform similar to those with other activation functions, but not the speed advantages of ISRLU.",actionable,shortcoming
r105,s6,t10,The effect of using the faster approximation on performance also remains to be investigated.,actionable,suggestion
r105,s7,t10,Clarity: The content of the paper is unclear in certain areas.,actionable,shortcoming
r105,s8,t10,- It is not clear what Table 2 is showing.,actionable,shortcoming
r105,s9,t10,- Why is the final Cross-Entropy Loss so high even though the accuracy is >99% for the MNIST experiments?,actionable,question
r105,s0,t2,It then argues that speeding up the activation function may be important since the convolution operations in CNNs are becoming heavily optimized and may form a lesser fraction of the overall computation.,non_actionable,fact
r105,s1,t2,The ISRLU is then reported to be 2.6x faster compared to ELU using AVX2 instructions.,non_actionable,fact
r105,s2,t2,Preliminary experimental results are reported which demonstrate that ISRLU can perform similar to ELU.,non_actionable,fact
r105,s3,t2,"However, on one hand the contribution is rather narrow, and on the other the results presented do not clearly show that the contribution is of significance in practice.",non_actionable,shortcoming
r105,s4,t2,b) and what is the percentage of cycles saved by employing the ISRLU.,actionable,question
r105,s5,t2,"The presented results using small networks on the MNIST dataset only show that networks with ISRLU can perform similar to those with other activation functions, but not the speed advantages of ISRLU.",actionable,shortcoming
r105,s6,t2,The effect of using the faster approximation on performance also remains to be investigated.,non_actionable,shortcoming
r105,s7,t2,Clarity: The content of the paper is unclear in certain areas.,actionable,shortcoming
r105,s8,t2,- It is not clear what Table 2 is showing.,actionable,shortcoming
r105,s9,t2,- Why is the final Cross-Entropy Loss so high even though the accuracy is >99% for the MNIST experiments?,actionable,question
r113,s0,t31,"Lack of convincing motivation and results not particularly unimpressive This paper proposes to train a classifier neural network not just to classifier, but also to reconstruct a representation of its input, in order to factorize the class information from the appearance (or ""style"" as used in this paper).",non_actionable,fact
r113,s1,t31,This is done by first using unsupervised pretraining and then fine-tuning using a weighted combination of the regular multinomial NLL loss and a reconstruction loss at the last hidden layer.,non_actionable,fact
r113,s2,t31,Experiments on MNIST are provided to analyse what this approach learns.,non_actionable,fact
r113,s3,t31,Why is it important to separate class from style?,non_actionable,question
r113,s4,t31,I don't find the reconstructions demonstrated particularly compelling (they are generally pretty different from the original input).,non_actionable,fact
r113,s5,t31,"Also, that the ""style"" representation contain less (and I'd say slightly less, in Figure 7 b and d, we still see a lot of same class nearest neighbors) is not exactly a surprising result.",non_actionable,fact
r113,s6,t31,"And the results of figure 9, showing poor reconstructions when changing the class representation essentially demonstrates that the method isn't able to factorize class and style successfully.",non_actionable,fact
r113,s7,t31,"The interpolation results of Figure 11 are also underwhelming, though possibly mostly because the reconstructions are in general not great.",non_actionable,fact
r113,s8,t31,"But most importantly, none of these results are measured in a quantitative way: they are all qualitative, and thus subjective.",non_actionable,fact
r113,s9,t31,"For all these reasons, I'm afraid I must recommend this paper be rejected.",non_actionable,fact
r113,s0,t16,"Lack of convincing motivation and results not particularly unimpressive This paper proposes to train a classifier neural network not just to classifier, but also to reconstruct a representation of its input, in order to factorize the class information from the appearance (or ""style"" as used in this paper).",actionable,shortcoming
r113,s1,t16,This is done by first using unsupervised pretraining and then fine-tuning using a weighted combination of the regular multinomial NLL loss and a reconstruction loss at the last hidden layer.,non_actionable,fact
r113,s2,t16,Experiments on MNIST are provided to analyse what this approach learns.,non_actionable,fact
r113,s3,t16,Why is it important to separate class from style?,non_actionable,question
r113,s4,t16,I don't find the reconstructions demonstrated particularly compelling (they are generally pretty different from the original input).,actionable,disagreement
r113,s5,t16,"Also, that the ""style"" representation contain less (and I'd say slightly less, in Figure 7 b and d, we still see a lot of same class nearest neighbors) is not exactly a surprising result.",actionable,disagreement
r113,s6,t16,"And the results of figure 9, showing poor reconstructions when changing the class representation essentially demonstrates that the method isn't able to factorize class and style successfully.",actionable,shortcoming
r113,s7,t16,"The interpolation results of Figure 11 are also underwhelming, though possibly mostly because the reconstructions are in general not great.",actionable,shortcoming
r113,s8,t16,"But most importantly, none of these results are measured in a quantitative way: they are all qualitative, and thus subjective.",actionable,shortcoming
r113,s9,t16,"For all these reasons, I'm afraid I must recommend this paper be rejected.",actionable,fact
r113,s0,t10,"Lack of convincing motivation and results not particularly unimpressive This paper proposes to train a classifier neural network not just to classifier, but also to reconstruct a representation of its input, in order to factorize the class information from the appearance (or ""style"" as used in this paper).",actionable,shortcoming
r113,s1,t10,This is done by first using unsupervised pretraining and then fine-tuning using a weighted combination of the regular multinomial NLL loss and a reconstruction loss at the last hidden layer.,non_actionable,other
r113,s2,t10,Experiments on MNIST are provided to analyse what this approach learns.,non_actionable,other
r113,s3,t10,Why is it important to separate class from style?,actionable,shortcoming
r113,s4,t10,I don't find the reconstructions demonstrated particularly compelling (they are generally pretty different from the original input).,actionable,shortcoming
r113,s5,t10,"Also, that the ""style"" representation contain less (and I'd say slightly less, in Figure 7 b and d, we still see a lot of same class nearest neighbors) is not exactly a surprising result.",actionable,shortcoming
r113,s6,t10,"And the results of figure 9, showing poor reconstructions when changing the class representation essentially demonstrates that the method isn't able to factorize class and style successfully.",actionable,disagreement
r113,s7,t10,"The interpolation results of Figure 11 are also underwhelming, though possibly mostly because the reconstructions are in general not great.",actionable,disagreement
r113,s8,t10,"But most importantly, none of these results are measured in a quantitative way: they are all qualitative, and thus subjective.",actionable,fact
r113,s9,t10,"For all these reasons, I'm afraid I must recommend this paper be rejected.",actionable,disagreement
r113,s0,t2,"Lack of convincing motivation and results not particularly unimpressive This paper proposes to train a classifier neural network not just to classifier, but also to reconstruct a representation of its input, in order to factorize the class information from the appearance (or ""style"" as used in this paper).",non_actionable,shortcoming
r113,s1,t2,This is done by first using unsupervised pretraining and then fine-tuning using a weighted combination of the regular multinomial NLL loss and a reconstruction loss at the last hidden layer.,non_actionable,fact
r113,s2,t2,Experiments on MNIST are provided to analyse what this approach learns.,non_actionable,fact
r113,s3,t2,Why is it important to separate class from style?,actionable,question
r113,s4,t2,I don't find the reconstructions demonstrated particularly compelling (they are generally pretty different from the original input).,non_actionable,shortcoming
r113,s5,t2,"Also, that the ""style"" representation contain less (and I'd say slightly less, in Figure 7 b and d, we still see a lot of same class nearest neighbors) is not exactly a surprising result.",non_actionable,shortcoming
r113,s6,t2,"And the results of figure 9, showing poor reconstructions when changing the class representation essentially demonstrates that the method isn't able to factorize class and style successfully.",non_actionable,shortcoming
r113,s7,t2,"The interpolation results of Figure 11 are also underwhelming, though possibly mostly because the reconstructions are in general not great.",non_actionable,shortcoming
r113,s8,t2,"But most importantly, none of these results are measured in a quantitative way: they are all qualitative, and thus subjective.",non_actionable,shortcoming
r113,s9,t2,"For all these reasons, I'm afraid I must recommend this paper be rejected.",non_actionable,fact
r113,s0,t20,"Lack of convincing motivation and results not particularly unimpressive This paper proposes to train a classifier neural network not just to classifier, but also to reconstruct a representation of its input, in order to factorize the class information from the appearance (or ""style"" as used in this paper).",non_actionable,shortcoming
r113,s1,t20,This is done by first using unsupervised pretraining and then fine-tuning using a weighted combination of the regular multinomial NLL loss and a reconstruction loss at the last hidden layer.,non_actionable,fact
r113,s2,t20,Experiments on MNIST are provided to analyse what this approach learns.,non_actionable,fact
r113,s3,t20,Why is it important to separate class from style?,actionable,question
r113,s4,t20,I don't find the reconstructions demonstrated particularly compelling (they are generally pretty different from the original input).,actionable,shortcoming
r113,s5,t20,"Also, that the ""style"" representation contain less (and I'd say slightly less, in Figure 7 b and d, we still see a lot of same class nearest neighbors) is not exactly a surprising result.",non_actionable,shortcoming
r113,s6,t20,"And the results of figure 9, showing poor reconstructions when changing the class representation essentially demonstrates that the method isn't able to factorize class and style successfully.",non_actionable,shortcoming
r113,s7,t20,"The interpolation results of Figure 11 are also underwhelming, though possibly mostly because the reconstructions are in general not great.",non_actionable,shortcoming
r113,s8,t20,"But most importantly, none of these results are measured in a quantitative way: they are all qualitative, and thus subjective.",non_actionable,shortcoming
r113,s9,t20,"For all these reasons, I'm afraid I must recommend this paper be rejected.",non_actionable,fact
r92,s0,t2,Lacking evaluations with few technical concerns The problem of lossy compression of neural networks is essentially important and relevant.,non_actionable,fact
r92,s1,t2,It is a data structure that maps from sparse indices to their corresponding values with chances that returns incorrect values for non-existing indices.,non_actionable,fact
r92,s2,t2,I find the paper fairly interesting but still have some concerns in the technical part and experiments.,non_actionable,fact
r92,s3,t2,It might be worthwhile to briefly describe the encoding/construction algorithm used in the paper.,actionable,suggestion
r92,s4,t2,It is recommended to describe a bit more details about how such encoding/decoding methods are applied in reducing neural net weights.,actionable,suggestion
r92,s5,t2,"That requires the method to be used together with network pruning methods, which seems limiting its applicability.",non_actionable,shortcoming
r92,s6,t2,5. It seems the construction of Bloomier filter is costly and the proposed method has to construct Bloomier filters for all layers.,non_actionable,shortcoming
r92,s7,t2,It would be nice to have a separate comparison on the time consumption of  different methods.,actionable,suggestion
r92,s8,t2,6. Figure 4 seems a bit misleading.,actionable,shortcoming
r92,s9,t2,I recommend producing another new figure of doing such comparison.,actionable,suggestion
r92,s0,t20,Lacking evaluations with few technical concerns The problem of lossy compression of neural networks is essentially important and relevant.,non_actionable,fact
r92,s1,t20,It is a data structure that maps from sparse indices to their corresponding values with chances that returns incorrect values for non-existing indices.,non_actionable,fact
r92,s2,t20,I find the paper fairly interesting but still have some concerns in the technical part and experiments.,non_actionable,shortcoming
r92,s3,t20,It might be worthwhile to briefly describe the encoding/construction algorithm used in the paper.,actionable,suggestion
r92,s4,t20,It is recommended to describe a bit more details about how such encoding/decoding methods are applied in reducing neural net weights.,actionable,suggestion
r92,s5,t20,"That requires the method to be used together with network pruning methods, which seems limiting its applicability.",non_actionable,fact
r92,s6,t20,5. It seems the construction of Bloomier filter is costly and the proposed method has to construct Bloomier filters for all layers.,non_actionable,fact
r92,s7,t20,It would be nice to have a separate comparison on the time consumption of  different methods.,actionable,suggestion
r92,s8,t20,6. Figure 4 seems a bit misleading.,actionable,shortcoming
r92,s9,t20,I recommend producing another new figure of doing such comparison.,actionable,suggestion
r92,s0,t31,Lacking evaluations with few technical concerns The problem of lossy compression of neural networks is essentially important and relevant.,actionable,shortcoming
r92,s1,t31,It is a data structure that maps from sparse indices to their corresponding values with chances that returns incorrect values for non-existing indices.,actionable,shortcoming
r92,s2,t31,I find the paper fairly interesting but still have some concerns in the technical part and experiments.,actionable,shortcoming
r92,s3,t31,It might be worthwhile to briefly describe the encoding/construction algorithm used in the paper.,actionable,suggestion
r92,s4,t31,It is recommended to describe a bit more details about how such encoding/decoding methods are applied in reducing neural net weights.,actionable,suggestion
r92,s5,t31,"That requires the method to be used together with network pruning methods, which seems limiting its applicability.",actionable,shortcoming
r92,s6,t31,5. It seems the construction of Bloomier filter is costly and the proposed method has to construct Bloomier filters for all layers.,actionable,shortcoming
r92,s7,t31,It would be nice to have a separate comparison on the time consumption of  different methods.,actionable,suggestion
r92,s8,t31,6. Figure 4 seems a bit misleading.,actionable,shortcoming
r92,s9,t31,I recommend producing another new figure of doing such comparison.,actionable,suggestion
r92,s0,t16,Lacking evaluations with few technical concerns The problem of lossy compression of neural networks is essentially important and relevant.,actionable,shortcoming
r92,s1,t16,It is a data structure that maps from sparse indices to their corresponding values with chances that returns incorrect values for non-existing indices.,non_actionable,fact
r92,s2,t16,I find the paper fairly interesting but still have some concerns in the technical part and experiments.,actionable,shortcoming
r92,s3,t16,It might be worthwhile to briefly describe the encoding/construction algorithm used in the paper.,actionable,suggestion
r92,s4,t16,It is recommended to describe a bit more details about how such encoding/decoding methods are applied in reducing neural net weights.,actionable,suggestion
r92,s5,t16,"That requires the method to be used together with network pruning methods, which seems limiting its applicability.",non_actionable,fact
r92,s6,t16,5. It seems the construction of Bloomier filter is costly and the proposed method has to construct Bloomier filters for all layers.,actionable,suggestion
r92,s7,t16,It would be nice to have a separate comparison on the time consumption of  different methods.,actionable,suggestion
r92,s8,t16,6. Figure 4 seems a bit misleading.,actionable,shortcoming
r92,s9,t16,I recommend producing another new figure of doing such comparison.,actionable,suggestion
r92,s0,t10,Lacking evaluations with few technical concerns The problem of lossy compression of neural networks is essentially important and relevant.,actionable,shortcoming
r92,s1,t10,It is a data structure that maps from sparse indices to their corresponding values with chances that returns incorrect values for non-existing indices.,non_actionable,other
r92,s2,t10,I find the paper fairly interesting but still have some concerns in the technical part and experiments.,actionable,shortcoming
r92,s3,t10,It might be worthwhile to briefly describe the encoding/construction algorithm used in the paper.,actionable,suggestion
r92,s4,t10,It is recommended to describe a bit more details about how such encoding/decoding methods are applied in reducing neural net weights.,actionable,suggestion
r92,s5,t10,"That requires the method to be used together with network pruning methods, which seems limiting its applicability.",actionable,suggestion
r92,s6,t10,5. It seems the construction of Bloomier filter is costly and the proposed method has to construct Bloomier filters for all layers.,actionable,fact
r92,s7,t10,It would be nice to have a separate comparison on the time consumption of  different methods.,actionable,suggestion
r92,s8,t10,6. Figure 4 seems a bit misleading.,actionable,disagreement
r92,s9,t10,I recommend producing another new figure of doing such comparison.,actionable,suggestion
r67,s0,t20,Lacking focus This paper investigates learning representations for the problem of nearest neighbor (NN) search by exploring various deep learning architectural choices.,actionable,shortcoming
r67,s1,t20,"I do not think class labels are usually assumed to be given in the standard definition of NN, and it is not clear to me how the proposed setup can accommodate NN without class labels.",actionable,shortcoming
r67,s2,t20,"However, that is moved completely to the Appendix.",non_actionable,fact
r67,s3,t20,2) Disconnect/Unclear Assumptions There seems to be some disconnect between LSH and deep learning architectures explored in Sections 2 and 3 respectively.,actionable,shortcoming
r67,s4,t20,"It is not clear how a softmax output of a CNN, which is trained in a supervised way, follow such assumptions.",actionable,shortcoming
r67,s5,t20,It would be important if the paper could clarify such assumptions to make sure the sections are congruent.,actionable,suggestion
r67,s6,t20,There are also not comparisons what-so-ever to competitive prior works.,actionable,shortcoming
r67,s7,t20,4) Novelty The main contribution of this paper is basically a set of experiments looking into architectural choices.,non_actionable,fact
r67,s8,t20,"Thus, as such, the novelty or the contributions of this paper are minor.",non_actionable,fact
r67,s9,t20,"Overall, while I find there are some interesting theoretical bits in this paper, it lacks focus, the experiments do not offer any surprises, and there are no comparisons with prior literature.",actionable,shortcoming
r67,s0,t10,Lacking focus This paper investigates learning representations for the problem of nearest neighbor (NN) search by exploring various deep learning architectural choices.,non_actionable,other
r67,s1,t10,"I do not think class labels are usually assumed to be given in the standard definition of NN, and it is not clear to me how the proposed setup can accommodate NN without class labels.",actionable,shortcoming
r67,s2,t10,"However, that is moved completely to the Appendix.",non_actionable,other
r67,s3,t10,2) Disconnect/Unclear Assumptions There seems to be some disconnect between LSH and deep learning architectures explored in Sections 2 and 3 respectively.,actionable,shortcoming
r67,s4,t10,"It is not clear how a softmax output of a CNN, which is trained in a supervised way, follow such assumptions.",actionable,shortcoming
r67,s5,t10,It would be important if the paper could clarify such assumptions to make sure the sections are congruent.,actionable,shortcoming
r67,s6,t10,There are also not comparisons what-so-ever to competitive prior works.,actionable,shortcoming
r67,s7,t10,4) Novelty The main contribution of this paper is basically a set of experiments looking into architectural choices.,non_actionable,other
r67,s8,t10,"Thus, as such, the novelty or the contributions of this paper are minor.",actionable,shortcoming
r67,s9,t10,"Overall, while I find there are some interesting theoretical bits in this paper, it lacks focus, the experiments do not offer any surprises, and there are no comparisons with prior literature.",actionable,shortcoming
r67,s0,t23,Lacking focus This paper investigates learning representations for the problem of nearest neighbor (NN) search by exploring various deep learning architectural choices.,non_actionable,shortcoming
r67,s1,t23,"I do not think class labels are usually assumed to be given in the standard definition of NN, and it is not clear to me how the proposed setup can accommodate NN without class labels.",actionable,disagreement
r67,s2,t23,"However, that is moved completely to the Appendix.",actionable,fact
r67,s3,t23,2) Disconnect/Unclear Assumptions There seems to be some disconnect between LSH and deep learning architectures explored in Sections 2 and 3 respectively.,actionable,shortcoming
r67,s4,t23,"It is not clear how a softmax output of a CNN, which is trained in a supervised way, follow such assumptions.",non_actionable,shortcoming
r67,s5,t23,It would be important if the paper could clarify such assumptions to make sure the sections are congruent.,actionable,suggestion
r67,s6,t23,There are also not comparisons what-so-ever to competitive prior works.,actionable,shortcoming
r67,s7,t23,4) Novelty The main contribution of this paper is basically a set of experiments looking into architectural choices.,non_actionable,fact
r67,s8,t23,"Thus, as such, the novelty or the contributions of this paper are minor.",non_actionable,shortcoming
r67,s9,t23,"Overall, while I find there are some interesting theoretical bits in this paper, it lacks focus, the experiments do not offer any surprises, and there are no comparisons with prior literature.",actionable,shortcoming
r67,s0,t2,Lacking focus This paper investigates learning representations for the problem of nearest neighbor (NN) search by exploring various deep learning architectural choices.,non_actionable,fact
r67,s1,t2,"I do not think class labels are usually assumed to be given in the standard definition of NN, and it is not clear to me how the proposed setup can accommodate NN without class labels.",non_actionable,disagreement
r67,s2,t2,"However, that is moved completely to the Appendix.",non_actionable,fact
r67,s3,t2,2) Disconnect/Unclear Assumptions There seems to be some disconnect between LSH and deep learning architectures explored in Sections 2 and 3 respectively.,actionable,shortcoming
r67,s4,t2,"It is not clear how a softmax output of a CNN, which is trained in a supervised way, follow such assumptions.",non_actionable,shortcoming
r67,s5,t2,It would be important if the paper could clarify such assumptions to make sure the sections are congruent.,actionable,suggestion
r67,s6,t2,There are also not comparisons what-so-ever to competitive prior works.,actionable,shortcoming
r67,s7,t2,4) Novelty The main contribution of this paper is basically a set of experiments looking into architectural choices.,non_actionable,fact
r67,s8,t2,"Thus, as such, the novelty or the contributions of this paper are minor.",non_actionable,fact
r67,s9,t2,"Overall, while I find there are some interesting theoretical bits in this paper, it lacks focus, the experiments do not offer any surprises, and there are no comparisons with prior literature.",non_actionable,fact
r67,s0,t16,Lacking focus This paper investigates learning representations for the problem of nearest neighbor (NN) search by exploring various deep learning architectural choices.,actionable,shortcoming
r67,s1,t16,"I do not think class labels are usually assumed to be given in the standard definition of NN, and it is not clear to me how the proposed setup can accommodate NN without class labels.",actionable,disagreement
r67,s2,t16,"However, that is moved completely to the Appendix.",non_actionable,fact
r67,s3,t16,2) Disconnect/Unclear Assumptions There seems to be some disconnect between LSH and deep learning architectures explored in Sections 2 and 3 respectively.,actionable,shortcoming
r67,s4,t16,"It is not clear how a softmax output of a CNN, which is trained in a supervised way, follow such assumptions.",actionable,shortcoming
r67,s5,t16,It would be important if the paper could clarify such assumptions to make sure the sections are congruent.,actionable,suggestion
r67,s6,t16,There are also not comparisons what-so-ever to competitive prior works.,actionable,shortcoming
r67,s7,t16,4) Novelty The main contribution of this paper is basically a set of experiments looking into architectural choices.,non_actionable,fact
r67,s8,t16,"Thus, as such, the novelty or the contributions of this paper are minor.",non_actionable,fact
r67,s9,t16,"Overall, while I find there are some interesting theoretical bits in this paper, it lacks focus, the experiments do not offer any surprises, and there are no comparisons with prior literature.",non_actionable,fact
r111,s0,t2,LatentPoison -- Adversarial Attacks On The Latent Space The idea is clearly stated (but lacks some details) and I enjoyed reading the paper.,non_actionable,agreement
r111,s1,t2,I understand the difference between [Kos+17] and the proposed scheme but I could not understand in which situation the proposed scheme works better.,actionable,shortcoming
r111,s2,t2,"From the adversary's standpoint, it would be easier to manipulate inputs than latent variables.",non_actionable,fact
r111,s3,t2,"On the other hand, I agree that sample-independent perturbation is much more practical than sample-dependent perturbation.",non_actionable,agreement
r111,s4,t2,"In Section 3.1, the attack methods #2 and #3 should be detailed more.",actionable,suggestion
r111,s5,t2,I could not imagine how VAE and T are trained simultaneously.,actionable,shortcoming
r111,s6,t2,"In Section 3.2, the authors listed a couple of loss functions.",non_actionable,fact
r111,s7,t2,The final optimization problem that is used for training of the propose VAE should be formally defined.,actionable,suggestion
r111,s8,t2,"Also, the detailed specification of the VAE should be detailed.",actionable,suggestion
r111,s9,t2,"Also, can you experimentally show that attacks on latent variables are more powerful than attacks on inputs?",actionable,question
r111,s0,t20,LatentPoison -- Adversarial Attacks On The Latent Space The idea is clearly stated (but lacks some details) and I enjoyed reading the paper.,non_actionable,agreement
r111,s1,t20,I understand the difference between [Kos+17] and the proposed scheme but I could not understand in which situation the proposed scheme works better.,actionable,shortcoming
r111,s2,t20,"From the adversary's standpoint, it would be easier to manipulate inputs than latent variables.",actionable,suggestion
r111,s3,t20,"On the other hand, I agree that sample-independent perturbation is much more practical than sample-dependent perturbation.",non_actionable,agreement
r111,s4,t20,"In Section 3.1, the attack methods #2 and #3 should be detailed more.",actionable,suggestion
r111,s5,t20,I could not imagine how VAE and T are trained simultaneously.,non_actionable,fact
r111,s6,t20,"In Section 3.2, the authors listed a couple of loss functions.",non_actionable,fact
r111,s7,t20,The final optimization problem that is used for training of the propose VAE should be formally defined.,actionable,suggestion
r111,s8,t20,"Also, the detailed specification of the VAE should be detailed.",actionable,suggestion
r111,s9,t20,"Also, can you experimentally show that attacks on latent variables are more powerful than attacks on inputs?",actionable,question
r111,s0,t16,LatentPoison -- Adversarial Attacks On The Latent Space The idea is clearly stated (but lacks some details) and I enjoyed reading the paper.,actionable,agreement
r111,s1,t16,I understand the difference between [Kos+17] and the proposed scheme but I could not understand in which situation the proposed scheme works better.,actionable,shortcoming
r111,s2,t16,"From the adversary's standpoint, it would be easier to manipulate inputs than latent variables.",actionable,suggestion
r111,s3,t16,"On the other hand, I agree that sample-independent perturbation is much more practical than sample-dependent perturbation.",non_actionable,agreement
r111,s4,t16,"In Section 3.1, the attack methods #2 and #3 should be detailed more.",actionable,suggestion
r111,s5,t16,I could not imagine how VAE and T are trained simultaneously.,actionable,shortcoming
r111,s6,t16,"In Section 3.2, the authors listed a couple of loss functions.",non_actionable,fact
r111,s7,t16,The final optimization problem that is used for training of the propose VAE should be formally defined.,actionable,suggestion
r111,s8,t16,"Also, the detailed specification of the VAE should be detailed.",actionable,suggestion
r111,s9,t16,"Also, can you experimentally show that attacks on latent variables are more powerful than attacks on inputs?",actionable,question
r111,s0,t10,LatentPoison -- Adversarial Attacks On The Latent Space The idea is clearly stated (but lacks some details) and I enjoyed reading the paper.,actionable,fact
r111,s1,t10,I understand the difference between [Kos+17] and the proposed scheme but I could not understand in which situation the proposed scheme works better.,actionable,shortcoming
r111,s2,t10,"From the adversary's standpoint, it would be easier to manipulate inputs than latent variables.",actionable,suggestion
r111,s3,t10,"On the other hand, I agree that sample-independent perturbation is much more practical than sample-dependent perturbation.",actionable,agreement
r111,s4,t10,"In Section 3.1, the attack methods #2 and #3 should be detailed more.",actionable,shortcoming
r111,s5,t10,I could not imagine how VAE and T are trained simultaneously.,actionable,fact
r111,s6,t10,"In Section 3.2, the authors listed a couple of loss functions.",non_actionable,other
r111,s7,t10,The final optimization problem that is used for training of the propose VAE should be formally defined.,actionable,suggestion
r111,s8,t10,"Also, the detailed specification of the VAE should be detailed.",actionable,suggestion
r111,s9,t10,"Also, can you experimentally show that attacks on latent variables are more powerful than attacks on inputs?",actionable,question
r111,s0,t31,LatentPoison -- Adversarial Attacks On The Latent Space The idea is clearly stated (but lacks some details) and I enjoyed reading the paper.,actionable,shortcoming
r111,s1,t31,I understand the difference between [Kos+17] and the proposed scheme but I could not understand in which situation the proposed scheme works better.,actionable,shortcoming
r111,s2,t31,"From the adversary's standpoint, it would be easier to manipulate inputs than latent variables.",actionable,suggestion
r111,s3,t31,"On the other hand, I agree that sample-independent perturbation is much more practical than sample-dependent perturbation.",non_actionable,agreement
r111,s4,t31,"In Section 3.1, the attack methods #2 and #3 should be detailed more.",actionable,suggestion
r111,s5,t31,I could not imagine how VAE and T are trained simultaneously.,non_actionable,fact
r111,s6,t31,"In Section 3.2, the authors listed a couple of loss functions.",non_actionable,fact
r111,s7,t31,The final optimization problem that is used for training of the propose VAE should be formally defined.,actionable,suggestion
r111,s8,t31,"Also, the detailed specification of the VAE should be detailed.",actionable,suggestion
r111,s9,t31,"Also, can you experimentally show that attacks on latent variables are more powerful than attacks on inputs?",actionable,suggestion
r7,s0,t2,"Moreover, there is no clear theoretical explanation for why this approach ought to work.",actionable,shortcoming
r7,s1,t2,"Practical contributions: The paper introduces a new technique for training DNNs by forming a convex combination between two training data instances, as well as changing the associated label to the corresponding convex combination of the original 2 labels.",non_actionable,fact
r7,s2,t2,The authors show mixup provides improvement over baselines in the following settings: * Image Classification on Imagenet.,non_actionable,fact
r7,s3,t2,Reproducibility: The provided website to access the source code is currently not loading.,actionable,shortcoming
r7,s4,t2,"However, experiment hyperparameters are meticulously recorded in the paper.",non_actionable,agreement
r7,s5,t2,"* Baseline in which the labels are not mixed, in order to ensure that the gains are not coming from the data augmentation only.",actionable,suggestion
r7,s6,t2,"* A thorough discussion on mixing in feature space, as well as a baseline which mizes in feature space.",actionable,suggestion
r7,s7,t2,* A concrete strategy for obtaining good results using the proposed method.,actionable,suggestion
r7,s8,t2,“ Would be good to see how this affects results and convergence speed.,actionable,suggestion
r7,s9,t2,* Figure 2 seems like a test made to work for this method and does not add much to the paper.,non_actionable,shortcoming
r7,s0,t8,"Moreover, there is no clear theoretical explanation for why this approach ought to work.",actionable,shortcoming
r7,s1,t8,"Practical contributions: The paper introduces a new technique for training DNNs by forming a convex combination between two training data instances, as well as changing the associated label to the corresponding convex combination of the original 2 labels.",non_actionable,agreement
r7,s2,t8,The authors show mixup provides improvement over baselines in the following settings: * Image Classification on Imagenet.,non_actionable,fact
r7,s3,t8,Reproducibility: The provided website to access the source code is currently not loading.,actionable,shortcoming
r7,s4,t8,"However, experiment hyperparameters are meticulously recorded in the paper.",non_actionable,agreement
r7,s5,t8,"* Baseline in which the labels are not mixed, in order to ensure that the gains are not coming from the data augmentation only.",non_actionable,fact
r7,s6,t8,"* A thorough discussion on mixing in feature space, as well as a baseline which mizes in feature space.",non_actionable,agreement
r7,s7,t8,* A concrete strategy for obtaining good results using the proposed method.,non_actionable,agreement
r7,s8,t8,“ Would be good to see how this affects results and convergence speed.,actionable,suggestion
r7,s9,t8,* Figure 2 seems like a test made to work for this method and does not add much to the paper.,actionable,shortcoming
r7,s0,t10,"Moreover, there is no clear theoretical explanation for why this approach ought to work.",actionable,disagreement
r7,s1,t10,"Practical contributions: The paper introduces a new technique for training DNNs by forming a convex combination between two training data instances, as well as changing the associated label to the corresponding convex combination of the original 2 labels.",actionable,agreement
r7,s2,t10,The authors show mixup provides improvement over baselines in the following settings: * Image Classification on Imagenet.,actionable,agreement
r7,s3,t10,Reproducibility: The provided website to access the source code is currently not loading.,actionable,shortcoming
r7,s4,t10,"However, experiment hyperparameters are meticulously recorded in the paper.",actionable,agreement
r7,s5,t10,"* Baseline in which the labels are not mixed, in order to ensure that the gains are not coming from the data augmentation only.",actionable,suggestion
r7,s6,t10,"* A thorough discussion on mixing in feature space, as well as a baseline which mizes in feature space.",actionable,suggestion
r7,s7,t10,* A concrete strategy for obtaining good results using the proposed method.,actionable,suggestion
r7,s8,t10,“ Would be good to see how this affects results and convergence speed.,actionable,suggestion
r7,s9,t10,* Figure 2 seems like a test made to work for this method and does not add much to the paper.,actionable,shortcoming
r7,s0,t31,"Moreover, there is no clear theoretical explanation for why this approach ought to work.",actionable,shortcoming
r7,s1,t31,"Practical contributions: The paper introduces a new technique for training DNNs by forming a convex combination between two training data instances, as well as changing the associated label to the corresponding convex combination of the original 2 labels.",non_actionable,agreement
r7,s2,t31,The authors show mixup provides improvement over baselines in the following settings: * Image Classification on Imagenet.,non_actionable,agreement
r7,s3,t31,Reproducibility: The provided website to access the source code is currently not loading.,actionable,shortcoming
r7,s4,t31,"However, experiment hyperparameters are meticulously recorded in the paper.",non_actionable,agreement
r7,s5,t31,"* Baseline in which the labels are not mixed, in order to ensure that the gains are not coming from the data augmentation only.",non_actionable,agreement
r7,s6,t31,"* A thorough discussion on mixing in feature space, as well as a baseline which mizes in feature space.",non_actionable,agreement
r7,s7,t31,* A concrete strategy for obtaining good results using the proposed method.,non_actionable,agreement
r7,s8,t31,“ Would be good to see how this affects results and convergence speed.,actionable,suggestion
r7,s9,t31,* Figure 2 seems like a test made to work for this method and does not add much to the paper.,actionable,shortcoming
r7,s0,t20,"Moreover, there is no clear theoretical explanation for why this approach ought to work.",actionable,shortcoming
r7,s1,t20,"Practical contributions: The paper introduces a new technique for training DNNs by forming a convex combination between two training data instances, as well as changing the associated label to the corresponding convex combination of the original 2 labels.",non_actionable,fact
r7,s2,t20,The authors show mixup provides improvement over baselines in the following settings: * Image Classification on Imagenet.,non_actionable,fact
r7,s3,t20,Reproducibility: The provided website to access the source code is currently not loading.,actionable,shortcoming
r7,s4,t20,"However, experiment hyperparameters are meticulously recorded in the paper.",non_actionable,agreement
r7,s5,t20,"* Baseline in which the labels are not mixed, in order to ensure that the gains are not coming from the data augmentation only.",actionable,suggestion
r7,s6,t20,"* A thorough discussion on mixing in feature space, as well as a baseline which mizes in feature space.",non_actionable,agreement
r7,s7,t20,* A concrete strategy for obtaining good results using the proposed method.,non_actionable,agreement
r7,s8,t20,“ Would be good to see how this affects results and convergence speed.,actionable,suggestion
r7,s9,t20,* Figure 2 seems like a test made to work for this method and does not add much to the paper.,actionable,shortcoming
r12,s0,t31,"Motivation In this paper, the authors consider the problem of generating a training data set for the neural programmer-interpreter from an executable oracle.",non_actionable,fact
r12,s1,t31,"In particular, they aim at generating a complete set that fully specifies the behavior of the oracle.",non_actionable,fact
r12,s2,t31,The authors propose a technique that achieves this aim by borrowing ideas from programming language and abstract interpretation.,non_actionable,fact
r12,s3,t31,"The technique systematically interacts with the oracle using observations, which are abstractions of environment states, and it is guaranteed to produce a data set that completely specifies the oracle.",non_actionable,fact
r12,s4,t31,The authors later describes how to improve this technique by further equating certain observations and exploring only one in each equivalence class.,non_actionable,fact
r12,s5,t31,Their experiments show that this improve technique can produce complete training sets for three programs.,non_actionable,fact
r12,s6,t31,It is nice to see the application of ideas from different areas for learning-related questions.,non_actionable,agreement
r12,s7,t31,One thing that I can see is that the technique in the paper can be used when we do research on the neural programmer-interpreter.,non_actionable,fact
r12,s8,t31,"During research, we have multiple executable oracles and need to produce good training data from them.",non_actionable,fact
r12,s9,t31,The authors' technique may let us do this data-generation easily.,non_actionable,agreement
r12,s0,t10,"Motivation In this paper, the authors consider the problem of generating a training data set for the neural programmer-interpreter from an executable oracle.",non_actionable,other
r12,s1,t10,"In particular, they aim at generating a complete set that fully specifies the behavior of the oracle.",non_actionable,other
r12,s2,t10,The authors propose a technique that achieves this aim by borrowing ideas from programming language and abstract interpretation.,non_actionable,other
r12,s3,t10,"The technique systematically interacts with the oracle using observations, which are abstractions of environment states, and it is guaranteed to produce a data set that completely specifies the oracle.",non_actionable,other
r12,s4,t10,The authors later describes how to improve this technique by further equating certain observations and exploring only one in each equivalence class.,non_actionable,other
r12,s5,t10,Their experiments show that this improve technique can produce complete training sets for three programs.,actionable,agreement
r12,s6,t10,It is nice to see the application of ideas from different areas for learning-related questions.,actionable,agreement
r12,s7,t10,One thing that I can see is that the technique in the paper can be used when we do research on the neural programmer-interpreter.,actionable,agreement
r12,s8,t10,"During research, we have multiple executable oracles and need to produce good training data from them.",non_actionable,other
r12,s9,t10,The authors' technique may let us do this data-generation easily.,actionable,agreement
r12,s0,t16,"Motivation In this paper, the authors consider the problem of generating a training data set for the neural programmer-interpreter from an executable oracle.",non_actionable,fact
r12,s1,t16,"In particular, they aim at generating a complete set that fully specifies the behavior of the oracle.",non_actionable,fact
r12,s2,t16,The authors propose a technique that achieves this aim by borrowing ideas from programming language and abstract interpretation.,non_actionable,fact
r12,s3,t16,"The technique systematically interacts with the oracle using observations, which are abstractions of environment states, and it is guaranteed to produce a data set that completely specifies the oracle.",non_actionable,fact
r12,s4,t16,The authors later describes how to improve this technique by further equating certain observations and exploring only one in each equivalence class.,non_actionable,fact
r12,s5,t16,Their experiments show that this improve technique can produce complete training sets for three programs.,non_actionable,fact
r12,s6,t16,It is nice to see the application of ideas from different areas for learning-related questions.,non_actionable,agreement
r12,s7,t16,One thing that I can see is that the technique in the paper can be used when we do research on the neural programmer-interpreter.,non_actionable,fact
r12,s8,t16,"During research, we have multiple executable oracles and need to produce good training data from them.",non_actionable,fact
r12,s9,t16,The authors' technique may let us do this data-generation easily.,non_actionable,agreement
r12,s0,t20,"Motivation In this paper, the authors consider the problem of generating a training data set for the neural programmer-interpreter from an executable oracle.",non_actionable,fact
r12,s1,t20,"In particular, they aim at generating a complete set that fully specifies the behavior of the oracle.",non_actionable,fact
r12,s2,t20,The authors propose a technique that achieves this aim by borrowing ideas from programming language and abstract interpretation.,non_actionable,fact
r12,s3,t20,"The technique systematically interacts with the oracle using observations, which are abstractions of environment states, and it is guaranteed to produce a data set that completely specifies the oracle.",non_actionable,fact
r12,s4,t20,The authors later describes how to improve this technique by further equating certain observations and exploring only one in each equivalence class.,non_actionable,fact
r12,s5,t20,Their experiments show that this improve technique can produce complete training sets for three programs.,non_actionable,fact
r12,s6,t20,It is nice to see the application of ideas from different areas for learning-related questions.,non_actionable,agreement
r12,s7,t20,One thing that I can see is that the technique in the paper can be used when we do research on the neural programmer-interpreter.,non_actionable,fact
r12,s8,t20,"During research, we have multiple executable oracles and need to produce good training data from them.",non_actionable,fact
r12,s9,t20,The authors' technique may let us do this data-generation easily.,non_actionable,agreement
r12,s0,t8,"Motivation In this paper, the authors consider the problem of generating a training data set for the neural programmer-interpreter from an executable oracle.",non_actionable,fact
r12,s1,t8,"In particular, they aim at generating a complete set that fully specifies the behavior of the oracle.",non_actionable,fact
r12,s2,t8,The authors propose a technique that achieves this aim by borrowing ideas from programming language and abstract interpretation.,non_actionable,fact
r12,s3,t8,"The technique systematically interacts with the oracle using observations, which are abstractions of environment states, and it is guaranteed to produce a data set that completely specifies the oracle.",non_actionable,fact
r12,s4,t8,The authors later describes how to improve this technique by further equating certain observations and exploring only one in each equivalence class.,non_actionable,fact
r12,s5,t8,Their experiments show that this improve technique can produce complete training sets for three programs.,non_actionable,fact
r12,s6,t8,It is nice to see the application of ideas from different areas for learning-related questions.,non_actionable,agreement
r12,s7,t8,One thing that I can see is that the technique in the paper can be used when we do research on the neural programmer-interpreter.,non_actionable,agreement
r12,s8,t8,"During research, we have multiple executable oracles and need to produce good training data from them.",non_actionable,fact
r12,s9,t8,The authors' technique may let us do this data-generation easily.,non_actionable,agreement
r117,s0,t20,"Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks Summary: The paper proposes Neumman optimizer, which makes some adjustments to the idealized Neumman algorithm to improve performance and stability in training.",non_actionable,fact
r117,s1,t20,"The paper also provides the effectiveness of the algorithm by training ImageNet models (Inception-V3, Resnet-50, Resnet-101, and Inception-Resnet-V2).",non_actionable,fact
r117,s2,t20,Comments: I really appreciate the author(s) by providing experiments using real models on the ImageNet dataset.,non_actionable,agreement
r117,s3,t20,The algorithm seems to be easily used in practice.,non_actionable,agreement
r117,s4,t20,"I understand that you are trying to improve the existing results with their optimizer, but this paper also introduces new algorithm.",non_actionable,agreement
r117,s5,t20,"The question is that, with the given architectures and dataset, what algorithm should people consider to use between Neumann optimizer and Adam?",actionable,question
r117,s6,t20,"Why should people use Neumann optimizer but not Adam, which is already very well-known?",actionable,question
r117,s7,t20,"If Neumann optimizer can surpass Adam on ImageNet, I think your algorithm will be widely used after being published.",non_actionable,fact
r117,s8,t20,"Minor comments: Page 3, in eq.",non_actionable,fact
r117,s9,t20,"(3): missing “-“ sign Page 3, in eq. (6): missing “transpose” on \nabla \hat{f} Page 4, first equation: O(|| \eta*mu_t ||^2) Page 5, in eq.",actionable,shortcoming
r117,s0,t16,"Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks Summary: The paper proposes Neumman optimizer, which makes some adjustments to the idealized Neumman algorithm to improve performance and stability in training.",non_actionable,fact
r117,s1,t16,"The paper also provides the effectiveness of the algorithm by training ImageNet models (Inception-V3, Resnet-50, Resnet-101, and Inception-Resnet-V2).",non_actionable,fact
r117,s2,t16,Comments: I really appreciate the author(s) by providing experiments using real models on the ImageNet dataset.,non_actionable,agreement
r117,s3,t16,The algorithm seems to be easily used in practice.,non_actionable,agreement
r117,s4,t16,"I understand that you are trying to improve the existing results with their optimizer, but this paper also introduces new algorithm.",actionable,shortcoming
r117,s5,t16,"The question is that, with the given architectures and dataset, what algorithm should people consider to use between Neumann optimizer and Adam?",actionable,question
r117,s6,t16,"Why should people use Neumann optimizer but not Adam, which is already very well-known?",actionable,question
r117,s7,t16,"If Neumann optimizer can surpass Adam on ImageNet, I think your algorithm will be widely used after being published.",non_actionable,agreement
r117,s8,t16,"Minor comments: Page 3, in eq.",non_actionable,fact
r117,s9,t16,"(3): missing “-“ sign Page 3, in eq. (6): missing “transpose” on \nabla \hat{f} Page 4, first equation: O(|| \eta*mu_t ||^2) Page 5, in eq.",actionable,shortcoming
r117,s0,t8,"Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks Summary: The paper proposes Neumman optimizer, which makes some adjustments to the idealized Neumman algorithm to improve performance and stability in training.",non_actionable,fact
r117,s1,t8,"The paper also provides the effectiveness of the algorithm by training ImageNet models (Inception-V3, Resnet-50, Resnet-101, and Inception-Resnet-V2).",non_actionable,fact
r117,s2,t8,Comments: I really appreciate the author(s) by providing experiments using real models on the ImageNet dataset.,non_actionable,agreement
r117,s3,t8,The algorithm seems to be easily used in practice.,non_actionable,fact
r117,s4,t8,"I understand that you are trying to improve the existing results with their optimizer, but this paper also introduces new algorithm.",non_actionable,fact
r117,s5,t8,"The question is that, with the given architectures and dataset, what algorithm should people consider to use between Neumann optimizer and Adam?",non_actionable,question
r117,s6,t8,"Why should people use Neumann optimizer but not Adam, which is already very well-known?",non_actionable,question
r117,s7,t8,"If Neumann optimizer can surpass Adam on ImageNet, I think your algorithm will be widely used after being published.",non_actionable,fact
r117,s8,t8,"Minor comments: Page 3, in eq.",non_actionable,other
r117,s9,t8,"(3): missing “-“ sign Page 3, in eq. (6): missing “transpose” on \nabla \hat{f} Page 4, first equation: O(|| \eta*mu_t ||^2) Page 5, in eq.",actionable,disagreement
r117,s0,t31,"Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks Summary: The paper proposes Neumman optimizer, which makes some adjustments to the idealized Neumman algorithm to improve performance and stability in training.",non_actionable,fact
r117,s1,t31,"The paper also provides the effectiveness of the algorithm by training ImageNet models (Inception-V3, Resnet-50, Resnet-101, and Inception-Resnet-V2).",non_actionable,fact
r117,s2,t31,Comments: I really appreciate the author(s) by providing experiments using real models on the ImageNet dataset.,non_actionable,agreement
r117,s3,t31,The algorithm seems to be easily used in practice.,non_actionable,agreement
r117,s4,t31,"I understand that you are trying to improve the existing results with their optimizer, but this paper also introduces new algorithm.",non_actionable,fact
r117,s5,t31,"The question is that, with the given architectures and dataset, what algorithm should people consider to use between Neumann optimizer and Adam?",non_actionable,question
r117,s6,t31,"Why should people use Neumann optimizer but not Adam, which is already very well-known?",non_actionable,question
r117,s7,t31,"If Neumann optimizer can surpass Adam on ImageNet, I think your algorithm will be widely used after being published.",non_actionable,fact
r117,s8,t31,"Minor comments: Page 3, in eq.",non_actionable,other
r117,s9,t31,"(3): missing “-“ sign Page 3, in eq. (6): missing “transpose” on \nabla \hat{f} Page 4, first equation: O(|| \eta*mu_t ||^2) Page 5, in eq.",actionable,shortcoming
r117,s0,t10,"Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks Summary: The paper proposes Neumman optimizer, which makes some adjustments to the idealized Neumman algorithm to improve performance and stability in training.",non_actionable,other
r117,s1,t10,"The paper also provides the effectiveness of the algorithm by training ImageNet models (Inception-V3, Resnet-50, Resnet-101, and Inception-Resnet-V2).",non_actionable,other
r117,s2,t10,Comments: I really appreciate the author(s) by providing experiments using real models on the ImageNet dataset.,actionable,agreement
r117,s3,t10,The algorithm seems to be easily used in practice.,actionable,agreement
r117,s4,t10,"I understand that you are trying to improve the existing results with their optimizer, but this paper also introduces new algorithm.",actionable,agreement
r117,s5,t10,"The question is that, with the given architectures and dataset, what algorithm should people consider to use between Neumann optimizer and Adam?",actionable,question
r117,s6,t10,"Why should people use Neumann optimizer but not Adam, which is already very well-known?",actionable,question
r117,s7,t10,"If Neumann optimizer can surpass Adam on ImageNet, I think your algorithm will be widely used after being published.",actionable,agreement
r117,s8,t10,"Minor comments: Page 3, in eq.",non_actionable,other
r117,s9,t10,"(3): missing “-“ sign Page 3, in eq. (6): missing “transpose” on \nabla \hat{f} Page 4, first equation: O(|| \eta*mu_t ||^2) Page 5, in eq.",actionable,shortcoming
r54,s0,t20,"Nice goal augmentation in state representation for DQN with, unfortunately incomplete and quite preliminary The authors use a variant of deep RL to solve a  simplified 2d physical stacking task.",non_actionable,fact
r54,s1,t20,To accommodate different goal stacking states the authors extend the state representation of DQN.,non_actionable,fact
r54,s2,t20,A number of heuristics are used to augment this reward function so as to provide shaping rewards along the way and speed up learning.,non_actionable,fact
r54,s3,t20,"Currently, I don’t understand from the manuscript, how DQN is actually trained.",non_actionable,fact
r54,s4,t20,Are all different tasks used on a single network?,actionable,question
r54,s5,t20,"If so, is it surprising that the network performs worse than when augmenting the state representation with the goal?",actionable,question
r54,s6,t20,The definition of value function at the bottom of page 4 uses the definition for continual tasks but in the current setting the tasks are naturally episodic.,non_actionable,fact
r54,s7,t20,"It would be helpful do obtain more information about the navigation task, especially a plot of sorts would be helpful.",actionable,suggestion
r54,s8,t20,"How physically “rich” is this environment compared to some of the cited work, e.g. Yildirim et al. or Battaglia et al:?",actionable,question
r54,s9,t20,Overall it feels as if this is an interesting project but that it is not yet ready for publication.,non_actionable,fact
r54,s0,t10,"Nice goal augmentation in state representation for DQN with, unfortunately incomplete and quite preliminary The authors use a variant of deep RL to solve a  simplified 2d physical stacking task.",actionable,shortcoming
r54,s1,t10,To accommodate different goal stacking states the authors extend the state representation of DQN.,non_actionable,other
r54,s2,t10,A number of heuristics are used to augment this reward function so as to provide shaping rewards along the way and speed up learning.,non_actionable,other
r54,s3,t10,"Currently, I don’t understand from the manuscript, how DQN is actually trained.",actionable,disagreement
r54,s4,t10,Are all different tasks used on a single network?,actionable,question
r54,s5,t10,"If so, is it surprising that the network performs worse than when augmenting the state representation with the goal?",actionable,question
r54,s6,t10,The definition of value function at the bottom of page 4 uses the definition for continual tasks but in the current setting the tasks are naturally episodic.,actionable,shortcoming
r54,s7,t10,"It would be helpful do obtain more information about the navigation task, especially a plot of sorts would be helpful.",actionable,suggestion
r54,s8,t10,"How physically “rich” is this environment compared to some of the cited work, e.g. Yildirim et al. or Battaglia et al:?",actionable,question
r54,s9,t10,Overall it feels as if this is an interesting project but that it is not yet ready for publication.,actionable,agreement
r54,s0,t31,"Nice goal augmentation in state representation for DQN with, unfortunately incomplete and quite preliminary The authors use a variant of deep RL to solve a  simplified 2d physical stacking task.",actionable,shortcoming
r54,s1,t31,To accommodate different goal stacking states the authors extend the state representation of DQN.,non_actionable,fact
r54,s2,t31,A number of heuristics are used to augment this reward function so as to provide shaping rewards along the way and speed up learning.,non_actionable,fact
r54,s3,t31,"Currently, I don’t understand from the manuscript, how DQN is actually trained.",actionable,shortcoming
r54,s4,t31,Are all different tasks used on a single network?,non_actionable,question
r54,s5,t31,"If so, is it surprising that the network performs worse than when augmenting the state representation with the goal?",non_actionable,question
r54,s6,t31,The definition of value function at the bottom of page 4 uses the definition for continual tasks but in the current setting the tasks are naturally episodic.,actionable,shortcoming
r54,s7,t31,"It would be helpful do obtain more information about the navigation task, especially a plot of sorts would be helpful.",actionable,suggestion
r54,s8,t31,"How physically “rich” is this environment compared to some of the cited work, e.g. Yildirim et al. or Battaglia et al:?",non_actionable,question
r54,s9,t31,Overall it feels as if this is an interesting project but that it is not yet ready for publication.,actionable,shortcoming
r54,s0,t2,"Nice goal augmentation in state representation for DQN with, unfortunately incomplete and quite preliminary The authors use a variant of deep RL to solve a  simplified 2d physical stacking task.",non_actionable,fact
r54,s1,t2,To accommodate different goal stacking states the authors extend the state representation of DQN.,non_actionable,fact
r54,s2,t2,A number of heuristics are used to augment this reward function so as to provide shaping rewards along the way and speed up learning.,non_actionable,fact
r54,s3,t2,"Currently, I don’t understand from the manuscript, how DQN is actually trained.",actionable,shortcoming
r54,s4,t2,Are all different tasks used on a single network?,actionable,question
r54,s5,t2,"If so, is it surprising that the network performs worse than when augmenting the state representation with the goal?",actionable,question
r54,s6,t2,The definition of value function at the bottom of page 4 uses the definition for continual tasks but in the current setting the tasks are naturally episodic.,non_actionable,shortcoming
r54,s7,t2,"It would be helpful do obtain more information about the navigation task, especially a plot of sorts would be helpful.",actionable,suggestion
r54,s8,t2,"How physically “rich” is this environment compared to some of the cited work, e.g. Yildirim et al. or Battaglia et al:?",actionable,question
r54,s9,t2,Overall it feels as if this is an interesting project but that it is not yet ready for publication.,non_actionable,fact
r54,s0,t16,"Nice goal augmentation in state representation for DQN with, unfortunately incomplete and quite preliminary The authors use a variant of deep RL to solve a  simplified 2d physical stacking task.",actionable,shortcoming
r54,s1,t16,To accommodate different goal stacking states the authors extend the state representation of DQN.,non_actionable,fact
r54,s2,t16,A number of heuristics are used to augment this reward function so as to provide shaping rewards along the way and speed up learning.,non_actionable,fact
r54,s3,t16,"Currently, I don’t understand from the manuscript, how DQN is actually trained.",actionable,shortcoming
r54,s4,t16,Are all different tasks used on a single network?,actionable,question
r54,s5,t16,"If so, is it surprising that the network performs worse than when augmenting the state representation with the goal?",actionable,question
r54,s6,t16,The definition of value function at the bottom of page 4 uses the definition for continual tasks but in the current setting the tasks are naturally episodic.,non_actionable,fact
r54,s7,t16,"It would be helpful do obtain more information about the navigation task, especially a plot of sorts would be helpful.",actionable,suggestion
r54,s8,t16,"How physically “rich” is this environment compared to some of the cited work, e.g. Yildirim et al. or Battaglia et al:?",actionable,question
r54,s9,t16,Overall it feels as if this is an interesting project but that it is not yet ready for publication.,actionable,shortcoming
r27,s0,t31,Nice plots but I would have appreciated a deeper treatment of observations.,actionable,suggestion
r27,s1,t31,It would be a good idea to put the test results directly on Fig. 4 as it does not ease reading currently (and postpone ResNet-56 in the appendix).,actionable,suggestion
r27,s2,t31,"* On the convexity vs non-convexity (Sec. 6), it is interesting to see how pushing the Id through the net changes the look of the loss for deep nets.",non_actionable,fact
r27,s3,t31,"The difference VGG - ResNets is also interesting, but it would have been interesting to see how this affects the current state of the art in understanding deep learning, something that was done for the ""flat vs sharp"" dilemma, but is lacking here.",actionable,suggestion
r27,s4,t31,"For example, does this observation that the local curvature of the loss around minima is different for ResNets and VGG allows to interpret the difference in their performances ?",actionable,suggestion
r27,s5,t31,"* On optimisation paths, the choice of PCA directions is wise compared to random projections, and results are nice as plotted.",non_actionable,agreement
r27,s6,t31,"There is however a phenomenon I would have liked to be discussed, the fact that the leading eigenvector captures so much variability, which perhaps signals that optimisation happens in a very low dimensional subspace for the experiments carried, and could be useful for optimisation algorithms (you trade dimension d for a much smaller ""effective"" d', you only have to figure out a generating system for this subspace and carry out optimisation inside).",actionable,suggestion
r27,s7,t31,"Can this be related to the ""flat vs sharp"" dilemma ?",non_actionable,question
r27,s8,t31,I would suppose that flatness tends to increase the variability captured by leading eigenvectors ?,non_actionable,question
r27,s9,t31,Typoes: Legend of Figure 2: red lines are error -> red lines are accuracy Table 1: test accuracy -> test error Before 6.2: architecture effects -> architecture affects,actionable,shortcoming
r27,s0,t10,Nice plots but I would have appreciated a deeper treatment of observations.,actionable,disagreement
r27,s1,t10,It would be a good idea to put the test results directly on Fig. 4 as it does not ease reading currently (and postpone ResNet-56 in the appendix).,actionable,suggestion
r27,s2,t10,"* On the convexity vs non-convexity (Sec. 6), it is interesting to see how pushing the Id through the net changes the look of the loss for deep nets.",non_actionable,other
r27,s3,t10,"The difference VGG - ResNets is also interesting, but it would have been interesting to see how this affects the current state of the art in understanding deep learning, something that was done for the ""flat vs sharp"" dilemma, but is lacking here.",actionable,suggestion
r27,s4,t10,"For example, does this observation that the local curvature of the loss around minima is different for ResNets and VGG allows to interpret the difference in their performances ?",actionable,question
r27,s5,t10,"* On optimisation paths, the choice of PCA directions is wise compared to random projections, and results are nice as plotted.",actionable,agreement
r27,s6,t10,"There is however a phenomenon I would have liked to be discussed, the fact that the leading eigenvector captures so much variability, which perhaps signals that optimisation happens in a very low dimensional subspace for the experiments carried, and could be useful for optimisation algorithms (you trade dimension d for a much smaller ""effective"" d', you only have to figure out a generating system for this subspace and carry out optimisation inside).",actionable,suggestion
r27,s7,t10,"Can this be related to the ""flat vs sharp"" dilemma ?",actionable,question
r27,s8,t10,I would suppose that flatness tends to increase the variability captured by leading eigenvectors ?,actionable,question
r27,s9,t10,Typoes: Legend of Figure 2: red lines are error -> red lines are accuracy Table 1: test accuracy -> test error Before 6.2: architecture effects -> architecture affects,actionable,shortcoming
r27,s0,t8,Nice plots but I would have appreciated a deeper treatment of observations.,actionable,suggestion
r27,s1,t8,It would be a good idea to put the test results directly on Fig. 4 as it does not ease reading currently (and postpone ResNet-56 in the appendix).,actionable,suggestion
r27,s2,t8,"* On the convexity vs non-convexity (Sec. 6), it is interesting to see how pushing the Id through the net changes the look of the loss for deep nets.",non_actionable,agreement
r27,s3,t8,"The difference VGG - ResNets is also interesting, but it would have been interesting to see how this affects the current state of the art in understanding deep learning, something that was done for the ""flat vs sharp"" dilemma, but is lacking here.",actionable,suggestion
r27,s4,t8,"For example, does this observation that the local curvature of the loss around minima is different for ResNets and VGG allows to interpret the difference in their performances ?",actionable,question
r27,s5,t8,"* On optimisation paths, the choice of PCA directions is wise compared to random projections, and results are nice as plotted.",non_actionable,agreement
r27,s6,t8,"There is however a phenomenon I would have liked to be discussed, the fact that the leading eigenvector captures so much variability, which perhaps signals that optimisation happens in a very low dimensional subspace for the experiments carried, and could be useful for optimisation algorithms (you trade dimension d for a much smaller ""effective"" d', you only have to figure out a generating system for this subspace and carry out optimisation inside).",actionable,suggestion
r27,s7,t8,"Can this be related to the ""flat vs sharp"" dilemma ?",actionable,question
r27,s8,t8,I would suppose that flatness tends to increase the variability captured by leading eigenvectors ?,non_actionable,question
r27,s9,t8,Typoes: Legend of Figure 2: red lines are error -> red lines are accuracy Table 1: test accuracy -> test error Before 6.2: architecture effects -> architecture affects,actionable,disagreement
r27,s0,t20,Nice plots but I would have appreciated a deeper treatment of observations.,actionable,suggestion
r27,s1,t20,It would be a good idea to put the test results directly on Fig. 4 as it does not ease reading currently (and postpone ResNet-56 in the appendix).,actionable,suggestion
r27,s2,t20,"* On the convexity vs non-convexity (Sec. 6), it is interesting to see how pushing the Id through the net changes the look of the loss for deep nets.",non_actionable,agreement
r27,s3,t20,"The difference VGG - ResNets is also interesting, but it would have been interesting to see how this affects the current state of the art in understanding deep learning, something that was done for the ""flat vs sharp"" dilemma, but is lacking here.",actionable,suggestion
r27,s4,t20,"For example, does this observation that the local curvature of the loss around minima is different for ResNets and VGG allows to interpret the difference in their performances ?",actionable,question
r27,s5,t20,"* On optimisation paths, the choice of PCA directions is wise compared to random projections, and results are nice as plotted.",non_actionable,agreement
r27,s6,t20,"There is however a phenomenon I would have liked to be discussed, the fact that the leading eigenvector captures so much variability, which perhaps signals that optimisation happens in a very low dimensional subspace for the experiments carried, and could be useful for optimisation algorithms (you trade dimension d for a much smaller ""effective"" d', you only have to figure out a generating system for this subspace and carry out optimisation inside).",actionable,suggestion
r27,s7,t20,"Can this be related to the ""flat vs sharp"" dilemma ?",actionable,question
r27,s8,t20,I would suppose that flatness tends to increase the variability captured by leading eigenvectors ?,actionable,question
r27,s9,t20,Typoes: Legend of Figure 2: red lines are error -> red lines are accuracy Table 1: test accuracy -> test error Before 6.2: architecture effects -> architecture affects,actionable,shortcoming
r27,s0,t16,Nice plots but I would have appreciated a deeper treatment of observations.,actionable,suggestion
r27,s1,t16,It would be a good idea to put the test results directly on Fig. 4 as it does not ease reading currently (and postpone ResNet-56 in the appendix).,actionable,suggestion
r27,s2,t16,"* On the convexity vs non-convexity (Sec. 6), it is interesting to see how pushing the Id through the net changes the look of the loss for deep nets.",non_actionable,agreement
r27,s3,t16,"The difference VGG - ResNets is also interesting, but it would have been interesting to see how this affects the current state of the art in understanding deep learning, something that was done for the ""flat vs sharp"" dilemma, but is lacking here.",actionable,suggestion
r27,s4,t16,"For example, does this observation that the local curvature of the loss around minima is different for ResNets and VGG allows to interpret the difference in their performances ?",actionable,question
r27,s5,t16,"* On optimisation paths, the choice of PCA directions is wise compared to random projections, and results are nice as plotted.",non_actionable,fact
r27,s6,t16,"There is however a phenomenon I would have liked to be discussed, the fact that the leading eigenvector captures so much variability, which perhaps signals that optimisation happens in a very low dimensional subspace for the experiments carried, and could be useful for optimisation algorithms (you trade dimension d for a much smaller ""effective"" d', you only have to figure out a generating system for this subspace and carry out optimisation inside).",actionable,suggestion
r27,s7,t16,"Can this be related to the ""flat vs sharp"" dilemma ?",actionable,question
r27,s8,t16,I would suppose that flatness tends to increase the variability captured by leading eigenvectors ?,actionable,question
r27,s9,t16,Typoes: Legend of Figure 2: red lines are error -> red lines are accuracy Table 1: test accuracy -> test error Before 6.2: architecture effects -> architecture affects,non_actionable,fact
r84,s0,t16,"Paper of broad interest for control tasks This is a well written paper, very nice work.",non_actionable,agreement
r84,s1,t16,Pros:  simple idea with impact;  the problem being tackled is a difficult one Cons:  not many;  real systems have constraints between physical dimensions and the forces/torques they can exert Some additional related work to consider citing.,actionable,shortcoming
r84,s2,t16,But the current system is a great start.,non_actionable,agreement
r84,s3,t16,"The introduction could also promote that over an evolutionary time-frame, the body and control system (reflexes, muscle capabilities, etc.) presumably co-evolved.",actionable,suggestion
r84,s4,t16,Flexible muscle-based locomotion for bipedal creatures.,non_actionable,fact
r84,s5,t16,"ACM Transactions on Graphics (TOG), 32(6), 206.",non_actionable,fact
r84,s6,t16,"(muscle routing parameters, including insertion and attachment points) are optimized along with the control).",non_actionable,fact
r84,s7,t16,"Stronger ankles are more generally correlated with a heavier body rather than heavy feet, given that a key role of the ankle is to be able to provide a ""push"" to the body at the end of a stride, and perhaps less for ""lifting the foot"".",actionable,disagreement
r84,s8,t16,It would be interesting to compare to a baseline where the control systems are allowed to adapt to the individual design parameters.,actionable,suggestion
r84,s9,t16,Are the four mixture components over the robot parameters updated independently of each other when the parameter-exploring policy gradients updates are applied?,actionable,question
r84,s0,t27,"Paper of broad interest for control tasks This is a well written paper, very nice work.",non_actionable,agreement
r84,s1,t27,Pros:  simple idea with impact;  the problem being tackled is a difficult one Cons:  not many;  real systems have constraints between physical dimensions and the forces/torques they can exert Some additional related work to consider citing.,actionable,suggestion
r84,s2,t27,But the current system is a great start.,non_actionable,fact
r84,s3,t27,"The introduction could also promote that over an evolutionary time-frame, the body and control system (reflexes, muscle capabilities, etc.) presumably co-evolved.",actionable,fact
r84,s4,t27,Flexible muscle-based locomotion for bipedal creatures.,non_actionable,fact
r84,s5,t27,"ACM Transactions on Graphics (TOG), 32(6), 206.",non_actionable,fact
r84,s6,t27,"(muscle routing parameters, including insertion and attachment points) are optimized along with the control).",non_actionable,fact
r84,s7,t27,"Stronger ankles are more generally correlated with a heavier body rather than heavy feet, given that a key role of the ankle is to be able to provide a ""push"" to the body at the end of a stride, and perhaps less for ""lifting the foot"".",non_actionable,fact
r84,s8,t27,It would be interesting to compare to a baseline where the control systems are allowed to adapt to the individual design parameters.,actionable,suggestion
r84,s9,t27,Are the four mixture components over the robot parameters updated independently of each other when the parameter-exploring policy gradients updates are applied?,non_actionable,question
r84,s0,t10,"Paper of broad interest for control tasks This is a well written paper, very nice work.",actionable,agreement
r84,s1,t10,Pros:  simple idea with impact;  the problem being tackled is a difficult one Cons:  not many;  real systems have constraints between physical dimensions and the forces/torques they can exert Some additional related work to consider citing.,actionable,agreement
r84,s2,t10,But the current system is a great start.,actionable,agreement
r84,s3,t10,"The introduction could also promote that over an evolutionary time-frame, the body and control system (reflexes, muscle capabilities, etc.) presumably co-evolved.",non_actionable,other
r84,s4,t10,Flexible muscle-based locomotion for bipedal creatures.,non_actionable,other
r84,s5,t10,"ACM Transactions on Graphics (TOG), 32(6), 206.",non_actionable,other
r84,s6,t10,"(muscle routing parameters, including insertion and attachment points) are optimized along with the control).",actionable,agreement
r84,s7,t10,"Stronger ankles are more generally correlated with a heavier body rather than heavy feet, given that a key role of the ankle is to be able to provide a ""push"" to the body at the end of a stride, and perhaps less for ""lifting the foot"".",non_actionable,other
r84,s8,t10,It would be interesting to compare to a baseline where the control systems are allowed to adapt to the individual design parameters.,actionable,suggestion
r84,s9,t10,Are the four mixture components over the robot parameters updated independently of each other when the parameter-exploring policy gradients updates are applied?,actionable,question
r84,s0,t31,"Paper of broad interest for control tasks This is a well written paper, very nice work.",non_actionable,agreement
r84,s1,t31,Pros:  simple idea with impact;  the problem being tackled is a difficult one Cons:  not many;  real systems have constraints between physical dimensions and the forces/torques they can exert Some additional related work to consider citing.,non_actionable,shortcoming
r84,s2,t31,But the current system is a great start.,non_actionable,agreement
r84,s3,t31,"The introduction could also promote that over an evolutionary time-frame, the body and control system (reflexes, muscle capabilities, etc.) presumably co-evolved.",actionable,suggestion
r84,s4,t31,Flexible muscle-based locomotion for bipedal creatures.,non_actionable,fact
r84,s5,t31,"ACM Transactions on Graphics (TOG), 32(6), 206.",non_actionable,fact
r84,s6,t31,"(muscle routing parameters, including insertion and attachment points) are optimized along with the control).",non_actionable,fact
r84,s7,t31,"Stronger ankles are more generally correlated with a heavier body rather than heavy feet, given that a key role of the ankle is to be able to provide a ""push"" to the body at the end of a stride, and perhaps less for ""lifting the foot"".",non_actionable,fact
r84,s8,t31,It would be interesting to compare to a baseline where the control systems are allowed to adapt to the individual design parameters.,actionable,suggestion
r84,s9,t31,Are the four mixture components over the robot parameters updated independently of each other when the parameter-exploring policy gradients updates are applied?,non_actionable,question
r84,s0,t20,"Paper of broad interest for control tasks This is a well written paper, very nice work.",non_actionable,agreement
r84,s1,t20,Pros:  simple idea with impact;  the problem being tackled is a difficult one Cons:  not many;  real systems have constraints between physical dimensions and the forces/torques they can exert Some additional related work to consider citing.,actionable,shortcoming
r84,s2,t20,But the current system is a great start.,non_actionable,agreement
r84,s3,t20,"The introduction could also promote that over an evolutionary time-frame, the body and control system (reflexes, muscle capabilities, etc.) presumably co-evolved.",non_actionable,fact
r84,s4,t20,Flexible muscle-based locomotion for bipedal creatures.,non_actionable,fact
r84,s5,t20,"ACM Transactions on Graphics (TOG), 32(6), 206.",non_actionable,other
r84,s6,t20,"(muscle routing parameters, including insertion and attachment points) are optimized along with the control).",non_actionable,fact
r84,s7,t20,"Stronger ankles are more generally correlated with a heavier body rather than heavy feet, given that a key role of the ankle is to be able to provide a ""push"" to the body at the end of a stride, and perhaps less for ""lifting the foot"".",non_actionable,fact
r84,s8,t20,It would be interesting to compare to a baseline where the control systems are allowed to adapt to the individual design parameters.,actionable,suggestion
r84,s9,t20,Are the four mixture components over the robot parameters updated independently of each other when the parameter-exploring policy gradients updates are applied?,actionable,question
r95,s0,t20,Presentation of the approach is not clear Summary: The paper presents a generic dynamic architecture for CLEVR VQA and Reverse Polish notation problems.,non_actionable,shortcoming
r95,s1,t20,"After reading the paper, it’s not clear to me what the components of the model are, what each of them take as input and produce as output, what these modules do and how they are combined.",actionable,shortcoming
r95,s2,t20,— Is the “fork” module the main contribution of the paper?,actionable,question
r95,s3,t20,"So, if no fork module is required for a question, the model architecture is effectively same as IEP?",actionable,question
r95,s4,t20,— Machine accuracy is already par with human accuracy on CLEVR and very close to 100%.,non_actionable,fact
r95,s5,t20,Why is this problem still important?,actionable,question
r95,s6,t20,"They are both trained on the same training data, only test data is of different length and ideally both models should achieve similar accuracy for the first 10 subproblems (same trend as DDRstack).",non_actionable,fact
r95,s7,t20,— Can the authors provide training time comparison of their model and other/baseline models?,actionable,question
r95,s8,t20,These names have not being defined formally in the paper.,actionable,shortcoming
r95,s9,t20,"By proper restructuring of paper and adding more details, the paper can be converted to a solid submission.",actionable,suggestion
r95,s0,t10,Presentation of the approach is not clear Summary: The paper presents a generic dynamic architecture for CLEVR VQA and Reverse Polish notation problems.,actionable,shortcoming
r95,s1,t10,"After reading the paper, it’s not clear to me what the components of the model are, what each of them take as input and produce as output, what these modules do and how they are combined.",actionable,shortcoming
r95,s2,t10,— Is the “fork” module the main contribution of the paper?,actionable,question
r95,s3,t10,"So, if no fork module is required for a question, the model architecture is effectively same as IEP?",actionable,question
r95,s4,t10,— Machine accuracy is already par with human accuracy on CLEVR and very close to 100%.,non_actionable,other
r95,s5,t10,Why is this problem still important?,actionable,question
r95,s6,t10,"They are both trained on the same training data, only test data is of different length and ideally both models should achieve similar accuracy for the first 10 subproblems (same trend as DDRstack).",non_actionable,other
r95,s7,t10,— Can the authors provide training time comparison of their model and other/baseline models?,actionable,question
r95,s8,t10,These names have not being defined formally in the paper.,non_actionable,shortcoming
r95,s9,t10,"By proper restructuring of paper and adding more details, the paper can be converted to a solid submission.",actionable,suggestion
r95,s0,t31,Presentation of the approach is not clear Summary: The paper presents a generic dynamic architecture for CLEVR VQA and Reverse Polish notation problems.,actionable,shortcoming
r95,s1,t31,"After reading the paper, it’s not clear to me what the components of the model are, what each of them take as input and produce as output, what these modules do and how they are combined.",actionable,shortcoming
r95,s2,t31,— Is the “fork” module the main contribution of the paper?,non_actionable,question
r95,s3,t31,"So, if no fork module is required for a question, the model architecture is effectively same as IEP?",non_actionable,question
r95,s4,t31,— Machine accuracy is already par with human accuracy on CLEVR and very close to 100%.,non_actionable,fact
r95,s5,t31,Why is this problem still important?,non_actionable,question
r95,s6,t31,"They are both trained on the same training data, only test data is of different length and ideally both models should achieve similar accuracy for the first 10 subproblems (same trend as DDRstack).",non_actionable,fact
r95,s7,t31,— Can the authors provide training time comparison of their model and other/baseline models?,actionable,suggestion
r95,s8,t31,These names have not being defined formally in the paper.,actionable,shortcoming
r95,s9,t31,"By proper restructuring of paper and adding more details, the paper can be converted to a solid submission.",actionable,suggestion
r95,s0,t8,Presentation of the approach is not clear Summary: The paper presents a generic dynamic architecture for CLEVR VQA and Reverse Polish notation problems.,actionable,shortcoming
r95,s1,t8,"After reading the paper, it’s not clear to me what the components of the model are, what each of them take as input and produce as output, what these modules do and how they are combined.",actionable,shortcoming
r95,s2,t8,— Is the “fork” module the main contribution of the paper?,non_actionable,question
r95,s3,t8,"So, if no fork module is required for a question, the model architecture is effectively same as IEP?",non_actionable,question
r95,s4,t8,— Machine accuracy is already par with human accuracy on CLEVR and very close to 100%.,non_actionable,fact
r95,s5,t8,Why is this problem still important?,non_actionable,question
r95,s6,t8,"They are both trained on the same training data, only test data is of different length and ideally both models should achieve similar accuracy for the first 10 subproblems (same trend as DDRstack).",non_actionable,fact
r95,s7,t8,— Can the authors provide training time comparison of their model and other/baseline models?,non_actionable,question
r95,s8,t8,These names have not being defined formally in the paper.,actionable,shortcoming
r95,s9,t8,"By proper restructuring of paper and adding more details, the paper can be converted to a solid submission.",actionable,suggestion
r95,s0,t16,Presentation of the approach is not clear Summary: The paper presents a generic dynamic architecture for CLEVR VQA and Reverse Polish notation problems.,actionable,shortcoming
r95,s1,t16,"After reading the paper, it’s not clear to me what the components of the model are, what each of them take as input and produce as output, what these modules do and how they are combined.",actionable,shortcoming
r95,s2,t16,— Is the “fork” module the main contribution of the paper?,actionable,question
r95,s3,t16,"So, if no fork module is required for a question, the model architecture is effectively same as IEP?",actionable,question
r95,s4,t16,— Machine accuracy is already par with human accuracy on CLEVR and very close to 100%.,non_actionable,fact
r95,s5,t16,Why is this problem still important?,actionable,question
r95,s6,t16,"They are both trained on the same training data, only test data is of different length and ideally both models should achieve similar accuracy for the first 10 subproblems (same trend as DDRstack).",non_actionable,fact
r95,s7,t16,— Can the authors provide training time comparison of their model and other/baseline models?,actionable,question
r95,s8,t16,These names have not being defined formally in the paper.,actionable,shortcoming
r95,s9,t16,"By proper restructuring of paper and adding more details, the paper can be converted to a solid submission.",actionable,suggestion
r91,s0,t16,"Presents an interesting idea fairly clearly, but overall very similar to expectation-linear dropout.",actionable,shortcoming
r91,s1,t16,"The proposed method fraternal dropout is a stochastic alternative of the expectation-linear dropout method, where part of the objective is for the dropout mask to have low variance.",non_actionable,fact
r91,s2,t16,The first order way to achieve lower variance is to have smaller weights.,non_actionable,fact
r91,s3,t16,"As a result, it seems that at least part of the effect of explicitly reducing the variance is just stronger weight penalty.",non_actionable,fact
r91,s4,t16,"So I would like to see some comparisons between this method and various dropout rates, and regular weight penalty combinations.",actionable,suggestion
r91,s5,t16,"This work is very closely related to expectation linear dropout, except that you are now actually minimizing the variance: 1/2E[ ||f(s) - f(s')|| ] is used instead of E [ ||f(s) - f_bar|| ].",non_actionable,fact
r91,s6,t16,"Eq 5 is very close to this, except the f_bar is not quite the mean, but the value with the mean dropout mask.",non_actionable,fact
r91,s7,t16,"I do not think the method is theoretically well-motivated as presented, but the empirical results seem solid.",actionable,disagreement
r91,s8,t16,"It is somewhat alarming how the analysis has little to do with the neural networks and how dropout works, let along RNNs, while the strength of the empirical results are all on RNNs.",actionable,shortcoming
r91,s9,t16,"I feel the ideas interesting and valuable especially in light of strong empirical results, but the authors should do more to clarify what is actually happening.",actionable,suggestion
r91,s0,t31,"Presents an interesting idea fairly clearly, but overall very similar to expectation-linear dropout.",non_actionable,fact
r91,s1,t31,"The proposed method fraternal dropout is a stochastic alternative of the expectation-linear dropout method, where part of the objective is for the dropout mask to have low variance.",non_actionable,fact
r91,s2,t31,The first order way to achieve lower variance is to have smaller weights.,non_actionable,fact
r91,s3,t31,"As a result, it seems that at least part of the effect of explicitly reducing the variance is just stronger weight penalty.",non_actionable,fact
r91,s4,t31,"So I would like to see some comparisons between this method and various dropout rates, and regular weight penalty combinations.",actionable,suggestion
r91,s5,t31,"This work is very closely related to expectation linear dropout, except that you are now actually minimizing the variance: 1/2E[ ||f(s) - f(s')|| ] is used instead of E [ ||f(s) - f_bar|| ].",non_actionable,fact
r91,s6,t31,"Eq 5 is very close to this, except the f_bar is not quite the mean, but the value with the mean dropout mask.",non_actionable,fact
r91,s7,t31,"I do not think the method is theoretically well-motivated as presented, but the empirical results seem solid.",actionable,shortcoming
r91,s8,t31,"It is somewhat alarming how the analysis has little to do with the neural networks and how dropout works, let along RNNs, while the strength of the empirical results are all on RNNs.",actionable,shortcoming
r91,s9,t31,"I feel the ideas interesting and valuable especially in light of strong empirical results, but the authors should do more to clarify what is actually happening.",actionable,shortcoming
r91,s0,t10,"Presents an interesting idea fairly clearly, but overall very similar to expectation-linear dropout.",actionable,shortcoming
r91,s1,t10,"The proposed method fraternal dropout is a stochastic alternative of the expectation-linear dropout method, where part of the objective is for the dropout mask to have low variance.",non_actionable,other
r91,s2,t10,The first order way to achieve lower variance is to have smaller weights.,non_actionable,other
r91,s3,t10,"As a result, it seems that at least part of the effect of explicitly reducing the variance is just stronger weight penalty.",non_actionable,other
r91,s4,t10,"So I would like to see some comparisons between this method and various dropout rates, and regular weight penalty combinations.",actionable,suggestion
r91,s5,t10,"This work is very closely related to expectation linear dropout, except that you are now actually minimizing the variance: 1/2E[ ||f(s) - f(s')|| ] is used instead of E [ ||f(s) - f_bar|| ].",actionable,shortcoming
r91,s6,t10,"Eq 5 is very close to this, except the f_bar is not quite the mean, but the value with the mean dropout mask.",actionable,shortcoming
r91,s7,t10,"I do not think the method is theoretically well-motivated as presented, but the empirical results seem solid.",actionable,agreement
r91,s8,t10,"It is somewhat alarming how the analysis has little to do with the neural networks and how dropout works, let along RNNs, while the strength of the empirical results are all on RNNs.",actionable,shortcoming
r91,s9,t10,"I feel the ideas interesting and valuable especially in light of strong empirical results, but the authors should do more to clarify what is actually happening.",actionable,shortcoming
r91,s0,t2,"Presents an interesting idea fairly clearly, but overall very similar to expectation-linear dropout.",non_actionable,fact
r91,s1,t2,"The proposed method fraternal dropout is a stochastic alternative of the expectation-linear dropout method, where part of the objective is for the dropout mask to have low variance.",non_actionable,fact
r91,s2,t2,The first order way to achieve lower variance is to have smaller weights.,non_actionable,fact
r91,s3,t2,"As a result, it seems that at least part of the effect of explicitly reducing the variance is just stronger weight penalty.",non_actionable,shortcoming
r91,s4,t2,"So I would like to see some comparisons between this method and various dropout rates, and regular weight penalty combinations.",actionable,suggestion
r91,s5,t2,"This work is very closely related to expectation linear dropout, except that you are now actually minimizing the variance: 1/2E[ ||f(s) - f(s')|| ] is used instead of E [ ||f(s) - f_bar|| ].",non_actionable,fact
r91,s6,t2,"Eq 5 is very close to this, except the f_bar is not quite the mean, but the value with the mean dropout mask.",non_actionable,fact
r91,s7,t2,"I do not think the method is theoretically well-motivated as presented, but the empirical results seem solid.",non_actionable,fact
r91,s8,t2,"It is somewhat alarming how the analysis has little to do with the neural networks and how dropout works, let along RNNs, while the strength of the empirical results are all on RNNs.",non_actionable,shortcoming
r91,s9,t2,"I feel the ideas interesting and valuable especially in light of strong empirical results, but the authors should do more to clarify what is actually happening.",actionable,suggestion
r91,s0,t20,"Presents an interesting idea fairly clearly, but overall very similar to expectation-linear dropout.",non_actionable,fact
r91,s1,t20,"The proposed method fraternal dropout is a stochastic alternative of the expectation-linear dropout method, where part of the objective is for the dropout mask to have low variance.",non_actionable,fact
r91,s2,t20,The first order way to achieve lower variance is to have smaller weights.,non_actionable,fact
r91,s3,t20,"As a result, it seems that at least part of the effect of explicitly reducing the variance is just stronger weight penalty.",non_actionable,fact
r91,s4,t20,"So I would like to see some comparisons between this method and various dropout rates, and regular weight penalty combinations.",actionable,suggestion
r91,s5,t20,"This work is very closely related to expectation linear dropout, except that you are now actually minimizing the variance: 1/2E[ ||f(s) - f(s')|| ] is used instead of E [ ||f(s) - f_bar|| ].",non_actionable,fact
r91,s6,t20,"Eq 5 is very close to this, except the f_bar is not quite the mean, but the value with the mean dropout mask.",non_actionable,fact
r91,s7,t20,"I do not think the method is theoretically well-motivated as presented, but the empirical results seem solid.",non_actionable,fact
r91,s8,t20,"It is somewhat alarming how the analysis has little to do with the neural networks and how dropout works, let along RNNs, while the strength of the empirical results are all on RNNs.",non_actionable,fact
r91,s9,t20,"I feel the ideas interesting and valuable especially in light of strong empirical results, but the authors should do more to clarify what is actually happening.",actionable,suggestion
r103,s0,t31,Recognizing and Updating NE Embeddings on the Fly The paper proposes to generate embedding of named-entities on the fly during dialogue sessions.,non_actionable,fact
r103,s1,t31,"If the text is from the user, a named entity recognizer is used.",non_actionable,fact
r103,s2,t31,"If it is from the bot response, then it is known which words are named entities therefore embedding can be constructed directly.",non_actionable,fact
r103,s3,t31,The idea has some novelty and the results on several tasks attempting to prove its effectiveness against systems that handle named entities in a static way.,non_actionable,fact
r103,s4,t31,One thing I hope the author could provide more clarification is the use of NER.,actionable,suggestion
r103,s5,t31,"To my understanding, because of the presence of the NER in the With-NE-Table model, you could directly do update to the NE embeddings and query from the DB using a combination of embedding and the NE words (as the paper does), whereas the W/O-NE-Table model cannot because of lack of the NER.",non_actionable,fact
r103,s6,t31,"This seems to prove that an NER is useful for tasks where DB queries are needed, rather than that the dynamic NE-Table construction is useful.",non_actionable,fact
r103,s7,t31,"You could use an NER for W/O-NE-Table and update the NE embeddings, and it should be as good as With-NE-Table model (and fairer to compare with too).",non_actionable,fact
r103,s8,t31,"That said, overall the paper is a nice contribution to dialogue and QA system research by pointing out a simple way of handling named entities by dynamically updating their embeddings.",non_actionable,agreement
r103,s9,t31,"It would be better if the paper could point out the importance of NER for user utterances, and the fact that using the knowledge of which words are NEs in dialogue models could help in tasks where DB queries are necessary.",actionable,suggestion
r103,s0,t13,Recognizing and Updating NE Embeddings on the Fly The paper proposes to generate embedding of named-entities on the fly during dialogue sessions.,non_actionable,fact
r103,s1,t13,"If the text is from the user, a named entity recognizer is used.",non_actionable,fact
r103,s2,t13,"If it is from the bot response, then it is known which words are named entities therefore embedding can be constructed directly.",non_actionable,fact
r103,s3,t13,The idea has some novelty and the results on several tasks attempting to prove its effectiveness against systems that handle named entities in a static way.,actionable,agreement
r103,s4,t13,One thing I hope the author could provide more clarification is the use of NER.,actionable,suggestion
r103,s5,t13,"To my understanding, because of the presence of the NER in the With-NE-Table model, you could directly do update to the NE embeddings and query from the DB using a combination of embedding and the NE words (as the paper does), whereas the W/O-NE-Table model cannot because of lack of the NER.",actionable,suggestion
r103,s6,t13,"This seems to prove that an NER is useful for tasks where DB queries are needed, rather than that the dynamic NE-Table construction is useful.",non_actionable,fact
r103,s7,t13,"You could use an NER for W/O-NE-Table and update the NE embeddings, and it should be as good as With-NE-Table model (and fairer to compare with too).",actionable,suggestion
r103,s8,t13,"That said, overall the paper is a nice contribution to dialogue and QA system research by pointing out a simple way of handling named entities by dynamically updating their embeddings.",non_actionable,agreement
r103,s9,t13,"It would be better if the paper could point out the importance of NER for user utterances, and the fact that using the knowledge of which words are NEs in dialogue models could help in tasks where DB queries are necessary.",actionable,suggestion
r103,s0,t20,Recognizing and Updating NE Embeddings on the Fly The paper proposes to generate embedding of named-entities on the fly during dialogue sessions.,non_actionable,fact
r103,s1,t20,"If the text is from the user, a named entity recognizer is used.",non_actionable,fact
r103,s2,t20,"If it is from the bot response, then it is known which words are named entities therefore embedding can be constructed directly.",non_actionable,fact
r103,s3,t20,The idea has some novelty and the results on several tasks attempting to prove its effectiveness against systems that handle named entities in a static way.,non_actionable,fact
r103,s4,t20,One thing I hope the author could provide more clarification is the use of NER.,non_actionable,fact
r103,s5,t20,"To my understanding, because of the presence of the NER in the With-NE-Table model, you could directly do update to the NE embeddings and query from the DB using a combination of embedding and the NE words (as the paper does), whereas the W/O-NE-Table model cannot because of lack of the NER.",non_actionable,fact
r103,s6,t20,"This seems to prove that an NER is useful for tasks where DB queries are needed, rather than that the dynamic NE-Table construction is useful.",non_actionable,fact
r103,s7,t20,"You could use an NER for W/O-NE-Table and update the NE embeddings, and it should be as good as With-NE-Table model (and fairer to compare with too).",actionable,suggestion
r103,s8,t20,"That said, overall the paper is a nice contribution to dialogue and QA system research by pointing out a simple way of handling named entities by dynamically updating their embeddings.",non_actionable,agreement
r103,s9,t20,"It would be better if the paper could point out the importance of NER for user utterances, and the fact that using the knowledge of which words are NEs in dialogue models could help in tasks where DB queries are necessary.",actionable,suggestion
r103,s0,t16,Recognizing and Updating NE Embeddings on the Fly The paper proposes to generate embedding of named-entities on the fly during dialogue sessions.,non_actionable,fact
r103,s1,t16,"If the text is from the user, a named entity recognizer is used.",non_actionable,fact
r103,s2,t16,"If it is from the bot response, then it is known which words are named entities therefore embedding can be constructed directly.",non_actionable,fact
r103,s3,t16,The idea has some novelty and the results on several tasks attempting to prove its effectiveness against systems that handle named entities in a static way.,non_actionable,fact
r103,s4,t16,One thing I hope the author could provide more clarification is the use of NER.,actionable,suggestion
r103,s5,t16,"To my understanding, because of the presence of the NER in the With-NE-Table model, you could directly do update to the NE embeddings and query from the DB using a combination of embedding and the NE words (as the paper does), whereas the W/O-NE-Table model cannot because of lack of the NER.",actionable,suggestion
r103,s6,t16,"This seems to prove that an NER is useful for tasks where DB queries are needed, rather than that the dynamic NE-Table construction is useful.",non_actionable,fact
r103,s7,t16,"You could use an NER for W/O-NE-Table and update the NE embeddings, and it should be as good as With-NE-Table model (and fairer to compare with too).",actionable,suggestion
r103,s8,t16,"That said, overall the paper is a nice contribution to dialogue and QA system research by pointing out a simple way of handling named entities by dynamically updating their embeddings.",non_actionable,agreement
r103,s9,t16,"It would be better if the paper could point out the importance of NER for user utterances, and the fact that using the knowledge of which words are NEs in dialogue models could help in tasks where DB queries are necessary.",actionable,suggestion
r103,s0,t10,Recognizing and Updating NE Embeddings on the Fly The paper proposes to generate embedding of named-entities on the fly during dialogue sessions.,non_actionable,other
r103,s1,t10,"If the text is from the user, a named entity recognizer is used.",non_actionable,other
r103,s2,t10,"If it is from the bot response, then it is known which words are named entities therefore embedding can be constructed directly.",non_actionable,other
r103,s3,t10,The idea has some novelty and the results on several tasks attempting to prove its effectiveness against systems that handle named entities in a static way.,non_actionable,other
r103,s4,t10,One thing I hope the author could provide more clarification is the use of NER.,actionable,suggestion
r103,s5,t10,"To my understanding, because of the presence of the NER in the With-NE-Table model, you could directly do update to the NE embeddings and query from the DB using a combination of embedding and the NE words (as the paper does), whereas the W/O-NE-Table model cannot because of lack of the NER.",non_actionable,fact
r103,s6,t10,"This seems to prove that an NER is useful for tasks where DB queries are needed, rather than that the dynamic NE-Table construction is useful.",non_actionable,other
r103,s7,t10,"You could use an NER for W/O-NE-Table and update the NE embeddings, and it should be as good as With-NE-Table model (and fairer to compare with too).",actionable,suggestion
r103,s8,t10,"That said, overall the paper is a nice contribution to dialogue and QA system research by pointing out a simple way of handling named entities by dynamically updating their embeddings.",actionable,agreement
r103,s9,t10,"It would be better if the paper could point out the importance of NER for user utterances, and the fact that using the knowledge of which words are NEs in dialogue models could help in tasks where DB queries are necessary.",actionable,suggestion
r33,s0,t2,"Results are good, some unclear explanation This paper proposes an end-to-end trainable attention module, which takes as input the 2D feature vector map and outputs a 2D matrix of scores for each map.",non_actionable,fact
r33,s1,t2,Experiments conducted on image classification and weakly supervised segmentation show the effectiveness of the proposed method.,non_actionable,agreement
r33,s2,t2,1) Most previous work are all implemented as post-hoc additions to fully trained networks while this work is end-to-end trainable.,non_actionable,agreement
r33,s3,t2,"Not only the newly added weights for attention will be learned, so are the original weights in the network.",non_actionable,fact
r33,s4,t2,3) Visualizations shown in the paper are convincing.,non_actionable,agreement
r33,s5,t2,"1) Some of the notations are unclear in this paper, vector should be bold, hard to differentiate vector and scalar.",actionable,suggestion
r33,s6,t2,"2) In equation (2), l_i and g should have different dimensionality, how does addition work?",actionable,question
r33,s7,t2,"The authors just pick three layers from VGG to add attention, why picking those 3 layers?",actionable,question
r33,s8,t2,Is it better to add attention to lower layers or higher layers?,actionable,question
r33,s9,t2,Why is it the case that having more layers with attention achieves worse performance?,actionable,question
r33,s0,t31,"Results are good, some unclear explanation This paper proposes an end-to-end trainable attention module, which takes as input the 2D feature vector map and outputs a 2D matrix of scores for each map.",actionable,shortcoming
r33,s1,t31,Experiments conducted on image classification and weakly supervised segmentation show the effectiveness of the proposed method.,non_actionable,fact
r33,s2,t31,1) Most previous work are all implemented as post-hoc additions to fully trained networks while this work is end-to-end trainable.,non_actionable,fact
r33,s3,t31,"Not only the newly added weights for attention will be learned, so are the original weights in the network.",non_actionable,fact
r33,s4,t31,3) Visualizations shown in the paper are convincing.,non_actionable,agreement
r33,s5,t31,"1) Some of the notations are unclear in this paper, vector should be bold, hard to differentiate vector and scalar.",actionable,shortcoming
r33,s6,t31,"2) In equation (2), l_i and g should have different dimensionality, how does addition work?",actionable,suggestion
r33,s7,t31,"The authors just pick three layers from VGG to add attention, why picking those 3 layers?",non_actionable,question
r33,s8,t31,Is it better to add attention to lower layers or higher layers?,non_actionable,question
r33,s9,t31,Why is it the case that having more layers with attention achieves worse performance?,non_actionable,question
r33,s0,t10,"Results are good, some unclear explanation This paper proposes an end-to-end trainable attention module, which takes as input the 2D feature vector map and outputs a 2D matrix of scores for each map.",actionable,agreement
r33,s1,t10,Experiments conducted on image classification and weakly supervised segmentation show the effectiveness of the proposed method.,non_actionable,other
r33,s2,t10,1) Most previous work are all implemented as post-hoc additions to fully trained networks while this work is end-to-end trainable.,non_actionable,other
r33,s3,t10,"Not only the newly added weights for attention will be learned, so are the original weights in the network.",non_actionable,other
r33,s4,t10,3) Visualizations shown in the paper are convincing.,actionable,agreement
r33,s5,t10,"1) Some of the notations are unclear in this paper, vector should be bold, hard to differentiate vector and scalar.",actionable,shortcoming
r33,s6,t10,"2) In equation (2), l_i and g should have different dimensionality, how does addition work?",actionable,question
r33,s7,t10,"The authors just pick three layers from VGG to add attention, why picking those 3 layers?",actionable,question
r33,s8,t10,Is it better to add attention to lower layers or higher layers?,actionable,question
r33,s9,t10,Why is it the case that having more layers with attention achieves worse performance?,actionable,question
r33,s0,t20,"Results are good, some unclear explanation This paper proposes an end-to-end trainable attention module, which takes as input the 2D feature vector map and outputs a 2D matrix of scores for each map.",non_actionable,fact
r33,s1,t20,Experiments conducted on image classification and weakly supervised segmentation show the effectiveness of the proposed method.,non_actionable,fact
r33,s2,t20,1) Most previous work are all implemented as post-hoc additions to fully trained networks while this work is end-to-end trainable.,non_actionable,fact
r33,s3,t20,"Not only the newly added weights for attention will be learned, so are the original weights in the network.",non_actionable,fact
r33,s4,t20,3) Visualizations shown in the paper are convincing.,non_actionable,agreement
r33,s5,t20,"1) Some of the notations are unclear in this paper, vector should be bold, hard to differentiate vector and scalar.",actionable,suggestion
r33,s6,t20,"2) In equation (2), l_i and g should have different dimensionality, how does addition work?",actionable,question
r33,s7,t20,"The authors just pick three layers from VGG to add attention, why picking those 3 layers?",actionable,question
r33,s8,t20,Is it better to add attention to lower layers or higher layers?,actionable,question
r33,s9,t20,Why is it the case that having more layers with attention achieves worse performance?,actionable,question
r33,s0,t4,"Results are good, some unclear explanation This paper proposes an end-to-end trainable attention module, which takes as input the 2D feature vector map and outputs a 2D matrix of scores for each map.",actionable,shortcoming
r33,s1,t4,Experiments conducted on image classification and weakly supervised segmentation show the effectiveness of the proposed method.,non_actionable,agreement
r33,s2,t4,1) Most previous work are all implemented as post-hoc additions to fully trained networks while this work is end-to-end trainable.,non_actionable,fact
r33,s3,t4,"Not only the newly added weights for attention will be learned, so are the original weights in the network.",non_actionable,fact
r33,s4,t4,3) Visualizations shown in the paper are convincing.,non_actionable,agreement
r33,s5,t4,"1) Some of the notations are unclear in this paper, vector should be bold, hard to differentiate vector and scalar.",actionable,shortcoming
r33,s6,t4,"2) In equation (2), l_i and g should have different dimensionality, how does addition work?",actionable,question
r33,s7,t4,"The authors just pick three layers from VGG to add attention, why picking those 3 layers?",actionable,question
r33,s8,t4,Is it better to add attention to lower layers or higher layers?,actionable,question
r33,s9,t4,Why is it the case that having more layers with attention achieves worse performance?,actionable,question
r106,s0,t20,Review (Summary) This paper is about learning discriminative features for the target domain in unsupervised DA problem.,non_actionable,fact
r106,s1,t20,The key idea is to use a critic which randomly drops the activations in the logit and maximizes the sensitivity between two versions of discriminators.,non_actionable,fact
r106,s2,t20,The approach proposed in section 3.2 uses dropout logits and the sensitivity criterion between two softmax probability distributions which seems novel.,non_actionable,fact
r106,s3,t20,"1) Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks, Bousmalis et al.",non_actionable,fact
r106,s4,t20,Does the proposed method outperform these state of the art methods using the same network architectures?,actionable,question
r106,s5,t20,"For example, in eqns 2,3,5, the minimization is over G,C but G,C do not appear anywhere in the equation.",actionable,shortcoming
r106,s6,t20,"4. It's not clear what exactly the ""ENT"" baseline is.",actionable,shortcoming
r106,s7,t20,"The text says ""(ENT) obtained by modifying (Springenberg 2015)"".",non_actionable,fact
r106,s8,t20,I'd encourage the authors to make this part more explicit and self-explanatory.,actionable,suggestion
r106,s9,t20,The method section is not very well written and the authors avoid comparing the method against the state of the art methods in unsupervised DA.,non_actionable,shortcoming
r106,s0,t27,Review (Summary) This paper is about learning discriminative features for the target domain in unsupervised DA problem.,non_actionable,fact
r106,s1,t27,The key idea is to use a critic which randomly drops the activations in the logit and maximizes the sensitivity between two versions of discriminators.,non_actionable,fact
r106,s2,t27,The approach proposed in section 3.2 uses dropout logits and the sensitivity criterion between two softmax probability distributions which seems novel.,non_actionable,fact
r106,s3,t27,"1) Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks, Bousmalis et al.",non_actionable,fact
r106,s4,t27,Does the proposed method outperform these state of the art methods using the same network architectures?,non_actionable,question
r106,s5,t27,"For example, in eqns 2,3,5, the minimization is over G,C but G,C do not appear anywhere in the equation.",non_actionable,shortcoming
r106,s6,t27,"4. It's not clear what exactly the ""ENT"" baseline is.",non_actionable,shortcoming
r106,s7,t27,"The text says ""(ENT) obtained by modifying (Springenberg 2015)"".",non_actionable,fact
r106,s8,t27,I'd encourage the authors to make this part more explicit and self-explanatory.,actionable,suggestion
r106,s9,t27,The method section is not very well written and the authors avoid comparing the method against the state of the art methods in unsupervised DA.,actionable,shortcoming
r106,s0,t31,Review (Summary) This paper is about learning discriminative features for the target domain in unsupervised DA problem.,non_actionable,fact
r106,s1,t31,The key idea is to use a critic which randomly drops the activations in the logit and maximizes the sensitivity between two versions of discriminators.,non_actionable,fact
r106,s2,t31,The approach proposed in section 3.2 uses dropout logits and the sensitivity criterion between two softmax probability distributions which seems novel.,non_actionable,fact
r106,s3,t31,"1) Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks, Bousmalis et al.",non_actionable,fact
r106,s4,t31,Does the proposed method outperform these state of the art methods using the same network architectures?,non_actionable,question
r106,s5,t31,"For example, in eqns 2,3,5, the minimization is over G,C but G,C do not appear anywhere in the equation.",actionable,suggestion
r106,s6,t31,"4. It's not clear what exactly the ""ENT"" baseline is.",actionable,shortcoming
r106,s7,t31,"The text says ""(ENT) obtained by modifying (Springenberg 2015)"".",non_actionable,fact
r106,s8,t31,I'd encourage the authors to make this part more explicit and self-explanatory.,actionable,suggestion
r106,s9,t31,The method section is not very well written and the authors avoid comparing the method against the state of the art methods in unsupervised DA.,actionable,shortcoming
r106,s0,t10,Review (Summary) This paper is about learning discriminative features for the target domain in unsupervised DA problem.,non_actionable,other
r106,s1,t10,The key idea is to use a critic which randomly drops the activations in the logit and maximizes the sensitivity between two versions of discriminators.,non_actionable,other
r106,s2,t10,The approach proposed in section 3.2 uses dropout logits and the sensitivity criterion between two softmax probability distributions which seems novel.,non_actionable,other
r106,s3,t10,"1) Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks, Bousmalis et al.",non_actionable,other
r106,s4,t10,Does the proposed method outperform these state of the art methods using the same network architectures?,actionable,question
r106,s5,t10,"For example, in eqns 2,3,5, the minimization is over G,C but G,C do not appear anywhere in the equation.",actionable,shortcoming
r106,s6,t10,"4. It's not clear what exactly the ""ENT"" baseline is.",actionable,shortcoming
r106,s7,t10,"The text says ""(ENT) obtained by modifying (Springenberg 2015)"".",non_actionable,other
r106,s8,t10,I'd encourage the authors to make this part more explicit and self-explanatory.,actionable,suggestion
r106,s9,t10,The method section is not very well written and the authors avoid comparing the method against the state of the art methods in unsupervised DA.,actionable,shortcoming
r106,s0,t16,Review (Summary) This paper is about learning discriminative features for the target domain in unsupervised DA problem.,non_actionable,fact
r106,s1,t16,The key idea is to use a critic which randomly drops the activations in the logit and maximizes the sensitivity between two versions of discriminators.,non_actionable,fact
r106,s2,t16,The approach proposed in section 3.2 uses dropout logits and the sensitivity criterion between two softmax probability distributions which seems novel.,non_actionable,fact
r106,s3,t16,"1) Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks, Bousmalis et al.",non_actionable,fact
r106,s4,t16,Does the proposed method outperform these state of the art methods using the same network architectures?,actionable,question
r106,s5,t16,"For example, in eqns 2,3,5, the minimization is over G,C but G,C do not appear anywhere in the equation.",actionable,shortcoming
r106,s6,t16,"4. It's not clear what exactly the ""ENT"" baseline is.",actionable,shortcoming
r106,s7,t16,"The text says ""(ENT) obtained by modifying (Springenberg 2015)"".",actionable,fact
r106,s8,t16,I'd encourage the authors to make this part more explicit and self-explanatory.,actionable,suggestion
r106,s9,t16,The method section is not very well written and the authors avoid comparing the method against the state of the art methods in unsupervised DA.,actionable,shortcoming
r45,s0,t10,"review of ""NOT-SO-CLEVR: VISUAL RELATIONS STRAIN FEEDFORWARD NEURAL NETWORKS"" The authors introduce a set of very simple tasks that are meant to illustrate the challenges of learning visual relations.",non_actionable,other
r45,s1,t10,"They then evaluate several existing network architectures on these tasks, and show that results are not as impressive as others might have assumed they would be.",non_actionable,other
r45,s2,t10,The authors estimate the performance of algorithms by how well they generalize to new image scenarios when trained on other image conditions.,non_actionable,other
r45,s3,t10,"The authors state that "". . . the effectiveness of an architecture to learn visual-relation problems should be measured in terms of generalization over multiple variants of the same problem, not over multiple splits of the same dataset.""  Taken literally, this would rule out a lot of modern machine learning, even obviously very good work.",non_actionable,other
r45,s4,t10,"How do the authors know that humans are effectively generalizing rather than just ""interpolating"" within their (very rich) training set?",actionable,question
r45,s5,t10,"Either human training data showing very effective generalization (if one could somehow make ""novel"" relationships unfamiliar to humans), or a different network architecture that was obviously superior in generalization to CNN+RN.",non_actionable,other
r45,s6,t10,"However, I think there's some chance that if the same tasks were in the hands of people who *wanted* CNNs or CNN+RN to work well, the results might have been different.",actionable,fact
r45,s7,t10,I myself am very curious about what would happen and would love to see this exchange catalyzed.,actionable,suggestion
r45,s8,t10,Originality and Significance:  The area of relation extraction seems to me to be very important and probably a bit less intensively worked on that it should be.,actionable,agreement
r45,s9,t10,"However, as the authors here note, there's been some recent work (e.g. Santoro 2017) in the area.",non_actionable,other
r45,s0,t20,"review of ""NOT-SO-CLEVR: VISUAL RELATIONS STRAIN FEEDFORWARD NEURAL NETWORKS"" The authors introduce a set of very simple tasks that are meant to illustrate the challenges of learning visual relations.",non_actionable,fact
r45,s1,t20,"They then evaluate several existing network architectures on these tasks, and show that results are not as impressive as others might have assumed they would be.",non_actionable,fact
r45,s2,t20,The authors estimate the performance of algorithms by how well they generalize to new image scenarios when trained on other image conditions.,non_actionable,fact
r45,s3,t20,"The authors state that "". . . the effectiveness of an architecture to learn visual-relation problems should be measured in terms of generalization over multiple variants of the same problem, not over multiple splits of the same dataset.""  Taken literally, this would rule out a lot of modern machine learning, even obviously very good work.",non_actionable,fact
r45,s4,t20,"How do the authors know that humans are effectively generalizing rather than just ""interpolating"" within their (very rich) training set?",actionable,question
r45,s5,t20,"Either human training data showing very effective generalization (if one could somehow make ""novel"" relationships unfamiliar to humans), or a different network architecture that was obviously superior in generalization to CNN+RN.",non_actionable,fact
r45,s6,t20,"However, I think there's some chance that if the same tasks were in the hands of people who *wanted* CNNs or CNN+RN to work well, the results might have been different.",non_actionable,fact
r45,s7,t20,I myself am very curious about what would happen and would love to see this exchange catalyzed.,actionable,suggestion
r45,s8,t20,Originality and Significance:  The area of relation extraction seems to me to be very important and probably a bit less intensively worked on that it should be.,actionable,suggestion
r45,s9,t20,"However, as the authors here note, there's been some recent work (e.g. Santoro 2017) in the area.",non_actionable,fact
r45,s0,t16,"review of ""NOT-SO-CLEVR: VISUAL RELATIONS STRAIN FEEDFORWARD NEURAL NETWORKS"" The authors introduce a set of very simple tasks that are meant to illustrate the challenges of learning visual relations.",non_actionable,fact
r45,s1,t16,"They then evaluate several existing network architectures on these tasks, and show that results are not as impressive as others might have assumed they would be.",non_actionable,fact
r45,s2,t16,The authors estimate the performance of algorithms by how well they generalize to new image scenarios when trained on other image conditions.,non_actionable,fact
r45,s3,t16,"The authors state that "". . . the effectiveness of an architecture to learn visual-relation problems should be measured in terms of generalization over multiple variants of the same problem, not over multiple splits of the same dataset.""  Taken literally, this would rule out a lot of modern machine learning, even obviously very good work.",non_actionable,fact
r45,s4,t16,"How do the authors know that humans are effectively generalizing rather than just ""interpolating"" within their (very rich) training set?",actionable,question
r45,s5,t16,"Either human training data showing very effective generalization (if one could somehow make ""novel"" relationships unfamiliar to humans), or a different network architecture that was obviously superior in generalization to CNN+RN.",actionable,suggestion
r45,s6,t16,"However, I think there's some chance that if the same tasks were in the hands of people who *wanted* CNNs or CNN+RN to work well, the results might have been different.",actionable,disagreement
r45,s7,t16,I myself am very curious about what would happen and would love to see this exchange catalyzed.,actionable,suggestion
r45,s8,t16,Originality and Significance:  The area of relation extraction seems to me to be very important and probably a bit less intensively worked on that it should be.,actionable,shortcoming
r45,s9,t16,"However, as the authors here note, there's been some recent work (e.g. Santoro 2017) in the area.",non_actionable,fact
r45,s0,t31,"review of ""NOT-SO-CLEVR: VISUAL RELATIONS STRAIN FEEDFORWARD NEURAL NETWORKS"" The authors introduce a set of very simple tasks that are meant to illustrate the challenges of learning visual relations.",non_actionable,fact
r45,s1,t31,"They then evaluate several existing network architectures on these tasks, and show that results are not as impressive as others might have assumed they would be.",non_actionable,fact
r45,s2,t31,The authors estimate the performance of algorithms by how well they generalize to new image scenarios when trained on other image conditions.,non_actionable,fact
r45,s3,t31,"The authors state that "". . . the effectiveness of an architecture to learn visual-relation problems should be measured in terms of generalization over multiple variants of the same problem, not over multiple splits of the same dataset.""  Taken literally, this would rule out a lot of modern machine learning, even obviously very good work.",non_actionable,fact
r45,s4,t31,"How do the authors know that humans are effectively generalizing rather than just ""interpolating"" within their (very rich) training set?",non_actionable,question
r45,s5,t31,"Either human training data showing very effective generalization (if one could somehow make ""novel"" relationships unfamiliar to humans), or a different network architecture that was obviously superior in generalization to CNN+RN.",non_actionable,fact
r45,s6,t31,"However, I think there's some chance that if the same tasks were in the hands of people who *wanted* CNNs or CNN+RN to work well, the results might have been different.",non_actionable,fact
r45,s7,t31,I myself am very curious about what would happen and would love to see this exchange catalyzed.,non_actionable,fact
r45,s8,t31,Originality and Significance:  The area of relation extraction seems to me to be very important and probably a bit less intensively worked on that it should be.,actionable,suggestion
r45,s9,t31,"However, as the authors here note, there's been some recent work (e.g. Santoro 2017) in the area.",non_actionable,fact
r45,s0,t8,"review of ""NOT-SO-CLEVR: VISUAL RELATIONS STRAIN FEEDFORWARD NEURAL NETWORKS"" The authors introduce a set of very simple tasks that are meant to illustrate the challenges of learning visual relations.",non_actionable,fact
r45,s1,t8,"They then evaluate several existing network architectures on these tasks, and show that results are not as impressive as others might have assumed they would be.",non_actionable,fact
r45,s2,t8,The authors estimate the performance of algorithms by how well they generalize to new image scenarios when trained on other image conditions.,non_actionable,fact
r45,s3,t8,"The authors state that "". . . the effectiveness of an architecture to learn visual-relation problems should be measured in terms of generalization over multiple variants of the same problem, not over multiple splits of the same dataset.""  Taken literally, this would rule out a lot of modern machine learning, even obviously very good work.",actionable,shortcoming
r45,s4,t8,"How do the authors know that humans are effectively generalizing rather than just ""interpolating"" within their (very rich) training set?",non_actionable,question
r45,s5,t8,"Either human training data showing very effective generalization (if one could somehow make ""novel"" relationships unfamiliar to humans), or a different network architecture that was obviously superior in generalization to CNN+RN.",actionable,suggestion
r45,s6,t8,"However, I think there's some chance that if the same tasks were in the hands of people who *wanted* CNNs or CNN+RN to work well, the results might have been different.",non_actionable,fact
r45,s7,t8,I myself am very curious about what would happen and would love to see this exchange catalyzed.,actionable,suggestion
r45,s8,t8,Originality and Significance:  The area of relation extraction seems to me to be very important and probably a bit less intensively worked on that it should be.,actionable,shortcoming
r45,s9,t8,"However, as the authors here note, there's been some recent work (e.g. Santoro 2017) in the area.",non_actionable,fact
r124,s0,t20,"Review This paper describes a method for computing representations for out-of-vocabulary words, e.g. based on their spelling or dictionary definitions.",non_actionable,fact
r124,s1,t20,"The main difference from previous approaches is that the model is that the embeddings are trained end-to-end for a specific task, rather than trying to produce generically useful embeddings.",non_actionable,fact
r124,s2,t20,"The method leads to better performance than using no external resources, but not as high performance as using Glove embeddings.",non_actionable,fact
r124,s3,t20,"However, I have a couple of questions/concerns: - Most of the gains seem to come from using the spelling of the word.",non_actionable,shortcoming
r124,s4,t20,"As the authors note, this kind of character level modelling has been used in many previous works.",non_actionable,fact
r124,s5,t20,"- I would be slightly surprised if no previous work has used external resources for training word representations using an end-task loss, but I don’t know the area well enough to make specific suggestions - I’m a little skeptical about how often this method would really be useful in practice.",non_actionable,shortcoming
r124,s6,t20,"It seems to assume that you don’t have much unlabelled text (or you’d use Glove), but you probably need a large labelled dataset to learn how to read dictionary definitions well.",non_actionable,fact
r124,s7,t20,All the experiments use large tasks - it would be helpful to have an experiment showing an improvement over character-level modelling on a smaller task.,actionable,suggestion
r124,s8,t20,"- The results on SQUAD seem pretty weak - 52-64%, compared to the SOTA of 81.",non_actionable,fact
r124,s9,t20,"It seems like the proposed method is quite generic, so why not apply it to a stronger baseline?",actionable,question
r124,s0,t10,"Review This paper describes a method for computing representations for out-of-vocabulary words, e.g. based on their spelling or dictionary definitions.",non_actionable,other
r124,s1,t10,"The main difference from previous approaches is that the model is that the embeddings are trained end-to-end for a specific task, rather than trying to produce generically useful embeddings.",non_actionable,other
r124,s2,t10,"The method leads to better performance than using no external resources, but not as high performance as using Glove embeddings.",non_actionable,other
r124,s3,t10,"However, I have a couple of questions/concerns: - Most of the gains seem to come from using the spelling of the word.",actionable,shortcoming
r124,s4,t10,"As the authors note, this kind of character level modelling has been used in many previous works.",non_actionable,other
r124,s5,t10,"- I would be slightly surprised if no previous work has used external resources for training word representations using an end-task loss, but I don’t know the area well enough to make specific suggestions - I’m a little skeptical about how often this method would really be useful in practice.",actionable,shortcoming
r124,s6,t10,"It seems to assume that you don’t have much unlabelled text (or you’d use Glove), but you probably need a large labelled dataset to learn how to read dictionary definitions well.",actionable,suggestion
r124,s7,t10,All the experiments use large tasks - it would be helpful to have an experiment showing an improvement over character-level modelling on a smaller task.,actionable,suggestion
r124,s8,t10,"- The results on SQUAD seem pretty weak - 52-64%, compared to the SOTA of 81.",actionable,shortcoming
r124,s9,t10,"It seems like the proposed method is quite generic, so why not apply it to a stronger baseline?",actionable,question
r124,s0,t16,"Review This paper describes a method for computing representations for out-of-vocabulary words, e.g. based on their spelling or dictionary definitions.",non_actionable,fact
r124,s1,t16,"The main difference from previous approaches is that the model is that the embeddings are trained end-to-end for a specific task, rather than trying to produce generically useful embeddings.",non_actionable,fact
r124,s2,t16,"The method leads to better performance than using no external resources, but not as high performance as using Glove embeddings.",non_actionable,fact
r124,s3,t16,"However, I have a couple of questions/concerns: - Most of the gains seem to come from using the spelling of the word.",actionable,shortcoming
r124,s4,t16,"As the authors note, this kind of character level modelling has been used in many previous works.",non_actionable,fact
r124,s5,t16,"- I would be slightly surprised if no previous work has used external resources for training word representations using an end-task loss, but I don’t know the area well enough to make specific suggestions - I’m a little skeptical about how often this method would really be useful in practice.",actionable,disagreement
r124,s6,t16,"It seems to assume that you don’t have much unlabelled text (or you’d use Glove), but you probably need a large labelled dataset to learn how to read dictionary definitions well.",actionable,shortcoming
r124,s7,t16,All the experiments use large tasks - it would be helpful to have an experiment showing an improvement over character-level modelling on a smaller task.,actionable,suggestion
r124,s8,t16,"- The results on SQUAD seem pretty weak - 52-64%, compared to the SOTA of 81.",actionable,shortcoming
r124,s9,t16,"It seems like the proposed method is quite generic, so why not apply it to a stronger baseline?",actionable,question
r124,s0,t13,"Review This paper describes a method for computing representations for out-of-vocabulary words, e.g. based on their spelling or dictionary definitions.",non_actionable,fact
r124,s1,t13,"The main difference from previous approaches is that the model is that the embeddings are trained end-to-end for a specific task, rather than trying to produce generically useful embeddings.",non_actionable,fact
r124,s2,t13,"The method leads to better performance than using no external resources, but not as high performance as using Glove embeddings.",non_actionable,fact
r124,s3,t13,"However, I have a couple of questions/concerns: - Most of the gains seem to come from using the spelling of the word.",actionable,shortcoming
r124,s4,t13,"As the authors note, this kind of character level modelling has been used in many previous works.",non_actionable,fact
r124,s5,t13,"- I would be slightly surprised if no previous work has used external resources for training word representations using an end-task loss, but I don’t know the area well enough to make specific suggestions - I’m a little skeptical about how often this method would really be useful in practice.",actionable,disagreement
r124,s6,t13,"It seems to assume that you don’t have much unlabelled text (or you’d use Glove), but you probably need a large labelled dataset to learn how to read dictionary definitions well.",actionable,disagreement
r124,s7,t13,All the experiments use large tasks - it would be helpful to have an experiment showing an improvement over character-level modelling on a smaller task.,actionable,suggestion
r124,s8,t13,"- The results on SQUAD seem pretty weak - 52-64%, compared to the SOTA of 81.",actionable,shortcoming
r124,s9,t13,"It seems like the proposed method is quite generic, so why not apply it to a stronger baseline?",actionable,suggestion
r124,s0,t31,"Review This paper describes a method for computing representations for out-of-vocabulary words, e.g. based on their spelling or dictionary definitions.",non_actionable,fact
r124,s1,t31,"The main difference from previous approaches is that the model is that the embeddings are trained end-to-end for a specific task, rather than trying to produce generically useful embeddings.",non_actionable,fact
r124,s2,t31,"The method leads to better performance than using no external resources, but not as high performance as using Glove embeddings.",non_actionable,fact
r124,s3,t31,"However, I have a couple of questions/concerns: - Most of the gains seem to come from using the spelling of the word.",actionable,shortcoming
r124,s4,t31,"As the authors note, this kind of character level modelling has been used in many previous works.",non_actionable,fact
r124,s5,t31,"- I would be slightly surprised if no previous work has used external resources for training word representations using an end-task loss, but I don’t know the area well enough to make specific suggestions - I’m a little skeptical about how often this method would really be useful in practice.",actionable,shortcoming
r124,s6,t31,"It seems to assume that you don’t have much unlabelled text (or you’d use Glove), but you probably need a large labelled dataset to learn how to read dictionary definitions well.",non_actionable,fact
r124,s7,t31,All the experiments use large tasks - it would be helpful to have an experiment showing an improvement over character-level modelling on a smaller task.,actionable,suggestion
r124,s8,t31,"- The results on SQUAD seem pretty weak - 52-64%, compared to the SOTA of 81.",actionable,shortcoming
r124,s9,t31,"It seems like the proposed method is quite generic, so why not apply it to a stronger baseline?",actionable,suggestion
r110,s0,t10,review: experiments insufficient This paper proposes a neural network architecture around the idea of layered scene composition.,actionable,shortcoming
r110,s1,t10,"An encoder is later trained to map real images into the space of latent codes for the generator, allowing the system to be applied to real image segmentation tasks.",non_actionable,other
r110,s2,t10,"Visualization of learned layers for several scene types (Figures 3, 7) shows that the network does learn a reasonable compositional scene model.",non_actionable,other
r110,s3,t10,"Experiments evaluate the ability to port the model learned in an unsupervised manner to semantic segmentation tasks, using a limited amount of supervision for the end task.",non_actionable,other
r110,s4,t10,"Only two scene types (bedroom, kitchen) and four object classes (bed, window, appliance, counter) are used for evaluation.",actionable,shortcoming
r110,s5,t10,This is far below the norm for semantic segmentation work in computer vision.,actionable,shortcoming
r110,s6,t10,"Even the ADE20K dataset, from which this paper samples, is substantially larger and has an established benchmarking methodology (see http://placeschallenge.csail.mit.edu/).",actionable,suggestion
r110,s7,t10,An additional problem is that performance is not compared to any external prior work.,actionable,shortcoming
r110,s8,t10,"Only simple baselines (eg autoencoder, kmeans) implemented by this paper are included.",actionable,shortcoming
r110,s9,t10,"In summary, the proposed method may be promising, but far more experiments are needed.",actionable,shortcoming
r110,s0,t20,review: experiments insufficient This paper proposes a neural network architecture around the idea of layered scene composition.,non_actionable,shortcoming
r110,s1,t20,"An encoder is later trained to map real images into the space of latent codes for the generator, allowing the system to be applied to real image segmentation tasks.",non_actionable,fact
r110,s2,t20,"Visualization of learned layers for several scene types (Figures 3, 7) shows that the network does learn a reasonable compositional scene model.",non_actionable,fact
r110,s3,t20,"Experiments evaluate the ability to port the model learned in an unsupervised manner to semantic segmentation tasks, using a limited amount of supervision for the end task.",non_actionable,fact
r110,s4,t20,"Only two scene types (bedroom, kitchen) and four object classes (bed, window, appliance, counter) are used for evaluation.",non_actionable,fact
r110,s5,t20,This is far below the norm for semantic segmentation work in computer vision.,non_actionable,shortcoming
r110,s6,t20,"Even the ADE20K dataset, from which this paper samples, is substantially larger and has an established benchmarking methodology (see http://placeschallenge.csail.mit.edu/).",non_actionable,fact
r110,s7,t20,An additional problem is that performance is not compared to any external prior work.,non_actionable,shortcoming
r110,s8,t20,"Only simple baselines (eg autoencoder, kmeans) implemented by this paper are included.",non_actionable,fact
r110,s9,t20,"In summary, the proposed method may be promising, but far more experiments are needed.",non_actionable,shortcoming
r110,s0,t16,review: experiments insufficient This paper proposes a neural network architecture around the idea of layered scene composition.,actionable,shortcoming
r110,s1,t16,"An encoder is later trained to map real images into the space of latent codes for the generator, allowing the system to be applied to real image segmentation tasks.",non_actionable,fact
r110,s2,t16,"Visualization of learned layers for several scene types (Figures 3, 7) shows that the network does learn a reasonable compositional scene model.",non_actionable,fact
r110,s3,t16,"Experiments evaluate the ability to port the model learned in an unsupervised manner to semantic segmentation tasks, using a limited amount of supervision for the end task.",non_actionable,fact
r110,s4,t16,"Only two scene types (bedroom, kitchen) and four object classes (bed, window, appliance, counter) are used for evaluation.",actionable,shortcoming
r110,s5,t16,This is far below the norm for semantic segmentation work in computer vision.,actionable,shortcoming
r110,s6,t16,"Even the ADE20K dataset, from which this paper samples, is substantially larger and has an established benchmarking methodology (see http://placeschallenge.csail.mit.edu/).",non_actionable,fact
r110,s7,t16,An additional problem is that performance is not compared to any external prior work.,actionable,shortcoming
r110,s8,t16,"Only simple baselines (eg autoencoder, kmeans) implemented by this paper are included.",actionable,shortcoming
r110,s9,t16,"In summary, the proposed method may be promising, but far more experiments are needed.",actionable,suggestion
r110,s0,t27,review: experiments insufficient This paper proposes a neural network architecture around the idea of layered scene composition.,non_actionable,fact
r110,s1,t27,"An encoder is later trained to map real images into the space of latent codes for the generator, allowing the system to be applied to real image segmentation tasks.",non_actionable,fact
r110,s2,t27,"Visualization of learned layers for several scene types (Figures 3, 7) shows that the network does learn a reasonable compositional scene model.",non_actionable,fact
r110,s3,t27,"Experiments evaluate the ability to port the model learned in an unsupervised manner to semantic segmentation tasks, using a limited amount of supervision for the end task.",non_actionable,fact
r110,s4,t27,"Only two scene types (bedroom, kitchen) and four object classes (bed, window, appliance, counter) are used for evaluation.",non_actionable,shortcoming
r110,s5,t27,This is far below the norm for semantic segmentation work in computer vision.,non_actionable,shortcoming
r110,s6,t27,"Even the ADE20K dataset, from which this paper samples, is substantially larger and has an established benchmarking methodology (see http://placeschallenge.csail.mit.edu/).",non_actionable,shortcoming
r110,s7,t27,An additional problem is that performance is not compared to any external prior work.,non_actionable,shortcoming
r110,s8,t27,"Only simple baselines (eg autoencoder, kmeans) implemented by this paper are included.",non_actionable,shortcoming
r110,s9,t27,"In summary, the proposed method may be promising, but far more experiments are needed.",actionable,suggestion
r110,s0,t31,review: experiments insufficient This paper proposes a neural network architecture around the idea of layered scene composition.,non_actionable,fact
r110,s1,t31,"An encoder is later trained to map real images into the space of latent codes for the generator, allowing the system to be applied to real image segmentation tasks.",non_actionable,fact
r110,s2,t31,"Visualization of learned layers for several scene types (Figures 3, 7) shows that the network does learn a reasonable compositional scene model.",non_actionable,fact
r110,s3,t31,"Experiments evaluate the ability to port the model learned in an unsupervised manner to semantic segmentation tasks, using a limited amount of supervision for the end task.",non_actionable,fact
r110,s4,t31,"Only two scene types (bedroom, kitchen) and four object classes (bed, window, appliance, counter) are used for evaluation.",non_actionable,fact
r110,s5,t31,This is far below the norm for semantic segmentation work in computer vision.,actionable,shortcoming
r110,s6,t31,"Even the ADE20K dataset, from which this paper samples, is substantially larger and has an established benchmarking methodology (see http://placeschallenge.csail.mit.edu/).",actionable,shortcoming
r110,s7,t31,An additional problem is that performance is not compared to any external prior work.,actionable,shortcoming
r110,s8,t31,"Only simple baselines (eg autoencoder, kmeans) implemented by this paper are included.",actionable,shortcoming
r110,s9,t31,"In summary, the proposed method may be promising, but far more experiments are needed.",actionable,suggestion
r26,s0,t12,"Secondly, the DDT’s leaves are parametrized with the encoder distribution q(z|x), and thus gradient information flows back through the DDT into the posterior approximations in order to make them more discriminative.",actionable,shortcoming
r26,s1,t12,"log likelihood performance, and latent space interpretability via the DDT.",actionable,suggestion
r26,s2,t12,"Thus, I applaud the authors for combining the two in a way that admits efficient training.",non_actionable,agreement
r26,s3,t12,"Moreover, I like the qualitative experiment (Figure 2) in which the tree is used to vary a latent dimension to change the digit’s class.",non_actionable,agreement
r26,s4,t12,"As there is no strong theory in the paper, this limited experimental evaluation is reason enough for rejection.",actionable,shortcoming
r26,s5,t12,"Yet, moreover, the negative log likelihood comparison (Table 2) is not an informative comparison, as it speaks only to the power of adding supervision.",actionable,shortcoming
r26,s6,t12,"Taking Figure 2 as an example (which I do like the end result of), I could generate similar results with a black-box classifier by using gradients to perturb the latent ‘4’ mean into a latent ‘7’ mean (a la DeepDream).",non_actionable,fact
r26,s7,t12,I could then identify the influential dimension(s) by taking the largest absolute values in the gradient vector.,non_actionable,fact
r26,s8,t12,"Comment:  It's easier to make a latent variable model interpretable when the latent variables are given clear semantics in the model definition, in my opinion.",actionable,suggestion
r26,s9,t12,"Could you, somehow, force the tree to encode an identifiable attribute at each node, which would then force that attribute to be encoded in a certain dimension of latent space?",actionable,suggestion
r26,s0,t10,"Secondly, the DDT’s leaves are parametrized with the encoder distribution q(z|x), and thus gradient information flows back through the DDT into the posterior approximations in order to make them more discriminative.",non_actionable,other
r26,s1,t10,"log likelihood performance, and latent space interpretability via the DDT.",non_actionable,other
r26,s2,t10,"Thus, I applaud the authors for combining the two in a way that admits efficient training.",actionable,agreement
r26,s3,t10,"Moreover, I like the qualitative experiment (Figure 2) in which the tree is used to vary a latent dimension to change the digit’s class.",actionable,agreement
r26,s4,t10,"As there is no strong theory in the paper, this limited experimental evaluation is reason enough for rejection.",actionable,shortcoming
r26,s5,t10,"Yet, moreover, the negative log likelihood comparison (Table 2) is not an informative comparison, as it speaks only to the power of adding supervision.",actionable,shortcoming
r26,s6,t10,"Taking Figure 2 as an example (which I do like the end result of), I could generate similar results with a black-box classifier by using gradients to perturb the latent ‘4’ mean into a latent ‘7’ mean (a la DeepDream).",non_actionable,other
r26,s7,t10,I could then identify the influential dimension(s) by taking the largest absolute values in the gradient vector.,non_actionable,other
r26,s8,t10,"Comment:  It's easier to make a latent variable model interpretable when the latent variables are given clear semantics in the model definition, in my opinion.",actionable,fact
r26,s9,t10,"Could you, somehow, force the tree to encode an identifiable attribute at each node, which would then force that attribute to be encoded in a certain dimension of latent space?",actionable,question
r26,s0,t20,"Secondly, the DDT’s leaves are parametrized with the encoder distribution q(z|x), and thus gradient information flows back through the DDT into the posterior approximations in order to make them more discriminative.",non_actionable,fact
r26,s1,t20,"log likelihood performance, and latent space interpretability via the DDT.",non_actionable,fact
r26,s2,t20,"Thus, I applaud the authors for combining the two in a way that admits efficient training.",non_actionable,agreement
r26,s3,t20,"Moreover, I like the qualitative experiment (Figure 2) in which the tree is used to vary a latent dimension to change the digit’s class.",non_actionable,agreement
r26,s4,t20,"As there is no strong theory in the paper, this limited experimental evaluation is reason enough for rejection.",non_actionable,shortcoming
r26,s5,t20,"Yet, moreover, the negative log likelihood comparison (Table 2) is not an informative comparison, as it speaks only to the power of adding supervision.",non_actionable,shortcoming
r26,s6,t20,"Taking Figure 2 as an example (which I do like the end result of), I could generate similar results with a black-box classifier by using gradients to perturb the latent ‘4’ mean into a latent ‘7’ mean (a la DeepDream).",non_actionable,fact
r26,s7,t20,I could then identify the influential dimension(s) by taking the largest absolute values in the gradient vector.,non_actionable,fact
r26,s8,t20,"Comment:  It's easier to make a latent variable model interpretable when the latent variables are given clear semantics in the model definition, in my opinion.",non_actionable,fact
r26,s9,t20,"Could you, somehow, force the tree to encode an identifiable attribute at each node, which would then force that attribute to be encoded in a certain dimension of latent space?",actionable,question
r26,s0,t13,"Secondly, the DDT’s leaves are parametrized with the encoder distribution q(z|x), and thus gradient information flows back through the DDT into the posterior approximations in order to make them more discriminative.",non_actionable,fact
r26,s1,t13,"log likelihood performance, and latent space interpretability via the DDT.",non_actionable,other
r26,s2,t13,"Thus, I applaud the authors for combining the two in a way that admits efficient training.",non_actionable,agreement
r26,s3,t13,"Moreover, I like the qualitative experiment (Figure 2) in which the tree is used to vary a latent dimension to change the digit’s class.",non_actionable,agreement
r26,s4,t13,"As there is no strong theory in the paper, this limited experimental evaluation is reason enough for rejection.",actionable,shortcoming
r26,s5,t13,"Yet, moreover, the negative log likelihood comparison (Table 2) is not an informative comparison, as it speaks only to the power of adding supervision.",actionable,shortcoming
r26,s6,t13,"Taking Figure 2 as an example (which I do like the end result of), I could generate similar results with a black-box classifier by using gradients to perturb the latent ‘4’ mean into a latent ‘7’ mean (a la DeepDream).",non_actionable,fact
r26,s7,t13,I could then identify the influential dimension(s) by taking the largest absolute values in the gradient vector.,non_actionable,fact
r26,s8,t13,"Comment:  It's easier to make a latent variable model interpretable when the latent variables are given clear semantics in the model definition, in my opinion.",non_actionable,fact
r26,s9,t13,"Could you, somehow, force the tree to encode an identifiable attribute at each node, which would then force that attribute to be encoded in a certain dimension of latent space?",actionable,suggestion
r26,s0,t31,"Secondly, the DDT’s leaves are parametrized with the encoder distribution q(z|x), and thus gradient information flows back through the DDT into the posterior approximations in order to make them more discriminative.",non_actionable,fact
r26,s1,t31,"log likelihood performance, and latent space interpretability via the DDT.",non_actionable,fact
r26,s2,t31,"Thus, I applaud the authors for combining the two in a way that admits efficient training.",non_actionable,agreement
r26,s3,t31,"Moreover, I like the qualitative experiment (Figure 2) in which the tree is used to vary a latent dimension to change the digit’s class.",non_actionable,agreement
r26,s4,t31,"As there is no strong theory in the paper, this limited experimental evaluation is reason enough for rejection.",actionable,shortcoming
r26,s5,t31,"Yet, moreover, the negative log likelihood comparison (Table 2) is not an informative comparison, as it speaks only to the power of adding supervision.",actionable,shortcoming
r26,s6,t31,"Taking Figure 2 as an example (which I do like the end result of), I could generate similar results with a black-box classifier by using gradients to perturb the latent ‘4’ mean into a latent ‘7’ mean (a la DeepDream).",non_actionable,fact
r26,s7,t31,I could then identify the influential dimension(s) by taking the largest absolute values in the gradient vector.,non_actionable,fact
r26,s8,t31,"Comment:  It's easier to make a latent variable model interpretable when the latent variables are given clear semantics in the model definition, in my opinion.",actionable,suggestion
r26,s9,t31,"Could you, somehow, force the tree to encode an identifiable attribute at each node, which would then force that attribute to be encoded in a certain dimension of latent space?",actionable,suggestion
r98,s0,t20,"Simple yet effective work The authors proposed to compress word embeddings by approximate matrix factorization, and to solve the problem with the Gumbel-soft trick.",non_actionable,agreement
r98,s1,t20,"The proposed method achieved compression rate 98% in a sentiment analysis task, and compression rate over 94% in machine translation tasks, without a performance loss.",non_actionable,fact
r98,s2,t20,This paper is well-written and easy to follow.,non_actionable,agreement
r98,s3,t20,The motivation is clear and the idea is simple and effective.,non_actionable,agreement
r98,s4,t20,It would be better to provide deeper analysis in Subsection 6.1.,actionable,suggestion
r98,s5,t20,The current analysis is too simple.,actionable,shortcoming
r98,s6,t20,It may be interesting to explain the meanings of individual components.,actionable,suggestion
r98,s7,t20,Does each component is related to a certain topic?,actionable,question
r98,s8,t20,Is it meaningful to perform ADD or SUBSTRACT on the leaned code?,actionable,question
r98,s9,t20,"It may also be interesting to provide suitable theoretical analysis, e.g., relationships with the SVD of the embedding matrix.",actionable,suggestion
r98,s0,t8,"Simple yet effective work The authors proposed to compress word embeddings by approximate matrix factorization, and to solve the problem with the Gumbel-soft trick.",non_actionable,agreement
r98,s1,t8,"The proposed method achieved compression rate 98% in a sentiment analysis task, and compression rate over 94% in machine translation tasks, without a performance loss.",non_actionable,fact
r98,s2,t8,This paper is well-written and easy to follow.,non_actionable,agreement
r98,s3,t8,The motivation is clear and the idea is simple and effective.,non_actionable,agreement
r98,s4,t8,It would be better to provide deeper analysis in Subsection 6.1.,actionable,suggestion
r98,s5,t8,The current analysis is too simple.,actionable,shortcoming
r98,s6,t8,It may be interesting to explain the meanings of individual components.,actionable,suggestion
r98,s7,t8,Does each component is related to a certain topic?,non_actionable,question
r98,s8,t8,Is it meaningful to perform ADD or SUBSTRACT on the leaned code?,non_actionable,question
r98,s9,t8,"It may also be interesting to provide suitable theoretical analysis, e.g., relationships with the SVD of the embedding matrix.",actionable,suggestion
r98,s0,t10,"Simple yet effective work The authors proposed to compress word embeddings by approximate matrix factorization, and to solve the problem with the Gumbel-soft trick.",actionable,agreement
r98,s1,t10,"The proposed method achieved compression rate 98% in a sentiment analysis task, and compression rate over 94% in machine translation tasks, without a performance loss.",non_actionable,other
r98,s2,t10,This paper is well-written and easy to follow.,actionable,agreement
r98,s3,t10,The motivation is clear and the idea is simple and effective.,actionable,agreement
r98,s4,t10,It would be better to provide deeper analysis in Subsection 6.1.,actionable,suggestion
r98,s5,t10,The current analysis is too simple.,actionable,shortcoming
r98,s6,t10,It may be interesting to explain the meanings of individual components.,actionable,suggestion
r98,s7,t10,Does each component is related to a certain topic?,actionable,question
r98,s8,t10,Is it meaningful to perform ADD or SUBSTRACT on the leaned code?,actionable,question
r98,s9,t10,"It may also be interesting to provide suitable theoretical analysis, e.g., relationships with the SVD of the embedding matrix.",actionable,suggestion
r98,s0,t16,"Simple yet effective work The authors proposed to compress word embeddings by approximate matrix factorization, and to solve the problem with the Gumbel-soft trick.",non_actionable,agreement
r98,s1,t16,"The proposed method achieved compression rate 98% in a sentiment analysis task, and compression rate over 94% in machine translation tasks, without a performance loss.",non_actionable,fact
r98,s2,t16,This paper is well-written and easy to follow.,non_actionable,agreement
r98,s3,t16,The motivation is clear and the idea is simple and effective.,non_actionable,agreement
r98,s4,t16,It would be better to provide deeper analysis in Subsection 6.1.,actionable,suggestion
r98,s5,t16,The current analysis is too simple.,actionable,shortcoming
r98,s6,t16,It may be interesting to explain the meanings of individual components.,actionable,suggestion
r98,s7,t16,Does each component is related to a certain topic?,actionable,question
r98,s8,t16,Is it meaningful to perform ADD or SUBSTRACT on the leaned code?,actionable,question
r98,s9,t16,"It may also be interesting to provide suitable theoretical analysis, e.g., relationships with the SVD of the embedding matrix.",actionable,suggestion
r98,s0,t31,"Simple yet effective work The authors proposed to compress word embeddings by approximate matrix factorization, and to solve the problem with the Gumbel-soft trick.",non_actionable,fact
r98,s1,t31,"The proposed method achieved compression rate 98% in a sentiment analysis task, and compression rate over 94% in machine translation tasks, without a performance loss.",non_actionable,fact
r98,s2,t31,This paper is well-written and easy to follow.,non_actionable,agreement
r98,s3,t31,The motivation is clear and the idea is simple and effective.,non_actionable,agreement
r98,s4,t31,It would be better to provide deeper analysis in Subsection 6.1.,actionable,suggestion
r98,s5,t31,The current analysis is too simple.,actionable,shortcoming
r98,s6,t31,It may be interesting to explain the meanings of individual components.,actionable,suggestion
r98,s7,t31,Does each component is related to a certain topic?,non_actionable,question
r98,s8,t31,Is it meaningful to perform ADD or SUBSTRACT on the leaned code?,non_actionable,question
r98,s9,t31,"It may also be interesting to provide suitable theoretical analysis, e.g., relationships with the SVD of the embedding matrix.",actionable,suggestion
r71,s0,t16,Simple yet efficient new algorithm for gradient compression with good performance.,non_actionable,agreement
r71,s1,t16,The authors propose a new gradient compression method for efficient distributed training of neural networks.,non_actionable,fact
r71,s2,t16,The authors propose a novel way of measuring ambiguity based on the variance of the gradients.,non_actionable,fact
r71,s3,t16,"In the experiment, the proposed method shows no or slight degradation of accuracy with big savings in communication cost.",non_actionable,fact
r71,s4,t16,"The proposed method can easily be combined with other existing method, i.e., Storm (2015), based on the absolute value of the gradient and shows further efficiency.",actionable,suggestion
r71,s5,t16,The paper is well written: clear and easy to understand.,non_actionable,agreement
r71,s6,t16,The proposed method is simple yet powerful.,non_actionable,agreement
r71,s7,t16,"Particularly, I found it interesting to re-evaluate the variance with (virtually) increasing larger batch size.",non_actionable,agreement
r71,s8,t16,"I found it would have also been interesting and helpful to define and show a new metric that incorporates both accuracy and compression rate into a single metric, e.g., how much accuracy is lost (or gained) per compression rate relatively to the baseline of no compression.",actionable,suggestion
r71,s9,t16,"With this metric, the comparison would be easier and more intuitive.",actionable,suggestion
r71,s0,t20,Simple yet efficient new algorithm for gradient compression with good performance.,non_actionable,agreement
r71,s1,t20,The authors propose a new gradient compression method for efficient distributed training of neural networks.,non_actionable,fact
r71,s2,t20,The authors propose a novel way of measuring ambiguity based on the variance of the gradients.,non_actionable,fact
r71,s3,t20,"In the experiment, the proposed method shows no or slight degradation of accuracy with big savings in communication cost.",non_actionable,fact
r71,s4,t20,"The proposed method can easily be combined with other existing method, i.e., Storm (2015), based on the absolute value of the gradient and shows further efficiency.",non_actionable,fact
r71,s5,t20,The paper is well written: clear and easy to understand.,non_actionable,agreement
r71,s6,t20,The proposed method is simple yet powerful.,non_actionable,agreement
r71,s7,t20,"Particularly, I found it interesting to re-evaluate the variance with (virtually) increasing larger batch size.",non_actionable,agreement
r71,s8,t20,"I found it would have also been interesting and helpful to define and show a new metric that incorporates both accuracy and compression rate into a single metric, e.g., how much accuracy is lost (or gained) per compression rate relatively to the baseline of no compression.",actionable,suggestion
r71,s9,t20,"With this metric, the comparison would be easier and more intuitive.",non_actionable,fact
r71,s0,t31,Simple yet efficient new algorithm for gradient compression with good performance.,non_actionable,agreement
r71,s1,t31,The authors propose a new gradient compression method for efficient distributed training of neural networks.,non_actionable,fact
r71,s2,t31,The authors propose a novel way of measuring ambiguity based on the variance of the gradients.,non_actionable,fact
r71,s3,t31,"In the experiment, the proposed method shows no or slight degradation of accuracy with big savings in communication cost.",non_actionable,fact
r71,s4,t31,"The proposed method can easily be combined with other existing method, i.e., Storm (2015), based on the absolute value of the gradient and shows further efficiency.",non_actionable,fact
r71,s5,t31,The paper is well written: clear and easy to understand.,non_actionable,agreement
r71,s6,t31,The proposed method is simple yet powerful.,non_actionable,agreement
r71,s7,t31,"Particularly, I found it interesting to re-evaluate the variance with (virtually) increasing larger batch size.",non_actionable,agreement
r71,s8,t31,"I found it would have also been interesting and helpful to define and show a new metric that incorporates both accuracy and compression rate into a single metric, e.g., how much accuracy is lost (or gained) per compression rate relatively to the baseline of no compression.",actionable,suggestion
r71,s9,t31,"With this metric, the comparison would be easier and more intuitive.",actionable,suggestion
r71,s0,t12,Simple yet efficient new algorithm for gradient compression with good performance.,non_actionable,agreement
r71,s1,t12,The authors propose a new gradient compression method for efficient distributed training of neural networks.,non_actionable,agreement
r71,s2,t12,The authors propose a novel way of measuring ambiguity based on the variance of the gradients.,non_actionable,agreement
r71,s3,t12,"In the experiment, the proposed method shows no or slight degradation of accuracy with big savings in communication cost.",non_actionable,agreement
r71,s4,t12,"The proposed method can easily be combined with other existing method, i.e., Storm (2015), based on the absolute value of the gradient and shows further efficiency.",non_actionable,agreement
r71,s5,t12,The paper is well written: clear and easy to understand.,non_actionable,agreement
r71,s6,t12,The proposed method is simple yet powerful.,non_actionable,agreement
r71,s7,t12,"Particularly, I found it interesting to re-evaluate the variance with (virtually) increasing larger batch size.",non_actionable,agreement
r71,s8,t12,"I found it would have also been interesting and helpful to define and show a new metric that incorporates both accuracy and compression rate into a single metric, e.g., how much accuracy is lost (or gained) per compression rate relatively to the baseline of no compression.",actionable,suggestion
r71,s9,t12,"With this metric, the comparison would be easier and more intuitive.",actionable,suggestion
r71,s0,t10,Simple yet efficient new algorithm for gradient compression with good performance.,actionable,agreement
r71,s1,t10,The authors propose a new gradient compression method for efficient distributed training of neural networks.,non_actionable,other
r71,s2,t10,The authors propose a novel way of measuring ambiguity based on the variance of the gradients.,non_actionable,other
r71,s3,t10,"In the experiment, the proposed method shows no or slight degradation of accuracy with big savings in communication cost.",non_actionable,other
r71,s4,t10,"The proposed method can easily be combined with other existing method, i.e., Storm (2015), based on the absolute value of the gradient and shows further efficiency.",non_actionable,other
r71,s5,t10,The paper is well written: clear and easy to understand.,actionable,agreement
r71,s6,t10,The proposed method is simple yet powerful.,actionable,agreement
r71,s7,t10,"Particularly, I found it interesting to re-evaluate the variance with (virtually) increasing larger batch size.",actionable,fact
r71,s8,t10,"I found it would have also been interesting and helpful to define and show a new metric that incorporates both accuracy and compression rate into a single metric, e.g., how much accuracy is lost (or gained) per compression rate relatively to the baseline of no compression.",actionable,suggestion
r71,s9,t10,"With this metric, the comparison would be easier and more intuitive.",actionable,suggestion
r104,s0,t20,"Standard idea, great results This paper borrows the classic idea of spectral regularization, recently applied to deep learning by Yoshida and Miyato (2017) and use it to normalize GAN objectives.",non_actionable,agreement
r104,s1,t20,This Lipschitz property has already been proposed by recent methods and has showed some success.,non_actionable,fact
r104,s2,t20,"However,  the authors here argue that spectral normalization is more powerful; it allows for models of higher rank (more non-zero singular values) which implies a more powerful discriminator and eventually more accurate generator.",non_actionable,fact
r104,s3,t20,This is demonstrated in comparison to weight normalization in Figure 4.,non_actionable,fact
r104,s4,t20,The experimental results are very good and give strong support for the proposed normalization.,non_actionable,agreement
r104,s5,t20,"The paper is overall well written (though check Comment 3 below), it covers the related work well and it includes an insightful discussion about the importance of high rank models.",non_actionable,agreement
r104,s6,t20,"I am recommending acceptance, though I anticipate to see a more rounded evaluation of the exact mechanism under which SN improves over the state of the art.",non_actionable,agreement
r104,s7,t20,"I found the discussion about rank to be very intuitive, however this intuition is not fully tested.",non_actionable,agreement
r104,s8,t20,I would like to see the same spectra included.,actionable,suggestion
r104,s9,t20,Do you get comparable inception scores?,actionable,question
r104,s0,t2,"Standard idea, great results This paper borrows the classic idea of spectral regularization, recently applied to deep learning by Yoshida and Miyato (2017) and use it to normalize GAN objectives.",non_actionable,agreement
r104,s1,t2,This Lipschitz property has already been proposed by recent methods and has showed some success.,non_actionable,fact
r104,s2,t2,"However,  the authors here argue that spectral normalization is more powerful; it allows for models of higher rank (more non-zero singular values) which implies a more powerful discriminator and eventually more accurate generator.",non_actionable,fact
r104,s3,t2,This is demonstrated in comparison to weight normalization in Figure 4.,non_actionable,fact
r104,s4,t2,The experimental results are very good and give strong support for the proposed normalization.,non_actionable,agreement
r104,s5,t2,"The paper is overall well written (though check Comment 3 below), it covers the related work well and it includes an insightful discussion about the importance of high rank models.",non_actionable,agreement
r104,s6,t2,"I am recommending acceptance, though I anticipate to see a more rounded evaluation of the exact mechanism under which SN improves over the state of the art.",non_actionable,agreement
r104,s7,t2,"I found the discussion about rank to be very intuitive, however this intuition is not fully tested.",actionable,shortcoming
r104,s8,t2,I would like to see the same spectra included.,actionable,suggestion
r104,s9,t2,Do you get comparable inception scores?,actionable,question
r104,s0,t31,"Standard idea, great results This paper borrows the classic idea of spectral regularization, recently applied to deep learning by Yoshida and Miyato (2017) and use it to normalize GAN objectives.",non_actionable,fact
r104,s1,t31,This Lipschitz property has already been proposed by recent methods and has showed some success.,non_actionable,fact
r104,s2,t31,"However,  the authors here argue that spectral normalization is more powerful; it allows for models of higher rank (more non-zero singular values) which implies a more powerful discriminator and eventually more accurate generator.",non_actionable,fact
r104,s3,t31,This is demonstrated in comparison to weight normalization in Figure 4.,non_actionable,fact
r104,s4,t31,The experimental results are very good and give strong support for the proposed normalization.,non_actionable,fact
r104,s5,t31,"The paper is overall well written (though check Comment 3 below), it covers the related work well and it includes an insightful discussion about the importance of high rank models.",non_actionable,agreement
r104,s6,t31,"I am recommending acceptance, though I anticipate to see a more rounded evaluation of the exact mechanism under which SN improves over the state of the art.",non_actionable,agreement
r104,s7,t31,"I found the discussion about rank to be very intuitive, however this intuition is not fully tested.",actionable,shortcoming
r104,s8,t31,I would like to see the same spectra included.,actionable,suggestion
r104,s9,t31,Do you get comparable inception scores?,non_actionable,question
r104,s0,t19,"Standard idea, great results This paper borrows the classic idea of spectral regularization, recently applied to deep learning by Yoshida and Miyato (2017) and use it to normalize GAN objectives.",non_actionable,agreement
r104,s1,t19,This Lipschitz property has already been proposed by recent methods and has showed some success.,non_actionable,fact
r104,s2,t19,"However,  the authors here argue that spectral normalization is more powerful; it allows for models of higher rank (more non-zero singular values) which implies a more powerful discriminator and eventually more accurate generator.",non_actionable,agreement
r104,s3,t19,This is demonstrated in comparison to weight normalization in Figure 4.,non_actionable,fact
r104,s4,t19,The experimental results are very good and give strong support for the proposed normalization.,non_actionable,agreement
r104,s5,t19,"The paper is overall well written (though check Comment 3 below), it covers the related work well and it includes an insightful discussion about the importance of high rank models.",non_actionable,agreement
r104,s6,t19,"I am recommending acceptance, though I anticipate to see a more rounded evaluation of the exact mechanism under which SN improves over the state of the art.",actionable,suggestion
r104,s7,t19,"I found the discussion about rank to be very intuitive, however this intuition is not fully tested.",actionable,shortcoming
r104,s8,t19,I would like to see the same spectra included.,actionable,suggestion
r104,s9,t19,Do you get comparable inception scores?,actionable,question
r104,s0,t10,"Standard idea, great results This paper borrows the classic idea of spectral regularization, recently applied to deep learning by Yoshida and Miyato (2017) and use it to normalize GAN objectives.",non_actionable,other
r104,s1,t10,This Lipschitz property has already been proposed by recent methods and has showed some success.,non_actionable,other
r104,s2,t10,"However,  the authors here argue that spectral normalization is more powerful; it allows for models of higher rank (more non-zero singular values) which implies a more powerful discriminator and eventually more accurate generator.",actionable,agreement
r104,s3,t10,This is demonstrated in comparison to weight normalization in Figure 4.,non_actionable,other
r104,s4,t10,The experimental results are very good and give strong support for the proposed normalization.,actionable,agreement
r104,s5,t10,"The paper is overall well written (though check Comment 3 below), it covers the related work well and it includes an insightful discussion about the importance of high rank models.",actionable,agreement
r104,s6,t10,"I am recommending acceptance, though I anticipate to see a more rounded evaluation of the exact mechanism under which SN improves over the state of the art.",actionable,agreement
r104,s7,t10,"I found the discussion about rank to be very intuitive, however this intuition is not fully tested.",actionable,shortcoming
r104,s8,t10,I would like to see the same spectra included.,actionable,suggestion
r104,s9,t10,Do you get comparable inception scores?,actionable,question
r0,s0,t10,Thanks for all the explanations on my review and the other comments.,non_actionable,other
r0,s1,t10,"While I can now clearly see the contributions of the paper, the minimal revisions in the paper do not make the contributions clear yet (in my opinion that should already be clear after having read the introduction).",actionable,shortcoming
r0,s2,t10,"The new section ""intuitive analysis"" is very nice.",actionable,agreement
r0,s3,t10,Quality ====== The approach seems sound but the paper does not provide many details on the underlying approach.,actionable,shortcoming
r0,s4,t10,The application to learning from (partially adversarial) demonstrations is a cool idea but effectively is a very straightforward application based on the insight that the approach can handle truly off-policy samples.,actionable,agreement
r0,s5,t10,The experiments are OK but I would have liked a more thorough analysis.,actionable,shortcoming
r0,s6,t10,"Clarity ===== The paper reads well, but it is not really clear what the claimed contribution is.",actionable,shortcoming
r0,s7,t10,Originality ========= The application seems original.,actionable,agreement
r0,s8,t10,Significance ========== Having an RL approach that can benefit from truly off-policy samples is highly relevant.,actionable,agreement
r0,s9,t10,Pros and Cons ============ + good results + interesting idea of using the algorithm for RLfD - weak experiments for an application paper - not clear what's new,actionable,agreement
r0,s0,t31,Thanks for all the explanations on my review and the other comments.,non_actionable,other
r0,s1,t31,"While I can now clearly see the contributions of the paper, the minimal revisions in the paper do not make the contributions clear yet (in my opinion that should already be clear after having read the introduction).",actionable,shortcoming
r0,s2,t31,"The new section ""intuitive analysis"" is very nice.",non_actionable,agreement
r0,s3,t31,Quality ====== The approach seems sound but the paper does not provide many details on the underlying approach.,actionable,suggestion
r0,s4,t31,The application to learning from (partially adversarial) demonstrations is a cool idea but effectively is a very straightforward application based on the insight that the approach can handle truly off-policy samples.,non_actionable,fact
r0,s5,t31,The experiments are OK but I would have liked a more thorough analysis.,actionable,suggestion
r0,s6,t31,"Clarity ===== The paper reads well, but it is not really clear what the claimed contribution is.",actionable,shortcoming
r0,s7,t31,Originality ========= The application seems original.,non_actionable,fact
r0,s8,t31,Significance ========== Having an RL approach that can benefit from truly off-policy samples is highly relevant.,non_actionable,fact
r0,s9,t31,Pros and Cons ============ + good results + interesting idea of using the algorithm for RLfD - weak experiments for an application paper - not clear what's new,actionable,shortcoming
r0,s0,t16,Thanks for all the explanations on my review and the other comments.,non_actionable,fact
r0,s1,t16,"While I can now clearly see the contributions of the paper, the minimal revisions in the paper do not make the contributions clear yet (in my opinion that should already be clear after having read the introduction).",actionable,shortcoming
r0,s2,t16,"The new section ""intuitive analysis"" is very nice.",non_actionable,agreement
r0,s3,t16,Quality ====== The approach seems sound but the paper does not provide many details on the underlying approach.,actionable,shortcoming
r0,s4,t16,The application to learning from (partially adversarial) demonstrations is a cool idea but effectively is a very straightforward application based on the insight that the approach can handle truly off-policy samples.,non_actionable,fact
r0,s5,t16,The experiments are OK but I would have liked a more thorough analysis.,actionable,suggestion
r0,s6,t16,"Clarity ===== The paper reads well, but it is not really clear what the claimed contribution is.",actionable,shortcoming
r0,s7,t16,Originality ========= The application seems original.,non_actionable,fact
r0,s8,t16,Significance ========== Having an RL approach that can benefit from truly off-policy samples is highly relevant.,non_actionable,agreement
r0,s9,t16,Pros and Cons ============ + good results + interesting idea of using the algorithm for RLfD - weak experiments for an application paper - not clear what's new,actionable,shortcoming
r0,s0,t8,Thanks for all the explanations on my review and the other comments.,non_actionable,agreement
r0,s1,t8,"While I can now clearly see the contributions of the paper, the minimal revisions in the paper do not make the contributions clear yet (in my opinion that should already be clear after having read the introduction).",actionable,shortcoming
r0,s2,t8,"The new section ""intuitive analysis"" is very nice.",non_actionable,agreement
r0,s3,t8,Quality ====== The approach seems sound but the paper does not provide many details on the underlying approach.,actionable,shortcoming
r0,s4,t8,The application to learning from (partially adversarial) demonstrations is a cool idea but effectively is a very straightforward application based on the insight that the approach can handle truly off-policy samples.,non_actionable,fact
r0,s5,t8,The experiments are OK but I would have liked a more thorough analysis.,actionable,suggestion
r0,s6,t8,"Clarity ===== The paper reads well, but it is not really clear what the claimed contribution is.",actionable,shortcoming
r0,s7,t8,Originality ========= The application seems original.,non_actionable,agreement
r0,s8,t8,Significance ========== Having an RL approach that can benefit from truly off-policy samples is highly relevant.,non_actionable,agreement
r0,s9,t8,Pros and Cons ============ + good results + interesting idea of using the algorithm for RLfD - weak experiments for an application paper - not clear what's new,actionable,shortcoming
r0,s0,t20,Thanks for all the explanations on my review and the other comments.,non_actionable,fact
r0,s1,t20,"While I can now clearly see the contributions of the paper, the minimal revisions in the paper do not make the contributions clear yet (in my opinion that should already be clear after having read the introduction).",non_actionable,fact
r0,s2,t20,"The new section ""intuitive analysis"" is very nice.",non_actionable,agreement
r0,s3,t20,Quality ====== The approach seems sound but the paper does not provide many details on the underlying approach.,non_actionable,shortcoming
r0,s4,t20,The application to learning from (partially adversarial) demonstrations is a cool idea but effectively is a very straightforward application based on the insight that the approach can handle truly off-policy samples.,non_actionable,fact
r0,s5,t20,The experiments are OK but I would have liked a more thorough analysis.,actionable,suggestion
r0,s6,t20,"Clarity ===== The paper reads well, but it is not really clear what the claimed contribution is.",actionable,shortcoming
r0,s7,t20,Originality ========= The application seems original.,non_actionable,agreement
r0,s8,t20,Significance ========== Having an RL approach that can benefit from truly off-policy samples is highly relevant.,non_actionable,fact
r0,s9,t20,Pros and Cons ============ + good results + interesting idea of using the algorithm for RLfD - weak experiments for an application paper - not clear what's new,non_actionable,fact
r57,s0,t16,"That said I found the intuitive story a little bit difficult to follow -- it's true that in Figure 1b the discriminator won't communicate the detailed structure of the data manifold to the generator, but it's not clear why this would be a problem -- the gradients should still pull the generator *towards* the manifold of real data, and as this happens and the manifolds begin to overlap, the discriminator will naturally be forced to allocate its capacity towards finer-grained details.",actionable,shortcoming
r57,s1,t16,"Is the implicit assumption that for real, high-dimensional data the generator and data manifolds will *never* overlap?",actionable,question
r57,s2,t16,But in that case much of the theoretical story goes out the window.,non_actionable,fact
r57,s3,t16,"I'd also appreciate further discussion of the relationship of this approach to Wasserstein GANs, which also attempt to provide a clearer training gradient when the data and generator manifolds do not overlap.",actionable,suggestion
r57,s4,t16,More generally I'd like to better understand what effect we'd expect this regularizer to have.,actionable,suggestion
r57,s5,t16,Does it also change the location of the Nash equilibria?,actionable,question
r57,s6,t16,"(or equivalently, the optimal generator under the density-ratio-estimator interpretation of discriminators proposed by https://arxiv.org/abs/1610.03483).",non_actionable,fact
r57,s7,t16,"The experimental results seem promising, although not earthshattering.",non_actionable,fact
r57,s8,t16,As such I'd rate it as borderline; though perhaps interesting enough to be worth presenting and discussing.,non_actionable,agreement
r57,s9,t16,I encourage the authors to have the work proofread by a native speaker; clearer writing will ultimately increase the reach and impact of the paper.,actionable,suggestion
r57,s0,t20,"That said I found the intuitive story a little bit difficult to follow -- it's true that in Figure 1b the discriminator won't communicate the detailed structure of the data manifold to the generator, but it's not clear why this would be a problem -- the gradients should still pull the generator *towards* the manifold of real data, and as this happens and the manifolds begin to overlap, the discriminator will naturally be forced to allocate its capacity towards finer-grained details.",non_actionable,shortcoming
r57,s1,t20,"Is the implicit assumption that for real, high-dimensional data the generator and data manifolds will *never* overlap?",actionable,question
r57,s2,t20,But in that case much of the theoretical story goes out the window.,non_actionable,fact
r57,s3,t20,"I'd also appreciate further discussion of the relationship of this approach to Wasserstein GANs, which also attempt to provide a clearer training gradient when the data and generator manifolds do not overlap.",actionable,suggestion
r57,s4,t20,More generally I'd like to better understand what effect we'd expect this regularizer to have.,actionable,suggestion
r57,s5,t20,Does it also change the location of the Nash equilibria?,actionable,question
r57,s6,t20,"(or equivalently, the optimal generator under the density-ratio-estimator interpretation of discriminators proposed by https://arxiv.org/abs/1610.03483).",non_actionable,fact
r57,s7,t20,"The experimental results seem promising, although not earthshattering.",non_actionable,fact
r57,s8,t20,As such I'd rate it as borderline; though perhaps interesting enough to be worth presenting and discussing.,non_actionable,fact
r57,s9,t20,I encourage the authors to have the work proofread by a native speaker; clearer writing will ultimately increase the reach and impact of the paper.,actionable,suggestion
r57,s0,t31,"That said I found the intuitive story a little bit difficult to follow -- it's true that in Figure 1b the discriminator won't communicate the detailed structure of the data manifold to the generator, but it's not clear why this would be a problem -- the gradients should still pull the generator *towards* the manifold of real data, and as this happens and the manifolds begin to overlap, the discriminator will naturally be forced to allocate its capacity towards finer-grained details.",actionable,shortcoming
r57,s1,t31,"Is the implicit assumption that for real, high-dimensional data the generator and data manifolds will *never* overlap?",non_actionable,question
r57,s2,t31,But in that case much of the theoretical story goes out the window.,non_actionable,fact
r57,s3,t31,"I'd also appreciate further discussion of the relationship of this approach to Wasserstein GANs, which also attempt to provide a clearer training gradient when the data and generator manifolds do not overlap.",actionable,suggestion
r57,s4,t31,More generally I'd like to better understand what effect we'd expect this regularizer to have.,actionable,suggestion
r57,s5,t31,Does it also change the location of the Nash equilibria?,non_actionable,question
r57,s6,t31,"(or equivalently, the optimal generator under the density-ratio-estimator interpretation of discriminators proposed by https://arxiv.org/abs/1610.03483).",non_actionable,question
r57,s7,t31,"The experimental results seem promising, although not earthshattering.",non_actionable,fact
r57,s8,t31,As such I'd rate it as borderline; though perhaps interesting enough to be worth presenting and discussing.,non_actionable,fact
r57,s9,t31,I encourage the authors to have the work proofread by a native speaker; clearer writing will ultimately increase the reach and impact of the paper.,actionable,suggestion
r57,s0,t8,"That said I found the intuitive story a little bit difficult to follow -- it's true that in Figure 1b the discriminator won't communicate the detailed structure of the data manifold to the generator, but it's not clear why this would be a problem -- the gradients should still pull the generator *towards* the manifold of real data, and as this happens and the manifolds begin to overlap, the discriminator will naturally be forced to allocate its capacity towards finer-grained details.",actionable,shortcoming
r57,s1,t8,"Is the implicit assumption that for real, high-dimensional data the generator and data manifolds will *never* overlap?",non_actionable,question
r57,s2,t8,But in that case much of the theoretical story goes out the window.,non_actionable,fact
r57,s3,t8,"I'd also appreciate further discussion of the relationship of this approach to Wasserstein GANs, which also attempt to provide a clearer training gradient when the data and generator manifolds do not overlap.",actionable,suggestion
r57,s4,t8,More generally I'd like to better understand what effect we'd expect this regularizer to have.,actionable,suggestion
r57,s5,t8,Does it also change the location of the Nash equilibria?,non_actionable,question
r57,s6,t8,"(or equivalently, the optimal generator under the density-ratio-estimator interpretation of discriminators proposed by https://arxiv.org/abs/1610.03483).",non_actionable,other
r57,s7,t8,"The experimental results seem promising, although not earthshattering.",non_actionable,agreement
r57,s8,t8,As such I'd rate it as borderline; though perhaps interesting enough to be worth presenting and discussing.,non_actionable,agreement
r57,s9,t8,I encourage the authors to have the work proofread by a native speaker; clearer writing will ultimately increase the reach and impact of the paper.,actionable,suggestion
r57,s0,t10,"That said I found the intuitive story a little bit difficult to follow -- it's true that in Figure 1b the discriminator won't communicate the detailed structure of the data manifold to the generator, but it's not clear why this would be a problem -- the gradients should still pull the generator *towards* the manifold of real data, and as this happens and the manifolds begin to overlap, the discriminator will naturally be forced to allocate its capacity towards finer-grained details.",actionable,shortcoming
r57,s1,t10,"Is the implicit assumption that for real, high-dimensional data the generator and data manifolds will *never* overlap?",actionable,question
r57,s2,t10,But in that case much of the theoretical story goes out the window.,non_actionable,other
r57,s3,t10,"I'd also appreciate further discussion of the relationship of this approach to Wasserstein GANs, which also attempt to provide a clearer training gradient when the data and generator manifolds do not overlap.",actionable,suggestion
r57,s4,t10,More generally I'd like to better understand what effect we'd expect this regularizer to have.,actionable,suggestion
r57,s5,t10,Does it also change the location of the Nash equilibria?,actionable,question
r57,s6,t10,"(or equivalently, the optimal generator under the density-ratio-estimator interpretation of discriminators proposed by https://arxiv.org/abs/1610.03483).",non_actionable,other
r57,s7,t10,"The experimental results seem promising, although not earthshattering.",actionable,agreement
r57,s8,t10,As such I'd rate it as borderline; though perhaps interesting enough to be worth presenting and discussing.,actionable,agreement
r57,s9,t10,I encourage the authors to have the work proofread by a native speaker; clearer writing will ultimately increase the reach and impact of the paper.,actionable,suggestion
r20,s0,t20,The angle bias is non-zero as long as the random vector is non-zero in expectation and the given vector is non-zero.,non_actionable,fact
r20,s1,t20,The proposed solution to angle bias is to place a linear constraint such that the sum of the weight becomes zero.,non_actionable,fact
r20,s2,t20,"Although this does not rule out angle bias in general, it does so for the very special case where the expected value of the random vector is a vector consisting of a common value.",non_actionable,fact
r20,s3,t20,"Nevertheless, numerical experiments suggest that the proposed approach can effectively reduce angle bias and improves the accuracy for training data in the CIFAR-10 task.",non_actionable,fact
r20,s4,t20,"Test accuracy is not improved, however.",non_actionable,fact
r20,s5,t20,"On the theoretical side, the linearly constrained weights are only shown to work for a very special case.",non_actionable,fact
r20,s6,t20,There can be many other approaches to mitigate the impact of angle bias.,non_actionable,fact
r20,s7,t20,"For example, how about scaling each variable in a way that the mean becomes zero, instead of scaling it into [-1,+1] as is done in the experiments?",actionable,question
r20,s8,t20,"It is intuitively expected that the proposed approach has some merit in some domains, but it is unclear exactly when and where it is.",non_actionable,fact
r20,s9,t20,"Minor comments: In Section 2.2, is Layer 1 the input layer or the next?",actionable,question
r20,s0,t31,The angle bias is non-zero as long as the random vector is non-zero in expectation and the given vector is non-zero.,non_actionable,fact
r20,s1,t31,The proposed solution to angle bias is to place a linear constraint such that the sum of the weight becomes zero.,non_actionable,fact
r20,s2,t31,"Although this does not rule out angle bias in general, it does so for the very special case where the expected value of the random vector is a vector consisting of a common value.",non_actionable,fact
r20,s3,t31,"Nevertheless, numerical experiments suggest that the proposed approach can effectively reduce angle bias and improves the accuracy for training data in the CIFAR-10 task.",non_actionable,fact
r20,s4,t31,"Test accuracy is not improved, however.",non_actionable,fact
r20,s5,t31,"On the theoretical side, the linearly constrained weights are only shown to work for a very special case.",non_actionable,fact
r20,s6,t31,There can be many other approaches to mitigate the impact of angle bias.,non_actionable,fact
r20,s7,t31,"For example, how about scaling each variable in a way that the mean becomes zero, instead of scaling it into [-1,+1] as is done in the experiments?",actionable,suggestion
r20,s8,t31,"It is intuitively expected that the proposed approach has some merit in some domains, but it is unclear exactly when and where it is.",actionable,shortcoming
r20,s9,t31,"Minor comments: In Section 2.2, is Layer 1 the input layer or the next?",non_actionable,question
r20,s0,t8,The angle bias is non-zero as long as the random vector is non-zero in expectation and the given vector is non-zero.,non_actionable,fact
r20,s1,t8,The proposed solution to angle bias is to place a linear constraint such that the sum of the weight becomes zero.,non_actionable,fact
r20,s2,t8,"Although this does not rule out angle bias in general, it does so for the very special case where the expected value of the random vector is a vector consisting of a common value.",non_actionable,fact
r20,s3,t8,"Nevertheless, numerical experiments suggest that the proposed approach can effectively reduce angle bias and improves the accuracy for training data in the CIFAR-10 task.",non_actionable,fact
r20,s4,t8,"Test accuracy is not improved, however.",non_actionable,fact
r20,s5,t8,"On the theoretical side, the linearly constrained weights are only shown to work for a very special case.",actionable,shortcoming
r20,s6,t8,There can be many other approaches to mitigate the impact of angle bias.,actionable,suggestion
r20,s7,t8,"For example, how about scaling each variable in a way that the mean becomes zero, instead of scaling it into [-1,+1] as is done in the experiments?",actionable,suggestion
r20,s8,t8,"It is intuitively expected that the proposed approach has some merit in some domains, but it is unclear exactly when and where it is.",non_actionable,shortcoming
r20,s9,t8,"Minor comments: In Section 2.2, is Layer 1 the input layer or the next?",non_actionable,question
r20,s0,t10,The angle bias is non-zero as long as the random vector is non-zero in expectation and the given vector is non-zero.,non_actionable,other
r20,s1,t10,The proposed solution to angle bias is to place a linear constraint such that the sum of the weight becomes zero.,non_actionable,other
r20,s2,t10,"Although this does not rule out angle bias in general, it does so for the very special case where the expected value of the random vector is a vector consisting of a common value.",actionable,shortcoming
r20,s3,t10,"Nevertheless, numerical experiments suggest that the proposed approach can effectively reduce angle bias and improves the accuracy for training data in the CIFAR-10 task.",non_actionable,other
r20,s4,t10,"Test accuracy is not improved, however.",actionable,shortcoming
r20,s5,t10,"On the theoretical side, the linearly constrained weights are only shown to work for a very special case.",actionable,shortcoming
r20,s6,t10,There can be many other approaches to mitigate the impact of angle bias.,non_actionable,other
r20,s7,t10,"For example, how about scaling each variable in a way that the mean becomes zero, instead of scaling it into [-1,+1] as is done in the experiments?",actionable,question
r20,s8,t10,"It is intuitively expected that the proposed approach has some merit in some domains, but it is unclear exactly when and where it is.",actionable,shortcoming
r20,s9,t10,"Minor comments: In Section 2.2, is Layer 1 the input layer or the next?",actionable,question
r20,s0,t16,The angle bias is non-zero as long as the random vector is non-zero in expectation and the given vector is non-zero.,non_actionable,fact
r20,s1,t16,The proposed solution to angle bias is to place a linear constraint such that the sum of the weight becomes zero.,non_actionable,fact
r20,s2,t16,"Although this does not rule out angle bias in general, it does so for the very special case where the expected value of the random vector is a vector consisting of a common value.",non_actionable,fact
r20,s3,t16,"Nevertheless, numerical experiments suggest that the proposed approach can effectively reduce angle bias and improves the accuracy for training data in the CIFAR-10 task.",non_actionable,fact
r20,s4,t16,"Test accuracy is not improved, however.",actionable,shortcoming
r20,s5,t16,"On the theoretical side, the linearly constrained weights are only shown to work for a very special case.",non_actionable,fact
r20,s6,t16,There can be many other approaches to mitigate the impact of angle bias.,non_actionable,fact
r20,s7,t16,"For example, how about scaling each variable in a way that the mean becomes zero, instead of scaling it into [-1,+1] as is done in the experiments?",actionable,question
r20,s8,t16,"It is intuitively expected that the proposed approach has some merit in some domains, but it is unclear exactly when and where it is.",actionable,shortcoming
r20,s9,t16,"Minor comments: In Section 2.2, is Layer 1 the input layer or the next?",actionable,question
r96,s0,t10,The authors perform an analysis of the performance of ResNets and DenseNets under data scarcity constraints and noisy training samples.,non_actionable,other
r96,s1,t10,The presentation of the paper could be significantly improved.,actionable,suggestion
r96,s2,t10,The motivation is difficult to grasp and the contributions do not seem compelling.,actionable,shortcoming
r96,s3,t10,"[a] https://arxiv.org/pdf/1603.05027.pdf Moreover, the literature review is very limited.",actionable,shortcoming
r96,s4,t10,"Although there is a vast existing literature on ResNets, DenseNets and, more generally, skip connections, the paper only references 4 papers.",actionable,shortcoming
r96,s5,t10,"Many relevant papers could be referenced in the introduction as examples of successes in computer vision tasks,  identity mapping initialization, recent interpretations of ResNets/DensetNets, etc.",actionable,suggestion
r96,s6,t10,"The title suggests that the analysis is performed on DenseNet architectures, but experiments focus on comparing both ResNets and DenseNets to sequential convolutional networks and assessing the importance of skip connections.",actionable,shortcoming
r96,s7,t10,"(1st paragraph) proposes adding noise to groundtruth labels; however, in section 3.1.2,.",non_actionable,other
r96,s8,t10,Wouldn’t the noise added to the groundtruth act as a regularizer?,actionable,question
r96,s9,t10,"However, experiments are performed on MNIST, CIFAR100, a curve fitting problem and a presumably synthetic 2D classification problem.",non_actionable,other
r96,s0,t8,The authors perform an analysis of the performance of ResNets and DenseNets under data scarcity constraints and noisy training samples.,non_actionable,fact
r96,s1,t8,The presentation of the paper could be significantly improved.,actionable,shortcoming
r96,s2,t8,The motivation is difficult to grasp and the contributions do not seem compelling.,actionable,shortcoming
r96,s3,t8,"[a] https://arxiv.org/pdf/1603.05027.pdf Moreover, the literature review is very limited.",actionable,shortcoming
r96,s4,t8,"Although there is a vast existing literature on ResNets, DenseNets and, more generally, skip connections, the paper only references 4 papers.",actionable,shortcoming
r96,s5,t8,"Many relevant papers could be referenced in the introduction as examples of successes in computer vision tasks,  identity mapping initialization, recent interpretations of ResNets/DensetNets, etc.",actionable,suggestion
r96,s6,t8,"The title suggests that the analysis is performed on DenseNet architectures, but experiments focus on comparing both ResNets and DenseNets to sequential convolutional networks and assessing the importance of skip connections.",actionable,shortcoming
r96,s7,t8,"(1st paragraph) proposes adding noise to groundtruth labels; however, in section 3.1.2,.",non_actionable,fact
r96,s8,t8,Wouldn’t the noise added to the groundtruth act as a regularizer?,non_actionable,question
r96,s9,t8,"However, experiments are performed on MNIST, CIFAR100, a curve fitting problem and a presumably synthetic 2D classification problem.",non_actionable,fact
r96,s0,t16,The authors perform an analysis of the performance of ResNets and DenseNets under data scarcity constraints and noisy training samples.,non_actionable,fact
r96,s1,t16,The presentation of the paper could be significantly improved.,actionable,shortcoming
r96,s2,t16,The motivation is difficult to grasp and the contributions do not seem compelling.,actionable,shortcoming
r96,s3,t16,"[a] https://arxiv.org/pdf/1603.05027.pdf Moreover, the literature review is very limited.",actionable,shortcoming
r96,s4,t16,"Although there is a vast existing literature on ResNets, DenseNets and, more generally, skip connections, the paper only references 4 papers.",actionable,shortcoming
r96,s5,t16,"Many relevant papers could be referenced in the introduction as examples of successes in computer vision tasks,  identity mapping initialization, recent interpretations of ResNets/DensetNets, etc.",actionable,suggestion
r96,s6,t16,"The title suggests that the analysis is performed on DenseNet architectures, but experiments focus on comparing both ResNets and DenseNets to sequential convolutional networks and assessing the importance of skip connections.",actionable,shortcoming
r96,s7,t16,"(1st paragraph) proposes adding noise to groundtruth labels; however, in section 3.1.2,.",actionable,shortcoming
r96,s8,t16,Wouldn’t the noise added to the groundtruth act as a regularizer?,actionable,question
r96,s9,t16,"However, experiments are performed on MNIST, CIFAR100, a curve fitting problem and a presumably synthetic 2D classification problem.",non_actionable,fact
r96,s0,t31,The authors perform an analysis of the performance of ResNets and DenseNets under data scarcity constraints and noisy training samples.,non_actionable,fact
r96,s1,t31,The presentation of the paper could be significantly improved.,actionable,shortcoming
r96,s2,t31,The motivation is difficult to grasp and the contributions do not seem compelling.,actionable,shortcoming
r96,s3,t31,"[a] https://arxiv.org/pdf/1603.05027.pdf Moreover, the literature review is very limited.",actionable,shortcoming
r96,s4,t31,"Although there is a vast existing literature on ResNets, DenseNets and, more generally, skip connections, the paper only references 4 papers.",actionable,shortcoming
r96,s5,t31,"Many relevant papers could be referenced in the introduction as examples of successes in computer vision tasks,  identity mapping initialization, recent interpretations of ResNets/DensetNets, etc.",actionable,suggestion
r96,s6,t31,"The title suggests that the analysis is performed on DenseNet architectures, but experiments focus on comparing both ResNets and DenseNets to sequential convolutional networks and assessing the importance of skip connections.",actionable,shortcoming
r96,s7,t31,"(1st paragraph) proposes adding noise to groundtruth labels; however, in section 3.1.2,.",actionable,shortcoming
r96,s8,t31,Wouldn’t the noise added to the groundtruth act as a regularizer?,actionable,shortcoming
r96,s9,t31,"However, experiments are performed on MNIST, CIFAR100, a curve fitting problem and a presumably synthetic 2D classification problem.",actionable,fact
r96,s0,t20,The authors perform an analysis of the performance of ResNets and DenseNets under data scarcity constraints and noisy training samples.,non_actionable,fact
r96,s1,t20,The presentation of the paper could be significantly improved.,actionable,suggestion
r96,s2,t20,The motivation is difficult to grasp and the contributions do not seem compelling.,non_actionable,shortcoming
r96,s3,t20,"[a] https://arxiv.org/pdf/1603.05027.pdf Moreover, the literature review is very limited.",non_actionable,shortcoming
r96,s4,t20,"Although there is a vast existing literature on ResNets, DenseNets and, more generally, skip connections, the paper only references 4 papers.",non_actionable,shortcoming
r96,s5,t20,"Many relevant papers could be referenced in the introduction as examples of successes in computer vision tasks,  identity mapping initialization, recent interpretations of ResNets/DensetNets, etc.",non_actionable,fact
r96,s6,t20,"The title suggests that the analysis is performed on DenseNet architectures, but experiments focus on comparing both ResNets and DenseNets to sequential convolutional networks and assessing the importance of skip connections.",non_actionable,fact
r96,s7,t20,"(1st paragraph) proposes adding noise to groundtruth labels; however, in section 3.1.2,.",non_actionable,fact
r96,s8,t20,Wouldn’t the noise added to the groundtruth act as a regularizer?,actionable,question
r96,s9,t20,"However, experiments are performed on MNIST, CIFAR100, a curve fitting problem and a presumably synthetic 2D classification problem.",non_actionable,fact
r119,s0,t20,"The basic idea is to use a multi-step dynamics model as a ""baseline"" (more properly a control variate, as the terminology in the paper uses, but I think baselines are more familiar to the RL community) to reduce the variance of a policy gradient estimator, while remaining unbiased.",non_actionable,fact
r119,s1,t20,"The authors also discuss how to best learn the type of multi-step dynamics that are well-suited to this problem (essentially, using off-policy data via importance weighting), and they demonstrate the effectiveness of the approach on four continuous control tasks.",non_actionable,fact
r119,s2,t20,"This paper presents a nice idea, and I'm sure that with some polish it will become a very nice conference submission.",non_actionable,agreement
r119,s3,t20,"In a bit more detail, the key idea of the paper, at least to the extent that I understood it, was that the authors are able to introduce a model-based variance-reduction baseline into the policy gradient term.",non_actionable,fact
r119,s4,t20,"But beyond this, and despite fairly reasonable familiarity with the subject, I simply don't understand other elements that the paper is talking about.",non_actionable,shortcoming
r119,s5,t20,"The paper frequently refers to ""embedding"" ""imaginary trajectories"" into the dynamics model, and I still have no idea what this is actually referring to (the definition at the start of section 4 is completely opaque to me).",actionable,shortcoming
r119,s6,t20,"But I also feel that in this case, it borders on being an issue with the paper itself, as I think this idea needs to be described much more clearly if it is central to the underlying paper.",actionable,suggestion
r119,s7,t20,"Finally, although I do think the extent of the algorithm that I could follow is interesting, the second issue with the paper is that the results are fairly weak as they stand currently.",non_actionable,shortcoming
r119,s8,t20,"Not that there is anything wrong with short papers, but in this case both the clarity of presentation and details are lacking.",actionable,shortcoming
r119,s9,t20,"I think it will eventually become a good paper, but it is not ready yet.",non_actionable,fact
r119,s0,t25,"The basic idea is to use a multi-step dynamics model as a ""baseline"" (more properly a control variate, as the terminology in the paper uses, but I think baselines are more familiar to the RL community) to reduce the variance of a policy gradient estimator, while remaining unbiased.",non_actionable,fact
r119,s1,t25,"The authors also discuss how to best learn the type of multi-step dynamics that are well-suited to this problem (essentially, using off-policy data via importance weighting), and they demonstrate the effectiveness of the approach on four continuous control tasks.",non_actionable,agreement
r119,s2,t25,"This paper presents a nice idea, and I'm sure that with some polish it will become a very nice conference submission.",actionable,agreement
r119,s3,t25,"In a bit more detail, the key idea of the paper, at least to the extent that I understood it, was that the authors are able to introduce a model-based variance-reduction baseline into the policy gradient term.",non_actionable,agreement
r119,s4,t25,"But beyond this, and despite fairly reasonable familiarity with the subject, I simply don't understand other elements that the paper is talking about.",non_actionable,disagreement
r119,s5,t25,"The paper frequently refers to ""embedding"" ""imaginary trajectories"" into the dynamics model, and I still have no idea what this is actually referring to (the definition at the start of section 4 is completely opaque to me).",non_actionable,disagreement
r119,s6,t25,"But I also feel that in this case, it borders on being an issue with the paper itself, as I think this idea needs to be described much more clearly if it is central to the underlying paper.",actionable,shortcoming
r119,s7,t25,"Finally, although I do think the extent of the algorithm that I could follow is interesting, the second issue with the paper is that the results are fairly weak as they stand currently.",actionable,shortcoming
r119,s8,t25,"Not that there is anything wrong with short papers, but in this case both the clarity of presentation and details are lacking.",actionable,shortcoming
r119,s9,t25,"I think it will eventually become a good paper, but it is not ready yet.",actionable,shortcoming
r119,s0,t10,"The basic idea is to use a multi-step dynamics model as a ""baseline"" (more properly a control variate, as the terminology in the paper uses, but I think baselines are more familiar to the RL community) to reduce the variance of a policy gradient estimator, while remaining unbiased.",non_actionable,other
r119,s1,t10,"The authors also discuss how to best learn the type of multi-step dynamics that are well-suited to this problem (essentially, using off-policy data via importance weighting), and they demonstrate the effectiveness of the approach on four continuous control tasks.",non_actionable,other
r119,s2,t10,"This paper presents a nice idea, and I'm sure that with some polish it will become a very nice conference submission.",actionable,agreement
r119,s3,t10,"In a bit more detail, the key idea of the paper, at least to the extent that I understood it, was that the authors are able to introduce a model-based variance-reduction baseline into the policy gradient term.",non_actionable,other
r119,s4,t10,"But beyond this, and despite fairly reasonable familiarity with the subject, I simply don't understand other elements that the paper is talking about.",actionable,disagreement
r119,s5,t10,"The paper frequently refers to ""embedding"" ""imaginary trajectories"" into the dynamics model, and I still have no idea what this is actually referring to (the definition at the start of section 4 is completely opaque to me).",actionable,fact
r119,s6,t10,"But I also feel that in this case, it borders on being an issue with the paper itself, as I think this idea needs to be described much more clearly if it is central to the underlying paper.",actionable,shortcoming
r119,s7,t10,"Finally, although I do think the extent of the algorithm that I could follow is interesting, the second issue with the paper is that the results are fairly weak as they stand currently.",actionable,shortcoming
r119,s8,t10,"Not that there is anything wrong with short papers, but in this case both the clarity of presentation and details are lacking.",actionable,shortcoming
r119,s9,t10,"I think it will eventually become a good paper, but it is not ready yet.",actionable,fact
r119,s0,t13,"The basic idea is to use a multi-step dynamics model as a ""baseline"" (more properly a control variate, as the terminology in the paper uses, but I think baselines are more familiar to the RL community) to reduce the variance of a policy gradient estimator, while remaining unbiased.",non_actionable,fact
r119,s1,t13,"The authors also discuss how to best learn the type of multi-step dynamics that are well-suited to this problem (essentially, using off-policy data via importance weighting), and they demonstrate the effectiveness of the approach on four continuous control tasks.",non_actionable,fact
r119,s2,t13,"This paper presents a nice idea, and I'm sure that with some polish it will become a very nice conference submission.",actionable,agreement
r119,s3,t13,"In a bit more detail, the key idea of the paper, at least to the extent that I understood it, was that the authors are able to introduce a model-based variance-reduction baseline into the policy gradient term.",non_actionable,fact
r119,s4,t13,"But beyond this, and despite fairly reasonable familiarity with the subject, I simply don't understand other elements that the paper is talking about.",actionable,fact
r119,s5,t13,"The paper frequently refers to ""embedding"" ""imaginary trajectories"" into the dynamics model, and I still have no idea what this is actually referring to (the definition at the start of section 4 is completely opaque to me).",actionable,fact
r119,s6,t13,"But I also feel that in this case, it borders on being an issue with the paper itself, as I think this idea needs to be described much more clearly if it is central to the underlying paper.",actionable,shortcoming
r119,s7,t13,"Finally, although I do think the extent of the algorithm that I could follow is interesting, the second issue with the paper is that the results are fairly weak as they stand currently.",actionable,shortcoming
r119,s8,t13,"Not that there is anything wrong with short papers, but in this case both the clarity of presentation and details are lacking.",actionable,shortcoming
r119,s9,t13,"I think it will eventually become a good paper, but it is not ready yet.",actionable,agreement
r119,s0,t31,"The basic idea is to use a multi-step dynamics model as a ""baseline"" (more properly a control variate, as the terminology in the paper uses, but I think baselines are more familiar to the RL community) to reduce the variance of a policy gradient estimator, while remaining unbiased.",non_actionable,fact
r119,s1,t31,"The authors also discuss how to best learn the type of multi-step dynamics that are well-suited to this problem (essentially, using off-policy data via importance weighting), and they demonstrate the effectiveness of the approach on four continuous control tasks.",non_actionable,fact
r119,s2,t31,"This paper presents a nice idea, and I'm sure that with some polish it will become a very nice conference submission.",actionable,agreement
r119,s3,t31,"In a bit more detail, the key idea of the paper, at least to the extent that I understood it, was that the authors are able to introduce a model-based variance-reduction baseline into the policy gradient term.",non_actionable,fact
r119,s4,t31,"But beyond this, and despite fairly reasonable familiarity with the subject, I simply don't understand other elements that the paper is talking about.",non_actionable,fact
r119,s5,t31,"The paper frequently refers to ""embedding"" ""imaginary trajectories"" into the dynamics model, and I still have no idea what this is actually referring to (the definition at the start of section 4 is completely opaque to me).",actionable,shortcoming
r119,s6,t31,"But I also feel that in this case, it borders on being an issue with the paper itself, as I think this idea needs to be described much more clearly if it is central to the underlying paper.",actionable,suggestion
r119,s7,t31,"Finally, although I do think the extent of the algorithm that I could follow is interesting, the second issue with the paper is that the results are fairly weak as they stand currently.",actionable,shortcoming
r119,s8,t31,"Not that there is anything wrong with short papers, but in this case both the clarity of presentation and details are lacking.",actionable,shortcoming
r119,s9,t31,"I think it will eventually become a good paper, but it is not ready yet.",actionable,shortcoming
r74,s0,t31,"The dataset consists of over 100,000 images and the questions are from 15 templates.",non_actionable,fact
r74,s1,t31,"The experimental results show that Relation Networks outperform the other two baselines, but still ~30% behind human performance.",non_actionable,fact
r74,s2,t31,3.	The idea of using different compositions of color and figure during train and test is interesting.,non_actionable,agreement
r74,s3,t31,1.	The motivation behind the proposed task needs to be better elaborated.,actionable,shortcoming
r74,s4,t31,"As of now, the paper mentions in one line that automatic understanding of figures could help human analysists.",non_actionable,fact
r74,s5,t31,"But, it would be good if this can be supported with real life examples.",actionable,suggestion
r74,s6,t31,"The paper briefly mentions supervising attention models using such boxes, but it isn’t clear how bounding boxes for data points could be used.",actionable,shortcoming
r74,s7,t31,"3.	It would have been good if the paper had the experiments on reconstructing quantitave data from plots and using bounding boxes for providing attention supervision, in order to concretize the usage of these annotations?",actionable,suggestion
r74,s8,t31,"So, why didn’t authors try the baselines models on FigureSeer dataset?",non_actionable,question
r74,s9,t31,6.	The paper should clarify which CNN is used for CNN + LSTM and Relation Networks models?,actionable,suggestion
r74,s0,t20,"The dataset consists of over 100,000 images and the questions are from 15 templates.",non_actionable,fact
r74,s1,t20,"The experimental results show that Relation Networks outperform the other two baselines, but still ~30% behind human performance.",non_actionable,fact
r74,s2,t20,3.	The idea of using different compositions of color and figure during train and test is interesting.,non_actionable,fact
r74,s3,t20,1.	The motivation behind the proposed task needs to be better elaborated.,actionable,suggestion
r74,s4,t20,"As of now, the paper mentions in one line that automatic understanding of figures could help human analysists.",non_actionable,fact
r74,s5,t20,"But, it would be good if this can be supported with real life examples.",actionable,suggestion
r74,s6,t20,"The paper briefly mentions supervising attention models using such boxes, but it isn’t clear how bounding boxes for data points could be used.",actionable,shortcoming
r74,s7,t20,"3.	It would have been good if the paper had the experiments on reconstructing quantitave data from plots and using bounding boxes for providing attention supervision, in order to concretize the usage of these annotations?",actionable,question
r74,s8,t20,"So, why didn’t authors try the baselines models on FigureSeer dataset?",actionable,question
r74,s9,t20,6.	The paper should clarify which CNN is used for CNN + LSTM and Relation Networks models?,actionable,question
r74,s0,t10,"The dataset consists of over 100,000 images and the questions are from 15 templates.",non_actionable,other
r74,s1,t10,"The experimental results show that Relation Networks outperform the other two baselines, but still ~30% behind human performance.",non_actionable,other
r74,s2,t10,3.	The idea of using different compositions of color and figure during train and test is interesting.,actionable,agreement
r74,s3,t10,1.	The motivation behind the proposed task needs to be better elaborated.,actionable,suggestion
r74,s4,t10,"As of now, the paper mentions in one line that automatic understanding of figures could help human analysists.",non_actionable,other
r74,s5,t10,"But, it would be good if this can be supported with real life examples.",actionable,suggestion
r74,s6,t10,"The paper briefly mentions supervising attention models using such boxes, but it isn’t clear how bounding boxes for data points could be used.",actionable,shortcoming
r74,s7,t10,"3.	It would have been good if the paper had the experiments on reconstructing quantitave data from plots and using bounding boxes for providing attention supervision, in order to concretize the usage of these annotations?",actionable,question
r74,s8,t10,"So, why didn’t authors try the baselines models on FigureSeer dataset?",actionable,question
r74,s9,t10,6.	The paper should clarify which CNN is used for CNN + LSTM and Relation Networks models?,actionable,question
r74,s0,t19,"The dataset consists of over 100,000 images and the questions are from 15 templates.",non_actionable,fact
r74,s1,t19,"The experimental results show that Relation Networks outperform the other two baselines, but still ~30% behind human performance.",non_actionable,fact
r74,s2,t19,3.	The idea of using different compositions of color and figure during train and test is interesting.,non_actionable,agreement
r74,s3,t19,1.	The motivation behind the proposed task needs to be better elaborated.,actionable,suggestion
r74,s4,t19,"As of now, the paper mentions in one line that automatic understanding of figures could help human analysists.",non_actionable,fact
r74,s5,t19,"But, it would be good if this can be supported with real life examples.",actionable,suggestion
r74,s6,t19,"The paper briefly mentions supervising attention models using such boxes, but it isn’t clear how bounding boxes for data points could be used.",actionable,shortcoming
r74,s7,t19,"3.	It would have been good if the paper had the experiments on reconstructing quantitave data from plots and using bounding boxes for providing attention supervision, in order to concretize the usage of these annotations?",actionable,suggestion
r74,s8,t19,"So, why didn’t authors try the baselines models on FigureSeer dataset?",actionable,question
r74,s9,t19,6.	The paper should clarify which CNN is used for CNN + LSTM and Relation Networks models?,actionable,suggestion
r74,s0,t13,"The dataset consists of over 100,000 images and the questions are from 15 templates.",non_actionable,fact
r74,s1,t13,"The experimental results show that Relation Networks outperform the other two baselines, but still ~30% behind human performance.",non_actionable,fact
r74,s2,t13,3.	The idea of using different compositions of color and figure during train and test is interesting.,non_actionable,agreement
r74,s3,t13,1.	The motivation behind the proposed task needs to be better elaborated.,actionable,suggestion
r74,s4,t13,"As of now, the paper mentions in one line that automatic understanding of figures could help human analysists.",non_actionable,fact
r74,s5,t13,"But, it would be good if this can be supported with real life examples.",actionable,suggestion
r74,s6,t13,"The paper briefly mentions supervising attention models using such boxes, but it isn’t clear how bounding boxes for data points could be used.",actionable,shortcoming
r74,s7,t13,"3.	It would have been good if the paper had the experiments on reconstructing quantitave data from plots and using bounding boxes for providing attention supervision, in order to concretize the usage of these annotations?",actionable,suggestion
r74,s8,t13,"So, why didn’t authors try the baselines models on FigureSeer dataset?",actionable,question
r74,s9,t13,6.	The paper should clarify which CNN is used for CNN + LSTM and Relation Networks models?,actionable,suggestion
r72,s0,t31,The experiments performed show that the Bayesian view of batch normalization performs similarly as MC dropout in terms of the estimates of uncertainty that it produces.,non_actionable,fact
r72,s1,t31,"First, the description of what is the prior used by batch normalization in section 3.3 is unsatisfactory.",actionable,shortcoming
r72,s2,t31,"The details in that Appendix are almost none, they just say ""it is thus possible to derive the prior..."".",actionable,shortcoming
r72,s3,t31,The authors indicate that they do not need to compare to variational methods because Gal and Ghahramani 2015 compare already to those methods.,non_actionable,fact
r72,s4,t31,"However, Gal and Ghahramani's code used Bayesian optimization methods to tune hyper-parameters and this code contains a bug that optimizes hyper-parameters by maximizing performance on the test data.",non_actionable,fact
r72,s5,t31,Clarity: The paper is clearly written and easy to follow and understand.,non_actionable,agreement
r72,s6,t31,I found confusing how to use the proposed method to obtain estimates of uncertainty for a particular test data point x_star.,actionable,shortcoming
r72,s7,t31,"The paragraph just above section 4 says that the authors sample a batch of training data for this, but assume that the test point x_star has to be included in this batch.",non_actionable,fact
r72,s8,t31,How is this actually done in practice?,non_actionable,question
r72,s9,t31,Many existing deep learning systems can use this to produce estimates of uncertainty in their predictions.,non_actionable,fact
r72,s0,t8,The experiments performed show that the Bayesian view of batch normalization performs similarly as MC dropout in terms of the estimates of uncertainty that it produces.,non_actionable,fact
r72,s1,t8,"First, the description of what is the prior used by batch normalization in section 3.3 is unsatisfactory.",actionable,shortcoming
r72,s2,t8,"The details in that Appendix are almost none, they just say ""it is thus possible to derive the prior..."".",actionable,shortcoming
r72,s3,t8,The authors indicate that they do not need to compare to variational methods because Gal and Ghahramani 2015 compare already to those methods.,non_actionable,fact
r72,s4,t8,"However, Gal and Ghahramani's code used Bayesian optimization methods to tune hyper-parameters and this code contains a bug that optimizes hyper-parameters by maximizing performance on the test data.",non_actionable,fact
r72,s5,t8,Clarity: The paper is clearly written and easy to follow and understand.,non_actionable,agreement
r72,s6,t8,I found confusing how to use the proposed method to obtain estimates of uncertainty for a particular test data point x_star.,actionable,shortcoming
r72,s7,t8,"The paragraph just above section 4 says that the authors sample a batch of training data for this, but assume that the test point x_star has to be included in this batch.",non_actionable,fact
r72,s8,t8,How is this actually done in practice?,non_actionable,question
r72,s9,t8,Many existing deep learning systems can use this to produce estimates of uncertainty in their predictions.,non_actionable,fact
r72,s0,t20,The experiments performed show that the Bayesian view of batch normalization performs similarly as MC dropout in terms of the estimates of uncertainty that it produces.,non_actionable,fact
r72,s1,t20,"First, the description of what is the prior used by batch normalization in section 3.3 is unsatisfactory.",non_actionable,shortcoming
r72,s2,t20,"The details in that Appendix are almost none, they just say ""it is thus possible to derive the prior..."".",non_actionable,shortcoming
r72,s3,t20,The authors indicate that they do not need to compare to variational methods because Gal and Ghahramani 2015 compare already to those methods.,non_actionable,fact
r72,s4,t20,"However, Gal and Ghahramani's code used Bayesian optimization methods to tune hyper-parameters and this code contains a bug that optimizes hyper-parameters by maximizing performance on the test data.",non_actionable,fact
r72,s5,t20,Clarity: The paper is clearly written and easy to follow and understand.,non_actionable,agreement
r72,s6,t20,I found confusing how to use the proposed method to obtain estimates of uncertainty for a particular test data point x_star.,non_actionable,shortcoming
r72,s7,t20,"The paragraph just above section 4 says that the authors sample a batch of training data for this, but assume that the test point x_star has to be included in this batch.",non_actionable,fact
r72,s8,t20,How is this actually done in practice?,actionable,question
r72,s9,t20,Many existing deep learning systems can use this to produce estimates of uncertainty in their predictions.,non_actionable,fact
r72,s0,t16,The experiments performed show that the Bayesian view of batch normalization performs similarly as MC dropout in terms of the estimates of uncertainty that it produces.,non_actionable,fact
r72,s1,t16,"First, the description of what is the prior used by batch normalization in section 3.3 is unsatisfactory.",actionable,disagreement
r72,s2,t16,"The details in that Appendix are almost none, they just say ""it is thus possible to derive the prior..."".",actionable,shortcoming
r72,s3,t16,The authors indicate that they do not need to compare to variational methods because Gal and Ghahramani 2015 compare already to those methods.,non_actionable,fact
r72,s4,t16,"However, Gal and Ghahramani's code used Bayesian optimization methods to tune hyper-parameters and this code contains a bug that optimizes hyper-parameters by maximizing performance on the test data.",actionable,shortcoming
r72,s5,t16,Clarity: The paper is clearly written and easy to follow and understand.,non_actionable,agreement
r72,s6,t16,I found confusing how to use the proposed method to obtain estimates of uncertainty for a particular test data point x_star.,actionable,shortcoming
r72,s7,t16,"The paragraph just above section 4 says that the authors sample a batch of training data for this, but assume that the test point x_star has to be included in this batch.",actionable,shortcoming
r72,s8,t16,How is this actually done in practice?,actionable,question
r72,s9,t16,Many existing deep learning systems can use this to produce estimates of uncertainty in their predictions.,non_actionable,fact
r72,s0,t10,The experiments performed show that the Bayesian view of batch normalization performs similarly as MC dropout in terms of the estimates of uncertainty that it produces.,non_actionable,other
r72,s1,t10,"First, the description of what is the prior used by batch normalization in section 3.3 is unsatisfactory.",actionable,shortcoming
r72,s2,t10,"The details in that Appendix are almost none, they just say ""it is thus possible to derive the prior..."".",actionable,shortcoming
r72,s3,t10,The authors indicate that they do not need to compare to variational methods because Gal and Ghahramani 2015 compare already to those methods.,non_actionable,other
r72,s4,t10,"However, Gal and Ghahramani's code used Bayesian optimization methods to tune hyper-parameters and this code contains a bug that optimizes hyper-parameters by maximizing performance on the test data.",actionable,shortcoming
r72,s5,t10,Clarity: The paper is clearly written and easy to follow and understand.,actionable,agreement
r72,s6,t10,I found confusing how to use the proposed method to obtain estimates of uncertainty for a particular test data point x_star.,actionable,fact
r72,s7,t10,"The paragraph just above section 4 says that the authors sample a batch of training data for this, but assume that the test point x_star has to be included in this batch.",non_actionable,other
r72,s8,t10,How is this actually done in practice?,non_actionable,agreement
r72,s9,t10,Many existing deep learning systems can use this to produce estimates of uncertainty in their predictions.,actionable,agreement
r58,s0,t10,The generator is a mixture of two Gaussians in one dimension.,non_actionable,other
r58,s1,t10,Discriminator is union of two intervals.,non_actionable,other
r58,s2,t10,"The paper notices through simulations that in a grid search over the initial parameters of generator optimal discriminator training always succeeds in recovering the true generator parameters, whereas the other two methods fail and exhibit mode collapse.",non_actionable,other
r58,s3,t10,1) This is an interesting paper studying the dynamics of GANs on a simpler model (but rich enough to display mode collapse).,actionable,agreement
r58,s4,t10,The results establish the standard issues noticed in training  GANs.,non_actionable,other
r58,s5,t10,However no intuition is given as to why the mode collapse happens or why the single discriminator updates fail (see for ex. https://arxiv.org/abs/1705.10461)?,actionable,question
r58,s6,t10,2) The proposed method of doing optimal discriminator updates cannot be extended when the discriminator is a neural network.,actionable,shortcoming
r58,s7,t10,Does doing more unrolling steps simulate this behavior?,actionable,question
r58,s8,t10,3) Can you write the exact dynamics used for Theorem 4.1 ?,actionable,question
r58,s9,t10,4) What is the size of the initial discriminator intervals used for experiments in figure 2?,actionable,question
r58,s0,t20,The generator is a mixture of two Gaussians in one dimension.,non_actionable,fact
r58,s1,t20,Discriminator is union of two intervals.,non_actionable,fact
r58,s2,t20,"The paper notices through simulations that in a grid search over the initial parameters of generator optimal discriminator training always succeeds in recovering the true generator parameters, whereas the other two methods fail and exhibit mode collapse.",non_actionable,fact
r58,s3,t20,1) This is an interesting paper studying the dynamics of GANs on a simpler model (but rich enough to display mode collapse).,non_actionable,agreement
r58,s4,t20,The results establish the standard issues noticed in training  GANs.,non_actionable,fact
r58,s5,t20,However no intuition is given as to why the mode collapse happens or why the single discriminator updates fail (see for ex. https://arxiv.org/abs/1705.10461)?,actionable,question
r58,s6,t20,2) The proposed method of doing optimal discriminator updates cannot be extended when the discriminator is a neural network.,non_actionable,fact
r58,s7,t20,Does doing more unrolling steps simulate this behavior?,actionable,question
r58,s8,t20,3) Can you write the exact dynamics used for Theorem 4.1 ?,actionable,question
r58,s9,t20,4) What is the size of the initial discriminator intervals used for experiments in figure 2?,actionable,question
r58,s0,t16,The generator is a mixture of two Gaussians in one dimension.,non_actionable,fact
r58,s1,t16,Discriminator is union of two intervals.,non_actionable,fact
r58,s2,t16,"The paper notices through simulations that in a grid search over the initial parameters of generator optimal discriminator training always succeeds in recovering the true generator parameters, whereas the other two methods fail and exhibit mode collapse.",non_actionable,fact
r58,s3,t16,1) This is an interesting paper studying the dynamics of GANs on a simpler model (but rich enough to display mode collapse).,non_actionable,fact
r58,s4,t16,The results establish the standard issues noticed in training  GANs.,non_actionable,fact
r58,s5,t16,However no intuition is given as to why the mode collapse happens or why the single discriminator updates fail (see for ex. https://arxiv.org/abs/1705.10461)?,actionable,shortcoming
r58,s6,t16,2) The proposed method of doing optimal discriminator updates cannot be extended when the discriminator is a neural network.,actionable,shortcoming
r58,s7,t16,Does doing more unrolling steps simulate this behavior?,actionable,question
r58,s8,t16,3) Can you write the exact dynamics used for Theorem 4.1 ?,actionable,question
r58,s9,t16,4) What is the size of the initial discriminator intervals used for experiments in figure 2?,actionable,question
r58,s0,t31,The generator is a mixture of two Gaussians in one dimension.,non_actionable,fact
r58,s1,t31,Discriminator is union of two intervals.,non_actionable,fact
r58,s2,t31,"The paper notices through simulations that in a grid search over the initial parameters of generator optimal discriminator training always succeeds in recovering the true generator parameters, whereas the other two methods fail and exhibit mode collapse.",non_actionable,fact
r58,s3,t31,1) This is an interesting paper studying the dynamics of GANs on a simpler model (but rich enough to display mode collapse).,actionable,shortcoming
r58,s4,t31,The results establish the standard issues noticed in training  GANs.,non_actionable,fact
r58,s5,t31,However no intuition is given as to why the mode collapse happens or why the single discriminator updates fail (see for ex. https://arxiv.org/abs/1705.10461)?,actionable,shortcoming
r58,s6,t31,2) The proposed method of doing optimal discriminator updates cannot be extended when the discriminator is a neural network.,actionable,shortcoming
r58,s7,t31,Does doing more unrolling steps simulate this behavior?,non_actionable,question
r58,s8,t31,3) Can you write the exact dynamics used for Theorem 4.1 ?,actionable,suggestion
r58,s9,t31,4) What is the size of the initial discriminator intervals used for experiments in figure 2?,non_actionable,question
r58,s0,t8,The generator is a mixture of two Gaussians in one dimension.,non_actionable,fact
r58,s1,t8,Discriminator is union of two intervals.,non_actionable,fact
r58,s2,t8,"The paper notices through simulations that in a grid search over the initial parameters of generator optimal discriminator training always succeeds in recovering the true generator parameters, whereas the other two methods fail and exhibit mode collapse.",non_actionable,fact
r58,s3,t8,1) This is an interesting paper studying the dynamics of GANs on a simpler model (but rich enough to display mode collapse).,non_actionable,agreement
r58,s4,t8,The results establish the standard issues noticed in training  GANs.,non_actionable,fact
r58,s5,t8,However no intuition is given as to why the mode collapse happens or why the single discriminator updates fail (see for ex. https://arxiv.org/abs/1705.10461)?,actionable,shortcoming
r58,s6,t8,2) The proposed method of doing optimal discriminator updates cannot be extended when the discriminator is a neural network.,actionable,shortcoming
r58,s7,t8,Does doing more unrolling steps simulate this behavior?,non_actionable,question
r58,s8,t8,3) Can you write the exact dynamics used for Theorem 4.1 ?,actionable,suggestion
r58,s9,t8,4) What is the size of the initial discriminator intervals used for experiments in figure 2?,non_actionable,question
r48,s0,t31,The idea is an extension of the earlier developed MCMC with people approach where samples are drawn in the latent space of a DCGAN and a BiGAN.,non_actionable,fact
r48,s1,t31,The approach is thoroughly validated using two online behavioural experiments.,non_actionable,fact
r48,s2,t31,Clarity The rationale is clear and the results are straightforward to interpret.,non_actionable,agreement
r48,s3,t31,In Section 4.2.1 statements on resemblance and closeness to mean faces could be tested.,non_actionable,fact
r48,s4,t31,The final sentence probably relates back to the CI approach.,non_actionable,fact
r48,s5,t31,Originality The approach is a straightforward extension of the MCMCP approach using generative models.,actionable,shortcoming
r48,s6,t31,Significance The approach improves on previous category estimation approaches by embracing the expressiveness of recent generative models.,non_actionable,agreement
r48,s7,t31,Extensive experiments demonstrate the usefulness of the approach.,non_actionable,agreement
r48,s8,t31,Pros Useful extension of an important technique backed up by behavioural experiments.,non_actionable,agreement
r48,s9,t31,Cons Does not provide new theory but combines existing ideas in a new manner.,actionable,shortcoming
r48,s0,t20,The idea is an extension of the earlier developed MCMC with people approach where samples are drawn in the latent space of a DCGAN and a BiGAN.,non_actionable,fact
r48,s1,t20,The approach is thoroughly validated using two online behavioural experiments.,non_actionable,fact
r48,s2,t20,Clarity The rationale is clear and the results are straightforward to interpret.,non_actionable,agreement
r48,s3,t20,In Section 4.2.1 statements on resemblance and closeness to mean faces could be tested.,non_actionable,fact
r48,s4,t20,The final sentence probably relates back to the CI approach.,non_actionable,fact
r48,s5,t20,Originality The approach is a straightforward extension of the MCMCP approach using generative models.,non_actionable,fact
r48,s6,t20,Significance The approach improves on previous category estimation approaches by embracing the expressiveness of recent generative models.,non_actionable,fact
r48,s7,t20,Extensive experiments demonstrate the usefulness of the approach.,non_actionable,fact
r48,s8,t20,Pros Useful extension of an important technique backed up by behavioural experiments.,non_actionable,agreement
r48,s9,t20,Cons Does not provide new theory but combines existing ideas in a new manner.,non_actionable,shortcoming
r48,s0,t8,The idea is an extension of the earlier developed MCMC with people approach where samples are drawn in the latent space of a DCGAN and a BiGAN.,non_actionable,fact
r48,s1,t8,The approach is thoroughly validated using two online behavioural experiments.,non_actionable,agreement
r48,s2,t8,Clarity The rationale is clear and the results are straightforward to interpret.,non_actionable,agreement
r48,s3,t8,In Section 4.2.1 statements on resemblance and closeness to mean faces could be tested.,actionable,suggestion
r48,s4,t8,The final sentence probably relates back to the CI approach.,non_actionable,fact
r48,s5,t8,Originality The approach is a straightforward extension of the MCMCP approach using generative models.,non_actionable,fact
r48,s6,t8,Significance The approach improves on previous category estimation approaches by embracing the expressiveness of recent generative models.,non_actionable,agreement
r48,s7,t8,Extensive experiments demonstrate the usefulness of the approach.,non_actionable,agreement
r48,s8,t8,Pros Useful extension of an important technique backed up by behavioural experiments.,non_actionable,agreement
r48,s9,t8,Cons Does not provide new theory but combines existing ideas in a new manner.,actionable,shortcoming
r48,s0,t16,The idea is an extension of the earlier developed MCMC with people approach where samples are drawn in the latent space of a DCGAN and a BiGAN.,non_actionable,fact
r48,s1,t16,The approach is thoroughly validated using two online behavioural experiments.,non_actionable,fact
r48,s2,t16,Clarity The rationale is clear and the results are straightforward to interpret.,non_actionable,agreement
r48,s3,t16,In Section 4.2.1 statements on resemblance and closeness to mean faces could be tested.,actionable,suggestion
r48,s4,t16,The final sentence probably relates back to the CI approach.,non_actionable,fact
r48,s5,t16,Originality The approach is a straightforward extension of the MCMCP approach using generative models.,non_actionable,fact
r48,s6,t16,Significance The approach improves on previous category estimation approaches by embracing the expressiveness of recent generative models.,non_actionable,fact
r48,s7,t16,Extensive experiments demonstrate the usefulness of the approach.,non_actionable,agreement
r48,s8,t16,Pros Useful extension of an important technique backed up by behavioural experiments.,non_actionable,agreement
r48,s9,t16,Cons Does not provide new theory but combines existing ideas in a new manner.,actionable,shortcoming
r48,s0,t10,The idea is an extension of the earlier developed MCMC with people approach where samples are drawn in the latent space of a DCGAN and a BiGAN.,non_actionable,other
r48,s1,t10,The approach is thoroughly validated using two online behavioural experiments.,actionable,agreement
r48,s2,t10,Clarity The rationale is clear and the results are straightforward to interpret.,actionable,agreement
r48,s3,t10,In Section 4.2.1 statements on resemblance and closeness to mean faces could be tested.,non_actionable,other
r48,s4,t10,The final sentence probably relates back to the CI approach.,non_actionable,other
r48,s5,t10,Originality The approach is a straightforward extension of the MCMCP approach using generative models.,actionable,agreement
r48,s6,t10,Significance The approach improves on previous category estimation approaches by embracing the expressiveness of recent generative models.,actionable,agreement
r48,s7,t10,Extensive experiments demonstrate the usefulness of the approach.,actionable,agreement
r48,s8,t10,Pros Useful extension of an important technique backed up by behavioural experiments.,actionable,agreement
r48,s9,t10,Cons Does not provide new theory but combines existing ideas in a new manner.,actionable,shortcoming
r68,s0,t10,The idea of enforcing information isolation is brilliant.,actionable,agreement
r68,s1,t10,"Wen, D. Vandyke, N. Mrksic, M. Gasic, L. Rojas-Barahona, P.",non_actionable,other
r68,s2,t10,"If ruled-based systems were sufficient, there would not be a need for statistical dialogue managers.",non_actionable,other
r68,s3,t10,"Figure 1 is missing information (for my likings), like not defined symbols.",actionable,shortcoming
r68,s4,t10,"Also, I would prefer a longer, descriptive and informative label to make the figure as self-explained as possible.",actionable,fact
r68,s5,t10,"""we use 2 transformation matrixes"" -> if you could please provide more details How is equation 2 related to figure 1?",actionable,question
r68,s6,t10,May you make this statement more clearly by adding an equation for example?,actionable,question
r68,s7,t10,"I cannot find Table 1, 2 and 5 referred to in-text.",actionable,shortcoming
r68,s8,t10,Setting generator: You mention the percentage of book and flight not found.,non_actionable,other
r68,s9,t10,"After all, reward seems to play a very important role for the proposed system.",non_actionable,other
r68,s0,t8,The idea of enforcing information isolation is brilliant.,non_actionable,agreement
r68,s1,t8,"Wen, D. Vandyke, N. Mrksic, M. Gasic, L. Rojas-Barahona, P.",non_actionable,other
r68,s2,t8,"If ruled-based systems were sufficient, there would not be a need for statistical dialogue managers.",non_actionable,fact
r68,s3,t8,"Figure 1 is missing information (for my likings), like not defined symbols.",actionable,shortcoming
r68,s4,t8,"Also, I would prefer a longer, descriptive and informative label to make the figure as self-explained as possible.",actionable,suggestion
r68,s5,t8,"""we use 2 transformation matrixes"" -> if you could please provide more details How is equation 2 related to figure 1?",actionable,suggestion
r68,s6,t8,May you make this statement more clearly by adding an equation for example?,actionable,suggestion
r68,s7,t8,"I cannot find Table 1, 2 and 5 referred to in-text.",actionable,shortcoming
r68,s8,t8,Setting generator: You mention the percentage of book and flight not found.,non_actionable,fact
r68,s9,t8,"After all, reward seems to play a very important role for the proposed system.",non_actionable,fact
r68,s0,t13,The idea of enforcing information isolation is brilliant.,non_actionable,agreement
r68,s1,t13,"Wen, D. Vandyke, N. Mrksic, M. Gasic, L. Rojas-Barahona, P.",non_actionable,other
r68,s2,t13,"If ruled-based systems were sufficient, there would not be a need for statistical dialogue managers.",non_actionable,fact
r68,s3,t13,"Figure 1 is missing information (for my likings), like not defined symbols.",actionable,shortcoming
r68,s4,t13,"Also, I would prefer a longer, descriptive and informative label to make the figure as self-explained as possible.",actionable,shortcoming
r68,s5,t13,"""we use 2 transformation matrixes"" -> if you could please provide more details How is equation 2 related to figure 1?",actionable,shortcoming
r68,s6,t13,May you make this statement more clearly by adding an equation for example?,actionable,shortcoming
r68,s7,t13,"I cannot find Table 1, 2 and 5 referred to in-text.",actionable,shortcoming
r68,s8,t13,Setting generator: You mention the percentage of book and flight not found.,non_actionable,other
r68,s9,t13,"After all, reward seems to play a very important role for the proposed system.",non_actionable,fact
r68,s0,t20,The idea of enforcing information isolation is brilliant.,non_actionable,agreement
r68,s1,t20,"Wen, D. Vandyke, N. Mrksic, M. Gasic, L. Rojas-Barahona, P.",non_actionable,other
r68,s2,t20,"If ruled-based systems were sufficient, there would not be a need for statistical dialogue managers.",non_actionable,fact
r68,s3,t20,"Figure 1 is missing information (for my likings), like not defined symbols.",actionable,shortcoming
r68,s4,t20,"Also, I would prefer a longer, descriptive and informative label to make the figure as self-explained as possible.",actionable,suggestion
r68,s5,t20,"""we use 2 transformation matrixes"" -> if you could please provide more details How is equation 2 related to figure 1?",actionable,question
r68,s6,t20,May you make this statement more clearly by adding an equation for example?,actionable,question
r68,s7,t20,"I cannot find Table 1, 2 and 5 referred to in-text.",actionable,shortcoming
r68,s8,t20,Setting generator: You mention the percentage of book and flight not found.,actionable,shortcoming
r68,s9,t20,"After all, reward seems to play a very important role for the proposed system.",non_actionable,fact
r68,s0,t31,The idea of enforcing information isolation is brilliant.,non_actionable,agreement
r68,s1,t31,"Wen, D. Vandyke, N. Mrksic, M. Gasic, L. Rojas-Barahona, P.",non_actionable,other
r68,s2,t31,"If ruled-based systems were sufficient, there would not be a need for statistical dialogue managers.",non_actionable,fact
r68,s3,t31,"Figure 1 is missing information (for my likings), like not defined symbols.",actionable,shortcoming
r68,s4,t31,"Also, I would prefer a longer, descriptive and informative label to make the figure as self-explained as possible.",actionable,suggestion
r68,s5,t31,"""we use 2 transformation matrixes"" -> if you could please provide more details How is equation 2 related to figure 1?",actionable,suggestion
r68,s6,t31,May you make this statement more clearly by adding an equation for example?,actionable,question
r68,s7,t31,"I cannot find Table 1, 2 and 5 referred to in-text.",actionable,shortcoming
r68,s8,t31,Setting generator: You mention the percentage of book and flight not found.,non_actionable,fact
r68,s9,t31,"After all, reward seems to play a very important role for the proposed system.",non_actionable,fact
r114,s0,t20,The manuscript mainly presents a cheap pruning algorithm for dense layers of DNNs.,non_actionable,fact
r114,s1,t20,"It reformulates the non-convex optimization problem in (Aghasi et al., 2016) as a difference of convex (DC) problem, which can be solved quite efficiently using the DCA algorithm (Tao and An, 1997).",non_actionable,fact
r114,s2,t20,"The contribution is valuable since the complexity is significantly reduced, but there are many syntax errors and the accuracy of the model is not satisfactory.",non_actionable,shortcoming
r114,s3,t20,The manuscript mainly presents a cheap pruning algorithm for dense layers of DNNs.,non_actionable,fact
r114,s4,t20,"The proposed algorithm is an improvement of Net-Trim (Aghasi et al., 2016), which is to enforce the weights to be sparse.",non_actionable,fact
r114,s5,t20,"The complexity of the proposed algorithm is much lower than Net-Trim and its fast version LOBS (Dong et al., 2017).",non_actionable,fact
r114,s6,t20,"The authors also analyze the generalization error bound of DNN after pruning based on the work of (Sokolic et al., 2017).",non_actionable,fact
r114,s7,t20,"Although the main idea is clearly presented, there are many syntax errors and I suggest the authors carefully checking the manuscript.",actionable,suggestion
r114,s8,t20,"1.	There are many syntax errors, e.g., “Closer to our approach recently in Aghasi et al. (2016) the authors”, “an cheap pruning algorithm”, etc.",actionable,shortcoming
r114,s9,t20,"For example, the accuracy drops from 95.2% to 91% compared with Net-Trim in the LeNet-5 model and from 80.5% to 74.6% compared with LOBS in the CifarNet model.",non_actionable,fact
r114,s0,t10,The manuscript mainly presents a cheap pruning algorithm for dense layers of DNNs.,non_actionable,other
r114,s1,t10,"It reformulates the non-convex optimization problem in (Aghasi et al., 2016) as a difference of convex (DC) problem, which can be solved quite efficiently using the DCA algorithm (Tao and An, 1997).",non_actionable,other
r114,s2,t10,"The contribution is valuable since the complexity is significantly reduced, but there are many syntax errors and the accuracy of the model is not satisfactory.",non_actionable,shortcoming
r114,s3,t10,The manuscript mainly presents a cheap pruning algorithm for dense layers of DNNs.,non_actionable,other
r114,s4,t10,"The proposed algorithm is an improvement of Net-Trim (Aghasi et al., 2016), which is to enforce the weights to be sparse.",actionable,agreement
r114,s5,t10,"The complexity of the proposed algorithm is much lower than Net-Trim and its fast version LOBS (Dong et al., 2017).",non_actionable,other
r114,s6,t10,"The authors also analyze the generalization error bound of DNN after pruning based on the work of (Sokolic et al., 2017).",non_actionable,other
r114,s7,t10,"Although the main idea is clearly presented, there are many syntax errors and I suggest the authors carefully checking the manuscript.",non_actionable,suggestion
r114,s8,t10,"1.	There are many syntax errors, e.g., “Closer to our approach recently in Aghasi et al. (2016) the authors”, “an cheap pruning algorithm”, etc.",actionable,shortcoming
r114,s9,t10,"For example, the accuracy drops from 95.2% to 91% compared with Net-Trim in the LeNet-5 model and from 80.5% to 74.6% compared with LOBS in the CifarNet model.",actionable,shortcoming
r114,s0,t31,The manuscript mainly presents a cheap pruning algorithm for dense layers of DNNs.,non_actionable,fact
r114,s1,t31,"It reformulates the non-convex optimization problem in (Aghasi et al., 2016) as a difference of convex (DC) problem, which can be solved quite efficiently using the DCA algorithm (Tao and An, 1997).",non_actionable,fact
r114,s2,t31,"The contribution is valuable since the complexity is significantly reduced, but there are many syntax errors and the accuracy of the model is not satisfactory.",actionable,shortcoming
r114,s3,t31,The manuscript mainly presents a cheap pruning algorithm for dense layers of DNNs.,non_actionable,fact
r114,s4,t31,"The proposed algorithm is an improvement of Net-Trim (Aghasi et al., 2016), which is to enforce the weights to be sparse.",non_actionable,fact
r114,s5,t31,"The complexity of the proposed algorithm is much lower than Net-Trim and its fast version LOBS (Dong et al., 2017).",non_actionable,fact
r114,s6,t31,"The authors also analyze the generalization error bound of DNN after pruning based on the work of (Sokolic et al., 2017).",non_actionable,fact
r114,s7,t31,"Although the main idea is clearly presented, there are many syntax errors and I suggest the authors carefully checking the manuscript.",actionable,shortcoming
r114,s8,t31,"1.	There are many syntax errors, e.g., “Closer to our approach recently in Aghasi et al. (2016) the authors”, “an cheap pruning algorithm”, etc.",actionable,shortcoming
r114,s9,t31,"For example, the accuracy drops from 95.2% to 91% compared with Net-Trim in the LeNet-5 model and from 80.5% to 74.6% compared with LOBS in the CifarNet model.",non_actionable,fact
r114,s0,t8,The manuscript mainly presents a cheap pruning algorithm for dense layers of DNNs.,non_actionable,fact
r114,s1,t8,"It reformulates the non-convex optimization problem in (Aghasi et al., 2016) as a difference of convex (DC) problem, which can be solved quite efficiently using the DCA algorithm (Tao and An, 1997).",non_actionable,fact
r114,s2,t8,"The contribution is valuable since the complexity is significantly reduced, but there are many syntax errors and the accuracy of the model is not satisfactory.",actionable,shortcoming
r114,s3,t8,The manuscript mainly presents a cheap pruning algorithm for dense layers of DNNs.,actionable,fact
r114,s4,t8,"The proposed algorithm is an improvement of Net-Trim (Aghasi et al., 2016), which is to enforce the weights to be sparse.",non_actionable,agreement
r114,s5,t8,"The complexity of the proposed algorithm is much lower than Net-Trim and its fast version LOBS (Dong et al., 2017).",non_actionable,fact
r114,s6,t8,"The authors also analyze the generalization error bound of DNN after pruning based on the work of (Sokolic et al., 2017).",non_actionable,fact
r114,s7,t8,"Although the main idea is clearly presented, there are many syntax errors and I suggest the authors carefully checking the manuscript.",actionable,suggestion
r114,s8,t8,"1.	There are many syntax errors, e.g., “Closer to our approach recently in Aghasi et al. (2016) the authors”, “an cheap pruning algorithm”, etc.",actionable,shortcoming
r114,s9,t8,"For example, the accuracy drops from 95.2% to 91% compared with Net-Trim in the LeNet-5 model and from 80.5% to 74.6% compared with LOBS in the CifarNet model.",actionable,shortcoming
r114,s0,t16,The manuscript mainly presents a cheap pruning algorithm for dense layers of DNNs.,non_actionable,fact
r114,s1,t16,"It reformulates the non-convex optimization problem in (Aghasi et al., 2016) as a difference of convex (DC) problem, which can be solved quite efficiently using the DCA algorithm (Tao and An, 1997).",non_actionable,fact
r114,s2,t16,"The contribution is valuable since the complexity is significantly reduced, but there are many syntax errors and the accuracy of the model is not satisfactory.",actionable,shortcoming
r114,s3,t16,The manuscript mainly presents a cheap pruning algorithm for dense layers of DNNs.,non_actionable,fact
r114,s4,t16,"The proposed algorithm is an improvement of Net-Trim (Aghasi et al., 2016), which is to enforce the weights to be sparse.",non_actionable,fact
r114,s5,t16,"The complexity of the proposed algorithm is much lower than Net-Trim and its fast version LOBS (Dong et al., 2017).",non_actionable,fact
r114,s6,t16,"The authors also analyze the generalization error bound of DNN after pruning based on the work of (Sokolic et al., 2017).",non_actionable,fact
r114,s7,t16,"Although the main idea is clearly presented, there are many syntax errors and I suggest the authors carefully checking the manuscript.",actionable,suggestion
r114,s8,t16,"1.	There are many syntax errors, e.g., “Closer to our approach recently in Aghasi et al. (2016) the authors”, “an cheap pruning algorithm”, etc.",actionable,shortcoming
r114,s9,t16,"For example, the accuracy drops from 95.2% to 91% compared with Net-Trim in the LeNet-5 model and from 80.5% to 74.6% compared with LOBS in the CifarNet model.",actionable,shortcoming
r86,s0,t16,The method is tested on a number of datasets (each used as source and target) and shows good transfer learning performance on each one.,non_actionable,fact
r86,s1,t16,A number of different fine-tuning regimes are explored.,non_actionable,fact
r86,s2,t16,The paper is mostly clear and well-written (though with a few typos that should be fixed).,actionable,shortcoming
r86,s3,t16,Cons/Questions/Suggestions The distinction between the convolutional and fully-connected layers (called “classifiers”) in the approach description (sec,actionable,shortcoming
r86,s4,t16,(This is hinted at by the mention of fully convolutional networks.)  The method could just as easily be applied to learn a task-specific rotation of the fully-connected layer weights.,actionable,suggestion
r86,s5,t16,"A more systematic set of experiments could compare learning the proposed weightings on the first K layers of the network (for K={0, 1, …, N}) and learning independent weights for the latter N-K layers, but I understand this would be a rather large experimental burden.",actionable,fact
r86,s6,t16,"Is this implying that the gradients wrt off-diagonal entries of the controller weight matrix are 0 under the diagonal initialization, hence the off-diagonal entries remain zero after learning?",actionable,question
r86,s7,t16,It’s not immediately clear to me whether this is the case -- it could help to clarify this in the text.,actionable,shortcoming
r86,s8,t16,) The dataset classifier (sec 4.3.4) could be learnt end-to-end by using a softmax output of the dataset classifier as the alpha weighting.,actionable,suggestion
r86,s9,t16,It would be interesting to see how this compares with the hard thresholding method used here.,actionable,suggestion
r86,s0,t10,The method is tested on a number of datasets (each used as source and target) and shows good transfer learning performance on each one.,actionable,agreement
r86,s1,t10,A number of different fine-tuning regimes are explored.,non_actionable,other
r86,s2,t10,The paper is mostly clear and well-written (though with a few typos that should be fixed).,actionable,agreement
r86,s3,t10,Cons/Questions/Suggestions The distinction between the convolutional and fully-connected layers (called “classifiers”) in the approach description (sec,non_actionable,other
r86,s4,t10,(This is hinted at by the mention of fully convolutional networks.)  The method could just as easily be applied to learn a task-specific rotation of the fully-connected layer weights.,non_actionable,other
r86,s5,t10,"A more systematic set of experiments could compare learning the proposed weightings on the first K layers of the network (for K={0, 1, …, N}) and learning independent weights for the latter N-K layers, but I understand this would be a rather large experimental burden.",actionable,suggestion
r86,s6,t10,"Is this implying that the gradients wrt off-diagonal entries of the controller weight matrix are 0 under the diagonal initialization, hence the off-diagonal entries remain zero after learning?",actionable,question
r86,s7,t10,It’s not immediately clear to me whether this is the case -- it could help to clarify this in the text.,actionable,suggestion
r86,s8,t10,) The dataset classifier (sec 4.3.4) could be learnt end-to-end by using a softmax output of the dataset classifier as the alpha weighting.,non_actionable,other
r86,s9,t10,It would be interesting to see how this compares with the hard thresholding method used here.,actionable,suggestion
r86,s0,t27,The method is tested on a number of datasets (each used as source and target) and shows good transfer learning performance on each one.,non_actionable,agreement
r86,s1,t27,A number of different fine-tuning regimes are explored.,non_actionable,fact
r86,s2,t27,The paper is mostly clear and well-written (though with a few typos that should be fixed).,actionable,agreement
r86,s3,t27,Cons/Questions/Suggestions The distinction between the convolutional and fully-connected layers (called “classifiers”) in the approach description (sec,actionable,suggestion
r86,s4,t27,(This is hinted at by the mention of fully convolutional networks.)  The method could just as easily be applied to learn a task-specific rotation of the fully-connected layer weights.,actionable,suggestion
r86,s5,t27,"A more systematic set of experiments could compare learning the proposed weightings on the first K layers of the network (for K={0, 1, …, N}) and learning independent weights for the latter N-K layers, but I understand this would be a rather large experimental burden.",actionable,suggestion
r86,s6,t27,"Is this implying that the gradients wrt off-diagonal entries of the controller weight matrix are 0 under the diagonal initialization, hence the off-diagonal entries remain zero after learning?",non_actionable,question
r86,s7,t27,It’s not immediately clear to me whether this is the case -- it could help to clarify this in the text.,actionable,suggestion
r86,s8,t27,) The dataset classifier (sec 4.3.4) could be learnt end-to-end by using a softmax output of the dataset classifier as the alpha weighting.,actionable,suggestion
r86,s9,t27,It would be interesting to see how this compares with the hard thresholding method used here.,actionable,suggestion
r86,s0,t20,The method is tested on a number of datasets (each used as source and target) and shows good transfer learning performance on each one.,non_actionable,fact
r86,s1,t20,A number of different fine-tuning regimes are explored.,non_actionable,fact
r86,s2,t20,The paper is mostly clear and well-written (though with a few typos that should be fixed).,actionable,suggestion
r86,s3,t20,Cons/Questions/Suggestions The distinction between the convolutional and fully-connected layers (called “classifiers”) in the approach description (sec,non_actionable,fact
r86,s4,t20,(This is hinted at by the mention of fully convolutional networks.)  The method could just as easily be applied to learn a task-specific rotation of the fully-connected layer weights.,non_actionable,shortcoming
r86,s5,t20,"A more systematic set of experiments could compare learning the proposed weightings on the first K layers of the network (for K={0, 1, …, N}) and learning independent weights for the latter N-K layers, but I understand this would be a rather large experimental burden.",actionable,suggestion
r86,s6,t20,"Is this implying that the gradients wrt off-diagonal entries of the controller weight matrix are 0 under the diagonal initialization, hence the off-diagonal entries remain zero after learning?",actionable,question
r86,s7,t20,It’s not immediately clear to me whether this is the case -- it could help to clarify this in the text.,actionable,suggestion
r86,s8,t20,) The dataset classifier (sec 4.3.4) could be learnt end-to-end by using a softmax output of the dataset classifier as the alpha weighting.,actionable,suggestion
r86,s9,t20,It would be interesting to see how this compares with the hard thresholding method used here.,actionable,suggestion
r86,s0,t31,The method is tested on a number of datasets (each used as source and target) and shows good transfer learning performance on each one.,non_actionable,fact
r86,s1,t31,A number of different fine-tuning regimes are explored.,non_actionable,fact
r86,s2,t31,The paper is mostly clear and well-written (though with a few typos that should be fixed).,actionable,shortcoming
r86,s3,t31,Cons/Questions/Suggestions The distinction between the convolutional and fully-connected layers (called “classifiers”) in the approach description (sec,actionable,shortcoming
r86,s4,t31,(This is hinted at by the mention of fully convolutional networks.)  The method could just as easily be applied to learn a task-specific rotation of the fully-connected layer weights.,non_actionable,fact
r86,s5,t31,"A more systematic set of experiments could compare learning the proposed weightings on the first K layers of the network (for K={0, 1, …, N}) and learning independent weights for the latter N-K layers, but I understand this would be a rather large experimental burden.",non_actionable,fact
r86,s6,t31,"Is this implying that the gradients wrt off-diagonal entries of the controller weight matrix are 0 under the diagonal initialization, hence the off-diagonal entries remain zero after learning?",non_actionable,question
r86,s7,t31,It’s not immediately clear to me whether this is the case -- it could help to clarify this in the text.,actionable,suggestion
r86,s8,t31,) The dataset classifier (sec 4.3.4) could be learnt end-to-end by using a softmax output of the dataset classifier as the alpha weighting.,actionable,suggestion
r86,s9,t31,It would be interesting to see how this compares with the hard thresholding method used here.,actionable,suggestion
r59,s0,t16,The model is based on both source/target syntax trees and performs an attentional encoder-decoder style network over the tree structure.,non_actionable,fact
r59,s1,t16,"On the other hand, the whole model structure looks to be easily generalized to other tree-to-tree tasks and might have some potential to contribute this kind of problems.",non_actionable,fact
r59,s2,t16,"First, it is unclear that what the linearization method of the syntax tree is, which could affect the final model accuracy.",actionable,shortcoming
r59,s3,t16,"Second, it is also unclear what the method to generate train/dev/test data is.",actionable,shortcoming
r59,s4,t16,"What is the reasonableness of training such kind of data, or are they already avoided from the data?",actionable,question
r59,s5,t16,"Since any CoffeeScript programs can be compiled into the corresponding Javascript programs, we should assume that CoffeeScript is the only subset of Javascript (without physical difference of syntax), and this translation task may never capture the whole tendency of Javascript.",actionable,shortcoming
r59,s6,t16,"If authors were interested in the tendency of real program translation task, they should arrange the experiment by collecting parallel corpora between some unrelated programming languages using resources in the real world.",actionable,suggestion
r59,s7,t16,"Probably we can suppress the range of each attention by introducing some prior knowledge about syntax trees (e.g., only paying attention to the descendants in a specific subtree).",actionable,suggestion
r59,s8,t16,"Suggestion: After capturing the motivation of the task, I suspect that the traditional tree-to-tree (also X-to-tree) ""statistical"" machine translation methods still can also work correctly in this task.",actionable,suggestion
r59,s9,t16,"Although it is not necessary, it would like to apply those methods to this task as another baseline if authors are interested in.",actionable,suggestion
r59,s0,t31,The model is based on both source/target syntax trees and performs an attentional encoder-decoder style network over the tree structure.,non_actionable,fact
r59,s1,t31,"On the other hand, the whole model structure looks to be easily generalized to other tree-to-tree tasks and might have some potential to contribute this kind of problems.",non_actionable,fact
r59,s2,t31,"First, it is unclear that what the linearization method of the syntax tree is, which could affect the final model accuracy.",actionable,shortcoming
r59,s3,t31,"Second, it is also unclear what the method to generate train/dev/test data is.",actionable,shortcoming
r59,s4,t31,"What is the reasonableness of training such kind of data, or are they already avoided from the data?",non_actionable,question
r59,s5,t31,"Since any CoffeeScript programs can be compiled into the corresponding Javascript programs, we should assume that CoffeeScript is the only subset of Javascript (without physical difference of syntax), and this translation task may never capture the whole tendency of Javascript.",non_actionable,fact
r59,s6,t31,"If authors were interested in the tendency of real program translation task, they should arrange the experiment by collecting parallel corpora between some unrelated programming languages using resources in the real world.",actionable,suggestion
r59,s7,t31,"Probably we can suppress the range of each attention by introducing some prior knowledge about syntax trees (e.g., only paying attention to the descendants in a specific subtree).",actionable,suggestion
r59,s8,t31,"Suggestion: After capturing the motivation of the task, I suspect that the traditional tree-to-tree (also X-to-tree) ""statistical"" machine translation methods still can also work correctly in this task.",actionable,suggestion
r59,s9,t31,"Although it is not necessary, it would like to apply those methods to this task as another baseline if authors are interested in.",non_actionable,other
r59,s0,t10,The model is based on both source/target syntax trees and performs an attentional encoder-decoder style network over the tree structure.,non_actionable,other
r59,s1,t10,"On the other hand, the whole model structure looks to be easily generalized to other tree-to-tree tasks and might have some potential to contribute this kind of problems.",non_actionable,other
r59,s2,t10,"First, it is unclear that what the linearization method of the syntax tree is, which could affect the final model accuracy.",actionable,shortcoming
r59,s3,t10,"Second, it is also unclear what the method to generate train/dev/test data is.",actionable,shortcoming
r59,s4,t10,"What is the reasonableness of training such kind of data, or are they already avoided from the data?",actionable,question
r59,s5,t10,"Since any CoffeeScript programs can be compiled into the corresponding Javascript programs, we should assume that CoffeeScript is the only subset of Javascript (without physical difference of syntax), and this translation task may never capture the whole tendency of Javascript.",actionable,shortcoming
r59,s6,t10,"If authors were interested in the tendency of real program translation task, they should arrange the experiment by collecting parallel corpora between some unrelated programming languages using resources in the real world.",actionable,suggestion
r59,s7,t10,"Probably we can suppress the range of each attention by introducing some prior knowledge about syntax trees (e.g., only paying attention to the descendants in a specific subtree).",actionable,suggestion
r59,s8,t10,"Suggestion: After capturing the motivation of the task, I suspect that the traditional tree-to-tree (also X-to-tree) ""statistical"" machine translation methods still can also work correctly in this task.",actionable,agreement
r59,s9,t10,"Although it is not necessary, it would like to apply those methods to this task as another baseline if authors are interested in.",actionable,suggestion
r59,s0,t27,The model is based on both source/target syntax trees and performs an attentional encoder-decoder style network over the tree structure.,non_actionable,fact
r59,s1,t27,"On the other hand, the whole model structure looks to be easily generalized to other tree-to-tree tasks and might have some potential to contribute this kind of problems.",non_actionable,shortcoming
r59,s2,t27,"First, it is unclear that what the linearization method of the syntax tree is, which could affect the final model accuracy.",non_actionable,shortcoming
r59,s3,t27,"Second, it is also unclear what the method to generate train/dev/test data is.",non_actionable,shortcoming
r59,s4,t27,"What is the reasonableness of training such kind of data, or are they already avoided from the data?",non_actionable,question
r59,s5,t27,"Since any CoffeeScript programs can be compiled into the corresponding Javascript programs, we should assume that CoffeeScript is the only subset of Javascript (without physical difference of syntax), and this translation task may never capture the whole tendency of Javascript.",non_actionable,fact
r59,s6,t27,"If authors were interested in the tendency of real program translation task, they should arrange the experiment by collecting parallel corpora between some unrelated programming languages using resources in the real world.",actionable,suggestion
r59,s7,t27,"Probably we can suppress the range of each attention by introducing some prior knowledge about syntax trees (e.g., only paying attention to the descendants in a specific subtree).",actionable,suggestion
r59,s8,t27,"Suggestion: After capturing the motivation of the task, I suspect that the traditional tree-to-tree (also X-to-tree) ""statistical"" machine translation methods still can also work correctly in this task.",actionable,suggestion
r59,s9,t27,"Although it is not necessary, it would like to apply those methods to this task as another baseline if authors are interested in.",actionable,suggestion
r59,s0,t20,The model is based on both source/target syntax trees and performs an attentional encoder-decoder style network over the tree structure.,non_actionable,fact
r59,s1,t20,"On the other hand, the whole model structure looks to be easily generalized to other tree-to-tree tasks and might have some potential to contribute this kind of problems.",non_actionable,fact
r59,s2,t20,"First, it is unclear that what the linearization method of the syntax tree is, which could affect the final model accuracy.",actionable,shortcoming
r59,s3,t20,"Second, it is also unclear what the method to generate train/dev/test data is.",actionable,shortcoming
r59,s4,t20,"What is the reasonableness of training such kind of data, or are they already avoided from the data?",actionable,question
r59,s5,t20,"Since any CoffeeScript programs can be compiled into the corresponding Javascript programs, we should assume that CoffeeScript is the only subset of Javascript (without physical difference of syntax), and this translation task may never capture the whole tendency of Javascript.",non_actionable,fact
r59,s6,t20,"If authors were interested in the tendency of real program translation task, they should arrange the experiment by collecting parallel corpora between some unrelated programming languages using resources in the real world.",actionable,suggestion
r59,s7,t20,"Probably we can suppress the range of each attention by introducing some prior knowledge about syntax trees (e.g., only paying attention to the descendants in a specific subtree).",actionable,suggestion
r59,s8,t20,"Suggestion: After capturing the motivation of the task, I suspect that the traditional tree-to-tree (also X-to-tree) ""statistical"" machine translation methods still can also work correctly in this task.",actionable,suggestion
r59,s9,t20,"Although it is not necessary, it would like to apply those methods to this task as another baseline if authors are interested in.",actionable,suggestion
r40,s0,t16,The model is evaluated on the following tasks: * Qualitative results on denoising and one-shot generation using the Omniglot dataset.,non_actionable,fact
r40,s1,t16,* Qualitative results on sampling from the model using the CIFAR dataset.,non_actionable,fact
r40,s2,t16,Doing so will help better understand what is gained from using retaining a probabilistic form of memory versus a determinstic memory indexed with attention as in [Li et. al].,non_actionable,fact
r40,s3,t16,It would be interesting to see how well the model performs in the limiting case of T=1.,actionable,suggestion
r40,s4,t16,The text states that the model was trained where each observation in an episode comprised randomly sampled datapoints.,non_actionable,fact
r40,s5,t16,"At test time, (if I understand correctly, please correct me if I haven't), the model is evaluated by having multiple copies of the same test point within an episode.",non_actionable,fact
r40,s6,t16,"If so, doesn't that correspond to evaluating the model under a different generative assumption?",actionable,question
r40,s7,t16,"This seems at odds with models, such as DRAW, evaluate the likelihood -- once at the end of the generative drawing process.",non_actionable,fact
r40,s8,t16,What is the per-pixel likelihood obtained on the CIFAR dataset and what is the likelihood on a model where T=1 (for omniglot/cifar)?,actionable,question
r40,s9,t16,"For the denoising comparison, how do the results compare to those obtained if you simulate a Markov Chain (sample latent state conditioned on noisy image, sample latent state, sample denoised observation, repeat using denoised observation) using a VAE?",actionable,question
r40,s0,t10,The model is evaluated on the following tasks: * Qualitative results on denoising and one-shot generation using the Omniglot dataset.,non_actionable,other
r40,s1,t10,* Qualitative results on sampling from the model using the CIFAR dataset.,non_actionable,other
r40,s2,t10,Doing so will help better understand what is gained from using retaining a probabilistic form of memory versus a determinstic memory indexed with attention as in [Li et. al].,non_actionable,other
r40,s3,t10,It would be interesting to see how well the model performs in the limiting case of T=1.,actionable,suggestion
r40,s4,t10,The text states that the model was trained where each observation in an episode comprised randomly sampled datapoints.,non_actionable,other
r40,s5,t10,"At test time, (if I understand correctly, please correct me if I haven't), the model is evaluated by having multiple copies of the same test point within an episode.",non_actionable,other
r40,s6,t10,"If so, doesn't that correspond to evaluating the model under a different generative assumption?",actionable,question
r40,s7,t10,"This seems at odds with models, such as DRAW, evaluate the likelihood -- once at the end of the generative drawing process.",non_actionable,disagreement
r40,s8,t10,What is the per-pixel likelihood obtained on the CIFAR dataset and what is the likelihood on a model where T=1 (for omniglot/cifar)?,actionable,question
r40,s9,t10,"For the denoising comparison, how do the results compare to those obtained if you simulate a Markov Chain (sample latent state conditioned on noisy image, sample latent state, sample denoised observation, repeat using denoised observation) using a VAE?",actionable,question
r40,s0,t8,The model is evaluated on the following tasks: * Qualitative results on denoising and one-shot generation using the Omniglot dataset.,non_actionable,fact
r40,s1,t8,* Qualitative results on sampling from the model using the CIFAR dataset.,non_actionable,fact
r40,s2,t8,Doing so will help better understand what is gained from using retaining a probabilistic form of memory versus a determinstic memory indexed with attention as in [Li et. al].,actionable,suggestion
r40,s3,t8,It would be interesting to see how well the model performs in the limiting case of T=1.,actionable,suggestion
r40,s4,t8,The text states that the model was trained where each observation in an episode comprised randomly sampled datapoints.,non_actionable,fact
r40,s5,t8,"At test time, (if I understand correctly, please correct me if I haven't), the model is evaluated by having multiple copies of the same test point within an episode.",non_actionable,fact
r40,s6,t8,"If so, doesn't that correspond to evaluating the model under a different generative assumption?",non_actionable,question
r40,s7,t8,"This seems at odds with models, such as DRAW, evaluate the likelihood -- once at the end of the generative drawing process.",actionable,shortcoming
r40,s8,t8,What is the per-pixel likelihood obtained on the CIFAR dataset and what is the likelihood on a model where T=1 (for omniglot/cifar)?,non_actionable,question
r40,s9,t8,"For the denoising comparison, how do the results compare to those obtained if you simulate a Markov Chain (sample latent state conditioned on noisy image, sample latent state, sample denoised observation, repeat using denoised observation) using a VAE?",non_actionable,question
r40,s0,t31,The model is evaluated on the following tasks: * Qualitative results on denoising and one-shot generation using the Omniglot dataset.,non_actionable,fact
r40,s1,t31,* Qualitative results on sampling from the model using the CIFAR dataset.,non_actionable,fact
r40,s2,t31,Doing so will help better understand what is gained from using retaining a probabilistic form of memory versus a determinstic memory indexed with attention as in [Li et. al].,non_actionable,fact
r40,s3,t31,It would be interesting to see how well the model performs in the limiting case of T=1.,actionable,suggestion
r40,s4,t31,The text states that the model was trained where each observation in an episode comprised randomly sampled datapoints.,non_actionable,fact
r40,s5,t31,"At test time, (if I understand correctly, please correct me if I haven't), the model is evaluated by having multiple copies of the same test point within an episode.",non_actionable,fact
r40,s6,t31,"If so, doesn't that correspond to evaluating the model under a different generative assumption?",non_actionable,question
r40,s7,t31,"This seems at odds with models, such as DRAW, evaluate the likelihood -- once at the end of the generative drawing process.",non_actionable,fact
r40,s8,t31,What is the per-pixel likelihood obtained on the CIFAR dataset and what is the likelihood on a model where T=1 (for omniglot/cifar)?,non_actionable,question
r40,s9,t31,"For the denoising comparison, how do the results compare to those obtained if you simulate a Markov Chain (sample latent state conditioned on noisy image, sample latent state, sample denoised observation, repeat using denoised observation) using a VAE?",non_actionable,question
r40,s0,t20,The model is evaluated on the following tasks: * Qualitative results on denoising and one-shot generation using the Omniglot dataset.,non_actionable,fact
r40,s1,t20,* Qualitative results on sampling from the model using the CIFAR dataset.,non_actionable,fact
r40,s2,t20,Doing so will help better understand what is gained from using retaining a probabilistic form of memory versus a determinstic memory indexed with attention as in [Li et. al].,non_actionable,fact
r40,s3,t20,It would be interesting to see how well the model performs in the limiting case of T=1.,non_actionable,fact
r40,s4,t20,The text states that the model was trained where each observation in an episode comprised randomly sampled datapoints.,non_actionable,fact
r40,s5,t20,"At test time, (if I understand correctly, please correct me if I haven't), the model is evaluated by having multiple copies of the same test point within an episode.",non_actionable,fact
r40,s6,t20,"If so, doesn't that correspond to evaluating the model under a different generative assumption?",actionable,question
r40,s7,t20,"This seems at odds with models, such as DRAW, evaluate the likelihood -- once at the end of the generative drawing process.",non_actionable,fact
r40,s8,t20,What is the per-pixel likelihood obtained on the CIFAR dataset and what is the likelihood on a model where T=1 (for omniglot/cifar)?,actionable,question
r40,s9,t20,"For the denoising comparison, how do the results compare to those obtained if you simulate a Markov Chain (sample latent state conditioned on noisy image, sample latent state, sample denoised observation, repeat using denoised observation) using a VAE?",actionable,question
r120,s0,t20,The model is further extended for conditional generation and demonstrated on a range of image benchmark data sets.,non_actionable,fact
r120,s1,t20,"The core idea is to train the model on pairs of images corresponding to the same content but varying in views, using adversarial training to discriminate such examples from generated pairs.",non_actionable,fact
r120,s2,t20,"The conditional variant is less obvious, requiring two kinds of negative images, and again the proposed approach seems technically sound.",non_actionable,fact
r120,s3,t20,"Given the simplicity of the algorithmic choices, the potential novelty of the paper lies more in the problem formulation itself, which considers the question of separating two sets of latent variables from each other in setups where one of them (the ""view"") can vary from pair to pair in arbitrary manner and no attributes characterising the view are provided.",non_actionable,fact
r120,s4,t20,"The generative story using three sets of latent variables, one shared, to describe a pair of objects corresponds to inter-battery factor analysis (IBFA) and is hence very closely related to canonical correlation analysis as well (Tucker ""An inter-battery method of factor analysis"", Psychometrika, 1958; Klami et al. ""Bayesian canonical correlation analysis"", JMLR, 2013).",non_actionable,fact
r120,s5,t20,"Linear CCA naturally would not be sufficient for generative modeling and its non-linear variants (e.g. Wang et al. ""Deep variational canonical correlation analysis"", arXiv:1610.03454, 2016; Damianou et al. ""Manifold relevance determination"", ICML, 2012) would not produce visually pleasing generative samples either, but the relationship is so close that these models have even been used for analysing setups identical to yours (e.g. Li et al. ""Cross-pose face recognition by canonical correlation analysis"", arXiv:1507.08076, 2015) but with goals other than generation.",non_actionable,fact
r120,s6,t20,"A particularly interesting question would be whether the proposed model actually is a direct GAN-based extension of IBFA, and if not then how does it differ.",non_actionable,fact
r120,s7,t20,"In other words, the evaluation is a bit lazy somewhat in the same sense as the writing and treatment of related work; the authors implemented the model and ran it on a collection of public data sets, but did not venture further into scientific reporting of the merits and limitations of the approach.",actionable,shortcoming
r120,s8,t20,"Finally, Table 1 seems to have some min/max values the wrong way around.",actionable,shortcoming
r120,s9,t20,"Revision of the review in light of the author response: The authors have adequately addressed my main remarks, and while doing so have improved both the positioning of the paper amongst relevant literature and the somewhat limited empirical comparisons.",non_actionable,fact
r120,s0,t16,The model is further extended for conditional generation and demonstrated on a range of image benchmark data sets.,non_actionable,fact
r120,s1,t16,"The core idea is to train the model on pairs of images corresponding to the same content but varying in views, using adversarial training to discriminate such examples from generated pairs.",non_actionable,fact
r120,s2,t16,"The conditional variant is less obvious, requiring two kinds of negative images, and again the proposed approach seems technically sound.",non_actionable,fact
r120,s3,t16,"Given the simplicity of the algorithmic choices, the potential novelty of the paper lies more in the problem formulation itself, which considers the question of separating two sets of latent variables from each other in setups where one of them (the ""view"") can vary from pair to pair in arbitrary manner and no attributes characterising the view are provided.",non_actionable,fact
r120,s4,t16,"The generative story using three sets of latent variables, one shared, to describe a pair of objects corresponds to inter-battery factor analysis (IBFA) and is hence very closely related to canonical correlation analysis as well (Tucker ""An inter-battery method of factor analysis"", Psychometrika, 1958; Klami et al. ""Bayesian canonical correlation analysis"", JMLR, 2013).",non_actionable,fact
r120,s5,t16,"Linear CCA naturally would not be sufficient for generative modeling and its non-linear variants (e.g. Wang et al. ""Deep variational canonical correlation analysis"", arXiv:1610.03454, 2016; Damianou et al. ""Manifold relevance determination"", ICML, 2012) would not produce visually pleasing generative samples either, but the relationship is so close that these models have even been used for analysing setups identical to yours (e.g. Li et al. ""Cross-pose face recognition by canonical correlation analysis"", arXiv:1507.08076, 2015) but with goals other than generation.",non_actionable,fact
r120,s6,t16,"A particularly interesting question would be whether the proposed model actually is a direct GAN-based extension of IBFA, and if not then how does it differ.",actionable,question
r120,s7,t16,"In other words, the evaluation is a bit lazy somewhat in the same sense as the writing and treatment of related work; the authors implemented the model and ran it on a collection of public data sets, but did not venture further into scientific reporting of the merits and limitations of the approach.",actionable,shortcoming
r120,s8,t16,"Finally, Table 1 seems to have some min/max values the wrong way around.",actionable,shortcoming
r120,s9,t16,"Revision of the review in light of the author response: The authors have adequately addressed my main remarks, and while doing so have improved both the positioning of the paper amongst relevant literature and the somewhat limited empirical comparisons.",non_actionable,agreement
r120,s0,t31,The model is further extended for conditional generation and demonstrated on a range of image benchmark data sets.,non_actionable,fact
r120,s1,t31,"The core idea is to train the model on pairs of images corresponding to the same content but varying in views, using adversarial training to discriminate such examples from generated pairs.",non_actionable,fact
r120,s2,t31,"The conditional variant is less obvious, requiring two kinds of negative images, and again the proposed approach seems technically sound.",non_actionable,agreement
r120,s3,t31,"Given the simplicity of the algorithmic choices, the potential novelty of the paper lies more in the problem formulation itself, which considers the question of separating two sets of latent variables from each other in setups where one of them (the ""view"") can vary from pair to pair in arbitrary manner and no attributes characterising the view are provided.",non_actionable,fact
r120,s4,t31,"The generative story using three sets of latent variables, one shared, to describe a pair of objects corresponds to inter-battery factor analysis (IBFA) and is hence very closely related to canonical correlation analysis as well (Tucker ""An inter-battery method of factor analysis"", Psychometrika, 1958; Klami et al. ""Bayesian canonical correlation analysis"", JMLR, 2013).",non_actionable,fact
r120,s5,t31,"Linear CCA naturally would not be sufficient for generative modeling and its non-linear variants (e.g. Wang et al. ""Deep variational canonical correlation analysis"", arXiv:1610.03454, 2016; Damianou et al. ""Manifold relevance determination"", ICML, 2012) would not produce visually pleasing generative samples either, but the relationship is so close that these models have even been used for analysing setups identical to yours (e.g. Li et al. ""Cross-pose face recognition by canonical correlation analysis"", arXiv:1507.08076, 2015) but with goals other than generation.",non_actionable,fact
r120,s6,t31,"A particularly interesting question would be whether the proposed model actually is a direct GAN-based extension of IBFA, and if not then how does it differ.",actionable,suggestion
r120,s7,t31,"In other words, the evaluation is a bit lazy somewhat in the same sense as the writing and treatment of related work; the authors implemented the model and ran it on a collection of public data sets, but did not venture further into scientific reporting of the merits and limitations of the approach.",actionable,shortcoming
r120,s8,t31,"Finally, Table 1 seems to have some min/max values the wrong way around.",actionable,shortcoming
r120,s9,t31,"Revision of the review in light of the author response: The authors have adequately addressed my main remarks, and while doing so have improved both the positioning of the paper amongst relevant literature and the somewhat limited empirical comparisons.",non_actionable,fact
r120,s0,t2,The model is further extended for conditional generation and demonstrated on a range of image benchmark data sets.,non_actionable,fact
r120,s1,t2,"The core idea is to train the model on pairs of images corresponding to the same content but varying in views, using adversarial training to discriminate such examples from generated pairs.",non_actionable,fact
r120,s2,t2,"The conditional variant is less obvious, requiring two kinds of negative images, and again the proposed approach seems technically sound.",non_actionable,agreement
r120,s3,t2,"Given the simplicity of the algorithmic choices, the potential novelty of the paper lies more in the problem formulation itself, which considers the question of separating two sets of latent variables from each other in setups where one of them (the ""view"") can vary from pair to pair in arbitrary manner and no attributes characterising the view are provided.",non_actionable,fact
r120,s4,t2,"The generative story using three sets of latent variables, one shared, to describe a pair of objects corresponds to inter-battery factor analysis (IBFA) and is hence very closely related to canonical correlation analysis as well (Tucker ""An inter-battery method of factor analysis"", Psychometrika, 1958; Klami et al. ""Bayesian canonical correlation analysis"", JMLR, 2013).",non_actionable,fact
r120,s5,t2,"Linear CCA naturally would not be sufficient for generative modeling and its non-linear variants (e.g. Wang et al. ""Deep variational canonical correlation analysis"", arXiv:1610.03454, 2016; Damianou et al. ""Manifold relevance determination"", ICML, 2012) would not produce visually pleasing generative samples either, but the relationship is so close that these models have even been used for analysing setups identical to yours (e.g. Li et al. ""Cross-pose face recognition by canonical correlation analysis"", arXiv:1507.08076, 2015) but with goals other than generation.",non_actionable,fact
r120,s6,t2,"A particularly interesting question would be whether the proposed model actually is a direct GAN-based extension of IBFA, and if not then how does it differ.",actionable,question
r120,s7,t2,"In other words, the evaluation is a bit lazy somewhat in the same sense as the writing and treatment of related work; the authors implemented the model and ran it on a collection of public data sets, but did not venture further into scientific reporting of the merits and limitations of the approach.",actionable,shortcoming
r120,s8,t2,"Finally, Table 1 seems to have some min/max values the wrong way around.",actionable,shortcoming
r120,s9,t2,"Revision of the review in light of the author response: The authors have adequately addressed my main remarks, and while doing so have improved both the positioning of the paper amongst relevant literature and the somewhat limited empirical comparisons.",non_actionable,agreement
r120,s0,t10,The model is further extended for conditional generation and demonstrated on a range of image benchmark data sets.,non_actionable,other
r120,s1,t10,"The core idea is to train the model on pairs of images corresponding to the same content but varying in views, using adversarial training to discriminate such examples from generated pairs.",non_actionable,other
r120,s2,t10,"The conditional variant is less obvious, requiring two kinds of negative images, and again the proposed approach seems technically sound.",non_actionable,other
r120,s3,t10,"Given the simplicity of the algorithmic choices, the potential novelty of the paper lies more in the problem formulation itself, which considers the question of separating two sets of latent variables from each other in setups where one of them (the ""view"") can vary from pair to pair in arbitrary manner and no attributes characterising the view are provided.",actionable,shortcoming
r120,s4,t10,"The generative story using three sets of latent variables, one shared, to describe a pair of objects corresponds to inter-battery factor analysis (IBFA) and is hence very closely related to canonical correlation analysis as well (Tucker ""An inter-battery method of factor analysis"", Psychometrika, 1958; Klami et al. ""Bayesian canonical correlation analysis"", JMLR, 2013).",non_actionable,other
r120,s5,t10,"Linear CCA naturally would not be sufficient for generative modeling and its non-linear variants (e.g. Wang et al. ""Deep variational canonical correlation analysis"", arXiv:1610.03454, 2016; Damianou et al. ""Manifold relevance determination"", ICML, 2012) would not produce visually pleasing generative samples either, but the relationship is so close that these models have even been used for analysing setups identical to yours (e.g. Li et al. ""Cross-pose face recognition by canonical correlation analysis"", arXiv:1507.08076, 2015) but with goals other than generation.",actionable,shortcoming
r120,s6,t10,"A particularly interesting question would be whether the proposed model actually is a direct GAN-based extension of IBFA, and if not then how does it differ.",actionable,question
r120,s7,t10,"In other words, the evaluation is a bit lazy somewhat in the same sense as the writing and treatment of related work; the authors implemented the model and ran it on a collection of public data sets, but did not venture further into scientific reporting of the merits and limitations of the approach.",actionable,shortcoming
r120,s8,t10,"Finally, Table 1 seems to have some min/max values the wrong way around.",actionable,shortcoming
r120,s9,t10,"Revision of the review in light of the author response: The authors have adequately addressed my main remarks, and while doing so have improved both the positioning of the paper amongst relevant literature and the somewhat limited empirical comparisons.",actionable,agreement
r13,s0,t13,The node embeddings are then projected into a 2-dimensional space by PCA.,non_actionable,fact
r13,s1,t13,The value for a bin is the (normalized) number of nodes falling into the corresponding region.,non_actionable,fact
r13,s2,t13,It seems that the methods outperforms existing methods for learning graph representations.,actionable,fact
r13,s3,t13,The problem with the approach is that it is very ad-hoc.,actionable,shortcoming
r13,s4,t13,There are several (existing) ideas of how to combine node representations into a representation for the entire graph.,non_actionable,fact
r13,s5,t13,"For instance, averaging the node embeddings is something that has shown promising results in previous work.",actionable,agreement
r13,s6,t13,"Again, pooling operations (average, max, etc.) on the learned node2vec embeddings are examples of simpler alternatives.",actionable,fact
r13,s7,t13,The experimental results are also not explained thoroughly enough.,actionable,shortcoming
r13,s8,t13,"For instance, since two runs of node2vec will give you highly varying embeddings (depending on the initialization), you will have to run node2vec several times to reduce the variance of your resulting discretized density maps.",non_actionable,fact
r13,s9,t13,How many times did you run node2vec on each graph?,actionable,question
r13,s0,t10,The node embeddings are then projected into a 2-dimensional space by PCA.,non_actionable,other
r13,s1,t10,The value for a bin is the (normalized) number of nodes falling into the corresponding region.,non_actionable,other
r13,s2,t10,It seems that the methods outperforms existing methods for learning graph representations.,actionable,agreement
r13,s3,t10,The problem with the approach is that it is very ad-hoc.,actionable,disagreement
r13,s4,t10,There are several (existing) ideas of how to combine node representations into a representation for the entire graph.,non_actionable,other
r13,s5,t10,"For instance, averaging the node embeddings is something that has shown promising results in previous work.",non_actionable,other
r13,s6,t10,"Again, pooling operations (average, max, etc.) on the learned node2vec embeddings are examples of simpler alternatives.",actionable,disagreement
r13,s7,t10,The experimental results are also not explained thoroughly enough.,actionable,shortcoming
r13,s8,t10,"For instance, since two runs of node2vec will give you highly varying embeddings (depending on the initialization), you will have to run node2vec several times to reduce the variance of your resulting discretized density maps.",actionable,shortcoming
r13,s9,t10,How many times did you run node2vec on each graph?,actionable,question
r13,s0,t31,The node embeddings are then projected into a 2-dimensional space by PCA.,non_actionable,fact
r13,s1,t31,The value for a bin is the (normalized) number of nodes falling into the corresponding region.,non_actionable,fact
r13,s2,t31,It seems that the methods outperforms existing methods for learning graph representations.,non_actionable,fact
r13,s3,t31,The problem with the approach is that it is very ad-hoc.,actionable,shortcoming
r13,s4,t31,There are several (existing) ideas of how to combine node representations into a representation for the entire graph.,non_actionable,fact
r13,s5,t31,"For instance, averaging the node embeddings is something that has shown promising results in previous work.",non_actionable,fact
r13,s6,t31,"Again, pooling operations (average, max, etc.) on the learned node2vec embeddings are examples of simpler alternatives.",non_actionable,fact
r13,s7,t31,The experimental results are also not explained thoroughly enough.,actionable,shortcoming
r13,s8,t31,"For instance, since two runs of node2vec will give you highly varying embeddings (depending on the initialization), you will have to run node2vec several times to reduce the variance of your resulting discretized density maps.",actionable,shortcoming
r13,s9,t31,How many times did you run node2vec on each graph?,non_actionable,question
r13,s0,t20,The node embeddings are then projected into a 2-dimensional space by PCA.,non_actionable,fact
r13,s1,t20,The value for a bin is the (normalized) number of nodes falling into the corresponding region.,non_actionable,fact
r13,s2,t20,It seems that the methods outperforms existing methods for learning graph representations.,non_actionable,fact
r13,s3,t20,The problem with the approach is that it is very ad-hoc.,non_actionable,shortcoming
r13,s4,t20,There are several (existing) ideas of how to combine node representations into a representation for the entire graph.,non_actionable,fact
r13,s5,t20,"For instance, averaging the node embeddings is something that has shown promising results in previous work.",non_actionable,fact
r13,s6,t20,"Again, pooling operations (average, max, etc.) on the learned node2vec embeddings are examples of simpler alternatives.",non_actionable,fact
r13,s7,t20,The experimental results are also not explained thoroughly enough.,actionable,shortcoming
r13,s8,t20,"For instance, since two runs of node2vec will give you highly varying embeddings (depending on the initialization), you will have to run node2vec several times to reduce the variance of your resulting discretized density maps.",non_actionable,shortcoming
r13,s9,t20,How many times did you run node2vec on each graph?,actionable,question
r13,s0,t21,The node embeddings are then projected into a 2-dimensional space by PCA.,non_actionable,fact
r13,s1,t21,The value for a bin is the (normalized) number of nodes falling into the corresponding region.,non_actionable,fact
r13,s2,t21,It seems that the methods outperforms existing methods for learning graph representations.,non_actionable,fact
r13,s3,t21,The problem with the approach is that it is very ad-hoc.,actionable,shortcoming
r13,s4,t21,There are several (existing) ideas of how to combine node representations into a representation for the entire graph.,non_actionable,suggestion
r13,s5,t21,"For instance, averaging the node embeddings is something that has shown promising results in previous work.",actionable,suggestion
r13,s6,t21,"Again, pooling operations (average, max, etc.) on the learned node2vec embeddings are examples of simpler alternatives.",actionable,suggestion
r13,s7,t21,The experimental results are also not explained thoroughly enough.,actionable,shortcoming
r13,s8,t21,"For instance, since two runs of node2vec will give you highly varying embeddings (depending on the initialization), you will have to run node2vec several times to reduce the variance of your resulting discretized density maps.",actionable,suggestion
r13,s9,t21,How many times did you run node2vec on each graph?,non_actionable,suggestion
r19,s0,t20,The paper considers single-agent problems and tests on Ms Pacman etc.,non_actionable,fact
r19,s1,t20,There are several variations of the hidden-state space [ds]SSM model: using det/stochastic latent variables + using det/stochastic decoders.,non_actionable,fact
r19,s2,t20,"In the stochastic case, learning is done using variational methods.",non_actionable,fact
r19,s3,t20,"[ds]SSM is integrated with I2A, which generates rollouts of future states, based on the inferred hidden states from the d/sSSM-VAE model.",non_actionable,fact
r19,s4,t20,The rollouts are then fed into the agent's policy / value function.,non_actionable,fact
r19,s5,t20,2. I2A agents with latent codes work better than model-free models or I2A from pixels.,non_actionable,fact
r19,s6,t20,Deterministic latent models seem to work better than stochastic ones.,non_actionable,fact
r19,s7,t20,"See e.g. the discussion in ""Variational Lossy Autoencoder"" (Chen et al. 2016) - Experiments are not complete (e.g. for AR, as noted in the paper).",non_actionable,other
r19,s8,t20,"- The games used are fairly reactive (i.e. do not require significant long-term planning), and so the sequential hidden-state-space model does not have to capture long-term dependencies.",non_actionable,fact
r19,s9,t20,Overall: The paper proposes a simple idea that seems to work well on reactive 1-agent games.,non_actionable,agreement
r19,s0,t10,The paper considers single-agent problems and tests on Ms Pacman etc.,non_actionable,other
r19,s1,t10,There are several variations of the hidden-state space [ds]SSM model: using det/stochastic latent variables + using det/stochastic decoders.,non_actionable,other
r19,s2,t10,"In the stochastic case, learning is done using variational methods.",non_actionable,other
r19,s3,t10,"[ds]SSM is integrated with I2A, which generates rollouts of future states, based on the inferred hidden states from the d/sSSM-VAE model.",non_actionable,other
r19,s4,t10,The rollouts are then fed into the agent's policy / value function.,non_actionable,other
r19,s5,t10,2. I2A agents with latent codes work better than model-free models or I2A from pixels.,non_actionable,other
r19,s6,t10,Deterministic latent models seem to work better than stochastic ones.,non_actionable,other
r19,s7,t10,"See e.g. the discussion in ""Variational Lossy Autoencoder"" (Chen et al. 2016) - Experiments are not complete (e.g. for AR, as noted in the paper).",non_actionable,other
r19,s8,t10,"- The games used are fairly reactive (i.e. do not require significant long-term planning), and so the sequential hidden-state-space model does not have to capture long-term dependencies.",non_actionable,other
r19,s9,t10,Overall: The paper proposes a simple idea that seems to work well on reactive 1-agent games.,actionable,agreement
r19,s0,t16,The paper considers single-agent problems and tests on Ms Pacman etc.,non_actionable,fact
r19,s1,t16,There are several variations of the hidden-state space [ds]SSM model: using det/stochastic latent variables + using det/stochastic decoders.,non_actionable,fact
r19,s2,t16,"In the stochastic case, learning is done using variational methods.",non_actionable,fact
r19,s3,t16,"[ds]SSM is integrated with I2A, which generates rollouts of future states, based on the inferred hidden states from the d/sSSM-VAE model.",non_actionable,fact
r19,s4,t16,The rollouts are then fed into the agent's policy / value function.,non_actionable,fact
r19,s5,t16,2. I2A agents with latent codes work better than model-free models or I2A from pixels.,non_actionable,fact
r19,s6,t16,Deterministic latent models seem to work better than stochastic ones.,non_actionable,fact
r19,s7,t16,"See e.g. the discussion in ""Variational Lossy Autoencoder"" (Chen et al. 2016) - Experiments are not complete (e.g. for AR, as noted in the paper).",non_actionable,fact
r19,s8,t16,"- The games used are fairly reactive (i.e. do not require significant long-term planning), and so the sequential hidden-state-space model does not have to capture long-term dependencies.",non_actionable,fact
r19,s9,t16,Overall: The paper proposes a simple idea that seems to work well on reactive 1-agent games.,non_actionable,agreement
r19,s0,t27,The paper considers single-agent problems and tests on Ms Pacman etc.,non_actionable,fact
r19,s1,t27,There are several variations of the hidden-state space [ds]SSM model: using det/stochastic latent variables + using det/stochastic decoders.,non_actionable,fact
r19,s2,t27,"In the stochastic case, learning is done using variational methods.",non_actionable,fact
r19,s3,t27,"[ds]SSM is integrated with I2A, which generates rollouts of future states, based on the inferred hidden states from the d/sSSM-VAE model.",non_actionable,fact
r19,s4,t27,The rollouts are then fed into the agent's policy / value function.,non_actionable,fact
r19,s5,t27,2. I2A agents with latent codes work better than model-free models or I2A from pixels.,non_actionable,fact
r19,s6,t27,Deterministic latent models seem to work better than stochastic ones.,non_actionable,fact
r19,s7,t27,"See e.g. the discussion in ""Variational Lossy Autoencoder"" (Chen et al. 2016) - Experiments are not complete (e.g. for AR, as noted in the paper).",non_actionable,shortcoming
r19,s8,t27,"- The games used are fairly reactive (i.e. do not require significant long-term planning), and so the sequential hidden-state-space model does not have to capture long-term dependencies.",non_actionable,fact
r19,s9,t27,Overall: The paper proposes a simple idea that seems to work well on reactive 1-agent games.,non_actionable,agreement
r19,s0,t31,The paper considers single-agent problems and tests on Ms Pacman etc.,non_actionable,fact
r19,s1,t31,There are several variations of the hidden-state space [ds]SSM model: using det/stochastic latent variables + using det/stochastic decoders.,non_actionable,fact
r19,s2,t31,"In the stochastic case, learning is done using variational methods.",non_actionable,fact
r19,s3,t31,"[ds]SSM is integrated with I2A, which generates rollouts of future states, based on the inferred hidden states from the d/sSSM-VAE model.",non_actionable,fact
r19,s4,t31,The rollouts are then fed into the agent's policy / value function.,non_actionable,fact
r19,s5,t31,2. I2A agents with latent codes work better than model-free models or I2A from pixels.,non_actionable,fact
r19,s6,t31,Deterministic latent models seem to work better than stochastic ones.,non_actionable,fact
r19,s7,t31,"See e.g. the discussion in ""Variational Lossy Autoencoder"" (Chen et al. 2016) - Experiments are not complete (e.g. for AR, as noted in the paper).",actionable,suggestion
r19,s8,t31,"- The games used are fairly reactive (i.e. do not require significant long-term planning), and so the sequential hidden-state-space model does not have to capture long-term dependencies.",non_actionable,fact
r19,s9,t31,Overall: The paper proposes a simple idea that seems to work well on reactive 1-agent games.,non_actionable,fact
r44,s0,t13,The paper could be more focused around a single scientific question: does the PATH function as formulated help?,actionable,suggestion
r44,s1,t13,The authors do provide a novel formulation and demonstrate the gains on a variety of concrete problems taken form the literature.,non_actionable,agreement
r44,s2,t13,Figure #s are missing off several figures.,actionable,shortcoming
r44,s3,t13,"MODEL & ARCHITECTURE The PATH function given a current state s and a goal state s', returns a distribution over the best first action to take to get to the goal P(A).",non_actionable,fact
r44,s4,t13,Should these parameters be take out of the n-step advantage function A?,actionable,question
r44,s5,t13,So the PATH function helps and longer paths are better.,actionable,fact
r44,s6,t13,What is the upper bound on the size of PATH lengths you can train?,actionable,question
r44,s7,t13,This would require a phi(s) function that is trained in a way that doesn’t depend on the action a.,non_actionable,fact
r44,s8,t13,ATARI 2600 games: I am not sure what state restoration is.,actionable,question
r44,s9,t13,Oh - now I get it.,non_actionable,other
r44,s0,t22,The paper could be more focused around a single scientific question: does the PATH function as formulated help?,actionable,question
r44,s1,t22,The authors do provide a novel formulation and demonstrate the gains on a variety of concrete problems taken form the literature.,non_actionable,fact
r44,s2,t22,Figure #s are missing off several figures.,actionable,shortcoming
r44,s3,t22,"MODEL & ARCHITECTURE The PATH function given a current state s and a goal state s', returns a distribution over the best first action to take to get to the goal P(A).",actionable,agreement
r44,s4,t22,Should these parameters be take out of the n-step advantage function A?,actionable,question
r44,s5,t22,So the PATH function helps and longer paths are better.,non_actionable,fact
r44,s6,t22,What is the upper bound on the size of PATH lengths you can train?,actionable,question
r44,s7,t22,This would require a phi(s) function that is trained in a way that doesn’t depend on the action a.,actionable,suggestion
r44,s8,t22,ATARI 2600 games: I am not sure what state restoration is.,actionable,suggestion
r44,s9,t22,Oh - now I get it.,actionable,agreement
r44,s0,t10,The paper could be more focused around a single scientific question: does the PATH function as formulated help?,actionable,question
r44,s1,t10,The authors do provide a novel formulation and demonstrate the gains on a variety of concrete problems taken form the literature.,actionable,agreement
r44,s2,t10,Figure #s are missing off several figures.,actionable,shortcoming
r44,s3,t10,"MODEL & ARCHITECTURE The PATH function given a current state s and a goal state s', returns a distribution over the best first action to take to get to the goal P(A).",non_actionable,other
r44,s4,t10,Should these parameters be take out of the n-step advantage function A?,actionable,question
r44,s5,t10,So the PATH function helps and longer paths are better.,actionable,suggestion
r44,s6,t10,What is the upper bound on the size of PATH lengths you can train?,actionable,question
r44,s7,t10,This would require a phi(s) function that is trained in a way that doesn’t depend on the action a.,actionable,suggestion
r44,s8,t10,ATARI 2600 games: I am not sure what state restoration is.,actionable,disagreement
r44,s9,t10,Oh - now I get it.,actionable,agreement
r44,s0,t20,The paper could be more focused around a single scientific question: does the PATH function as formulated help?,actionable,suggestion
r44,s1,t20,The authors do provide a novel formulation and demonstrate the gains on a variety of concrete problems taken form the literature.,non_actionable,agreement
r44,s2,t20,Figure #s are missing off several figures.,actionable,shortcoming
r44,s3,t20,"MODEL & ARCHITECTURE The PATH function given a current state s and a goal state s', returns a distribution over the best first action to take to get to the goal P(A).",non_actionable,fact
r44,s4,t20,Should these parameters be take out of the n-step advantage function A?,actionable,question
r44,s5,t20,So the PATH function helps and longer paths are better.,non_actionable,fact
r44,s6,t20,What is the upper bound on the size of PATH lengths you can train?,actionable,question
r44,s7,t20,This would require a phi(s) function that is trained in a way that doesn’t depend on the action a.,non_actionable,fact
r44,s8,t20,ATARI 2600 games: I am not sure what state restoration is.,non_actionable,fact
r44,s9,t20,Oh - now I get it.,non_actionable,fact
r44,s0,t31,The paper could be more focused around a single scientific question: does the PATH function as formulated help?,actionable,suggestion
r44,s1,t31,The authors do provide a novel formulation and demonstrate the gains on a variety of concrete problems taken form the literature.,non_actionable,agreement
r44,s2,t31,Figure #s are missing off several figures.,actionable,shortcoming
r44,s3,t31,"MODEL & ARCHITECTURE The PATH function given a current state s and a goal state s', returns a distribution over the best first action to take to get to the goal P(A).",non_actionable,fact
r44,s4,t31,Should these parameters be take out of the n-step advantage function A?,non_actionable,question
r44,s5,t31,So the PATH function helps and longer paths are better.,non_actionable,fact
r44,s6,t31,What is the upper bound on the size of PATH lengths you can train?,non_actionable,question
r44,s7,t31,This would require a phi(s) function that is trained in a way that doesn’t depend on the action a.,non_actionable,fact
r44,s8,t31,ATARI 2600 games: I am not sure what state restoration is.,non_actionable,question
r44,s9,t31,Oh - now I get it.,non_actionable,fact
r121,s0,t31,The paper describe a method for how to train and make inference in a network using only integer values.,non_actionable,fact
r121,s1,t31,"The idea is using quantizers with clipping (denoted in the paper with Q(x,k)) and some additional operators like shift (denoted with shift(x)) and stochastic rounding.",non_actionable,fact
r121,s2,t31,"After introducing the idea and related work, the authors in Section 3 give details about how to perform the quantization.",non_actionable,fact
r121,s3,t31,"Afterward, as in other techniques for quantization, they describe how to initialize the network values.",non_actionable,fact
r121,s4,t31,"Also, they argue that batch normalization in this network is replaced with the shift-quantize operations, and what is matter in this case is (1) the relative values (“orientations”) and not the absolute values and (2) small values in errors are negligible.",non_actionable,fact
r121,s5,t31,"Afterward, the authors conduct experiments on MNIST, SVHN, CIFAR10, and ILSVRC12 datasets, where they show promising results compared to the errors provided by previous works.",non_actionable,fact
r121,s6,t31,"The WAGE parameters (i.e., the quantized no. of bits used) are 2-8-8-8, respectively.",non_actionable,fact
r121,s7,t31,"For understand more the WAGE, the authors compare on CIFAR10 the test error rate with vanilla CNN and show is small loss in using their network.",non_actionable,fact
r121,s8,t31,"For inference only, other works has more to offer but this is a promising technique for learning.",non_actionable,agreement
r121,s9,t31,This will give the hardware community a clear vision of how such methods may be implemented both in data centers as well as on end portable devices.,non_actionable,fact
r121,s0,t8,The paper describe a method for how to train and make inference in a network using only integer values.,non_actionable,fact
r121,s1,t8,"The idea is using quantizers with clipping (denoted in the paper with Q(x,k)) and some additional operators like shift (denoted with shift(x)) and stochastic rounding.",non_actionable,fact
r121,s2,t8,"After introducing the idea and related work, the authors in Section 3 give details about how to perform the quantization.",non_actionable,fact
r121,s3,t8,"Afterward, as in other techniques for quantization, they describe how to initialize the network values.",non_actionable,fact
r121,s4,t8,"Also, they argue that batch normalization in this network is replaced with the shift-quantize operations, and what is matter in this case is (1) the relative values (“orientations”) and not the absolute values and (2) small values in errors are negligible.",non_actionable,fact
r121,s5,t8,"Afterward, the authors conduct experiments on MNIST, SVHN, CIFAR10, and ILSVRC12 datasets, where they show promising results compared to the errors provided by previous works.",non_actionable,agreement
r121,s6,t8,"The WAGE parameters (i.e., the quantized no. of bits used) are 2-8-8-8, respectively.",non_actionable,fact
r121,s7,t8,"For understand more the WAGE, the authors compare on CIFAR10 the test error rate with vanilla CNN and show is small loss in using their network.",non_actionable,fact
r121,s8,t8,"For inference only, other works has more to offer but this is a promising technique for learning.",non_actionable,agreement
r121,s9,t8,This will give the hardware community a clear vision of how such methods may be implemented both in data centers as well as on end portable devices.,non_actionable,agreement
r121,s0,t10,The paper describe a method for how to train and make inference in a network using only integer values.,non_actionable,other
r121,s1,t10,"The idea is using quantizers with clipping (denoted in the paper with Q(x,k)) and some additional operators like shift (denoted with shift(x)) and stochastic rounding.",non_actionable,other
r121,s2,t10,"After introducing the idea and related work, the authors in Section 3 give details about how to perform the quantization.",non_actionable,other
r121,s3,t10,"Afterward, as in other techniques for quantization, they describe how to initialize the network values.",non_actionable,other
r121,s4,t10,"Also, they argue that batch normalization in this network is replaced with the shift-quantize operations, and what is matter in this case is (1) the relative values (“orientations”) and not the absolute values and (2) small values in errors are negligible.",actionable,suggestion
r121,s5,t10,"Afterward, the authors conduct experiments on MNIST, SVHN, CIFAR10, and ILSVRC12 datasets, where they show promising results compared to the errors provided by previous works.",actionable,agreement
r121,s6,t10,"The WAGE parameters (i.e., the quantized no. of bits used) are 2-8-8-8, respectively.",non_actionable,other
r121,s7,t10,"For understand more the WAGE, the authors compare on CIFAR10 the test error rate with vanilla CNN and show is small loss in using their network.",non_actionable,other
r121,s8,t10,"For inference only, other works has more to offer but this is a promising technique for learning.",actionable,agreement
r121,s9,t10,This will give the hardware community a clear vision of how such methods may be implemented both in data centers as well as on end portable devices.,actionable,agreement
r121,s0,t20,The paper describe a method for how to train and make inference in a network using only integer values.,non_actionable,fact
r121,s1,t20,"The idea is using quantizers with clipping (denoted in the paper with Q(x,k)) and some additional operators like shift (denoted with shift(x)) and stochastic rounding.",non_actionable,fact
r121,s2,t20,"After introducing the idea and related work, the authors in Section 3 give details about how to perform the quantization.",non_actionable,fact
r121,s3,t20,"Afterward, as in other techniques for quantization, they describe how to initialize the network values.",non_actionable,fact
r121,s4,t20,"Also, they argue that batch normalization in this network is replaced with the shift-quantize operations, and what is matter in this case is (1) the relative values (“orientations”) and not the absolute values and (2) small values in errors are negligible.",non_actionable,fact
r121,s5,t20,"Afterward, the authors conduct experiments on MNIST, SVHN, CIFAR10, and ILSVRC12 datasets, where they show promising results compared to the errors provided by previous works.",non_actionable,fact
r121,s6,t20,"The WAGE parameters (i.e., the quantized no. of bits used) are 2-8-8-8, respectively.",non_actionable,fact
r121,s7,t20,"For understand more the WAGE, the authors compare on CIFAR10 the test error rate with vanilla CNN and show is small loss in using their network.",non_actionable,fact
r121,s8,t20,"For inference only, other works has more to offer but this is a promising technique for learning.",non_actionable,agreement
r121,s9,t20,This will give the hardware community a clear vision of how such methods may be implemented both in data centers as well as on end portable devices.,non_actionable,agreement
r121,s0,t2,The paper describe a method for how to train and make inference in a network using only integer values.,non_actionable,fact
r121,s1,t2,"The idea is using quantizers with clipping (denoted in the paper with Q(x,k)) and some additional operators like shift (denoted with shift(x)) and stochastic rounding.",non_actionable,fact
r121,s2,t2,"After introducing the idea and related work, the authors in Section 3 give details about how to perform the quantization.",non_actionable,fact
r121,s3,t2,"Afterward, as in other techniques for quantization, they describe how to initialize the network values.",non_actionable,fact
r121,s4,t2,"Also, they argue that batch normalization in this network is replaced with the shift-quantize operations, and what is matter in this case is (1) the relative values (“orientations”) and not the absolute values and (2) small values in errors are negligible.",non_actionable,fact
r121,s5,t2,"Afterward, the authors conduct experiments on MNIST, SVHN, CIFAR10, and ILSVRC12 datasets, where they show promising results compared to the errors provided by previous works.",non_actionable,fact
r121,s6,t2,"The WAGE parameters (i.e., the quantized no. of bits used) are 2-8-8-8, respectively.",non_actionable,fact
r121,s7,t2,"For understand more the WAGE, the authors compare on CIFAR10 the test error rate with vanilla CNN and show is small loss in using their network.",non_actionable,fact
r121,s8,t2,"For inference only, other works has more to offer but this is a promising technique for learning.",non_actionable,fact
r121,s9,t2,This will give the hardware community a clear vision of how such methods may be implemented both in data centers as well as on end portable devices.,non_actionable,agreement
r31,s0,t20,"The paper is well-written, clearly illustrating the goal of this work and the corresponding approach.",non_actionable,agreement
r31,s1,t20,The “obverter” technique is quite interesting since it incorporates the concept from the theory of mind which is similar to human alignment or AGI approach.,non_actionable,agreement
r31,s2,t20,"2. Employ “obverter” technique, showing that it can be an alternative approach comparing to RL",actionable,suggestion
r31,s3,t20,3. The authors provided various experiments to showcase their approach Cons:,non_actionable,fact
r31,s4,t20,"1. Comparing to previous work (Mordatch & Abbeel, 2018), the task is relatively simple, only requiring the agent to perform binary prediction.",non_actionable,fact
r31,s5,t20,2. Sharing the RNN for speaking and consuming by picking the token that maximizes the probability might decrease the diversity.,non_actionable,fact
r31,s6,t20,3. This paper lack original technical contribution from themselves.,actionable,shortcoming
r31,s7,t20,"In conclusion, this paper does not have a major flaw.",non_actionable,fact
r31,s8,t20,Generating language based only on raw image pixels is not difficult.,non_actionable,fact
r31,s9,t20,"Approaches to the evolution of language: Social and cognitive bases, 405:426, 1998.",non_actionable,other
r31,s0,t31,"The paper is well-written, clearly illustrating the goal of this work and the corresponding approach.",non_actionable,agreement
r31,s1,t31,The “obverter” technique is quite interesting since it incorporates the concept from the theory of mind which is similar to human alignment or AGI approach.,non_actionable,agreement
r31,s2,t31,"2. Employ “obverter” technique, showing that it can be an alternative approach comparing to RL",non_actionable,fact
r31,s3,t31,3. The authors provided various experiments to showcase their approach Cons:,non_actionable,fact
r31,s4,t31,"1. Comparing to previous work (Mordatch & Abbeel, 2018), the task is relatively simple, only requiring the agent to perform binary prediction.",actionable,shortcoming
r31,s5,t31,2. Sharing the RNN for speaking and consuming by picking the token that maximizes the probability might decrease the diversity.,actionable,suggestion
r31,s6,t31,3. This paper lack original technical contribution from themselves.,actionable,shortcoming
r31,s7,t31,"In conclusion, this paper does not have a major flaw.",non_actionable,agreement
r31,s8,t31,Generating language based only on raw image pixels is not difficult.,non_actionable,fact
r31,s9,t31,"Approaches to the evolution of language: Social and cognitive bases, 405:426, 1998.",non_actionable,other
r31,s0,t16,"The paper is well-written, clearly illustrating the goal of this work and the corresponding approach.",non_actionable,agreement
r31,s1,t16,The “obverter” technique is quite interesting since it incorporates the concept from the theory of mind which is similar to human alignment or AGI approach.,non_actionable,agreement
r31,s2,t16,"2. Employ “obverter” technique, showing that it can be an alternative approach comparing to RL",actionable,suggestion
r31,s3,t16,3. The authors provided various experiments to showcase their approach Cons:,non_actionable,fact
r31,s4,t16,"1. Comparing to previous work (Mordatch & Abbeel, 2018), the task is relatively simple, only requiring the agent to perform binary prediction.",non_actionable,fact
r31,s5,t16,2. Sharing the RNN for speaking and consuming by picking the token that maximizes the probability might decrease the diversity.,actionable,suggestion
r31,s6,t16,3. This paper lack original technical contribution from themselves.,actionable,shortcoming
r31,s7,t16,"In conclusion, this paper does not have a major flaw.",non_actionable,agreement
r31,s8,t16,Generating language based only on raw image pixels is not difficult.,non_actionable,fact
r31,s9,t16,"Approaches to the evolution of language: Social and cognitive bases, 405:426, 1998.",non_actionable,fact
r31,s0,t10,"The paper is well-written, clearly illustrating the goal of this work and the corresponding approach.",actionable,agreement
r31,s1,t10,The “obverter” technique is quite interesting since it incorporates the concept from the theory of mind which is similar to human alignment or AGI approach.,actionable,agreement
r31,s2,t10,"2. Employ “obverter” technique, showing that it can be an alternative approach comparing to RL",non_actionable,other
r31,s3,t10,3. The authors provided various experiments to showcase their approach Cons:,non_actionable,other
r31,s4,t10,"1. Comparing to previous work (Mordatch & Abbeel, 2018), the task is relatively simple, only requiring the agent to perform binary prediction.",actionable,shortcoming
r31,s5,t10,2. Sharing the RNN for speaking and consuming by picking the token that maximizes the probability might decrease the diversity.,actionable,shortcoming
r31,s6,t10,3. This paper lack original technical contribution from themselves.,actionable,shortcoming
r31,s7,t10,"In conclusion, this paper does not have a major flaw.",non_actionable,other
r31,s8,t10,Generating language based only on raw image pixels is not difficult.,non_actionable,other
r31,s9,t10,"Approaches to the evolution of language: Social and cognitive bases, 405:426, 1998.",non_actionable,other
r31,s0,t8,"The paper is well-written, clearly illustrating the goal of this work and the corresponding approach.",non_actionable,agreement
r31,s1,t8,The “obverter” technique is quite interesting since it incorporates the concept from the theory of mind which is similar to human alignment or AGI approach.,non_actionable,agreement
r31,s2,t8,"2. Employ “obverter” technique, showing that it can be an alternative approach comparing to RL",non_actionable,fact
r31,s3,t8,3. The authors provided various experiments to showcase their approach Cons:,non_actionable,fact
r31,s4,t8,"1. Comparing to previous work (Mordatch & Abbeel, 2018), the task is relatively simple, only requiring the agent to perform binary prediction.",non_actionable,fact
r31,s5,t8,2. Sharing the RNN for speaking and consuming by picking the token that maximizes the probability might decrease the diversity.,non_actionable,fact
r31,s6,t8,3. This paper lack original technical contribution from themselves.,actionable,shortcoming
r31,s7,t8,"In conclusion, this paper does not have a major flaw.",non_actionable,fact
r31,s8,t8,Generating language based only on raw image pixels is not difficult.,non_actionable,fact
r31,s9,t8,"Approaches to the evolution of language: Social and cognitive bases, 405:426, 1998.",non_actionable,other
r62,s0,t8,The paper uses some interesting properties of the CPD model to derive an efficient optimization solver for the BCD subproblems.,non_actionable,fact
r62,s1,t8,"However, it is unclear to me why would one need to use K-Means for non-linear data, why not use kernelized kmeans?",non_actionable,question
r62,s2,t8,The proposed CPD model also is essentially learning a linear transformation of the kernelized feature space.,non_actionable,fact
r62,s3,t8,"So in contrast to kernelized kmeans, what is the advantage of the proposed framework?",non_actionable,question
r62,s4,t8,Perhaps providing some specific examples/scenarios or graphic illustrations will help appreciate the method.,actionable,suggestion
r62,s5,t8,"As noted above, it is not clear what is the significance of this combination or how does it improve performance.",actionable,shortcoming
r62,s6,t8,"Ideally, I would think when formulating the joint optimization for the clustering problem, the optimal functional v(x) should also be learned/derived for the clustering problem, or some proof should be provided showing the functionals are the same.",actionable,suggestion
r62,s7,t8,"9, the definition of Y_c is incorrect and unclear.",actionable,disagreement
r62,s8,t8,"p is defined as a vector of ones, earlier.",non_actionable,fact
r62,s9,t8,The provided math does not support any of these steps.,actionable,disagreement
r62,s0,t10,The paper uses some interesting properties of the CPD model to derive an efficient optimization solver for the BCD subproblems.,actionable,agreement
r62,s1,t10,"However, it is unclear to me why would one need to use K-Means for non-linear data, why not use kernelized kmeans?",actionable,question
r62,s2,t10,The proposed CPD model also is essentially learning a linear transformation of the kernelized feature space.,non_actionable,other
r62,s3,t10,"So in contrast to kernelized kmeans, what is the advantage of the proposed framework?",actionable,question
r62,s4,t10,Perhaps providing some specific examples/scenarios or graphic illustrations will help appreciate the method.,actionable,suggestion
r62,s5,t10,"As noted above, it is not clear what is the significance of this combination or how does it improve performance.",actionable,shortcoming
r62,s6,t10,"Ideally, I would think when formulating the joint optimization for the clustering problem, the optimal functional v(x) should also be learned/derived for the clustering problem, or some proof should be provided showing the functionals are the same.",actionable,fact
r62,s7,t10,"9, the definition of Y_c is incorrect and unclear.",actionable,shortcoming
r62,s8,t10,"p is defined as a vector of ones, earlier.",non_actionable,other
r62,s9,t10,The provided math does not support any of these steps.,actionable,shortcoming
r62,s0,t20,The paper uses some interesting properties of the CPD model to derive an efficient optimization solver for the BCD subproblems.,non_actionable,agreement
r62,s1,t20,"However, it is unclear to me why would one need to use K-Means for non-linear data, why not use kernelized kmeans?",actionable,question
r62,s2,t20,The proposed CPD model also is essentially learning a linear transformation of the kernelized feature space.,non_actionable,fact
r62,s3,t20,"So in contrast to kernelized kmeans, what is the advantage of the proposed framework?",actionable,question
r62,s4,t20,Perhaps providing some specific examples/scenarios or graphic illustrations will help appreciate the method.,actionable,suggestion
r62,s5,t20,"As noted above, it is not clear what is the significance of this combination or how does it improve performance.",actionable,shortcoming
r62,s6,t20,"Ideally, I would think when formulating the joint optimization for the clustering problem, the optimal functional v(x) should also be learned/derived for the clustering problem, or some proof should be provided showing the functionals are the same.",actionable,suggestion
r62,s7,t20,"9, the definition of Y_c is incorrect and unclear.",actionable,shortcoming
r62,s8,t20,"p is defined as a vector of ones, earlier.",non_actionable,fact
r62,s9,t20,The provided math does not support any of these steps.,actionable,shortcoming
r62,s0,t25,The paper uses some interesting properties of the CPD model to derive an efficient optimization solver for the BCD subproblems.,non_actionable,agreement
r62,s1,t25,"However, it is unclear to me why would one need to use K-Means for non-linear data, why not use kernelized kmeans?",non_actionable,shortcoming
r62,s2,t25,The proposed CPD model also is essentially learning a linear transformation of the kernelized feature space.,non_actionable,fact
r62,s3,t25,"So in contrast to kernelized kmeans, what is the advantage of the proposed framework?",actionable,suggestion
r62,s4,t25,Perhaps providing some specific examples/scenarios or graphic illustrations will help appreciate the method.,actionable,suggestion
r62,s5,t25,"As noted above, it is not clear what is the significance of this combination or how does it improve performance.",non_actionable,disagreement
r62,s6,t25,"Ideally, I would think when formulating the joint optimization for the clustering problem, the optimal functional v(x) should also be learned/derived for the clustering problem, or some proof should be provided showing the functionals are the same.",actionable,suggestion
r62,s7,t25,"9, the definition of Y_c is incorrect and unclear.",actionable,disagreement
r62,s8,t25,"p is defined as a vector of ones, earlier.",non_actionable,disagreement
r62,s9,t25,The provided math does not support any of these steps.,non_actionable,disagreement
r62,s0,t31,The paper uses some interesting properties of the CPD model to derive an efficient optimization solver for the BCD subproblems.,actionable,agreement
r62,s1,t31,"However, it is unclear to me why would one need to use K-Means for non-linear data, why not use kernelized kmeans?",actionable,shortcoming
r62,s2,t31,The proposed CPD model also is essentially learning a linear transformation of the kernelized feature space.,non_actionable,fact
r62,s3,t31,"So in contrast to kernelized kmeans, what is the advantage of the proposed framework?",non_actionable,question
r62,s4,t31,Perhaps providing some specific examples/scenarios or graphic illustrations will help appreciate the method.,actionable,suggestion
r62,s5,t31,"As noted above, it is not clear what is the significance of this combination or how does it improve performance.",actionable,shortcoming
r62,s6,t31,"Ideally, I would think when formulating the joint optimization for the clustering problem, the optimal functional v(x) should also be learned/derived for the clustering problem, or some proof should be provided showing the functionals are the same.",actionable,suggestion
r62,s7,t31,"9, the definition of Y_c is incorrect and unclear.",actionable,shortcoming
r62,s8,t31,"p is defined as a vector of ones, earlier.",non_actionable,fact
r62,s9,t31,The provided math does not support any of these steps.,actionable,shortcoming
r56,s0,t10,"The proposed model includes modules for all of these kinds of cells and includes: an Neural Touring Machine, Spatial Transformer Network, Recurrent Neural Networks, and CNNs.",non_actionable,other
r56,s1,t10,The model is trained with supervision to output the overhead map of the global map.,non_actionable,other
r56,s2,t10,"All components are trained with dense supervision (e.g. loop closure, ego motion with orientation-position, and the ground truth local overhead map).",non_actionable,other
r56,s3,t10,The model is trained on 5 mazes and tested on 2 others.,non_actionable,other
r56,s4,t10,I believe that this paper is severely flawed.,actionable,fact
r56,s5,t10,"Firstly, the model has ample free parameters to overfit when such a tiny test set is used.",actionable,shortcoming
r56,s6,t10,Are the test environments sufficiently different from the training ones?,actionable,question
r56,s7,t10,"For example, when showing that the head direction cells generalize in the new mazes how can we be sure that it is not using a common lighting scheme common to both train and test mazes to orient itself?",actionable,question
r56,s8,t10,What is the maximal possible MSE error in these environments?,actionable,question
r56,s9,t10,"To quote the authors ""However, there is no existing end-to-end neural network for visual SLAM to our best knowledge.""",non_actionable,other
r56,s0,t20,"The proposed model includes modules for all of these kinds of cells and includes: an Neural Touring Machine, Spatial Transformer Network, Recurrent Neural Networks, and CNNs.",non_actionable,fact
r56,s1,t20,The model is trained with supervision to output the overhead map of the global map.,non_actionable,fact
r56,s2,t20,"All components are trained with dense supervision (e.g. loop closure, ego motion with orientation-position, and the ground truth local overhead map).",non_actionable,fact
r56,s3,t20,The model is trained on 5 mazes and tested on 2 others.,non_actionable,fact
r56,s4,t20,I believe that this paper is severely flawed.,non_actionable,shortcoming
r56,s5,t20,"Firstly, the model has ample free parameters to overfit when such a tiny test set is used.",non_actionable,fact
r56,s6,t20,Are the test environments sufficiently different from the training ones?,actionable,question
r56,s7,t20,"For example, when showing that the head direction cells generalize in the new mazes how can we be sure that it is not using a common lighting scheme common to both train and test mazes to orient itself?",actionable,question
r56,s8,t20,What is the maximal possible MSE error in these environments?,actionable,question
r56,s9,t20,"To quote the authors ""However, there is no existing end-to-end neural network for visual SLAM to our best knowledge.""",non_actionable,fact
r56,s0,t16,"The proposed model includes modules for all of these kinds of cells and includes: an Neural Touring Machine, Spatial Transformer Network, Recurrent Neural Networks, and CNNs.",non_actionable,fact
r56,s1,t16,The model is trained with supervision to output the overhead map of the global map.,non_actionable,fact
r56,s2,t16,"All components are trained with dense supervision (e.g. loop closure, ego motion with orientation-position, and the ground truth local overhead map).",non_actionable,fact
r56,s3,t16,The model is trained on 5 mazes and tested on 2 others.,non_actionable,fact
r56,s4,t16,I believe that this paper is severely flawed.,actionable,disagreement
r56,s5,t16,"Firstly, the model has ample free parameters to overfit when such a tiny test set is used.",actionable,shortcoming
r56,s6,t16,Are the test environments sufficiently different from the training ones?,actionable,question
r56,s7,t16,"For example, when showing that the head direction cells generalize in the new mazes how can we be sure that it is not using a common lighting scheme common to both train and test mazes to orient itself?",actionable,question
r56,s8,t16,What is the maximal possible MSE error in these environments?,actionable,question
r56,s9,t16,"To quote the authors ""However, there is no existing end-to-end neural network for visual SLAM to our best knowledge.""",non_actionable,fact
r56,s0,t31,"The proposed model includes modules for all of these kinds of cells and includes: an Neural Touring Machine, Spatial Transformer Network, Recurrent Neural Networks, and CNNs.",non_actionable,fact
r56,s1,t31,The model is trained with supervision to output the overhead map of the global map.,non_actionable,fact
r56,s2,t31,"All components are trained with dense supervision (e.g. loop closure, ego motion with orientation-position, and the ground truth local overhead map).",non_actionable,fact
r56,s3,t31,The model is trained on 5 mazes and tested on 2 others.,non_actionable,fact
r56,s4,t31,I believe that this paper is severely flawed.,actionable,shortcoming
r56,s5,t31,"Firstly, the model has ample free parameters to overfit when such a tiny test set is used.",actionable,shortcoming
r56,s6,t31,Are the test environments sufficiently different from the training ones?,actionable,shortcoming
r56,s7,t31,"For example, when showing that the head direction cells generalize in the new mazes how can we be sure that it is not using a common lighting scheme common to both train and test mazes to orient itself?",actionable,shortcoming
r56,s8,t31,What is the maximal possible MSE error in these environments?,actionable,suggestion
r56,s9,t31,"To quote the authors ""However, there is no existing end-to-end neural network for visual SLAM to our best knowledge.""",non_actionable,fact
r56,s0,t8,"The proposed model includes modules for all of these kinds of cells and includes: an Neural Touring Machine, Spatial Transformer Network, Recurrent Neural Networks, and CNNs.",non_actionable,fact
r56,s1,t8,The model is trained with supervision to output the overhead map of the global map.,non_actionable,fact
r56,s2,t8,"All components are trained with dense supervision (e.g. loop closure, ego motion with orientation-position, and the ground truth local overhead map).",non_actionable,fact
r56,s3,t8,The model is trained on 5 mazes and tested on 2 others.,non_actionable,fact
r56,s4,t8,I believe that this paper is severely flawed.,actionable,shortcoming
r56,s5,t8,"Firstly, the model has ample free parameters to overfit when such a tiny test set is used.",actionable,shortcoming
r56,s6,t8,Are the test environments sufficiently different from the training ones?,non_actionable,question
r56,s7,t8,"For example, when showing that the head direction cells generalize in the new mazes how can we be sure that it is not using a common lighting scheme common to both train and test mazes to orient itself?",non_actionable,question
r56,s8,t8,What is the maximal possible MSE error in these environments?,non_actionable,question
r56,s9,t8,"To quote the authors ""However, there is no existing end-to-end neural network for visual SLAM to our best knowledge.""",non_actionable,fact
r60,s0,t8,The proposed technique is of modest contribution and the experimental results do not provide sufficient validation of the approach.,actionable,shortcoming
r60,s1,t8,The authors propose an extension of adversarial reinforcement learning to A3C.,non_actionable,fact
r60,s2,t8,The protagonist is attempting to achieve the given task while the antagonist's goal is for the task to fail.,non_actionable,fact
r60,s3,t8,"However, the proposed method is still within the same family methods as demonstrated by RARL.",non_actionable,fact
r60,s4,t8,The authors state that AR-A3C requires half as many rollouts as compared to RARL.,non_actionable,fact
r60,s5,t8,"However, no empirical comparison between the two methods is performed.",actionable,shortcoming
r60,s6,t8,The paper only performs analysis against the A3C and no other adversarial baseline and on only one environment: cartpole.,actionable,shortcoming
r60,s7,t8,There are a few notational issues in the paper that should be addressed.,actionable,suggestion
r60,s8,t8,"The authors mislabel the value function V as the  action value, or Q function.",actionable,disagreement
r60,s9,t8,"Double blind was likely compromised with the youtube video, which was linked to a real name account instead of an anonymous account.",actionable,shortcoming
r60,s0,t10,The proposed technique is of modest contribution and the experimental results do not provide sufficient validation of the approach.,actionable,shortcoming
r60,s1,t10,The authors propose an extension of adversarial reinforcement learning to A3C.,non_actionable,other
r60,s2,t10,The protagonist is attempting to achieve the given task while the antagonist's goal is for the task to fail.,non_actionable,other
r60,s3,t10,"However, the proposed method is still within the same family methods as demonstrated by RARL.",non_actionable,other
r60,s4,t10,The authors state that AR-A3C requires half as many rollouts as compared to RARL.,non_actionable,other
r60,s5,t10,"However, no empirical comparison between the two methods is performed.",actionable,shortcoming
r60,s6,t10,The paper only performs analysis against the A3C and no other adversarial baseline and on only one environment: cartpole.,actionable,shortcoming
r60,s7,t10,There are a few notational issues in the paper that should be addressed.,actionable,suggestion
r60,s8,t10,"The authors mislabel the value function V as the  action value, or Q function.",actionable,shortcoming
r60,s9,t10,"Double blind was likely compromised with the youtube video, which was linked to a real name account instead of an anonymous account.",actionable,shortcoming
r60,s0,t20,The proposed technique is of modest contribution and the experimental results do not provide sufficient validation of the approach.,non_actionable,shortcoming
r60,s1,t20,The authors propose an extension of adversarial reinforcement learning to A3C.,non_actionable,fact
r60,s2,t20,The protagonist is attempting to achieve the given task while the antagonist's goal is for the task to fail.,non_actionable,fact
r60,s3,t20,"However, the proposed method is still within the same family methods as demonstrated by RARL.",non_actionable,fact
r60,s4,t20,The authors state that AR-A3C requires half as many rollouts as compared to RARL.,non_actionable,fact
r60,s5,t20,"However, no empirical comparison between the two methods is performed.",actionable,shortcoming
r60,s6,t20,The paper only performs analysis against the A3C and no other adversarial baseline and on only one environment: cartpole.,non_actionable,fact
r60,s7,t20,There are a few notational issues in the paper that should be addressed.,actionable,suggestion
r60,s8,t20,"The authors mislabel the value function V as the  action value, or Q function.",actionable,shortcoming
r60,s9,t20,"Double blind was likely compromised with the youtube video, which was linked to a real name account instead of an anonymous account.",non_actionable,shortcoming
r60,s0,t4,The proposed technique is of modest contribution and the experimental results do not provide sufficient validation of the approach.,actionable,shortcoming
r60,s1,t4,The authors propose an extension of adversarial reinforcement learning to A3C.,non_actionable,fact
r60,s2,t4,The protagonist is attempting to achieve the given task while the antagonist's goal is for the task to fail.,non_actionable,fact
r60,s3,t4,"However, the proposed method is still within the same family methods as demonstrated by RARL.",non_actionable,fact
r60,s4,t4,The authors state that AR-A3C requires half as many rollouts as compared to RARL.,non_actionable,fact
r60,s5,t4,"However, no empirical comparison between the two methods is performed.",actionable,shortcoming
r60,s6,t4,The paper only performs analysis against the A3C and no other adversarial baseline and on only one environment: cartpole.,actionable,shortcoming
r60,s7,t4,There are a few notational issues in the paper that should be addressed.,non_actionable,other
r60,s8,t4,"The authors mislabel the value function V as the  action value, or Q function.",actionable,shortcoming
r60,s9,t4,"Double blind was likely compromised with the youtube video, which was linked to a real name account instead of an anonymous account.",actionable,shortcoming
r60,s0,t31,The proposed technique is of modest contribution and the experimental results do not provide sufficient validation of the approach.,actionable,shortcoming
r60,s1,t31,The authors propose an extension of adversarial reinforcement learning to A3C.,non_actionable,fact
r60,s2,t31,The protagonist is attempting to achieve the given task while the antagonist's goal is for the task to fail.,non_actionable,fact
r60,s3,t31,"However, the proposed method is still within the same family methods as demonstrated by RARL.",non_actionable,fact
r60,s4,t31,The authors state that AR-A3C requires half as many rollouts as compared to RARL.,non_actionable,fact
r60,s5,t31,"However, no empirical comparison between the two methods is performed.",actionable,shortcoming
r60,s6,t31,The paper only performs analysis against the A3C and no other adversarial baseline and on only one environment: cartpole.,actionable,shortcoming
r60,s7,t31,There are a few notational issues in the paper that should be addressed.,actionable,shortcoming
r60,s8,t31,"The authors mislabel the value function V as the  action value, or Q function.",actionable,shortcoming
r60,s9,t31,"Double blind was likely compromised with the youtube video, which was linked to a real name account instead of an anonymous account.",actionable,shortcoming
r42,s0,t16,The stated Theorem 1 is incorrect.,actionable,fact
r42,s1,t16,"Even if the stated result was correct, it presents much worse rate for a weaker notion of convergence.",actionable,shortcoming
r42,s2,t16,"2. Even if this was correct, the main point is that this is ""only"" d times worse - see eq (11).",actionable,shortcoming
r42,s3,t16,"Also, it is lot more worse than just d times:",actionable,shortcoming
r42,s4,t16,"3. Again in eq (11), you compare different notions of convergence - E[||g||_1]^2 vs. E[||g||_2^2].",actionable,shortcoming
r42,s5,t16,"In particular, the one for signSGD is the weaker notion - squared L1 norm can be d times bigger again.",actionable,fact
r42,s6,t16,"If this is not the case for some reason, more detailed explanation is needed.",actionable,suggestion
r42,s7,t16,"Other than that, the paper contains several attempts at intuitive explanation, which I don't find correct.",actionable,shortcoming
r42,s8,t16,Inclusion of Assumption 3 would in particular require better justification.,actionable,suggestion
r42,s9,t16,"Experiments are also inconclusive, as the plots show convergence to significantly worse accuracy than what the models converged to in original contributions.",actionable,shortcoming
r42,s0,t20,The stated Theorem 1 is incorrect.,actionable,shortcoming
r42,s1,t20,"Even if the stated result was correct, it presents much worse rate for a weaker notion of convergence.",non_actionable,fact
r42,s2,t20,"2. Even if this was correct, the main point is that this is ""only"" d times worse - see eq (11).",non_actionable,fact
r42,s3,t20,"Also, it is lot more worse than just d times:",non_actionable,fact
r42,s4,t20,"3. Again in eq (11), you compare different notions of convergence - E[||g||_1]^2 vs. E[||g||_2^2].",non_actionable,fact
r42,s5,t20,"In particular, the one for signSGD is the weaker notion - squared L1 norm can be d times bigger again.",non_actionable,fact
r42,s6,t20,"If this is not the case for some reason, more detailed explanation is needed.",actionable,suggestion
r42,s7,t20,"Other than that, the paper contains several attempts at intuitive explanation, which I don't find correct.",non_actionable,shortcoming
r42,s8,t20,Inclusion of Assumption 3 would in particular require better justification.,actionable,suggestion
r42,s9,t20,"Experiments are also inconclusive, as the plots show convergence to significantly worse accuracy than what the models converged to in original contributions.",non_actionable,shortcoming
r42,s0,t31,The stated Theorem 1 is incorrect.,actionable,shortcoming
r42,s1,t31,"Even if the stated result was correct, it presents much worse rate for a weaker notion of convergence.",actionable,shortcoming
r42,s2,t31,"2. Even if this was correct, the main point is that this is ""only"" d times worse - see eq (11).",actionable,shortcoming
r42,s3,t31,"Also, it is lot more worse than just d times:",actionable,shortcoming
r42,s4,t31,"3. Again in eq (11), you compare different notions of convergence - E[||g||_1]^2 vs. E[||g||_2^2].",actionable,shortcoming
r42,s5,t31,"In particular, the one for signSGD is the weaker notion - squared L1 norm can be d times bigger again.",actionable,shortcoming
r42,s6,t31,"If this is not the case for some reason, more detailed explanation is needed.",actionable,suggestion
r42,s7,t31,"Other than that, the paper contains several attempts at intuitive explanation, which I don't find correct.",actionable,shortcoming
r42,s8,t31,Inclusion of Assumption 3 would in particular require better justification.,actionable,suggestion
r42,s9,t31,"Experiments are also inconclusive, as the plots show convergence to significantly worse accuracy than what the models converged to in original contributions.",actionable,shortcoming
r42,s0,t32,The stated Theorem 1 is incorrect.,non_actionable,fact
r42,s1,t32,"Even if the stated result was correct, it presents much worse rate for a weaker notion of convergence.",non_actionable,fact
r42,s2,t32,"2. Even if this was correct, the main point is that this is ""only"" d times worse - see eq (11).",non_actionable,shortcoming
r42,s3,t32,"Also, it is lot more worse than just d times:",non_actionable,fact
r42,s4,t32,"3. Again in eq (11), you compare different notions of convergence - E[||g||_1]^2 vs. E[||g||_2^2].",non_actionable,shortcoming
r42,s5,t32,"In particular, the one for signSGD is the weaker notion - squared L1 norm can be d times bigger again.",non_actionable,fact
r42,s6,t32,"If this is not the case for some reason, more detailed explanation is needed.",actionable,agreement
r42,s7,t32,"Other than that, the paper contains several attempts at intuitive explanation, which I don't find correct.",non_actionable,fact
r42,s8,t32,Inclusion of Assumption 3 would in particular require better justification.,actionable,suggestion
r42,s9,t32,"Experiments are also inconclusive, as the plots show convergence to significantly worse accuracy than what the models converged to in original contributions.",non_actionable,fact
r42,s0,t10,The stated Theorem 1 is incorrect.,actionable,shortcoming
r42,s1,t10,"Even if the stated result was correct, it presents much worse rate for a weaker notion of convergence.",actionable,shortcoming
r42,s2,t10,"2. Even if this was correct, the main point is that this is ""only"" d times worse - see eq (11).",actionable,shortcoming
r42,s3,t10,"Also, it is lot more worse than just d times:",actionable,shortcoming
r42,s4,t10,"3. Again in eq (11), you compare different notions of convergence - E[||g||_1]^2 vs. E[||g||_2^2].",actionable,shortcoming
r42,s5,t10,"In particular, the one for signSGD is the weaker notion - squared L1 norm can be d times bigger again.",actionable,shortcoming
r42,s6,t10,"If this is not the case for some reason, more detailed explanation is needed.",actionable,suggestion
r42,s7,t10,"Other than that, the paper contains several attempts at intuitive explanation, which I don't find correct.",actionable,disagreement
r42,s8,t10,Inclusion of Assumption 3 would in particular require better justification.,actionable,suggestion
r42,s9,t10,"Experiments are also inconclusive, as the plots show convergence to significantly worse accuracy than what the models converged to in original contributions.",actionable,shortcoming
r64,s0,t13,They also showed that the traditional score matching estimator (Hyvarinen 2005) can be obtained as a special case of their estimator.,non_actionable,fact
r64,s1,t13,The novelty of this work consists of an approach based on score matching and Stein’s identity to estimate the gradient directly and the empirical results of the proposed method on meta-learning for approximate inference and entropy regularized GANs.,non_actionable,agreement
r64,s2,t13,The proposed method is new and technically sound.,actionable,agreement
r64,s3,t13,The authors also demonstrated through several experiments that the proposed technique can be applied in a wide range of applications.,non_actionable,fact
r64,s4,t13,"Nevertheless, I suspect that the drawback of this method compared to existing ones is computational cost.",non_actionable,fact
r64,s5,t13,"For example, the authors claimed in Section 4.3 that the Stein gradient estimator is faster than other methods, but it is not clear as to why this is the case.",actionable,shortcoming
r64,s6,t13,"In Section 4.2, the authors only consider four datasets (out of six UCI datasets).",actionable,shortcoming
r64,s7,t13,"Also, in Section 4.3, it is not clear what the point of this experiment is: whether to show that entropy regularization helps or the Stein gradient estimator outperforms other estimators.",actionable,shortcoming
r64,s8,t13,"Some comments: - Perhaps, it is better to move Section 3.3 before Section 3.2 to emphasize the main contribution of this work, i.e., using Stein’s identity to derive an estimate of the gradient of the score function.",actionable,suggestion
r64,s9,t13,"- In Section 4.3, why did you consider the entropy regularizer?",actionable,question
r64,s0,t31,They also showed that the traditional score matching estimator (Hyvarinen 2005) can be obtained as a special case of their estimator.,non_actionable,fact
r64,s1,t31,The novelty of this work consists of an approach based on score matching and Stein’s identity to estimate the gradient directly and the empirical results of the proposed method on meta-learning for approximate inference and entropy regularized GANs.,non_actionable,fact
r64,s2,t31,The proposed method is new and technically sound.,non_actionable,fact
r64,s3,t31,The authors also demonstrated through several experiments that the proposed technique can be applied in a wide range of applications.,non_actionable,fact
r64,s4,t31,"Nevertheless, I suspect that the drawback of this method compared to existing ones is computational cost.",non_actionable,fact
r64,s5,t31,"For example, the authors claimed in Section 4.3 that the Stein gradient estimator is faster than other methods, but it is not clear as to why this is the case.",actionable,shortcoming
r64,s6,t31,"In Section 4.2, the authors only consider four datasets (out of six UCI datasets).",actionable,shortcoming
r64,s7,t31,"Also, in Section 4.3, it is not clear what the point of this experiment is: whether to show that entropy regularization helps or the Stein gradient estimator outperforms other estimators.",actionable,shortcoming
r64,s8,t31,"Some comments: - Perhaps, it is better to move Section 3.3 before Section 3.2 to emphasize the main contribution of this work, i.e., using Stein’s identity to derive an estimate of the gradient of the score function.",actionable,suggestion
r64,s9,t31,"- In Section 4.3, why did you consider the entropy regularizer?",non_actionable,question
r64,s0,t16,They also showed that the traditional score matching estimator (Hyvarinen 2005) can be obtained as a special case of their estimator.,non_actionable,fact
r64,s1,t16,The novelty of this work consists of an approach based on score matching and Stein’s identity to estimate the gradient directly and the empirical results of the proposed method on meta-learning for approximate inference and entropy regularized GANs.,non_actionable,fact
r64,s2,t16,The proposed method is new and technically sound.,non_actionable,agreement
r64,s3,t16,The authors also demonstrated through several experiments that the proposed technique can be applied in a wide range of applications.,non_actionable,fact
r64,s4,t16,"Nevertheless, I suspect that the drawback of this method compared to existing ones is computational cost.",non_actionable,fact
r64,s5,t16,"For example, the authors claimed in Section 4.3 that the Stein gradient estimator is faster than other methods, but it is not clear as to why this is the case.",actionable,shortcoming
r64,s6,t16,"In Section 4.2, the authors only consider four datasets (out of six UCI datasets).",actionable,shortcoming
r64,s7,t16,"Also, in Section 4.3, it is not clear what the point of this experiment is: whether to show that entropy regularization helps or the Stein gradient estimator outperforms other estimators.",actionable,shortcoming
r64,s8,t16,"Some comments: - Perhaps, it is better to move Section 3.3 before Section 3.2 to emphasize the main contribution of this work, i.e., using Stein’s identity to derive an estimate of the gradient of the score function.",actionable,suggestion
r64,s9,t16,"- In Section 4.3, why did you consider the entropy regularizer?",actionable,question
r64,s0,t10,They also showed that the traditional score matching estimator (Hyvarinen 2005) can be obtained as a special case of their estimator.,non_actionable,other
r64,s1,t10,The novelty of this work consists of an approach based on score matching and Stein’s identity to estimate the gradient directly and the empirical results of the proposed method on meta-learning for approximate inference and entropy regularized GANs.,non_actionable,other
r64,s2,t10,The proposed method is new and technically sound.,actionable,agreement
r64,s3,t10,The authors also demonstrated through several experiments that the proposed technique can be applied in a wide range of applications.,actionable,agreement
r64,s4,t10,"Nevertheless, I suspect that the drawback of this method compared to existing ones is computational cost.",actionable,shortcoming
r64,s5,t10,"For example, the authors claimed in Section 4.3 that the Stein gradient estimator is faster than other methods, but it is not clear as to why this is the case.",actionable,shortcoming
r64,s6,t10,"In Section 4.2, the authors only consider four datasets (out of six UCI datasets).",actionable,shortcoming
r64,s7,t10,"Also, in Section 4.3, it is not clear what the point of this experiment is: whether to show that entropy regularization helps or the Stein gradient estimator outperforms other estimators.",actionable,shortcoming
r64,s8,t10,"Some comments: - Perhaps, it is better to move Section 3.3 before Section 3.2 to emphasize the main contribution of this work, i.e., using Stein’s identity to derive an estimate of the gradient of the score function.",actionable,suggestion
r64,s9,t10,"- In Section 4.3, why did you consider the entropy regularizer?",actionable,question
r64,s0,t20,They also showed that the traditional score matching estimator (Hyvarinen 2005) can be obtained as a special case of their estimator.,non_actionable,fact
r64,s1,t20,The novelty of this work consists of an approach based on score matching and Stein’s identity to estimate the gradient directly and the empirical results of the proposed method on meta-learning for approximate inference and entropy regularized GANs.,non_actionable,fact
r64,s2,t20,The proposed method is new and technically sound.,non_actionable,agreement
r64,s3,t20,The authors also demonstrated through several experiments that the proposed technique can be applied in a wide range of applications.,non_actionable,fact
r64,s4,t20,"Nevertheless, I suspect that the drawback of this method compared to existing ones is computational cost.",non_actionable,shortcoming
r64,s5,t20,"For example, the authors claimed in Section 4.3 that the Stein gradient estimator is faster than other methods, but it is not clear as to why this is the case.",non_actionable,shortcoming
r64,s6,t20,"In Section 4.2, the authors only consider four datasets (out of six UCI datasets).",non_actionable,shortcoming
r64,s7,t20,"Also, in Section 4.3, it is not clear what the point of this experiment is: whether to show that entropy regularization helps or the Stein gradient estimator outperforms other estimators.",non_actionable,shortcoming
r64,s8,t20,"Some comments: - Perhaps, it is better to move Section 3.3 before Section 3.2 to emphasize the main contribution of this work, i.e., using Stein’s identity to derive an estimate of the gradient of the score function.",actionable,suggestion
r64,s9,t20,"- In Section 4.3, why did you consider the entropy regularizer?",actionable,question
r82,s0,t20,"They perform experiments of label noise in a uniform setting, structured setting as well provide some heuristics to mitigate the effect of label noise such as changing learning rate or batch size.",non_actionable,fact
r82,s1,t20,"Although, the observations are interesting, especially the one on MNIST where the network performs well even with correct labels slightly above chance, the overall contributions are incremental.",non_actionable,fact
r82,s2,t20,"Most of the observations of label noise such as training with structured noise, importance of larger datasets have already been archived in prior work such as in Sukhbataar et.al.",non_actionable,fact
r82,s3,t20,"Agreed that the authors do a more detailed study on simple MNIST classification, but these insights are not transferable to more challenging domains.",non_actionable,agreement
r82,s4,t20,"(2014), or an actionable trade-off between data acquisition and training schedules.",non_actionable,fact
r82,s5,t20,The authors contend that the way they deal with noise (keeping number of training samples constant) is different from previous setting which use label flips.,non_actionable,fact
r82,s6,t20,"However, the previous settings can be reinterpreted in the authors setting.",non_actionable,fact
r82,s7,t20,I found the formulation of the \alpha to be non-intuitive and confusing at times.,actionable,shortcoming
r82,s8,t20,The graphs plot number of noisy labels per clean label so a alpha of 100 would imply 1 right label and 100 noisy labels for total 101 labels.,non_actionable,fact
r82,s9,t20,"Missing citation: ""Training Deep Neural Networks on Noisy Labels with Bootstrapping"", Reed et al.",actionable,shortcoming
r82,s0,t27,"They perform experiments of label noise in a uniform setting, structured setting as well provide some heuristics to mitigate the effect of label noise such as changing learning rate or batch size.",non_actionable,fact
r82,s1,t27,"Although, the observations are interesting, especially the one on MNIST where the network performs well even with correct labels slightly above chance, the overall contributions are incremental.",non_actionable,fact
r82,s2,t27,"Most of the observations of label noise such as training with structured noise, importance of larger datasets have already been archived in prior work such as in Sukhbataar et.al.",non_actionable,shortcoming
r82,s3,t27,"Agreed that the authors do a more detailed study on simple MNIST classification, but these insights are not transferable to more challenging domains.",non_actionable,shortcoming
r82,s4,t27,"(2014), or an actionable trade-off between data acquisition and training schedules.",non_actionable,shortcoming
r82,s5,t27,The authors contend that the way they deal with noise (keeping number of training samples constant) is different from previous setting which use label flips.,non_actionable,fact
r82,s6,t27,"However, the previous settings can be reinterpreted in the authors setting.",non_actionable,fact
r82,s7,t27,I found the formulation of the \alpha to be non-intuitive and confusing at times.,non_actionable,shortcoming
r82,s8,t27,The graphs plot number of noisy labels per clean label so a alpha of 100 would imply 1 right label and 100 noisy labels for total 101 labels.,actionable,fact
r82,s9,t27,"Missing citation: ""Training Deep Neural Networks on Noisy Labels with Bootstrapping"", Reed et al.",actionable,suggestion
r82,s0,t16,"They perform experiments of label noise in a uniform setting, structured setting as well provide some heuristics to mitigate the effect of label noise such as changing learning rate or batch size.",non_actionable,fact
r82,s1,t16,"Although, the observations are interesting, especially the one on MNIST where the network performs well even with correct labels slightly above chance, the overall contributions are incremental.",non_actionable,fact
r82,s2,t16,"Most of the observations of label noise such as training with structured noise, importance of larger datasets have already been archived in prior work such as in Sukhbataar et.al.",non_actionable,fact
r82,s3,t16,"Agreed that the authors do a more detailed study on simple MNIST classification, but these insights are not transferable to more challenging domains.",actionable,shortcoming
r82,s4,t16,"(2014), or an actionable trade-off between data acquisition and training schedules.",non_actionable,fact
r82,s5,t16,The authors contend that the way they deal with noise (keeping number of training samples constant) is different from previous setting which use label flips.,non_actionable,shortcoming
r82,s6,t16,"However, the previous settings can be reinterpreted in the authors setting.",actionable,disagreement
r82,s7,t16,I found the formulation of the \alpha to be non-intuitive and confusing at times.,actionable,shortcoming
r82,s8,t16,The graphs plot number of noisy labels per clean label so a alpha of 100 would imply 1 right label and 100 noisy labels for total 101 labels.,actionable,shortcoming
r82,s9,t16,"Missing citation: ""Training Deep Neural Networks on Noisy Labels with Bootstrapping"", Reed et al.",actionable,shortcoming
r82,s0,t31,"They perform experiments of label noise in a uniform setting, structured setting as well provide some heuristics to mitigate the effect of label noise such as changing learning rate or batch size.",non_actionable,fact
r82,s1,t31,"Although, the observations are interesting, especially the one on MNIST where the network performs well even with correct labels slightly above chance, the overall contributions are incremental.",non_actionable,fact
r82,s2,t31,"Most of the observations of label noise such as training with structured noise, importance of larger datasets have already been archived in prior work such as in Sukhbataar et.al.",non_actionable,fact
r82,s3,t31,"Agreed that the authors do a more detailed study on simple MNIST classification, but these insights are not transferable to more challenging domains.",non_actionable,fact
r82,s4,t31,"(2014), or an actionable trade-off between data acquisition and training schedules.",non_actionable,fact
r82,s5,t31,The authors contend that the way they deal with noise (keeping number of training samples constant) is different from previous setting which use label flips.,non_actionable,fact
r82,s6,t31,"However, the previous settings can be reinterpreted in the authors setting.",actionable,shortcoming
r82,s7,t31,I found the formulation of the \alpha to be non-intuitive and confusing at times.,actionable,shortcoming
r82,s8,t31,The graphs plot number of noisy labels per clean label so a alpha of 100 would imply 1 right label and 100 noisy labels for total 101 labels.,non_actionable,fact
r82,s9,t31,"Missing citation: ""Training Deep Neural Networks on Noisy Labels with Bootstrapping"", Reed et al.",actionable,shortcoming
r82,s0,t29,"They perform experiments of label noise in a uniform setting, structured setting as well provide some heuristics to mitigate the effect of label noise such as changing learning rate or batch size.",non_actionable,fact
r82,s1,t29,"Although, the observations are interesting, especially the one on MNIST where the network performs well even with correct labels slightly above chance, the overall contributions are incremental.",non_actionable,shortcoming
r82,s2,t29,"Most of the observations of label noise such as training with structured noise, importance of larger datasets have already been archived in prior work such as in Sukhbataar et.al.",non_actionable,shortcoming
r82,s3,t29,"Agreed that the authors do a more detailed study on simple MNIST classification, but these insights are not transferable to more challenging domains.",non_actionable,shortcoming
r82,s4,t29,"(2014), or an actionable trade-off between data acquisition and training schedules.",non_actionable,fact
r82,s5,t29,The authors contend that the way they deal with noise (keeping number of training samples constant) is different from previous setting which use label flips.,non_actionable,fact
r82,s6,t29,"However, the previous settings can be reinterpreted in the authors setting.",actionable,shortcoming
r82,s7,t29,I found the formulation of the \alpha to be non-intuitive and confusing at times.,non_actionable,shortcoming
r82,s8,t29,The graphs plot number of noisy labels per clean label so a alpha of 100 would imply 1 right label and 100 noisy labels for total 101 labels.,non_actionable,shortcoming
r82,s9,t29,"Missing citation: ""Training Deep Neural Networks on Noisy Labels with Bootstrapping"", Reed et al.",actionable,shortcoming
r24,s0,t20,This is then used to develop a simple bias initialization scheme for the gates when the range of temporal dependencies relevant for a problem can be estimated or are known.,non_actionable,fact
r24,s1,t20,Quality and significance: The core insight of the paper is the link between recurrent network design and its effect on how the network reacts to time transformations.,non_actionable,fact
r24,s2,t20,"This insight is simple, elegant and valuable in my opinion.",non_actionable,agreement
r24,s3,t20,"It is becoming increasingly apparent recently that the benefits of the gating and cell mechanisms introduced by the LSTM, now also used in feedforward networks, go beyond avoiding vanishing gradients.",non_actionable,fact
r24,s4,t20,The particular structural elements also induce certain inductive biases which make learning or generalization easier in many cases.,non_actionable,fact
r24,s5,t20,"Understanding the link between model architecture and behavior is very useful for the field in general, and this paper contributes to this knowledge.",non_actionable,agreement
r24,s6,t20,"In light of this, I think it is reasonable to ignore the fact that the proposed initialization does not provide benefits on Penn Treebank and text8.",non_actionable,fact
r24,s7,t20,The real value of the paper is in providing an alternative way of thinking about LSTMs that is theoretically sound and intuitive.,non_actionable,agreement
r24,s8,t20,Clarity: The paper is well-written in general and easy to understand.,non_actionable,agreement
r24,s9,t20,"A minor complaint is that there are an unnecessarily large number of paragraph breaks, especially on pages 3 and 4, which make reading slightly jarring.",actionable,shortcoming
r24,s0,t16,This is then used to develop a simple bias initialization scheme for the gates when the range of temporal dependencies relevant for a problem can be estimated or are known.,non_actionable,fact
r24,s1,t16,Quality and significance: The core insight of the paper is the link between recurrent network design and its effect on how the network reacts to time transformations.,non_actionable,fact
r24,s2,t16,"This insight is simple, elegant and valuable in my opinion.",non_actionable,agreement
r24,s3,t16,"It is becoming increasingly apparent recently that the benefits of the gating and cell mechanisms introduced by the LSTM, now also used in feedforward networks, go beyond avoiding vanishing gradients.",non_actionable,fact
r24,s4,t16,The particular structural elements also induce certain inductive biases which make learning or generalization easier in many cases.,non_actionable,fact
r24,s5,t16,"Understanding the link between model architecture and behavior is very useful for the field in general, and this paper contributes to this knowledge.",non_actionable,agreement
r24,s6,t16,"In light of this, I think it is reasonable to ignore the fact that the proposed initialization does not provide benefits on Penn Treebank and text8.",actionable,shortcoming
r24,s7,t16,The real value of the paper is in providing an alternative way of thinking about LSTMs that is theoretically sound and intuitive.,non_actionable,agreement
r24,s8,t16,Clarity: The paper is well-written in general and easy to understand.,non_actionable,agreement
r24,s9,t16,"A minor complaint is that there are an unnecessarily large number of paragraph breaks, especially on pages 3 and 4, which make reading slightly jarring.",actionable,shortcoming
r24,s0,t31,This is then used to develop a simple bias initialization scheme for the gates when the range of temporal dependencies relevant for a problem can be estimated or are known.,non_actionable,fact
r24,s1,t31,Quality and significance: The core insight of the paper is the link between recurrent network design and its effect on how the network reacts to time transformations.,non_actionable,fact
r24,s2,t31,"This insight is simple, elegant and valuable in my opinion.",non_actionable,agreement
r24,s3,t31,"It is becoming increasingly apparent recently that the benefits of the gating and cell mechanisms introduced by the LSTM, now also used in feedforward networks, go beyond avoiding vanishing gradients.",non_actionable,fact
r24,s4,t31,The particular structural elements also induce certain inductive biases which make learning or generalization easier in many cases.,non_actionable,fact
r24,s5,t31,"Understanding the link between model architecture and behavior is very useful for the field in general, and this paper contributes to this knowledge.",non_actionable,fact
r24,s6,t31,"In light of this, I think it is reasonable to ignore the fact that the proposed initialization does not provide benefits on Penn Treebank and text8.",non_actionable,fact
r24,s7,t31,The real value of the paper is in providing an alternative way of thinking about LSTMs that is theoretically sound and intuitive.,non_actionable,agreement
r24,s8,t31,Clarity: The paper is well-written in general and easy to understand.,non_actionable,agreement
r24,s9,t31,"A minor complaint is that there are an unnecessarily large number of paragraph breaks, especially on pages 3 and 4, which make reading slightly jarring.",actionable,shortcoming
r24,s0,t8,This is then used to develop a simple bias initialization scheme for the gates when the range of temporal dependencies relevant for a problem can be estimated or are known.,non_actionable,fact
r24,s1,t8,Quality and significance: The core insight of the paper is the link between recurrent network design and its effect on how the network reacts to time transformations.,non_actionable,fact
r24,s2,t8,"This insight is simple, elegant and valuable in my opinion.",non_actionable,agreement
r24,s3,t8,"It is becoming increasingly apparent recently that the benefits of the gating and cell mechanisms introduced by the LSTM, now also used in feedforward networks, go beyond avoiding vanishing gradients.",non_actionable,fact
r24,s4,t8,The particular structural elements also induce certain inductive biases which make learning or generalization easier in many cases.,non_actionable,fact
r24,s5,t8,"Understanding the link between model architecture and behavior is very useful for the field in general, and this paper contributes to this knowledge.",non_actionable,agreement
r24,s6,t8,"In light of this, I think it is reasonable to ignore the fact that the proposed initialization does not provide benefits on Penn Treebank and text8.",non_actionable,fact
r24,s7,t8,The real value of the paper is in providing an alternative way of thinking about LSTMs that is theoretically sound and intuitive.,non_actionable,agreement
r24,s8,t8,Clarity: The paper is well-written in general and easy to understand.,non_actionable,agreement
r24,s9,t8,"A minor complaint is that there are an unnecessarily large number of paragraph breaks, especially on pages 3 and 4, which make reading slightly jarring.",actionable,shortcoming
r24,s0,t10,This is then used to develop a simple bias initialization scheme for the gates when the range of temporal dependencies relevant for a problem can be estimated or are known.,non_actionable,other
r24,s1,t10,Quality and significance: The core insight of the paper is the link between recurrent network design and its effect on how the network reacts to time transformations.,non_actionable,other
r24,s2,t10,"This insight is simple, elegant and valuable in my opinion.",actionable,fact
r24,s3,t10,"It is becoming increasingly apparent recently that the benefits of the gating and cell mechanisms introduced by the LSTM, now also used in feedforward networks, go beyond avoiding vanishing gradients.",actionable,agreement
r24,s4,t10,The particular structural elements also induce certain inductive biases which make learning or generalization easier in many cases.,actionable,agreement
r24,s5,t10,"Understanding the link between model architecture and behavior is very useful for the field in general, and this paper contributes to this knowledge.",actionable,agreement
r24,s6,t10,"In light of this, I think it is reasonable to ignore the fact that the proposed initialization does not provide benefits on Penn Treebank and text8.",actionable,shortcoming
r24,s7,t10,The real value of the paper is in providing an alternative way of thinking about LSTMs that is theoretically sound and intuitive.,actionable,agreement
r24,s8,t10,Clarity: The paper is well-written in general and easy to understand.,actionable,agreement
r24,s9,t10,"A minor complaint is that there are an unnecessarily large number of paragraph breaks, especially on pages 3 and 4, which make reading slightly jarring.",actionable,shortcoming
r4,s0,t16,"This paper addresses multi-task feature learning, i.e. learning representations that are common across multiple related supervised learning tasks.",non_actionable,fact
r4,s1,t16,"The authors rely on two prior works in multi-task learning  that explore parameter sharing (Lee et al, 2016) and subspace learning (Kumar & Daume III 2012) for multi-task learning.",non_actionable,fact
r4,s2,t16,2) The second prior work is Kumar & Daume III 2012 (and also an early work of Argyrio et al 2008) that is based on learning a common feature representation.,non_actionable,fact
r4,s3,t16,Subspace learning could help to scale up to many tasks.,actionable,suggestion
r4,s4,t16,Why not using raw input image?,actionable,question
r4,s5,t16,"In fact, linear PCA can be viewed as an autoencoder model with linear encoder and decoder (so that the squared error reconstruction loss between a given sample and the sample reconstructed by the autoencoder is minimal (Bishop, 2006)).",non_actionable,fact
r4,s6,t16,"The introduction primarily criticizes  the approach of Lee et al, 2016 called Assymetric Multi-task Learning.",non_actionable,fact
r4,s7,t16,The main learning objective (6) should be better explained.,actionable,suggestion
r4,s8,t16,"One might argue that visually, striped hyena is as informative as white tigers.",actionable,shortcoming
r4,s9,t16,"Perhaps one could use a different (less striped) animal, e.g. raccoon.",actionable,suggestion
r4,s0,t20,"This paper addresses multi-task feature learning, i.e. learning representations that are common across multiple related supervised learning tasks.",non_actionable,fact
r4,s1,t20,"The authors rely on two prior works in multi-task learning  that explore parameter sharing (Lee et al, 2016) and subspace learning (Kumar & Daume III 2012) for multi-task learning.",non_actionable,fact
r4,s2,t20,2) The second prior work is Kumar & Daume III 2012 (and also an early work of Argyrio et al 2008) that is based on learning a common feature representation.,non_actionable,fact
r4,s3,t20,Subspace learning could help to scale up to many tasks.,non_actionable,fact
r4,s4,t20,Why not using raw input image?,actionable,question
r4,s5,t20,"In fact, linear PCA can be viewed as an autoencoder model with linear encoder and decoder (so that the squared error reconstruction loss between a given sample and the sample reconstructed by the autoencoder is minimal (Bishop, 2006)).",non_actionable,fact
r4,s6,t20,"The introduction primarily criticizes  the approach of Lee et al, 2016 called Assymetric Multi-task Learning.",non_actionable,fact
r4,s7,t20,The main learning objective (6) should be better explained.,actionable,suggestion
r4,s8,t20,"One might argue that visually, striped hyena is as informative as white tigers.",non_actionable,fact
r4,s9,t20,"Perhaps one could use a different (less striped) animal, e.g. raccoon.",actionable,suggestion
r4,s0,t10,"This paper addresses multi-task feature learning, i.e. learning representations that are common across multiple related supervised learning tasks.",non_actionable,other
r4,s1,t10,"The authors rely on two prior works in multi-task learning  that explore parameter sharing (Lee et al, 2016) and subspace learning (Kumar & Daume III 2012) for multi-task learning.",non_actionable,other
r4,s2,t10,2) The second prior work is Kumar & Daume III 2012 (and also an early work of Argyrio et al 2008) that is based on learning a common feature representation.,non_actionable,other
r4,s3,t10,Subspace learning could help to scale up to many tasks.,actionable,fact
r4,s4,t10,Why not using raw input image?,actionable,question
r4,s5,t10,"In fact, linear PCA can be viewed as an autoencoder model with linear encoder and decoder (so that the squared error reconstruction loss between a given sample and the sample reconstructed by the autoencoder is minimal (Bishop, 2006)).",actionable,suggestion
r4,s6,t10,"The introduction primarily criticizes  the approach of Lee et al, 2016 called Assymetric Multi-task Learning.",non_actionable,other
r4,s7,t10,The main learning objective (6) should be better explained.,actionable,suggestion
r4,s8,t10,"One might argue that visually, striped hyena is as informative as white tigers.",actionable,shortcoming
r4,s9,t10,"Perhaps one could use a different (less striped) animal, e.g. raccoon.",actionable,suggestion
r4,s0,t31,"This paper addresses multi-task feature learning, i.e. learning representations that are common across multiple related supervised learning tasks.",non_actionable,fact
r4,s1,t31,"The authors rely on two prior works in multi-task learning  that explore parameter sharing (Lee et al, 2016) and subspace learning (Kumar & Daume III 2012) for multi-task learning.",non_actionable,fact
r4,s2,t31,2) The second prior work is Kumar & Daume III 2012 (and also an early work of Argyrio et al 2008) that is based on learning a common feature representation.,non_actionable,fact
r4,s3,t31,Subspace learning could help to scale up to many tasks.,actionable,suggestion
r4,s4,t31,Why not using raw input image?,non_actionable,question
r4,s5,t31,"In fact, linear PCA can be viewed as an autoencoder model with linear encoder and decoder (so that the squared error reconstruction loss between a given sample and the sample reconstructed by the autoencoder is minimal (Bishop, 2006)).",non_actionable,fact
r4,s6,t31,"The introduction primarily criticizes  the approach of Lee et al, 2016 called Assymetric Multi-task Learning.",non_actionable,fact
r4,s7,t31,The main learning objective (6) should be better explained.,actionable,suggestion
r4,s8,t31,"One might argue that visually, striped hyena is as informative as white tigers.",non_actionable,fact
r4,s9,t31,"Perhaps one could use a different (less striped) animal, e.g. raccoon.",actionable,suggestion
r4,s0,t2,"This paper addresses multi-task feature learning, i.e. learning representations that are common across multiple related supervised learning tasks.",non_actionable,fact
r4,s1,t2,"The authors rely on two prior works in multi-task learning  that explore parameter sharing (Lee et al, 2016) and subspace learning (Kumar & Daume III 2012) for multi-task learning.",non_actionable,fact
r4,s2,t2,2) The second prior work is Kumar & Daume III 2012 (and also an early work of Argyrio et al 2008) that is based on learning a common feature representation.,non_actionable,fact
r4,s3,t2,Subspace learning could help to scale up to many tasks.,non_actionable,fact
r4,s4,t2,Why not using raw input image?,actionable,question
r4,s5,t2,"In fact, linear PCA can be viewed as an autoencoder model with linear encoder and decoder (so that the squared error reconstruction loss between a given sample and the sample reconstructed by the autoencoder is minimal (Bishop, 2006)).",non_actionable,fact
r4,s6,t2,"The introduction primarily criticizes  the approach of Lee et al, 2016 called Assymetric Multi-task Learning.",non_actionable,fact
r4,s7,t2,The main learning objective (6) should be better explained.,actionable,suggestion
r4,s8,t2,"One might argue that visually, striped hyena is as informative as white tigers.",non_actionable,fact
r4,s9,t2,"Perhaps one could use a different (less striped) animal, e.g. raccoon.",actionable,suggestion
r2,s0,t10,This paper shows an observation of “super-convergence” when training resnet with cyclical learning rates but does not provide conclusive analysis or experiment results.,non_actionable,other
r2,s1,t10,This paper discusses the phenomenon of a fast convergence rate for training resnet with cyclical learning rates under a few particular setting.,non_actionable,other
r2,s2,t10,It tries to provide an explanation for the phenomenon and a procedure to test when it happens.,non_actionable,other
r2,s3,t10,"However, I don't find the paper of high significance or the proposed method solid for publication at ICLR.",actionable,shortcoming
r2,s4,t10,"The paper is based on the cyclical learning rates proposed by Smith (2015, 2017).",non_actionable,other
r2,s5,t10,I don't understand what is offered beyond the original papers.,actionable,shortcoming
r2,s6,t10,"The ""super-convergence"" occurs under special settings of hyper-parameters for resnet only and therefore I am concerned if it is of general interest for deep learning models.",actionable,shortcoming
r2,s7,t10,"Also, the authors do not give a conclusive analysis under what condition it may happen.",actionable,shortcoming
r2,s8,t10,"The explanation of the cause of ""super-convergence"" from the perspective of  transversing the loss function topology in section 3 is rather illustrative at the best without convincing support of arguments.",actionable,shortcoming
r2,s9,t10,"I feel most content of this paper (section 3, 4, 5) is observational results, and there is lack of solid analysis or discussion behind these observations.",actionable,shortcoming
r2,s0,t16,This paper shows an observation of “super-convergence” when training resnet with cyclical learning rates but does not provide conclusive analysis or experiment results.,actionable,shortcoming
r2,s1,t16,This paper discusses the phenomenon of a fast convergence rate for training resnet with cyclical learning rates under a few particular setting.,non_actionable,fact
r2,s2,t16,It tries to provide an explanation for the phenomenon and a procedure to test when it happens.,non_actionable,fact
r2,s3,t16,"However, I don't find the paper of high significance or the proposed method solid for publication at ICLR.",actionable,shortcoming
r2,s4,t16,"The paper is based on the cyclical learning rates proposed by Smith (2015, 2017).",non_actionable,fact
r2,s5,t16,I don't understand what is offered beyond the original papers.,actionable,shortcoming
r2,s6,t16,"The ""super-convergence"" occurs under special settings of hyper-parameters for resnet only and therefore I am concerned if it is of general interest for deep learning models.",actionable,shortcoming
r2,s7,t16,"Also, the authors do not give a conclusive analysis under what condition it may happen.",actionable,shortcoming
r2,s8,t16,"The explanation of the cause of ""super-convergence"" from the perspective of  transversing the loss function topology in section 3 is rather illustrative at the best without convincing support of arguments.",actionable,shortcoming
r2,s9,t16,"I feel most content of this paper (section 3, 4, 5) is observational results, and there is lack of solid analysis or discussion behind these observations.",actionable,disagreement
r2,s0,t31,This paper shows an observation of “super-convergence” when training resnet with cyclical learning rates but does not provide conclusive analysis or experiment results.,actionable,shortcoming
r2,s1,t31,This paper discusses the phenomenon of a fast convergence rate for training resnet with cyclical learning rates under a few particular setting.,non_actionable,fact
r2,s2,t31,It tries to provide an explanation for the phenomenon and a procedure to test when it happens.,non_actionable,fact
r2,s3,t31,"However, I don't find the paper of high significance or the proposed method solid for publication at ICLR.",actionable,shortcoming
r2,s4,t31,"The paper is based on the cyclical learning rates proposed by Smith (2015, 2017).",non_actionable,fact
r2,s5,t31,I don't understand what is offered beyond the original papers.,actionable,shortcoming
r2,s6,t31,"The ""super-convergence"" occurs under special settings of hyper-parameters for resnet only and therefore I am concerned if it is of general interest for deep learning models.",actionable,shortcoming
r2,s7,t31,"Also, the authors do not give a conclusive analysis under what condition it may happen.",actionable,shortcoming
r2,s8,t31,"The explanation of the cause of ""super-convergence"" from the perspective of  transversing the loss function topology in section 3 is rather illustrative at the best without convincing support of arguments.",actionable,shortcoming
r2,s9,t31,"I feel most content of this paper (section 3, 4, 5) is observational results, and there is lack of solid analysis or discussion behind these observations.",actionable,shortcoming
r2,s0,t20,This paper shows an observation of “super-convergence” when training resnet with cyclical learning rates but does not provide conclusive analysis or experiment results.,non_actionable,fact
r2,s1,t20,This paper discusses the phenomenon of a fast convergence rate for training resnet with cyclical learning rates under a few particular setting.,non_actionable,fact
r2,s2,t20,It tries to provide an explanation for the phenomenon and a procedure to test when it happens.,non_actionable,fact
r2,s3,t20,"However, I don't find the paper of high significance or the proposed method solid for publication at ICLR.",non_actionable,shortcoming
r2,s4,t20,"The paper is based on the cyclical learning rates proposed by Smith (2015, 2017).",non_actionable,fact
r2,s5,t20,I don't understand what is offered beyond the original papers.,non_actionable,shortcoming
r2,s6,t20,"The ""super-convergence"" occurs under special settings of hyper-parameters for resnet only and therefore I am concerned if it is of general interest for deep learning models.",non_actionable,fact
r2,s7,t20,"Also, the authors do not give a conclusive analysis under what condition it may happen.",non_actionable,fact
r2,s8,t20,"The explanation of the cause of ""super-convergence"" from the perspective of  transversing the loss function topology in section 3 is rather illustrative at the best without convincing support of arguments.",non_actionable,shortcoming
r2,s9,t20,"I feel most content of this paper (section 3, 4, 5) is observational results, and there is lack of solid analysis or discussion behind these observations.",non_actionable,shortcoming
r2,s0,t8,This paper shows an observation of “super-convergence” when training resnet with cyclical learning rates but does not provide conclusive analysis or experiment results.,actionable,shortcoming
r2,s1,t8,This paper discusses the phenomenon of a fast convergence rate for training resnet with cyclical learning rates under a few particular setting.,non_actionable,fact
r2,s2,t8,It tries to provide an explanation for the phenomenon and a procedure to test when it happens.,non_actionable,fact
r2,s3,t8,"However, I don't find the paper of high significance or the proposed method solid for publication at ICLR.",non_actionable,shortcoming
r2,s4,t8,"The paper is based on the cyclical learning rates proposed by Smith (2015, 2017).",non_actionable,fact
r2,s5,t8,I don't understand what is offered beyond the original papers.,non_actionable,fact
r2,s6,t8,"The ""super-convergence"" occurs under special settings of hyper-parameters for resnet only and therefore I am concerned if it is of general interest for deep learning models.",actionable,shortcoming
r2,s7,t8,"Also, the authors do not give a conclusive analysis under what condition it may happen.",actionable,shortcoming
r2,s8,t8,"The explanation of the cause of ""super-convergence"" from the perspective of  transversing the loss function topology in section 3 is rather illustrative at the best without convincing support of arguments.",actionable,shortcoming
r2,s9,t8,"I feel most content of this paper (section 3, 4, 5) is observational results, and there is lack of solid analysis or discussion behind these observations.",actionable,shortcoming
r52,s0,t10,This paper studies the problem of multi-label learning for text copora.,non_actionable,other
r52,s1,t10,"The paper proposed a latent variable model for the documents and their labels, and used spectral algorithms to provably learn the parameters.",non_actionable,other
r52,s2,t10,"The model is fairly simplistic: the topic can be one of k topics (pure topic model), based on the topic, there is a probability distribution over documents, and a probabilistic distribution over labels.",non_actionable,other
r52,s3,t10,"The model defined here is also very strange, especially Equation (2) is not really consistent with Equation (7).",actionable,shortcoming
r52,s4,t10,"Just to elaborate: in equation (2), the probability of a document is related to the set of distinct words, so it does not distinguish between documents where a word appear multiple times or only once.",actionable,shortcoming
r52,s5,t10,This is different from the standard bag-of-words model where words are sampled independently and word counts do matter.,non_actionable,other
r52,s6,t10,"However, in the calculation before Equation (7), it was trying to compute the probability that a pair of words are equal to v_i and v_j, and it assumed words w_1 and w_2 are independent and both of them satisfy the conditional distribution P[v_i|h = k], this is back to the standard bag-of-words model.",actionable,shortcoming
r52,s7,t10,"To see why these models are different, if it is the model of (2), and we look at only distinct words, the diagonal of the matrix P[v_i,v_i] does not really make sense and certainly will not follow Equation (7).",actionable,shortcoming
r52,s8,t10,Equation (7) and also (9) only works in the standard bag-of-words model that is also used in Anandkumar et al. (the same equations were also proved).,actionable,shortcoming
r52,s9,t10,The main novelty in this paper is that it uses the label as a third view of a multi-view model and make use of cross moments.,non_actionable,other
r52,s0,t31,This paper studies the problem of multi-label learning for text copora.,non_actionable,fact
r52,s1,t31,"The paper proposed a latent variable model for the documents and their labels, and used spectral algorithms to provably learn the parameters.",non_actionable,fact
r52,s2,t31,"The model is fairly simplistic: the topic can be one of k topics (pure topic model), based on the topic, there is a probability distribution over documents, and a probabilistic distribution over labels.",actionable,shortcoming
r52,s3,t31,"The model defined here is also very strange, especially Equation (2) is not really consistent with Equation (7).",actionable,shortcoming
r52,s4,t31,"Just to elaborate: in equation (2), the probability of a document is related to the set of distinct words, so it does not distinguish between documents where a word appear multiple times or only once.",actionable,shortcoming
r52,s5,t31,This is different from the standard bag-of-words model where words are sampled independently and word counts do matter.,non_actionable,fact
r52,s6,t31,"However, in the calculation before Equation (7), it was trying to compute the probability that a pair of words are equal to v_i and v_j, and it assumed words w_1 and w_2 are independent and both of them satisfy the conditional distribution P[v_i|h = k], this is back to the standard bag-of-words model.",actionable,shortcoming
r52,s7,t31,"To see why these models are different, if it is the model of (2), and we look at only distinct words, the diagonal of the matrix P[v_i,v_i] does not really make sense and certainly will not follow Equation (7).",actionable,shortcoming
r52,s8,t31,Equation (7) and also (9) only works in the standard bag-of-words model that is also used in Anandkumar et al. (the same equations were also proved).,actionable,shortcoming
r52,s9,t31,The main novelty in this paper is that it uses the label as a third view of a multi-view model and make use of cross moments.,non_actionable,fact
r52,s0,t3,This paper studies the problem of multi-label learning for text copora.,non_actionable,fact
r52,s1,t3,"The paper proposed a latent variable model for the documents and their labels, and used spectral algorithms to provably learn the parameters.",non_actionable,fact
r52,s2,t3,"The model is fairly simplistic: the topic can be one of k topics (pure topic model), based on the topic, there is a probability distribution over documents, and a probabilistic distribution over labels.",non_actionable,fact
r52,s3,t3,"The model defined here is also very strange, especially Equation (2) is not really consistent with Equation (7).",actionable,disagreement
r52,s4,t3,"Just to elaborate: in equation (2), the probability of a document is related to the set of distinct words, so it does not distinguish between documents where a word appear multiple times or only once.",actionable,disagreement
r52,s5,t3,This is different from the standard bag-of-words model where words are sampled independently and word counts do matter.,non_actionable,fact
r52,s6,t3,"However, in the calculation before Equation (7), it was trying to compute the probability that a pair of words are equal to v_i and v_j, and it assumed words w_1 and w_2 are independent and both of them satisfy the conditional distribution P[v_i|h = k], this is back to the standard bag-of-words model.",actionable,suggestion
r52,s7,t3,"To see why these models are different, if it is the model of (2), and we look at only distinct words, the diagonal of the matrix P[v_i,v_i] does not really make sense and certainly will not follow Equation (7).",actionable,disagreement
r52,s8,t3,Equation (7) and also (9) only works in the standard bag-of-words model that is also used in Anandkumar et al. (the same equations were also proved).,non_actionable,fact
r52,s9,t3,The main novelty in this paper is that it uses the label as a third view of a multi-view model and make use of cross moments.,non_actionable,agreement
r52,s0,t20,This paper studies the problem of multi-label learning for text copora.,non_actionable,fact
r52,s1,t20,"The paper proposed a latent variable model for the documents and their labels, and used spectral algorithms to provably learn the parameters.",non_actionable,fact
r52,s2,t20,"The model is fairly simplistic: the topic can be one of k topics (pure topic model), based on the topic, there is a probability distribution over documents, and a probabilistic distribution over labels.",non_actionable,fact
r52,s3,t20,"The model defined here is also very strange, especially Equation (2) is not really consistent with Equation (7).",non_actionable,shortcoming
r52,s4,t20,"Just to elaborate: in equation (2), the probability of a document is related to the set of distinct words, so it does not distinguish between documents where a word appear multiple times or only once.",non_actionable,fact
r52,s5,t20,This is different from the standard bag-of-words model where words are sampled independently and word counts do matter.,non_actionable,fact
r52,s6,t20,"However, in the calculation before Equation (7), it was trying to compute the probability that a pair of words are equal to v_i and v_j, and it assumed words w_1 and w_2 are independent and both of them satisfy the conditional distribution P[v_i|h = k], this is back to the standard bag-of-words model.",non_actionable,fact
r52,s7,t20,"To see why these models are different, if it is the model of (2), and we look at only distinct words, the diagonal of the matrix P[v_i,v_i] does not really make sense and certainly will not follow Equation (7).",non_actionable,fact
r52,s8,t20,Equation (7) and also (9) only works in the standard bag-of-words model that is also used in Anandkumar et al. (the same equations were also proved).,non_actionable,fact
r52,s9,t20,The main novelty in this paper is that it uses the label as a third view of a multi-view model and make use of cross moments.,non_actionable,fact
r52,s0,t16,This paper studies the problem of multi-label learning for text copora.,non_actionable,fact
r52,s1,t16,"The paper proposed a latent variable model for the documents and their labels, and used spectral algorithms to provably learn the parameters.",non_actionable,fact
r52,s2,t16,"The model is fairly simplistic: the topic can be one of k topics (pure topic model), based on the topic, there is a probability distribution over documents, and a probabilistic distribution over labels.",non_actionable,fact
r52,s3,t16,"The model defined here is also very strange, especially Equation (2) is not really consistent with Equation (7).",actionable,shortcoming
r52,s4,t16,"Just to elaborate: in equation (2), the probability of a document is related to the set of distinct words, so it does not distinguish between documents where a word appear multiple times or only once.",non_actionable,fact
r52,s5,t16,This is different from the standard bag-of-words model where words are sampled independently and word counts do matter.,non_actionable,fact
r52,s6,t16,"However, in the calculation before Equation (7), it was trying to compute the probability that a pair of words are equal to v_i and v_j, and it assumed words w_1 and w_2 are independent and both of them satisfy the conditional distribution P[v_i|h = k], this is back to the standard bag-of-words model.",actionable,disagreement
r52,s7,t16,"To see why these models are different, if it is the model of (2), and we look at only distinct words, the diagonal of the matrix P[v_i,v_i] does not really make sense and certainly will not follow Equation (7).",actionable,disagreement
r52,s8,t16,Equation (7) and also (9) only works in the standard bag-of-words model that is also used in Anandkumar et al. (the same equations were also proved).,actionable,shortcoming
r52,s9,t16,The main novelty in this paper is that it uses the label as a third view of a multi-view model and make use of cross moments.,non_actionable,fact
r3,s0,t10,"To generate these samples, they first employ a model-free algorithm.",non_actionable,other
r3,s1,t10,"In continuous domains, the state is not unique … so they build a soft next state predictor that gives a probability over next states favoring those demonstrated by the expert.",non_actionable,other
r3,s2,t10,"I guess if expert trace data is sparse, the model-free learner can generate a lot of transitions which enable it to create accurate dynamics models which in turn allow it to extract more information out of sparse expert traces?",actionable,question
r3,s3,t10,"Second: They then train a model based agent using the collected transitions ( St, At, St+1 ).",non_actionable,other
r3,s4,t10,"It looks like the authors extract position information from flappy bird frames, so the algorithm is only using images for obstacle reasoning?",actionable,question
r3,s5,t10,The evaluation framework is described in enough detail to replicate the results.,actionable,agreement
r3,s6,t10,"Interestingly, the assisted method starts off much higher in the “reacher” task.",non_actionable,other
r3,s7,t10,"Interestingly, DQN + heuristic reward approaches expert performance while behavioral cloning never achieves expert performance level even though it has actions.",non_actionable,other
r3,s8,t10,"After equation 5, the authors suggest categorical loss for discrete problems, but cross-entropy loss might work better.",actionable,suggestion
r3,s9,t10,Algorithm 1 does not make clear the relationship between the model learned in step 2 and the algorithms in steps 4 to 6.,actionable,shortcoming
r3,s0,t11,"To generate these samples, they first employ a model-free algorithm.",non_actionable,fact
r3,s1,t11,"In continuous domains, the state is not unique … so they build a soft next state predictor that gives a probability over next states favoring those demonstrated by the expert.",actionable,shortcoming
r3,s2,t11,"I guess if expert trace data is sparse, the model-free learner can generate a lot of transitions which enable it to create accurate dynamics models which in turn allow it to extract more information out of sparse expert traces?",actionable,question
r3,s3,t11,"Second: They then train a model based agent using the collected transitions ( St, At, St+1 ).",non_actionable,fact
r3,s4,t11,"It looks like the authors extract position information from flappy bird frames, so the algorithm is only using images for obstacle reasoning?",actionable,question
r3,s5,t11,The evaluation framework is described in enough detail to replicate the results.,non_actionable,agreement
r3,s6,t11,"Interestingly, the assisted method starts off much higher in the “reacher” task.",non_actionable,fact
r3,s7,t11,"Interestingly, DQN + heuristic reward approaches expert performance while behavioral cloning never achieves expert performance level even though it has actions.",non_actionable,fact
r3,s8,t11,"After equation 5, the authors suggest categorical loss for discrete problems, but cross-entropy loss might work better.",actionable,suggestion
r3,s9,t11,Algorithm 1 does not make clear the relationship between the model learned in step 2 and the algorithms in steps 4 to 6.,actionable,shortcoming
r3,s0,t16,"To generate these samples, they first employ a model-free algorithm.",non_actionable,fact
r3,s1,t16,"In continuous domains, the state is not unique … so they build a soft next state predictor that gives a probability over next states favoring those demonstrated by the expert.",non_actionable,fact
r3,s2,t16,"I guess if expert trace data is sparse, the model-free learner can generate a lot of transitions which enable it to create accurate dynamics models which in turn allow it to extract more information out of sparse expert traces?",non_actionable,fact
r3,s3,t16,"Second: They then train a model based agent using the collected transitions ( St, At, St+1 ).",non_actionable,fact
r3,s4,t16,"It looks like the authors extract position information from flappy bird frames, so the algorithm is only using images for obstacle reasoning?",actionable,question
r3,s5,t16,The evaluation framework is described in enough detail to replicate the results.,non_actionable,fact
r3,s6,t16,"Interestingly, the assisted method starts off much higher in the “reacher” task.",non_actionable,fact
r3,s7,t16,"Interestingly, DQN + heuristic reward approaches expert performance while behavioral cloning never achieves expert performance level even though it has actions.",non_actionable,fact
r3,s8,t16,"After equation 5, the authors suggest categorical loss for discrete problems, but cross-entropy loss might work better.",actionable,suggestion
r3,s9,t16,Algorithm 1 does not make clear the relationship between the model learned in step 2 and the algorithms in steps 4 to 6.,actionable,shortcoming
r3,s0,t31,"To generate these samples, they first employ a model-free algorithm.",non_actionable,fact
r3,s1,t31,"In continuous domains, the state is not unique … so they build a soft next state predictor that gives a probability over next states favoring those demonstrated by the expert.",non_actionable,fact
r3,s2,t31,"I guess if expert trace data is sparse, the model-free learner can generate a lot of transitions which enable it to create accurate dynamics models which in turn allow it to extract more information out of sparse expert traces?",non_actionable,question
r3,s3,t31,"Second: They then train a model based agent using the collected transitions ( St, At, St+1 ).",non_actionable,fact
r3,s4,t31,"It looks like the authors extract position information from flappy bird frames, so the algorithm is only using images for obstacle reasoning?",actionable,question
r3,s5,t31,The evaluation framework is described in enough detail to replicate the results.,non_actionable,fact
r3,s6,t31,"Interestingly, the assisted method starts off much higher in the “reacher” task.",non_actionable,fact
r3,s7,t31,"Interestingly, DQN + heuristic reward approaches expert performance while behavioral cloning never achieves expert performance level even though it has actions.",non_actionable,fact
r3,s8,t31,"After equation 5, the authors suggest categorical loss for discrete problems, but cross-entropy loss might work better.",actionable,suggestion
r3,s9,t31,Algorithm 1 does not make clear the relationship between the model learned in step 2 and the algorithms in steps 4 to 6.,actionable,shortcoming
r3,s0,t20,"To generate these samples, they first employ a model-free algorithm.",non_actionable,fact
r3,s1,t20,"In continuous domains, the state is not unique … so they build a soft next state predictor that gives a probability over next states favoring those demonstrated by the expert.",non_actionable,fact
r3,s2,t20,"I guess if expert trace data is sparse, the model-free learner can generate a lot of transitions which enable it to create accurate dynamics models which in turn allow it to extract more information out of sparse expert traces?",actionable,question
r3,s3,t20,"Second: They then train a model based agent using the collected transitions ( St, At, St+1 ).",non_actionable,fact
r3,s4,t20,"It looks like the authors extract position information from flappy bird frames, so the algorithm is only using images for obstacle reasoning?",actionable,question
r3,s5,t20,The evaluation framework is described in enough detail to replicate the results.,non_actionable,agreement
r3,s6,t20,"Interestingly, the assisted method starts off much higher in the “reacher” task.",non_actionable,fact
r3,s7,t20,"Interestingly, DQN + heuristic reward approaches expert performance while behavioral cloning never achieves expert performance level even though it has actions.",non_actionable,fact
r3,s8,t20,"After equation 5, the authors suggest categorical loss for discrete problems, but cross-entropy loss might work better.",actionable,suggestion
r3,s9,t20,Algorithm 1 does not make clear the relationship between the model learned in step 2 and the algorithms in steps 4 to 6.,actionable,shortcoming
r21,s0,t16,Well structured analysis paper on shortcut connections but contributions/results are not compelling This paper performs an analysis of shortcut connections in ResNet-like architectures.,actionable,shortcoming
r21,s1,t16,This alternative is referred to as tandem block.,non_actionable,fact
r21,s2,t16,My main concerns are related to the contribution of the paper and experimental pipeline followed to perform the comparison.,actionable,shortcoming
r21,s3,t16,"Moreover, results on Table 2 are reported as the ones with “the highest test accuracy achieved with each tandem block”.",non_actionable,fact
r21,s4,t16,Do the authors have any explanation/intuition for this behavior?,actionable,question
r21,s5,t16,"In section 4, authors claim that their results are competitive with the best published results for a similar number of parameters.",non_actionable,fact
r21,s6,t16,The experiments were performed on relatively shallow networks (8 to 26 layers).,non_actionable,fact
r21,s7,t16,I wonder how the conclusions drawn scale to much deeper networks (of 100 layers for example) and on larger datasets such as ImageNet.,actionable,suggestion
r21,s8,t16,"Following the design of the tandem blocks proposed in the paper, I wonder why the tandem block B3x3(2,w) was not included.",actionable,suggestion
r21,s9,t16,"Finally, it might be interesting to initialize the convolutions in the shortcut connections with the identity, and check what they have leant at the end of the training.",actionable,suggestion
r21,s0,t27,Well structured analysis paper on shortcut connections but contributions/results are not compelling This paper performs an analysis of shortcut connections in ResNet-like architectures.,non_actionable,fact
r21,s1,t27,This alternative is referred to as tandem block.,non_actionable,fact
r21,s2,t27,My main concerns are related to the contribution of the paper and experimental pipeline followed to perform the comparison.,non_actionable,shortcoming
r21,s3,t27,"Moreover, results on Table 2 are reported as the ones with “the highest test accuracy achieved with each tandem block”.",non_actionable,fact
r21,s4,t27,Do the authors have any explanation/intuition for this behavior?,non_actionable,question
r21,s5,t27,"In section 4, authors claim that their results are competitive with the best published results for a similar number of parameters.",non_actionable,fact
r21,s6,t27,The experiments were performed on relatively shallow networks (8 to 26 layers).,non_actionable,shortcoming
r21,s7,t27,I wonder how the conclusions drawn scale to much deeper networks (of 100 layers for example) and on larger datasets such as ImageNet.,actionable,suggestion
r21,s8,t27,"Following the design of the tandem blocks proposed in the paper, I wonder why the tandem block B3x3(2,w) was not included.",non_actionable,shortcoming
r21,s9,t27,"Finally, it might be interesting to initialize the convolutions in the shortcut connections with the identity, and check what they have leant at the end of the training.",actionable,suggestion
r21,s0,t20,Well structured analysis paper on shortcut connections but contributions/results are not compelling This paper performs an analysis of shortcut connections in ResNet-like architectures.,non_actionable,fact
r21,s1,t20,This alternative is referred to as tandem block.,non_actionable,fact
r21,s2,t20,My main concerns are related to the contribution of the paper and experimental pipeline followed to perform the comparison.,non_actionable,fact
r21,s3,t20,"Moreover, results on Table 2 are reported as the ones with “the highest test accuracy achieved with each tandem block”.",non_actionable,fact
r21,s4,t20,Do the authors have any explanation/intuition for this behavior?,actionable,question
r21,s5,t20,"In section 4, authors claim that their results are competitive with the best published results for a similar number of parameters.",non_actionable,fact
r21,s6,t20,The experiments were performed on relatively shallow networks (8 to 26 layers).,non_actionable,fact
r21,s7,t20,I wonder how the conclusions drawn scale to much deeper networks (of 100 layers for example) and on larger datasets such as ImageNet.,non_actionable,fact
r21,s8,t20,"Following the design of the tandem blocks proposed in the paper, I wonder why the tandem block B3x3(2,w) was not included.",non_actionable,shortcoming
r21,s9,t20,"Finally, it might be interesting to initialize the convolutions in the shortcut connections with the identity, and check what they have leant at the end of the training.",actionable,suggestion
r21,s0,t31,Well structured analysis paper on shortcut connections but contributions/results are not compelling This paper performs an analysis of shortcut connections in ResNet-like architectures.,actionable,shortcoming
r21,s1,t31,This alternative is referred to as tandem block.,non_actionable,fact
r21,s2,t31,My main concerns are related to the contribution of the paper and experimental pipeline followed to perform the comparison.,actionable,shortcoming
r21,s3,t31,"Moreover, results on Table 2 are reported as the ones with “the highest test accuracy achieved with each tandem block”.",non_actionable,fact
r21,s4,t31,Do the authors have any explanation/intuition for this behavior?,non_actionable,question
r21,s5,t31,"In section 4, authors claim that their results are competitive with the best published results for a similar number of parameters.",non_actionable,fact
r21,s6,t31,The experiments were performed on relatively shallow networks (8 to 26 layers).,actionable,shortcoming
r21,s7,t31,I wonder how the conclusions drawn scale to much deeper networks (of 100 layers for example) and on larger datasets such as ImageNet.,non_actionable,fact
r21,s8,t31,"Following the design of the tandem blocks proposed in the paper, I wonder why the tandem block B3x3(2,w) was not included.",actionable,suggestion
r21,s9,t31,"Finally, it might be interesting to initialize the convolutions in the shortcut connections with the identity, and check what they have leant at the end of the training.",actionable,suggestion
r21,s0,t10,Well structured analysis paper on shortcut connections but contributions/results are not compelling This paper performs an analysis of shortcut connections in ResNet-like architectures.,non_actionable,shortcoming
r21,s1,t10,This alternative is referred to as tandem block.,non_actionable,other
r21,s2,t10,My main concerns are related to the contribution of the paper and experimental pipeline followed to perform the comparison.,actionable,shortcoming
r21,s3,t10,"Moreover, results on Table 2 are reported as the ones with “the highest test accuracy achieved with each tandem block”.",non_actionable,other
r21,s4,t10,Do the authors have any explanation/intuition for this behavior?,actionable,question
r21,s5,t10,"In section 4, authors claim that their results are competitive with the best published results for a similar number of parameters.",non_actionable,other
r21,s6,t10,The experiments were performed on relatively shallow networks (8 to 26 layers).,actionable,shortcoming
r21,s7,t10,I wonder how the conclusions drawn scale to much deeper networks (of 100 layers for example) and on larger datasets such as ImageNet.,actionable,suggestion
r21,s8,t10,"Following the design of the tandem blocks proposed in the paper, I wonder why the tandem block B3x3(2,w) was not included.",actionable,shortcoming
r21,s9,t10,"Finally, it might be interesting to initialize the convolutions in the shortcut connections with the identity, and check what they have leant at the end of the training.",actionable,suggestion
r8,s0,t10,"What this means is that the agent has a reservoir of n ""states"" in which states encountered in the past can be stored.",non_actionable,other
r8,s1,t10,(i) which states should be stored and how,non_actionable,other
r8,s2,t10,"For the latter question, the authors propose using a ""query network"" that based on the current state, pulls out one state from the memory according to certain probability distribution.",non_actionable,other
r8,s3,t10,"This network has many tunable parameters, but the main point is that the policy then can condition on this state drawn from the memory.",non_actionable,other
r8,s4,t10,"Intuitively, one can see why this may be advantageous as one gets some information from the past.",actionable,agreement
r8,s5,t10,"(As an aside, the authors of course acknowledge that recurrent neural networks have been used for this purpose with varying degrees of success.)",non_actionable,other
r8,s6,t10,There is no easy way to fix this and for the purpose of sampling the paper simply treats the weights as immutable.,non_actionable,other
r8,s7,t10,There is also a toy example created to show that this approach works well compared to the RNN based approaches.,non_actionable,other
r8,s8,t10,"Positives: - An interesting new idea that has potential to be useful in RL - An elegant algorithm to solve at least part of the problem properly (the rest of course relies on standard SGD methods to train the various networks) Negatives: - The math is fudged around quite a bit with approximations that are not always justified - While overall the writing is clear, in some places I feel it could be improved.",actionable,agreement
r8,s9,t10,I had a very hard time understanding the set-up of the problem in Figure 2.,actionable,shortcoming
r8,s0,t19,"What this means is that the agent has a reservoir of n ""states"" in which states encountered in the past can be stored.",non_actionable,fact
r8,s1,t19,(i) which states should be stored and how,actionable,question
r8,s2,t19,"For the latter question, the authors propose using a ""query network"" that based on the current state, pulls out one state from the memory according to certain probability distribution.",non_actionable,fact
r8,s3,t19,"This network has many tunable parameters, but the main point is that the policy then can condition on this state drawn from the memory.",non_actionable,fact
r8,s4,t19,"Intuitively, one can see why this may be advantageous as one gets some information from the past.",non_actionable,agreement
r8,s5,t19,"(As an aside, the authors of course acknowledge that recurrent neural networks have been used for this purpose with varying degrees of success.)",non_actionable,agreement
r8,s6,t19,There is no easy way to fix this and for the purpose of sampling the paper simply treats the weights as immutable.,actionable,shortcoming
r8,s7,t19,There is also a toy example created to show that this approach works well compared to the RNN based approaches.,actionable,agreement
r8,s8,t19,"Positives: - An interesting new idea that has potential to be useful in RL - An elegant algorithm to solve at least part of the problem properly (the rest of course relies on standard SGD methods to train the various networks) Negatives: - The math is fudged around quite a bit with approximations that are not always justified - While overall the writing is clear, in some places I feel it could be improved.",actionable,shortcoming
r8,s9,t19,I had a very hard time understanding the set-up of the problem in Figure 2.,actionable,shortcoming
r8,s0,t20,"What this means is that the agent has a reservoir of n ""states"" in which states encountered in the past can be stored.",non_actionable,fact
r8,s1,t20,(i) which states should be stored and how,non_actionable,fact
r8,s2,t20,"For the latter question, the authors propose using a ""query network"" that based on the current state, pulls out one state from the memory according to certain probability distribution.",non_actionable,fact
r8,s3,t20,"This network has many tunable parameters, but the main point is that the policy then can condition on this state drawn from the memory.",non_actionable,fact
r8,s4,t20,"Intuitively, one can see why this may be advantageous as one gets some information from the past.",non_actionable,fact
r8,s5,t20,"(As an aside, the authors of course acknowledge that recurrent neural networks have been used for this purpose with varying degrees of success.)",non_actionable,fact
r8,s6,t20,There is no easy way to fix this and for the purpose of sampling the paper simply treats the weights as immutable.,non_actionable,fact
r8,s7,t20,There is also a toy example created to show that this approach works well compared to the RNN based approaches.,non_actionable,fact
r8,s8,t20,"Positives: - An interesting new idea that has potential to be useful in RL - An elegant algorithm to solve at least part of the problem properly (the rest of course relies on standard SGD methods to train the various networks) Negatives: - The math is fudged around quite a bit with approximations that are not always justified - While overall the writing is clear, in some places I feel it could be improved.",non_actionable,shortcoming
r8,s9,t20,I had a very hard time understanding the set-up of the problem in Figure 2.,actionable,shortcoming
r8,s0,t12,"What this means is that the agent has a reservoir of n ""states"" in which states encountered in the past can be stored.",non_actionable,fact
r8,s1,t12,(i) which states should be stored and how,actionable,question
r8,s2,t12,"For the latter question, the authors propose using a ""query network"" that based on the current state, pulls out one state from the memory according to certain probability distribution.",non_actionable,fact
r8,s3,t12,"This network has many tunable parameters, but the main point is that the policy then can condition on this state drawn from the memory.",non_actionable,fact
r8,s4,t12,"Intuitively, one can see why this may be advantageous as one gets some information from the past.",non_actionable,agreement
r8,s5,t12,"(As an aside, the authors of course acknowledge that recurrent neural networks have been used for this purpose with varying degrees of success.)",non_actionable,fact
r8,s6,t12,There is no easy way to fix this and for the purpose of sampling the paper simply treats the weights as immutable.,non_actionable,fact
r8,s7,t12,There is also a toy example created to show that this approach works well compared to the RNN based approaches.,non_actionable,agreement
r8,s8,t12,"Positives: - An interesting new idea that has potential to be useful in RL - An elegant algorithm to solve at least part of the problem properly (the rest of course relies on standard SGD methods to train the various networks) Negatives: - The math is fudged around quite a bit with approximations that are not always justified - While overall the writing is clear, in some places I feel it could be improved.",actionable,shortcoming
r8,s9,t12,I had a very hard time understanding the set-up of the problem in Figure 2.,actionable,shortcoming
r8,s0,t31,"What this means is that the agent has a reservoir of n ""states"" in which states encountered in the past can be stored.",non_actionable,fact
r8,s1,t31,(i) which states should be stored and how,non_actionable,fact
r8,s2,t31,"For the latter question, the authors propose using a ""query network"" that based on the current state, pulls out one state from the memory according to certain probability distribution.",non_actionable,fact
r8,s3,t31,"This network has many tunable parameters, but the main point is that the policy then can condition on this state drawn from the memory.",non_actionable,fact
r8,s4,t31,"Intuitively, one can see why this may be advantageous as one gets some information from the past.",non_actionable,agreement
r8,s5,t31,"(As an aside, the authors of course acknowledge that recurrent neural networks have been used for this purpose with varying degrees of success.)",non_actionable,fact
r8,s6,t31,There is no easy way to fix this and for the purpose of sampling the paper simply treats the weights as immutable.,non_actionable,fact
r8,s7,t31,There is also a toy example created to show that this approach works well compared to the RNN based approaches.,non_actionable,fact
r8,s8,t31,"Positives: - An interesting new idea that has potential to be useful in RL - An elegant algorithm to solve at least part of the problem properly (the rest of course relies on standard SGD methods to train the various networks) Negatives: - The math is fudged around quite a bit with approximations that are not always justified - While overall the writing is clear, in some places I feel it could be improved.",non_actionable,shortcoming
r8,s9,t31,I had a very hard time understanding the set-up of the problem in Figure 2.,actionable,shortcoming
r32,s0,t20,Simulation results with MuJoCo physics simulator show that this simple trick reduces the amount of needed data by an order of magnitude.,non_actionable,fact
r32,s1,t20,This paper considers the problem of model-free imitation learning.,non_actionable,fact
r32,s2,t20,Previous formulation of GAIL uses a stochastic behavior policy and the RIENFORCE-like algorithms.,non_actionable,fact
r32,s3,t20,"The authors of this paper propose to use a deterministic policy instead, and apply the deterministic policy gradient DPG (Silver et al., 2014) for optimizing the behavior policy.",non_actionable,fact
r32,s4,t20,A state screening function (SSF) method is proposed to drive the learner to remain in areas of the state space that have been covered by the teacher.,non_actionable,fact
r32,s5,t20,"Although, a more detailed discussion and a clearer explanation is needed to clarify what SSF is actually doing, based on the provided formulation.",actionable,suggestion
r32,s6,t20,"Except from a few typos here and there, the paper is overall well-written.",non_actionable,agreement
r32,s7,t20,"Replacing a stochastic policy with a deterministic one does not change much the original GAIL algorithm, since the adoption of stochastic policies is often used just to have differentiable parameterized policies, and if the action space is continuous, then there is not much need for it (except for exploration, which is done here through re-initializations anyway).",non_actionable,fact
r32,s8,t20,"My guess is that if someone would use the GAIL algorithm for real problems (e.g, robotic task), they would significantly reduce the stochasticity of the behavior policy, which would make it virtually similar in term of data efficiency to the proposed method.",non_actionable,fact
r32,s9,t20,Pros: - A new GAIL formulation for saving on interaction data.,non_actionable,agreement
r32,s0,t31,Simulation results with MuJoCo physics simulator show that this simple trick reduces the amount of needed data by an order of magnitude.,non_actionable,fact
r32,s1,t31,This paper considers the problem of model-free imitation learning.,non_actionable,fact
r32,s2,t31,Previous formulation of GAIL uses a stochastic behavior policy and the RIENFORCE-like algorithms.,non_actionable,fact
r32,s3,t31,"The authors of this paper propose to use a deterministic policy instead, and apply the deterministic policy gradient DPG (Silver et al., 2014) for optimizing the behavior policy.",non_actionable,fact
r32,s4,t31,A state screening function (SSF) method is proposed to drive the learner to remain in areas of the state space that have been covered by the teacher.,non_actionable,fact
r32,s5,t31,"Although, a more detailed discussion and a clearer explanation is needed to clarify what SSF is actually doing, based on the provided formulation.",actionable,suggestion
r32,s6,t31,"Except from a few typos here and there, the paper is overall well-written.",actionable,shortcoming
r32,s7,t31,"Replacing a stochastic policy with a deterministic one does not change much the original GAIL algorithm, since the adoption of stochastic policies is often used just to have differentiable parameterized policies, and if the action space is continuous, then there is not much need for it (except for exploration, which is done here through re-initializations anyway).",non_actionable,fact
r32,s8,t31,"My guess is that if someone would use the GAIL algorithm for real problems (e.g, robotic task), they would significantly reduce the stochasticity of the behavior policy, which would make it virtually similar in term of data efficiency to the proposed method.",non_actionable,fact
r32,s9,t31,Pros: - A new GAIL formulation for saving on interaction data.,non_actionable,agreement
r32,s0,t10,Simulation results with MuJoCo physics simulator show that this simple trick reduces the amount of needed data by an order of magnitude.,non_actionable,other
r32,s1,t10,This paper considers the problem of model-free imitation learning.,non_actionable,other
r32,s2,t10,Previous formulation of GAIL uses a stochastic behavior policy and the RIENFORCE-like algorithms.,non_actionable,other
r32,s3,t10,"The authors of this paper propose to use a deterministic policy instead, and apply the deterministic policy gradient DPG (Silver et al., 2014) for optimizing the behavior policy.",non_actionable,other
r32,s4,t10,A state screening function (SSF) method is proposed to drive the learner to remain in areas of the state space that have been covered by the teacher.,non_actionable,other
r32,s5,t10,"Although, a more detailed discussion and a clearer explanation is needed to clarify what SSF is actually doing, based on the provided formulation.",actionable,suggestion
r32,s6,t10,"Except from a few typos here and there, the paper is overall well-written.",actionable,agreement
r32,s7,t10,"Replacing a stochastic policy with a deterministic one does not change much the original GAIL algorithm, since the adoption of stochastic policies is often used just to have differentiable parameterized policies, and if the action space is continuous, then there is not much need for it (except for exploration, which is done here through re-initializations anyway).",non_actionable,other
r32,s8,t10,"My guess is that if someone would use the GAIL algorithm for real problems (e.g, robotic task), they would significantly reduce the stochasticity of the behavior policy, which would make it virtually similar in term of data efficiency to the proposed method.",actionable,fact
r32,s9,t10,Pros: - A new GAIL formulation for saving on interaction data.,actionable,agreement
r32,s0,t2,Simulation results with MuJoCo physics simulator show that this simple trick reduces the amount of needed data by an order of magnitude.,non_actionable,fact
r32,s1,t2,This paper considers the problem of model-free imitation learning.,non_actionable,fact
r32,s2,t2,Previous formulation of GAIL uses a stochastic behavior policy and the RIENFORCE-like algorithms.,non_actionable,fact
r32,s3,t2,"The authors of this paper propose to use a deterministic policy instead, and apply the deterministic policy gradient DPG (Silver et al., 2014) for optimizing the behavior policy.",non_actionable,fact
r32,s4,t2,A state screening function (SSF) method is proposed to drive the learner to remain in areas of the state space that have been covered by the teacher.,non_actionable,fact
r32,s5,t2,"Although, a more detailed discussion and a clearer explanation is needed to clarify what SSF is actually doing, based on the provided formulation.",actionable,suggestion
r32,s6,t2,"Except from a few typos here and there, the paper is overall well-written.",non_actionable,agreement
r32,s7,t2,"Replacing a stochastic policy with a deterministic one does not change much the original GAIL algorithm, since the adoption of stochastic policies is often used just to have differentiable parameterized policies, and if the action space is continuous, then there is not much need for it (except for exploration, which is done here through re-initializations anyway).",non_actionable,fact
r32,s8,t2,"My guess is that if someone would use the GAIL algorithm for real problems (e.g, robotic task), they would significantly reduce the stochasticity of the behavior policy, which would make it virtually similar in term of data efficiency to the proposed method.",non_actionable,fact
r32,s9,t2,Pros: - A new GAIL formulation for saving on interaction data.,non_actionable,fact
r32,s0,t16,Simulation results with MuJoCo physics simulator show that this simple trick reduces the amount of needed data by an order of magnitude.,non_actionable,fact
r32,s1,t16,This paper considers the problem of model-free imitation learning.,non_actionable,fact
r32,s2,t16,Previous formulation of GAIL uses a stochastic behavior policy and the RIENFORCE-like algorithms.,non_actionable,fact
r32,s3,t16,"The authors of this paper propose to use a deterministic policy instead, and apply the deterministic policy gradient DPG (Silver et al., 2014) for optimizing the behavior policy.",non_actionable,fact
r32,s4,t16,A state screening function (SSF) method is proposed to drive the learner to remain in areas of the state space that have been covered by the teacher.,non_actionable,fact
r32,s5,t16,"Although, a more detailed discussion and a clearer explanation is needed to clarify what SSF is actually doing, based on the provided formulation.",actionable,shortcoming
r32,s6,t16,"Except from a few typos here and there, the paper is overall well-written.",actionable,shortcoming
r32,s7,t16,"Replacing a stochastic policy with a deterministic one does not change much the original GAIL algorithm, since the adoption of stochastic policies is often used just to have differentiable parameterized policies, and if the action space is continuous, then there is not much need for it (except for exploration, which is done here through re-initializations anyway).",actionable,disagreement
r32,s8,t16,"My guess is that if someone would use the GAIL algorithm for real problems (e.g, robotic task), they would significantly reduce the stochasticity of the behavior policy, which would make it virtually similar in term of data efficiency to the proposed method.",actionable,suggestion
r32,s9,t16,Pros: - A new GAIL formulation for saving on interaction data.,non_actionable,agreement
r88,s0,t31,"Exploration of data perturbations in the synthetic problem of classifying 2 concentric spheres The paper considers the synthetic problem setting of classifying two concentric high dimensional spheres and the worst case behavior of neural networks on this task, in the hope to gain insights about the vulnerability of deep networks to adversarial examples.",non_actionable,fact
r88,s1,t31,The problem dimension is varied along with the class separation in order to control the difficulty of the problem.,non_actionable,fact
r88,s2,t31,"Considering representative synthetic problems is a good idea, but it is not clear to me why this particular choice is useful for the purpose.",non_actionable,fact
r88,s3,t31,"2 kind is ""attacks are generated"" for this purpose, and the ReLU network is simplified to a single layer network with quadratic nonlinearity.",non_actionable,fact
r88,s4,t31,This gives an ellipsoid decision boundary around the origin.,non_actionable,fact
r88,s5,t31,It is observed that words case and average case empirical error estimates diverge when the input is high dimensional.,non_actionable,fact
r88,s6,t31,It is conjectured that the observed behaviour has to do with high dimensional geometrie.,non_actionable,fact
r88,s7,t31,"This is a very interesting conjecture, however unfortunately it is not studied further.",actionable,suggestion
r88,s8,t31,"Some empirical observations are made, but it is not discussed whether what is observed is surprising in any way, or just as expected?",actionable,suggestion
r88,s9,t31,"Overall, this works seems somewhat too preliminary at this stage.",actionable,shortcoming
r88,s0,t13,"Exploration of data perturbations in the synthetic problem of classifying 2 concentric spheres The paper considers the synthetic problem setting of classifying two concentric high dimensional spheres and the worst case behavior of neural networks on this task, in the hope to gain insights about the vulnerability of deep networks to adversarial examples.",non_actionable,fact
r88,s1,t13,The problem dimension is varied along with the class separation in order to control the difficulty of the problem.,non_actionable,fact
r88,s2,t13,"Considering representative synthetic problems is a good idea, but it is not clear to me why this particular choice is useful for the purpose.",actionable,disagreement
r88,s3,t13,"2 kind is ""attacks are generated"" for this purpose, and the ReLU network is simplified to a single layer network with quadratic nonlinearity.",non_actionable,fact
r88,s4,t13,This gives an ellipsoid decision boundary around the origin.,non_actionable,fact
r88,s5,t13,It is observed that words case and average case empirical error estimates diverge when the input is high dimensional.,non_actionable,fact
r88,s6,t13,It is conjectured that the observed behaviour has to do with high dimensional geometrie.,non_actionable,fact
r88,s7,t13,"This is a very interesting conjecture, however unfortunately it is not studied further.",actionable,shortcoming
r88,s8,t13,"Some empirical observations are made, but it is not discussed whether what is observed is surprising in any way, or just as expected?",actionable,shortcoming
r88,s9,t13,"Overall, this works seems somewhat too preliminary at this stage.",actionable,shortcoming
r88,s0,t10,"Exploration of data perturbations in the synthetic problem of classifying 2 concentric spheres The paper considers the synthetic problem setting of classifying two concentric high dimensional spheres and the worst case behavior of neural networks on this task, in the hope to gain insights about the vulnerability of deep networks to adversarial examples.",non_actionable,other
r88,s1,t10,The problem dimension is varied along with the class separation in order to control the difficulty of the problem.,non_actionable,other
r88,s2,t10,"Considering representative synthetic problems is a good idea, but it is not clear to me why this particular choice is useful for the purpose.",actionable,fact
r88,s3,t10,"2 kind is ""attacks are generated"" for this purpose, and the ReLU network is simplified to a single layer network with quadratic nonlinearity.",non_actionable,other
r88,s4,t10,This gives an ellipsoid decision boundary around the origin.,non_actionable,other
r88,s5,t10,It is observed that words case and average case empirical error estimates diverge when the input is high dimensional.,non_actionable,other
r88,s6,t10,It is conjectured that the observed behaviour has to do with high dimensional geometrie.,non_actionable,other
r88,s7,t10,"This is a very interesting conjecture, however unfortunately it is not studied further.",actionable,agreement
r88,s8,t10,"Some empirical observations are made, but it is not discussed whether what is observed is surprising in any way, or just as expected?",actionable,question
r88,s9,t10,"Overall, this works seems somewhat too preliminary at this stage.",actionable,fact
r88,s0,t20,"Exploration of data perturbations in the synthetic problem of classifying 2 concentric spheres The paper considers the synthetic problem setting of classifying two concentric high dimensional spheres and the worst case behavior of neural networks on this task, in the hope to gain insights about the vulnerability of deep networks to adversarial examples.",non_actionable,fact
r88,s1,t20,The problem dimension is varied along with the class separation in order to control the difficulty of the problem.,non_actionable,fact
r88,s2,t20,"Considering representative synthetic problems is a good idea, but it is not clear to me why this particular choice is useful for the purpose.",non_actionable,shortcoming
r88,s3,t20,"2 kind is ""attacks are generated"" for this purpose, and the ReLU network is simplified to a single layer network with quadratic nonlinearity.",non_actionable,fact
r88,s4,t20,This gives an ellipsoid decision boundary around the origin.,non_actionable,fact
r88,s5,t20,It is observed that words case and average case empirical error estimates diverge when the input is high dimensional.,non_actionable,fact
r88,s6,t20,It is conjectured that the observed behaviour has to do with high dimensional geometrie.,non_actionable,fact
r88,s7,t20,"This is a very interesting conjecture, however unfortunately it is not studied further.",actionable,shortcoming
r88,s8,t20,"Some empirical observations are made, but it is not discussed whether what is observed is surprising in any way, or just as expected?",actionable,question
r88,s9,t20,"Overall, this works seems somewhat too preliminary at this stage.",non_actionable,shortcoming
r88,s0,t16,"Exploration of data perturbations in the synthetic problem of classifying 2 concentric spheres The paper considers the synthetic problem setting of classifying two concentric high dimensional spheres and the worst case behavior of neural networks on this task, in the hope to gain insights about the vulnerability of deep networks to adversarial examples.",non_actionable,fact
r88,s1,t16,The problem dimension is varied along with the class separation in order to control the difficulty of the problem.,non_actionable,fact
r88,s2,t16,"Considering representative synthetic problems is a good idea, but it is not clear to me why this particular choice is useful for the purpose.",actionable,shortcoming
r88,s3,t16,"2 kind is ""attacks are generated"" for this purpose, and the ReLU network is simplified to a single layer network with quadratic nonlinearity.",non_actionable,fact
r88,s4,t16,This gives an ellipsoid decision boundary around the origin.,non_actionable,fact
r88,s5,t16,It is observed that words case and average case empirical error estimates diverge when the input is high dimensional.,non_actionable,fact
r88,s6,t16,It is conjectured that the observed behaviour has to do with high dimensional geometrie.,non_actionable,fact
r88,s7,t16,"This is a very interesting conjecture, however unfortunately it is not studied further.",actionable,shortcoming
r88,s8,t16,"Some empirical observations are made, but it is not discussed whether what is observed is surprising in any way, or just as expected?",actionable,question
r88,s9,t16,"Overall, this works seems somewhat too preliminary at this stage.",actionable,shortcoming
r123,s0,t20,"A tensor compiler (as opposed to a DSL inside a general purpose language) The success of Deep Learning is, in no small part, due the development of libraries and frameworks which have made building novel models much easier, faster and less error prone and also make taking advantage of modern hardware (such as GPUs) more accessible.",non_actionable,fact
r123,s1,t20,"This is still a vital area of work, as new types of models and hardware are developed.",non_actionable,fact
r123,s2,t20,"This work argues that prior solutions do not take advantage of the fact that a tensor compiler is, essentially, just a compiler.",non_actionable,fact
r123,s3,t20,"Unusually, compared to most frameworks, gradients are calculated using source code transformation, which is argued to allow for easier optimization.",non_actionable,fact
r123,s4,t20,"This paper is not well-adapted for an ICLR audience, many of which are not experts in compilers or LLVM.",non_actionable,shortcoming
r123,s5,t20,"For example, the Figure 3, table 1 would be benefit from being shorter with more exposition on what the reader should understand and take away from them.",actionable,suggestion
r123,s6,t20,"The authors mention several philosophical arguments in favor of their approach, but is there a concrete example of an model which is cumbersome to write in an existing framework but easy here?",actionable,question
r123,s7,t20,"(e.g. recent libraries pytorch, TF eager can express conditional logic much more simply than previous approaches, its easy to communicate why you might use them).",non_actionable,fact
r123,s8,t20,"Because of this work seems likely to be of limited interest to the ICLR audience, most of which are potentially interested users rather than compiler experts.",non_actionable,shortcoming
r123,s9,t20,"One aspect that seemed under-addressed and which often a crucial aspect of a good framework, is how general purpose code e.g. for loading data or logging interacts with the accelerated tensor code.",actionable,shortcoming
r123,s0,t31,"A tensor compiler (as opposed to a DSL inside a general purpose language) The success of Deep Learning is, in no small part, due the development of libraries and frameworks which have made building novel models much easier, faster and less error prone and also make taking advantage of modern hardware (such as GPUs) more accessible.",non_actionable,fact
r123,s1,t31,"This is still a vital area of work, as new types of models and hardware are developed.",non_actionable,fact
r123,s2,t31,"This work argues that prior solutions do not take advantage of the fact that a tensor compiler is, essentially, just a compiler.",non_actionable,fact
r123,s3,t31,"Unusually, compared to most frameworks, gradients are calculated using source code transformation, which is argued to allow for easier optimization.",non_actionable,fact
r123,s4,t31,"This paper is not well-adapted for an ICLR audience, many of which are not experts in compilers or LLVM.",actionable,shortcoming
r123,s5,t31,"For example, the Figure 3, table 1 would be benefit from being shorter with more exposition on what the reader should understand and take away from them.",actionable,suggestion
r123,s6,t31,"The authors mention several philosophical arguments in favor of their approach, but is there a concrete example of an model which is cumbersome to write in an existing framework but easy here?",non_actionable,question
r123,s7,t31,"(e.g. recent libraries pytorch, TF eager can express conditional logic much more simply than previous approaches, its easy to communicate why you might use them).",actionable,fact
r123,s8,t31,"Because of this work seems likely to be of limited interest to the ICLR audience, most of which are potentially interested users rather than compiler experts.",non_actionable,fact
r123,s9,t31,"One aspect that seemed under-addressed and which often a crucial aspect of a good framework, is how general purpose code e.g. for loading data or logging interacts with the accelerated tensor code.",actionable,suggestion
r123,s0,t16,"A tensor compiler (as opposed to a DSL inside a general purpose language) The success of Deep Learning is, in no small part, due the development of libraries and frameworks which have made building novel models much easier, faster and less error prone and also make taking advantage of modern hardware (such as GPUs) more accessible.",non_actionable,fact
r123,s1,t16,"This is still a vital area of work, as new types of models and hardware are developed.",non_actionable,agreement
r123,s2,t16,"This work argues that prior solutions do not take advantage of the fact that a tensor compiler is, essentially, just a compiler.",actionable,shortcoming
r123,s3,t16,"Unusually, compared to most frameworks, gradients are calculated using source code transformation, which is argued to allow for easier optimization.",non_actionable,fact
r123,s4,t16,"This paper is not well-adapted for an ICLR audience, many of which are not experts in compilers or LLVM.",actionable,shortcoming
r123,s5,t16,"For example, the Figure 3, table 1 would be benefit from being shorter with more exposition on what the reader should understand and take away from them.",actionable,suggestion
r123,s6,t16,"The authors mention several philosophical arguments in favor of their approach, but is there a concrete example of an model which is cumbersome to write in an existing framework but easy here?",actionable,question
r123,s7,t16,"(e.g. recent libraries pytorch, TF eager can express conditional logic much more simply than previous approaches, its easy to communicate why you might use them).",actionable,suggestion
r123,s8,t16,"Because of this work seems likely to be of limited interest to the ICLR audience, most of which are potentially interested users rather than compiler experts.",non_actionable,fact
r123,s9,t16,"One aspect that seemed under-addressed and which often a crucial aspect of a good framework, is how general purpose code e.g. for loading data or logging interacts with the accelerated tensor code.",actionable,shortcoming
r123,s0,t8,"A tensor compiler (as opposed to a DSL inside a general purpose language) The success of Deep Learning is, in no small part, due the development of libraries and frameworks which have made building novel models much easier, faster and less error prone and also make taking advantage of modern hardware (such as GPUs) more accessible.",non_actionable,fact
r123,s1,t8,"This is still a vital area of work, as new types of models and hardware are developed.",non_actionable,fact
r123,s2,t8,"This work argues that prior solutions do not take advantage of the fact that a tensor compiler is, essentially, just a compiler.",non_actionable,fact
r123,s3,t8,"Unusually, compared to most frameworks, gradients are calculated using source code transformation, which is argued to allow for easier optimization.",non_actionable,fact
r123,s4,t8,"This paper is not well-adapted for an ICLR audience, many of which are not experts in compilers or LLVM.",actionable,shortcoming
r123,s5,t8,"For example, the Figure 3, table 1 would be benefit from being shorter with more exposition on what the reader should understand and take away from them.",actionable,suggestion
r123,s6,t8,"The authors mention several philosophical arguments in favor of their approach, but is there a concrete example of an model which is cumbersome to write in an existing framework but easy here?",non_actionable,question
r123,s7,t8,"(e.g. recent libraries pytorch, TF eager can express conditional logic much more simply than previous approaches, its easy to communicate why you might use them).",actionable,suggestion
r123,s8,t8,"Because of this work seems likely to be of limited interest to the ICLR audience, most of which are potentially interested users rather than compiler experts.",actionable,shortcoming
r123,s9,t8,"One aspect that seemed under-addressed and which often a crucial aspect of a good framework, is how general purpose code e.g. for loading data or logging interacts with the accelerated tensor code.",actionable,shortcoming
r123,s0,t10,"A tensor compiler (as opposed to a DSL inside a general purpose language) The success of Deep Learning is, in no small part, due the development of libraries and frameworks which have made building novel models much easier, faster and less error prone and also make taking advantage of modern hardware (such as GPUs) more accessible.",non_actionable,other
r123,s1,t10,"This is still a vital area of work, as new types of models and hardware are developed.",non_actionable,other
r123,s2,t10,"This work argues that prior solutions do not take advantage of the fact that a tensor compiler is, essentially, just a compiler.",non_actionable,other
r123,s3,t10,"Unusually, compared to most frameworks, gradients are calculated using source code transformation, which is argued to allow for easier optimization.",non_actionable,other
r123,s4,t10,"This paper is not well-adapted for an ICLR audience, many of which are not experts in compilers or LLVM.",actionable,shortcoming
r123,s5,t10,"For example, the Figure 3, table 1 would be benefit from being shorter with more exposition on what the reader should understand and take away from them.",actionable,suggestion
r123,s6,t10,"The authors mention several philosophical arguments in favor of their approach, but is there a concrete example of an model which is cumbersome to write in an existing framework but easy here?",actionable,question
r123,s7,t10,"(e.g. recent libraries pytorch, TF eager can express conditional logic much more simply than previous approaches, its easy to communicate why you might use them).",actionable,suggestion
r123,s8,t10,"Because of this work seems likely to be of limited interest to the ICLR audience, most of which are potentially interested users rather than compiler experts.",actionable,shortcoming
r123,s9,t10,"One aspect that seemed under-addressed and which often a crucial aspect of a good framework, is how general purpose code e.g. for loading data or logging interacts with the accelerated tensor code.",actionable,shortcoming
r46,s0,t31,The predictive model is an interesting two-step approach where important atoms of the molecule are added one-by-one with a reward given by a second Q-network that learns how well we can solve the prediction problem with the given set of atoms.,non_actionable,fact
r46,s1,t31,Both datasets should be compared to LASSO as well.,actionable,suggestion
r46,s2,t31,It's somewhat odd that the test performance in table 2 is often better than CV performance.,actionable,shortcoming
r46,s3,t31,"The table 2 does not seem reliable result, and should use more folds and more randomizations, etc.",actionable,shortcoming
r46,s4,t31,The key problem of the method is its seeming inabability to find the correct number of atoms to use.,actionable,shortcoming
r46,s5,t31,"In both datasets the number of atoms were globally fixed, which is counter-intuitive.",actionable,shortcoming
r46,s6,t31,The authors should at least provide learning curves where different number of atoms are used; but ideally the method should learn the number of atoms to use for each molecule.,actionable,suggestion
r46,s7,t31,"There should be experiments that compare the the Q+P model with incresing number of atoms against a full CNN, to see whether the Q+P can converge to maximal performance.",actionable,suggestion
r46,s8,t31,"Overall the method is interesting and has a clear impact for molecular prediction, however the paper has limited appeal to the broader audience.",non_actionable,fact
r46,s9,t31,"This paper also would probably be more suitable for a chemoinformatics journal, where the rationale learning would be highly appreciated.",non_actionable,fact
r46,s0,t20,The predictive model is an interesting two-step approach where important atoms of the molecule are added one-by-one with a reward given by a second Q-network that learns how well we can solve the prediction problem with the given set of atoms.,non_actionable,fact
r46,s1,t20,Both datasets should be compared to LASSO as well.,actionable,suggestion
r46,s2,t20,It's somewhat odd that the test performance in table 2 is often better than CV performance.,non_actionable,fact
r46,s3,t20,"The table 2 does not seem reliable result, and should use more folds and more randomizations, etc.",actionable,suggestion
r46,s4,t20,The key problem of the method is its seeming inabability to find the correct number of atoms to use.,non_actionable,shortcoming
r46,s5,t20,"In both datasets the number of atoms were globally fixed, which is counter-intuitive.",non_actionable,shortcoming
r46,s6,t20,The authors should at least provide learning curves where different number of atoms are used; but ideally the method should learn the number of atoms to use for each molecule.,actionable,suggestion
r46,s7,t20,"There should be experiments that compare the the Q+P model with incresing number of atoms against a full CNN, to see whether the Q+P can converge to maximal performance.",actionable,suggestion
r46,s8,t20,"Overall the method is interesting and has a clear impact for molecular prediction, however the paper has limited appeal to the broader audience.",non_actionable,shortcoming
r46,s9,t20,"This paper also would probably be more suitable for a chemoinformatics journal, where the rationale learning would be highly appreciated.",non_actionable,fact
r46,s0,t10,The predictive model is an interesting two-step approach where important atoms of the molecule are added one-by-one with a reward given by a second Q-network that learns how well we can solve the prediction problem with the given set of atoms.,actionable,agreement
r46,s1,t10,Both datasets should be compared to LASSO as well.,actionable,suggestion
r46,s2,t10,It's somewhat odd that the test performance in table 2 is often better than CV performance.,actionable,disagreement
r46,s3,t10,"The table 2 does not seem reliable result, and should use more folds and more randomizations, etc.",actionable,shortcoming
r46,s4,t10,The key problem of the method is its seeming inabability to find the correct number of atoms to use.,actionable,shortcoming
r46,s5,t10,"In both datasets the number of atoms were globally fixed, which is counter-intuitive.",actionable,shortcoming
r46,s6,t10,The authors should at least provide learning curves where different number of atoms are used; but ideally the method should learn the number of atoms to use for each molecule.,actionable,suggestion
r46,s7,t10,"There should be experiments that compare the the Q+P model with incresing number of atoms against a full CNN, to see whether the Q+P can converge to maximal performance.",actionable,suggestion
r46,s8,t10,"Overall the method is interesting and has a clear impact for molecular prediction, however the paper has limited appeal to the broader audience.",actionable,shortcoming
r46,s9,t10,"This paper also would probably be more suitable for a chemoinformatics journal, where the rationale learning would be highly appreciated.",actionable,suggestion
r46,s0,t16,The predictive model is an interesting two-step approach where important atoms of the molecule are added one-by-one with a reward given by a second Q-network that learns how well we can solve the prediction problem with the given set of atoms.,non_actionable,fact
r46,s1,t16,Both datasets should be compared to LASSO as well.,non_actionable,fact
r46,s2,t16,It's somewhat odd that the test performance in table 2 is often better than CV performance.,actionable,shortcoming
r46,s3,t16,"The table 2 does not seem reliable result, and should use more folds and more randomizations, etc.",actionable,shortcoming
r46,s4,t16,The key problem of the method is its seeming inabability to find the correct number of atoms to use.,actionable,shortcoming
r46,s5,t16,"In both datasets the number of atoms were globally fixed, which is counter-intuitive.",actionable,shortcoming
r46,s6,t16,The authors should at least provide learning curves where different number of atoms are used; but ideally the method should learn the number of atoms to use for each molecule.,actionable,suggestion
r46,s7,t16,"There should be experiments that compare the the Q+P model with incresing number of atoms against a full CNN, to see whether the Q+P can converge to maximal performance.",actionable,suggestion
r46,s8,t16,"Overall the method is interesting and has a clear impact for molecular prediction, however the paper has limited appeal to the broader audience.",non_actionable,fact
r46,s9,t16,"This paper also would probably be more suitable for a chemoinformatics journal, where the rationale learning would be highly appreciated.",actionable,suggestion
r46,s0,t27,The predictive model is an interesting two-step approach where important atoms of the molecule are added one-by-one with a reward given by a second Q-network that learns how well we can solve the prediction problem with the given set of atoms.,non_actionable,agreement
r46,s1,t27,Both datasets should be compared to LASSO as well.,actionable,suggestion
r46,s2,t27,It's somewhat odd that the test performance in table 2 is often better than CV performance.,non_actionable,shortcoming
r46,s3,t27,"The table 2 does not seem reliable result, and should use more folds and more randomizations, etc.",actionable,suggestion
r46,s4,t27,The key problem of the method is its seeming inabability to find the correct number of atoms to use.,non_actionable,shortcoming
r46,s5,t27,"In both datasets the number of atoms were globally fixed, which is counter-intuitive.",non_actionable,shortcoming
r46,s6,t27,The authors should at least provide learning curves where different number of atoms are used; but ideally the method should learn the number of atoms to use for each molecule.,actionable,suggestion
r46,s7,t27,"There should be experiments that compare the the Q+P model with incresing number of atoms against a full CNN, to see whether the Q+P can converge to maximal performance.",actionable,suggestion
r46,s8,t27,"Overall the method is interesting and has a clear impact for molecular prediction, however the paper has limited appeal to the broader audience.",non_actionable,fact
r46,s9,t27,"This paper also would probably be more suitable for a chemoinformatics journal, where the rationale learning would be highly appreciated.",non_actionable,fact
r93,s0,t2,Intriguing two phase RL approach for learning neural controllers for discrete programs This paper presents a reinforcement learning based approach to learn context-free parsers from pairs of input programs and their corresponding parse trees.,non_actionable,fact
r93,s1,t2,The main idea of the approach is to learn a neural controller that operates over a discrete space of programmatic actions such that the controller is able to produce the desired parse trees for the input programs.,non_actionable,fact
r93,s2,t2,The use of reinforcement learning in the two phases of finding candidate trace sets with different reward functions for different operators and searching for a satisfiable subset of traces is also interesting.,non_actionable,fact
r93,s3,t2,"Finally, the results leading to perfect generalization on parsing 100x longer input programs is also quite impressive.",non_actionable,agreement
r93,s4,t2,The comparison with general approaches such as seq2seq and stack LSTM might not be that fair as they are not restricted to only those operators and this possibly also explains the low generalization accuracies.,non_actionable,shortcoming
r93,s5,t2,The paper mentions that developing a parser can take upto 2x/3x more time than developing the training set.,non_actionable,fact
r93,s6,t2,Hand generating parse trees for complex expressions seems to be more tedious and error-prone that writing a modular parser.,non_actionable,fact
r93,s7,t2,"For longer programs, I can imagine there can be thousands of bad traces as it only needs one small mistake to propagate to full traces.",non_actionable,fact
r93,s8,t2,Shouldn’t they be adaptive values with respect to the number of candidate traces found so far?,actionable,question
r93,s9,t2,The current paper presentation is a bit too dense to clearly understand the LL machine model and the two-phase algorithm.,actionable,shortcoming
r93,s0,t10,Intriguing two phase RL approach for learning neural controllers for discrete programs This paper presents a reinforcement learning based approach to learn context-free parsers from pairs of input programs and their corresponding parse trees.,actionable,agreement
r93,s1,t10,The main idea of the approach is to learn a neural controller that operates over a discrete space of programmatic actions such that the controller is able to produce the desired parse trees for the input programs.,non_actionable,other
r93,s2,t10,The use of reinforcement learning in the two phases of finding candidate trace sets with different reward functions for different operators and searching for a satisfiable subset of traces is also interesting.,actionable,agreement
r93,s3,t10,"Finally, the results leading to perfect generalization on parsing 100x longer input programs is also quite impressive.",actionable,agreement
r93,s4,t10,The comparison with general approaches such as seq2seq and stack LSTM might not be that fair as they are not restricted to only those operators and this possibly also explains the low generalization accuracies.,actionable,shortcoming
r93,s5,t10,The paper mentions that developing a parser can take upto 2x/3x more time than developing the training set.,non_actionable,other
r93,s6,t10,Hand generating parse trees for complex expressions seems to be more tedious and error-prone that writing a modular parser.,actionable,shortcoming
r93,s7,t10,"For longer programs, I can imagine there can be thousands of bad traces as it only needs one small mistake to propagate to full traces.",actionable,shortcoming
r93,s8,t10,Shouldn’t they be adaptive values with respect to the number of candidate traces found so far?,actionable,question
r93,s9,t10,The current paper presentation is a bit too dense to clearly understand the LL machine model and the two-phase algorithm.,actionable,shortcoming
r93,s0,t20,Intriguing two phase RL approach for learning neural controllers for discrete programs This paper presents a reinforcement learning based approach to learn context-free parsers from pairs of input programs and their corresponding parse trees.,non_actionable,fact
r93,s1,t20,The main idea of the approach is to learn a neural controller that operates over a discrete space of programmatic actions such that the controller is able to produce the desired parse trees for the input programs.,non_actionable,fact
r93,s2,t20,The use of reinforcement learning in the two phases of finding candidate trace sets with different reward functions for different operators and searching for a satisfiable subset of traces is also interesting.,non_actionable,agreement
r93,s3,t20,"Finally, the results leading to perfect generalization on parsing 100x longer input programs is also quite impressive.",non_actionable,agreement
r93,s4,t20,The comparison with general approaches such as seq2seq and stack LSTM might not be that fair as they are not restricted to only those operators and this possibly also explains the low generalization accuracies.,non_actionable,shortcoming
r93,s5,t20,The paper mentions that developing a parser can take upto 2x/3x more time than developing the training set.,non_actionable,fact
r93,s6,t20,Hand generating parse trees for complex expressions seems to be more tedious and error-prone that writing a modular parser.,non_actionable,fact
r93,s7,t20,"For longer programs, I can imagine there can be thousands of bad traces as it only needs one small mistake to propagate to full traces.",non_actionable,fact
r93,s8,t20,Shouldn’t they be adaptive values with respect to the number of candidate traces found so far?,actionable,question
r93,s9,t20,The current paper presentation is a bit too dense to clearly understand the LL machine model and the two-phase algorithm.,actionable,shortcoming
r93,s0,t31,Intriguing two phase RL approach for learning neural controllers for discrete programs This paper presents a reinforcement learning based approach to learn context-free parsers from pairs of input programs and their corresponding parse trees.,non_actionable,fact
r93,s1,t31,The main idea of the approach is to learn a neural controller that operates over a discrete space of programmatic actions such that the controller is able to produce the desired parse trees for the input programs.,non_actionable,fact
r93,s2,t31,The use of reinforcement learning in the two phases of finding candidate trace sets with different reward functions for different operators and searching for a satisfiable subset of traces is also interesting.,non_actionable,agreement
r93,s3,t31,"Finally, the results leading to perfect generalization on parsing 100x longer input programs is also quite impressive.",non_actionable,agreement
r93,s4,t31,The comparison with general approaches such as seq2seq and stack LSTM might not be that fair as they are not restricted to only those operators and this possibly also explains the low generalization accuracies.,actionable,shortcoming
r93,s5,t31,The paper mentions that developing a parser can take upto 2x/3x more time than developing the training set.,non_actionable,fact
r93,s6,t31,Hand generating parse trees for complex expressions seems to be more tedious and error-prone that writing a modular parser.,non_actionable,fact
r93,s7,t31,"For longer programs, I can imagine there can be thousands of bad traces as it only needs one small mistake to propagate to full traces.",non_actionable,fact
r93,s8,t31,Shouldn’t they be adaptive values with respect to the number of candidate traces found so far?,actionable,suggestion
r93,s9,t31,The current paper presentation is a bit too dense to clearly understand the LL machine model and the two-phase algorithm.,actionable,shortcoming
r93,s0,t16,Intriguing two phase RL approach for learning neural controllers for discrete programs This paper presents a reinforcement learning based approach to learn context-free parsers from pairs of input programs and their corresponding parse trees.,non_actionable,agreement
r93,s1,t16,The main idea of the approach is to learn a neural controller that operates over a discrete space of programmatic actions such that the controller is able to produce the desired parse trees for the input programs.,non_actionable,fact
r93,s2,t16,The use of reinforcement learning in the two phases of finding candidate trace sets with different reward functions for different operators and searching for a satisfiable subset of traces is also interesting.,non_actionable,fact
r93,s3,t16,"Finally, the results leading to perfect generalization on parsing 100x longer input programs is also quite impressive.",non_actionable,fact
r93,s4,t16,The comparison with general approaches such as seq2seq and stack LSTM might not be that fair as they are not restricted to only those operators and this possibly also explains the low generalization accuracies.,actionable,shortcoming
r93,s5,t16,The paper mentions that developing a parser can take upto 2x/3x more time than developing the training set.,non_actionable,fact
r93,s6,t16,Hand generating parse trees for complex expressions seems to be more tedious and error-prone that writing a modular parser.,actionable,shortcoming
r93,s7,t16,"For longer programs, I can imagine there can be thousands of bad traces as it only needs one small mistake to propagate to full traces.",actionable,shortcoming
r93,s8,t16,Shouldn’t they be adaptive values with respect to the number of candidate traces found so far?,actionable,question
r93,s9,t16,The current paper presentation is a bit too dense to clearly understand the LL machine model and the two-phase algorithm.,actionable,shortcoming
r61,s0,t10,"Finding ReMO review This paper introduces Related Memory Network (RMN), an improvement over Relationship Networks (RN).",non_actionable,other
r61,s1,t10,RN constructs pair-wise interactions between objects in RN to solve complex tasks such as transitive reasoning.,non_actionable,other
r61,s2,t10,"However, how widespread is this problem across other models or are you simply addressing a point problem for RN?",actionable,question
r61,s3,t10,It would not be fair to claim superiority over RN since you only evaluate on bABi while RN also demonstrated results on other tasks.,actionable,shortcoming
r61,s4,t10,Is this on bAbi as well?,actionable,question
r61,s5,t10,How did you generate these stories with so many sentences?,actionable,question
r61,s6,t10,Another clarification is the bAbi performance over Entnet which claims to solve all tasks.,actionable,suggestion
r61,s7,t10,Some wall clock time results or FLOPs of train/test time should be provided since you use multiple hops.,actionable,suggestion
r61,s8,t10,"Without experiments over other datasets and wall clock time results, it is hard to appreciate the significance of this improvement.",actionable,shortcoming
r61,s9,t10,One direction to strengthen this paper is to examine if RMN can do better than pair-wise interactions (and other baselines) for more complex reasoning tasks.,actionable,suggestion
r61,s0,t16,"Finding ReMO review This paper introduces Related Memory Network (RMN), an improvement over Relationship Networks (RN).",non_actionable,fact
r61,s1,t16,RN constructs pair-wise interactions between objects in RN to solve complex tasks such as transitive reasoning.,non_actionable,fact
r61,s2,t16,"However, how widespread is this problem across other models or are you simply addressing a point problem for RN?",actionable,question
r61,s3,t16,It would not be fair to claim superiority over RN since you only evaluate on bABi while RN also demonstrated results on other tasks.,actionable,disagreement
r61,s4,t16,Is this on bAbi as well?,actionable,question
r61,s5,t16,How did you generate these stories with so many sentences?,actionable,question
r61,s6,t16,Another clarification is the bAbi performance over Entnet which claims to solve all tasks.,actionable,shortcoming
r61,s7,t16,Some wall clock time results or FLOPs of train/test time should be provided since you use multiple hops.,actionable,suggestion
r61,s8,t16,"Without experiments over other datasets and wall clock time results, it is hard to appreciate the significance of this improvement.",actionable,shortcoming
r61,s9,t16,One direction to strengthen this paper is to examine if RMN can do better than pair-wise interactions (and other baselines) for more complex reasoning tasks.,actionable,suggestion
r61,s0,t20,"Finding ReMO review This paper introduces Related Memory Network (RMN), an improvement over Relationship Networks (RN).",non_actionable,fact
r61,s1,t20,RN constructs pair-wise interactions between objects in RN to solve complex tasks such as transitive reasoning.,non_actionable,fact
r61,s2,t20,"However, how widespread is this problem across other models or are you simply addressing a point problem for RN?",actionable,question
r61,s3,t20,It would not be fair to claim superiority over RN since you only evaluate on bABi while RN also demonstrated results on other tasks.,non_actionable,fact
r61,s4,t20,Is this on bAbi as well?,actionable,question
r61,s5,t20,How did you generate these stories with so many sentences?,actionable,question
r61,s6,t20,Another clarification is the bAbi performance over Entnet which claims to solve all tasks.,actionable,suggestion
r61,s7,t20,Some wall clock time results or FLOPs of train/test time should be provided since you use multiple hops.,actionable,suggestion
r61,s8,t20,"Without experiments over other datasets and wall clock time results, it is hard to appreciate the significance of this improvement.",non_actionable,shortcoming
r61,s9,t20,One direction to strengthen this paper is to examine if RMN can do better than pair-wise interactions (and other baselines) for more complex reasoning tasks.,actionable,suggestion
r61,s0,t8,"Finding ReMO review This paper introduces Related Memory Network (RMN), an improvement over Relationship Networks (RN).",non_actionable,fact
r61,s1,t8,RN constructs pair-wise interactions between objects in RN to solve complex tasks such as transitive reasoning.,non_actionable,fact
r61,s2,t8,"However, how widespread is this problem across other models or are you simply addressing a point problem for RN?",non_actionable,question
r61,s3,t8,It would not be fair to claim superiority over RN since you only evaluate on bABi while RN also demonstrated results on other tasks.,actionable,shortcoming
r61,s4,t8,Is this on bAbi as well?,non_actionable,question
r61,s5,t8,How did you generate these stories with so many sentences?,non_actionable,question
r61,s6,t8,Another clarification is the bAbi performance over Entnet which claims to solve all tasks.,non_actionable,fact
r61,s7,t8,Some wall clock time results or FLOPs of train/test time should be provided since you use multiple hops.,actionable,suggestion
r61,s8,t8,"Without experiments over other datasets and wall clock time results, it is hard to appreciate the significance of this improvement.",actionable,shortcoming
r61,s9,t8,One direction to strengthen this paper is to examine if RMN can do better than pair-wise interactions (and other baselines) for more complex reasoning tasks.,actionable,suggestion
r61,s0,t31,"Finding ReMO review This paper introduces Related Memory Network (RMN), an improvement over Relationship Networks (RN).",non_actionable,fact
r61,s1,t31,RN constructs pair-wise interactions between objects in RN to solve complex tasks such as transitive reasoning.,non_actionable,fact
r61,s2,t31,"However, how widespread is this problem across other models or are you simply addressing a point problem for RN?",non_actionable,question
r61,s3,t31,It would not be fair to claim superiority over RN since you only evaluate on bABi while RN also demonstrated results on other tasks.,actionable,shortcoming
r61,s4,t31,Is this on bAbi as well?,non_actionable,question
r61,s5,t31,How did you generate these stories with so many sentences?,non_actionable,question
r61,s6,t31,Another clarification is the bAbi performance over Entnet which claims to solve all tasks.,actionable,suggestion
r61,s7,t31,Some wall clock time results or FLOPs of train/test time should be provided since you use multiple hops.,actionable,suggestion
r61,s8,t31,"Without experiments over other datasets and wall clock time results, it is hard to appreciate the significance of this improvement.",actionable,shortcoming
r61,s9,t31,One direction to strengthen this paper is to examine if RMN can do better than pair-wise interactions (and other baselines) for more complex reasoning tasks.,actionable,suggestion
r10,s0,t20,The paper presents a new technique for anomaly detection where the dimension reduction and the density estimation steps are jointly optimized.,non_actionable,fact
r10,s1,t20,I'm also not convinced of how well the Gaussian model fits the low-dimensional representation and how well can a neural network compute the GMM mixture memberships.,non_actionable,shortcoming
r10,s2,t20,"1. The framework uses the class information, i.e., “only data samples from the normal class are used for training”, but it is still considered unsupervised.",non_actionable,shortcoming
r10,s3,t20,"Also, the anomaly detection in the evaluation step is based on a threshold which depends on the percentage of known anomalies, i.e., a priori information.",non_actionable,fact
r10,s4,t20,"Better yet it would be to use methods like Local Outlier Factor (LOF) (Breunig et al., 2000 – LOF:Identifying Density-based local outliers) to detect the outliers (these methods also have parameters to tune, sure, but using the known percentage of anomalies to find the threshold is not relevant in a purely unsupervised context when we don't know how many anomalies are in the data).",actionable,suggestion
r10,s5,t20,Do you normalize the features (the output of the dimension reduction and the representation error are quite different)?,actionable,question
r10,s6,t20,Fig. 3a doesn't seem to show that the output is a clear mixture of Gaussians.,actionable,shortcoming
r10,s7,t20,"3. In Fig. 3 when plotting the results for KDDCup, I would have liked to see results for the best 4 methods from Table 1, OC-SVM performs better than PAE.",actionable,suggestion
r10,s8,t20,They are the best in terms of precision.,non_actionable,fact
r10,s9,t20,"Clarity – The paper is very well written with clear statements, a pleasure to read.",non_actionable,agreement
r10,s0,t31,The paper presents a new technique for anomaly detection where the dimension reduction and the density estimation steps are jointly optimized.,non_actionable,fact
r10,s1,t31,I'm also not convinced of how well the Gaussian model fits the low-dimensional representation and how well can a neural network compute the GMM mixture memberships.,actionable,shortcoming
r10,s2,t31,"1. The framework uses the class information, i.e., “only data samples from the normal class are used for training”, but it is still considered unsupervised.",actionable,shortcoming
r10,s3,t31,"Also, the anomaly detection in the evaluation step is based on a threshold which depends on the percentage of known anomalies, i.e., a priori information.",non_actionable,fact
r10,s4,t31,"Better yet it would be to use methods like Local Outlier Factor (LOF) (Breunig et al., 2000 – LOF:Identifying Density-based local outliers) to detect the outliers (these methods also have parameters to tune, sure, but using the known percentage of anomalies to find the threshold is not relevant in a purely unsupervised context when we don't know how many anomalies are in the data).",actionable,suggestion
r10,s5,t31,Do you normalize the features (the output of the dimension reduction and the representation error are quite different)?,non_actionable,question
r10,s6,t31,Fig. 3a doesn't seem to show that the output is a clear mixture of Gaussians.,actionable,shortcoming
r10,s7,t31,"3. In Fig. 3 when plotting the results for KDDCup, I would have liked to see results for the best 4 methods from Table 1, OC-SVM performs better than PAE.",actionable,suggestion
r10,s8,t31,They are the best in terms of precision.,non_actionable,fact
r10,s9,t31,"Clarity – The paper is very well written with clear statements, a pleasure to read.",non_actionable,agreement
r10,s0,t2,The paper presents a new technique for anomaly detection where the dimension reduction and the density estimation steps are jointly optimized.,non_actionable,fact
r10,s1,t2,I'm also not convinced of how well the Gaussian model fits the low-dimensional representation and how well can a neural network compute the GMM mixture memberships.,non_actionable,fact
r10,s2,t2,"1. The framework uses the class information, i.e., “only data samples from the normal class are used for training”, but it is still considered unsupervised.",non_actionable,fact
r10,s3,t2,"Also, the anomaly detection in the evaluation step is based on a threshold which depends on the percentage of known anomalies, i.e., a priori information.",non_actionable,shortcoming
r10,s4,t2,"Better yet it would be to use methods like Local Outlier Factor (LOF) (Breunig et al., 2000 – LOF:Identifying Density-based local outliers) to detect the outliers (these methods also have parameters to tune, sure, but using the known percentage of anomalies to find the threshold is not relevant in a purely unsupervised context when we don't know how many anomalies are in the data).",actionable,suggestion
r10,s5,t2,Do you normalize the features (the output of the dimension reduction and the representation error are quite different)?,actionable,question
r10,s6,t2,Fig. 3a doesn't seem to show that the output is a clear mixture of Gaussians.,actionable,shortcoming
r10,s7,t2,"3. In Fig. 3 when plotting the results for KDDCup, I would have liked to see results for the best 4 methods from Table 1, OC-SVM performs better than PAE.",actionable,suggestion
r10,s8,t2,They are the best in terms of precision.,non_actionable,fact
r10,s9,t2,"Clarity – The paper is very well written with clear statements, a pleasure to read.",non_actionable,agreement
r10,s0,t16,The paper presents a new technique for anomaly detection where the dimension reduction and the density estimation steps are jointly optimized.,non_actionable,fact
r10,s1,t16,I'm also not convinced of how well the Gaussian model fits the low-dimensional representation and how well can a neural network compute the GMM mixture memberships.,actionable,shortcoming
r10,s2,t16,"1. The framework uses the class information, i.e., “only data samples from the normal class are used for training”, but it is still considered unsupervised.",non_actionable,fact
r10,s3,t16,"Also, the anomaly detection in the evaluation step is based on a threshold which depends on the percentage of known anomalies, i.e., a priori information.",non_actionable,fact
r10,s4,t16,"Better yet it would be to use methods like Local Outlier Factor (LOF) (Breunig et al., 2000 – LOF:Identifying Density-based local outliers) to detect the outliers (these methods also have parameters to tune, sure, but using the known percentage of anomalies to find the threshold is not relevant in a purely unsupervised context when we don't know how many anomalies are in the data).",actionable,suggestion
r10,s5,t16,Do you normalize the features (the output of the dimension reduction and the representation error are quite different)?,actionable,question
r10,s6,t16,Fig. 3a doesn't seem to show that the output is a clear mixture of Gaussians.,actionable,shortcoming
r10,s7,t16,"3. In Fig. 3 when plotting the results for KDDCup, I would have liked to see results for the best 4 methods from Table 1, OC-SVM performs better than PAE.",actionable,suggestion
r10,s8,t16,They are the best in terms of precision.,non_actionable,agreement
r10,s9,t16,"Clarity – The paper is very well written with clear statements, a pleasure to read.",non_actionable,agreement
r10,s0,t27,The paper presents a new technique for anomaly detection where the dimension reduction and the density estimation steps are jointly optimized.,non_actionable,fact
r10,s1,t27,I'm also not convinced of how well the Gaussian model fits the low-dimensional representation and how well can a neural network compute the GMM mixture memberships.,non_actionable,disagreement
r10,s2,t27,"1. The framework uses the class information, i.e., “only data samples from the normal class are used for training”, but it is still considered unsupervised.",non_actionable,shortcoming
r10,s3,t27,"Also, the anomaly detection in the evaluation step is based on a threshold which depends on the percentage of known anomalies, i.e., a priori information.",non_actionable,fact
r10,s4,t27,"Better yet it would be to use methods like Local Outlier Factor (LOF) (Breunig et al., 2000 – LOF:Identifying Density-based local outliers) to detect the outliers (these methods also have parameters to tune, sure, but using the known percentage of anomalies to find the threshold is not relevant in a purely unsupervised context when we don't know how many anomalies are in the data).",actionable,suggestion
r10,s5,t27,Do you normalize the features (the output of the dimension reduction and the representation error are quite different)?,non_actionable,question
r10,s6,t27,Fig. 3a doesn't seem to show that the output is a clear mixture of Gaussians.,non_actionable,shortcoming
r10,s7,t27,"3. In Fig. 3 when plotting the results for KDDCup, I would have liked to see results for the best 4 methods from Table 1, OC-SVM performs better than PAE.",actionable,suggestion
r10,s8,t27,They are the best in terms of precision.,non_actionable,agreement
r10,s9,t27,"Clarity – The paper is very well written with clear statements, a pleasure to read.",non_actionable,agreement
r29,s0,t16,"simple, effective method, some discussion/understanding missing This paper proposes a new method of detecting in vs. out of distribution samples.",non_actionable,agreement
r29,s1,t16,This paper proposes a different approach (with could be combined with these methods) based on a new training procedure.,non_actionable,fact
r29,s2,t16,The authors propose to train a generator network in combination with the classifier and an adversarial discriminator.,non_actionable,fact
r29,s3,t16,Classifier is trained to not only maximize classification accuracy on the real training data but also to output a uniform distribution for the generated samples.,non_actionable,fact
r29,s4,t16,"This paper is clearly written, proposes a simple model and seems to outperform current methods.",non_actionable,agreement
r29,s5,t16,Most GAN models are able to stably train without such explicit terms such as the pull away or batch discrimination.,non_actionable,fact
r29,s6,t16,"- How does this compare with a method whereby instead of pushing the fake sample's softmax distribution to be uniform, the model is simply a trained to classify them as an additional ""out of distribution"" class?",actionable,question
r29,s7,t16,This exact approach has been used to do semi supervised learning with GANS [1][2].,non_actionable,fact
r29,s8,t16,"More generally, could the authors comment on how this approach is related to these semi-supervised approaches?",actionable,question
r29,s9,t16,[1] Semi-Supervised Learning with Generative Adversarial Networks (https://arxiv.org/abs/1606.01583) [2] Good Semi-supervised Learning that Requires a Bad GAN (https://arxiv.org/abs/1705.09783),non_actionable,fact
r29,s0,t31,"simple, effective method, some discussion/understanding missing This paper proposes a new method of detecting in vs. out of distribution samples.",actionable,shortcoming
r29,s1,t31,This paper proposes a different approach (with could be combined with these methods) based on a new training procedure.,non_actionable,fact
r29,s2,t31,The authors propose to train a generator network in combination with the classifier and an adversarial discriminator.,non_actionable,fact
r29,s3,t31,Classifier is trained to not only maximize classification accuracy on the real training data but also to output a uniform distribution for the generated samples.,non_actionable,fact
r29,s4,t31,"This paper is clearly written, proposes a simple model and seems to outperform current methods.",non_actionable,agreement
r29,s5,t31,Most GAN models are able to stably train without such explicit terms such as the pull away or batch discrimination.,non_actionable,fact
r29,s6,t31,"- How does this compare with a method whereby instead of pushing the fake sample's softmax distribution to be uniform, the model is simply a trained to classify them as an additional ""out of distribution"" class?",non_actionable,question
r29,s7,t31,This exact approach has been used to do semi supervised learning with GANS [1][2].,non_actionable,fact
r29,s8,t31,"More generally, could the authors comment on how this approach is related to these semi-supervised approaches?",actionable,suggestion
r29,s9,t31,[1] Semi-Supervised Learning with Generative Adversarial Networks (https://arxiv.org/abs/1606.01583) [2] Good Semi-supervised Learning that Requires a Bad GAN (https://arxiv.org/abs/1705.09783),actionable,suggestion
r29,s0,t20,"simple, effective method, some discussion/understanding missing This paper proposes a new method of detecting in vs. out of distribution samples.",non_actionable,fact
r29,s1,t20,This paper proposes a different approach (with could be combined with these methods) based on a new training procedure.,non_actionable,fact
r29,s2,t20,The authors propose to train a generator network in combination with the classifier and an adversarial discriminator.,non_actionable,fact
r29,s3,t20,Classifier is trained to not only maximize classification accuracy on the real training data but also to output a uniform distribution for the generated samples.,non_actionable,fact
r29,s4,t20,"This paper is clearly written, proposes a simple model and seems to outperform current methods.",non_actionable,agreement
r29,s5,t20,Most GAN models are able to stably train without such explicit terms such as the pull away or batch discrimination.,non_actionable,fact
r29,s6,t20,"- How does this compare with a method whereby instead of pushing the fake sample's softmax distribution to be uniform, the model is simply a trained to classify them as an additional ""out of distribution"" class?",actionable,question
r29,s7,t20,This exact approach has been used to do semi supervised learning with GANS [1][2].,non_actionable,fact
r29,s8,t20,"More generally, could the authors comment on how this approach is related to these semi-supervised approaches?",actionable,question
r29,s9,t20,[1] Semi-Supervised Learning with Generative Adversarial Networks (https://arxiv.org/abs/1606.01583) [2] Good Semi-supervised Learning that Requires a Bad GAN (https://arxiv.org/abs/1705.09783),non_actionable,other
r29,s0,t8,"simple, effective method, some discussion/understanding missing This paper proposes a new method of detecting in vs. out of distribution samples.",actionable,shortcoming
r29,s1,t8,This paper proposes a different approach (with could be combined with these methods) based on a new training procedure.,non_actionable,fact
r29,s2,t8,The authors propose to train a generator network in combination with the classifier and an adversarial discriminator.,non_actionable,fact
r29,s3,t8,Classifier is trained to not only maximize classification accuracy on the real training data but also to output a uniform distribution for the generated samples.,non_actionable,fact
r29,s4,t8,"This paper is clearly written, proposes a simple model and seems to outperform current methods.",non_actionable,agreement
r29,s5,t8,Most GAN models are able to stably train without such explicit terms such as the pull away or batch discrimination.,non_actionable,fact
r29,s6,t8,"- How does this compare with a method whereby instead of pushing the fake sample's softmax distribution to be uniform, the model is simply a trained to classify them as an additional ""out of distribution"" class?",non_actionable,question
r29,s7,t8,This exact approach has been used to do semi supervised learning with GANS [1][2].,non_actionable,fact
r29,s8,t8,"More generally, could the authors comment on how this approach is related to these semi-supervised approaches?",non_actionable,question
r29,s9,t8,[1] Semi-Supervised Learning with Generative Adversarial Networks (https://arxiv.org/abs/1606.01583) [2] Good Semi-supervised Learning that Requires a Bad GAN (https://arxiv.org/abs/1705.09783),non_actionable,other
r29,s0,t10,"simple, effective method, some discussion/understanding missing This paper proposes a new method of detecting in vs. out of distribution samples.",actionable,shortcoming
r29,s1,t10,This paper proposes a different approach (with could be combined with these methods) based on a new training procedure.,non_actionable,other
r29,s2,t10,The authors propose to train a generator network in combination with the classifier and an adversarial discriminator.,non_actionable,other
r29,s3,t10,Classifier is trained to not only maximize classification accuracy on the real training data but also to output a uniform distribution for the generated samples.,non_actionable,other
r29,s4,t10,"This paper is clearly written, proposes a simple model and seems to outperform current methods.",actionable,agreement
r29,s5,t10,Most GAN models are able to stably train without such explicit terms such as the pull away or batch discrimination.,non_actionable,other
r29,s6,t10,"- How does this compare with a method whereby instead of pushing the fake sample's softmax distribution to be uniform, the model is simply a trained to classify them as an additional ""out of distribution"" class?",actionable,question
r29,s7,t10,This exact approach has been used to do semi supervised learning with GANS [1][2].,non_actionable,other
r29,s8,t10,"More generally, could the authors comment on how this approach is related to these semi-supervised approaches?",actionable,question
r29,s9,t10,[1] Semi-Supervised Learning with Generative Adversarial Networks (https://arxiv.org/abs/1606.01583) [2] Good Semi-supervised Learning that Requires a Bad GAN (https://arxiv.org/abs/1705.09783),non_actionable,other
r37,s0,t8,"Interesting paper, would like to see more experiments The paper proposed a new activation function that tries to alleviate the use of  other form of normalization methods for RNNs.",actionable,suggestion
r37,s1,t8,"In general, this is an interesting direction to explore, the idea is interesting, however, I would like to see more experiments",actionable,suggestion
r37,s2,t8,1. The authors tested out this new activation function on RNNs.,non_actionable,fact
r37,s3,t8,It would be interesting to see the results of the new activation function on LSTM.,actionable,suggestion
r37,s4,t8,2. The experimental results are fairly weak compared to the other methods that also uses many layers.,actionable,shortcoming
r37,s5,t8,"For PTB and Text8, the results are comparable to recurrent batchnorm with similar number of parameters, however the recurrent batchnorm model has only 1 layer, whereas the proposed architecture has 36 layers.",non_actionable,fact
r37,s6,t8,"3.  It would also be nice to show results on tasks that involve long term dependencies, such as speech modeling.",actionable,suggestion
r37,s7,t8,"4. If the authors could test out the new activation function on LSTMs, it would be interesting to perform a comparison between LSTM baseline, LSTM + new activation function, LSTM + recurrent batch norm.",actionable,suggestion
r37,s8,t8,5. It would be nice to see the gradient flow with the new activation function compared to the ones without.,actionable,suggestion
r37,s9,t8,"6. The theorems and proofs are rather preliminary, they may not necessarily have to be presented as theorems.",actionable,suggestion
r37,s0,t20,"Interesting paper, would like to see more experiments The paper proposed a new activation function that tries to alleviate the use of  other form of normalization methods for RNNs.",actionable,suggestion
r37,s1,t20,"In general, this is an interesting direction to explore, the idea is interesting, however, I would like to see more experiments",actionable,suggestion
r37,s2,t20,1. The authors tested out this new activation function on RNNs.,non_actionable,fact
r37,s3,t20,It would be interesting to see the results of the new activation function on LSTM.,actionable,suggestion
r37,s4,t20,2. The experimental results are fairly weak compared to the other methods that also uses many layers.,non_actionable,shortcoming
r37,s5,t20,"For PTB and Text8, the results are comparable to recurrent batchnorm with similar number of parameters, however the recurrent batchnorm model has only 1 layer, whereas the proposed architecture has 36 layers.",non_actionable,shortcoming
r37,s6,t20,"3.  It would also be nice to show results on tasks that involve long term dependencies, such as speech modeling.",actionable,suggestion
r37,s7,t20,"4. If the authors could test out the new activation function on LSTMs, it would be interesting to perform a comparison between LSTM baseline, LSTM + new activation function, LSTM + recurrent batch norm.",actionable,suggestion
r37,s8,t20,5. It would be nice to see the gradient flow with the new activation function compared to the ones without.,actionable,suggestion
r37,s9,t20,"6. The theorems and proofs are rather preliminary, they may not necessarily have to be presented as theorems.",actionable,suggestion
r37,s0,t31,"Interesting paper, would like to see more experiments The paper proposed a new activation function that tries to alleviate the use of  other form of normalization methods for RNNs.",actionable,suggestion
r37,s1,t31,"In general, this is an interesting direction to explore, the idea is interesting, however, I would like to see more experiments",actionable,suggestion
r37,s2,t31,1. The authors tested out this new activation function on RNNs.,non_actionable,fact
r37,s3,t31,It would be interesting to see the results of the new activation function on LSTM.,actionable,suggestion
r37,s4,t31,2. The experimental results are fairly weak compared to the other methods that also uses many layers.,actionable,shortcoming
r37,s5,t31,"For PTB and Text8, the results are comparable to recurrent batchnorm with similar number of parameters, however the recurrent batchnorm model has only 1 layer, whereas the proposed architecture has 36 layers.",actionable,shortcoming
r37,s6,t31,"3.  It would also be nice to show results on tasks that involve long term dependencies, such as speech modeling.",actionable,suggestion
r37,s7,t31,"4. If the authors could test out the new activation function on LSTMs, it would be interesting to perform a comparison between LSTM baseline, LSTM + new activation function, LSTM + recurrent batch norm.",actionable,suggestion
r37,s8,t31,5. It would be nice to see the gradient flow with the new activation function compared to the ones without.,actionable,suggestion
r37,s9,t31,"6. The theorems and proofs are rather preliminary, they may not necessarily have to be presented as theorems.",actionable,shortcoming
r37,s0,t10,"Interesting paper, would like to see more experiments The paper proposed a new activation function that tries to alleviate the use of  other form of normalization methods for RNNs.",actionable,agreement
r37,s1,t10,"In general, this is an interesting direction to explore, the idea is interesting, however, I would like to see more experiments",actionable,agreement
r37,s2,t10,1. The authors tested out this new activation function on RNNs.,non_actionable,other
r37,s3,t10,It would be interesting to see the results of the new activation function on LSTM.,actionable,suggestion
r37,s4,t10,2. The experimental results are fairly weak compared to the other methods that also uses many layers.,actionable,shortcoming
r37,s5,t10,"For PTB and Text8, the results are comparable to recurrent batchnorm with similar number of parameters, however the recurrent batchnorm model has only 1 layer, whereas the proposed architecture has 36 layers.",actionable,shortcoming
r37,s6,t10,"3.  It would also be nice to show results on tasks that involve long term dependencies, such as speech modeling.",actionable,suggestion
r37,s7,t10,"4. If the authors could test out the new activation function on LSTMs, it would be interesting to perform a comparison between LSTM baseline, LSTM + new activation function, LSTM + recurrent batch norm.",actionable,suggestion
r37,s8,t10,5. It would be nice to see the gradient flow with the new activation function compared to the ones without.,actionable,suggestion
r37,s9,t10,"6. The theorems and proofs are rather preliminary, they may not necessarily have to be presented as theorems.",actionable,fact
r37,s0,t2,"Interesting paper, would like to see more experiments The paper proposed a new activation function that tries to alleviate the use of  other form of normalization methods for RNNs.",non_actionable,fact
r37,s1,t2,"In general, this is an interesting direction to explore, the idea is interesting, however, I would like to see more experiments",actionable,suggestion
r37,s2,t2,1. The authors tested out this new activation function on RNNs.,non_actionable,fact
r37,s3,t2,It would be interesting to see the results of the new activation function on LSTM.,actionable,suggestion
r37,s4,t2,2. The experimental results are fairly weak compared to the other methods that also uses many layers.,non_actionable,shortcoming
r37,s5,t2,"For PTB and Text8, the results are comparable to recurrent batchnorm with similar number of parameters, however the recurrent batchnorm model has only 1 layer, whereas the proposed architecture has 36 layers.",non_actionable,fact
r37,s6,t2,"3.  It would also be nice to show results on tasks that involve long term dependencies, such as speech modeling.",actionable,suggestion
r37,s7,t2,"4. If the authors could test out the new activation function on LSTMs, it would be interesting to perform a comparison between LSTM baseline, LSTM + new activation function, LSTM + recurrent batch norm.",actionable,suggestion
r37,s8,t2,5. It would be nice to see the gradient flow with the new activation function compared to the ones without.,actionable,suggestion
r37,s9,t2,"6. The theorems and proofs are rather preliminary, they may not necessarily have to be presented as theorems.",actionable,suggestion
r87,s0,t16,There's clear value in having good inductive biases (e.g. expressed in the form of the discriminator architecture) when defining divergences for practical applications.,non_actionable,agreement
r87,s1,t16,"Some reasons below: * There are no specific results on properties of the divergences, or axioms that justify them.",actionable,shortcoming
r87,s2,t16,I think that presenting a very all-encompassing formulation without a strong foundation does not add value.,actionable,disagreement
r87,s3,t16,"* There's abundant literature on f-divergences which show that there's a 1-1 relationship between divergences and optimal (Bayes) risks of classification problems (e.g. Reid at al. Information, Divergence and Risk for Binary Experiments in JMLR and Garcia-Garcia et al. Divergences and Risks for Multiclass Experiments in COLT).",non_actionable,fact
r87,s4,t16,"If the loss for the task is proper, then it's well known how to construct a divergence which coincides with the optimal risk.",non_actionable,fact
r87,s5,t16,* The divergences presented in this work are different from the above since the risk is minimised over a parametric class instead of over the whole set of integrable functions.,non_actionable,fact
r87,s6,t16,"* There are many estimators for f-divergences (like the ones cited above and many others based e.g. on nearest-neighbors) that are sample-based and thus correspond to the ""implicit"" case that the authors discuss.",non_actionable,fact
r87,s7,t16,"* The experiments are few and too specific, specially given that the paper presents a very general framework.",actionable,shortcoming
r87,s8,t16,"That feels like confirmation bias and also does not really say anything about the parametric adversarial GANs, which are the focus of the paper.",actionable,disagreement
r87,s9,t16,The theory is not strong and the experiments don't necessary support the intuitive claims made in the paper.,actionable,shortcoming
r87,s0,t31,There's clear value in having good inductive biases (e.g. expressed in the form of the discriminator architecture) when defining divergences for practical applications.,non_actionable,fact
r87,s1,t31,"Some reasons below: * There are no specific results on properties of the divergences, or axioms that justify them.",non_actionable,fact
r87,s2,t31,I think that presenting a very all-encompassing formulation without a strong foundation does not add value.,actionable,shortcoming
r87,s3,t31,"* There's abundant literature on f-divergences which show that there's a 1-1 relationship between divergences and optimal (Bayes) risks of classification problems (e.g. Reid at al. Information, Divergence and Risk for Binary Experiments in JMLR and Garcia-Garcia et al. Divergences and Risks for Multiclass Experiments in COLT).",non_actionable,fact
r87,s4,t31,"If the loss for the task is proper, then it's well known how to construct a divergence which coincides with the optimal risk.",non_actionable,fact
r87,s5,t31,* The divergences presented in this work are different from the above since the risk is minimised over a parametric class instead of over the whole set of integrable functions.,non_actionable,fact
r87,s6,t31,"* There are many estimators for f-divergences (like the ones cited above and many others based e.g. on nearest-neighbors) that are sample-based and thus correspond to the ""implicit"" case that the authors discuss.",non_actionable,fact
r87,s7,t31,"* The experiments are few and too specific, specially given that the paper presents a very general framework.",non_actionable,fact
r87,s8,t31,"That feels like confirmation bias and also does not really say anything about the parametric adversarial GANs, which are the focus of the paper.",actionable,shortcoming
r87,s9,t31,The theory is not strong and the experiments don't necessary support the intuitive claims made in the paper.,actionable,shortcoming
r87,s0,t27,There's clear value in having good inductive biases (e.g. expressed in the form of the discriminator architecture) when defining divergences for practical applications.,non_actionable,fact
r87,s1,t27,"Some reasons below: * There are no specific results on properties of the divergences, or axioms that justify them.",actionable,shortcoming
r87,s2,t27,I think that presenting a very all-encompassing formulation without a strong foundation does not add value.,actionable,shortcoming
r87,s3,t27,"* There's abundant literature on f-divergences which show that there's a 1-1 relationship between divergences and optimal (Bayes) risks of classification problems (e.g. Reid at al. Information, Divergence and Risk for Binary Experiments in JMLR and Garcia-Garcia et al. Divergences and Risks for Multiclass Experiments in COLT).",non_actionable,fact
r87,s4,t27,"If the loss for the task is proper, then it's well known how to construct a divergence which coincides with the optimal risk.",non_actionable,fact
r87,s5,t27,* The divergences presented in this work are different from the above since the risk is minimised over a parametric class instead of over the whole set of integrable functions.,non_actionable,fact
r87,s6,t27,"* There are many estimators for f-divergences (like the ones cited above and many others based e.g. on nearest-neighbors) that are sample-based and thus correspond to the ""implicit"" case that the authors discuss.",non_actionable,fact
r87,s7,t27,"* The experiments are few and too specific, specially given that the paper presents a very general framework.",actionable,shortcoming
r87,s8,t27,"That feels like confirmation bias and also does not really say anything about the parametric adversarial GANs, which are the focus of the paper.",non_actionable,fact
r87,s9,t27,The theory is not strong and the experiments don't necessary support the intuitive claims made in the paper.,non_actionable,disagreement
r87,s0,t20,There's clear value in having good inductive biases (e.g. expressed in the form of the discriminator architecture) when defining divergences for practical applications.,non_actionable,fact
r87,s1,t20,"Some reasons below: * There are no specific results on properties of the divergences, or axioms that justify them.",non_actionable,fact
r87,s2,t20,I think that presenting a very all-encompassing formulation without a strong foundation does not add value.,non_actionable,shortcoming
r87,s3,t20,"* There's abundant literature on f-divergences which show that there's a 1-1 relationship between divergences and optimal (Bayes) risks of classification problems (e.g. Reid at al. Information, Divergence and Risk for Binary Experiments in JMLR and Garcia-Garcia et al. Divergences and Risks for Multiclass Experiments in COLT).",non_actionable,fact
r87,s4,t20,"If the loss for the task is proper, then it's well known how to construct a divergence which coincides with the optimal risk.",non_actionable,fact
r87,s5,t20,* The divergences presented in this work are different from the above since the risk is minimised over a parametric class instead of over the whole set of integrable functions.,non_actionable,fact
r87,s6,t20,"* There are many estimators for f-divergences (like the ones cited above and many others based e.g. on nearest-neighbors) that are sample-based and thus correspond to the ""implicit"" case that the authors discuss.",non_actionable,fact
r87,s7,t20,"* The experiments are few and too specific, specially given that the paper presents a very general framework.",non_actionable,shortcoming
r87,s8,t20,"That feels like confirmation bias and also does not really say anything about the parametric adversarial GANs, which are the focus of the paper.",non_actionable,shortcoming
r87,s9,t20,The theory is not strong and the experiments don't necessary support the intuitive claims made in the paper.,non_actionable,shortcoming
r87,s0,t12,There's clear value in having good inductive biases (e.g. expressed in the form of the discriminator architecture) when defining divergences for practical applications.,actionable,suggestion
r87,s1,t12,"Some reasons below: * There are no specific results on properties of the divergences, or axioms that justify them.",actionable,shortcoming
r87,s2,t12,I think that presenting a very all-encompassing formulation without a strong foundation does not add value.,actionable,shortcoming
r87,s3,t12,"* There's abundant literature on f-divergences which show that there's a 1-1 relationship between divergences and optimal (Bayes) risks of classification problems (e.g. Reid at al. Information, Divergence and Risk for Binary Experiments in JMLR and Garcia-Garcia et al. Divergences and Risks for Multiclass Experiments in COLT).",non_actionable,suggestion
r87,s4,t12,"If the loss for the task is proper, then it's well known how to construct a divergence which coincides with the optimal risk.",non_actionable,fact
r87,s5,t12,* The divergences presented in this work are different from the above since the risk is minimised over a parametric class instead of over the whole set of integrable functions.,non_actionable,fact
r87,s6,t12,"* There are many estimators for f-divergences (like the ones cited above and many others based e.g. on nearest-neighbors) that are sample-based and thus correspond to the ""implicit"" case that the authors discuss.",non_actionable,fact
r87,s7,t12,"* The experiments are few and too specific, specially given that the paper presents a very general framework.",actionable,shortcoming
r87,s8,t12,"That feels like confirmation bias and also does not really say anything about the parametric adversarial GANs, which are the focus of the paper.",actionable,shortcoming
r87,s9,t12,The theory is not strong and the experiments don't necessary support the intuitive claims made in the paper.,actionable,shortcoming
r9,s0,t8,"In general, the proposed work is very interesting and the idea is neat.",non_actionable,agreement
r9,s1,t8,It is a useful contribution to the community of GANs and implicit generative models.,non_actionable,agreement
r9,s2,t8,I am impressed with the structure and presentation of the paper.,non_actionable,agreement
r9,s3,t8,The authors propose the use of a gamma prior as the distribution over the latent representation space in GANs.,non_actionable,fact
r9,s4,t8,The motivation behind it is that in GANs interpolating between sampled points is common in the process of generating examples but the use of a normal prior results in samples that fall in low probability mass regions.,non_actionable,fact
r9,s5,t8,"The use of the proposed gamma distribution, as a simple alternative, overcomes this problem.",non_actionable,fact
r9,s6,t8,"In general, the proposed work is very interesting and the idea is neat.",non_actionable,agreement
r9,s7,t8,The paper is well presented and I want to underline the importance of this.,non_actionable,agreement
r9,s8,t8,"The authors did a very good job presenting the problem, motivation and solution in a coherent fashion and easy to follow.",non_actionable,agreement
r9,s9,t8,The work itself is interesting and can provide useful alternatives for the distribution over the latent space.,non_actionable,agreement
r9,s0,t20,"In general, the proposed work is very interesting and the idea is neat.",non_actionable,agreement
r9,s1,t20,It is a useful contribution to the community of GANs and implicit generative models.,non_actionable,agreement
r9,s2,t20,I am impressed with the structure and presentation of the paper.,non_actionable,agreement
r9,s3,t20,The authors propose the use of a gamma prior as the distribution over the latent representation space in GANs.,non_actionable,fact
r9,s4,t20,The motivation behind it is that in GANs interpolating between sampled points is common in the process of generating examples but the use of a normal prior results in samples that fall in low probability mass regions.,non_actionable,fact
r9,s5,t20,"The use of the proposed gamma distribution, as a simple alternative, overcomes this problem.",non_actionable,fact
r9,s6,t20,"In general, the proposed work is very interesting and the idea is neat.",non_actionable,agreement
r9,s7,t20,The paper is well presented and I want to underline the importance of this.,non_actionable,agreement
r9,s8,t20,"The authors did a very good job presenting the problem, motivation and solution in a coherent fashion and easy to follow.",non_actionable,agreement
r9,s9,t20,The work itself is interesting and can provide useful alternatives for the distribution over the latent space.,non_actionable,agreement
r9,s0,t16,"In general, the proposed work is very interesting and the idea is neat.",non_actionable,agreement
r9,s1,t16,It is a useful contribution to the community of GANs and implicit generative models.,non_actionable,agreement
r9,s2,t16,I am impressed with the structure and presentation of the paper.,non_actionable,agreement
r9,s3,t16,The authors propose the use of a gamma prior as the distribution over the latent representation space in GANs.,non_actionable,fact
r9,s4,t16,The motivation behind it is that in GANs interpolating between sampled points is common in the process of generating examples but the use of a normal prior results in samples that fall in low probability mass regions.,actionable,shortcoming
r9,s5,t16,"The use of the proposed gamma distribution, as a simple alternative, overcomes this problem.",non_actionable,fact
r9,s6,t16,"In general, the proposed work is very interesting and the idea is neat.",non_actionable,agreement
r9,s7,t16,The paper is well presented and I want to underline the importance of this.,non_actionable,agreement
r9,s8,t16,"The authors did a very good job presenting the problem, motivation and solution in a coherent fashion and easy to follow.",non_actionable,agreement
r9,s9,t16,The work itself is interesting and can provide useful alternatives for the distribution over the latent space.,non_actionable,agreement
r9,s0,t10,"In general, the proposed work is very interesting and the idea is neat.",actionable,agreement
r9,s1,t10,It is a useful contribution to the community of GANs and implicit generative models.,actionable,agreement
r9,s2,t10,I am impressed with the structure and presentation of the paper.,actionable,agreement
r9,s3,t10,The authors propose the use of a gamma prior as the distribution over the latent representation space in GANs.,non_actionable,other
r9,s4,t10,The motivation behind it is that in GANs interpolating between sampled points is common in the process of generating examples but the use of a normal prior results in samples that fall in low probability mass regions.,non_actionable,other
r9,s5,t10,"The use of the proposed gamma distribution, as a simple alternative, overcomes this problem.",actionable,agreement
r9,s6,t10,"In general, the proposed work is very interesting and the idea is neat.",actionable,agreement
r9,s7,t10,The paper is well presented and I want to underline the importance of this.,actionable,agreement
r9,s8,t10,"The authors did a very good job presenting the problem, motivation and solution in a coherent fashion and easy to follow.",actionable,agreement
r9,s9,t10,The work itself is interesting and can provide useful alternatives for the distribution over the latent space.,actionable,agreement
r9,s0,t31,"In general, the proposed work is very interesting and the idea is neat.",non_actionable,agreement
r9,s1,t31,It is a useful contribution to the community of GANs and implicit generative models.,non_actionable,agreement
r9,s2,t31,I am impressed with the structure and presentation of the paper.,non_actionable,agreement
r9,s3,t31,The authors propose the use of a gamma prior as the distribution over the latent representation space in GANs.,non_actionable,fact
r9,s4,t31,The motivation behind it is that in GANs interpolating between sampled points is common in the process of generating examples but the use of a normal prior results in samples that fall in low probability mass regions.,non_actionable,fact
r9,s5,t31,"The use of the proposed gamma distribution, as a simple alternative, overcomes this problem.",non_actionable,fact
r9,s6,t31,"In general, the proposed work is very interesting and the idea is neat.",non_actionable,agreement
r9,s7,t31,The paper is well presented and I want to underline the importance of this.,non_actionable,agreement
r9,s8,t31,"The authors did a very good job presenting the problem, motivation and solution in a coherent fashion and easy to follow.",non_actionable,agreement
r9,s9,t31,The work itself is interesting and can provide useful alternatives for the distribution over the latent space.,non_actionable,agreement
r35,s0,t16,Marginal contributions and missing comparison with state of the art In this paper a neural-network based method for multi-frame video prediction is proposed.,non_actionable,fact
r35,s1,t16,"It builds on the previous work of [Finn et al. 2016] that uses a neural network to predict transformation parameters of an affine image transformation for future frame prediction, an idea akin to the Spatial Transformer Network paper of [Jaderberg et al., 2015].",non_actionable,fact
r35,s2,t16,"What is new compared to [Finn et al. 2016] is that the authors managed to train the network in combination with an adversarial loss, which allows for the generation of more realistic images.",non_actionable,fact
r35,s3,t16,"The authors evaluate their method based on a mechanical turk survey, where humans are asked to judge the realism of the generated images; additionally, they propose to measure prediction quality by the distance between the manually annotated positions of objects within ground truth and predicted frames.",non_actionable,fact
r35,s4,t16,"A number of design decisions (such as instance normalization) seem to help yield better results, but are minor contributions.",non_actionable,fact
r35,s5,t16,"Spatio-temporal video autoencoder with differentiable memory, arxiv 2017 Since this is prior state-of-the-art and directly applicable to the problem, a comparison is a must.",actionable,suggestion
r35,s6,t16,"Even if the authors released their code used for training (which is not mentioned), I think the authors should aim for a more self-contained exposition.",actionable,suggestion
r35,s7,t16,It is also not mentioned whether the other methods that the authors compare to are re-trained on their newly proposed training dataset.,actionable,shortcoming
r35,s8,t16,"Since the authors combine their method with a GAN, it is not surprising that the generated images look more realistic.",non_actionable,fact
r35,s9,t16,"However, since the task is *video* prediction, it seems more natural to show small video snippets rather than individual images, which would also evaluate temporal consistency.",actionable,suggestion
r35,s0,t20,Marginal contributions and missing comparison with state of the art In this paper a neural-network based method for multi-frame video prediction is proposed.,non_actionable,fact
r35,s1,t20,"It builds on the previous work of [Finn et al. 2016] that uses a neural network to predict transformation parameters of an affine image transformation for future frame prediction, an idea akin to the Spatial Transformer Network paper of [Jaderberg et al., 2015].",non_actionable,fact
r35,s2,t20,"What is new compared to [Finn et al. 2016] is that the authors managed to train the network in combination with an adversarial loss, which allows for the generation of more realistic images.",non_actionable,fact
r35,s3,t20,"The authors evaluate their method based on a mechanical turk survey, where humans are asked to judge the realism of the generated images; additionally, they propose to measure prediction quality by the distance between the manually annotated positions of objects within ground truth and predicted frames.",non_actionable,fact
r35,s4,t20,"A number of design decisions (such as instance normalization) seem to help yield better results, but are minor contributions.",non_actionable,fact
r35,s5,t20,"Spatio-temporal video autoencoder with differentiable memory, arxiv 2017 Since this is prior state-of-the-art and directly applicable to the problem, a comparison is a must.",non_actionable,fact
r35,s6,t20,"Even if the authors released their code used for training (which is not mentioned), I think the authors should aim for a more self-contained exposition.",actionable,suggestion
r35,s7,t20,It is also not mentioned whether the other methods that the authors compare to are re-trained on their newly proposed training dataset.,non_actionable,shortcoming
r35,s8,t20,"Since the authors combine their method with a GAN, it is not surprising that the generated images look more realistic.",non_actionable,fact
r35,s9,t20,"However, since the task is *video* prediction, it seems more natural to show small video snippets rather than individual images, which would also evaluate temporal consistency.",actionable,suggestion
r35,s0,t31,Marginal contributions and missing comparison with state of the art In this paper a neural-network based method for multi-frame video prediction is proposed.,non_actionable,fact
r35,s1,t31,"It builds on the previous work of [Finn et al. 2016] that uses a neural network to predict transformation parameters of an affine image transformation for future frame prediction, an idea akin to the Spatial Transformer Network paper of [Jaderberg et al., 2015].",non_actionable,fact
r35,s2,t31,"What is new compared to [Finn et al. 2016] is that the authors managed to train the network in combination with an adversarial loss, which allows for the generation of more realistic images.",non_actionable,agreement
r35,s3,t31,"The authors evaluate their method based on a mechanical turk survey, where humans are asked to judge the realism of the generated images; additionally, they propose to measure prediction quality by the distance between the manually annotated positions of objects within ground truth and predicted frames.",non_actionable,fact
r35,s4,t31,"A number of design decisions (such as instance normalization) seem to help yield better results, but are minor contributions.",actionable,shortcoming
r35,s5,t31,"Spatio-temporal video autoencoder with differentiable memory, arxiv 2017 Since this is prior state-of-the-art and directly applicable to the problem, a comparison is a must.",actionable,suggestion
r35,s6,t31,"Even if the authors released their code used for training (which is not mentioned), I think the authors should aim for a more self-contained exposition.",actionable,suggestion
r35,s7,t31,It is also not mentioned whether the other methods that the authors compare to are re-trained on their newly proposed training dataset.,actionable,suggestion
r35,s8,t31,"Since the authors combine their method with a GAN, it is not surprising that the generated images look more realistic.",non_actionable,fact
r35,s9,t31,"However, since the task is *video* prediction, it seems more natural to show small video snippets rather than individual images, which would also evaluate temporal consistency.",actionable,suggestion
r35,s0,t13,Marginal contributions and missing comparison with state of the art In this paper a neural-network based method for multi-frame video prediction is proposed.,non_actionable,fact
r35,s1,t13,"It builds on the previous work of [Finn et al. 2016] that uses a neural network to predict transformation parameters of an affine image transformation for future frame prediction, an idea akin to the Spatial Transformer Network paper of [Jaderberg et al., 2015].",non_actionable,fact
r35,s2,t13,"What is new compared to [Finn et al. 2016] is that the authors managed to train the network in combination with an adversarial loss, which allows for the generation of more realistic images.",non_actionable,agreement
r35,s3,t13,"The authors evaluate their method based on a mechanical turk survey, where humans are asked to judge the realism of the generated images; additionally, they propose to measure prediction quality by the distance between the manually annotated positions of objects within ground truth and predicted frames.",non_actionable,fact
r35,s4,t13,"A number of design decisions (such as instance normalization) seem to help yield better results, but are minor contributions.",actionable,fact
r35,s5,t13,"Spatio-temporal video autoencoder with differentiable memory, arxiv 2017 Since this is prior state-of-the-art and directly applicable to the problem, a comparison is a must.",actionable,suggestion
r35,s6,t13,"Even if the authors released their code used for training (which is not mentioned), I think the authors should aim for a more self-contained exposition.",actionable,suggestion
r35,s7,t13,It is also not mentioned whether the other methods that the authors compare to are re-trained on their newly proposed training dataset.,actionable,shortcoming
r35,s8,t13,"Since the authors combine their method with a GAN, it is not surprising that the generated images look more realistic.",non_actionable,fact
r35,s9,t13,"However, since the task is *video* prediction, it seems more natural to show small video snippets rather than individual images, which would also evaluate temporal consistency.",actionable,fact
r35,s0,t10,Marginal contributions and missing comparison with state of the art In this paper a neural-network based method for multi-frame video prediction is proposed.,actionable,shortcoming
r35,s1,t10,"It builds on the previous work of [Finn et al. 2016] that uses a neural network to predict transformation parameters of an affine image transformation for future frame prediction, an idea akin to the Spatial Transformer Network paper of [Jaderberg et al., 2015].",non_actionable,other
r35,s2,t10,"What is new compared to [Finn et al. 2016] is that the authors managed to train the network in combination with an adversarial loss, which allows for the generation of more realistic images.",non_actionable,other
r35,s3,t10,"The authors evaluate their method based on a mechanical turk survey, where humans are asked to judge the realism of the generated images; additionally, they propose to measure prediction quality by the distance between the manually annotated positions of objects within ground truth and predicted frames.",non_actionable,other
r35,s4,t10,"A number of design decisions (such as instance normalization) seem to help yield better results, but are minor contributions.",actionable,agreement
r35,s5,t10,"Spatio-temporal video autoencoder with differentiable memory, arxiv 2017 Since this is prior state-of-the-art and directly applicable to the problem, a comparison is a must.",actionable,suggestion
r35,s6,t10,"Even if the authors released their code used for training (which is not mentioned), I think the authors should aim for a more self-contained exposition.",actionable,suggestion
r35,s7,t10,It is also not mentioned whether the other methods that the authors compare to are re-trained on their newly proposed training dataset.,actionable,suggestion
r35,s8,t10,"Since the authors combine their method with a GAN, it is not surprising that the generated images look more realistic.",non_actionable,other
r35,s9,t10,"However, since the task is *video* prediction, it seems more natural to show small video snippets rather than individual images, which would also evaluate temporal consistency.",actionable,suggestion
r6,s0,t2,"The main idea is that if one can find a small coreset, then finding the optimal separator (maximum margin etc.) over the coreset might be sufficient.",non_actionable,fact
r6,s1,t2,"So, the main contribution of the paper is to do all the sensitivity calculations with respect to SVM problem and then use the importance sampling theory to obtain bounds on the coreset size.",non_actionable,fact
r6,s2,t2,Evaluation: Significance: Coresets give significant running time benefits when working with very big datasets.,non_actionable,fact
r6,s3,t2,The problem has been well motivated and all the relevant issues point out for the reader.,non_actionable,agreement
r6,s4,t2,The theoretical results are clearly stated as lemmas a theorems that one can follow without looking at proofs.,non_actionable,agreement
r6,s5,t2,Originality: The paper uses previously developed theory of importance sampling.,non_actionable,fact
r6,s6,t2,"However, the sensitivity calculations in the SVM context is new as per my knowledge.",non_actionable,fact
r6,s7,t2,It is nice to know the bounds given in the paper and to understand the theoretical conditions under which we can obtain running time benefits using corsets.,non_actionable,agreement
r6,s8,t2,The paper compares the Coreset construction with simple uniform sampling.,non_actionable,fact
r6,s9,t2,"Since Coreset construction is being sold as a fast alternative to previous methods for training SVMs, it would have been nice to see the running time and cost comparison with other training methods that have been discussed in section 2.",actionable,suggestion
r6,s0,t10,"The main idea is that if one can find a small coreset, then finding the optimal separator (maximum margin etc.) over the coreset might be sufficient.",non_actionable,other
r6,s1,t10,"So, the main contribution of the paper is to do all the sensitivity calculations with respect to SVM problem and then use the importance sampling theory to obtain bounds on the coreset size.",non_actionable,other
r6,s2,t10,Evaluation: Significance: Coresets give significant running time benefits when working with very big datasets.,actionable,agreement
r6,s3,t10,The problem has been well motivated and all the relevant issues point out for the reader.,actionable,agreement
r6,s4,t10,The theoretical results are clearly stated as lemmas a theorems that one can follow without looking at proofs.,actionable,agreement
r6,s5,t10,Originality: The paper uses previously developed theory of importance sampling.,non_actionable,other
r6,s6,t10,"However, the sensitivity calculations in the SVM context is new as per my knowledge.",non_actionable,other
r6,s7,t10,It is nice to know the bounds given in the paper and to understand the theoretical conditions under which we can obtain running time benefits using corsets.,actionable,agreement
r6,s8,t10,The paper compares the Coreset construction with simple uniform sampling.,non_actionable,other
r6,s9,t10,"Since Coreset construction is being sold as a fast alternative to previous methods for training SVMs, it would have been nice to see the running time and cost comparison with other training methods that have been discussed in section 2.",actionable,suggestion
r6,s0,t16,"The main idea is that if one can find a small coreset, then finding the optimal separator (maximum margin etc.) over the coreset might be sufficient.",non_actionable,fact
r6,s1,t16,"So, the main contribution of the paper is to do all the sensitivity calculations with respect to SVM problem and then use the importance sampling theory to obtain bounds on the coreset size.",non_actionable,fact
r6,s2,t16,Evaluation: Significance: Coresets give significant running time benefits when working with very big datasets.,non_actionable,fact
r6,s3,t16,The problem has been well motivated and all the relevant issues point out for the reader.,non_actionable,agreement
r6,s4,t16,The theoretical results are clearly stated as lemmas a theorems that one can follow without looking at proofs.,non_actionable,agreement
r6,s5,t16,Originality: The paper uses previously developed theory of importance sampling.,non_actionable,fact
r6,s6,t16,"However, the sensitivity calculations in the SVM context is new as per my knowledge.",non_actionable,fact
r6,s7,t16,It is nice to know the bounds given in the paper and to understand the theoretical conditions under which we can obtain running time benefits using corsets.,non_actionable,agreement
r6,s8,t16,The paper compares the Coreset construction with simple uniform sampling.,non_actionable,fact
r6,s9,t16,"Since Coreset construction is being sold as a fast alternative to previous methods for training SVMs, it would have been nice to see the running time and cost comparison with other training methods that have been discussed in section 2.",actionable,suggestion
r6,s0,t31,"The main idea is that if one can find a small coreset, then finding the optimal separator (maximum margin etc.) over the coreset might be sufficient.",non_actionable,fact
r6,s1,t31,"So, the main contribution of the paper is to do all the sensitivity calculations with respect to SVM problem and then use the importance sampling theory to obtain bounds on the coreset size.",non_actionable,fact
r6,s2,t31,Evaluation: Significance: Coresets give significant running time benefits when working with very big datasets.,non_actionable,fact
r6,s3,t31,The problem has been well motivated and all the relevant issues point out for the reader.,non_actionable,fact
r6,s4,t31,The theoretical results are clearly stated as lemmas a theorems that one can follow without looking at proofs.,non_actionable,fact
r6,s5,t31,Originality: The paper uses previously developed theory of importance sampling.,non_actionable,fact
r6,s6,t31,"However, the sensitivity calculations in the SVM context is new as per my knowledge.",non_actionable,fact
r6,s7,t31,It is nice to know the bounds given in the paper and to understand the theoretical conditions under which we can obtain running time benefits using corsets.,non_actionable,agreement
r6,s8,t31,The paper compares the Coreset construction with simple uniform sampling.,non_actionable,fact
r6,s9,t31,"Since Coreset construction is being sold as a fast alternative to previous methods for training SVMs, it would have been nice to see the running time and cost comparison with other training methods that have been discussed in section 2.",actionable,suggestion
r6,s0,t20,"The main idea is that if one can find a small coreset, then finding the optimal separator (maximum margin etc.) over the coreset might be sufficient.",non_actionable,fact
r6,s1,t20,"So, the main contribution of the paper is to do all the sensitivity calculations with respect to SVM problem and then use the importance sampling theory to obtain bounds on the coreset size.",non_actionable,fact
r6,s2,t20,Evaluation: Significance: Coresets give significant running time benefits when working with very big datasets.,non_actionable,fact
r6,s3,t20,The problem has been well motivated and all the relevant issues point out for the reader.,non_actionable,agreement
r6,s4,t20,The theoretical results are clearly stated as lemmas a theorems that one can follow without looking at proofs.,non_actionable,agreement
r6,s5,t20,Originality: The paper uses previously developed theory of importance sampling.,non_actionable,fact
r6,s6,t20,"However, the sensitivity calculations in the SVM context is new as per my knowledge.",non_actionable,fact
r6,s7,t20,It is nice to know the bounds given in the paper and to understand the theoretical conditions under which we can obtain running time benefits using corsets.,non_actionable,agreement
r6,s8,t20,The paper compares the Coreset construction with simple uniform sampling.,non_actionable,fact
r6,s9,t20,"Since Coreset construction is being sold as a fast alternative to previous methods for training SVMs, it would have been nice to see the running time and cost comparison with other training methods that have been discussed in section 2.",actionable,suggestion
r116,s0,t16,Good basic idea with several weaknesses in the technical exposition and the experiments This is a paper about learning vector representations for the nodes of a graph.,non_actionable,agreement
r116,s1,t16,"To me, the comparison makes sense but it also shows that the ideas presented here are less novel than they might initially seem.",non_actionable,fact
r116,s2,t16,The proposed method introduces two forms of (simple) attention.,non_actionable,fact
r116,s3,t16,Nothing groundbreaking here but still interesting enough and well explained.,non_actionable,agreement
r116,s4,t16,Please take a look at the comment by Fabian Jansen.,actionable,suggestion
r116,s5,t16,"First, you don't report results on Pubmed because your method didn't scale.",actionable,shortcoming
r116,s6,t16,"We have to, however, evaluate the approach on what it is able to do at the moment.",non_actionable,fact
r116,s7,t16,"Second, the experimental set-up on the Cora and Citeseer data sets should be properly randomized.",actionable,suggestion
r116,s8,t16,"As Thomas pointed out, for graph data the variance can be quite high.",actionable,shortcoming
r116,s9,t16,In Kipf et al.'s GCN paper this is what was done (not over 100 splits as some other commenter claimed. The average over 100 runs  pertained to the ICA method only.),non_actionable,fact
r116,s0,t27,Good basic idea with several weaknesses in the technical exposition and the experiments This is a paper about learning vector representations for the nodes of a graph.,non_actionable,fact
r116,s1,t27,"To me, the comparison makes sense but it also shows that the ideas presented here are less novel than they might initially seem.",non_actionable,fact
r116,s2,t27,The proposed method introduces two forms of (simple) attention.,non_actionable,fact
r116,s3,t27,Nothing groundbreaking here but still interesting enough and well explained.,non_actionable,fact
r116,s4,t27,Please take a look at the comment by Fabian Jansen.,actionable,suggestion
r116,s5,t27,"First, you don't report results on Pubmed because your method didn't scale.",non_actionable,shortcoming
r116,s6,t27,"We have to, however, evaluate the approach on what it is able to do at the moment.",non_actionable,fact
r116,s7,t27,"Second, the experimental set-up on the Cora and Citeseer data sets should be properly randomized.",actionable,suggestion
r116,s8,t27,"As Thomas pointed out, for graph data the variance can be quite high.",non_actionable,fact
r116,s9,t27,In Kipf et al.'s GCN paper this is what was done (not over 100 splits as some other commenter claimed. The average over 100 runs  pertained to the ICA method only.),non_actionable,fact
r116,s0,t20,Good basic idea with several weaknesses in the technical exposition and the experiments This is a paper about learning vector representations for the nodes of a graph.,non_actionable,fact
r116,s1,t20,"To me, the comparison makes sense but it also shows that the ideas presented here are less novel than they might initially seem.",non_actionable,shortcoming
r116,s2,t20,The proposed method introduces two forms of (simple) attention.,non_actionable,fact
r116,s3,t20,Nothing groundbreaking here but still interesting enough and well explained.,non_actionable,agreement
r116,s4,t20,Please take a look at the comment by Fabian Jansen.,actionable,suggestion
r116,s5,t20,"First, you don't report results on Pubmed because your method didn't scale.",non_actionable,shortcoming
r116,s6,t20,"We have to, however, evaluate the approach on what it is able to do at the moment.",non_actionable,fact
r116,s7,t20,"Second, the experimental set-up on the Cora and Citeseer data sets should be properly randomized.",actionable,suggestion
r116,s8,t20,"As Thomas pointed out, for graph data the variance can be quite high.",non_actionable,fact
r116,s9,t20,In Kipf et al.'s GCN paper this is what was done (not over 100 splits as some other commenter claimed. The average over 100 runs  pertained to the ICA method only.),non_actionable,fact
r116,s0,t10,Good basic idea with several weaknesses in the technical exposition and the experiments This is a paper about learning vector representations for the nodes of a graph.,actionable,shortcoming
r116,s1,t10,"To me, the comparison makes sense but it also shows that the ideas presented here are less novel than they might initially seem.",actionable,shortcoming
r116,s2,t10,The proposed method introduces two forms of (simple) attention.,non_actionable,other
r116,s3,t10,Nothing groundbreaking here but still interesting enough and well explained.,actionable,agreement
r116,s4,t10,Please take a look at the comment by Fabian Jansen.,actionable,suggestion
r116,s5,t10,"First, you don't report results on Pubmed because your method didn't scale.",actionable,shortcoming
r116,s6,t10,"We have to, however, evaluate the approach on what it is able to do at the moment.",non_actionable,other
r116,s7,t10,"Second, the experimental set-up on the Cora and Citeseer data sets should be properly randomized.",actionable,shortcoming
r116,s8,t10,"As Thomas pointed out, for graph data the variance can be quite high.",non_actionable,other
r116,s9,t10,In Kipf et al.'s GCN paper this is what was done (not over 100 splits as some other commenter claimed. The average over 100 runs  pertained to the ICA method only.),non_actionable,other
r116,s0,t31,Good basic idea with several weaknesses in the technical exposition and the experiments This is a paper about learning vector representations for the nodes of a graph.,actionable,shortcoming
r116,s1,t31,"To me, the comparison makes sense but it also shows that the ideas presented here are less novel than they might initially seem.",non_actionable,fact
r116,s2,t31,The proposed method introduces two forms of (simple) attention.,non_actionable,fact
r116,s3,t31,Nothing groundbreaking here but still interesting enough and well explained.,non_actionable,agreement
r116,s4,t31,Please take a look at the comment by Fabian Jansen.,actionable,other
r116,s5,t31,"First, you don't report results on Pubmed because your method didn't scale.",actionable,shortcoming
r116,s6,t31,"We have to, however, evaluate the approach on what it is able to do at the moment.",non_actionable,fact
r116,s7,t31,"Second, the experimental set-up on the Cora and Citeseer data sets should be properly randomized.",actionable,suggestion
r116,s8,t31,"As Thomas pointed out, for graph data the variance can be quite high.",non_actionable,fact
r116,s9,t31,In Kipf et al.'s GCN paper this is what was done (not over 100 splits as some other commenter claimed. The average over 100 runs  pertained to the ICA method only.),non_actionable,fact
r36,s0,t20,remark on theorem 1: This result generalizes a result proven in 2015 stating that the normality of a layer propagates to the next as the size of the first layer goes to infinity.,non_actionable,fact
r36,s1,t20,"In part 5, the authors compare the distributions (finite Bayesian deep networks and their analogues Gaussian processes) in yet another way: by studying their agreement in terms of inference.",non_actionable,fact
r36,s2,t20,"For this purpose, the authors chose several crieteria: the first two moments of the posterior, the log marginal likelihood and the predictive log-likelihood.",non_actionable,fact
r36,s3,t20,"In part 7, it is concluded that the result that has been proven for size of layers going to infinity (Theorem 1) seems to empirically be verified on finite networks similar to those used in the literature.",non_actionable,fact
r36,s4,t20,"This can be used to simplify inference in cases were the gaussian process behaviour is desired, and opens questions on how to avoid this behaviour the rest of the time.",non_actionable,fact
r36,s5,t20,"The last part contains a discussion concerning the extent to which it is actually a desired or a undesired result in classical deep learning use-cases, and the authors provide intuitive conditions under which the convergence would not hold.",non_actionable,fact
r36,s6,t20,The stated theorem is a clear improvement on the past literature and is promising in a context where multi-layers neural networks are more and more studied.,non_actionable,agreement
r36,s7,t20,"Unclear statements/notations: * end of page 3, notations are not entirely consist with previous notations * I do not understand which distribution is assumed on epsilon and gamma when taking the expectancy in equation (9).",actionable,shortcoming
r36,s8,t20,* I understood that the conclusion of part 3 was that the expectation of eq (9) was elegantly computable for certain non-linearity (including ReLU).,non_actionable,fact
r36,s9,t20,I think they rather wanted to refer to (9).,actionable,suggestion
r36,s0,t16,remark on theorem 1: This result generalizes a result proven in 2015 stating that the normality of a layer propagates to the next as the size of the first layer goes to infinity.,non_actionable,fact
r36,s1,t16,"In part 5, the authors compare the distributions (finite Bayesian deep networks and their analogues Gaussian processes) in yet another way: by studying their agreement in terms of inference.",non_actionable,fact
r36,s2,t16,"For this purpose, the authors chose several crieteria: the first two moments of the posterior, the log marginal likelihood and the predictive log-likelihood.",non_actionable,fact
r36,s3,t16,"In part 7, it is concluded that the result that has been proven for size of layers going to infinity (Theorem 1) seems to empirically be verified on finite networks similar to those used in the literature.",non_actionable,fact
r36,s4,t16,"This can be used to simplify inference in cases were the gaussian process behaviour is desired, and opens questions on how to avoid this behaviour the rest of the time.",non_actionable,fact
r36,s5,t16,"The last part contains a discussion concerning the extent to which it is actually a desired or a undesired result in classical deep learning use-cases, and the authors provide intuitive conditions under which the convergence would not hold.",non_actionable,agreement
r36,s6,t16,The stated theorem is a clear improvement on the past literature and is promising in a context where multi-layers neural networks are more and more studied.,non_actionable,fact
r36,s7,t16,"Unclear statements/notations: * end of page 3, notations are not entirely consist with previous notations * I do not understand which distribution is assumed on epsilon and gamma when taking the expectancy in equation (9).",actionable,shortcoming
r36,s8,t16,* I understood that the conclusion of part 3 was that the expectation of eq (9) was elegantly computable for certain non-linearity (including ReLU).,non_actionable,fact
r36,s9,t16,I think they rather wanted to refer to (9).,non_actionable,other
r36,s0,t13,remark on theorem 1: This result generalizes a result proven in 2015 stating that the normality of a layer propagates to the next as the size of the first layer goes to infinity.,non_actionable,fact
r36,s1,t13,"In part 5, the authors compare the distributions (finite Bayesian deep networks and their analogues Gaussian processes) in yet another way: by studying their agreement in terms of inference.",non_actionable,fact
r36,s2,t13,"For this purpose, the authors chose several crieteria: the first two moments of the posterior, the log marginal likelihood and the predictive log-likelihood.",non_actionable,fact
r36,s3,t13,"In part 7, it is concluded that the result that has been proven for size of layers going to infinity (Theorem 1) seems to empirically be verified on finite networks similar to those used in the literature.",non_actionable,fact
r36,s4,t13,"This can be used to simplify inference in cases were the gaussian process behaviour is desired, and opens questions on how to avoid this behaviour the rest of the time.",non_actionable,fact
r36,s5,t13,"The last part contains a discussion concerning the extent to which it is actually a desired or a undesired result in classical deep learning use-cases, and the authors provide intuitive conditions under which the convergence would not hold.",non_actionable,fact
r36,s6,t13,The stated theorem is a clear improvement on the past literature and is promising in a context where multi-layers neural networks are more and more studied.,non_actionable,agreement
r36,s7,t13,"Unclear statements/notations: * end of page 3, notations are not entirely consist with previous notations * I do not understand which distribution is assumed on epsilon and gamma when taking the expectancy in equation (9).",actionable,shortcoming
r36,s8,t13,* I understood that the conclusion of part 3 was that the expectation of eq (9) was elegantly computable for certain non-linearity (including ReLU).,non_actionable,fact
r36,s9,t13,I think they rather wanted to refer to (9).,actionable,suggestion
r36,s0,t10,remark on theorem 1: This result generalizes a result proven in 2015 stating that the normality of a layer propagates to the next as the size of the first layer goes to infinity.,actionable,shortcoming
r36,s1,t10,"In part 5, the authors compare the distributions (finite Bayesian deep networks and their analogues Gaussian processes) in yet another way: by studying their agreement in terms of inference.",non_actionable,other
r36,s2,t10,"For this purpose, the authors chose several crieteria: the first two moments of the posterior, the log marginal likelihood and the predictive log-likelihood.",non_actionable,other
r36,s3,t10,"In part 7, it is concluded that the result that has been proven for size of layers going to infinity (Theorem 1) seems to empirically be verified on finite networks similar to those used in the literature.",non_actionable,other
r36,s4,t10,"This can be used to simplify inference in cases were the gaussian process behaviour is desired, and opens questions on how to avoid this behaviour the rest of the time.",non_actionable,other
r36,s5,t10,"The last part contains a discussion concerning the extent to which it is actually a desired or a undesired result in classical deep learning use-cases, and the authors provide intuitive conditions under which the convergence would not hold.",non_actionable,other
r36,s6,t10,The stated theorem is a clear improvement on the past literature and is promising in a context where multi-layers neural networks are more and more studied.,actionable,agreement
r36,s7,t10,"Unclear statements/notations: * end of page 3, notations are not entirely consist with previous notations * I do not understand which distribution is assumed on epsilon and gamma when taking the expectancy in equation (9).",actionable,suggestion
r36,s8,t10,* I understood that the conclusion of part 3 was that the expectation of eq (9) was elegantly computable for certain non-linearity (including ReLU).,non_actionable,other
r36,s9,t10,I think they rather wanted to refer to (9).,actionable,shortcoming
r36,s0,t31,remark on theorem 1: This result generalizes a result proven in 2015 stating that the normality of a layer propagates to the next as the size of the first layer goes to infinity.,actionable,shortcoming
r36,s1,t31,"In part 5, the authors compare the distributions (finite Bayesian deep networks and their analogues Gaussian processes) in yet another way: by studying their agreement in terms of inference.",non_actionable,fact
r36,s2,t31,"For this purpose, the authors chose several crieteria: the first two moments of the posterior, the log marginal likelihood and the predictive log-likelihood.",non_actionable,fact
r36,s3,t31,"In part 7, it is concluded that the result that has been proven for size of layers going to infinity (Theorem 1) seems to empirically be verified on finite networks similar to those used in the literature.",non_actionable,fact
r36,s4,t31,"This can be used to simplify inference in cases were the gaussian process behaviour is desired, and opens questions on how to avoid this behaviour the rest of the time.",non_actionable,fact
r36,s5,t31,"The last part contains a discussion concerning the extent to which it is actually a desired or a undesired result in classical deep learning use-cases, and the authors provide intuitive conditions under which the convergence would not hold.",non_actionable,fact
r36,s6,t31,The stated theorem is a clear improvement on the past literature and is promising in a context where multi-layers neural networks are more and more studied.,non_actionable,agreement
r36,s7,t31,"Unclear statements/notations: * end of page 3, notations are not entirely consist with previous notations * I do not understand which distribution is assumed on epsilon and gamma when taking the expectancy in equation (9).",actionable,shortcoming
r36,s8,t31,* I understood that the conclusion of part 3 was that the expectation of eq (9) was elegantly computable for certain non-linearity (including ReLU).,non_actionable,fact
r36,s9,t31,I think they rather wanted to refer to (9).,actionable,shortcoming
r5,s0,t20,The authors also provide a specific theoretical construction that shows bidirectional GANs cannot escape specific cases of mode collapse.,non_actionable,fact
r5,s1,t20,The results are interpreted to mean that mode collapse is strong in a number of state-of-the-art generative models.,non_actionable,fact
r5,s2,t20,This is a very interesting area and exciting work.,non_actionable,agreement
r5,s3,t20,In my opinion both contributions suffer from some significant limitations.,non_actionable,shortcoming
r5,s4,t20,1. The biggest issue with the proposed test is that it conflates mode collapse with non-uniformity.,non_actionable,shortcoming
r5,s5,t20,"The authors do mention this issue, but do not put much effort into evaluating its implications in practice, or parsing Theorems 1 and",non_actionable,shortcoming
r5,s6,t20,I feel that the authors should give a more prominent disclaimer to potential users of the test.,actionable,suggestion
r5,s7,t20,"2. Also, given how mode collapse is the main concern, it seems to me that a discussion on coverage is missing.",actionable,suggestion
r5,s8,t20,I welcome and are grateful for any theory in the area.,actionable,suggestion
r5,s9,t20,"In particular, the current statement seems to obfuscate the understanding that training such an objective would typically not result into the construction of Theorem 3.",non_actionable,shortcoming
r5,s0,t16,The authors also provide a specific theoretical construction that shows bidirectional GANs cannot escape specific cases of mode collapse.,non_actionable,fact
r5,s1,t16,The results are interpreted to mean that mode collapse is strong in a number of state-of-the-art generative models.,non_actionable,fact
r5,s2,t16,This is a very interesting area and exciting work.,non_actionable,agreement
r5,s3,t16,In my opinion both contributions suffer from some significant limitations.,actionable,shortcoming
r5,s4,t16,1. The biggest issue with the proposed test is that it conflates mode collapse with non-uniformity.,actionable,shortcoming
r5,s5,t16,"The authors do mention this issue, but do not put much effort into evaluating its implications in practice, or parsing Theorems 1 and",actionable,shortcoming
r5,s6,t16,I feel that the authors should give a more prominent disclaimer to potential users of the test.,actionable,suggestion
r5,s7,t16,"2. Also, given how mode collapse is the main concern, it seems to me that a discussion on coverage is missing.",actionable,suggestion
r5,s8,t16,I welcome and are grateful for any theory in the area.,non_actionable,fact
r5,s9,t16,"In particular, the current statement seems to obfuscate the understanding that training such an objective would typically not result into the construction of Theorem 3.",actionable,shortcoming
r5,s0,t31,The authors also provide a specific theoretical construction that shows bidirectional GANs cannot escape specific cases of mode collapse.,non_actionable,fact
r5,s1,t31,The results are interpreted to mean that mode collapse is strong in a number of state-of-the-art generative models.,non_actionable,fact
r5,s2,t31,This is a very interesting area and exciting work.,non_actionable,agreement
r5,s3,t31,In my opinion both contributions suffer from some significant limitations.,actionable,shortcoming
r5,s4,t31,1. The biggest issue with the proposed test is that it conflates mode collapse with non-uniformity.,actionable,shortcoming
r5,s5,t31,"The authors do mention this issue, but do not put much effort into evaluating its implications in practice, or parsing Theorems 1 and",actionable,shortcoming
r5,s6,t31,I feel that the authors should give a more prominent disclaimer to potential users of the test.,actionable,suggestion
r5,s7,t31,"2. Also, given how mode collapse is the main concern, it seems to me that a discussion on coverage is missing.",actionable,shortcoming
r5,s8,t31,I welcome and are grateful for any theory in the area.,actionable,suggestion
r5,s9,t31,"In particular, the current statement seems to obfuscate the understanding that training such an objective would typically not result into the construction of Theorem 3.",actionable,shortcoming
r5,s0,t27,The authors also provide a specific theoretical construction that shows bidirectional GANs cannot escape specific cases of mode collapse.,non_actionable,fact
r5,s1,t27,The results are interpreted to mean that mode collapse is strong in a number of state-of-the-art generative models.,non_actionable,fact
r5,s2,t27,This is a very interesting area and exciting work.,non_actionable,agreement
r5,s3,t27,In my opinion both contributions suffer from some significant limitations.,non_actionable,shortcoming
r5,s4,t27,1. The biggest issue with the proposed test is that it conflates mode collapse with non-uniformity.,non_actionable,shortcoming
r5,s5,t27,"The authors do mention this issue, but do not put much effort into evaluating its implications in practice, or parsing Theorems 1 and",non_actionable,shortcoming
r5,s6,t27,I feel that the authors should give a more prominent disclaimer to potential users of the test.,actionable,suggestion
r5,s7,t27,"2. Also, given how mode collapse is the main concern, it seems to me that a discussion on coverage is missing.",non_actionable,shortcoming
r5,s8,t27,I welcome and are grateful for any theory in the area.,actionable,suggestion
r5,s9,t27,"In particular, the current statement seems to obfuscate the understanding that training such an objective would typically not result into the construction of Theorem 3.",non_actionable,fact
r5,s0,t10,The authors also provide a specific theoretical construction that shows bidirectional GANs cannot escape specific cases of mode collapse.,non_actionable,other
r5,s1,t10,The results are interpreted to mean that mode collapse is strong in a number of state-of-the-art generative models.,non_actionable,other
r5,s2,t10,This is a very interesting area and exciting work.,actionable,agreement
r5,s3,t10,In my opinion both contributions suffer from some significant limitations.,actionable,fact
r5,s4,t10,1. The biggest issue with the proposed test is that it conflates mode collapse with non-uniformity.,actionable,shortcoming
r5,s5,t10,"The authors do mention this issue, but do not put much effort into evaluating its implications in practice, or parsing Theorems 1 and",actionable,shortcoming
r5,s6,t10,I feel that the authors should give a more prominent disclaimer to potential users of the test.,actionable,suggestion
r5,s7,t10,"2. Also, given how mode collapse is the main concern, it seems to me that a discussion on coverage is missing.",actionable,shortcoming
r5,s8,t10,I welcome and are grateful for any theory in the area.,actionable,fact
r5,s9,t10,"In particular, the current statement seems to obfuscate the understanding that training such an objective would typically not result into the construction of Theorem 3.",actionable,shortcoming
r83,s0,t19,This is a solid paper about model evaluation in the chemical domain.,non_actionable,agreement
r83,s1,t19,Summary: This work is about model evaluation for molecule generation and design.,non_actionable,fact
r83,s2,t19,"19 benchmarks are proposed, small data sets are expanded to a large, standardized data set and it is explored how to apply new RL techniques effectively for molecular design.",non_actionable,fact
r83,s3,t19,"on the positive side: The paper is well written, quality and clarity of the work are good.",non_actionable,agreement
r83,s4,t19,The work provides a good overview about how to apply new reinforcement learning techniques for sequence generation.,non_actionable,agreement
r83,s5,t19,"It is investigated how several RL strategies perform on a large, standardized data set.",non_actionable,fact
r83,s6,t19,"Different RL models like Hillclimb-MLE, PPO, GAN, A2C are investigated and discussed.",non_actionable,fact
r83,s7,t19,An implementation of 19 suggested benchmarks of relevance for de novo design will be provided as open source as an OpenAI Gym.,non_actionable,fact
r83,s8,t19,on the negative side: There is no new novel contribution on the methods side.,non_actionable,shortcoming
r83,s9,t19,see Fig.2 —> see Fig.1 page 4just before equation 8: the the,actionable,suggestion
r83,s0,t13,This is a solid paper about model evaluation in the chemical domain.,non_actionable,agreement
r83,s1,t13,Summary: This work is about model evaluation for molecule generation and design.,non_actionable,fact
r83,s2,t13,"19 benchmarks are proposed, small data sets are expanded to a large, standardized data set and it is explored how to apply new RL techniques effectively for molecular design.",non_actionable,fact
r83,s3,t13,"on the positive side: The paper is well written, quality and clarity of the work are good.",non_actionable,agreement
r83,s4,t13,The work provides a good overview about how to apply new reinforcement learning techniques for sequence generation.,non_actionable,agreement
r83,s5,t13,"It is investigated how several RL strategies perform on a large, standardized data set.",non_actionable,fact
r83,s6,t13,"Different RL models like Hillclimb-MLE, PPO, GAN, A2C are investigated and discussed.",non_actionable,fact
r83,s7,t13,An implementation of 19 suggested benchmarks of relevance for de novo design will be provided as open source as an OpenAI Gym.,non_actionable,fact
r83,s8,t13,on the negative side: There is no new novel contribution on the methods side.,actionable,shortcoming
r83,s9,t13,see Fig.2 —> see Fig.1 page 4just before equation 8: the the,non_actionable,other
r83,s0,t10,This is a solid paper about model evaluation in the chemical domain.,actionable,agreement
r83,s1,t10,Summary: This work is about model evaluation for molecule generation and design.,non_actionable,other
r83,s2,t10,"19 benchmarks are proposed, small data sets are expanded to a large, standardized data set and it is explored how to apply new RL techniques effectively for molecular design.",non_actionable,other
r83,s3,t10,"on the positive side: The paper is well written, quality and clarity of the work are good.",actionable,agreement
r83,s4,t10,The work provides a good overview about how to apply new reinforcement learning techniques for sequence generation.,actionable,agreement
r83,s5,t10,"It is investigated how several RL strategies perform on a large, standardized data set.",non_actionable,other
r83,s6,t10,"Different RL models like Hillclimb-MLE, PPO, GAN, A2C are investigated and discussed.",non_actionable,other
r83,s7,t10,An implementation of 19 suggested benchmarks of relevance for de novo design will be provided as open source as an OpenAI Gym.,non_actionable,other
r83,s8,t10,on the negative side: There is no new novel contribution on the methods side.,actionable,disagreement
r83,s9,t10,see Fig.2 —> see Fig.1 page 4just before equation 8: the the,actionable,shortcoming
r83,s0,t20,This is a solid paper about model evaluation in the chemical domain.,non_actionable,agreement
r83,s1,t20,Summary: This work is about model evaluation for molecule generation and design.,non_actionable,fact
r83,s2,t20,"19 benchmarks are proposed, small data sets are expanded to a large, standardized data set and it is explored how to apply new RL techniques effectively for molecular design.",non_actionable,fact
r83,s3,t20,"on the positive side: The paper is well written, quality and clarity of the work are good.",non_actionable,agreement
r83,s4,t20,The work provides a good overview about how to apply new reinforcement learning techniques for sequence generation.,non_actionable,agreement
r83,s5,t20,"It is investigated how several RL strategies perform on a large, standardized data set.",non_actionable,fact
r83,s6,t20,"Different RL models like Hillclimb-MLE, PPO, GAN, A2C are investigated and discussed.",non_actionable,fact
r83,s7,t20,An implementation of 19 suggested benchmarks of relevance for de novo design will be provided as open source as an OpenAI Gym.,non_actionable,fact
r83,s8,t20,on the negative side: There is no new novel contribution on the methods side.,actionable,shortcoming
r83,s9,t20,see Fig.2 —> see Fig.1 page 4just before equation 8: the the,non_actionable,other
r83,s0,t31,This is a solid paper about model evaluation in the chemical domain.,non_actionable,agreement
r83,s1,t31,Summary: This work is about model evaluation for molecule generation and design.,non_actionable,fact
r83,s2,t31,"19 benchmarks are proposed, small data sets are expanded to a large, standardized data set and it is explored how to apply new RL techniques effectively for molecular design.",non_actionable,fact
r83,s3,t31,"on the positive side: The paper is well written, quality and clarity of the work are good.",non_actionable,agreement
r83,s4,t31,The work provides a good overview about how to apply new reinforcement learning techniques for sequence generation.,non_actionable,agreement
r83,s5,t31,"It is investigated how several RL strategies perform on a large, standardized data set.",non_actionable,fact
r83,s6,t31,"Different RL models like Hillclimb-MLE, PPO, GAN, A2C are investigated and discussed.",non_actionable,fact
r83,s7,t31,An implementation of 19 suggested benchmarks of relevance for de novo design will be provided as open source as an OpenAI Gym.,non_actionable,fact
r83,s8,t31,on the negative side: There is no new novel contribution on the methods side.,actionable,shortcoming
r83,s9,t31,see Fig.2 —> see Fig.1 page 4just before equation 8: the the,actionable,shortcoming
r81,s0,t10,"The paper presents an approach to do task aware distillation, task-specific pruning and specialized cascades.",non_actionable,other
r81,s1,t10,"The main result is that such methods can yield smaller, efficient and sometimes more accurate models.",actionable,agreement
r81,s2,t10,The task aware distillation relies on the availability of data that is target specific.,non_actionable,other
r81,s3,t10,"In practice, I believe this is not an unreasonable requirement.",actionable,agreement
r81,s4,t10,The speedups and accuracy gains of this paper are impressive.,actionable,agreement
r81,s5,t10,The fact that the proposed technique is simple yet yields such speedups is encouraging.,actionable,agreement
r81,s6,t10,"However, evaluating on simple datasets like Kaggle cat/dog and Oxford Flowers diminishes the value of the paper.",actionable,shortcoming
r81,s7,t10,"I would strongly encourage the authors to try harder datasets such as COCO, VOC etc.",actionable,suggestion
r81,s8,t10,This will make the paper more valuable to the community.,actionable,fact
r81,s9,t10,Missing citations Do Deep Nets Really Need to be Deep?,actionable,shortcoming
r81,s0,t16,"The paper presents an approach to do task aware distillation, task-specific pruning and specialized cascades.",non_actionable,fact
r81,s1,t16,"The main result is that such methods can yield smaller, efficient and sometimes more accurate models.",non_actionable,fact
r81,s2,t16,The task aware distillation relies on the availability of data that is target specific.,non_actionable,fact
r81,s3,t16,"In practice, I believe this is not an unreasonable requirement.",non_actionable,fact
r81,s4,t16,The speedups and accuracy gains of this paper are impressive.,non_actionable,agreement
r81,s5,t16,The fact that the proposed technique is simple yet yields such speedups is encouraging.,non_actionable,agreement
r81,s6,t16,"However, evaluating on simple datasets like Kaggle cat/dog and Oxford Flowers diminishes the value of the paper.",actionable,shortcoming
r81,s7,t16,"I would strongly encourage the authors to try harder datasets such as COCO, VOC etc.",actionable,suggestion
r81,s8,t16,This will make the paper more valuable to the community.,non_actionable,fact
r81,s9,t16,Missing citations Do Deep Nets Really Need to be Deep?,actionable,question
r81,s0,t31,"The paper presents an approach to do task aware distillation, task-specific pruning and specialized cascades.",non_actionable,fact
r81,s1,t31,"The main result is that such methods can yield smaller, efficient and sometimes more accurate models.",non_actionable,fact
r81,s2,t31,The task aware distillation relies on the availability of data that is target specific.,non_actionable,fact
r81,s3,t31,"In practice, I believe this is not an unreasonable requirement.",non_actionable,agreement
r81,s4,t31,The speedups and accuracy gains of this paper are impressive.,non_actionable,agreement
r81,s5,t31,The fact that the proposed technique is simple yet yields such speedups is encouraging.,non_actionable,agreement
r81,s6,t31,"However, evaluating on simple datasets like Kaggle cat/dog and Oxford Flowers diminishes the value of the paper.",actionable,shortcoming
r81,s7,t31,"I would strongly encourage the authors to try harder datasets such as COCO, VOC etc.",actionable,suggestion
r81,s8,t31,This will make the paper more valuable to the community.,non_actionable,fact
r81,s9,t31,Missing citations Do Deep Nets Really Need to be Deep?,actionable,shortcoming
r81,s0,t8,"The paper presents an approach to do task aware distillation, task-specific pruning and specialized cascades.",non_actionable,fact
r81,s1,t8,"The main result is that such methods can yield smaller, efficient and sometimes more accurate models.",non_actionable,fact
r81,s2,t8,The task aware distillation relies on the availability of data that is target specific.,non_actionable,fact
r81,s3,t8,"In practice, I believe this is not an unreasonable requirement.",non_actionable,fact
r81,s4,t8,The speedups and accuracy gains of this paper are impressive.,non_actionable,agreement
r81,s5,t8,The fact that the proposed technique is simple yet yields such speedups is encouraging.,non_actionable,agreement
r81,s6,t8,"However, evaluating on simple datasets like Kaggle cat/dog and Oxford Flowers diminishes the value of the paper.",actionable,shortcoming
r81,s7,t8,"I would strongly encourage the authors to try harder datasets such as COCO, VOC etc.",actionable,suggestion
r81,s8,t8,This will make the paper more valuable to the community.,non_actionable,fact
r81,s9,t8,Missing citations Do Deep Nets Really Need to be Deep?,actionable,shortcoming
r81,s0,t20,"The paper presents an approach to do task aware distillation, task-specific pruning and specialized cascades.",non_actionable,fact
r81,s1,t20,"The main result is that such methods can yield smaller, efficient and sometimes more accurate models.",non_actionable,fact
r81,s2,t20,The task aware distillation relies on the availability of data that is target specific.,non_actionable,fact
r81,s3,t20,"In practice, I believe this is not an unreasonable requirement.",non_actionable,fact
r81,s4,t20,The speedups and accuracy gains of this paper are impressive.,non_actionable,agreement
r81,s5,t20,The fact that the proposed technique is simple yet yields such speedups is encouraging.,non_actionable,agreement
r81,s6,t20,"However, evaluating on simple datasets like Kaggle cat/dog and Oxford Flowers diminishes the value of the paper.",non_actionable,shortcoming
r81,s7,t20,"I would strongly encourage the authors to try harder datasets such as COCO, VOC etc.",actionable,suggestion
r81,s8,t20,This will make the paper more valuable to the community.,non_actionable,fact
r81,s9,t20,Missing citations Do Deep Nets Really Need to be Deep?,actionable,question
r17,s0,t20,In order to do so it uses a set of experts each one made out of a GAN.,non_actionable,fact
r17,s1,t20,My main concern with this work is that I don't see any mechanism in the framework that prevents an expert  (or few of them) to win all examples except its own learning capacities.,non_actionable,shortcoming
r17,s2,t20,p7 authors have also noticed that several experts fail to specialize and I bet that is the reason why.,non_actionable,shortcoming
r17,s3,t20,"Thus, authors should analyze how well we can have all/most experts specialize in a pool vs expert capacity/architecture.",actionable,suggestion
r17,s4,t20,It would also be great to integrate a direct regularization mechanism in the cost  in order to do so.,actionable,suggestion
r17,s5,t20,Like for example a penalty in how many examples a expert has catched.,actionable,suggestion
r17,s6,t20,"Moreover, the discrimator D  (which is trained to discriminate between real or fake examples) seems to be directly used to tell if an example is throw from the targeted distribution.",non_actionable,fact
r17,s7,t20,It is not the same task.,non_actionable,fact
r17,s8,t20,How D will handle an example far from fake or real ones ?,actionable,question
r17,s9,t20,Why will D answer negatively (or positively) on this example ?,actionable,question
r17,s0,t10,In order to do so it uses a set of experts each one made out of a GAN.,non_actionable,other
r17,s1,t10,My main concern with this work is that I don't see any mechanism in the framework that prevents an expert  (or few of them) to win all examples except its own learning capacities.,actionable,shortcoming
r17,s2,t10,p7 authors have also noticed that several experts fail to specialize and I bet that is the reason why.,actionable,fact
r17,s3,t10,"Thus, authors should analyze how well we can have all/most experts specialize in a pool vs expert capacity/architecture.",actionable,suggestion
r17,s4,t10,It would also be great to integrate a direct regularization mechanism in the cost  in order to do so.,actionable,suggestion
r17,s5,t10,Like for example a penalty in how many examples a expert has catched.,actionable,suggestion
r17,s6,t10,"Moreover, the discrimator D  (which is trained to discriminate between real or fake examples) seems to be directly used to tell if an example is throw from the targeted distribution.",non_actionable,other
r17,s7,t10,It is not the same task.,actionable,shortcoming
r17,s8,t10,How D will handle an example far from fake or real ones ?,actionable,question
r17,s9,t10,Why will D answer negatively (or positively) on this example ?,actionable,question
r17,s0,t8,In order to do so it uses a set of experts each one made out of a GAN.,non_actionable,fact
r17,s1,t8,My main concern with this work is that I don't see any mechanism in the framework that prevents an expert  (or few of them) to win all examples except its own learning capacities.,actionable,shortcoming
r17,s2,t8,p7 authors have also noticed that several experts fail to specialize and I bet that is the reason why.,actionable,shortcoming
r17,s3,t8,"Thus, authors should analyze how well we can have all/most experts specialize in a pool vs expert capacity/architecture.",actionable,suggestion
r17,s4,t8,It would also be great to integrate a direct regularization mechanism in the cost  in order to do so.,actionable,suggestion
r17,s5,t8,Like for example a penalty in how many examples a expert has catched.,actionable,suggestion
r17,s6,t8,"Moreover, the discrimator D  (which is trained to discriminate between real or fake examples) seems to be directly used to tell if an example is throw from the targeted distribution.",actionable,shortcoming
r17,s7,t8,It is not the same task.,actionable,shortcoming
r17,s8,t8,How D will handle an example far from fake or real ones ?,non_actionable,question
r17,s9,t8,Why will D answer negatively (or positively) on this example ?,non_actionable,question
r17,s0,t16,In order to do so it uses a set of experts each one made out of a GAN.,non_actionable,fact
r17,s1,t16,My main concern with this work is that I don't see any mechanism in the framework that prevents an expert  (or few of them) to win all examples except its own learning capacities.,actionable,shortcoming
r17,s2,t16,p7 authors have also noticed that several experts fail to specialize and I bet that is the reason why.,actionable,shortcoming
r17,s3,t16,"Thus, authors should analyze how well we can have all/most experts specialize in a pool vs expert capacity/architecture.",actionable,suggestion
r17,s4,t16,It would also be great to integrate a direct regularization mechanism in the cost  in order to do so.,actionable,suggestion
r17,s5,t16,Like for example a penalty in how many examples a expert has catched.,actionable,suggestion
r17,s6,t16,"Moreover, the discrimator D  (which is trained to discriminate between real or fake examples) seems to be directly used to tell if an example is throw from the targeted distribution.",actionable,shortcoming
r17,s7,t16,It is not the same task.,actionable,shortcoming
r17,s8,t16,How D will handle an example far from fake or real ones ?,actionable,question
r17,s9,t16,Why will D answer negatively (or positively) on this example ?,actionable,question
r17,s0,t31,In order to do so it uses a set of experts each one made out of a GAN.,non_actionable,fact
r17,s1,t31,My main concern with this work is that I don't see any mechanism in the framework that prevents an expert  (or few of them) to win all examples except its own learning capacities.,actionable,shortcoming
r17,s2,t31,p7 authors have also noticed that several experts fail to specialize and I bet that is the reason why.,non_actionable,fact
r17,s3,t31,"Thus, authors should analyze how well we can have all/most experts specialize in a pool vs expert capacity/architecture.",actionable,suggestion
r17,s4,t31,It would also be great to integrate a direct regularization mechanism in the cost  in order to do so.,actionable,suggestion
r17,s5,t31,Like for example a penalty in how many examples a expert has catched.,actionable,suggestion
r17,s6,t31,"Moreover, the discrimator D  (which is trained to discriminate between real or fake examples) seems to be directly used to tell if an example is throw from the targeted distribution.",non_actionable,fact
r17,s7,t31,It is not the same task.,non_actionable,fact
r17,s8,t31,How D will handle an example far from fake or real ones ?,actionable,question
r17,s9,t31,Why will D answer negatively (or positively) on this example ?,actionable,question
r75,s0,t20,No title This paper proposed an end-to-end trainable hierarchical attention mechanism for CNN.,non_actionable,fact
r75,s1,t20,The proposed method demonstrated noticeable performance improvement on various discriminative tasks over existing approaches.,non_actionable,fact
r75,s2,t20,"Overall, the idea presented in the paper is simple yet solid, and showed good empirical performance.",non_actionable,agreement
r75,s3,t20,The followings are several concerns and suggestions.,non_actionable,fact
r75,s4,t20,"1. The authors claimed that this is the first end-to-end trainable hierarchical attention model, but there is a previous work that also addressed the similar task: Seo et al, Progressive Attention Networks for Visual Attribute Prediction, in Arxiv preprint:1606.02393, 2016",non_actionable,fact
r75,s5,t20,"2. The proposed attention mechanism seems to be fairly domain (or task ) specific, and may not be beneficial for strong generalization (generalization over unseen category).",non_actionable,shortcoming
r75,s6,t20,"Since this could be a potential disadvantage, some discussions or empirical study on cross-category generalization seems to be interesting.",non_actionable,agreement
r75,s7,t20,"3. The proposed attention mechanism is mainly demonstrated for single-class classification task, but it would be interesting to see if it can also help the multi-class classification (e.g. image classification on MS-COCO or PASCAL VOC datasets)",actionable,suggestion
r75,s8,t20,4. The localization performance of the proposed attention mechanism is evaluated by weakly-supervised semantic segmentation tasks.,non_actionable,fact
r75,s9,t20,"In that perspective, it would be interesting to see the comparisons against other attention mechanisms (e.g. Zhou et al 2016) in terms of localization performance.",actionable,suggestion
r75,s0,t10,No title This paper proposed an end-to-end trainable hierarchical attention mechanism for CNN.,non_actionable,other
r75,s1,t10,The proposed method demonstrated noticeable performance improvement on various discriminative tasks over existing approaches.,actionable,agreement
r75,s2,t10,"Overall, the idea presented in the paper is simple yet solid, and showed good empirical performance.",actionable,agreement
r75,s3,t10,The followings are several concerns and suggestions.,non_actionable,other
r75,s4,t10,"1. The authors claimed that this is the first end-to-end trainable hierarchical attention model, but there is a previous work that also addressed the similar task: Seo et al, Progressive Attention Networks for Visual Attribute Prediction, in Arxiv preprint:1606.02393, 2016",non_actionable,shortcoming
r75,s5,t10,"2. The proposed attention mechanism seems to be fairly domain (or task ) specific, and may not be beneficial for strong generalization (generalization over unseen category).",non_actionable,shortcoming
r75,s6,t10,"Since this could be a potential disadvantage, some discussions or empirical study on cross-category generalization seems to be interesting.",actionable,suggestion
r75,s7,t10,"3. The proposed attention mechanism is mainly demonstrated for single-class classification task, but it would be interesting to see if it can also help the multi-class classification (e.g. image classification on MS-COCO or PASCAL VOC datasets)",non_actionable,suggestion
r75,s8,t10,4. The localization performance of the proposed attention mechanism is evaluated by weakly-supervised semantic segmentation tasks.,non_actionable,shortcoming
r75,s9,t10,"In that perspective, it would be interesting to see the comparisons against other attention mechanisms (e.g. Zhou et al 2016) in terms of localization performance.",actionable,suggestion
r75,s0,t16,No title This paper proposed an end-to-end trainable hierarchical attention mechanism for CNN.,non_actionable,fact
r75,s1,t16,The proposed method demonstrated noticeable performance improvement on various discriminative tasks over existing approaches.,non_actionable,fact
r75,s2,t16,"Overall, the idea presented in the paper is simple yet solid, and showed good empirical performance.",non_actionable,agreement
r75,s3,t16,The followings are several concerns and suggestions.,actionable,shortcoming
r75,s4,t16,"1. The authors claimed that this is the first end-to-end trainable hierarchical attention model, but there is a previous work that also addressed the similar task: Seo et al, Progressive Attention Networks for Visual Attribute Prediction, in Arxiv preprint:1606.02393, 2016",actionable,shortcoming
r75,s5,t16,"2. The proposed attention mechanism seems to be fairly domain (or task ) specific, and may not be beneficial for strong generalization (generalization over unseen category).",actionable,suggestion
r75,s6,t16,"Since this could be a potential disadvantage, some discussions or empirical study on cross-category generalization seems to be interesting.",actionable,suggestion
r75,s7,t16,"3. The proposed attention mechanism is mainly demonstrated for single-class classification task, but it would be interesting to see if it can also help the multi-class classification (e.g. image classification on MS-COCO or PASCAL VOC datasets)",actionable,shortcoming
r75,s8,t16,4. The localization performance of the proposed attention mechanism is evaluated by weakly-supervised semantic segmentation tasks.,actionable,shortcoming
r75,s9,t16,"In that perspective, it would be interesting to see the comparisons against other attention mechanisms (e.g. Zhou et al 2016) in terms of localization performance.",actionable,suggestion
r75,s0,t17,No title This paper proposed an end-to-end trainable hierarchical attention mechanism for CNN.,actionable,suggestion
r75,s1,t17,The proposed method demonstrated noticeable performance improvement on various discriminative tasks over existing approaches.,non_actionable,fact
r75,s2,t17,"Overall, the idea presented in the paper is simple yet solid, and showed good empirical performance.",non_actionable,fact
r75,s3,t17,The followings are several concerns and suggestions.,actionable,suggestion
r75,s4,t17,"1. The authors claimed that this is the first end-to-end trainable hierarchical attention model, but there is a previous work that also addressed the similar task: Seo et al, Progressive Attention Networks for Visual Attribute Prediction, in Arxiv preprint:1606.02393, 2016",non_actionable,fact
r75,s5,t17,"2. The proposed attention mechanism seems to be fairly domain (or task ) specific, and may not be beneficial for strong generalization (generalization over unseen category).",non_actionable,fact
r75,s6,t17,"Since this could be a potential disadvantage, some discussions or empirical study on cross-category generalization seems to be interesting.",non_actionable,fact
r75,s7,t17,"3. The proposed attention mechanism is mainly demonstrated for single-class classification task, but it would be interesting to see if it can also help the multi-class classification (e.g. image classification on MS-COCO or PASCAL VOC datasets)",actionable,suggestion
r75,s8,t17,4. The localization performance of the proposed attention mechanism is evaluated by weakly-supervised semantic segmentation tasks.,non_actionable,fact
r75,s9,t17,"In that perspective, it would be interesting to see the comparisons against other attention mechanisms (e.g. Zhou et al 2016) in terms of localization performance.",actionable,suggestion
r75,s0,t31,No title This paper proposed an end-to-end trainable hierarchical attention mechanism for CNN.,non_actionable,fact
r75,s1,t31,The proposed method demonstrated noticeable performance improvement on various discriminative tasks over existing approaches.,non_actionable,agreement
r75,s2,t31,"Overall, the idea presented in the paper is simple yet solid, and showed good empirical performance.",non_actionable,agreement
r75,s3,t31,The followings are several concerns and suggestions.,actionable,suggestion
r75,s4,t31,"1. The authors claimed that this is the first end-to-end trainable hierarchical attention model, but there is a previous work that also addressed the similar task: Seo et al, Progressive Attention Networks for Visual Attribute Prediction, in Arxiv preprint:1606.02393, 2016",actionable,shortcoming
r75,s5,t31,"2. The proposed attention mechanism seems to be fairly domain (or task ) specific, and may not be beneficial for strong generalization (generalization over unseen category).",actionable,shortcoming
r75,s6,t31,"Since this could be a potential disadvantage, some discussions or empirical study on cross-category generalization seems to be interesting.",actionable,suggestion
r75,s7,t31,"3. The proposed attention mechanism is mainly demonstrated for single-class classification task, but it would be interesting to see if it can also help the multi-class classification (e.g. image classification on MS-COCO or PASCAL VOC datasets)",actionable,suggestion
r75,s8,t31,4. The localization performance of the proposed attention mechanism is evaluated by weakly-supervised semantic segmentation tasks.,actionable,shortcoming
r75,s9,t31,"In that perspective, it would be interesting to see the comparisons against other attention mechanisms (e.g. Zhou et al 2016) in terms of localization performance.",actionable,suggestion
r50,s0,t16,"Good contribution The paper is well written, and the authors do an admirable job of motivating their primary contributions throughout the early portions of the paper.",non_actionable,agreement
r50,s1,t16,Each extension to the Dual Actor-Critic is well motivated and clear in context.,non_actionable,agreement
r50,s2,t16,"Turning to the experimental section, I think the authors did a good job of evaluating their approach with the ablation study and comparisons with PPO and TRPO.",non_actionable,agreement
r50,s3,t16,There were a few things that jumped out to me that I was surprised by.,non_actionable,fact
r50,s4,t16,"The difference in performance for Dual-AC between Figure 1 and Figure 2b is significant, but the only difference seems to be a reduce batch size, is this right?",actionable,question
r50,s5,t16,This suggests a fairly significant sensitivity to this hyperparameter if so.,non_actionable,fact
r50,s6,t16,Reproducibility in continuous control is particularly problematic.,actionable,shortcoming
r50,s7,t16,"Nonetheless, in recent work PPO and TRPO performance on the same set of tasks seem to be substantively different than what the authors get in their experiments.",non_actionable,fact
r50,s8,t16,"With these in mind I view the comparison results with a bit of uncertainty about the exact amount of gain being achieved, which may beg the question if the algorithmic contributions are buying much for their added complexity?",actionable,question
r50,s9,t16,"Pros: Well written, thorough treatment of the approaches Improvements on top of Dual-AC with ablation study show improvement Cons: Empirical gains might not be very large",non_actionable,agreement
r50,s0,t20,"Good contribution The paper is well written, and the authors do an admirable job of motivating their primary contributions throughout the early portions of the paper.",non_actionable,agreement
r50,s1,t20,Each extension to the Dual Actor-Critic is well motivated and clear in context.,non_actionable,agreement
r50,s2,t20,"Turning to the experimental section, I think the authors did a good job of evaluating their approach with the ablation study and comparisons with PPO and TRPO.",non_actionable,agreement
r50,s3,t20,There were a few things that jumped out to me that I was surprised by.,non_actionable,fact
r50,s4,t20,"The difference in performance for Dual-AC between Figure 1 and Figure 2b is significant, but the only difference seems to be a reduce batch size, is this right?",actionable,question
r50,s5,t20,This suggests a fairly significant sensitivity to this hyperparameter if so.,non_actionable,fact
r50,s6,t20,Reproducibility in continuous control is particularly problematic.,non_actionable,fact
r50,s7,t20,"Nonetheless, in recent work PPO and TRPO performance on the same set of tasks seem to be substantively different than what the authors get in their experiments.",non_actionable,fact
r50,s8,t20,"With these in mind I view the comparison results with a bit of uncertainty about the exact amount of gain being achieved, which may beg the question if the algorithmic contributions are buying much for their added complexity?",actionable,question
r50,s9,t20,"Pros: Well written, thorough treatment of the approaches Improvements on top of Dual-AC with ablation study show improvement Cons: Empirical gains might not be very large",non_actionable,agreement
r50,s0,t27,"Good contribution The paper is well written, and the authors do an admirable job of motivating their primary contributions throughout the early portions of the paper.",non_actionable,agreement
r50,s1,t27,Each extension to the Dual Actor-Critic is well motivated and clear in context.,non_actionable,agreement
r50,s2,t27,"Turning to the experimental section, I think the authors did a good job of evaluating their approach with the ablation study and comparisons with PPO and TRPO.",non_actionable,agreement
r50,s3,t27,There were a few things that jumped out to me that I was surprised by.,non_actionable,fact
r50,s4,t27,"The difference in performance for Dual-AC between Figure 1 and Figure 2b is significant, but the only difference seems to be a reduce batch size, is this right?",non_actionable,question
r50,s5,t27,This suggests a fairly significant sensitivity to this hyperparameter if so.,non_actionable,fact
r50,s6,t27,Reproducibility in continuous control is particularly problematic.,non_actionable,fact
r50,s7,t27,"Nonetheless, in recent work PPO and TRPO performance on the same set of tasks seem to be substantively different than what the authors get in their experiments.",non_actionable,shortcoming
r50,s8,t27,"With these in mind I view the comparison results with a bit of uncertainty about the exact amount of gain being achieved, which may beg the question if the algorithmic contributions are buying much for their added complexity?",non_actionable,question
r50,s9,t27,"Pros: Well written, thorough treatment of the approaches Improvements on top of Dual-AC with ablation study show improvement Cons: Empirical gains might not be very large",non_actionable,fact
r50,s0,t31,"Good contribution The paper is well written, and the authors do an admirable job of motivating their primary contributions throughout the early portions of the paper.",non_actionable,agreement
r50,s1,t31,Each extension to the Dual Actor-Critic is well motivated and clear in context.,non_actionable,agreement
r50,s2,t31,"Turning to the experimental section, I think the authors did a good job of evaluating their approach with the ablation study and comparisons with PPO and TRPO.",non_actionable,agreement
r50,s3,t31,There were a few things that jumped out to me that I was surprised by.,non_actionable,fact
r50,s4,t31,"The difference in performance for Dual-AC between Figure 1 and Figure 2b is significant, but the only difference seems to be a reduce batch size, is this right?",non_actionable,question
r50,s5,t31,This suggests a fairly significant sensitivity to this hyperparameter if so.,non_actionable,fact
r50,s6,t31,Reproducibility in continuous control is particularly problematic.,actionable,shortcoming
r50,s7,t31,"Nonetheless, in recent work PPO and TRPO performance on the same set of tasks seem to be substantively different than what the authors get in their experiments.",non_actionable,fact
r50,s8,t31,"With these in mind I view the comparison results with a bit of uncertainty about the exact amount of gain being achieved, which may beg the question if the algorithmic contributions are buying much for their added complexity?",actionable,shortcoming
r50,s9,t31,"Pros: Well written, thorough treatment of the approaches Improvements on top of Dual-AC with ablation study show improvement Cons: Empirical gains might not be very large",actionable,shortcoming
r50,s0,t10,"Good contribution The paper is well written, and the authors do an admirable job of motivating their primary contributions throughout the early portions of the paper.",actionable,agreement
r50,s1,t10,Each extension to the Dual Actor-Critic is well motivated and clear in context.,actionable,agreement
r50,s2,t10,"Turning to the experimental section, I think the authors did a good job of evaluating their approach with the ablation study and comparisons with PPO and TRPO.",actionable,agreement
r50,s3,t10,There were a few things that jumped out to me that I was surprised by.,actionable,fact
r50,s4,t10,"The difference in performance for Dual-AC between Figure 1 and Figure 2b is significant, but the only difference seems to be a reduce batch size, is this right?",actionable,question
r50,s5,t10,This suggests a fairly significant sensitivity to this hyperparameter if so.,non_actionable,other
r50,s6,t10,Reproducibility in continuous control is particularly problematic.,actionable,shortcoming
r50,s7,t10,"Nonetheless, in recent work PPO and TRPO performance on the same set of tasks seem to be substantively different than what the authors get in their experiments.",actionable,disagreement
r50,s8,t10,"With these in mind I view the comparison results with a bit of uncertainty about the exact amount of gain being achieved, which may beg the question if the algorithmic contributions are buying much for their added complexity?",actionable,question
r50,s9,t10,"Pros: Well written, thorough treatment of the approaches Improvements on top of Dual-AC with ablation study show improvement Cons: Empirical gains might not be very large",actionable,agreement
r38,s0,t16,"Wasserstein distances for eliminating batch effect, not enough novelty and no thorough comparisons to other methods.",actionable,shortcoming
r38,s1,t16,"The authors present a method that aims to remove domain-specific information while preserving the relevant biological information between biological data measured in different experiments or ""batches"".",non_actionable,fact
r38,s2,t16,A network is trained to learn the transformations that minimize the Wasserstein distance between distributions.,non_actionable,fact
r38,s3,t16,The paper presents an interesting idea and is fairly well written.,non_actionable,agreement
r38,s4,t16,"1. Most of the ideas presented in the paper rely on works by Arjovsky et al. (2017), Gulrajani et al. (2017), and Gulrajani et al. (2017).",non_actionable,fact
r38,s5,t16,"Some selections, which are presented in the papers are not explained, for example, the gradient penalty, the choice of \lambda and the choice of points for gradient computation.",actionable,shortcoming
r38,s6,t16,This section could be improved by demonstrating the approach on more datasets.,actionable,suggestion
r38,s7,t16,3. There is a lack comparison to other methods such as Shaham et al. (2017).,actionable,shortcoming
r38,s8,t16,4. Why is the affine transform assumption valid in biology?,actionable,question
r38,s9,t16,How does this compare to the near-identity constraints in resnets in Shaham et al.,actionable,question
r38,s0,t20,"Wasserstein distances for eliminating batch effect, not enough novelty and no thorough comparisons to other methods.",non_actionable,shortcoming
r38,s1,t20,"The authors present a method that aims to remove domain-specific information while preserving the relevant biological information between biological data measured in different experiments or ""batches"".",non_actionable,fact
r38,s2,t20,A network is trained to learn the transformations that minimize the Wasserstein distance between distributions.,non_actionable,fact
r38,s3,t20,The paper presents an interesting idea and is fairly well written.,non_actionable,agreement
r38,s4,t20,"1. Most of the ideas presented in the paper rely on works by Arjovsky et al. (2017), Gulrajani et al. (2017), and Gulrajani et al. (2017).",non_actionable,fact
r38,s5,t20,"Some selections, which are presented in the papers are not explained, for example, the gradient penalty, the choice of \lambda and the choice of points for gradient computation.",actionable,shortcoming
r38,s6,t20,This section could be improved by demonstrating the approach on more datasets.,actionable,suggestion
r38,s7,t20,3. There is a lack comparison to other methods such as Shaham et al. (2017).,actionable,shortcoming
r38,s8,t20,4. Why is the affine transform assumption valid in biology?,actionable,question
r38,s9,t20,How does this compare to the near-identity constraints in resnets in Shaham et al.,actionable,question
r38,s0,t31,"Wasserstein distances for eliminating batch effect, not enough novelty and no thorough comparisons to other methods.",actionable,shortcoming
r38,s1,t31,"The authors present a method that aims to remove domain-specific information while preserving the relevant biological information between biological data measured in different experiments or ""batches"".",non_actionable,fact
r38,s2,t31,A network is trained to learn the transformations that minimize the Wasserstein distance between distributions.,non_actionable,fact
r38,s3,t31,The paper presents an interesting idea and is fairly well written.,non_actionable,agreement
r38,s4,t31,"1. Most of the ideas presented in the paper rely on works by Arjovsky et al. (2017), Gulrajani et al. (2017), and Gulrajani et al. (2017).",non_actionable,fact
r38,s5,t31,"Some selections, which are presented in the papers are not explained, for example, the gradient penalty, the choice of \lambda and the choice of points for gradient computation.",actionable,shortcoming
r38,s6,t31,This section could be improved by demonstrating the approach on more datasets.,actionable,suggestion
r38,s7,t31,3. There is a lack comparison to other methods such as Shaham et al. (2017).,actionable,shortcoming
r38,s8,t31,4. Why is the affine transform assumption valid in biology?,non_actionable,question
r38,s9,t31,How does this compare to the near-identity constraints in resnets in Shaham et al.,non_actionable,question
r38,s0,t13,"Wasserstein distances for eliminating batch effect, not enough novelty and no thorough comparisons to other methods.",actionable,shortcoming
r38,s1,t13,"The authors present a method that aims to remove domain-specific information while preserving the relevant biological information between biological data measured in different experiments or ""batches"".",non_actionable,fact
r38,s2,t13,A network is trained to learn the transformations that minimize the Wasserstein distance between distributions.,non_actionable,fact
r38,s3,t13,The paper presents an interesting idea and is fairly well written.,non_actionable,agreement
r38,s4,t13,"1. Most of the ideas presented in the paper rely on works by Arjovsky et al. (2017), Gulrajani et al. (2017), and Gulrajani et al. (2017).",non_actionable,fact
r38,s5,t13,"Some selections, which are presented in the papers are not explained, for example, the gradient penalty, the choice of \lambda and the choice of points for gradient computation.",actionable,shortcoming
r38,s6,t13,This section could be improved by demonstrating the approach on more datasets.,actionable,suggestion
r38,s7,t13,3. There is a lack comparison to other methods such as Shaham et al. (2017).,actionable,shortcoming
r38,s8,t13,4. Why is the affine transform assumption valid in biology?,actionable,question
r38,s9,t13,How does this compare to the near-identity constraints in resnets in Shaham et al.,actionable,question
r38,s0,t10,"Wasserstein distances for eliminating batch effect, not enough novelty and no thorough comparisons to other methods.",non_actionable,shortcoming
r38,s1,t10,"The authors present a method that aims to remove domain-specific information while preserving the relevant biological information between biological data measured in different experiments or ""batches"".",non_actionable,other
r38,s2,t10,A network is trained to learn the transformations that minimize the Wasserstein distance between distributions.,non_actionable,other
r38,s3,t10,The paper presents an interesting idea and is fairly well written.,actionable,agreement
r38,s4,t10,"1. Most of the ideas presented in the paper rely on works by Arjovsky et al. (2017), Gulrajani et al. (2017), and Gulrajani et al. (2017).",non_actionable,other
r38,s5,t10,"Some selections, which are presented in the papers are not explained, for example, the gradient penalty, the choice of \lambda and the choice of points for gradient computation.",actionable,shortcoming
r38,s6,t10,This section could be improved by demonstrating the approach on more datasets.,actionable,suggestion
r38,s7,t10,3. There is a lack comparison to other methods such as Shaham et al. (2017).,actionable,shortcoming
r38,s8,t10,4. Why is the affine transform assumption valid in biology?,actionable,question
r38,s9,t10,How does this compare to the near-identity constraints in resnets in Shaham et al.,actionable,question
r23,s0,t8,"The algorithm employs a tailored loss function that involves reconstruction error on the latent space, penalties on degenerate parameters of the GMM, and an energy term to model the probability of observing the input samples.",non_actionable,fact
r23,s1,t8,The algorithm replaces the membership probabilities found in the E-step of EM for a GMM with the outputs of a subnetwork in the end-to-end architecture.,non_actionable,fact
r23,s2,t8,The GMM parameters are updated with these estimated responsibilities as usual in the M-step during training.,non_actionable,fact
r23,s3,t8,"Careful reporting of the tuning and hyperparameter choices renders these experiments repeatable, and hence a suitable improvement in the field.",non_actionable,agreement
r23,s4,t8,"Well-designed ablation studies demonstrate the importance of the architectural choices made, which are generally well-motivated in intuitions about the nature of anomaly detection.",non_actionable,fact
r23,s5,t8,Little to no detail about these features is included.,actionable,shortcoming
r23,s6,t8,"Since this is so important to the results, more analysis would be helpful.",actionable,suggestion
r23,s7,t8,Why did the choices that were made in the paper yield this success?,actionable,question
r23,s8,t8,How do you recommend other researchers or practitioners selected from the large possible space of reconstruction features to get the best results?,actionable,question
r23,s9,t8,Perhaps the biggest innovation is the use of reconstruction error features as input to a subnetwork that predicts the E-step output in EM for a GMM.,non_actionable,fact
r23,s0,t16,"The algorithm employs a tailored loss function that involves reconstruction error on the latent space, penalties on degenerate parameters of the GMM, and an energy term to model the probability of observing the input samples.",non_actionable,fact
r23,s1,t16,The algorithm replaces the membership probabilities found in the E-step of EM for a GMM with the outputs of a subnetwork in the end-to-end architecture.,non_actionable,fact
r23,s2,t16,The GMM parameters are updated with these estimated responsibilities as usual in the M-step during training.,non_actionable,fact
r23,s3,t16,"Careful reporting of the tuning and hyperparameter choices renders these experiments repeatable, and hence a suitable improvement in the field.",non_actionable,fact
r23,s4,t16,"Well-designed ablation studies demonstrate the importance of the architectural choices made, which are generally well-motivated in intuitions about the nature of anomaly detection.",non_actionable,fact
r23,s5,t16,Little to no detail about these features is included.,actionable,shortcoming
r23,s6,t16,"Since this is so important to the results, more analysis would be helpful.",actionable,suggestion
r23,s7,t16,Why did the choices that were made in the paper yield this success?,actionable,question
r23,s8,t16,How do you recommend other researchers or practitioners selected from the large possible space of reconstruction features to get the best results?,actionable,question
r23,s9,t16,Perhaps the biggest innovation is the use of reconstruction error features as input to a subnetwork that predicts the E-step output in EM for a GMM.,actionable,shortcoming
r23,s0,t10,"The algorithm employs a tailored loss function that involves reconstruction error on the latent space, penalties on degenerate parameters of the GMM, and an energy term to model the probability of observing the input samples.",non_actionable,other
r23,s1,t10,The algorithm replaces the membership probabilities found in the E-step of EM for a GMM with the outputs of a subnetwork in the end-to-end architecture.,non_actionable,other
r23,s2,t10,The GMM parameters are updated with these estimated responsibilities as usual in the M-step during training.,non_actionable,other
r23,s3,t10,"Careful reporting of the tuning and hyperparameter choices renders these experiments repeatable, and hence a suitable improvement in the field.",actionable,agreement
r23,s4,t10,"Well-designed ablation studies demonstrate the importance of the architectural choices made, which are generally well-motivated in intuitions about the nature of anomaly detection.",non_actionable,other
r23,s5,t10,Little to no detail about these features is included.,actionable,shortcoming
r23,s6,t10,"Since this is so important to the results, more analysis would be helpful.",actionable,suggestion
r23,s7,t10,Why did the choices that were made in the paper yield this success?,actionable,question
r23,s8,t10,How do you recommend other researchers or practitioners selected from the large possible space of reconstruction features to get the best results?,actionable,question
r23,s9,t10,Perhaps the biggest innovation is the use of reconstruction error features as input to a subnetwork that predicts the E-step output in EM for a GMM.,non_actionable,other
r23,s0,t31,"The algorithm employs a tailored loss function that involves reconstruction error on the latent space, penalties on degenerate parameters of the GMM, and an energy term to model the probability of observing the input samples.",non_actionable,fact
r23,s1,t31,The algorithm replaces the membership probabilities found in the E-step of EM for a GMM with the outputs of a subnetwork in the end-to-end architecture.,non_actionable,fact
r23,s2,t31,The GMM parameters are updated with these estimated responsibilities as usual in the M-step during training.,non_actionable,fact
r23,s3,t31,"Careful reporting of the tuning and hyperparameter choices renders these experiments repeatable, and hence a suitable improvement in the field.",non_actionable,agreement
r23,s4,t31,"Well-designed ablation studies demonstrate the importance of the architectural choices made, which are generally well-motivated in intuitions about the nature of anomaly detection.",non_actionable,agreement
r23,s5,t31,Little to no detail about these features is included.,actionable,shortcoming
r23,s6,t31,"Since this is so important to the results, more analysis would be helpful.",actionable,suggestion
r23,s7,t31,Why did the choices that were made in the paper yield this success?,non_actionable,question
r23,s8,t31,How do you recommend other researchers or practitioners selected from the large possible space of reconstruction features to get the best results?,non_actionable,question
r23,s9,t31,Perhaps the biggest innovation is the use of reconstruction error features as input to a subnetwork that predicts the E-step output in EM for a GMM.,non_actionable,agreement
r23,s0,t1,"The algorithm employs a tailored loss function that involves reconstruction error on the latent space, penalties on degenerate parameters of the GMM, and an energy term to model the probability of observing the input samples.",non_actionable,fact
r23,s1,t1,The algorithm replaces the membership probabilities found in the E-step of EM for a GMM with the outputs of a subnetwork in the end-to-end architecture.,non_actionable,fact
r23,s2,t1,The GMM parameters are updated with these estimated responsibilities as usual in the M-step during training.,non_actionable,fact
r23,s3,t1,"Careful reporting of the tuning and hyperparameter choices renders these experiments repeatable, and hence a suitable improvement in the field.",non_actionable,agreement
r23,s4,t1,"Well-designed ablation studies demonstrate the importance of the architectural choices made, which are generally well-motivated in intuitions about the nature of anomaly detection.",non_actionable,fact
r23,s5,t1,Little to no detail about these features is included.,actionable,shortcoming
r23,s6,t1,"Since this is so important to the results, more analysis would be helpful.",actionable,shortcoming
r23,s7,t1,Why did the choices that were made in the paper yield this success?,actionable,question
r23,s8,t1,How do you recommend other researchers or practitioners selected from the large possible space of reconstruction features to get the best results?,actionable,question
r23,s9,t1,Perhaps the biggest innovation is the use of reconstruction error features as input to a subnetwork that predicts the E-step output in EM for a GMM.,non_actionable,fact
r99,s0,t20,Focus on the success This paper focuses on using RNNs to generate straightline computer programs (ie. code strings) using reinforcement learning.,non_actionable,fact
r99,s1,t20,"The basic setup assumes a setting where we do not have access to input/output samples, but instead only have access to a separate reward function for each desired program that indicates how close a predicted program is to the correct one.",non_actionable,fact
r99,s2,t20,This reward function is used to train a separate RNN for each desired program.,non_actionable,fact
r99,s3,t20,"This is a nice result, but I did not feel as though their algorithm was sufficently different from the algorithm used by Liang et.",non_actionable,shortcoming
r99,s4,t20,I was not able to imagine a reasonable setting where we would have access to a reward function of this form without input/output examples.,non_actionable,shortcoming
r99,s5,t20,"I feel as though the restriction to the reward function in this case makes the problem uncessarily hard, and does not represent an important use-case.",non_actionable,shortcoming
r99,s6,t20,"1.  At the end of section 4.3 the paper is inconsistent about whether the test cases are randomly generated or hand picked, and whether they use 5 test cases for all problems, or sometimes up to 20 test cases.",non_actionable,shortcoming
r99,s7,t20,"If they are hand picked (and the number of test cases is hand chosen for each problem), then how dependant are the results on an appropriate choice of test cases?",actionable,question
r99,s8,t20,"2.  They argue that they don't need to separate train and test, but I think it is important to be sure that the generated programs work on test cases that are not a part of the reward function.",actionable,suggestion
r99,s9,t20,"They say that ""almost always"" the synthesizer does not overfit, but I would have liked them to be clear about whether their reported results include any cases of overfitting (i.e. did they ensure they the final generate program always generalized)?",actionable,question
r99,s0,t8,Focus on the success This paper focuses on using RNNs to generate straightline computer programs (ie. code strings) using reinforcement learning.,non_actionable,fact
r99,s1,t8,"The basic setup assumes a setting where we do not have access to input/output samples, but instead only have access to a separate reward function for each desired program that indicates how close a predicted program is to the correct one.",non_actionable,fact
r99,s2,t8,This reward function is used to train a separate RNN for each desired program.,non_actionable,fact
r99,s3,t8,"This is a nice result, but I did not feel as though their algorithm was sufficently different from the algorithm used by Liang et.",actionable,shortcoming
r99,s4,t8,I was not able to imagine a reasonable setting where we would have access to a reward function of this form without input/output examples.,actionable,shortcoming
r99,s5,t8,"I feel as though the restriction to the reward function in this case makes the problem uncessarily hard, and does not represent an important use-case.",actionable,shortcoming
r99,s6,t8,"1.  At the end of section 4.3 the paper is inconsistent about whether the test cases are randomly generated or hand picked, and whether they use 5 test cases for all problems, or sometimes up to 20 test cases.",actionable,shortcoming
r99,s7,t8,"If they are hand picked (and the number of test cases is hand chosen for each problem), then how dependant are the results on an appropriate choice of test cases?",non_actionable,question
r99,s8,t8,"2.  They argue that they don't need to separate train and test, but I think it is important to be sure that the generated programs work on test cases that are not a part of the reward function.",actionable,shortcoming
r99,s9,t8,"They say that ""almost always"" the synthesizer does not overfit, but I would have liked them to be clear about whether their reported results include any cases of overfitting (i.e. did they ensure they the final generate program always generalized)?",actionable,suggestion
r99,s0,t25,Focus on the success This paper focuses on using RNNs to generate straightline computer programs (ie. code strings) using reinforcement learning.,actionable,suggestion
r99,s1,t25,"The basic setup assumes a setting where we do not have access to input/output samples, but instead only have access to a separate reward function for each desired program that indicates how close a predicted program is to the correct one.",actionable,shortcoming
r99,s2,t25,This reward function is used to train a separate RNN for each desired program.,non_actionable,shortcoming
r99,s3,t25,"This is a nice result, but I did not feel as though their algorithm was sufficently different from the algorithm used by Liang et.",non_actionable,agreement
r99,s4,t25,I was not able to imagine a reasonable setting where we would have access to a reward function of this form without input/output examples.,non_actionable,disagreement
r99,s5,t25,"I feel as though the restriction to the reward function in this case makes the problem uncessarily hard, and does not represent an important use-case.",non_actionable,shortcoming
r99,s6,t25,"1.  At the end of section 4.3 the paper is inconsistent about whether the test cases are randomly generated or hand picked, and whether they use 5 test cases for all problems, or sometimes up to 20 test cases.",actionable,shortcoming
r99,s7,t25,"If they are hand picked (and the number of test cases is hand chosen for each problem), then how dependant are the results on an appropriate choice of test cases?",actionable,question
r99,s8,t25,"2.  They argue that they don't need to separate train and test, but I think it is important to be sure that the generated programs work on test cases that are not a part of the reward function.",actionable,shortcoming
r99,s9,t25,"They say that ""almost always"" the synthesizer does not overfit, but I would have liked them to be clear about whether their reported results include any cases of overfitting (i.e. did they ensure they the final generate program always generalized)?",actionable,shortcoming
r99,s0,t31,Focus on the success This paper focuses on using RNNs to generate straightline computer programs (ie. code strings) using reinforcement learning.,non_actionable,fact
r99,s1,t31,"The basic setup assumes a setting where we do not have access to input/output samples, but instead only have access to a separate reward function for each desired program that indicates how close a predicted program is to the correct one.",non_actionable,fact
r99,s2,t31,This reward function is used to train a separate RNN for each desired program.,non_actionable,fact
r99,s3,t31,"This is a nice result, but I did not feel as though their algorithm was sufficently different from the algorithm used by Liang et.",actionable,shortcoming
r99,s4,t31,I was not able to imagine a reasonable setting where we would have access to a reward function of this form without input/output examples.,actionable,shortcoming
r99,s5,t31,"I feel as though the restriction to the reward function in this case makes the problem uncessarily hard, and does not represent an important use-case.",actionable,shortcoming
r99,s6,t31,"1.  At the end of section 4.3 the paper is inconsistent about whether the test cases are randomly generated or hand picked, and whether they use 5 test cases for all problems, or sometimes up to 20 test cases.",actionable,shortcoming
r99,s7,t31,"If they are hand picked (and the number of test cases is hand chosen for each problem), then how dependant are the results on an appropriate choice of test cases?",non_actionable,question
r99,s8,t31,"2.  They argue that they don't need to separate train and test, but I think it is important to be sure that the generated programs work on test cases that are not a part of the reward function.",actionable,suggestion
r99,s9,t31,"They say that ""almost always"" the synthesizer does not overfit, but I would have liked them to be clear about whether their reported results include any cases of overfitting (i.e. did they ensure they the final generate program always generalized)?",actionable,shortcoming
r99,s0,t16,Focus on the success This paper focuses on using RNNs to generate straightline computer programs (ie. code strings) using reinforcement learning.,non_actionable,fact
r99,s1,t16,"The basic setup assumes a setting where we do not have access to input/output samples, but instead only have access to a separate reward function for each desired program that indicates how close a predicted program is to the correct one.",actionable,shortcoming
r99,s2,t16,This reward function is used to train a separate RNN for each desired program.,non_actionable,fact
r99,s3,t16,"This is a nice result, but I did not feel as though their algorithm was sufficently different from the algorithm used by Liang et.",actionable,shortcoming
r99,s4,t16,I was not able to imagine a reasonable setting where we would have access to a reward function of this form without input/output examples.,actionable,disagreement
r99,s5,t16,"I feel as though the restriction to the reward function in this case makes the problem uncessarily hard, and does not represent an important use-case.",actionable,disagreement
r99,s6,t16,"1.  At the end of section 4.3 the paper is inconsistent about whether the test cases are randomly generated or hand picked, and whether they use 5 test cases for all problems, or sometimes up to 20 test cases.",actionable,shortcoming
r99,s7,t16,"If they are hand picked (and the number of test cases is hand chosen for each problem), then how dependant are the results on an appropriate choice of test cases?",actionable,question
r99,s8,t16,"2.  They argue that they don't need to separate train and test, but I think it is important to be sure that the generated programs work on test cases that are not a part of the reward function.",actionable,disagreement
r99,s9,t16,"They say that ""almost always"" the synthesizer does not overfit, but I would have liked them to be clear about whether their reported results include any cases of overfitting (i.e. did they ensure they the final generate program always generalized)?",actionable,question
r79,s0,t20,An interesting work on the characterization of critical points of neural networks This paper mainly focuses on the square loss function of linear networks.,non_actionable,fact
r79,s1,t20,"As an extension, the manuscript also characterizes the analytical forms for the critical points of deep linear networks and deep ReLU networks, although only a subset of non-global-optimal critical points are discussed.",non_actionable,fact
r79,s2,t20,1. This manuscript provides the sufficient and necessary characterization of critical points for deep networks.,non_actionable,agreement
r79,s3,t20,"2. Compared to previous work, the current analysis for one-hidden-layer linear networks doesn’t require assumptions on parameter dimensions and data matrices.",non_actionable,fact
r79,s4,t20,"The novel analyses, especially the technique to characterize critical points and the proof of item 2 in Proposition 3, will probably be interesting to the community.",non_actionable,agreement
r79,s5,t20,1. I'm concerned that the contribution of this manuscript is a little incremental.,non_actionable,shortcoming
r79,s6,t20,The equivalence of global minima and local minima for linear networks is not surprising based on existing works e.g. Hardt & Ma (2017) and Kawaguchi (2016).,non_actionable,fact
r79,s7,t20,"2. Unlike one-hidden-layer linear networks, the characterizations of critical points for deep linear networks and deep ReLU networks seem to be hard to be interpreted.",non_actionable,shortcoming
r79,s8,t20,"This manuscript doesn't show that every local minimum of these two types of deep networks is a global minimum, which actually has been shown by existing works like Kawaguchi (2016) with some assumptions.",non_actionable,shortcoming
r79,s9,t20,"Minors: There are some mixed-up notations: tilde{A_i} => A_i , and rank(A_2) => rank(A)_2 in Proposition 3.",actionable,shortcoming
r79,s0,t8,An interesting work on the characterization of critical points of neural networks This paper mainly focuses on the square loss function of linear networks.,non_actionable,agreement
r79,s1,t8,"As an extension, the manuscript also characterizes the analytical forms for the critical points of deep linear networks and deep ReLU networks, although only a subset of non-global-optimal critical points are discussed.",non_actionable,fact
r79,s2,t8,1. This manuscript provides the sufficient and necessary characterization of critical points for deep networks.,non_actionable,agreement
r79,s3,t8,"2. Compared to previous work, the current analysis for one-hidden-layer linear networks doesn’t require assumptions on parameter dimensions and data matrices.",non_actionable,fact
r79,s4,t8,"The novel analyses, especially the technique to characterize critical points and the proof of item 2 in Proposition 3, will probably be interesting to the community.",non_actionable,agreement
r79,s5,t8,1. I'm concerned that the contribution of this manuscript is a little incremental.,actionable,shortcoming
r79,s6,t8,The equivalence of global minima and local minima for linear networks is not surprising based on existing works e.g. Hardt & Ma (2017) and Kawaguchi (2016).,non_actionable,fact
r79,s7,t8,"2. Unlike one-hidden-layer linear networks, the characterizations of critical points for deep linear networks and deep ReLU networks seem to be hard to be interpreted.",actionable,shortcoming
r79,s8,t8,"This manuscript doesn't show that every local minimum of these two types of deep networks is a global minimum, which actually has been shown by existing works like Kawaguchi (2016) with some assumptions.",actionable,shortcoming
r79,s9,t8,"Minors: There are some mixed-up notations: tilde{A_i} => A_i , and rank(A_2) => rank(A)_2 in Proposition 3.",actionable,disagreement
r79,s0,t10,An interesting work on the characterization of critical points of neural networks This paper mainly focuses on the square loss function of linear networks.,non_actionable,other
r79,s1,t10,"As an extension, the manuscript also characterizes the analytical forms for the critical points of deep linear networks and deep ReLU networks, although only a subset of non-global-optimal critical points are discussed.",non_actionable,other
r79,s2,t10,1. This manuscript provides the sufficient and necessary characterization of critical points for deep networks.,non_actionable,other
r79,s3,t10,"2. Compared to previous work, the current analysis for one-hidden-layer linear networks doesn’t require assumptions on parameter dimensions and data matrices.",non_actionable,other
r79,s4,t10,"The novel analyses, especially the technique to characterize critical points and the proof of item 2 in Proposition 3, will probably be interesting to the community.",non_actionable,other
r79,s5,t10,1. I'm concerned that the contribution of this manuscript is a little incremental.,actionable,fact
r79,s6,t10,The equivalence of global minima and local minima for linear networks is not surprising based on existing works e.g. Hardt & Ma (2017) and Kawaguchi (2016).,non_actionable,other
r79,s7,t10,"2. Unlike one-hidden-layer linear networks, the characterizations of critical points for deep linear networks and deep ReLU networks seem to be hard to be interpreted.",actionable,fact
r79,s8,t10,"This manuscript doesn't show that every local minimum of these two types of deep networks is a global minimum, which actually has been shown by existing works like Kawaguchi (2016) with some assumptions.",actionable,shortcoming
r79,s9,t10,"Minors: There are some mixed-up notations: tilde{A_i} => A_i , and rank(A_2) => rank(A)_2 in Proposition 3.",actionable,shortcoming
r79,s0,t16,An interesting work on the characterization of critical points of neural networks This paper mainly focuses on the square loss function of linear networks.,non_actionable,fact
r79,s1,t16,"As an extension, the manuscript also characterizes the analytical forms for the critical points of deep linear networks and deep ReLU networks, although only a subset of non-global-optimal critical points are discussed.",non_actionable,fact
r79,s2,t16,1. This manuscript provides the sufficient and necessary characterization of critical points for deep networks.,non_actionable,fact
r79,s3,t16,"2. Compared to previous work, the current analysis for one-hidden-layer linear networks doesn’t require assumptions on parameter dimensions and data matrices.",non_actionable,fact
r79,s4,t16,"The novel analyses, especially the technique to characterize critical points and the proof of item 2 in Proposition 3, will probably be interesting to the community.",non_actionable,agreement
r79,s5,t16,1. I'm concerned that the contribution of this manuscript is a little incremental.,actionable,shortcoming
r79,s6,t16,The equivalence of global minima and local minima for linear networks is not surprising based on existing works e.g. Hardt & Ma (2017) and Kawaguchi (2016).,non_actionable,fact
r79,s7,t16,"2. Unlike one-hidden-layer linear networks, the characterizations of critical points for deep linear networks and deep ReLU networks seem to be hard to be interpreted.",actionable,shortcoming
r79,s8,t16,"This manuscript doesn't show that every local minimum of these two types of deep networks is a global minimum, which actually has been shown by existing works like Kawaguchi (2016) with some assumptions.",actionable,shortcoming
r79,s9,t16,"Minors: There are some mixed-up notations: tilde{A_i} => A_i , and rank(A_2) => rank(A)_2 in Proposition 3.",actionable,shortcoming
r79,s0,t31,An interesting work on the characterization of critical points of neural networks This paper mainly focuses on the square loss function of linear networks.,non_actionable,agreement
r79,s1,t31,"As an extension, the manuscript also characterizes the analytical forms for the critical points of deep linear networks and deep ReLU networks, although only a subset of non-global-optimal critical points are discussed.",non_actionable,fact
r79,s2,t31,1. This manuscript provides the sufficient and necessary characterization of critical points for deep networks.,non_actionable,fact
r79,s3,t31,"2. Compared to previous work, the current analysis for one-hidden-layer linear networks doesn’t require assumptions on parameter dimensions and data matrices.",non_actionable,fact
r79,s4,t31,"The novel analyses, especially the technique to characterize critical points and the proof of item 2 in Proposition 3, will probably be interesting to the community.",non_actionable,agreement
r79,s5,t31,1. I'm concerned that the contribution of this manuscript is a little incremental.,actionable,shortcoming
r79,s6,t31,The equivalence of global minima and local minima for linear networks is not surprising based on existing works e.g. Hardt & Ma (2017) and Kawaguchi (2016).,non_actionable,fact
r79,s7,t31,"2. Unlike one-hidden-layer linear networks, the characterizations of critical points for deep linear networks and deep ReLU networks seem to be hard to be interpreted.",actionable,shortcoming
r79,s8,t31,"This manuscript doesn't show that every local minimum of these two types of deep networks is a global minimum, which actually has been shown by existing works like Kawaguchi (2016) with some assumptions.",actionable,shortcoming
r79,s9,t31,"Minors: There are some mixed-up notations: tilde{A_i} => A_i , and rank(A_2) => rank(A)_2 in Proposition 3.",actionable,shortcoming