Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added functions for reading feature chunks. #37

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

gulamungon
Copy link

I added some functions for reading chunks of features.

@KarelVesely84
Copy link
Owner

Hi Johan,
i am not going to accept it as it is now. I think it would be good to extend the existing functions in current 'master' to accept the tuple of slice objects that is provided from: _strip_mat_range(), which i just added in...

Also, I think, there should be support of slicing matrices without compression.
And the code could be a bit simplified.

The current 'master' supports slices of matrices in 'scp' files, but the implementation loads the full matrix and applies slices afterwards. In the ideal case it should read only the data that are necessary, as you have it in your PR...

Would you like to continue working on it, or should I take over and finish it?
And, i guess you rely a lot on reading the lengths from the 'headers'?
(i have no problem with adding the support of that...)

K.

@KarelVesely84
Copy link
Owner

See the last PR: #38

@gulamungon
Copy link
Author

I have made the changes. Please see Comment 1 and Comment 2 in _read_range_slice for some potential issues.

@KarelVesely84
Copy link
Owner

FWDing info from Johann:

I fixed it. In summary:

  • Binary and compressed binary will be sliced efficiently. Ascii will still be sliced the way you did it.
  • I removed the code I previously added for checking duration of files. I don't need it anymore because I prepare this info in advance in my training scripts. Of course if you think this feature would be useful, I could add it.
  • I added an option for reading data using scp in list format like this:
    [ "AMI_ES2011a_H00_FEE041_0003714_0003915_slice2 tests/data/feats.ark:14913[:,7:13]",
    "AMI_ES2011a_H00_FEE041_0003714_0003915_slice3 tests/data/feats.ark:14913[20:30,7:13]" ]
    this is because in my training scripts, I select the chunks to put into minibatches during training randomly and I don't want to store a temporary scp in order to load the data. Is it OK?
  • Limitations
    • The "slice step must be 1 at the moment" That is, python slice like slice(20,30,2) can't be used. I could add this possibility in the future.
    • slicing can only be done on data provided by scp, i.e, there is no mechanism to provide it to "read_mat_ark". In fact if we do want to modify "read_mat_ark" so that one could provide e.g. a list of start and end times, the new code would have to be modified slightly. I will comment on that in the code to make you aware of the issue.
    • One issue I run into is that I use seek and tell functions and if the input is a pipe this would fail. On the other hand for piped input we don't expect slices to be provided so I simple do a check like fd.seekable() and if this is not the case, seek and tell will not be used and it is asserted that there is no slice information provided. I will comment in the code about this and I think it would be good if you could check it so that I don't overlook anything scenario in this regard.

So how do I do to make the pull request? I have the code as before in my fork in a branch called "read_feature_chunk" and I have commited it locally. I guess I should somehow continue on the previous pull request?

@KarelVesely84
Copy link
Owner

KarelVesely84 commented Apr 6, 2020

Karels reaction:

  • it is fine to drop durations, this can be done by reading a pipeline producing lengths (with wav-to-len in it).
  • ideally, the slicing should be generic, supporting the 5 variants: none, [1:10], [1:10,:], [:,3:7], [1:10,3:7],

Ad limitations:

  • slice step 1, this is okay. kaldi has this limitation too https://kaldi-asr.org/doc/io_tut.html
  • yes, there is no way to slice while reading the ark.
  • yes, the assert is well deserved there ;)

The pull request is still sitting there from the last time :)
You don't need to open it again...
K.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants