You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not a bug as such - just that I am not sure the capabilities of the --regex and --replacement features.
What I want is ideally to convert directories "one directory per hour" eg
...somedirectory/2015/05/10/21/...lots of files...
...somedirectory/2015/05/10/22/...lots of files...
...somedirectory/2015/05/10/23/...lots of files...
...somedirectory/2015/05/11/00/...lots of files...
...somedirectory/2015/05/11/01/...lots of files...
...somedirectory/2015/05/11/02/...lots of files...
I suppose my fall back solution would be to move everything from the low level directories one directory up before running the file crush. That would be a bit of a pain - I suppose I could write a perl or shell script to do that which ran "hadoop fs -mv " commands
Alex
The text was updated successfully, but these errors were encountered:
Hi Folks,
This is not a bug as such - just that I am not sure the capabilities of the --regex and --replacement features.
What I want is ideally to convert directories "one directory per hour" eg
...somedirectory/2015/05/10/21/...lots of files...
...somedirectory/2015/05/10/22/...lots of files...
...somedirectory/2015/05/10/23/...lots of files...
...somedirectory/2015/05/11/00/...lots of files...
...somedirectory/2015/05/11/01/...lots of files...
...somedirectory/2015/05/11/02/...lots of files...
into "one directory per day"
...somedirectory/2015/05/10/oneBigFile
...somedirectory/2015/05/11/oneBigFile
or, if necessary
...somedirectory/2015/05/10/00/oneBigFile
...somedirectory/2015/05/11/00/oneBigFile
(And ideally I'd love it to tell Hive HCatalog at the same time, but that might be asking too much)
I am trying to use the --regex and --replacement features to do this. Should it work?
This just adds in a new directory
--regex=".*/\d\d/(.+)"
--replacement=00/$1-${crush.timestamp}-${crush.task.num}-${crush.file.num} \
Should I be trying something like
--regex=".*/(\d\d)/(.+)"
--replacement=00/$2-${crush.timestamp}-${crush.task.num}-${crush.file.num} \
I suppose my fall back solution would be to move everything from the low level directories one directory up before running the file crush. That would be a bit of a pain - I suppose I could write a perl or shell script to do that which ran "hadoop fs -mv " commands
Alex
The text was updated successfully, but these errors were encountered: