Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to merge multiple directories into one (Documentation/Help Request) #10

Open
alexmc6 opened this issue May 11, 2015 · 0 comments
Open

Comments

@alexmc6
Copy link

alexmc6 commented May 11, 2015

Hi Folks,

This is not a bug as such - just that I am not sure the capabilities of the --regex and --replacement features.

What I want is ideally to convert directories "one directory per hour" eg

...somedirectory/2015/05/10/21/...lots of files...
...somedirectory/2015/05/10/22/...lots of files...
...somedirectory/2015/05/10/23/...lots of files...
...somedirectory/2015/05/11/00/...lots of files...
...somedirectory/2015/05/11/01/...lots of files...
...somedirectory/2015/05/11/02/...lots of files...

into "one directory per day"

...somedirectory/2015/05/10/oneBigFile
...somedirectory/2015/05/11/oneBigFile

or, if necessary

...somedirectory/2015/05/10/00/oneBigFile
...somedirectory/2015/05/11/00/oneBigFile

(And ideally I'd love it to tell Hive HCatalog at the same time, but that might be asking too much)

I am trying to use the --regex and --replacement features to do this. Should it work?

This just adds in a new directory

--regex=".*/\d\d/(.+)"
--replacement=00/$1-${crush.timestamp}-${crush.task.num}-${crush.file.num} \

Should I be trying something like

--regex=".*/(\d\d)/(.+)"
--replacement=00/$2-${crush.timestamp}-${crush.task.num}-${crush.file.num} \

I suppose my fall back solution would be to move everything from the low level directories one directory up before running the file crush. That would be a bit of a pain - I suppose I could write a perl or shell script to do that which ran "hadoop fs -mv " commands

Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant