Skip to content

Input format for hadoop able to read multiline CSVs

Notifications You must be signed in to change notification settings

richylyq/hadoop2-csv

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

hadoop2-csv

Input format for hadoop able to read multiline CSVs

Run BasicTest.java to see it working. Check src/test/resource/test.csv to see a multiline demofile.

The key returned is the file position where the line starts and the value is a List with the column values

Zip files are supported.

More ideas to improve this are welcome.

Example:

If we read this CSV (note that line 2 is multiline):

Joe Demo,"2 Demo Street,
Demoville,
Australia. 2615",[email protected]
Jim Sample,"3 Sample Street, Sampleville, Australia. 2615",[email protected]
Jack Example,"1 Example Street, Exampleville, Australia.
2615",[email protected]

The output is as follows:

==> TestMapper
==> key=0
==> val[0] = Joe Demo
==> val[1] = 2 Demo Street, 
Demoville, 
Australia. 261
==> val[2] = [email protected]

==> TestMapper
==> key=73
==> val[0] = Jim Sample
==> val[1] = 
==> val[2] = [email protected]

==> TestMapper
==> key=10
==> val[0] = Jack Example
==> val[1] = 1 Example Street, Exampleville, Australia. 261
==> val[2] = [email protected]

License

https://www.apache.org/licenses/LICENSE-2.0.html

Credits

Personal fork of CSVInputFormat, but built against hadoop2. Please report the issues to the original fork.

About

Input format for hadoop able to read multiline CSVs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%