Parquet Hadoop input plugin for Embulk

Read Parquet files via Hadoop FileSystem.
Outputs a single record named record (type is json).

Overview

Plugin type: input
Resume supported: no
Cleanup supported: no
Guess supported: no

Configuration

config_files: list of paths to Hadoop's configuration files (array of strings, default: [])
config: overwrites configuration parameters (hash, default: {})
path: file path on Hdfs. you can use glob pattern (string, required).
parquet_log_level: set log level of parquet reader module (string, default: "INFO")
- value is one of java.util.logging.Level (ALL, SEVERE, WARNING, INFO, CONFIG, FINE, FINER, FINEST, OFF)

Hadoop Configuration

Describe parquet reader specific configuration.

parquet.read.bad.record.threshold: Tolerated percent bad records per file (float, 0.0 to 1.0, default: 0)

Example

in:
  type: parquet_hadoop
  config_files:
    - /etc/hadoop/conf/core-site.xml
    - /etc/hadoop/conf/hdfs-site.xml
  config:
    parquet.read.bad.record.threshold: 0.01
  path: /user/hadoop/example/data/*.parquet
  parquet_log_level: WARNING

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously

Install gem

$ embulk gem install <pkg/.gemfile>

Notes

Why implement this as input plugin rather than parser plugin ?

Because to parsing Parquet format needs seekable file stream but parser plugin has only sequential read.

How map Parquet schema to json ?

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
config/checkstyle		config/checkstyle
embulk-input-parquet_hadoop		embulk-input-parquet_hadoop
gradle/wrapper		gradle/wrapper
parquet-msgpack		parquet-msgpack
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parquet Hadoop input plugin for Embulk

Overview

Configuration

Hadoop Configuration

Example

Build

Install gem

Notes

Why implement this as input plugin rather than parser plugin ?

How map Parquet schema to json ?

About

Releases

Packages

Contributors 2

Languages

License

rubik-ai/embulk-input-parquet_hadoop

Folders and files

Latest commit

History

Repository files navigation

Parquet Hadoop input plugin for Embulk

Overview

Configuration

Hadoop Configuration

Example

Build

Install gem

Notes

Why implement this as input plugin rather than parser plugin ?

How map Parquet schema to json ?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages