Skip to content

Spark 2.4.x DataSourceV2 implementation for WAC

Notifications You must be signed in to change notification settings

mydpy/WARCDataSource

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WARCDataSource

Work in progress.

Spark DataSourceV2 for reading WARC.

Unit Tests

The unit test is definitely not self contained since it relies on having one of the April 2019 WARC files present in $HOME/Downloads. The gradle runSpark task also assumes this. This is due to line ending peculiarities between Unix and Windows and making certain that trying to grab a few records wasn't resulting in problems I experienced while parsing records.

About

Spark 2.4.x DataSourceV2 implementation for WAC

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages