Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOFException on Janelia file system #50

Closed
hanslovsky opened this issue Jan 31, 2019 · 5 comments
Closed

EOFException on Janelia file system #50

hanslovsky opened this issue Jan 31, 2019 · 5 comments

Comments

@hanslovsky
Copy link
Contributor

When running spark jobs on the Janelia cluster, I get these EOFExceptions:

19/01/31 09:42:22 ERROR Executor: Exception in task 1259.0 in stage 1.0 (TID 2609)
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.EOFException
	at net.imglib2.cache.util.CacheAsUncheckedCacheAdapter.get(CacheAsUncheckedCacheAdapter.java:32)
	at net.imglib2.img.cell.LazyCellImg$LazyCells.get(LazyCellImg.java:104)
	at net.imglib2.img.list.AbstractLongListImg$LongListRandomAccess.get(AbstractLongListImg.java:274)
	at net.imglib2.img.cell.CellRandomAccess.getCell(CellRandomAccess.java:136)
	at net.imglib2.img.cell.CellRandomAccess.updatePosition(CellRandomAccess.java:474)
	at net.imglib2.img.cell.CellRandomAccess.fwd(CellRandomAccess.java:164)
	at net.imglib2.view.RandomAccessibleIntervalCursor.nextLine(RandomAccessibleIntervalCursor.java:124)
	at net.imglib2.view.RandomAccessibleIntervalCursor.fwd(RandomAccessibleIntervalCursor.java:113)
	at net.imglib2.view.RandomAccessibleIntervalCursor.next(RandomAccessibleIntervalCursor.java:150)
	at org.janelia.saalfeldlab.label.spark.SparkWatersheds.relabel(SparkWatersheds.java:542)
	at org.janelia.saalfeldlab.label.spark.SparkWatersheds.relabel(SparkWatersheds.java:523)
	at org.janelia.saalfeldlab.label.spark.SparkWatersheds.lambda$run$b9e1d326$1(SparkWatersheds.java:434)
	at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1838)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1162)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1162)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.EOFException
	at net.imglib2.cache.ref.SoftRefLoaderCache.get(SoftRefLoaderCache.java:111)
	at net.imglib2.cache.util.LoaderCacheAsCacheAdapter.get(LoaderCacheAsCacheAdapter.java:30)
	at net.imglib2.cache.util.CacheAsUncheckedCacheAdapter.get(CacheAsUncheckedCacheAdapter.java:28)
	... 24 more
Caused by: java.lang.RuntimeException: java.io.EOFException
	at org.janelia.saalfeldlab.n5.imglib2.N5CellLoader.load(N5CellLoader.java:132)
	at net.imglib2.cache.img.LoadedCellCacheLoader.get(LoadedCellCacheLoader.java:91)
	at net.imglib2.cache.img.LoadedCellCacheLoader.get(LoadedCellCacheLoader.java:51)
	at net.imglib2.cache.ref.SoftRefLoaderCache.get(SoftRefLoaderCache.java:101)
	... 26 more
Caused by: java.io.EOFException
	at java.io.DataInputStream.readShort(DataInputStream.java:315)
	at org.janelia.saalfeldlab.n5.DefaultBlockReader.readBlock(DefaultBlockReader.java:71)
	at org.janelia.saalfeldlab.n5.N5FSReader.readBlock(N5FSReader.java:169)
	at org.janelia.saalfeldlab.n5.imglib2.N5CellLoader.load(N5CellLoader.java:130)
	... 29 more

The jobs can recover and finish successfully. The n5 data is written and read with the Java N5FS implementation

@igorpisarev
Copy link
Contributor

Looks like the exception is happening here reading the very first bytes: DefaultBlockReader.java#L71
Can you try to trace the block index when the exception occurs, and see if the corresponding file is indeed empty (or does not exist)?

Though, this is suspicious that jobs reattempt and succeed... Might it be that the blocks are written concurrently as they are being read?

@hanslovsky
Copy link
Contributor Author

Though, this is suspicious that jobs reattempt and succeed

I agree, maybe it is an issue with the file system, actually, and not N5.

Might it be that the blocks are written concurrently as they are being read?

They should not, unless I have a bug in my code, which is a realistic possibility, of course.

@hanslovsky
Copy link
Contributor Author

I just ran into EOF issues again in a different project and increasing the block size seems to help here. Might be unrelated but if other people run into EOF issues, I recommend trying to increase block size as well.

@igorpisarev
Copy link
Contributor

Most likely this is the cause: imglib/imglib2#252

@bogovicj
Copy link
Contributor

imglib/imglib2#252 has been fixed by imglib/imglib2#329

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants