-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fixed-width column support #220
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -117,6 +117,34 @@ public interface Builder { | |
*/ | ||
Builder headerValidator(Predicate<String> headerValidator); | ||
|
||
/** | ||
* True if the input is organized into fixed width columns rather than delimited by a delimiter. | ||
*/ | ||
Builder hasFixedWidthColumns(boolean hasFixedWidthColumns); | ||
|
||
/** | ||
* When {@link #hasFixedWidthColumns} is set, the library either determines the column widths from the header | ||
* row (provided {@link #hasHeaderRow} is set), or the column widths can be specified explictly by the caller. | ||
* If the caller wants to specify them explicitly, they can use this method. | ||
* | ||
* @param fixedColumnWidths The caller-specified widths of the columns. | ||
*/ | ||
Builder fixedColumnWidths(Iterable<Integer> fixedColumnWidths); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If specified, do all of the column lengths need to be set? All but the last? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All but the last need to be set. The last is a placeholder. Made explicit in comments for |
||
|
||
/** | ||
* This setting controls what units fixed width columns are measured in. When true, fixed width columns are | ||
* measured in Unicode code points. When false, fixed width columns are measured in UTF-16 units (aka Java | ||
* chars). The difference arises when encountering characters outside the Unicode Basic Multilingual Plane. For | ||
* example, the Unicode code point 💔 (U+1F494) is one Unicode code point, but takes two Java chars to | ||
* represent. Along these lines, the string 💔💔💔 would fit in a column of width 3 when utf32CountingMode is | ||
* true, but would require a column width of at least 6 when utf32CountingMode is false. | ||
* | ||
* The default setting of true is arguably more natural for users (the number of characters they see matches the | ||
* visual width of the column). But some programs may want the value of false because they are counting Java | ||
* chars. | ||
*/ | ||
Builder useUtf32CountingConvention(boolean useUtf32CountingConvention); | ||
|
||
/** | ||
* Number of data rows to skip before processing data. This is useful when you want to parse data in chunks. | ||
* Typically used together with {@link Builder#numRows}. Defaults to 0. | ||
|
@@ -340,6 +368,30 @@ public Predicate<String> headerValidator() { | |
return c -> true; | ||
} | ||
|
||
/** | ||
* See {@link Builder#hasFixedWidthColumns}. | ||
*/ | ||
@Default | ||
public boolean hasFixedWidthColumns() { | ||
return false; | ||
} | ||
|
||
/** | ||
* See {@link Builder#fixedColumnWidths}. | ||
*/ | ||
@Default | ||
public List<Integer> fixedColumnWidths() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should be able to add a check that if fixedColumnWidths is specified, hasFixedWidthColunms must be true. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Check added, and triggered in the test |
||
return Collections.emptyList(); | ||
} | ||
|
||
/** | ||
* See {@link Builder#useUtf32CountingConvention}. | ||
*/ | ||
@Default | ||
public boolean useUtf32CountingConvention() { | ||
return true; | ||
} | ||
|
||
/** | ||
* See {@link Builder#skipRows}. | ||
*/ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
package io.deephaven.csv.reading.cells; | ||
|
||
import io.deephaven.csv.containers.ByteSlice; | ||
import io.deephaven.csv.reading.ReaderUtil; | ||
import io.deephaven.csv.util.CsvReaderException; | ||
import io.deephaven.csv.util.MutableBoolean; | ||
import io.deephaven.csv.util.MutableInt; | ||
|
||
import java.io.InputStream; | ||
|
||
/** | ||
* This class uses an underlying DelimitedCellGrabber to grab whole lines at a time from the input stream, and then it | ||
* breaks them into fixed-sized cells to return to the caller. | ||
*/ | ||
public class FixedCellGrabber implements CellGrabber { | ||
/** | ||
* Makes a degenerate CellGrabber that has no delimiters or quotes and therefore returns whole lines. This is a | ||
* somewhat quick-and-dirty way to reuse the buffering and newline logic in DelimitedCellGrabber without rewriting | ||
* it. | ||
* | ||
* @param stream The underlying stream. | ||
* @return The "line grabber" | ||
*/ | ||
public static CellGrabber makeLineGrabber(InputStream stream) { | ||
final byte IllegalUtf8 = (byte) 0xff; | ||
return new DelimitedCellGrabber(stream, IllegalUtf8, IllegalUtf8, true, false); | ||
} | ||
|
||
private final CellGrabber lineGrabber; | ||
private final int[] columnWidths; | ||
private final boolean ignoreSurroundingSpaces; | ||
private final boolean utf32CountingMode; | ||
private final ByteSlice rowText; | ||
private boolean needsUnderlyingRefresh; | ||
private int colIndex; | ||
private final MutableBoolean dummy1; | ||
private final MutableInt dummy2; | ||
|
||
/** Constructor. */ | ||
public FixedCellGrabber(final CellGrabber lineGrabber, final int[] columnWidths, boolean ignoreSurroundingSpaces, | ||
boolean utf32CountingMode) { | ||
this.lineGrabber = lineGrabber; | ||
this.columnWidths = columnWidths; | ||
this.ignoreSurroundingSpaces = ignoreSurroundingSpaces; | ||
this.utf32CountingMode = utf32CountingMode; | ||
this.rowText = new ByteSlice(); | ||
this.needsUnderlyingRefresh = true; | ||
this.colIndex = 0; | ||
this.dummy1 = new MutableBoolean(); | ||
this.dummy2 = new MutableInt(); | ||
} | ||
|
||
@Override | ||
public void grabNext(ByteSlice dest, MutableBoolean lastInRow, MutableBoolean endOfInput) | ||
throws CsvReaderException { | ||
if (needsUnderlyingRefresh) { | ||
// Underlying row used up, and all columns provided. Ask underlying CellGrabber for the next line. | ||
lineGrabber.grabNext(rowText, dummy1, endOfInput); | ||
|
||
if (endOfInput.booleanValue()) { | ||
// Set dest to the empty string, and leave 'endOfInput' set to true. | ||
dest.reset(rowText.data(), rowText.end(), rowText.end()); | ||
return; | ||
} | ||
|
||
needsUnderlyingRefresh = false; | ||
colIndex = 0; | ||
} | ||
|
||
// There is data to return. Count off N characters. The final column gets all remaining characters. | ||
final boolean lastCol = colIndex == columnWidths.length - 1; | ||
final int numCharsToTake = lastCol ? Integer.MAX_VALUE : columnWidths[colIndex]; | ||
takeNCharactersInCharset(rowText, dest, numCharsToTake, utf32CountingMode, dummy2); | ||
++colIndex; | ||
needsUnderlyingRefresh = lastCol || dest.size() == 0; | ||
lastInRow.setValue(needsUnderlyingRefresh); | ||
endOfInput.setValue(false); | ||
|
||
if (ignoreSurroundingSpaces) { | ||
ReaderUtil.trimSpacesAndTabs(dest); | ||
} | ||
} | ||
|
||
private static void takeNCharactersInCharset(ByteSlice src, ByteSlice dest, int numCharsToTake, | ||
boolean utf32CountingMode, MutableInt tempInt) { | ||
final byte[] data = src.data(); | ||
final int cellBegin = src.begin(); | ||
int current = cellBegin; | ||
while (numCharsToTake > 0) { | ||
if (current == src.end()) { | ||
break; | ||
} | ||
final int utf8Length = ReaderUtil.getUtf8LengthAndCharLength(data[current], src.end() - current, | ||
utf32CountingMode, tempInt); | ||
if (numCharsToTake < tempInt.intValue()) { | ||
// There is not enough space left in the field to store this character. | ||
// This can happen if CsvSpecs is set for the UTF16 counting convention, | ||
// there is one unit left in the field, and we encounter a character outside | ||
// the Basic Multilingual Plane, which would require two units. | ||
break; | ||
} | ||
numCharsToTake -= tempInt.intValue(); | ||
current += utf8Length; | ||
} | ||
dest.reset(src.data(), cellBegin, current); | ||
src.reset(src.data(), current, src.end()); | ||
} | ||
|
||
@Override | ||
public int physicalRowNum() { | ||
return lineGrabber.physicalRowNum(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we expect will happen if
ignoreSurroundingSpaces() == false
? Is that a legit configuration? Should we disallow it?What about quote / trim behavior? I'm assuming that
"
(or quote char) inside the fixed width are included in the resulting string. In which case, maybe we want to assert that if we are in fixed width mode,quote()
should not be set (or, at least, not set differently than the default).Mainly, I wonder if we need to consider the scope of CsvSpecs wrt hasFixedWidthColumns; I'm happy to err on the side of being more strict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignoreSurroundingSpaces() == false
is legit, though perhaps not very useful. Now tested inCsvReaderTest#fixedColumnsMayIncludeOrExcludeSurroundingSpaces
quote / trim / etc now validated for only being set in the right mode. See
CsvReaderTest#checkParametersIncompatibleWithFixedWidthMode
and its partnerCsvReaderTest#checkParametersIncompatibleWithDelimitedMode