Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fixed-width column support #220

Merged
merged 3 commits into from
Nov 8, 2024

Conversation

kosak
Copy link
Contributor

@kosak kosak commented Nov 4, 2024

This PR depends on #219 and should not be merged until that one is.

Until then the changes in this PR will look like they also include the changes from #219, but that will go away once the other PR has been merged.

This PR resolves #212

Perhaps the most controversial idea here is the flag that controls how characters are counted. Should they be UTF-32 code points or the number of characters Java would see as String.length? There's no difference until you go outside the Unicode Basic Multilingual Plane. But once outside that plane, Unicode code points require two Java char rather than one.

The question is, what does the user intend? Should a string like 🥰😻🧡💓💕💖 represent a width of six (because it is six Unicode code points) or a width of twelve (because the Java string representing it would have length 12).

I think in some sense the question has no right answer, so we have a flag CsvSpecs.useUtf32CountingConvention to make it go one way or the other.

Also I need to add a couple more tests that I forgot to do (e.g. for explicitly specifying column widths rather than inferring them, and a file that has no column headers, etc).

@kosak kosak self-assigned this Nov 4, 2024
@kosak kosak requested a review from devinrsmith November 4, 2024 01:51
@kosak
Copy link
Contributor Author

kosak commented Nov 4, 2024

Note I will fix the "DESIRED UNITS: BLAH" comments I left in FixedHeaderFinder.java

@kosak kosak force-pushed the kosak_fixed-width-columns branch from 5c1e76a to 6aee4ae Compare November 5, 2024 20:25
Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your level of test coverage is impressive. My main concern is testing (or testing they are invalid) other ways CsvSpecs might be configured in relationship to hasFixedWidthColumns, along w/ documentation in those regards.

* See {@link Builder#fixedColumnWidths}.
*/
@Default
public List<Integer> fixedColumnWidths() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to add a check that if fixedColumnWidths is specified, hasFixedWidthColunms must be true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check added, and triggered in the test checkParametersIncompatibleWithDelimitedMode

*
* @param fixedColumnWidths The caller-specified widths of the columns.
*/
Builder fixedColumnWidths(Iterable<Integer> fixedColumnWidths);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If specified, do all of the column lengths need to be set? All but the last?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All but the last need to be set. The last is a placeholder. Made explicit in comments for CsvSpecs#fixedColumnWidths and also tested in CsvReaderTest#finalFixedColumnWidthEntryIsPlaceholder

Comment on lines +120 to +123
/**
* True if the input is organized into fixed width columns rather than delimited by a delimiter.
*/
Builder hasFixedWidthColumns(boolean hasFixedWidthColumns);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we expect will happen if ignoreSurroundingSpaces() == false? Is that a legit configuration? Should we disallow it?

What about quote / trim behavior? I'm assuming that " (or quote char) inside the fixed width are included in the resulting string. In which case, maybe we want to assert that if we are in fixed width mode, quote() should not be set (or, at least, not set differently than the default).

Mainly, I wonder if we need to consider the scope of CsvSpecs wrt hasFixedWidthColumns; I'm happy to err on the side of being more strict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoreSurroundingSpaces() == false is legit, though perhaps not very useful. Now tested in CsvReaderTest#fixedColumnsMayIncludeOrExcludeSurroundingSpaces

quote / trim / etc now validated for only being set in the right mode. See CsvReaderTest#checkParametersIncompatibleWithFixedWidthMode and its partner CsvReaderTest#checkParametersIncompatibleWithDelimitedMode

+ "beta Not Active 4y235d kubernetes.io/metadata.name=beta\n";

final CsvSpecs specs = defaultCsvBuilder().hasFixedWidthColumns(true)
.ignoreSurroundingSpaces(true).build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoreSurroundingSpaces defaults to true; is false a legit value? If not, we should err if it is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed a bunch of places where I was redundantly setting it to true. Setting it to false does work but is not IMO very useful. Unit test in CsvReaderTest#fixedColumnsMayIncludeOrExcludeSurroundingSpaces

@kosak kosak force-pushed the kosak_fixed-width-columns branch from 6e762db to 1ff6927 Compare November 8, 2024 03:14
Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Will approve once spotless is corrected and tests pass.

@devinrsmith devinrsmith merged commit f9293f1 into deephaven:main Nov 8, 2024
3 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Nov 8, 2024
@kosak kosak deleted the kosak_fixed-width-columns branch November 8, 2024 23:14
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot handle aligned multi-space-delimited files
2 participants