-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always make the last separators mandatory #2
Comments
Yes your writeup is excellent. In practice, I see two additional issues that are related to your points. What I'd like to do is keep this issue open and use it for discussion because I'm 100% aiming to standardize and do a BNF and similar, and this repo is helping to find corner cases and to find advice.
What are your thoughts about these? |
Your point (1) (about newlines) I think is more related to issue #3. (I'll reply to it there.)
I also believe that any "tabular format" will deal in 99.999% of the cases only with one table per file. Thus, in terms of specifications, there are two major choices to be made (which are most of the time conflicting):
That being said, I think in the case of USV these are the most sensible choices:
I'll try to tackle a bit the third choice (i.e. going back to the drawing board with groups / files). My assumption is that groups (and files) were meant to support multiple tables in the same spreadsheet, and multiple spreadsheets respectively. However, currently USV misses one important feature of these, namely how to identify which group / file is which? I.e. table / spreadsheets titles. So perhaps one could rework how groups / files work by introducing some missing features, and perhaps by dropping the symmetry with units / records. For example (and this is not something I've thoroughly thought about) how about this new syntax:
Namely, files and groups are introduced by FS / GS, meanwhile records / units are joined (or in my #2 proposal terminated) by RS / US. Moreover a USV can contain either multiple files, multiple groups, or just records in an unnamed file; then a file can contain multiple groups, or just records in an unnamed group. The US and RS are reused by files and groups to denote the name and description. It's not as nice as the initial specification, but it does support (without ambiguity) the case of just records, just groups, files with just records, files with groups. Also this second proposal does suffer from the same truncation issue as described in #2, thus perhaps a group terminator and file terminator might be useful, as in:
I.e. two adjacent files would be joined by |
Lots of info below... I'm hoping I'm responding to each of your points because I very much appreciate your insights.
100% agree.
This must be a hard error i.e. the entire parse must be invalid. TODO: add this to the docs.
This must have a spec. The complement also must have a spec e.g. given a blank spreadsheet, what must the USV export be? TODO: spec this.
You're correct this is an issue. How does these issues interact with similar data exchange formats?
I believe you're honing in on a tension of these options:
How about delegating this to a checksum that's out of scope of USV? Detecting unexpected file truncation, or other kinds of unexpect corruption, are big scope increase (IMHO) for a simple format.
Yes, and real world cases that have come up somewhat-often where the content is solely units, never records. In practice, the big ones so far have involved logging:
Worth mentioning, the real world cases are somehat-often using different dimensions meaning each record is using a different number of units. In other words, the data isn't an X,Y grid. A typical example is walking file systems, where directories (which are treated as USV records) can have a different numbers of entries (which are treated as USV units).
I agree with your choices. 1 is not viable because the groups are must-have in practice, in order to be able to export a typical database set of schemas, or a typical Excel spreadsheet set of folios. The real world use case is import/export all the data, which is then slurped into another system that knows enough about the data structure. For import/export where the other system doesn't know enough about the data, we use a typical Postgres database dump (including metadata, table layouts, etc.), or a zip file of Excel .xls files (including metadata, macros, etc.). 2 I want to think more about this 3 Likewise
I would describe that style of loop as using content "terminators" or "trailing separators", rather than content "splitters" a.k.a. "in-between separators". This feels akin to C style null terminated strings. My intuition is there are large advantages to this approach, such as for streaming data-- a stream source can output a unit and its terminator, without needing to be aware of whether there's a next unit coming. What would you do to trigger the start-of-file or start-of-group or start-of-record or start-of-unit? OTOH, it's a totally different approach than CSV, TSV, ASV, all of which use in-between separators.
Yes. In practice this hasn't been an issue because the reader and writer both pre-agree on the overall data structure. In other words, USV hasn't yet aimed to reconstitute table names, nor even table column headers. For example, USV doesn't specify that a record's first row is the column names. Whenever we've needed to reconstitute the data structure, we've switched from USV to more-powerful formats (e.g. Postgres dump, Excel zips, etc. as above). |
First of all, given there is no clear specification, I interpret the current USV as described in issue #1.
Thus, my suggestion is to make the unit / record / group / file separators mandatory at the end of each such block.
The reasons are:
value-1<US>value-2
is a valid USV, however it might also be the prefix of a longer file that contained more records, but which was truncated; having the last separators mandatory, make the truncation detectable; (granted, the stream might get truncated at<FS>
boundaries and not be detected, but given that most USV file would contain only one file, that would be an acceptable trade-off;)And, if those are not convincing enough, here is a practical reason: it's simpler to write the formatter, because one can just print the last separator without checking if this was indeed the last item in its block:
(I'll leave to others to think about the implementation where the last separator is not mandatory.) :)
The text was updated successfully, but these errors were encountered: