Skip to content
Mikhail Mustakimov edited this page Jun 23, 2020 · 14 revisions

The basic architecture consists of four layers, which correspond with the package structure:

  • api; Facade for bean and row, used to have a single class for interacting the underlying layers
  • bean; layer responsible for translating rows to Beans.
  • row; layer responsible for translating tokens to lines and lines to headers and rows.
  • token; layer responsible for parsing the CSV file one symbol at a time and translating those to tokens and lines.

Furthermore, there are two auxiliary packages:

  • annotations; various annotations for instructing the bean, row and token classes.
  • report; structure for reporting back errors with references to the original CSV row

api

Design goal: a single Facade for interacting with the underlying layers, intuitive in usage and operation

CsvReader joins both BeanReader and LineReader, and their respective instructions classes in a single class. Developers making use of the Facade now only have to deal with one class to drive the process.

bean

Design goal: user-friendly (ie, low-maintenance for user) way of converting CSV rows into Beans

This is where Rows are converted to Beans. This process makes use of the original Mapping strategy, which is either column index or column name based:

  • column index mapping; used by default, executed by the ColumnIndexMapper. Maps found cells to Bean properties by their index of the column in the CSV file.
  • column name mapping; executed by the ColumnNameMapper. Maps found cells to Bean properties by the header name of the column in the CSV file.

All mappable cells are converted to the Bean property using PropertyEditors. Custom PropertyEditors can be used as well.

row

Design goal: collecting headers and rows, while maintaining links to the original line text for later reference

Lines are read in their bare form. When headers are used, the first read line will be converted to a Header, whereas the others are all Row instances. The layer takes into account the needs for jumping to the start line, and skipping empty or comment lines. Only valid lines are offered for conversion to a Header/Row.

Whenever a line is read in the row layer, it is stored together with information on its original CSV line. This information contains two important aspects:

  • start and end of columns
  • non-printable characters made printable

Users can directly read Rows, with their Header and extra information.

token

Design goal: most versatile and effective way of tokenizing CSV files, without forward reading, one symbol at a time

The tokenizer is made on a state machine. The state machine gets one symbol at the time and on the basis of its current state, it can determine its next state. The state machine is capable of saying whether tokens must be returned (ie, a cell has been read), or lines are finished. The states are fixes and cannot be configured. Most meaningful symbols (except SPACE and END-OF-FILE) can be configured.

report

Design goal: ability to trace back errors on a property to the original cell in the CSV line, even after the conversion process has taken place

When an error occurs, the exception can be fed with exact details on the position of the cell in the CSV line, meaning that lines can be printed showing precisely the location of the error.

annotations

Design goal: an intuitive way of defining parsing and mapping instructions directly associated with the Bean class and properties

Annotations have a big advantage of the declarative model, which is that the elements they describe are hard-coupled with the instruction. The declarative model on the other hand, is a soft-coupling, more prone to failure, such as name changes that go unnoticed.

Annotations and the declarative model are mirrored, so that everything that can be declared with annotations can also be declared programmatically and vice versa.

Design Choices

  • Tokenizer; why build the tokenizer the way you did?
  • v1.0.0; why not make a v1.0.0 immediately?
  • Spring; why use Spring?
  • Writer; why is there not a CsvWriter?
  • Architecture; why have the layering of CSV and not roll all into one layer?
  • Error feedback; what is your thing with error feedback?
  • Facade; why the Facade?

Tokenizer

v1.0.0

I believe version 1 must have seen some time in the trenches before it can be elevated to that position. That means people must have used it, it must have run stable in production software, and in general, it must have existed for some time. Software that is immediately released as version 1 is not to be trusted.

I am aware that this may hamper the attraction of CSVeed, but I think the community and the prestige of the software are better served ...

Spring

Writer

Architecture

Error feedback

Facade