-
Notifications
You must be signed in to change notification settings - Fork 24
/
NeoCSV.pillar
302 lines (224 loc) · 10 KB
/
NeoCSV.pillar
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
{
"metadata" : {
"title": "NeoCSV",
"attribution": "Sven Van Caekenberghe with Damien Cassou and Stéphane Ducasse"
},
"headingLevelOffset":2
}
@cha:neoCSV
CSV (Comma-Separated Values) is a popular data-interchange format. This chapter presents NeoCSV, a library to parse and export CSV files.
This chapter has originally been written by Sven Van Caekenberghe the author of NeoCSV and many other nicely designed Pharo libraries.
!NeoCSV
NeoCSV is an elegant and efficient standalone Pharo library to read (resp. write) CSV files converting to (resp. from) Pharo objects.
!!An Introduction to CSV
CSV is a lightweight text-based ''de facto'' standard for human-readable tabular data interchange. Essentially, the key characteristics are that CSV (or more generally, delimiter-separated text data):
-is text-based (ASCII, Latin1, Unicode);
-consists of records, 1 per line (any line ending convention);
-where ''records'' consist of ''fields'' separated by a ''delimiter'' (comma, tab, semicolon);
-where every record has the same number of fields; and
-where fields can be quoted should they contain separators or line endings.
@@note References: *http://en.wikipedia.org/wiki/Comma-separated_values*, *http://tools.ietf.org/html/rfc4180*
Note that there is not one single official standard specification.
!!Hands On NeoCSV
NeoCSV contains a reader (==NeoCSVReader==) and a writer (==NeoCSVWriter==) to parse and generate delimiter-separated text data to and
from Smalltalk objects. The goals of NeoCSV are:
-to be standalone (have no dependencies and little requirements);
-to be small, elegant and understandable;
-to be efficient (both in time and space); and
-to be flexible and non-intrusive.
To load NeoCSV, evaluate the following or use the Configuration Browser:
[[[
Gofer it
smalltalkhubUser: 'SvenVanCaekenberghe' project: 'Neo';
configurationOf: 'NeoCSV';
loadStable.
]]]
To use either the reader or the writer, you instantiate them on a character stream and use standard stream access messages.
The first example reads a sequence of data separated by ==,== and containing line breaks. The reader produces arrays corresponding to the lines with the data (the ==withCRs== method converts backslashes to new lines).
[[[
(NeoCSVReader on: '1,2,3\4,5,6\7,8,9' withCRs readStream) upToEnd.
--> #(#('1' '2' '3') #('4' '5' '6') #('7' '8' '9'))
]]]
The second proceeds from the inverse: given a set of data as arrays it produces comma separated lines.
[[[
String streamContents: [ :stream |
(NeoCSVWriter on: stream)
nextPutAll: #( (x y z) (10 20 30) (40 50 60) (70 80 90) ) ].
-->
'"x","y","z"
"10","20","30"
"40","50","60"
"70","80","90"
'
]]]
!! More features
You can also configure the reader to give you the data but with the name of their column using the message ==namedColumnsConfigurations== as shown in the example below:
[[[
(NeoCSVReader on: 'foo bar baz
1 2 3
11 22 33' readStream)
separator: $ ;
namedColumnsConfiguration;
upToEnd.
-> an Array(a Dictionary(#bar->'2' #baz->'3' #foo->'1' )
a Dictionary(#bar->'22' #baz->'33' #foo->'11' ))
]]]
You obtain an array of dictionary.
Note here that the input what not separated by ==$,== but space so we set using ==separator:== that
the reader should split input at space. ==namedColumnsConfiguration== uses the header as keys and reads dictionaries instead of arrays.
Note also that instead of ==upToEnd== you can also iterate with ==do:== to get over the treated input.
!Generic Mode
NeoCSV can operate in generic mode without any further customization. While writing,
- record objects should respond to the ==do:== protocol,
- fields are always sent ==asString== and quoted, and
- CRLF line ending is used.
While reading,
- records become arrays,
- fields remain strings,
- any line ending is accepted, and
- both quoted and unquoted fields are allowed.
The standard delimiter is a comma character. Quoting is always done using a double quote character. A double quote character inside a field will be escaped by
repeating it. Field separators and line endings are allowed inside a quoted field. Any whitespace is significant.
!Customizing NeoCSVWriter
Any character can be used as field separator, for example:
[[[
neoCSVWriter separator: Character tab
]]]
or
[[[
neoCSVWriter separator: $;
]]]
Likewise, any of the three common line end conventions can be set. In the following example we set carriage return:
[[[
neoCSVWriter lineEndConvention: #cr
]]]
There are 3 mechanisms that the writer may use to write a field (in increasing order of efficiency):
;quoted
:converting it with ==asString== and quoting it (the default);
;raw
:converting it with ==asString== but not quoting it; and
;object
:not quoting it and using ==printOn:== directly on the output stream.
When disabling quoting, you have to be sure your values do not contain embedded separators or line endings. If you are writing arrays of numbers for example, this would be the fastest way to do it:
[[[
neoCSVWriter
fieldWriter: #object;
nextPutAll: #( (100 200 300) (400 500 600) (700 800 900) )
]]]
The ==fieldWriter== option applies to all fields.
!!Writing Objects
If your data is in the form of regular domain-level objects it would be wasteful to convert them to arrays just for writing them as CSV. NeoCSV has a non-intrusive option to map your domain object's fields: You add field specifications based on accessors. This is how you would write an array of ==Point==s.
[[[
String streamContents: [ :stream |
(NeoCSVWriter on: stream)
nextPut: #('x field' 'y field');
addFields: #(x y);
nextPutAll: { 1@2. 3@4. 5@6 } ].
-->
'"x field","y field"
"1","2"
"3","4"
"5","6"
'
]]]
Note how ==nextPut:== is used to first write the header (''i.e.'', the first line). After printing the header, the writer is customized: the messages ==addField:== and ==addFields:== arrange for the specified selectors (here ==x== and ==y==) to be sent on each incoming object to produce fields that will be written.
To change the writing behavior for a specific field, you have to use ==addQuotedField:==, ==addRawField:== and ==addObjectField:==.
To specify different field writers for an array (actually any subclass of ==SequenceableCollection==), you can use the ==first==, ==second==, ==third==, etc. methods as selectors:
[[[
String streamContents: [ :stream |
(NeoCSVWriter on: stream)
addFields: #(first third second);
nextPutAll: { 'acb' . 'dfe' . 'gih' }].
-->
'"a","b","c"
"d","e","f"
"g","h","i"
'
]]]
!Customizing NeoCSVReader
The parser is flexible and forgiving. Any line ending will do, quoted and non-quoted fields are allowed.
Any character can be used as field separator, for example:
[[[
neoCSVReader separator: Character tab
]]]
or
[[[
neoCSVReader separator: $;
]]]
NeoCSVReader will produce records that are instances of its ==recordClass==, which defaults to ==Array==. All fields are always read as Strings. If you want,
you can specify converters for each field, to convert them to integers, floats or any other object. Here is an example:
[[[
(NeoCSVReader on: '1,2.3,abc,2015/07/07' readStream)
separator: $,;
addIntegerField;
addFloatField;
addField;
addFieldConverter: [ :string | Date fromString: string ];
upToEnd.
--> an Array(an Array(1 2.3 'abc' 7 July 2015))
]]]
Here we specify 4 fields: an integer, a float, a string and a date field. Field conversions specified this way only work on indexable record classes, like
==Array==.
!!Ignoring Fields
While reading from a CSV file, you can ignore fields using ==addIgnoredField==. In the following example, the third field of each record is ignored:
[[[
| input |
(NeoCSVReader on: '1,2,a,3\1,2,b,3\1,2,c,3\' withCRs readStream)
addIntegerField;
addIntegerField;
addIgnoredField;
addIntegerField;
upToEnd
--> #(#(1 2 3) #(1 2 3) #(1 2 3))
]]]
Adding ignored field(s) requires adding field types on all other fields.
!!Creating Objects
In many cases you will probably want your data to be returned as one of your domain objects. It would be wasteful to first create arrays and then convert all those. NeoCSV has non-intrusive options to create instances of your own classes and to convert and set fields on them directly. This is done by specifying accessors and converters. Here is an example for reading ==Association==s of ==Float==s.
[[[
(NeoCSVReader on: '1.5,2.2\4.5,6\7.8,9.1' withCRs readStream)
recordClass: Association;
addFloatField: #key: ;
addFloatField: #value: ;
upToEnd.
--> {1.5->2.2. 4.5->6. 7.8->9.1}
]]]
For each field you have to give the mutating accessor to use. You might also want to pass a conversion block using ==addField:converter:==.
!!Reading many Objects
Handling large CSV files is possible with NeoCVS. In the following, we first create a large CSV file then read it partly (this takes a bit of time, be patient).
[[[
'paul.csv' asFileReference writeStreamDo: [ :file |
ZnBufferedWriteStream on: file do: [ :out | | writer |
writer := (NeoCSVWriter on: out).
writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
1 to: 1e7 do: [ :each |
writer nextPut: {
each.
#(Red Green Blue) atRandom.
1e6 atRandom.
#(true false) atRandom } ] ] ].
]]]
Above code results in a 300Mb file:
[[[language=bash
$ ls -lah paul.csv
-rw-r--r--@ 1 sven staff 327M Nov 14 20:45 paul.csv
$ wc paul.csv
10000001 10000001 342781577 paul.csv
]]]
The following code selectively collects every record with a third field lower than 1000 (this takes a bit of time, be patient):
[[[
Array streamContents: [ :out |
'paul.csv' asFileReference readStreamDo: [ :in |
(NeoCSVReader on: (ZnBufferedReadStream on: in))
skipHeader;
addIntegerField;
addSymbolField;
addIntegerField;
addFieldConverter: [ :x | x = #true ];
do: [ :each |
each third < 1000
ifTrue: [ out nextPut: each ] ] ] ].
]]]
% LocalWords: accessor
% Local Variables:
% compile-command: "cd .. && ./pillar export --to=\"LaTeX by chapter\" NeoCSV/NeoCSV.pillar && bash pillarPostExport.sh"
% End: