We begin without data. Then it is observed, or made, or imagined, or generated. After that, it goes through further transformations:
Raw data is yet to be processed, meaning it has yet to be manipulated by a human or computer. Received or collected data could be in any number of formats, locations, etc.. It could be in any of the forms listed above.
But "raw data" is a relative term, inasmuch as when one person finishes processing data and presents it as a finished product, another person may take that product and work on it further, and for them that data is "raw data".
For example, is "big data" "raw data"? How do we understand data that we have "scraped"?
Processing data puts it into a state more readily available for analysis, and makes the data legible. For instance it could be rendered as structured data. This can also take many forms, e.g., a table.
Here are a few you're likely to come across, all representing the same data:
<Cats>
<Cat>
<firstName>Smally</firstName> <lastName>McTiny</lastName>
</Cat>
<Cat>
<firstName>Kitty</firstName> <lastName>Kitty</lastName>
</Cat>
<Cat>
<firstName>Foots</firstName> <lastName>Smith</lastName>
</Cat>
<Cat>
<firstName>Tiger</firstName> <lastName>Jaws</lastName>
</Cat>
</Cats>
{"Cats":[
{ "firstName":"Smally", "lastName":"McTiny" },
{ "firstName":"Kitty", "lastName":"Kitty" },
{ "firstName":"Foots", "lastName":"Smith" },
{ "firstName":"Tiger", "lastName":"Jaws" }
]}
First Name,Last Name/n
Smally,McTiny/n
Kitty,Kitty/n
Foots,Smith/n
Tiger,Jaws/n
A small detour to discuss (the ethics of?) data formats. For accessibility, future-proofing, and preservation, keep your data in open, sustainable formats. A demonstration:
- Open this file in a text editor, and then in an app like Excel. This is a CSV, an open, text-only, file format.
- Now do the same with this one. This is a proprietary format!
Sustainable formats are generally unencrypted, uncompressed, and follow an open standard. A small list:
-
ASCII
-
PDF
-
.csv
-
FLAC
-
TIFF
-
JPEG2000
-
MPEG-4
-
XML
-
RDF
-
.txt
-
.r
How do you decide the formats to store your data when you transition from 'raw' to 'processed/transformed' data? What are some of your considerations?
There are guidelines to the processing of data, sometimes referred to as Tidy Data.1 One manifestation of these rules:
- Each variable is in a column.
- Each observation is a row.
- Each value is a cell.
Look back at our example of cats to see how they may or may not follow those guidelines. Important note: some data formats allow for more than one dimension of data! How might that complicate the concept of Tidy Data?
{"Cats":[
{"Calico":[
{ "firstName":"Smally", "lastName":"McTiny" },
{ "firstName":"Kitty", "lastName":"Kitty" }],
"Tortoiseshell":[
{ "firstName":"Foots", "lastName":"Smith" },
{ "firstName":"Tiger", "lastName":"Jaws" }]}]}
1Wickham, Hadley. "Tidy Data". Journal of Statistical Software.