Stages of Data

We begin without data. Then it is observed, or made, or imagined, or generated. After that, it goes through further transformations:

Raw

Raw data is yet to be processed, meaning it has yet to be manipulated by a human or computer. Received or collected data could be in any number of formats, locations, etc.. It could be in any of the forms listed above.

But "raw data" is a relative term, inasmuch as when one person finishes processing data and presents it as a finished product, another person may take that product and work on it further, and for them that data is "raw data".

Discussion: Raw Data

For example, is "big data" "raw data"? How do we understand data that we have "scraped"?

Processed/Transformed

Processing data puts it into a state more readily available for analysis, and makes the data legible. For instance it could be rendered as structured data. This can also take many forms, e.g., a table.

Here are a few you're likely to come across, all representing the same data:

XML

<Cats> 
    <Cat> 
        <firstName>Smally</firstName> <lastName>McTiny</lastName> 
    </Cat> 
    <Cat> 
        <firstName>Kitty</firstName> <lastName>Kitty</lastName> 
    </Cat> 
    <Cat> 
        <firstName>Foots</firstName> <lastName>Smith</lastName> 
    </Cat> 
    <Cat> 
        <firstName>Tiger</firstName> <lastName>Jaws</lastName> 
    </Cat> 
</Cats>

JSON

{"Cats":[ 
    { "firstName":"Smally", "lastName":"McTiny" }, 
    { "firstName":"Kitty", "lastName":"Kitty" }, 
    { "firstName":"Foots", "lastName":"Smith" }, 
    { "firstName":"Tiger", "lastName":"Jaws" } 
]}

CSV

First Name,Last Name/n
Smally,McTiny/n
Kitty,Kitty/n
Foots,Smith/n
Tiger,Jaws/n

The importance of using open data formats

A small detour to discuss (the ethics of?) data formats. For accessibility, future-proofing, and preservation, keep your data in open, sustainable formats. A demonstration:

Open this file in a text editor, and then in an app like Excel. This is a CSV, an open, text-only, file format.
Now do the same with this one. This is a proprietary format!

Sustainable formats are generally unencrypted, uncompressed, and follow an open standard. A small list:

ASCII
PDF
.csv
FLAC
TIFF
JPEG2000
MPEG-4
XML
RDF
.txt
.r

Discussion: Processed/Transformed

How do you decide the formats to store your data when you transition from 'raw' to 'processed/transformed' data? What are some of your considerations?

Tidy Data

There are guidelines to the processing of data, sometimes referred to as Tidy Data.¹ One manifestation of these rules:

Each variable is in a column.
Each observation is a row.
Each value is a cell.

Look back at our example of cats to see how they may or may not follow those guidelines. Important note: some data formats allow for more than one dimension of data! How might that complicate the concept of Tidy Data?

{"Cats":[
    {"Calico":[
    { "firstName":"Smally", "lastName":"McTiny" },
    { "firstName":"Kitty", "lastName":"Kitty" }],
    "Tortoiseshell":[
    { "firstName":"Foots", "lastName":"Smith" }, 
    { "firstName":"Tiger", "lastName":"Jaws" }]}]}

¹Wickham, Hadley. "Tidy Data". Journal of Statistical Software.

<<< Previous | Next >>>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stages.md

stages.md

Stages of Data

Raw

Discussion: Raw Data

Processed/Transformed

XML

JSON

CSV

The importance of using open data formats

Discussion: Processed/Transformed

Tidy Data

Files

stages.md

Latest commit

History

stages.md

File metadata and controls

Stages of Data

Raw

Discussion: Raw Data

Processed/Transformed

XML

JSON

CSV

The importance of using open data formats

Discussion: Processed/Transformed

Tidy Data