-
Notifications
You must be signed in to change notification settings - Fork 3
What is a Data Frame?
The concept of a data frame comes from the world of statistical software used in empirical research; it generally refers to "tabular" data: a data structure representing cases (rows), each of which consists of a number of observations or measurements (columns). Alternatively, each row may be treated as a single observation of multiple "variables". In any case, each row and each column has the same data type, but the row ("record") datatype may be heterogenous (a tuple of different types), while the column datatype must be homogenous. Data frames usually contain some metadata in addition to data; for example, column and row names.
Data frame APIs usually support more or less elaborate methods for slicing-and-dicing the data, such as "selecting" rows, columns, and cells by name or by number; filtering out rows; "recoding" column and row names; normalizing data (e.g. converting units of measure); adding new columns (e.g. summing some fields); and much, much, more!
Statistical data is often - usually, in fact - messy. To be useful a data frame API must provide means for dealing with incoming data that violates the (usually implicit) integrity constraints of the row and column types. Obvious examples include typos ("Maale" instead of "Male") and range violations (e.g. 17 < Age < 65). Missing data is also common, and may be represented in a variety of ways, such as using "NA" or some value that would not normally occur, such as 9999 for a missing Age datum. So a critically important feature of data frames is explicit management of missing data. For example, R supports "NA" as a kind of data type, and many of its statistical functions support an "na.rm" parameter, which tells the function to ignore records with missing values.
(TODO: datasets vs dataframes. Some statistical software uses the term "dataset" where R (and others, presumably on R's example) use "dataframe". Sometimes "dataset" refers to data stored in a vendor's proprietary data format (Stata). So it seems better to reserve "dataset" to refer to actual files or collections of data, and "dataframe" for the datatype used to represent such data. Example: Definition of a SAS Data Set)
Dataframes in other languages:
-
R:
dataframe
is a quasi-builtin data type in R. Technically, it is not a primitive; the language definition mentions "Data frame objects" only in passing: "A data frame is a list of vectors, factors, and/or matrices all having the same length (number of rows in the case of matrices). In addition, a data frame generally has a names attribute labeling the variables and a row.names attribute for labeling the cases." Vectors, lists, and "factors" are primitive; matrices and dataframes are not. But in practice the dataframe is central to R. A web search will produce many tutorials on working with dataframes, e.g. - Data Frames
- R Programming/Working with data frames
- R Tutorial: Data Frame
- Julia: DataFrames.lj
-
Python
- Pandas: data analysis toolkit
- pydataframe
- Go : package dataframe
- Octave : Dataframe "Data manipulation toolbox similar to R data.frame"
- F# : Deedle "for Data Frame and Time Series programming with F# and C#...Data Frame programming is fast becoming the "standard" way of doing multi-dimensional and statistical data processing in systems such as R, Python and now .NET."
- ArcGIS: DataFrame