title | cover title | description | learning objectives | estimated time | prerequisites | instructors | authors | editors | before getting started | readings | projects | ethical considerations | additional datasets | |||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Data Literacies |
Data Literacies |
What is data? What counts as data? What are the ethical implications when working with data? How can we manage our data? These are questions we will explore throughout the workshop. Data is foundational to nearly all digital projects and often help us to understand and express our ideas and narratives. Hence, in order to do digital work, we should know how data is captured, constructed, and manipulated. We will also engage with the ethical dimensions of what it means to work with data, from collection to visualization to management. |
|
|
|
|
|
|
|
|
|
|
|
In this workshop we will be discussing the basics of research data in terms of material, transformation, and presentation. We will also be discussing the ethical issues that arise in data collection, cleaning, and representation. Because everyone has a different approach and understanding to data and ethics, this workshop will also include multiple sites for discussions to help us think through what data literacies mean within our projects and broader applications.
These quotes below offers a variety of perspectives to understand research data across different stakeholders. The inclusion of these different approaches to research data is to suggest that there is no singular, definitive approach, and is dependent on multiple factors, including your project considerations.
University
Material or information on which an argument, theory, test or hypothesis, or another research output is based.
— Queensland University of Technology. Manual of Procedures and Policies. Section 2.8.3.
Digital Project Management
What constitutes such data will be determined by the community of interest through the process of peer review and program management. This may include, but is not limited to: data, publications, samples, physical collections, software and models.
Government Institution
Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues.
Data Science
The short answer is that we can’t always trust empirical measures at face value: data is always biased, measurements always contain errors, systems always have confounders, and people always make assumptions
Broadly, research data can be understood as materials or information necessary to come to your conclusion but what these materials and information is depends on your project.
There are many ways to represent data, just as there are many sources of data. What can you/do you count as data? Here's a small list of possibilities of collections of digital objects acquired and generated during research.
Form | Format | Common file extensions |
---|---|---|
Images | TIFF (Tagged Image File Format) | `.tiff`, `.tif` |
JPEG2000 | `.jp2`, `.jpf`, `.jpx` | |
PNG (Portable Network Graphics) | `.png` | |
Text | ASCII (American Standard Code for Information Interchange) | `.ascii`, `.dat`, `.txt` |
PDF (Portable Document Format) | `.pdf` | |
CSV (Comma-Separated Values | `.csv` | |
Audio | FLAC (Free Lossless Audio Codec) | `.flac` |
ogg | `.ogg` | |
Video | MPEG-4 | `.mp4` |
Others | XML (Extensible Markup Language) | `.xml` |
JSON (JavaScript Object Notation) | `.json` | |
STL (STereoLithography file format—used in 3D modeling) | `.stl` | |
For a list of file formats, consider the Library of Congress' list of Sustainability of Digital Formats. |
Open data formats are usually available to anyone free-of-charge and allows for easy reusability. Proprietary formats often hold copyrights, patents, or have other restrictions placed on them, and are dependent on (expensive) licensed softwares. If the licensed software ceases to support its proprietary format or it becomes obsolete, you may be stuck with a file format that cannot be easily opened or (re)used (e.g. .mac). For accessibility, future-proofing, and preservation, keep your data in open, sustainable formats.
An illustration:
- A screenshot of the moSmall dataset as a cvs file.
- A screenshot of the moSmall dataset as an Excel file (.xlsx). Unlike the previous image, this is a proprietary format.
Sustainable formats are generally unencrypted, uncompressed, and follow an open standard.
Research data can be defined as: (select all that apply)
- materials or information necessary to come to my conclusion.* - the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.* - method of collection and analysis. - objective and error-free.Below you will find two front matter pages of two distinct digital projects. As you inspect the information present in each image, consider these questions:
- What are some forms of data used in the project?
- What are some forms of data outputted by the project?
- Where was the data retrieved from to complete the project?
Human Computers at NASA is an archival project that "seeks to shed light on the buried stories of African American women with math and science degrees who began working at NACA (now NASA) in 1943 in secret, segregated facilities."
From the image, we can deduce that newspaper articles (digital copies of text) and photographs (digital copies of images) were used to compile this archive. Noticing the highlighted name in the news article, the data may be outputted as searchable text, searchable database, and/or searchable images. The data most likely was retrieved from a database and/or non-digital field notes. This is the [data source page](https://omeka.macalester.edu/humancomputerproject/items/browse) for Human Computers At NASA.Listen for the Iraqis in NYC! is an audio community mapping project that seeks to locate the Iraqi population in NYC using their own voices.
From the image, we can deduce that audio recordings of participants and a map (geospatial data) were used to compile this project. Given the details in the text on the right of the screen, we learn that the researcher will provide a map (geospatial data) and testaments (audio files) for us to peruse. The researcher has gathered digital field notes in the form of audio files from participants through a survey. The Call for Participants for Listen for the Iraqis in NYC! can be found [here](https://docs.google.com/document/d/1G8RxmEILImlW4O5LgRQ5I3e3JOlg0Sg6b7dM2CmrvEQ/edit).The Institutional Review Board (IRB) is a floor for ethical responsibility at your university that came to pass after outrage about horrific unethical research studies done on people. A prime example of these grotesque studies is the Tuskegee Syphilis Study (1932-1972).
Born from concerns of the ethical choices made in biomedical and behavioral research, IRB compliance is not broadly applicable. This leaves holes in institutional ethical regulations and requires researches in other fields, such as the social sciences, to find other ethical regulations or devise field specific ethical considers.
Usually, IRB review is required when ALL of the criteria below are met:
- The investigator is conducting research or clinical investigation,
- The proposed research or clinical investigation involves human subjects, and
- Your university or research institution is engaged in the research or clinical investigation involving human subjects.
An IRB is an institutional compliance that may not consider other ethical impacts. As we move forward in this workshop we will consider data and digital project ethics beyond compliance.
Your oral history project does the following; do you need a CUNY HRPP/IRB?
- Open-ended interviews, that ONLY document a specific historical event or the experiences of individuals without intent to draw conclusions or generalize findings.
HRPP/IRB Required?
No- Systematic investigations involving open-ended interviews that are designed to develop or contribute to generalizable knowledge (e.g.,designed to draw conclusions, inform policy, or generalize findings).
HRPP/IRB Required?
Yes- Creation of archives for the purpose of providing a resource for others to do research. The intent of the archive is to create a repository of information for other investigators to conduct research.
HRPP/IRB Required?
YesFor guidance and more examples see The CUNY Human Research Protection Program (HRPP), "CUNY HRPP Guidance: When is CUNY HRPP or IRB Review Required?"
We begin without data. Then it is observed, or made, or imagined, or generated. After that, it goes through further transformations. Stages of data typically consist of:
collection of "raw" data
We start with formulating a research question(s) or hypotheses and set up a project to answer our question(s).
- E.g. What proportion of the artwork collected and/or hosted in the Met are by non cis-gender men artists and also in public domain?
processing and/or transforming data
In the process of setting up the project, we make decisions on what kind of data we think can help us to answer the question.
- E.g. We may retrieve the data from the Met's open access data set. We will need to look at what variables exist in the dataset to find out if we can filter by gender and the variables that will correspond to copyrights.( Note: if the file opens as a web page, you would need to use your machine's 'save as' option to save it as a csv file to view it in a tabular form.)
cleaning
After collecting our data we then consider and make decisions in the processes of cleaning.
- E.g. We have to transform some of the gender values and decide what to do with the missing fields.
analysis
We then run our preliminary analysis of the data.
- E.g. We can run an analysis of the subset of non cis-gender men and public domain media objects against the total number of media objects to find out the proportion.
visualization
At the end of our analysis, a decision is then made about how we would present the data and its analysis.
- E.g. We can present the result in a pie chart or a bar graph.
There is no one way to go through the stages. For example, we could do a preliminary analysis first, such as running a correlation of variables, to explore what is missing before we begin the process of cleaning. Often, we also end up doing multiple iterations of cleaning and analysis, making decisions and choices to collapse particular variables or remove them entirely at each iteration. Making sure that we keep a clear documentation of our process will ensure that we are accountable to the data we have collected/are using and also ensure that our results can be replicated and reproduced if others choose to work on our "raw" data. While making these decisions seems innocuous, there are ethical considerations, beyond the institution, and impacts we must evaluate in the process.
As we learn to manipulate data, we will consider our ethical obligations beyond institutional compliance such as an IRB. We will think of ethics as the moral principles that an individual aims to follow in practice to the best of their ability, research, and foresight. Using this definition of ethics, we then consider ethics as situated.
Situated ethics refers to the notion that a person's understandings of and commitments to ethics or morality are greatly linked to their own experiences, positionalities, and political orientations, as well as the particular context in which that person is putting such ethics into practice Helen Simons and Robin Usher, Situated Ethics in Educational Research, 2000.
Thinking through how ethical ideas and practices, or lack thereof, are situated may prompt questions such as: How is data retrieved? By whom? For whom? From where? Why?
In the Command Line workshop you learned about the history of the computer and considered the questions: How were computers developed? By whom? Where? Why?Annette Markham in "OKCupid data release fiasco: It’s time to rethink ethics education", 2016 asserts that ethical digital research is a methodology dependent on reflection, awareness of the debates and concerns in our respective fields, and accountability for the choices we make at each stage of our research. Thus, given the precarious nature of digital research and data we need to use a "what if" approach that will help us evaluate "the possible or probable impact, rather than the prevention of the impact This "impact approach" helps us expand our ethical imagination and consider ethics beyond prescriptives and compliance.
Drawing from Markham (2016), we will focus on three levels of impact:
- Direct impacts on people
- Ramifications of (re)producing categories
- Social, political and economic effects.
Additionally, this workshop will address the range of impact, or the range of accessibility to our work:
- to people with disabilities,
- to people in different countries or who speak different languages, and
- in terms of cost and proprietary accessibility.
Throughout the workshop we will refer to the impacts by number for quick reference.
Think of the following scenario and the possible considerations and impacts for working with data.
A graduate student decides to analyze a data set they collected through surveying fellow graduate students. The survey asks students to denote their graduate level, current job, gross income, and housing status. The graduate student hopes to analyze the survey and present their findings at a student council meeting as part of the council's attempt to persuade the administration to provide more funding to the graduate students. The student learns that they can analyze the data and create a visual using Chat GPT. For the sake of time, they decide to use Chat GPT to analyze and visualize the data set. What ethical considerations should the student evaluate? How might using Chat GPT impact the students surveyed?
Using a large language model application such as Chat GPT to analyze personal information collected by the survey can cause direct impact on the "human subjects" and reproduce categories and information that can be harmful to the participants. The data inputted into Chat GPT is stored and used to produce outputs for other prompts unbeknownst to the graduate student researcher or the participants. Considering these impacts, it is better ethical practice for the graduate student to use a tool that is private and secure for their analysis and visualization. The reveal above is only one of the possible considerations, but there might be many others to evaluate."Raw" data is yet to be processed, meaning it has yet to be manipulated by a human or computer. Received or collected data could be in any number of formats, locations, etc. It could be in any of the forms listed in the Forms of Data Section..
But "raw" data is a relative term, inasmuch as when one person finishes processing data and presents it as a finished product, another person may take that product and work on it further, and for them that data is "raw" data. For example, we may consider the General Social Survey data to be "raw" as it will require us to filter out missing entries and collapse variables or fields before we can run our analysis. A researcher who participated in the creation of this survey may not consider the version on the site as "raw" because the "raw" version is the physical paper copies of the file. As you can see, this consideration of what is "raw" is non-definitive and is dependent on the project you are working on and the narrative you want to tell with the results.
If you are interested in further exploration and discussion of the ethics of "raw" data, please consider reading Drucker's article which has made useful distinctions between "data" (understood as given) and "capta" (taken or "captured") that also troubles the distinction between "raw" and "processed" data.
"At the most basic level of an impact approach, we might ask how our methods of data collection impact humans, directly. If one is interviewing, or the data is visibly connected to a person, this is easy to see. But a distance principle might help us recognize that when the data is very distant from where it originated, it can seem disconnected from persons, or what some regulators call ‘human subjects." (Annette Markham, 2016,emphasis added)
This brings us to several questions:
- What counts as "human"? What data should be off limits?
- How do we account for personhood?
- What is the distance principle? How does it impact the data we decide to collect?
- What is "public" data?
The definitions of terms such as "human" and "public" are ambiguous. They are dependent on the forms of data, the context of the data, and the relationship of the data to the source and the researcher.
We return to our data set Research_Data_DRI24.csv that is now on a public site that is designed to be widely accessible. Would we consider this data set, "public" date from our understanding of the general term "public"?
For guidelines and working definitions of "human", "public", and "personhood" see the 2012 Ethical Decision-Making and Internet Research report by the AoIR Ethics Working Committee.
As we think about data collection, we should also consider the labor involved in the process. Many researchers rely on Amazon Mechanical Turk (sometimes also referred to as MTurk) for data collection, often paying less than minimum wage for the task. Often the assumption made of these workers is someone who is retired, bored, and participating in online gig work for fun or to kill time. While this may be true for some, more than half of those surveyed in a Pew Research study cite that the income from this work is essential or important. Those who do view the income from this work as essential or important are also mostly from underserved communities.
In addition to being mindful of paying a fair wage to the workers on such platforms, this kind of working environment also brings some further considerations to the data that is collected. For instance, to get close to minimum wage, workers cannot afford to spend much time on each task. Thinking through these circumstances, how do you think it impacts the data we collected?
For a deeper discussion on data and labor, consider Catherine D'Ignazio and Lauren Klein's chapter Show Your Work in Data Feminism.
- The stages of data is a single iteration process, i.e. there is a fixed stage progression from data collection to visualization.
- Which of the following statements are true for "raw" data: (select all that apply)
If you have not done so, please download moSmall.csv
and open it from your local computer/laptop. As the original file has about 500,000 entries, we've taken a random sample of 1% of the original dataset. In this case, would you consider this file to be a "raw" dataset?
Processing data puts it into a state more readily available for analysis and makes the data legible. For instance, it could be rendered as structured data. Structured data consists of clearly defined data types with patterns that make them easily searchable; while unstructured data is “everything else.”Unstructured data can be an open source book from the Gutenberg project and a structured data can be a csv file or a our dataset.
Here is a reminder of the list of structured forms and formats we reviewed earlier in the workshop:
- Spreadsheets (e.g.
.xlsx
,.numbers
,.csv
) - Audio (e.g.
.mp3
,.wav
,.aac
) - Video (e.g.
.mov
,.mp4
) - Computer Aided Design/CAD (
.cad
) - Databases (e.g.
.sql
) - Geographic Information Systems (GIS) and spatial data (e.g.
.shp
,.dbf
,.shx
) - Digital copies of images (e.g.
.png
,.jpeg
,.tiff
) - Web files (e.g.
.html
,.asp
,.php
) - Matlab files & 3D Models (e.g.
.stl
,.dae
,.3ds
) - Metadata & Paradata (e.g.
.xml
,.json
)
There are different guidelines to the processing of data, one of which is the Tidy Data format, which follows these rules in structuring data:
- Each variable is in a column.
- Each observation is a row.
- Each value is a cell.
Look at this example of cats to see how they may or may not follow those guidelines. Important note: Some data formats allow for more than one dimension of data (like the JSON
structure below). How might that complicate the concept of Tidy Data?
{
"Cats": [
{
"Calico": [
{
"firstName": "Smally",
"lastName":"McTiny"
},
{
"firstName": "Kitty",
"lastName": "Kitty"
}
],
"Tortoiseshell": [
{
"firstName": "Foots",
"lastName":"Smith"
},
{
"firstName": "Tiger",
"lastName":"Jaws"
}
]
}
]
}
How would you convert this nested data set into a tidy data structure?
While tiny data is a really popular method of structuring and organizing data, it is not the only way to do so. Depending on the type of data you have, it is also not always the best way to structure data.
But not all rectangular data is tidy. Here is an example of how tabular (or rectangular) data can be transformed into more tidy data.
What new variables were created?
What does each of the three tables represent?
Before beginning your data collection, manipulation, and transformation, a good practice is to determine your file naming conventions. How many times have named something as XXX_FinalFINALFINAL.pdf
or have difficulty searching for a version of the file that contained all that good idea that was edited out in the XXX_FinalFINALFINALFINAL.pdf
version? While tools like version controlling with git can be helpful, we can also begin with setting up file naming conventions that can help us succeed! Here's an example from Stanford that demonstrates the problems of badly name files in our projects.
For example, The Graduate Center's Data Management guide suggest that top level folders (such as your main project folder) should include your project title, a unique identifier and the date (year) of your project (e.g. dataliteracies_XYZ_2020
). Your sub folders and individual files should follow a similar system, with an identifiable activity or project in the file name (e.g. a sub-folder of the project: sections_xyz_2020
, a file in the project: lessons_XYZ_2020.doc
).
For a thorough understanding of the impact of data management on our research and the ethical quandries that may arise from poor data management, please head over to Stephen Zweibel's interactive website Data Management.
"At another level, we can ask how our methods of organizing data, analytical interpretations, or findings as shared datasets are being used—or might be used—to build definitional categories or to profile particular groups in ways that could impact livelihoods or lives. Are we contributing positive or negative categorizations?" (Annette Markham, 2016,emphasis added)
For ethical considerations of the impact of our knowledge production,it can be helpful to think through the concepts of Gramsci's hegemony, Foucault's discourse, and Hall's "policing the crisis". An example of the politics of power production can be found in Julia Angwen's and Jeff Larson's, "Bias in Criminal Risk Scores Is Mathematically Inevitable, Researchers Say," 2016 article.
Decisions on the categories and boundaries scholars use shape our:
- Datasets
- Catalogs
- Maps
- Algorithms
A small detour to discuss data formats. Open data formats are usually available to anyone free-of-charge and allows for easy reusability. Proprietary formats often hold copyrights, patents, or have other restrictions placed on them, and are dependent on (expensive) licensed softwares. If the licensed software ceases to support its proprietary format or it becomes obsolete, you may be stuck with a file format that cannot be easily opened or (re)used (e.g. .mac). For accessibility, future-proofing, and preservation, keep your data in open, sustainable formats. A demonstration:
- Open this file in a text editor (e.g. Visual Studio Code, TextEdit (macOS), NotePad (Windows) ), and then in an app like Excel. This is a CSV, an open, text-only, file format. To save the file onto your local computer, right click on
Research_Data_DRI24.csv
and clickSave Link As
to download the file to your local computer (it's the same Research_Data_DRI24.csv from above!) - Now do the same with this Excel file. Unlike the previous, this is a proprietary format!
Sustainable formats are generally unencrypted, uncompressed, and follow an open standard.
Form | Format | Common file extensions |
---|---|---|
Images | TIFF (Tagged Image File Format) | `.tiff`, `.tif` |
JPEG2000 | `.jp2`, `.jpf`, `.jpx` | |
PNG (Portable Network Graphics) | `.png` | |
Text | ASCII (American Standard Code for Information Interchange) | `.ascii`, `.dat`, `.txt` |
PDF (Portable Document Format) | `.pdf` | |
CSV (Comma-Separated Values | `.csv` | |
Audio | FLAC (Free Lossless Audio Codec) | `.flac` |
ogg | `.ogg` | |
Video | MPEG-4 | `.mp4` |
Others | XML (Extensible Markup Language) | `.xml` |
JSON (JavaScript Object Notation | `.json` | |
STL (STereoLithography file format—used in 3D modeling) | `.stl` | |
For a list of file formats, consider the Library of Congress' list of Sustainability of Digital Formats. |
- Structured data can be: (select all that apply)
- We may choose to store our data in open data formats because they: (select all that apply)
- Tiny data format only allows one value per cell.
- Explore the
moSmall.csv
dataset, what questions might you ask with this dataset? What columns (variables) will you keep?
- If you are saving the file
moSmall.csv
in a proprietary spreadsheet application like Microsoft Excel (Windows/macOS) or Numbers (macOS), you may be prompted to save the file as.xlsx
or.numbers
. What format would you choose to save it in? Why would you choose to do so?
High quality data is measured in its validity, accuracy, completeness, consistency, and uniformity.
Processed data, even in a table, is going to be full of errors:
- Empty fields
- Multiple formats, such as "yes" or "y" or "1" for a positive response.
- Suspect answers, like a date of birth of 00/11/1234
- Impossible negative numbers, like an age of "-37"
- Dubious outliers
- Duplicated rows And many more!
Cleaning data is the work of correcting the errors listed above, and moving towards high quality. This work can be done manually or programmatically.
Measurements must be valid, in that they must conform to set constraints:
- The aforementioned "yes" or "y" or "1" should all be changed to one response.
- Certain fields cannot be empty, or the whole observation must be thrown out.
- Uniqueness, for instance no two people should have the same social security number.
Measurements must be accurate, in that they must represent the correct values. While an observation may be valid, it might at the same time be inaccurate. 123 Fake street is a valid, inaccurate street address.
Unfortunately, accuracy is mostly achieved in the observation process. To be achieved in the cleaning process, an outside trusted source would have to be cross-referenced.
Measurements must be complete, in that they must represent everything that might be known. This also is nearly impossible to achieve in the cleaning process! For instance in a survey, it would be necessary to re-interview someone whose previous answer to a question was left blank.
Measurements must be consistent, in that different observations must not contradict each other. For instance, one person cannot be represented as both dead and still alive in different observations.
Measurements must be uniform, in that the same unit of measure must be used in all relevant measurements. If one person's height is listed in meters and another in feet, one measurement must be converted.
Measurements are accurate when: (select one)
- observations do not contradict each other. - they represent the correct values.* - when they are unique responses (e.g. no duplication). - when the same unit of measure is used in all relevant measurements.- How do we know when our data is cleaned enough?
- What happens to the data that is removed?
- Explore the
moSmall.csv
dataset.- Are all the measurements valid? Try checking the
Object ID
column for duplicates. - How might you check if the
Is Public Domain
accurately represents the copyrights of the media objects? - Is the data collected completed? How might you deal with the NA or empty fields?
- What assumptions do you have to make when you clean NA or empty fields?
- Is the collected data consistent? Does the column
Is Public Domain
correspond with the data inRights and Reproduction
? If it does not, which would you follow? Why? - As the dataset is not one that we personally collected, how do we make sense that only
Female
or|
is collected as responses in the column (with the exception of NA and empty fields)? What do we have to do to the data to make sure it is uniform? What decisions do we make in this process?
- Are all the measurements valid? Try checking the
Each choice we make while manipulating our data ripples to other areas of our scholarship, institution, and communities (locally and globally).
"At a third level of impact, we can consider the social, economic, or political changes caused by one’s research processes or products, in both the short and long term." (Annette Markham, 2016,emphasis added)
We can consider this level of impact by asking questions about labor, surveillance, social and political discourse, etc.
- Whose labor and what materials are used to make the digital tools you use? How should we (those who benefit from the labor of other people) attribute others' labor? How can we (users of these tools) be held accountable?
- Could your research or project be used to justify or facilitate potentially harmful control or surveillance?
- Could it influence social or political discourse? Modes of profit?
Analysis can take many forms (just like the rest of this stuff!), but many techniques fall within a couple of categories:
Techniques geared towards summarizing a data set, such as:
- Mean
- Median
- Mode
- Average
- Standard deviation
Techniques geared towards testing a hypothesis about a population, based on your data set, such as:
- Extrapolation
- P-Value calculation/Regression
Techniques geared towards understanding a phenomenon, rather than predicting and testing hypotheses, such as:
- Grounded Theory/Computational Grounded Theory
- Content Analysis
- Text Analysis
As we have discussed thus far, data are not neutral or objective. They are guided by and produced through our interests and assumptions, often shaped by our socio-political contexts. Hence, we must also understand that the forms of analyses we take to our data further shapes how we are choosing to tell the story. We are crafting a narrative through each of the stages of data that helps us communicate our projects to a wider audience. This is not to say that our analyses are not "empirical" or "scientific" but a suggestion to make transparent the theoretical foundations and perspectives that are guiding our interpretations. For a more nuanced perspective, consider The Numbers Don't Speak for Themselves in Data Feminism.
Descriptive analysis helps us summarize a data set.
- True* - False- As we consider the types of analysis that we choose to apply onto our data set, what are we representing and leaving out?
- How do we guide our decisions of interpretation with our choices of analyses?
- Are we comfortable with the intended use of our research? Are we comfortable with the unintended use of our research? What are potential misuses of our outputs?
- What can happen when we are trying to just go for the next big thing (tool/methods/algorithms) or just ran out of time and/or budget for our project?
Visualizing your data helps you tell a story and construct a narrative that guides your audience in understanding your interpretation of a collected, cleaned, and analyzed dataset. Depending on the type of analysis you ran, different kinds of visualization can be more effective than others. In the table below are some examples of data visualization that can help you convey the message of your data and the ethical considerations you have been evaluating throughout your project.
Types of Analysis | Types of Visualization | When to Use | Example of Visualization |
---|---|---|---|
Comparisons | Bar charts | Comparison across distinct categories | From The Data for Public Good at the Graduate Center. |
Histograms | Comparison across continuous variable | From Policy Viz. | |
Scatter plots | Useful to check for correlation (not causation!) | From FiveThirtyEight. | |
Time | Stacked area charts | Evolution of value across different groups | From Data to Viz. |
Sankey Diagrams | Displaying flows of changes | From Data to Viz. | |
Line graphs | Tracking changes over time | From The Data for Public Good at the Graduate Center. | |
Small numbers/percentages | Pie charts | Demonstrate proportions between categories | From The Library of Congress. |
Tree maps | Demonstrate hierarchy and proportion | From The Data Visualization Catalogue. | |
Survey responses | Stacked bar charts | Compares total amount across each group (e.g. plotting Likert scale) | From The Library of Congress. |
Nested area graphs | Visualize branching/nested questions | From Evergreen Data. | |
Place | Choropleth maps | Visualize values over a geographic area to demonstrate pattern | From The Library of Congress. |
Hex(bin) or Tile maps | Similar to Choropleth with the hexbin/tile representing regions equally rather than by geographic size | From R Graph Gallery. | |
Adapted from Stephanie D. Evergreen (2019) Effective data visualization : The right chart for the right data, The Data Visualization Catalogue, and From Data to Viz
An example of effective data visualization can be seen in W.E.B. Du Bois data portraits at the Paris Exposition in 1900, as part of the Exhibit of American Negroes. Using engaging hand-drawn visualizations, he tells the narrative of what it meant to be Black in post-Emancipation America as he translates sociological research and census data to reach beyond the academy. Head here to read more about Du Bois' project.
"[The]impact approach is targeted toward the possible or probable impact, rather than the prevention of impact in the first place. It acknowledges that we change the world as we conduct even the smallest of scientific studies, and therefore, we must take some personal responsibility for our methods." (Annette Markham, 2016)
Some areas to consider as we expand our range of ethical imagination beyond the three levels of impact are:
-
Accessibility to people with disabilities
-
International accessibility and language access
-
Openness and accessibility
-
What or when not to make things accessible
-
When might we decide not to record data or delete it
As we transform our results into visuals, we are also trying to tell a narrative about the data we collected. Data visualization can help us to decode information and share quickly and simply. Consider the range of impacts as you think through these questions.
- What are we assuming when we choose to visually represent data in particular ways?
- As you may have realized, many of the visualization examples work with quantitative data, as such, how do you think we can visualize qualitative data? (e.g. Word Clouds, Heat Map)
- How can data visualization mislead us? (for e.g. Nathan Yau discusses how data visualization can lie)
- How can data visualization help us tell a story? (for e.g. Data Feminism's On rational, Scientific, Objective Viewpoints from Mythical, Imaginary, Impossible Standpoints)
- Can you try to plot the
moSmall.csv
dataset based on theArtist Gender
variable? What would you have to do before you can plot this graph? How might you explain what your visualization represents?
- If you were collecting and/or analyzing data on folx in power, such as looking at the data from Tweets of Congress' project, would that change the way you consider your answers to the previous questions?
- Current ethical guidelines from SAFE Lab at University of Pennsylvania have decided to alter the text of social media posts to render it unsearchable. Why and when would you consider (or not) altering the collected tweets for publication?
Data and ethics are contextually driven. As such, there isn’t always a risk-free approach. We often have to work through ethical dilemmas while thinking through information that we may not have (what are the risks of doing/not doing this work?). We have approached a moment where the question is no longer what we could do but what we should do. Given this saturated data-driven world we currently live in, there is value in pausing and considering why and what we are collecting, researching, analyzing, and understanding. Starting on a new project, especially one dealing with "big" data can be exciting but we now also have to first consider who does the data collected benefit and why is it important are important. The IRB (Institutional Review Board)'s regulations may form the starting point of our considerations but should not be the ending point of how we consider contextually-driven ethics and data projects.
In addition, open access is not always the answer to concerns of reproducibility and/or ethical considerations. There are moments where the decision to not have a dataset or analysis openly accessible is valid. For example, when you are working with marginalized or vulnerable populations, concerns for causing more harm justifies restricting access. We may choose to control who has access to decrease the chances of misrepresentations (intentional or otherwise) or having results taken out of context.
For a set of great questions to help you think through your data exploration and project planning, please check out Kristen Hackett's Tagging the Tower post, What to Consider when Planning a Digital Project.
1. Structured data can be: (Select all that apply)
- a XML list.* - a Excel table.* - an email chain. - a collection of text files.Revisit Lesson Stages of Data: Processed/Transformed to learn more.
2. Descriptive analysis help us summarize a data set. (Select one of the following)
- True* - FalseRevisit lesson Stages of Data: Analyzed to learn more.
3. Measurements are accurate when: (Select one of the following)
- they represent the correct values.* - observations do not contradict each other. - when they are unique responses (e.g. no duplication). - when the same unit of measure is used in all relevant measurements.Revisit Lesson Stages of Data: Cleaned to learn more.
4. Research data can be defined as: (Select all that apply)
- materials or information necessary to come to my conclusion.* - the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.* - method of collection and analysis. - objective and error-free.Review lesson Data is Foundational to learn more.
5. The stages of data is a single iteration process, i.e. there is a fixed stage progression from data collection to visualization. (Select one of the following)
- False* - TrueRevisit Lesson Stages of Data: Raw to learn more.
6. Tiny data format only allows one value per cell. (Select one of the following)
- True* - FalseRevisit Lesson Data Structures: Tidy Data to learn more.
Now that you've gained an understanding of some of the considerations around data and ethics, let's think a bit further about how you may apply some of what we have discussed in your work.
We invite you to consider the ethical implications of the dataset of Refugee Arrivals that you will learn to manipulate in the Pandas workshop. This dataset is adapted from the one compiled by Jeremy Singer-Vine for his 2015 BuzzFeed article “Where U.S. Refugees Come From — And Go — In Charts.” which includes information on refugee arrivals to the United States between 2005 and 2015 from the Department of State’s Refugee Processing Center."
Some questions to ask in preparation are:
- Who collected this data?
- How and why is this data being collected?
- What assumptions are baked into this data?
- What consequences does this data have in the world?
- What does this data tell us about our world?
- Marieke Guy's data management presentation discusses some ideas around planning for data management before, during, and after a project.
- Queensland University of Technology's Management of Research Data provides some ideas around ownership, roles and responsibilities of data-driven projects. While this is specific to Queensland University of Technology, it is useful for understanding some of the different roles in a research project.
- The Graduate Center, CUNY's Data Management research guide provides resources and specific steps for CUNY faculty, staff, and students.
- The Council for Big Data, Ethics, and Society's Perspectives on Big Data, Ethics, and Society is a white paper that consolidates the council's discussions on big data, ethics, and society.
- Catherine D'Ignazio & Lauren F. Klein's Data Feminism (scroll down the page to access the book chapters for free). It looks at "big" data from a feminist perspective, and discusses the importance of understanding long histories and socio-political contexts in research, as well as providing an overview of the field.
- Feminist Data's Manifest-No discusses the realities of "big" data and the fallacies of unequal harm and risk distribution, particularly towards marginalized communities.
- Mimi Onuoha's Missing Data Sets looks at "blank spots that exist in spaces that are otherwise data-saturated," that usually affect those who are the most vulnerable.
- Digital Rhetorical Privacy Collective "is a coalition approach to studying privacy and surveillance in rhetoric, composition, and technical communication." The collective hosts the DRPC Privacy Week, provides teaching resources, and publishes a blog on all things privacy and writing.
- Computational social science with R is a 2-week summer institute program that follows the Bit By Bit: Social Research in Digital Age format. The current repository (updated: Jul, 2020) contains the institute's workshops and materials.
- The European Data Portal's tutorial on Open Data offers a guided insight to the importance of choosing the right format for open datasets.
- The Data Visualization Catalogue by Severino Ribecca provides a guide to data visualizations for different types of data and narratives.
- From Data to Viz by From Data to Viz also provides a guide to data visualization for different types of data and narratives.
- Consider a project where you are interested in the trend of Euro-American political views. You've decided to look at the 2018 European Social Survey and the U.S.-based 2018 General Social Survey. How would you approach the data? If you're interested in reporting on the trend of global political views, what do you have to consider when you join these data sets? What assumptions do you have to make? How would you collapse responses?
- How does increased data literacy add to your project planning?
- How do you address your use of data and your ethics? For example,how might ethics play a part in the way you think about (a) data collection? (b) anonymity and confidentiality? (c) data and its relation to the communities it emerges from?
- Consider your next project, what are some considerations from this workshop that you might bring into your project?