title

cover title

description

learning objectives

estimated time

prerequisites

instructors

authors

editors

before getting started

readings

projects

ethical considerations

additional datasets

Data Literacies

What is data? What counts as data? What are the ethical implications when working with data? How can we manage our data? These are questions we will explore throughout the workshop. Data is foundational to nearly all digital projects and often help us to understand and express our ideas and narratives. Hence, in order to do digital work, we should know how data is captured, constructed, and manipulated. We will also engage with the ethical dimensions of what it means to work with data, from collection to visualization to management.

Define data in the context of digital research

Distinguish forms of data and data formats

Review the stages of data from collection to analysis

Differentiate institutional ethical requirements and situated ethical responsibilities

Use the “impact approach” to evaluate the possible effects of different decisions made by the researchers in the different stages on data

Apply data management practices to store and manage data

2 - 3 hours

Command Line

description
(required) This workshop makes reference to concepts from the Command Line workshop, and having some knowledge about how to use the command line will be central for anyone who wants to learn about how to handle and process data and data analysis.

Tuka Al-Sahlani

Leanne Fan

Di Yoong

Stephen Zweibel

Kelsey Chatlosh

Patrick Sweeney

Patrick Smyth

Lisa Rhody

[Download the workshop dataset](https://raw.githubusercontent.com/DHRI-Curriculum/data-literacies/v2.0/files/moSmall.csv) (required) The dataset, `moSmall.csv`, will be used throughout the challenges in the workshop. To save the file to your local computer, right click on the _Download the workshop dataset_ link and choose `Save Link As...`. Note: It is important to make sure your file is saved as a `.csv` file. Original dataset taken from [The Metropolitan Museum of Art's Creative Commons Zero](https://github.com/metmuseum/openaccess).

In [Big? Smart? Clean? Messy? Data in the Humanities](http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/), Christof Schöch discusses what data means in the humanities and the necessity of 'smart big data.'

The book, [Bit By Bit: Social Research in the Digital Age](https://www.bitbybitbook.com/en/1st-ed/preface/), written by Matthew Salganik, approaches data and social research from a computational social science perspective. He also discusses the idea of 'readymade' and 'custommade' data alongside ethics.

[Ten Simple Rules for Responsible Big Data Research](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5373508/) explores some guidelines for addressing complex ethical issues that arise in any research project.

In [The Challenges and Possibilities of Social Media Data: New Directions in Literary Studies and the Digital Humanities](https://dhdebates.gc.cuny.edu/read/debates-in-the-digital-humanities-2023/section/a57b98ab-0f10-45d0-b205-3e563aab7ea8), Melania Walsh advises researchers to think beyond the IRB and consider 'community engagement, citation, and data sharing' for ethically responsible digital research.

In the article ['Data Colonialism: Rethinking Big Data’s Relation to the Contemporary Subject'](https://journals.sagepub.com/doi/abs/10.1177/1527476418796632?journalCode=tvna) Nick Couldry and Ulises A. Mejias argue that our relationship with data is a new form of colonialism where the 'exploitation of human beings through data' is akin to historical colonialism. You may access the full version through your university's library system.

The [Data for Public Good](https://dataforgood.commons.gc.cuny.edu/) is a semester-long collaborative project led by CUNY graduate students. Each semester, a different public-interest dataset is explored to present information that is useful and informative to a public audience.

[SAFElab](https://www.asc.upenn.edu/research/centers/safe-lab) led by Dr. Desmond U. Patton, uses computational and social work approaches to understand the mechanisms of violence and work on prevention and intervention in violence that occur in neighborhoods and on social media.

Data and data analysis is [not free from bias](https://medium.com/@angebassa/data-alone-isnt-ground-truth-9e733079dfd4). There is no magic blackbox for which data emerges from and is contextually driven. As we think about the automation process of looking at "big" data, we have to be aware of [the biases that gets reproduced that is "hidden."](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing)

De-identified information can be [reconstructed from piecemeal data](https://techscience.org/a/2015092903/) found across different sources. When we consider what we are doing with the data we have collected, we also need to think about the possible re-identification of our participants.

Consider how you may use [differential privacy](https://theconversation.com/explainer-what-is-differential-privacy-and-how-can-it-protect-your-data-90686) as a strategy against re-identification. Consider the [US Census 2020 example](https://www.ncsl.org/research/redistricting/differential-privacy-for-census-data-explained.aspx) on utilizing this strategy to address privacy concerns.

Big data projects oftentimes require sharing data sets across different individuals and teams. In addition, we need to ensure that our work is reproducible and accountable, we may also feel inclined to share the data collected. As such, figuring out [how to share such data](https://techscience.org/a/2015101601/) is crucial in the project planning stage.

[National Science Foundation's open datasets](https://catalog.data.gov/organization/nsf-gov)

[Resources to Find the Data You Need (2016)](https://flowingdata.com/2016/11/10/find-the-data-you-need-2016-edition/)

[Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets)

Data is Foundational

In this workshop we will be discussing the basics of research data in terms of material, transformation, and presentation. We will also be discussing the ethical issues that arise in data collection, cleaning, and representation. Because everyone has a different approach and understanding to data and ethics, this workshop will also include multiple sites for discussions to help us think through what data literacies mean within our projects and broader applications.

What Constitutes Research Data?

These quotes below offers a variety of perspectives to understand research data across different stakeholders. The inclusion of these different approaches to research data is to suggest that there is no singular, definitive approach, and is dependent on multiple factors, including your project considerations.

University

Material or information on which an argument, theory, test or hypothesis, or another research output is based.

— Queensland University of Technology. Manual of Procedures and Policies. Section 2.8.3.

Digital Project Management

What constitutes such data will be determined by the community of interest through the process of peer review and program management. This may include, but is not limited to: data, publications, samples, physical collections, software and models.

— Marieke Guy

Government Institution

Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues.

— OMB-110, Subpart C, section 36, (d) (i)

Data Science

The short answer is that we can’t always trust empirical measures at face value: data is always biased, measurements always contain errors, systems always have confounders, and people always make assumptions

— Angela Bassa

Broadly, research data can be understood as materials or information necessary to come to your conclusion but what these materials and information is depends on your project.

Forms of Data

There are many ways to represent data, just as there are many sources of data. What can you/do you count as data? Here's a small list of possibilities of collections of digital objects acquired and generated during research.

A small list of open multimedia formats (more information of each file format is linked in their entries):

Form	Format	Common file extensions
Images	TIFF (Tagged Image File Format)	`.tiff`, `.tif`
	JPEG2000	`.jp2`, `.jpf`, `.jpx`
	PNG (Portable Network Graphics)	`.png`
Text	ASCII (American Standard Code for Information Interchange)	`.ascii`, `.dat`, `.txt`
	PDF (Portable Document Format)	`.pdf`
	CSV (Comma-Separated Values	`.csv`
Audio	FLAC (Free Lossless Audio Codec)	`.flac`
Audio	ogg	`.ogg`
Video	MPEG-4	`.mp4`
Others	XML (Extensible Markup Language)	`.xml`
	JSON (JavaScript Object Notation)	`.json`
	STL (STereoLithography file format—used in 3D modeling)	`.stl`
For a list of file formats, consider the Library of Congress' list of Sustainability of Digital Formats.

The Importance of Using Open Data Formats

Open data formats are usually available to anyone free-of-charge and allows for easy reusability. Proprietary formats often hold copyrights, patents, or have other restrictions placed on them, and are dependent on (expensive) licensed softwares. If the licensed software ceases to support its proprietary format or it becomes obsolete, you may be stuck with a file format that cannot be easily opened or (re)used (e.g. .mac). For accessibility, future-proofing, and preservation, keep your data in open, sustainable formats.

An illustration:

A screenshot of the moSmall dataset as a cvs file.

A screenshot of the moSmall dataset as an Excel file (.xlsx). Unlike the previous image, this is a proprietary format.

Sustainable formats are generally unencrypted, uncompressed, and follow an open standard.

What forms of data do you use?

Evaluate

Research data can be defined as: (select all that apply)

- materials or information necessary to come to my conclusion.* - the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.* - method of collection and analysis. - objective and error-free.

Challenge: Forms of Data

Below you will find two front matter pages of two distinct digital projects. As you inspect the information present in each image, consider these questions:

What are some forms of data used in the project?
What are some forms of data outputted by the project?
Where was the data retrieved from to complete the project?

Human Computers at NASA is an archival project that "seeks to shed light on the buried stories of African American women with math and science degrees who began working at NACA (now NASA) in 1943 in secret, segregated facilities."

From the image, we can deduce that newspaper articles (digital copies of text) and photographs (digital copies of images) were used to compile this archive. Noticing the highlighted name in the news article, the data may be outputted as searchable text, searchable database, and/or searchable images. The data most likely was retrieved from a database and/or non-digital field notes. This is the [data source page](https://omeka.macalester.edu/humancomputerproject/items/browse) for Human Computers At NASA.

Listen for the Iraqis in NYC! is an audio community mapping project that seeks to locate the Iraqi population in NYC using their own voices.

From the image, we can deduce that audio recordings of participants and a map (geospatial data) were used to compile this project. Given the details in the text on the right of the screen, we learn that the researcher will provide a map (geospatial data) and testaments (audio files) for us to peruse. The researcher has gathered digital field notes in the form of audio files from participants through a survey. The Call for Participants for Listen for the Iraqis in NYC! can be found [here](https://docs.google.com/document/d/1G8RxmEILImlW4O5LgRQ5I3e3JOlg0Sg6b7dM2CmrvEQ/edit).

Institutional Compliance for Data and Research

Institutional Review Board (IRB)

The Institutional Review Board (IRB) is a floor for ethical responsibility at your university that came to pass after outrage about horrific unethical research studies done on people. A prime example of these grotesque studies is the Tuskegee Syphilis Study (1932-1972).

Born from concerns of the ethical choices made in biomedical and behavioral research, IRB compliance is not broadly applicable. This leaves holes in institutional ethical regulations and requires researches in other fields, such as the social sciences, to find other ethical regulations or devise field specific ethical considers.

When is an IRB required?

Usually, IRB review is required when ALL of the criteria below are met:

The investigator is conducting research or clinical investigation,
The proposed research or clinical investigation involves human subjects, and
Your university or research institution is engaged in the research or clinical investigation involving human subjects.

An IRB is an institutional compliance that may not consider other ethical impacts. As we move forward in this workshop we will consider data and digital project ethics beyond compliance.

Example: Oral History Projects

Your oral history project does the following; do you need a CUNY HRPP/IRB?

Open-ended interviews, that ONLY document a specific historical event or the experiences of individuals without intent to draw conclusions or generalize findings.

HRPP/IRB Required?

No

Systematic investigations involving open-ended interviews that are designed to develop or contribute to generalizable knowledge (e.g.,designed to draw conclusions, inform policy, or generalize findings).

HRPP/IRB Required?

Yes

Creation of archives for the purpose of providing a resource for others to do research. The intent of the archive is to create a repository of information for other investigators to conduct research.

HRPP/IRB Required?

Yes

For guidance and more examples see The CUNY Human Research Protection Program (HRPP), "CUNY HRPP Guidance: When is CUNY HRPP or IRB Review Required?"

Stages of Data

We begin without data. Then it is observed, or made, or imagined, or generated. After that, it goes through further transformations. Stages of data typically consist of:

collection of "raw" data

We start with formulating a research question(s) or hypotheses and set up a project to answer our question(s).

E.g. What proportion of the artwork collected and/or hosted in the Met are by non cis-gender men artists and also in public domain?

processing and/or transforming data

In the process of setting up the project, we make decisions on what kind of data we think can help us to answer the question.

E.g. We may retrieve the data from the Met's open access data set. We will need to look at what variables exist in the dataset to find out if we can filter by gender and the variables that will correspond to copyrights.( Note: if the file opens as a web page, you would need to use your machine's 'save as' option to save it as a csv file to view it in a tabular form.)

cleaning

After collecting our data we then consider and make decisions in the processes of cleaning.

E.g. We have to transform some of the gender values and decide what to do with the missing fields.

analysis

We then run our preliminary analysis of the data.

E.g. We can run an analysis of the subset of non cis-gender men and public domain media objects against the total number of media objects to find out the proportion.

visualization

At the end of our analysis, a decision is then made about how we would present the data and its analysis.

E.g. We can present the result in a pie chart or a bar graph.

Stages of Data: Non-linear

There is no one way to go through the stages. For example, we could do a preliminary analysis first, such as running a correlation of variables, to explore what is missing before we begin the process of cleaning. Often, we also end up doing multiple iterations of cleaning and analysis, making decisions and choices to collapse particular variables or remove them entirely at each iteration. Making sure that we keep a clear documentation of our process will ensure that we are accountable to the data we have collected/are using and also ensure that our results can be replicated and reproduced if others choose to work on our "raw" data. While making these decisions seems innocuous, there are ethical considerations, beyond the institution, and impacts we must evaluate in the process.

Ethics Beyond Compliance

As we learn to manipulate data, we will consider our ethical obligations beyond institutional compliance such as an IRB. We will think of ethics as the moral principles that an individual aims to follow in practice to the best of their ability, research, and foresight. Using this definition of ethics, we then consider ethics as situated.

Situated ethics refers to the notion that a person's understandings of and commitments to ethics or morality are greatly linked to their own experiences, positionalities, and political orientations, as well as the particular context in which that person is putting such ethics into practice Helen Simons and Robin Usher, Situated Ethics in Educational Research, 2000.

Thinking through how ethical ideas and practices, or lack thereof, are situated may prompt questions such as: How is data retrieved? By whom? For whom? From where? Why?

In the Command Line workshop you learned about the history of the computer and considered the questions: How were computers developed? By whom? Where? Why?

Levels of Impact

Annette Markham in "OKCupid data release fiasco: It’s time to rethink ethics education", 2016 asserts that ethical digital research is a methodology dependent on reflection, awareness of the debates and concerns in our respective fields, and accountability for the choices we make at each stage of our research. Thus, given the precarious nature of digital research and data we need to use a "what if" approach that will help us evaluate "the possible or probable impact, rather than the prevention of the impact This "impact approach" helps us expand our ethical imagination and consider ethics beyond prescriptives and compliance.

Drawing from Markham (2016), we will focus on three levels of impact:

Direct impacts on people
Ramifications of (re)producing categories
Social, political and economic effects.

Additionally, this workshop will address the range of impact, or the range of accessibility to our work:

to people with disabilities,
to people in different countries or who speak different languages, and
in terms of cost and proprietary accessibility.

Throughout the workshop we will refer to the impacts by number for quick reference.

Challenge: Ethics Beyond Compliance

Think of the following scenario and the possible considerations and impacts for working with data.

A graduate student decides to analyze a data set they collected through surveying fellow graduate students. The survey asks students to denote their graduate level, current job, gross income, and housing status. The graduate student hopes to analyze the survey and present their findings at a student council meeting as part of the council's attempt to persuade the administration to provide more funding to the graduate students. The student learns that they can analyze the data and create a visual using Chat GPT. For the sake of time, they decide to use Chat GPT to analyze and visualize the data set. What ethical considerations should the student evaluate? How might using Chat GPT impact the students surveyed?

Using a large language model application such as Chat GPT to analyze personal information collected by the survey can cause direct impact on the "human subjects" and reproduce categories and information that can be harmful to the participants. The data inputted into Chat GPT is stored and used to produce outputs for other prompts unbeknownst to the graduate student researcher or the participants. Considering these impacts, it is better ethical practice for the graduate student to use a tool that is private and secure for their analysis and visualization. The reveal above is only one of the possible considerations, but there might be many others to evaluate.

Stages of Data: Raw

"Raw" data is yet to be processed, meaning it has yet to be manipulated by a human or computer. Received or collected data could be in any number of formats, locations, etc. It could be in any of the forms listed in the Forms of Data Section..

But "raw" data is a relative term, inasmuch as when one person finishes processing data and presents it as a finished product, another person may take that product and work on it further, and for them that data is "raw" data. For example, we may consider the General Social Survey data to be "raw" as it will require us to filter out missing entries and collapse variables or fields before we can run our analysis. A researcher who participated in the creation of this survey may not consider the version on the site as "raw" because the "raw" version is the physical paper copies of the file. As you can see, this consideration of what is "raw" is non-definitive and is dependent on the project you are working on and the narrative you want to tell with the results.

If you are interested in further exploration and discussion of the ethics of "raw" data, please consider reading Drucker's article which has made useful distinctions between "data" (understood as given) and "capta" (taken or "captured") that also troubles the distinction between "raw" and "processed" data.

Level of Impact I: Direct Impacts on People through Data Collection

Direct effects on people

"At the most basic level of an impact approach, we might ask how our methods of data collection impact humans, directly. If one is interviewing, or the data is visibly connected to a person, this is easy to see. But a distance principle might help us recognize that when the data is very distant from where it originated, it can seem disconnected from persons, or what some regulators call ‘human subjects." (Annette Markham, 2016,emphasis added)

This brings us to several questions:

What counts as "human"? What data should be off limits?
How do we account for personhood?
What is the distance principle? How does it impact the data we decide to collect?
What is "public" data?

The definitions of terms such as "human" and "public" are ambiguous. They are dependent on the forms of data, the context of the data, and the relationship of the data to the source and the researcher.

We return to our data set Research_Data_DRI24.csv that is now on a public site that is designed to be widely accessible. Would we consider this data set, "public" date from our understanding of the general term "public"?

For guidelines and working definitions of "human", "public", and "personhood" see the 2012 Ethical Decision-Making and Internet Research report by the AoIR Ethics Working Committee.

Data and Labor

As we think about data collection, we should also consider the labor involved in the process. Many researchers rely on Amazon Mechanical Turk (sometimes also referred to as MTurk) for data collection, often paying less than minimum wage for the task. Often the assumption made of these workers is someone who is retired, bored, and participating in online gig work for fun or to kill time. While this may be true for some, more than half of those surveyed in a Pew Research study cite that the income from this work is essential or important. Those who do view the income from this work as essential or important are also mostly from underserved communities.

In addition to being mindful of paying a fair wage to the workers on such platforms, this kind of working environment also brings some further considerations to the data that is collected. For instance, to get close to minimum wage, workers cannot afford to spend much time on each task. Thinking through these circumstances, how do you think it impacts the data we collected?

For a deeper discussion on data and labor, consider Catherine D'Ignazio and Lauren Klein's chapter Show Your Work in Data Feminism.

Evaluation

The stages of data is a single iteration process, i.e. there is a fixed stage progression from data collection to visualization.

- True - False*

Which of the following statements are true for "raw" data: (select all that apply)

- is data that is yet to be processed.* - is data that is received and/or collected.* - is the same to every researcher/research team. - can only be collected from participants.

Challenge: Raw Data

If you have not done so, please download moSmall.csv and open it from your local computer/laptop. As the original file has about 500,000 entries, we've taken a random sample of 1% of the original dataset. In this case, would you consider this file to be a "raw" dataset?

The dataset would be a "raw" dataset for you because you would most likely need to remove certain variables/entries to work towards your question or in this case our question: “What proportion of the artwork collected and/or hosted in the Met are by non cis-gender men artists?”

Keywords

Data
"Raw" Data

Stages of Data: Processed/Transformed

Processing data puts it into a state more readily available for analysis and makes the data legible. For instance, it could be rendered as structured data. Structured data consists of clearly defined data types with patterns that make them easily searchable; while unstructured data is “everything else.”Unstructured data can be an open source book from the Gutenberg project and a structured data can be a csv file or a our dataset.

Here is a reminder of the list of structured forms and formats we reviewed earlier in the workshop:

Spreadsheets (e.g. .xlsx, .numbers, .csv)
Audio (e.g. .mp3, .wav, .aac)
Video (e.g. .mov, .mp4)
Computer Aided Design/CAD (.cad)
Databases (e.g. .sql)
Geographic Information Systems (GIS) and spatial data (e.g. .shp, .dbf, .shx)
Digital copies of images (e.g. .png, .jpeg, .tiff)
Web files (e.g. .html, .asp, .php)
Matlab files & 3D Models (e.g. .stl, .dae, .3ds)
Metadata & Paradata (e.g. .xml, .json)

Data Structures: Tidy Data

There are different guidelines to the processing of data, one of which is the Tidy Data format, which follows these rules in structuring data:

Each variable is in a column.
Each observation is a row.
Each value is a cell.

Look at this example of cats to see how they may or may not follow those guidelines. Important note: Some data formats allow for more than one dimension of data (like the JSON structure below). How might that complicate the concept of Tidy Data?

{
    "Cats": [
            {
                "Calico": [
                    {
                        "firstName": "Smally",
                        "lastName":"McTiny"
                    },
                    {
                        "firstName": "Kitty",
                        "lastName": "Kitty"
                    }
                ],
                "Tortoiseshell": [
                    {
                        "firstName": "Foots",
                        "lastName":"Smith"
                    },
                    {
                        "firstName": "Tiger",
                        "lastName":"Jaws"
                    }
                ]
            }
        ]
}

How would you convert this nested data set into a tidy data structure?

While tiny data is a really popular method of structuring and organizing data, it is not the only way to do so. Depending on the type of data you have, it is also not always the best way to structure data.

Table form does not equal Tidy

But not all rectangular data is tidy. Here is an example of how tabular (or rectangular) data can be transformed into more tidy data.

What new variables were created?

What does each of the three tables represent?

Data Managment

Before beginning your data collection, manipulation, and transformation, a good practice is to determine your file naming conventions. How many times have named something as XXX_FinalFINALFINAL.pdf or have difficulty searching for a version of the file that contained all that good idea that was edited out in the XXX_FinalFINALFINALFINAL.pdf version? While tools like version controlling with git can be helpful, we can also begin with setting up file naming conventions that can help us succeed! Here's an example from Stanford that demonstrates the problems of badly name files in our projects.

For example, The Graduate Center's Data Management guide suggest that top level folders (such as your main project folder) should include your project title, a unique identifier and the date (year) of your project (e.g. dataliteracies_XYZ_2020). Your sub folders and individual files should follow a similar system, with an identifiable activity or project in the file name (e.g. a sub-folder of the project: sections_xyz_2020, a file in the project: lessons_XYZ_2020.doc).

For a thorough understanding of the impact of data management on our research and the ethical quandries that may arise from poor data management, please head over to Stephen Zweibel's interactive website Data Management.

Level of Impact II: Politics of Knowledge Production and Categorization

Politics of Knowledge Production

"At another level, we can ask how our methods of organizing data, analytical interpretations, or findings as shared datasets are being used—or might be used—to build definitional categories or to profile particular groups in ways that could impact livelihoods or lives. Are we contributing positive or negative categorizations?" (Annette Markham, 2016,emphasis added)

For ethical considerations of the impact of our knowledge production,it can be helpful to think through the concepts of Gramsci's hegemony, Foucault's discourse, and Hall's "policing the crisis". An example of the politics of power production can be found in Julia Angwen's and Jeff Larson's, "Bias in Criminal Risk Scores Is Mathematically Inevitable, Researchers Say," 2016 article.

Decisions on the categories and boundaries scholars use shape our:

Datasets
Catalogs
Maps
Algorithms

The Importance of Using Open Data Formats

A small detour to discuss data formats. Open data formats are usually available to anyone free-of-charge and allows for easy reusability. Proprietary formats often hold copyrights, patents, or have other restrictions placed on them, and are dependent on (expensive) licensed softwares. If the licensed software ceases to support its proprietary format or it becomes obsolete, you may be stuck with a file format that cannot be easily opened or (re)used (e.g. .mac). For accessibility, future-proofing, and preservation, keep your data in open, sustainable formats. A demonstration:

Open this file in a text editor (e.g. Visual Studio Code, TextEdit (macOS), NotePad (Windows) ), and then in an app like Excel. This is a CSV, an open, text-only, file format. To save the file onto your local computer, right click on Research_Data_DRI24.csv and click Save Link As to download the file to your local computer (it's the same Research_Data_DRI24.csv from above!)
Now do the same with this Excel file. Unlike the previous, this is a proprietary format!

Sustainable formats are generally unencrypted, uncompressed, and follow an open standard.

A small list of open multimedia formats (more information of each file format is linked in their entries):

Form	Format	Common file extensions
Images	TIFF (Tagged Image File Format)	`.tiff`, `.tif`
	JPEG2000	`.jp2`, `.jpf`, `.jpx`
	PNG (Portable Network Graphics)	`.png`
Text	ASCII (American Standard Code for Information Interchange)	`.ascii`, `.dat`, `.txt`
	PDF (Portable Document Format)	`.pdf`
	CSV (Comma-Separated Values	`.csv`
Audio	FLAC (Free Lossless Audio Codec)	`.flac`
Audio	ogg	`.ogg`
Video	MPEG-4	`.mp4`
Others	XML (Extensible Markup Language)	`.xml`
	JSON (JavaScript Object Notation	`.json`
	STL (STereoLithography file format—used in 3D modeling)	`.stl`
For a list of file formats, consider the Library of Congress' list of Sustainability of Digital Formats.

<-->

Evaluation

Structured data can be: (select all that apply)

- a XML list.* - a Excel table.* - an email chain. - a collection of text files.

We may choose to store our data in open data formats because they: (select all that apply)

- are sustainable.* - allow for easy reusability.* - are free-of-charge to use.*

Tiny data format only allows one value per cell.

- True* - False

Challenge: Processed/Transformed

Explore the moSmall.csv dataset, what questions might you ask with this dataset? What columns (variables) will you keep?

You would keep columns (variables) relevant to our question, such as the `Artist Gender`, `Is Public Domain` and `Rights and Reproduction` columns. We would also keep some of the descriptive columns such as `Object ID` and `Artist Role` to help contextualize the results (e.g. what kind of roles do female artists tend to take on?)

If you are saving the file moSmall.csv in a proprietary spreadsheet application like Microsoft Excel (Windows/macOS) or Numbers (macOS), you may be prompted to save the file as .xlsx or .numbers. What format would you choose to save it in? Why would you choose to do so?

It is recommended to keep it in a `.csv` file type as it can be opened up by more programs and if Microsoft stops supporting `.xlsx` file types you may no longer have access to opening the dataset. **or** you will choose to switch to a `.xlsx` format as it is easier to use on a graphical user interface like Microsoft Excel. Any stylistic changes you'd have made to the file will remain as well, such as alternative highlighting rows for readability or bolding column headings.

Keywords

CSV (file format)
Open Data Formats
Tidy Data

Stages of Data: Cleaned

High quality data is measured in its validity, accuracy, completeness, consistency, and uniformity.

Processed data, even in a table, is going to be full of errors:

Empty fields
Multiple formats, such as "yes" or "y" or "1" for a positive response.
Suspect answers, like a date of birth of 00/11/1234
Impossible negative numbers, like an age of "-37"
Dubious outliers
Duplicated rows And many more!

Cleaning data is the work of correcting the errors listed above, and moving towards high quality. This work can be done manually or programmatically.

Validity

Measurements must be valid, in that they must conform to set constraints:

The aforementioned "yes" or "y" or "1" should all be changed to one response.
Certain fields cannot be empty, or the whole observation must be thrown out.
Uniqueness, for instance no two people should have the same social security number.

Accuracy

Measurements must be accurate, in that they must represent the correct values. While an observation may be valid, it might at the same time be inaccurate. 123 Fake street is a valid, inaccurate street address.

Unfortunately, accuracy is mostly achieved in the observation process. To be achieved in the cleaning process, an outside trusted source would have to be cross-referenced.

Completeness

Measurements must be complete, in that they must represent everything that might be known. This also is nearly impossible to achieve in the cleaning process! For instance in a survey, it would be necessary to re-interview someone whose previous answer to a question was left blank.

Consistency

Measurements must be consistent, in that different observations must not contradict each other. For instance, one person cannot be represented as both dead and still alive in different observations.

Uniformity

Measurements must be uniform, in that the same unit of measure must be used in all relevant measurements. If one person's height is listed in meters and another in feet, one measurement must be converted.

Evaluation

Measurements are accurate when: (select one)

- observations do not contradict each other. - they represent the correct values.* - when they are unique responses (e.g. no duplication). - when the same unit of measure is used in all relevant measurements.

Challenge: When Do We Stop Cleaning?

How do we know when our data is cleaned enough?

Often this is decided before the cleaning process begins, perhaps after some quick visualization or analysis of the "raw" data. Generally, empty entries are removed from the data sets. Perhaps if one is working with social media data, they may remove URLs as these influence the topic modeling algorithms (e.g. "http" may end up being the most prominent topic of the corpus). One may decide here is where to stop cleaning. Some might suggest the removal of stop words like "the" "a" "an," but some may consider impact II, politics of knowledge production, and feel uncertain about the removal of these words. This is especially because the dictionary of stop words were generated through canon western texts that is not representative of the many variations of English. For example, if one were looking at the tweets of Singaporean youths, the stop word dictionary may not be appropriate.

What happens to the data that is removed?

Usually the IRB desires that the data is destroyed. Removed data can remain in the original "raw" file. The file that is cleaned is usually a duplicate file to allow for recovery in case a researcher made a poor decision in the process of cleaning.

Explore the moSmall.csv dataset.
- Are all the measurements valid? Try checking the Object ID column for duplicates.
- How might you check if the Is Public Domain accurately represents the copyrights of the media objects?
- Is the data collected completed? How might you deal with the NA or empty fields?
  - What assumptions do you have to make when you clean NA or empty fields?
- Is the collected data consistent? Does the column Is Public Domain correspond with the data in Rights and Reproduction? If it does not, which would you follow? Why?
- As the dataset is not one that we personally collected, how do we make sense that only Female or | is collected as responses in the column (with the exception of NA and empty fields)? What do we have to do to the data to make sure it is uniform? What decisions do we make in this process?

Exploring the dataset, here are possible responses to the questions: - Using `Object ID` indicates that there is no duplicates in the dataset. Every entry is unique. - You may choose to compare it to another trusted source like a database from [The Getty Research Institute](https://www.getty.edu/research/tools/). - The data collected is not completed. There are missing fields. Depending on where the missing field is, you may choose to code it as `0` for the ease of analysis. For example, the column `Dynasty` only contain 1 meaningful entry within this sample data set, as such, you may choose to not run any analysis that may rely on this column and choose to drop it. The column `Accession Year` only has 1 NA and you may choose to drop that row if this becomes a useful variable for your analysis. - While the `Rights and Reproduction` contains a lot of NA and inappropriate responses (e.g. "Ceramics"), for the most part, for the items labeled as `YES` in the column `Is Public Domain` the corresponding column in `Rights and Reproduction` does not record a copyright holder. You may assume that the NA can stand in for the object being in the public domain. - Taking only `Female` as a valid gender response, everything else will be converted to a `0` for ease of analysis. You may assume `|` as equivalent to a NA or an empty field rather than an alternative gender. Hence in this analysis, the proportion will only record female artists' objects against the rest of the collected items. You may not be able to necessarily answer the larger question of all non-cisgender men against the total in this case.

Level of Impact III: Social, Political, and Economic Impacts of Projects or Research

Each choice we make while manipulating our data ripples to other areas of our scholarship, institution, and communities (locally and globally).

"At a third level of impact, we can consider the social, economic, or political changes caused by one’s research processes or products, in both the short and long term." (Annette Markham, 2016,emphasis added)

We can consider this level of impact by asking questions about labor, surveillance, social and political discourse, etc.

Whose labor and what materials are used to make the digital tools you use? How should we (those who benefit from the labor of other people) attribute others' labor? How can we (users of these tools) be held accountable?
Could your research or project be used to justify or facilitate potentially harmful control or surveillance?
Could it influence social or political discourse? Modes of profit?

Stages of Data: Analyzed

Analysis can take many forms (just like the rest of this stuff!), but many techniques fall within a couple of categories:

Descriptive Analysis

Techniques geared towards summarizing a data set, such as:

Mean
Median
Mode
Average
Standard deviation

Inferential Analysis

Techniques geared towards testing a hypothesis about a population, based on your data set, such as:

Extrapolation
P-Value calculation/Regression

Qualitative Analysis

Techniques geared towards understanding a phenomenon, rather than predicting and testing hypotheses, such as:

Grounded Theory/Computational Grounded Theory
Content Analysis
Text Analysis

As we have discussed thus far, data are not neutral or objective. They are guided by and produced through our interests and assumptions, often shaped by our socio-political contexts. Hence, we must also understand that the forms of analyses we take to our data further shapes how we are choosing to tell the story. We are crafting a narrative through each of the stages of data that helps us communicate our projects to a wider audience. This is not to say that our analyses are not "empirical" or "scientific" but a suggestion to make transparent the theoretical foundations and perspectives that are guiding our interpretations. For a more nuanced perspective, consider The Numbers Don't Speak for Themselves in Data Feminism.

Evaluation

Descriptive analysis helps us summarize a data set.

- True* - False

Challenge: Analysis

As we consider the types of analysis that we choose to apply onto our data set, what are we representing and leaving out?

You may choose to leave out data that are perceived to be outliers, especially if they differ too much from the "normal" curve. You may end up representing only those who fall within the "normal" curve which may not actually be an equitable representation. This would require considering ethical impact two: the ramifications of (re)producing categories.

How do we guide our decisions of interpretation with our choices of analyses?

The interpretation of the results should align itself with the type of analyses that you ran. In addition, it should be guided in some capacity by previous work in this area to inform your understanding and the ethical implications you have evaluated.

Are we comfortable with the intended use of our research? Are we comfortable with the unintended use of our research? What are potential misuses of our outputs?

Potential misuse we should concern ourselves with is the weaponization of marginalized participants' words and thoughts. We need to be wary of the unintended use of our research because we can't consider every circumstance that the analysis can be misused or misquoted. When working on an oral history project, we may set up some layers of boundaries to prevent too easy of an access to audio files as an attempt at negotiating access and protection of my narrators. Walsh in [The Challenges and Possibilities of Social Media Data: New Directions in Literary Studies and the Digital Humanities](https://dhdebates.gc.cuny.edu/read/debates-in-the-digital-humanities-2023/section/a57b98ab-0f10-45d0-b205-3e563aab7ea8#ch18) gives us an example of the Inter-University Consortium for Political and Social Research (ICPSR) that requires different "levels of restriction and access" to social media data.

What can happen when we are trying to just go for the next big thing (tool/methods/algorithms) or just ran out of time and/or budget for our project?

In chasing the next big thing, the original intentions for beginning the project might be lost. When working with communities, our priority is that our work is meaningful to them and the excitement of exploring a new tool can sometimes distract us from this intention. Running out of time and/or budget can also mean that the project may end abruptly, and relationships built could be strained in a haphazard wrap up. This brings us back to making sure that before the project begins to spend a significant amount of time on project planning to reduce the chances of this happening.

Keywords

Descriptive Analysis
Inferential Analysis
Qualitative Analysis

Stages of Data: Visualized

Visualizing your data helps you tell a story and construct a narrative that guides your audience in understanding your interpretation of a collected, cleaned, and analyzed dataset. Depending on the type of analysis you ran, different kinds of visualization can be more effective than others. In the table below are some examples of data visualization that can help you convey the message of your data and the ethical considerations you have been evaluating throughout your project.

Examples of Data Visualization

Types of Analysis	Types of Visualization	When to Use	Example of Visualization
Comparisons	Bar charts	Comparison across distinct categories	From The Data for Public Good at the Graduate Center.
	Histograms	Comparison across continuous variable	From Policy Viz.
	Scatter plots	Useful to check for correlation (not causation!)	From FiveThirtyEight.
Time	Stacked area charts	Evolution of value across different groups	From Data to Viz.
	Sankey Diagrams	Displaying flows of changes	From Data to Viz.
	Line graphs	Tracking changes over time	From The Data for Public Good at the Graduate Center.
Small numbers/percentages	Pie charts	Demonstrate proportions between categories	From The Library of Congress.
Small numbers/percentages	Tree maps	Demonstrate hierarchy and proportion	From The Data Visualization Catalogue.
Survey responses	Stacked bar charts	Compares total amount across each group (e.g. plotting Likert scale)	From The Library of Congress.
Survey responses	Nested area graphs	Visualize branching/nested questions	From Evergreen Data.
Place	Choropleth maps	Visualize values over a geographic area to demonstrate pattern	From The Library of Congress.
Place	Hex(bin) or Tile maps	Similar to Choropleth with the hexbin/tile representing regions equally rather than by geographic size	From R Graph Gallery.

Adapted from Stephanie D. Evergreen (2019) Effective data visualization : The right chart for the right data, The Data Visualization Catalogue, and From Data to Viz

An example of effective data visualization can be seen in W.E.B. Du Bois data portraits at the Paris Exposition in 1900, as part of the Exhibit of American Negroes. Using engaging hand-drawn visualizations, he tells the narrative of what it meant to be Black in post-Emancipation America as he translates sociological research and census data to reach beyond the academy. Head here to read more about Du Bois' project.

Range of impact, or the range of accessibility

"[The]impact approach is targeted toward the possible or probable impact, rather than the prevention of impact in the first place. It acknowledges that we change the world as we conduct even the smallest of scientific studies, and therefore, we must take some personal responsibility for our methods." (Annette Markham, 2016)

Some areas to consider as we expand our range of ethical imagination beyond the three levels of impact are:

Accessibility to people with disabilities
International accessibility and language access
Openness and accessibility
What or when not to make things accessible
When might we decide not to record data or delete it

Challenge: Visualizations

As we transform our results into visuals, we are also trying to tell a narrative about the data we collected. Data visualization can help us to decode information and share quickly and simply. Consider the range of impacts as you think through these questions.

What are we assuming when we choose to visually represent data in particular ways?

An underlying assumption we make is that the conventions of top-down, left-right are universal or at least universal enough for most folx to understand. This neglects potential right-to-left readers. Certain conventions that use color as a way to represent good and bad (e.g. green as good and red as bad) also assume that this is an effective differentiation that excludes those who have visual impairments and can decipher the data in a similar fashion.

As you may have realized, many of the visualization examples work with quantitative data, as such, how do you think we can visualize qualitative data? (e.g. Word Clouds, Heat Map)

Exploring [Voyant-Tools](https://voyant-tools.org/) can be a good place to start to see how visualization of qualitative data can look like.

How can data visualization mislead us? (for e.g. Nathan Yau discusses how data visualization can lie)

Exaggerated differences through the choice of scales on the x and y-axis can mislead a casual viewer to think that the data is representing a larger difference than it actually is reporting.

How can data visualization help us tell a story? (for e.g. Data Feminism's On rational, Scientific, Objective Viewpoints from Mythical, Imaginary, Impossible Standpoints)

Data visualization can help us convey dense information quickly. The casual viewer can glance at the visualization and understand what we are trying to communicate with our data. Data visualization also can be affective device, like the DuBois' examples which helps to tell the urgency of the narrative/story.

Can you try to plot the moSmall.csv dataset based on the Artist Gender variable? What would you have to do before you can plot this graph? How might you explain what your visualization represents?

The difficulty of representing this dataset is how at first glance there's an assumption that gender is binary given that only 2 bars are representing the dataset. Even though the other bar is labeled `Unknown` to suggest that this is not a comprehensive breakdown, it makes one wonder how effective it is. ![Plot of media objects in public domain by gender of artist](/images/data-ethics/genderPD.png)

Additional Exploration

If you were collecting and/or analyzing data on folx in power, such as looking at the data from Tweets of Congress' project, would that change the way you consider your answers to the previous questions?
Current ethical guidelines from SAFE Lab at University of Pennsylvania have decided to alter the text of social media posts to render it unsearchable. Why and when would you consider (or not) altering the collected tweets for publication?

Concluding Thoughts

Data and ethics are contextually driven. As such, there isn’t always a risk-free approach. We often have to work through ethical dilemmas while thinking through information that we may not have (what are the risks of doing/not doing this work?). We have approached a moment where the question is no longer what we could do but what we should do. Given this saturated data-driven world we currently live in, there is value in pausing and considering why and what we are collecting, researching, analyzing, and understanding. Starting on a new project, especially one dealing with "big" data can be exciting but we now also have to first consider who does the data collected benefit and why is it important are important. The IRB (Institutional Review Board)'s regulations may form the starting point of our considerations but should not be the ending point of how we consider contextually-driven ethics and data projects.

In addition, open access is not always the answer to concerns of reproducibility and/or ethical considerations. There are moments where the decision to not have a dataset or analysis openly accessible is valid. For example, when you are working with marginalized or vulnerable populations, concerns for causing more harm justifies restricting access. We may choose to control who has access to decrease the chances of misrepresentations (intentional or otherwise) or having results taken out of context.

For a set of great questions to help you think through your data exploration and project planning, please check out Kristen Hackett's Tagging the Tower post, What to Consider when Planning a Digital Project.

Review your knowledge: questions from the lessons

1. Structured data can be: (Select all that apply)

- a XML list.* - a Excel table.* - an email chain. - a collection of text files.

Revisit Lesson Stages of Data: Processed/Transformed to learn more.

2. Descriptive analysis help us summarize a data set. (Select one of the following)

- True* - False

Revisit lesson Stages of Data: Analyzed to learn more.

3. Measurements are accurate when: (Select one of the following)

- they represent the correct values.* - observations do not contradict each other. - when they are unique responses (e.g. no duplication). - when the same unit of measure is used in all relevant measurements.

Revisit Lesson Stages of Data: Cleaned to learn more.

4. Research data can be defined as: (Select all that apply)

- materials or information necessary to come to my conclusion.* - the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.* - method of collection and analysis. - objective and error-free.

Review lesson Data is Foundational to learn more.

5. The stages of data is a single iteration process, i.e. there is a fixed stage progression from data collection to visualization. (Select one of the following)

- False* - True

Revisit Lesson Stages of Data: Raw to learn more.

6. Tiny data format only allows one value per cell. (Select one of the following)

- True* - False

Revisit Lesson Data Structures: Tidy Data to learn more.

Theory to Practice

Now that you've gained an understanding of some of the considerations around data and ethics, let's think a bit further about how you may apply some of what we have discussed in your work.

We invite you to consider the ethical implications of the dataset of Refugee Arrivals that you will learn to manipulate in the Pandas workshop. This dataset is adapted from the one compiled by Jeremy Singer-Vine for his 2015 BuzzFeed article “Where U.S. Refugees Come From — And Go — In Charts.” which includes information on refugee arrivals to the United States between 2005 and 2015 from the Department of State’s Refugee Processing Center."

Some questions to ask in preparation are:

Who collected this data?
How and why is this data being collected?
What assumptions are baked into this data?
What consequences does this data have in the world?
What does this data tell us about our world?

Suggested Further Resources

Data management

Marieke Guy's data management presentation discusses some ideas around planning for data management before, during, and after a project.
Queensland University of Technology's Management of Research Data provides some ideas around ownership, roles and responsibilities of data-driven projects. While this is specific to Queensland University of Technology, it is useful for understanding some of the different roles in a research project.
The Graduate Center, CUNY's Data Management research guide provides resources and specific steps for CUNY faculty, staff, and students.

Ethics and ("big" data) research

The Council for Big Data, Ethics, and Society's Perspectives on Big Data, Ethics, and Society is a white paper that consolidates the council's discussions on big data, ethics, and society.
Catherine D'Ignazio & Lauren F. Klein's Data Feminism (scroll down the page to access the book chapters for free). It looks at "big" data from a feminist perspective, and discusses the importance of understanding long histories and socio-political contexts in research, as well as providing an overview of the field.
Feminist Data's Manifest-No discusses the realities of "big" data and the fallacies of unequal harm and risk distribution, particularly towards marginalized communities.
Mimi Onuoha's Missing Data Sets looks at "blank spots that exist in spaces that are otherwise data-saturated," that usually affect those who are the most vulnerable.
Digital Rhetorical Privacy Collective "is a coalition approach to studying privacy and surveillance in rhetoric, composition, and technical communication." The collective hosts the DRPC Privacy Week, provides teaching resources, and publishes a blog on all things privacy and writing.

Projects or Challenges to Try

Consider a project where you are interested in the trend of Euro-American political views. You've decided to look at the 2018 European Social Survey and the U.S.-based 2018 General Social Survey. How would you approach the data? If you're interested in reporting on the trend of global political views, what do you have to consider when you join these data sets? What assumptions do you have to make? How would you collapse responses?

Discussion Questions

How does increased data literacy add to your project planning?
How do you address your use of data and your ethics? For example,how might ethics play a part in the way you think about (a) data collection? (b) anonymity and confidentiality? (c) data and its relation to the communities it emerges from?
Consider your next project, what are some considerations from this workshop that you might bring into your project?

Files

data-literacies.md

Latest commit

History

data-literacies.md

File metadata and controls

Data is Foundational

What Constitutes Research Data?

Forms of Data

The Importance of Using Open Data Formats

What forms of data do you use?

Evaluate

Challenge: Forms of Data

Institutional Compliance for Data and Research

Institutional Review Board (IRB)

When is an IRB required?

Example: Oral History Projects

Stages of Data

Stages of Data: Non-linear

Ethics Beyond Compliance

Levels of Impact

Challenge: Ethics Beyond Compliance

Stages of Data: Raw

Level of Impact I: Direct Impacts on People through Data Collection

Direct effects on people

Data and Labor

Evaluation

Challenge: Raw Data

Keywords

Stages of Data: Processed/Transformed

Data Structures: Tidy Data

Table form does not equal Tidy

Data Managment

Level of Impact II: Politics of Knowledge Production and Categorization

Politics of Knowledge Production

The Importance of Using Open Data Formats

Evaluation

Challenge: Processed/Transformed

Keywords

Stages of Data: Cleaned

Validity

Accuracy

Completeness

Consistency

Uniformity

Evaluation

Challenge: When Do We Stop Cleaning?

Level of Impact III: Social, Political, and Economic Impacts of Projects or Research

Stages of Data: Analyzed

Descriptive Analysis

Inferential Analysis

Qualitative Analysis

Evaluation

Challenge: Analysis

Keywords

Stages of Data: Visualized

Examples of Data Visualization

Range of impact, or the range of accessibility

Challenge: Visualizations

Additional Exploration

Concluding Thoughts

Review your knowledge: questions from the lessons

Theory to Practice

Suggested Further Resources

Data management

Ethics and ("big" data) research

Other Tutorials

Projects or Challenges to Try

Discussion Questions