I was inspired to create this after taking many project-based CS and AI classes at Stanford, where I would spend more time finding data for a problem I actually cared about than writing the baseline algorithm.
The list is divided by sector, and each link has a (D), (T), or (C) next to it. (D) represents a dataset; (T) represents a tutorial; (C) represents an online challenge you can download data from and contribute knowledge to.
I am sure there are many great datasets I have missed. If you have datasets to add, please create a pull request!
- Lung Cancer Early Detection Challenge (C)
- Predicting Blood Donations (D)
- Modeling Women's Health Care Decisions (C)
- New York Health Data Portal (D)
- Medicaid Adult Health: Diabetes Information (D)
- US Health Data Portal (D)
- State Medicaid Data (D)
- Youth Tobacco Legislation Data (D)
- US Chronic Disease Indicators (D)
- Broad Institute Cancer Programs Datasets (D)
- Medicare Data (D)
- Mental Health in Tech (C)
- UCI Student Alcohol Consumption Dataset (D)
- NIH Chest X-Ray Dataset (D)
- California Kindergarten Vaccinations (D)
- Classifying Breast Cancer Tumors (T)
- Third Grade Reading Scores for San Mateo County (D)
- Wall Street Journal: Where it Pays to Attend College (D)
- Popular Online edX Courses from Harvard and MIT (D)
- World Bank Education Status Indicators (D)
- Cost of Higher Education in the US (D)
- Brazilian High School National Exam Scores (D)
- Indian Primary and Secondary Education Data (D)
- Visualize the State of Public Education in Colorado (C)
- National Student Loan Data System (D)
- 2010 Federal STEM Education Inventory Dataset (D)
- National School Lunch Assistance Program Data (D)
- Predicting Faulty Water Pumps in Tanzania (D)
- Air Quality and Pollution (D)
- Lead Testing in School Drinking Water (D)
- US Climate Data (D)
- Commercial Building Energy Dataset (D)
- ETH Zurich Electricity Consumption and Occupancy Dataset (D)
- US Energy Information and Administration Electric Power and Fossil Fuel Data (D)
- UN Greenhouse Gas Inventory Data (D)
- UN World Meteorological Organization Standard Normals (D)
- Predicting US Presidential Election Outcomes (T)
- New York City Open Data (D)
- San Francisco Open Data (D)
- Austin Open Data (D)
- Seattle Open Data (D)
- Los Angeles Open Data (D)
- Denver Open Data (D)
- Bureau of Labor Statistics Employment Data (D)
- U.S. Census Bureau’s Small Area Income and Poverty Estimates (D)
- CIA World Factbook (D)
- USDA Food and Nutrition Service: SNAP Vendor Data (D)
- US Open Gov (D)
- American Factfinder (D)
- City of Chicago Crime Data (D)
- US Traffic Data (D)
- East Palo Alto Homelessness Data (D)
- Global Terrorism Database (C)
- WorldBank World Development Indicators (D)
- Fake News Dataset (D)
- Credit Card Fraud Detection (D)
- Crime in India Dataset (D)
- Fatal Police Shootings in the US (D)
- Crimes Committed in France (D)
- Homelessness in USA (D)
- Modeling Bias in Age, Race, and Gender (T)
- Classifying Anti-Refugee Tweets (T)
- https://www.datasciencecentral.com/profiles/blogs/great-github-list-of-public-data-sets
- https://ibmhadoop.devpost.com/details/data
- http://kevinchai.net/datasets
- https://www.kaggle.com/datasets
- http://archive.ics.uci.edu/ml/datasets.html?sort=nameUp&view=list
- https://github.com/rafalab/dslabs/tree/master/data