Publicly Available Data Sets
The DCR maintains a list of sites that house publicly available data sets. Some of these data sets are governed by data use agreements while others are free to use without such agreements. It is expected that propoer ethical considerations of data use will be followed regardless of a need for data use agreements
Data Sets in Public Health and Medicine
May require use agreements. These data sets are publicly available but may be subject to specific terms of use and data access requirements. For example, you many need to submit a data request form to obtain access to the data
Data | Description |
---|---|
Adolescent Behaviors and Experiences Survey (ABES) | Adolescent Behaviors and Experiences Survey (ABES) Source: Centers for Disease Control & Prevention |
Behavioral Risk Factor Surveillance System (BRFSS) | BRFSS collects data on health risk behaviors, chronic health conditions, and use of preventive services in the United States. |
CCG Data | Center for Cancer Genomics (CCG) data resources Source: National Cancer Institute |
CDC WONDER | The Centers for Disease Control and Prevention (CDC) provides access to the Wide-ranging Online Data for Epidemiologic Research (WONDER) system, which offers a wide range of public health data, including mortality, morbidity, and population data. |
CGCI Data | Cancer Genome Characterization Initiative (CGCI) Source: National Cancer Institute |
COVID-19 Datasets | Several datasets related to the COVID-19 pandemic are publicly available, including data on cases, testing, and vaccine distribution. |
CTD2 Data | Cancer Target Discovery and Development (CTD²) Network / C-T-D-Squared Source: National Cancer Institute |
HCMI Data | Human Cancer Models Initiative (HCMI) Source: National Cancer Institute |
HCUP-US Databases | Healthcare Cost and Utilization Project (HCUP) Source: Agency for Healthcare Research and Quality |
HCUP-US NIS Overview | Healthcare Cost and Utilization Project (HCUP) National (Nationwide) Inpatient Sample (NIS) Source: Agency for Healthcare Research and Quality |
Health Information National Trends Survey (HINTS) | Health Information National Trends Survey (HINTS) Source: National Cancer Institute |
KID Database | Kids’ Inpatient Database (KID) Source: Agency for Healthcare Research and Quality |
Medical Image Datasets | Various medical image datasets are available for tasks like image analysis and machine learning. Examples include the National Library of Medicine's Chest X-ray and The Cancer Imaging Archive (TCIA). |
Medicare Data | The Centers for Medicare & Medicaid Services (CMS https://data.cms.gov/) offers datasets related to healthcare utilization, provider performance, and payment information through the Medicare program. |
Monitoring the Future (MTF) | Monitoring the Future (MTF) Public-Use Cross-Sectional Datasets Source: National Addiction & HIV Data Archive Program |
NASS Database | Nationwide Ambulatory Surgery Sample (NASS) Source: Agency for Healthcare Research and Quality |
National Ambulatory Healthcare Survey (NAMCS/NHAMCS) | National Ambulatory Medical Care Survey (NAMCS) |
National Cancer Institute (NCI) SEER Datasets | The Surveillance, Epidemiology, and End Results (SEER) program provides cancer statistics, including data on cancer incidence, survival, and mortality in the United States. |
National Center for Health Statistics (NCHS) | NCHS, part of the CDC, offers various datasets on health and vital statistics in the United States. |
National Center for Health Statistics (NCHS) | National Center for Health Statistics (NCHS) Source: Centers for Disease Control & Prevention National Center for Health Statistics |
National Electronic Health Records Survey (NEHRS) | National Electronic Health Records Survey (NEHRS) Source: Center for Disease Control |
National Health Care Survey Registry | National Health Care Surveys Source: Centers for Disease Control & Prevention |
National Hospital Care Survey (NHCS) | National Hospital Care Survey (NHCS) |
National Institutes of Health (NIH) Data Sharing Repositories | NIH offers a variety of datasets related to medical research, clinical trials, genomics, and more through its data sharing repositories, such as the National Library of Medicine's National Center for Biotechnology Information (NCBI) and the National Institute of Mental Health (NIMH) Data Archive. |
National Post-acute and Long-term Care Study Homepage (NPALS) | National Post-acute and Long-term Care Study (NPALS) Source: Center for Disease Control |
National Survey of Family Growth (NSFG) | National Survey of Family Growth (NSFG) Source: Center for Disease Control |
National Survey on Drug Use and Health (SAMHSA.gov) | National Survey on Drug Use and Health (NSDUH) Source: Substance Abuse and Mental Health Services Administration |
National Health Interview Survey (NHIS) | National Health Interview Survey (NHIS) Source: Centers for Disease Control & Prevention |
NEDS Database | Nationwide Emergency Department Sample (NEDS) Source: Agency for Healthcare Research and Quality |
NHANES Questionnaires and Datasets (NHANES) | National Health and Nutrition Examination Survey (NHANES) Source: Source: Centers for Disease Control & Prevention |
NRD Database | Nationwide Readmissions Database (NRD) Source: Agency for Healthcare Research and Quality |
SASD Database | State Ambulatory Surgery and Services Databases (SASD) Source: Agency for Healthcare Research and Quality |
SEDD Database | State Emergency Department Databases (SEDD) Source: Agency for Healthcare Research and Quality |
SID Database | State Inpatient Databases (SID) Source: Agency for Healthcare Research and Quality |
TCGA Program | The Cancer Genome Atlas (TCGA) Source: National Cancer Institute |
Vital Statistics Online | Vital Statistics (Birth and Death datasets) Source: Centers for Disease Control & Prevention National Center for Health Statistics |
World Health Organization (WHO) Data | The WHO provides access to a wealth of global health data, including disease statistics, health systems performance, and demographic information. |
Youth Risk Behavior Surveillance System (YRBSS) | Youth Risk Behavior Surveillance System (YRBSS) Source: Centers for Disease Control & Prevention |
Do NOT require use agreements Although these may not have strict use requirements you should always check the specific terms of use and licensing associated with each dataset you intend to use to ensure compliance.
Data | Description |
---|---|
Data.gov | The U.S. government's open data portal, Data.gov, offers a variety of health-related datasets. Many of these datasets are available for public use without strict access requirements. |
Gapminder | Gapminder offers a wide range of publicly accessible data related to global health and development. Their datasets cover various health indicators and socio-economic factors. |
Global Burden of Disease Study (GBD) | The Institute for Health Metrics and Evaluation (IHME) offers publicly accessible data related to the global burden of diseases, injuries, and risk factors. |
Google Trends Data | Google Trends provides access to search query data related to health topics. While it doesn't have strict access requirements, it's essential to review Google's terms of service for data usage guidelines. |
Open Data on Kaggle | Kaggle hosts various datasets, including health-related data. Many of these datasets are open for public use, but you should check individual dataset terms for any restrictions. |
UNICEF Data | UNICEF provides data related to child health, nutrition, and well-being. Their datasets are often accessible to the public. |
US Census Bureau Data | The US Census Bureau provides data related to population demographics, including health insurance coverage, disability status, and more. Most of their data is publicly accessible. |
World Bank Data | The World Bank offers various health-related datasets, including those related to healthcare access, health financing, and health expenditure. These datasets are generally publicly accessible without specific restrictions. |
Data Sets for Machine Learning
Data | Description |
---|---|
Machine Learning Repository at UC Irvine | This repository has over 600 data sets with a variety of sample sizes and variables. Topics for these datasets are varied and include agriculture, biology, computer science, medicine, and public health. Users can contribute datasets so this repository will grow over time.
|
Data Sets in R packages
Data | Description |
---|---|
Airline Delays | The United States Bureau of Transportation Statistics has collected data on more than 169 million domestic flights dating back to October 1987. These data were used for the 2009 ASA Data Expo (H. Wickham 2011) (a subset are available in the MySQL database we have made available through the mdsr package). The nycflights13 package contains a proper subset of these data (flights leaving the three most prominent New York City airports in 2013). |
Baby Names | The babynames package for R provides data about the popularity of individual baby names from the United States Social Security Administration (Hadley Wickham 2019). These data can be used, for example, to track the popularity of certain names over time. |
Baseball | The Lahman database (in the Lahman package) is maintained by Sean Lahman, a database journalist. Compiled by a team of volunteers, it contains complete seasonal records going back to 1871 and is usually updated yearly. It is available for download both as a pre-packaged SQL file and as an R package (Friendly et al. 2023). |
Federal Election Commission | The fec16 package (Benjamin S. Baumer and Gjekmarkaj 2017) provides access to campaign spending data for recent federal elections maintained by the Federal Election Commission. These data include contributions by individuals to committees, spending by those committees on behalf, or against individual candidates for president, the Senate, and the House of Representatives, as well information about those committees and candidates. The fec12 and fec16 packages provide that information for single election cycles in a simplified form (Tapal, Gahwagy, and Ryan 2023). |
MacLeish | The Ada and Archibald MacLeish field station is a 260-acre plot of land owned and operated by Smith College. It is used by faculty, students, and members of the local community for environmental research, outdoor activities, and recreation. The macleish package allows you to download and process weather data as a time series from the MacLeish Field Station using the etl framework (Benjamin S. Baumer et al. 2022). It also contains shapefiles for contextualizing spatial information. |
Restaurant Violations | The mdsr package contains data on restaurant health inspections made by the New York City Health Department. |
Data Sets in Python Modules
Data | Description |
---|---|
Movies | The Internet Movie Database (imdby python package) is a massive repository of information about movies (IMDB.com 2013). The easiest way to get the IMDb data into SQL is by using the open-source IMDbPY Python package (Alberani 2014). |