get more information via emailrequest servicesfind other researchersresearch tool kitcite the OSCTR

 

 

Publicly Available Data Sets

The DCR maintains a list of sites that house publicly available data sets. Some of these data sets are governed by data use agreements while others are free to use without such agreements. It is expected that propoer ethical considerations of data use will be followed regardless of a need for data use agreements

Data Sets in Public Health and Medicine

May require use agreements. These data sets are publicly available but may be subject to specific terms of use and data access requirements. For example, you many need to submit a data request form to obtain access to the data

DataDescription
Adolescent Behaviors and Experiences Survey (ABES)Adolescent Behaviors and Experiences Survey (ABES) Source: Centers for Disease Control & Prevention
Behavioral Risk Factor Surveillance System (BRFSS)BRFSS collects data on health risk behaviors, chronic health conditions, and use of preventive services in the United States.
CCG DataCenter for Cancer Genomics (CCG) data resources Source: National Cancer Institute
CDC WONDERThe Centers for Disease Control and Prevention (CDC) provides access to the Wide-ranging Online Data for Epidemiologic Research (WONDER) system, which offers a wide range of public health data, including mortality, morbidity, and population data.
CGCI DataCancer Genome Characterization Initiative (CGCI) Source: National Cancer Institute
COVID-19 DatasetsSeveral datasets related to the COVID-19 pandemic are publicly available, including data on cases, testing, and vaccine distribution.
CTD2 DataCancer Target Discovery and Development (CTD²) Network / C-T-D-Squared Source: National Cancer Institute
HCMI DataHuman Cancer Models Initiative (HCMI) Source: National Cancer Institute
HCUP-US DatabasesHealthcare Cost and Utilization Project (HCUP) Source: Agency for Healthcare Research and Quality
HCUP-US NIS OverviewHealthcare Cost and Utilization Project (HCUP) National (Nationwide) Inpatient Sample (NIS) Source: Agency for Healthcare Research and Quality
Health Information National Trends Survey (HINTS)Health Information National Trends Survey (HINTS) Source: National Cancer Institute
KID DatabaseKids’ Inpatient Database (KID) Source: Agency for Healthcare Research and Quality
Medical Image DatasetsVarious medical image datasets are available for tasks like image analysis and machine learning. Examples include the National Library of Medicine's Chest X-ray and The Cancer Imaging Archive (TCIA).
Medicare DataThe Centers for Medicare & Medicaid Services (CMS https://data.cms.gov/) offers datasets related to healthcare utilization, provider performance, and payment information through the Medicare program.
Monitoring the Future (MTF)Monitoring the Future (MTF) Public-Use Cross-Sectional Datasets Source: National Addiction & HIV Data Archive Program
NASS DatabaseNationwide Ambulatory Surgery Sample (NASS) Source: Agency for Healthcare Research and Quality
National Ambulatory Healthcare Survey (NAMCS/NHAMCS)National Ambulatory Medical Care Survey (NAMCS)
National Cancer Institute (NCI) SEER DatasetsThe Surveillance, Epidemiology, and End Results (SEER) program provides cancer statistics, including data on cancer incidence, survival, and mortality in the United States.
National Center for Health Statistics (NCHS)NCHS, part of the CDC, offers various datasets on health and vital statistics in the United States.
National Center for Health Statistics (NCHS)National Center for Health Statistics (NCHS) Source: Centers for Disease Control & Prevention National Center for Health Statistics
National Electronic Health Records Survey (NEHRS)National Electronic Health Records Survey (NEHRS) Source: Center for Disease Control
National Health Care Survey RegistryNational Health Care Surveys Source: Centers for Disease Control & Prevention
National Hospital Care Survey (NHCS)National Hospital Care Survey (NHCS)
National Institutes of Health (NIH) Data Sharing RepositoriesNIH offers a variety of datasets related to medical research, clinical trials, genomics, and more through its data sharing repositories, such as the National Library of Medicine's National Center for Biotechnology Information (NCBI) and the National Institute of Mental Health (NIMH) Data Archive.
National Post-acute and Long-term Care Study Homepage (NPALS)National Post-acute and Long-term Care Study (NPALS) Source: Center for Disease Control
National Survey of Family Growth (NSFG)National Survey of Family Growth (NSFG) Source: Center for Disease Control
National Survey on Drug Use and Health (SAMHSA.gov)National Survey on Drug Use and Health (NSDUH) Source: Substance Abuse and Mental Health Services Administration
National Health Interview Survey (NHIS)National Health Interview Survey (NHIS) Source: Centers for Disease Control & Prevention
NEDS DatabaseNationwide Emergency Department Sample (NEDS) Source: Agency for Healthcare Research and Quality
NHANES Questionnaires and Datasets (NHANES)National Health and Nutrition Examination Survey (NHANES) Source: Source: Centers for Disease Control & Prevention
NRD DatabaseNationwide Readmissions Database (NRD) Source: Agency for Healthcare Research and Quality
SASD DatabaseState Ambulatory Surgery and Services Databases (SASD) Source: Agency for Healthcare Research and Quality
SEDD DatabaseState Emergency Department Databases (SEDD) Source: Agency for Healthcare Research and Quality
SID DatabaseState Inpatient Databases (SID) Source: Agency for Healthcare Research and Quality
TCGA ProgramThe Cancer Genome Atlas (TCGA) Source: National Cancer Institute
Vital Statistics OnlineVital Statistics (Birth and Death datasets) Source: Centers for Disease Control & Prevention National Center for Health Statistics
World Health Organization (WHO) DataThe WHO provides access to a wealth of global health data, including disease statistics, health systems performance, and demographic information.
Youth Risk Behavior Surveillance System (YRBSS)Youth Risk Behavior Surveillance System (YRBSS) Source: Centers for Disease Control & Prevention

Do NOT require use agreements Although these may not have strict use requirements you should always check the specific terms of use and licensing associated with each dataset you intend to use to ensure compliance.

DataDescription
Data.govThe U.S. government's open data portal, Data.gov, offers a variety of health-related datasets. Many of these datasets are available for public use without strict access requirements.
GapminderGapminder offers a wide range of publicly accessible data related to global health and development. Their datasets cover various health indicators and socio-economic factors.
Global Burden of Disease Study (GBD)The Institute for Health Metrics and Evaluation (IHME) offers publicly accessible data related to the global burden of diseases, injuries, and risk factors.
Google Trends DataGoogle Trends provides access to search query data related to health topics. While it doesn't have strict access requirements, it's essential to review Google's terms of service for data usage guidelines.
Open Data on KaggleKaggle hosts various datasets, including health-related data. Many of these datasets are open for public use, but you should check individual dataset terms for any restrictions.
UNICEF DataUNICEF provides data related to child health, nutrition, and well-being. Their datasets are often accessible to the public.
US Census Bureau DataThe US Census Bureau provides data related to population demographics, including health insurance coverage, disability status, and more. Most of their data is publicly accessible.
World Bank DataThe World Bank offers various health-related datasets, including those related to healthcare access, health financing, and health expenditure. These datasets are generally publicly accessible without specific restrictions.

Data Sets for Machine Learning

DataDescription
Machine Learning Repository at UC Irvine

This repository has over 600 data sets with a variety of sample sizes and variables. Topics for these datasets are varied and include agriculture, biology, computer science, medicine, and public health. Users can contribute datasets so this repository will grow over time.

 

Data Sets in R packages

DataDescription
Airline DelaysThe United States Bureau of Transportation Statistics has collected data on more than 169 million domestic flights dating back to October 1987. These data were used for the 2009 ASA Data Expo (H. Wickham 2011) (a subset are available in the MySQL database we have made available through the mdsr package). The nycflights13 package contains a proper subset of these data (flights leaving the three most prominent New York City airports in 2013).
Baby NamesThe babynames package for R provides data about the popularity of individual baby names from the United States Social Security Administration (Hadley Wickham 2019). These data can be used, for example, to track the popularity of certain names over time.
BaseballThe Lahman database (in the Lahman package) is maintained by Sean Lahman, a database journalist. Compiled by a team of volunteers, it contains complete seasonal records going back to 1871 and is usually updated yearly. It is available for download both as a pre-packaged SQL file and as an R package (Friendly et al. 2023).
Federal Election CommissionThe fec16 package (Benjamin S. Baumer and Gjekmarkaj 2017) provides access to campaign spending data for recent federal elections maintained by the Federal Election Commission. These data include contributions by individuals to committees, spending by those committees on behalf, or against individual candidates for president, the Senate, and the House of Representatives, as well information about those committees and candidates. The fec12 and fec16 packages provide that information for single election cycles in a simplified form (Tapal, Gahwagy, and Ryan 2023).
MacLeishThe Ada and Archibald MacLeish field station is a 260-acre plot of land owned and operated by Smith College. It is used by faculty, students, and members of the local community for environmental research, outdoor activities, and recreation. The macleish package allows you to download and process weather data as a time series from the MacLeish Field Station using the etl framework (Benjamin S. Baumer et al. 2022). It also contains shapefiles for contextualizing spatial information.
Restaurant ViolationsThe mdsr package contains data on restaurant health inspections made by the New York City Health Department.

Data Sets in Python Modules

DataDescription
MoviesThe Internet Movie Database (imdby python package) is a massive repository of information about movies (IMDB.com 2013). The easiest way to get the IMDb data into SQL is by using the open-source IMDbPY Python package (Alberani 2014).