Data and Analytics Support

Health Services Research Data Center (HSRDC)

Established in 2005, Penn LDI’s Health Services Research Data Center (HSRDC) provides data services to LDI-affiliated investigators who use highly sensitive patient information in their research. The HSRDC is comprised of secure high-performance servers within the University of Pennsylvania’s Perelman School of Medicine that have the necessary security protections to permit storage and analysis of data containing Protected Health Information by LDI-affiliated investigators and research staff. Please send any questions or comments you have about the HSRDC to HSRDC@pennmedicine.upenn.edu.

Access to HSRDC

Access to the HSRDC is available to Penn LDI-affiliated investigators. University of Pennsylvania’s graduate students, clinical fellows, and post-docs who are not affiliated with LDI will be considered for HSRDC server privileges if their mentor or collaborator on their research project is an LDI-affiliated investigator. LDI Associate Fellows are encouraged to first discuss their proposals with their faculty mentors.

In addition, faculty affiliated with Penn’s Center for Clinical Epidemiology and Biostatistics (CCEB) can access the HSRDC to use the Optum datasets.

Please contact Jibby Kurichi for further information, including costs associated with the use of the HSRDC and access for non-Penn collaborators.

Data and Documentation Requirements

Data Storage on HSRDC

HSRDC is maintained at a high-security level in accordance with federal regulations governing secure computer systems (e.g., the Federal Information Security Management Act-FISMA). Data storage on HSRDC is limited to research using data that require high security such as data with individually identifiable, protected health information. Lower security datasets that include anonymous patient surveys, de-identified data, and publicly available data should be stored and analyzed on other University resources, such as Penn+Box.

The HSRDC server cluster is designed for the analysis of “complete” databases that are uploaded by our IT administrators. Since users are not allowed to upload their own data, the HSRDC is not an appropriate storage infrastructure for data from clinical trials, surveys, or other data sources that require more frequent updates.

Documentation Requirements

Data that includes protected health information obtained outside of the University of Pennsylvania requires a Data Use Agreement (DUA) specifically permitting storage of the data on the HSRDC. Please provide HSRDC staff an executed DUA prior to requesting a data upload and provide current DUA documentation to HSRDC staff annually.

Technical Details

HSRDC servers utilize RedHat Linux operating systems, and data can be stored and analyzed using SAS, Stata, and R. Please note that “Windows-based” programs are not available in the HSRDC environment. Users wishing to use other software are responsible for the licensing costs. Installation of the specialized software will be evaluated by HSRDC staff on a case-by-case basis.

Cost

The HSRDC is a service center within the Perelman School of Medicine that provides services for a fee. The total operating costs of the HSRDC, including support for IT, administrative personnel, software licenses, hardware maintenance and depreciation, storage space, CPU time, and database management, exceed $150,000 per year, and Penn LDI receives no core funds from the University or its schools to support this resource.

HSRDC staff can prepare a formal cost estimate for investigators submitting grant applications to be included with the budget justification. Cost estimate requests should be sent to Jibby Kurichi. Please allow five business days for a response.

For funded projects, PIs and/or their staff should contact Jibby Kurichi in the earliest stages of the project to plan the timing, scope, and logistics of HSRDC resource use. Invoices are generally sent to PIs in May for the use of the HSRDC during the current fiscal year (July–June) unless prior arrangements have been made. If payment is not received, access to the HSRDC will be disabled, and project folders will be archived and then deleted.

Available Data Resources

LDI houses a variety of data sources on HSRDC that can be shared under certain circumstances and at varying costs. LDI also facilitates access to data on vendor platforms via academic subscriptions and partnerships. A table summarizing and comparing some of the currently available LDI data sources is linked below:

UPenn Users (SharePoint)
UPHS Users (PennBox)

Descriptions of these data resources are also below, with links to more detailed information.

Data Stored on HSRDC

Centers for Medicare & Medicaid Services (CMS)

LDI houses a variety of Centers for Medicare & Medicaid Services (CMS) data from 1998-2023 on HSRDC. To reuse these data, investigators are required to obtain a data use agreement (DUA) from CMS.

Access

Applications to reuse these data may be submitted directly to the Research Data Assistance Center (ResDAC) by the investigator. Reuse of CMS data under the investigator’s new DUA will be subject to fees paid directly to CMS. Use of CMS data on HSRDC also requires a cost estimate for HSRDC use, which will be invoiced annually. Once an executed DUA and cost estimate is obtained and provided to the HSRDC staff, access to the HSRDC and data can be granted. Current DUA documentation must be sent to HSRDC staff annually.

Cost

Please contact Jibby Kurichi for information on the cost of HSRDC use and to find out what CMS data is currently available for reuse of the HSRDC. For more information on CMS data, including data dictionaries and more information about the DUA request process, please visit the ResDAC website.

Merative MarketScan Research Databases

MarketScan Research Databases covering the most recent 10-15 years are housed on HSRDC. These are a family of statistically de-identified research datasets that fully integrate patient-level healthcare claims (medical, drug, and dental) and health plan enrollment data, productivity data (workplace absence, short- and long-term disability, and workers’ compensation), laboratory results, and health risk assessments (HRAs) into datasets available for healthcare research. Data are contributed by large employers, health plans, and hospitals in the United States. The oldest year of data for each database is rolled off as new years are released.

MarketScan Commercial Database

The Commercial Database for the most recent 15 years is stored on HSRDC. This claims database includes enrollment and demographic information, inpatient and outpatient medical claims data, and outpatient pharmacy claims data collected from large employers and a variety of fee-for-service and managed care health plans in the United States. The Commercial Database provides longitudinal views across the under-65 working population and their dependents.

MarketScan Medicare Database

The Medicare Database for the most recent 15 years is stored on HSRDC. This database integrates enrollment information, demographic information, and inpatient medical, outpatient medical, and outpatient pharmacy claims data collected from large employers and health plans in the United States. The Medicare Database consists of both the Medicare-paid and supplemental-paid components of reimbursed administrative claims. Starting in 2019, the database contains information for Medicare Advantage patients for those former employers who have chosen this option for their former employees.

MarketScan Multi-State Medicaid Database

The Medicaid Database for the most recent 15 years is stored on HSRDC. The Medicaid database reflects the healthcare service use of individuals covered by Medicaid programs in multiple geographically dispersed states. The Medicaid database includes enrollment and demographic information, inpatient and outpatient medical claims data, and outpatient pharmacy claims data, as well as information on long-term care, collected from Medicaid enrollees, covered under fee-for-service and managed care plans. In addition to the standard demographic variables such as age and gender, the Medicaid Database includes federal aid category.

MarketScan Linked Specialty Databases

The following specialty databases are stored on HSRDC, and cover the most recent 8-15 years: Lab Database, Commercial and Medicare Mortality Edition databases, Health and Productivity Management Database, Benefit Plan Design Database, Health Risk Assessment Database, Dental Database, and National Weights.

Merative’s website provides more information on each of the MarketScan research databases. For information on accessing the MarketScan data on HSRDC, contact Jibby Kurichi.

Healthcare Cost and Utilization Project (HCUP)

National (Nationwide) Inpatient Sample (NIS)

The NIS from 1988-2020 is stored on the HSRDC. NIS is the largest publicly available all-payer inpatient care database in the United States, containing data on more than seven million hospital stays. Its large sample size is ideal for developing national and regional estimates and enables analyses of rare conditions, uncommon treatments, and special populations. For a description of the data elements in NIS, visit HCUP’s website.

Nationwide Emergency Department Sample (NEDS)

NEDS data from 2007-2020 is stored on HSRDC. NEDS produces national estimates about emergency department (ED) visits across the country. The NEDS describes ED visits, regardless of whether they result in admission. Its large sample size allows for analysis across hospital types and the study of relatively uncommon disorders and procedures. HCUP’s website provides a description of the data elements in NEDS.

Nationwide Readmission Database (NRD)

NRD data from 2011-2017 is stored on the HSRDC. NRD is a unique and powerful database designed to support various types of analyses of national readmission rates for all patients regardless of the expected payer for the hospital stay. The NRD includes discharges for patients with and without repeat hospital visits in a year and those who have died in the hospital. This database addresses a large gap in health care data – the lack of nationally representative information on hospital readmissions for all ages. Visit HCUP’s website for a description of the data elements in the NRD.

Kids’ Inpatient Database (KID)

KID is the largest publicly available all-payer pediatric inpatient care database in the U.S., containing data from two to three million hospital stays. Its large sample size is ideal for calculating national and regional estimates and enables analyses of rare conditions and uncommon treatments. Data from 2003, 2006, 2009, 2012, and 2016 are stored on the HSRDC. HCUP’s website provides a description of the data elements in KID.

Contact Jibby Kurichi for additional information on HCUP data.

Data Available Through External Research Partnerships

Elevance Health (formerly Anthem) Research Opportunity

Penn LDI and the Wharton School Health Care Management Department are partnering with Elevance Health (formerly Anthem), one of the largest health insurance entities in the United States, in a research collaboration aimed at advancing knowledge about how to improve the nation’s health care. This collaboration provides access to Elevance’s claims data and seeks to answer questions focused on using provider payment and insurance design to improve health care access, cost, and quality.

Research projects using existing data and focusing on the following areas will be considered:

Spending, financial incentives, and health care financing
Health care markets, including the effects on access, cost, and quality
Medicare Advantage
Quality and value of health care
Mental health

There are a few ways to collaborate with Elevance Health. LDI Senior Fellows, or Associate Fellows in collaboration with a Senior Fellow mentor, who are interested in learning more, please contact Caleb Hearn.

Health Care Data on Wharton Research Data Services (WRDS)

WRDS provides clients with a broad collection of health care data, analytics, and the most robust computing infrastructure available, making it the global gold standard for integrated research systems. Corporate, academic, and government clients turn to WRDS for seamless data storage, management, and access, all backed by the credibility and leadership of The Wharton School. WRDS has assembled, with the expertise from LDI, a collection of health care data that are available to Penn investigators at no charge. Penn staff and faculty can sign up for an account here. Contact Matthew Cohen for additional information.

Federal Statistical Research Data Center (FSRDC)

The University of Pennsylvania is a founding member of a consortium that established, in partnership with the Census Bureau, a FSRDC. The FSRDC provides approved researchers with a secure environment to access restricted-use microdata from the Census Bureau, Agency for Healthcare Research and Quality, National Center for Health Statistics, and Bureau of Labor Statistics. They represent a vital resource for researchers in economics, business, demography, sociology, medicine, statistics, criminology, and many other disciplines. Data can only be accessed on-site at the Federal Reserve Bank of Philadelphia, and there is an application process to obtain access.

The Philadelphia FSRDC’s website lists the available data and examples of research. For additional information, contact Joseph Ballegeer.

Data Available on Vendor Platforms

Truveta

The Truveta database includes research-ready complete EHR data on over 100 million people, from 30 of the largest health systems in the U.S., linked with claims, social determinants of health (SDOH) and mortality data. Records come from all 50 states, date back to 2016 and are updated daily. Truveta EHR data also includes inpatient/outpatient care settings, pharmacy, labs, procedures, medications, implanted/explanted devises, and longitudinal data on over 1 million mother-child pairs. The Truveta data dictionary provides the fullest overview of the data contained in the Truveta database.

*Please use your email address ending in upenn.edu to access the Truveta data dictionary and Truveta Learn page*

Truveta data is stored and analyzed in the Truveta cloud environment, Truveta Studio.

Access

LDI has established an Academic Subscription for the University of Pennsylvania to allow access to Truveta Studio for approved Penn researchers. Researchers wishing to use Truveta Data for funded research should contact grants@truveta.com for cost estimates and grant proposal resources. For all other questions, contact customersupport@truveta.com.

In order to access Truveta Studio, researchers must provide LDI Administrative Director of Research, Jibby Kurichi (jkurichi@pennmedicine.upenn.edu), with their Truveta Cost Estimate, proof of funds, and University of Pennsylvania IRB approval. Researchers will then review and sign the User Agreement, and LDI will provide approval for SSO access to the Truveta platform and data.

Costs

Researchers pay compute and storage fees for cloud analytics usage, and can expect to be invoiced monthly. Truveta charges additional fees for externally-funded research projects. Please contact grants@truveta.com for budgeting related to grant proposals.

FAQs

Questions? Read the frequently asked questions about Truveta to learn more.

All of Us Research Hub (NIH)

The All of Us Research Program is an NIH-developed biomedical data resource, with a secure data repository, research tools, and projects. Registered researchers can create research projects using the Researcher Workbench and access collaborative workspaces, cohort-building tools, and interactive analysis tools such as Jupyter Notebook, RStudio, and SAS Studio. The All of Us dataset includes surveys (633k+ participants), EHR and demographic data (393k+ participants), physical measurements (509k+ participants), and wearables like Fitbit devices (59k+), plus genomic data (447k+ participants).

To access All of Us, researchers must register, create an account, and complete the required trainings. The University of Pennsylvania has a Data Use and Registration Agreement (DURA) covering all data access tiers for University of Pennsylvania researchers. All of Us runs on the Google Cloud Platform (GCP) and requires a self-managed GCP billing account. There is no cost to register for a Researcher Workbench. Researchers pay for environment compute resources, persistent disks, and running applications in the workspace. More information is available in the Penn Medicine All of Us Tutorial and the NIH All of Us Research Hub website.

Epic Cosmos

Epic Cosmos is a tool made up of structured patient data from participating Epic health care systems. Penn Medicine researchers can access fully de-identified datasets in the Cosmos environment, which includes de-identified EMR data from 296M patients, 15.3B encounters, 39.9K clinics, and 1,704 hospitals. Cosmos has a representative sample of patients across race/sex/rural-urban location/types of insurance, and employs real-time data refreshes. Cosmos Slicer Dicer (a limited dataset that can be used to conduct basic statistical research) is available at no cost. Row-level data and the use of built-in analytical tools such as R and Python require additional training and certification through Epic.

To access Cosmos, researchers need to be PennChart account holders, have an active Epic Userweb account, and submit a Cosmos Access Form. Penn Medicine researchers without PennChart accounts can request access via a sponsored PennChart account. Please visit the Penn Medicine Clinical Research page on Cosmos for access information, links to forms, and required trainings. More detailed information on Cosmos data is available on the Epic Cosmos website.

TriNetX

TriNetX is a self-service research cohort exploration and data analytics tool that includes a HIPAA-limited dataset of patient data from Penn Medicine, plus fully deidentified datasets from 170+ healthcare organizations via the TriNetX Research Network. TriNetX data includes data from EHRs (diagnoses, procedures, medications, labs, and vitals), which can be enriched with additional data such as claims and mortality. Queries can be limited to Penn Medicine or across the entire TriNetX Research Network, and deidentified data sets can be made available for download with appropriate IRB approval and agreements.

TriNetX access is available to Penn Medicine employees with a PMACS AD account. Individuals who would like to access TriNetX must complete the required trainings and request a TriNetX account by submitting the TriNetX Access Form through an IS Self-Service Ticket.

For more information on TriNetX data, access, and trainings, please visit the Penn Medicine Clinical Research TriNetX page, or contact the Office of Clinical Research.

Health Economics Data Analyst Pool (HEDAP)

The Health Economics Data Analyst Pool (HEDAP) is a Penn service center supported and managed by Penn LDI to provide LDI-affiliated investigators access to high-quality, skilled data analysts. HEDAP administrative staff recruits, trains, and manages a group of Master’s-level and PhD-level statistical analysts. These analysts work with LDI-affiliated investigators across funded projects using statistical software packages such as SAS, Stata, and R to manipulate and analyze health care data under the guidance of the investigators and other collaborators.

HEDAP administration supports the professional development of the analysts and provides state-of-the-art computer equipment and programming software required to conduct health services research.

For more information about HEDAP or to view the analyst request queue, please email Jibby Kurichi.

HEDAP Co-Directors

Norma Coe, PhD

Professor, Medical Ethics and Health Policy, Perelman School of Medicine

Harsha Thirumurthy, PhD

Professor, Medical Ethics and Health Policy, Perelman School of Medicine

Statistical Analysts

Kano Amagai, MPH

Lizzie Bair, MS

Xinwei Chen, MS

Zhi Geng, MPH

Seiyoun Kim, PhD

Sue Kim, MS

Junning Liang, MS, MS

Yang Li, MS

Eliza Macneal, MS

Angira Mondal, MS

Selina Pan, MSed

Sae-Hwan Park, PhD

Charles Rareshide, MS

Dominic Ruggiero, MPH

Kaitlyn Shultz, MS

Jinming Tao, MS

Jingyi Wu, MS

Ruiying (Aria) Xiong, MS

Lin Xu, MS

Ruiqi Yan, MS

Lin Yang, MS

Yueming Zhao, MS

Yu Zhao, MS

Song Zhong, MSSPDA

Apply to an Open Position

If you are a statistical analyst looking for a new and exciting place to work supporting health policy-related research projects conducted by investigators at Penn LDI, please email your resume to Abby Kearns.

We are looking for analysts who have programming skills using SAS, Stata, or R to create analytical data sets from clinical trials, surveys, and health care claims data, construct and standardize outcome measures and other analytical variables, provide descriptive and analytical reports, and perform specialized statistical analyses.

Qualifications include a minimum of a Bachelor’s degree in Mathematics/Statistics, Health Care Management, Economics, or Public Health, and three (3) years of related experience, or an equivalent combination of education and experience required.

Data and Analytics Support

Health Services Research Data Center (HSRDC)

Access to HSRDC

Data and Documentation Requirements

Data Storage on HSRDC

Documentation Requirements

Technical Details

Cost

Available Data Resources

Data Stored on HSRDC

Centers for Medicare & Medicaid Services (CMS)

Access

Cost

Merative MarketScan Research Databases

MarketScan Commercial Database

MarketScan Medicare Database

MarketScan Multi-State Medicaid Database

MarketScan Linked Specialty Databases

Healthcare Cost and Utilization Project (HCUP)

National (Nationwide) Inpatient Sample (NIS)

Nationwide Emergency Department Sample (NEDS)

Nationwide Readmission Database (NRD)

Kids’ Inpatient Database (KID)

Data Available Through External Research Partnerships

Elevance Health (formerly Anthem) Research Opportunity

Health Care Data on Wharton Research Data Services (WRDS)

Federal Statistical Research Data Center (FSRDC)

Data Available on Vendor Platforms

Truveta

Access

Costs

FAQs

All of Us Research Hub (NIH)

Epic Cosmos

TriNetX

Health Economics Data Analyst Pool (HEDAP)

HEDAP Co-Directors

Norma Coe, PhD

Professor, Medical Ethics and Health Policy, Perelman School of Medicine

Harsha Thirumurthy, PhD

Professor, Medical Ethics and Health Policy, Perelman School of Medicine

Statistical Analysts

Kano Amagai, MPH

Lizzie Bair, MS

Xinwei Chen, MS

Zhi Geng, MPH

Seiyoun Kim, PhD

Sue Kim, MS

Junning Liang, MS, MS

Yang Li, MS

Eliza Macneal, MS

Angira Mondal, MS

Selina Pan, MSed

Sae-Hwan Park, PhD

Charles Rareshide, MS

Dominic Ruggiero, MPH

Kaitlyn Shultz, MS

Jinming Tao, MS

Jingyi Wu, MS

Ruiying (Aria) Xiong, MS

Lin Xu, MS

Ruiqi Yan, MS

Lin Yang, MS

Yueming Zhao, MS

Yu Zhao, MS

Song Zhong, MSSPDA

Apply to an Open Position

Search