Health Services Research Data Center (HSRDC)
Established in 2005, Penn LDI’s Health Services Research Data Center (HSRDC) provides data services to LDI-affiliated investigators who use highly sensitive patient information in their research. The HSRDC is comprised of secure high-performance servers within the University of Pennsylvania’s Perelman School of Medicine that have the necessary security protections to permit storage and analysis of data containing Protected Health Information by LDI-affiliated investigators and research staff. Please send any questions or comments you have about the HSRDC to HSRDC@pennmedicine.upenn.edu.
Access to HSRDC
Access to the HSRDC is available to Penn LDI-affiliated investigators. University of Pennsylvania’s graduate students, clinical fellows, and post-docs who are not affiliated with LDI will be considered for HSRDC server privileges if their mentor or collaborator on their research project is an LDI-affiliated investigator. LDI Associate Fellows are encouraged to first discuss their proposals with their faculty mentors.
In addition, faculty affiliated with Penn’s Center for Clinical Epidemiology and Biostatistics (CCEB) can access the HSRDC to use the Optum datasets.
Please contact Jibby Kurichi for further information, including costs associated with the use of the HSRDC and access for non-Penn collaborators.
Data and Documentation Requirements
Data Storage on HSRDC
HSRDC is maintained at a high-security level in accordance with federal regulations governing secure computer systems (e.g., the Federal Information Security Management Act-FISMA). Data storage on HSRDC is limited to research using data that require high security such as data with individually identifiable, protected health information. Lower security datasets that include anonymous patient surveys, de-identified data, and publicly available data should be stored and analyzed on other University resources, such as Penn+Box.
The HSRDC server cluster is designed for the analysis of “complete” databases that are uploaded by our IT administrators. Since users are not allowed to upload their own data, the HSRDC is not an appropriate storage infrastructure for data from clinical trials, surveys, or other data sources that require more frequent updates.
Documentation Requirements
Data that includes protected health information obtained outside of the University of Pennsylvania requires a Data Use Agreement (DUA) specifically permitting storage of the data on the HSRDC. Please provide HSRDC staff an executed DUA prior to requesting a data upload and provide current DUA documentation to HSRDC staff annually.
Technical Details
HSRDC servers utilize RedHat Linux operating systems, and data can be stored and analyzed using SAS, Stata, and R. Please note that “Windows-based” programs are not available in the HSRDC environment. Users wishing to use other software are responsible for the licensing costs. Installation of the specialized software will be evaluated by HSRDC staff on a case-by-case basis.
Cost
The HSRDC is a service center within the Perelman School of Medicine that provides services for a fee. The total operating costs of the HSRDC, including support for IT, administrative personnel, software licenses, hardware maintenance and depreciation, storage space, CPU time, and database management, exceed $150,000 per year, and Penn LDI receives no core funds from the University or its schools to support this resource.
HSRDC staff can prepare a formal cost estimate for investigators submitting grant applications to be included with the budget justification. Cost estimate requests should be sent to Jibby Kurichi. Please allow five business days for a response.
For funded projects, PIs and/or their staff should contact Jibby Kurichi in the earliest stages of the project to plan the timing, scope, and logistics of HSRDC resource use. Invoices are generally sent to PIs in May for the use of the HSRDC during the current fiscal year (July–June) unless prior arrangements have been made. If payment is not received, access to the HSRDC will be disabled, and project folders will be archived and then deleted.
Available Data Resources
HSRDC houses a variety of data resources that can be shared with a range of restrictions and at varying costs. For additional information on data accessible through HSRDC, please refer to each individual dataset.
Centers for Medicare & Medicaid Services (CMS)
The HSRDC houses a variety of Centers for Medicare & Medicaid Services (CMS) data from 1998–2020. These data require individual data use agreements (DUAs).
Centers for Medicare & Medicaid Services (CMS) data stored on the HSRDC are available for reuse purposes. LDI-affiliated investigators may submit an application to reuse CMS data stored on the HSRDC under their own DUA. These may be submitted directly to the CMS Data Request Center by the investigator. Reuse of CMS data under the investigator’s new DUA will be subject to fees paid directly to CMS. Once an executed DUA is obtained and provided to the HSRDC staff, access to the HSRDC and data can be granted. Current DUA documentation must be sent to HSRDC staff annually.
There are costs associated with using the HSRDC where these data are housed. Please contact Jibby Kurichi for information on costs and to find out what CMS data is currently available for reuse on the HSRDC. For more information on CMS data, including data dictionaries and more information about the DUA request process, please visit the ResDAC website.
Optum
Optum is a clinically rich U.S. health care claims database that can be used to conduct research studies. Optum accesses a comprehensive, large, and robust proprietary health care database of Optum’s parent company. The Optum database contains health care claims from 2000 to 2022, covering more than 100 million people, including inpatient and outpatient claims, pharmacy claims, and laboratory results.
Access
Penn faculty wishing to use Optum data need to be either LDI-affiliated investigators or affiliated with CCEB. Access to Penn’s Optum data is project-based, and all projects require LDI review and approval. Faculty members or students should submit a brief (approximately one-page) research proposal that clearly demonstrates why Optum data is appropriate to the research aims and include a timeline, motivation, and brief research design specifying who will be conducting the analyses to Jibby Kurichi.
Investigators are required to email Optum with their grant proposal, regardless of funding source, at least 10 business days prior to grant submission. Publication of any work using Optum data needs to be reviewed by Optum before submission via email. Since the University of Pennsylvania is the sole contracting entity with Optum, it is essential that all faculty (including those at CHOP, VA, or any other Penn-affiliated hospital) use and emphasize their University of Pennsylvania affiliation when publishing work using Optum data.
Cost
There are costs associated with using the HSRDC where the Optum data is housed (as outlined here).
Please Note: Externally funded (by a government or nonprofit agency) research using Optum data must follow the requirements of the Penn-Optum agreement. Research using Optum data funded by for-profit/corporate entities is prohibited by terms of Penn’s contract with Optum. Investigators are required to alert Optum within 10 days of receiving notification of the award. Fees must be paid to Optum accordingly.
Questions? Please Contact Us.
Contact Jibby Kurichi for the Optum codebook and other related documents, or for further information.
HCUP National (Nationwide) Inpatient Sample (NIS)
The NIS from 1988–2020 is stored on the HSRDC. NIS is the largest publicly available all-payer inpatient care database in the United States, containing data on more than seven million hospital stays. Its large sample size is ideal for developing national and regional estimates and enables analyses of rare conditions, uncommon treatments, and special populations. For a description of the data elements, visit HCUP’s website. Contact Jibby Kurichi for user guides and additional documentation.
HCUP Nationwide Emergency Department Sample (NEDS)
NEDS data from 2007–2020 is stored on HSRDC. NEDS produces national estimates about emergency department (ED) visits across the country. The NEDS describes ED visits, regardless of whether they result in admission. Its large sample size allows for analysis across hospital types and the study of relatively uncommon disorders and procedures. HCUP’s website provides a description of the data elements. Contact Jibby Kurichi for user guides and additional documentation.
HCUP Nationwide Readmissions Database (NRD)
NRD data from 2011–2017 is stored on the HSRDC. NRD is a unique and powerful database designed to support various types of analyses of national readmission rates for all patients regardless of the expected payer for the hospital stay. The NRD includes discharges for patients with and without repeat hospital visits in a year and those who have died in the hospital. This database addresses a large gap in health care data – the lack of nationally representative information on hospital readmissions for all ages. Visit HCUP’s website for a description of the data elements. Contact Jibby Kurichi for user guides and additional documentation.
HCUP Kids’ Inpatient Database (KID)
KID is the largest publicly available all-payer pediatric inpatient care database in the U.S., containing data from two to three million hospital stays. Its large sample size is ideal for calculating national and regional estimates and enables analyses of rare conditions and uncommon treatments. Data from 2003, 2006, 2009, 2012, and 2016 are stored on the HSRDC. HCUP’s website provides a description of the data elements. Contact Jibby Kurichi for additional information.
Health Care Data on Wharton Research Data Services (WRDS)
WRDS provides clients with a broad collection of health care data, analytics, and the most robust computing infrastructure available, making it the global gold standard for integrated research systems. Corporate, academic, and government clients turn to WRDS for seamless data storage, management, and access, all backed by the credibility and leadership of The Wharton School. WRDS has assembled, with the expertise from LDI, a collection of health care data that are available to Penn investigators at no charge. Penn staff and faculty can sign up for an account here. Contact Matthew Cohen for additional information.
Truveta
The Truveta database includes research-ready complete EHR data on over 100 million people, from 30 of the largest health systems in the U.S., linked with claims, social determinants of health (SDOH) and mortality data. Records come from all 50 states, date back to 2016 and are updated daily. Truveta EHR data also includes inpatient/outpatient care settings, pharmacy, labs, procedures, medications, implanted/explanted devices, and longitudinal data on over 1 million mother-child pairs. The Truveta data dictionary provides the fullest overview of the data contained in the Truveta database.
Truveta data is stored and analyzed in the Truveta cloud environment, Truveta Studio.
*Please use your email address ending in upenn.edu to access the Truveta data dictionary and Truveta Learn page*
Access
LDI has established an Academic Subscription for the University of Pennsylvania to allow access to Truveta Studio for approved Penn researchers. For externally funded projects, researchers wishing to use Truveta Studio should contact grants@truveta.com for support. Anu Ray at Truveta will provide cost estimates, discuss study feasibility, and provide resources for grant proposals. For internally funded projects, researchers should contact customersupport@truveta.com. Agnes Pastwa at Truveta will provide cost estimates, discuss study feasibility, and provide onboarding resources.
In order to access Truveta Studio, researchers must provide LDI Administrative Director of Research, Jibby Kurichi (jkurichi@pennmedicine.upenn.edu), with their Truveta Cost Estimate, proof of funds, and University of Pennsylvania IRB approval. Researchers will then review and sign the User Agreement, and LDI will provide approval for SSO access to the Truveta platform and data.
Costs
Researchers pay compute and storage fees for cloud analytics usage, and can expect to be invoiced monthly. Truveta charges additional fees for externally-funded research projects. Please contact grants@truveta.com for budgeting related to grant proposals.
FAQs
How can Truveta data be analyzed?
Truveta data can be analyzed using Python, SQL or R. SAS and Stata are not supported in Truveta Studio. Analysis occurs in the Truveta Notebooks environment, which is described in more detail on the Truveta Learn page here.
*Please use your email address ending in upenn.edu to access the Truveta data dictionary and Truveta Learn page*
What payment information is available?
Truveta exposes payer type through LexisNexis open claims (Commercial, Medicare, Medicaid, etc) but not payer. Payment information should be evaluated on a case-by-case basis to see whether it meets the needs of your project. More information on this can be found in the Truveta data dictionary here, under the “Claim” table.
What provider information is included?
Truveta does not expose NPIs, but shows de-identified provider IDs which show provider type and specialty, and allows tracking of care by a particular provider. Information on setting of care is also available. Encounters are associated with a provider ID, so you can tell switching if the provider ID changes with time. More information on this can be found in the Truveta data dictionary here, under the “PractitionerQualification” table.
What race and ethnicity information is included in the data?
Truveta exposes the 5 major race categories (American Indian/Alaska Native, Asian, Black or African American, Native Hawaiian/Other Pacific Islander, and White) as well as the 2 major ethnicities (Hispanic or Latino and Not Hispanic or Latino). Race and ethnicity data are captured by the health systems and shared directly with Truveta.
What level of geographical representation is in Truveta data?
Truveta exposes State and 3 digit zip Location information. Truveta is in the process of incorporating geographic census detail and area deprivation index (ADI) as well.
Can external, publicly available data sources be merged in Truveta Studio?
This depends on the file size and what is being linked. Please reach out to customersupport@truveta.com with specific use cases so they can determine the feasibility.
For how long post-partum is the mother-child data linked?
Truveta’s dataset is consistent from 2016-current, and has mother-child linkages ongoing for those patients with birth events in this time. For more mother-child information please see: https://www.truveta.com/truveta-data/mother-child
Is there documentation or transparency regarding how Truveta processes raw data?
Information on how Truveta normalizes/standardizes across the 30 partner health systems: https://resources.truveta.com/hubfs/Whitepapers/The%20Truveta%20Language%20Model.pdf
Information on Truveta data quality: https://resources.truveta.com/hubfs/Whitepapers/Truvetas%20Approach%20to%20Data%20Quality.pdf
Health Economics Data Analyst Pool (HEDAP)
The Health Economics Data Analyst Pool (HEDAP) is a Penn service center supported and managed by Penn LDI to provide LDI-affiliated investigators access to high-quality, skilled data analysts. HEDAP administrative staff recruits, trains, and manages a group of Master’s-level and PhD-level statistical analysts. These analysts work with LDI-affiliated investigators across funded projects using statistical software packages such as SAS, Stata, and R to manipulate and analyze health care data under the guidance of the investigators and other collaborators.
HEDAP administration supports the professional development of the analysts and provides state-of-the-art computer equipment and programming software required to conduct health services research.
For more information about HEDAP or to view the analyst request queue, please email Jibby Kurichi.
HEDAP Co-Directors
Norma Coe, PhD
Professor, Medical Ethics and Health Policy, Perelman School of Medicine
Harsha Thirumurthy, PhD
Professor, Medical Ethics and Health Policy, Perelman School of Medicine
Statistical Analysts
Kano Amagai, MPH
Lizzie Bair, MS
Xinwei Chen, MS
Zhi Geng, MPH
Qian (Erin) Huang, MPH
Seiyoun Kim, PhD
Sue Kim, MS
Junning Liang, MS, MS
Yang Li, MS
Eliza Macneal, MS
Angira Mondal, MS
Selina Pan, MSed
Sae-Hwan Park, PhD
Charles Rareshide, MS
Dominic Ruggiero, MPH
Kaitlyn Shultz, MS
Chuxuan Sun, MPA
Jinming Tao, MS
Erkuan Wang, MA
Jingyi Wu, MS
Ruiying (Aria) Xiong, MS
Lin Xu, MS
Ruiqi Yan, MS
Lin Yang, MS
Yueming Zhao, MS
Yu Zhao, MS
Song Zhong, MSSPDA
Apply to an Open Position
If you are a statistical analyst looking for a new and exciting place to work supporting health policy-related research projects conducted by investigators at Penn LDI, please email your resume to Abby Kearns.
We are looking for analysts who have programming skills using SAS, Stata, or R to create analytical data sets from clinical trials, surveys, and health care claims data, construct and standardize outcome measures and other analytical variables, provide descriptive and analytical reports, and perform specialized statistical analyses.
Qualifications include a minimum of a Bachelor’s degree in Mathematics/Statistics, Health Care Management, Economics, or Public Health, and three (3) years of related experience, or an equivalent combination of education and experience required.