The Community-Acquired Pneumonia Organization ( CAPO ) Cloud-Based Research Platform ( the CAPO-Cloud ) : Facilitating Data Sharing in Clinical Research

Background: Pneumonia is a costly and deadly respiratory disease that afflicts millions every year. Advances in pneumonia care require significant research investment and collaboration among pneumonia investigators. Despite the importance of data sharing for clinical research it remains difficult to share datasets with old and new investigators. We present CAPOCloud, a web-based pneumonia research platform intended to facilitate data sharing and make data more accessible to new investigators. Methods: We establish the first two use cases for CAPOCloud to be the automatic subsetting and constraining of the CAPO database and the automatic summarization of the database in aggregate. We use the REDCap data capture software and the R programming language to facilitate these use cases. Results: CAPOCloud allows CAPO investigators to access the CAPO clinical database and explore subsets of the data including demographics, comorbidities, and geographic regions. It also allows them to summarize these subsets or the entire CAPO database in aggregate while preserving privacy restrictions. Discussion: CAPOCloud demonstrates the viability of a research platform combining data capture, data quality, hypothesis generation, data exploration, and data sharing in one interface. Future use cases for the software include automated univariate hypothesis testing, automated bivariate hypothesis testing, and principal component analysis. DOI: 10.18297/jri/vol1/iss3/9/ Received Date: April 13, 2017 Accepted Date: May 8, 2017 Website: https://www.louisville.edu/jri


Introduction
Of all infectious diseases, community-acquired pneumonia (CAP) continues to be one of the most common causes of morbidity for adults in the world. 1 Improving patient outcomes is facilitated by the generation of new knowledge through clinical research and quality improvement activities. One of the largest CAP studies is The Community Acquired Pneumonia Organization (CAPO), originally developed in 1999 to facilitate collaborative international research on CAP 2 . CAPO currently includes investigators from over 30 countries and 130 healthcare facilities, and the CAPO database houses data for nearly 20,000 hospitalized patients with CAP.
The CAPO database uses the Research Electronic Data Capture software or REDCap. REDCap is a clinical data management software that is accessible online through a secure portal. CAPO members access the database with a unique username and password in order to enter data. The database and CAPO web portal are maintained by the CAPO support group and any pneumonia investigator may contact the group for the creation of a free account. 3 The primary benefit of this type of shared collaborative database is the large number of case records it makes available to researchers. Since 2003 the NIH has maintained an official policy regarding data sharing which highlights the importance of collaboration to research. 4 This policy emphasizes the importance of data sharing to the translation of research into practice and requires that NIH funded studies include a data sharing plan that makes available the timely release and sharing of data. The increasing availability of the internet has provided a greater capability to share data among researchers, but successful dissemina-tion is still challenging for many reasons including differences in data formats, institutional intellectual property restrictions, and cost for de-identifying data. 5,6 Since its creation, CAPO investigators have published many studies based on data contributed by CAPO members  and each of these studies required careful analysis of data, as well as interpretation of results.

The Problem
Despite the utility that the CAPO system makes available to investigators, there are several challenges that need to be overcome. One of the primary problems is that investigators need to contact the CAPO support group to determine if a study population or study variable of interest is available in the CAPO database. Once the population and variables are identified, the data must be analyzed appropriately. Currently, analyses are conducted through two means: 1) The CAPO support group at the University of Louisville provides a raw dataset to investigators for analysis, or 2) the CAPO support group analyzes data and reports back to the investigator. Both of these approaches are cumbersome and timeconsuming, typically necessitating several meetings to explain the data, methods, and results.
These issues are not unique to the CAPO database. Presently the only way to determine if a large pneumonia database might be appropriate for secondary analysis is to contact the original investigators and request the data. Even if the original investigators are willing to collaborate, there may be added costs of data deidentification and intellectual property agreements between institutions. These will add weeks and months to the time needed to form the first hypothesis for a secondary analysis.

Objective
The objective of this project was to develop and implement a cloud-based, interactive analytic and data visualization research platform with the following aims: 1) Promote new research 2) facilitate education of new researchers 3) enable the exploration of research ideas not envisioned by the initial protocol 4) make possible the testing of new or alternative hypothesis and 5) facilitate exploratory data analysis of the original study database. We believe that using data analytics and visualization techniques can improve secondary data analysis not only for the field of pneumonia but also for clinical research in general. Table 1 summarizes many of the terms used in the design and development of software for assisting with data analysis.

Use Cases
When designing software, use cases are used to describe the purpose that software affords to its users. Use cases are described from the perspective of a typical user of the software. Once use cases are established, software features are created that will afford and support all of the necessary use cases. In the case of CAPO-cloud, we focus on two primary use cases: 1) A CAPO Investigator wants to constrain, or subset, the CAPO database and Every feature of the app will be made to address a primary use case. Addressing the first use case of subsetting the database, requires a feature that allows selecting features of the database such as demographics. These selections effectively exclude portions of the CAPO database that do not meet the inclusion criteria of a research question and is usually done in software by means of a checkbox. After subsetting the database, an additional feature is needed to provide summarization of the subset. This feature can be added by performing calculations and presenting results in a tabular or graphical format.
To satisfy these use cases, we integrate three open source software projects: 1) Research Electronic Data Capture (REDCap) 36 , 2) The R Programming Language 37 , and 3) R-Studio/Shiny 38 .

REDCap
REDCap is an electronic data capture software that is developed and maintained by Vanderbilt University. It is web-based software that can be obtained for eligible organizations at no cost. The CAPO REDCap database is housed on a secure web server and comprises multiple data collection forms. REDCap has application programmer interface (API) functionality, allowing REDCap to communicate with a variety of other software and programming languages.

R
R is a statistical programming language based around S 39 . R is open source and free, allowing contributors to write software packages and contribute to software repositories such as CRAN 40 , GitHub 41 , and BIOConductor 42 .

Shiny
Shiny is an R software package created by RStudio 43 that allows programmers to build interactive applications that allow for the visualization and sharing of data. As with web applications, a Shiny application is a web page that accepts inputs from users via on-screen controls, puts the user inputs into R running on a server or local computer, and then outputs results back to the web page, updating or generating the information displayed. Shiny applications are usually composed of two key units of code: a script for the functions run by the server and a script for the user interface.

Integration
Although REDCap does not use R, the REDCap API allows it to work in conjunction with many modern programming languages including R, Python 43 , PHP 44 , and Perl. Through the API, a software developer can create a REDCap "plug-in" or a feature that is added directly to the REDCap software to be used through its native interface. The REDCap plug-in runs on the same web server as the REDCap instance, and connects to the Shiny server which can be running on the same hardware or on a remote system, depending on performance and memory requirements. Figure 1 depicts the CAPO-Cloud plug-in for REDCap.

CAPO-Cloud
After logging in to the CAPO REDCap instance, investigators can navigate to the CAPO-Cloud plug-in by clicking the CAPO-Cloud link on the sidebar. This will take them to the CAPO-Cloud About page, which offers a brief description of the CAPO study and a patient characteristics table for the whole dataset (see Figure 2).
The next tab, Subset Data, provides the subset selection feature for the CAPO dataset. Currently, investigators can subset based on date of admission, sex of patient, age of patient, patient history of comorbidities, and region. The subsets made on this page will carry over to all subsequent pages, allowing investigators to get a unique snapshot of the database from a perspective that meets their research needs. Figure 3 shows the total number of male patients aged 85 and over admitted between 2000-04-17 and 2016-11-03 with a history of COPD in the United State and Canada.
The third tab, Descriptive Statistics, shown in Figure 4 allows investigators to look at summary statistics in more detail. Statistics are accompanied by tables and graphs such as histograms and bar charts. This tab is separated into multiple sections, allowing investigators to look at patient history and demographics, physical exam values, laboratory values, severity of disease markers, and outcomes such as length of stay or in-hospital mortality.

Security and Privacy
The CAPO database is not completely de-identified, but the summary statistics provided by CAPO-cloud effectively mask protected health information. Access is only available upon requesting a CAPO REDCap account, and all access is tracked using RED-Cap's audit trail. Using the REDCap API to connect to R, CAPO-Cloud makes use of statistical functions using R-base (core R programming language) and applies them to data stored in REDCap. These functions and results are then displayed to the user, via the shiny-server web interface.

Discussion
In this study, we developed and implemented a new cloudbased interactive analytic and data visualization research platform, CAPO-Cloud, to facilitate hypothesis generation and secondary data analysis of the CAPO international database. The platform allows investigators to start secondary data analysis in an interactive way that is less time consuming and more acces   and available means to visualize data with their products Many Eyes 45 and Fusion Tables 46 . For investigators who are skilled with these tools, they are often sufficient to provide a general overview of a clinical dataset.
Excel provides a convenient way to interact with and edit data, and although it is proprietary software, it is not prohibitively expensive, and available to most students and employees at a reduced price or for free. However, Excel has limits when it comes to the size of a dataset that can be managed, and to performing any specialized statistics requires familiarity with Excel's formula language.
Tableau supports a point and click interface, allowing investigators to drag variables names to create charts and other data visualizations based on their selections. To get meaningful visualizations, a Tableau user must know what type of data is associated with each variable name. Most investigators, and especially investigators new to clinical research and statistics, could not discern this from the raw data alone, and so many datasets will have an accompanying data specification or data dictionary describing this information. This dictionary provides the data type, units, validity ranges, and any other relevant information about a variable. Even having access to this dictionary does not make initial data exploration steps approachable for a new investigator, as the amount of options available to new users when first opening the dataset in Tableau can be overwhelming.
None of this software achieves an integration with the same data collection and quality framework on which it was collected, and this is a key advantage of the CAPO-Cloud. Having the analysis and visualization tools ensures that investigators will only see capabilities that apply to the CAPO pneumonia dataset. This is crucial in making investigation approachable and reproducible.

Strengths
All software used for this project is available at no cost to eligible organizations, and anyone with access to REDCap can create a similar platform tailored to their studies. Any investigator interested in pneumonia data can become a member of CAPO and be granted access to the CAPO clinical database and CAPO-Cloud.
The platform provides visualization that aids clinicians and researchers without a strong statistical background in the generation of new ideas for research. This allows researchers who were not part of the original study to initiate secondary data analyses of the clinical database on their own, allowing the CAPO support unit to focus on sophisticated, high-level analysis developed from investigators' hypotheses.

Limitations
Currently, this platform can only be applied to one database (the CAPO database). Applying to other REDCap projects would require some modification from a programmer experienced in R and Shiny.
REDCap is not open source and is only available upon request from the REDCap consortium. REDCap can only be licensed for use that is considered not-for-profit. This includes academic, government and non-profit organizations, but not industry or private enterprise.

Future research
CAPO-Cloud supports use cases for the exploratory browsing and descriptive statistics of the CAPO database for CAPO investigators.
In Table 2 we list several use cases and the features to support them that we anticipate to be useful to CAPO investigators in the future. Many of these functions are in use by the CAPO coordinating center as standalone R programs, and only need adjustment for presentation on a web page through Shiny.
In order to understand the impact that CAPO-Cloud has on investigators, it will be important to track CAPO-Cloud usage and follow up regularly with investigators using it, to determine which features are the most useful. Features that facilitate studies leading to publication will have the highest priority.

Conclusions
In conclusion, we have presented our project CAPO-Cloud, a REDCap plug-in that aims to create a basic structure for a cloudbased research program facilitating collaborative research and data sharing at the international level. We believe that tools like these will greatly enhance the ability of non-statisticians to perform preliminary investigations into important new clinical research questions.

Author contributions:
WAM, SPF, and TLW wrote the first draft of the manuscript. All authors critically reviewed the manuscript for important intellectual content. All authors agree with the manuscript results and conclusions.
Funding source: This study project was unfunded.