Advancing clinical genomics and precision medicine with GVViZ: FAIR bioinformatics platform for variable gene-disease annotation, visualization, and expression analysis

Background Genetic disposition is considered critical for identifying subjects at high risk for disease development. Investigating disease-causing and high and low expressed genes can support finding the root causes of uncertainties in patient care. However, independent and timely high-throughput next-generation sequencing data analysis is still a challenge for non-computational biologists and geneticists. Results In this manuscript, we present a findable, accessible, interactive, and reusable (FAIR) bioinformatics platform, i.e., GVViZ (visualizing genes with disease-causing variants). GVViZ is a user-friendly, cross-platform, and database application for RNA-seq-driven variable and complex gene-disease data annotation and expression analysis with a dynamic heat map visualization. GVViZ has the potential to find patterns across millions of features and extract actionable information, which can support the early detection of complex disorders and the development of new therapies for personalized patient care. The execution of GVViZ is based on a set of simple instructions that users without a computational background can follow to design and perform customized data analysis. It can assimilate patients’ transcriptomics data with the public, proprietary, and our in-house developed gene-disease databases to query, easily explore, and access information on gene annotation and classified disease phenotypes with greater visibility and customization. To test its performance and understand the clinical and scientific impact of GVViZ, we present GVViZ analysis for different chronic diseases and conditions, including Alzheimer’s disease, arthritis, asthma, diabetes mellitus, heart failure, hypertension, obesity, osteoporosis, and multiple cancer disorders. The results are visualized using GVViZ and can be exported as image (PNF/TIFF) and text (CSV) files that include gene names, Ensembl (ENSG) IDs, quantified abundances, expressed transcript lengths, and annotated oncology and non-oncology diseases. Conclusions We emphasize that automated and interactive visualization should be an indispensable component of modern RNA-seq analysis, which is currently not the case. However, experts in clinics and researchers in life sciences can use GVViZ to visualize and interpret the transcriptomics data, making it a powerful tool to study the dynamics of gene expression and regulation. Furthermore, with successful deployment in clinical settings, GVViZ has the potential to enable high-throughput correlations between patient diagnoses based on clinical and transcriptomics data. Supplementary Information The online version contains supplementary material available at 10.1186/s40246-021-00336-1.


GVViZ
GVViZ is a user-friendly desktop-bioinformatics application, developed by Ahmed Lab to support RNA-seq driven gene expressions, regulation, and disease annotation analysis with dynamic heat map visualization. It is based on set of simple instructions, following those a user without strong bioinformatics background and programming skills can perform gene expressions, regulation, and disease annotation analysis, and produce dynamic heat map visualization.
GVViZ is a multi-platform software package programmed in JAVA, designed following software engineering principles and "Butterfly" paradigm [1,2]. It can execute on Microsoft Windows, Linux, Unix and macOS operating systems. Overall graphical user interface (GUI) of GVViZ consists of five main components:

GVViZ: Main
Main is the primary user interface of the GVViZ. As presented in S. Figure 1, and S. • Navigates to Connect to the database panel.
• Provides option to Exit from GVViZ.

Export
• Provides options to export results: ▪ Text in CSV format file.
▪ Visualization in TIFF and PNG format files.

Help
• Provides author's Contact and GVViZ's About information

S.1.B Data Settings Panel
The data settings panel offers three data selection options: 1. Select the database: It allows user to choose between "Protein Coding", "Non-Coding", and all genes available in the integrated annotation database. It also allows user to select and analyze all available genes in the samples used for the analysis.

Select expression data:
It offers user to select expression value for the analysis. Current options include: • Length GVViZ basic workflow starts with the establishment of connection to the database by using the top menu, selecting connect to the database, provision of valid username, password, and address to the host database. Next, data needs to be searched and selected. Heatmap can be customized, as the user can set a new title, the number of y-axis and the gradient for the heat map. The last step is to visualize and export the rendered heat map.

GVViZ GUI: Menu
The main menu is available at the top of the GUI. It consists of three components (S. Figure 2 and S. Table 2).:

File:
The file component allows user to connect to the local database and exit from the program. GVViZ relies on MySQL database management system (server), where genedisease annotation and expression data are stored to support data analysis and heat map visualization. User needs to connect to the database by clicking on the connect to database button. With that a popup window will appear, where the user is required to input the database connection related information (hostname, username, password, and port).

Export:
The export option allows user to save the heat maps generated in two different image formats (PNG and TIFF). In addition, the user has the option to save the matrix produced by GVViZ into a CSV file.

Help:
Help is the last component in the main menu. It consists of two options: About and Contact Us. The about displays information about the current version of GVViZ and Contact Us displays information to contact the authors.

S.2.A File
• Connect to database: To connect to the local database.
• Exit: To terminate the program.

S.2.B Export
• Export to PNG: Save the heat map as PNG.
• Export to TIFF: Save the heat map as TIFF.
• Save data as CSV: Save the heat map as CSV.

S.2.C Help
• Contact us: Popup that displays information for contacting the authors of the GVViZ.
• About: Popup that displays information about the current version of the GVViZ.

GVViZ GUI: Data Settings Panel
The data settings panel is in the WEST side of the screen and it is divided into three sections (S. Figure 3 and S. Table 3):

Type:
In this section the user needs to select the type of data to consume. The user has four options available to choose from: 1) select protein coding genes, 2) select non-protein coding genes, 3) select all (protein and non-protein coding genes), and 4) select all the samples stored in the database.

Expression selection:
There are five different gene expression metrics to choose from: 1) length, 2) effective length, 3) expected count, 4) TPM, and 5) FPKM. The user can impose up to two constraints that can be applied to select the gene expression data. The following are all the operators that can be used to impose constraints in the data: ">", "<", "=", "<=", ">=", "! =". Lastly, the user is required to input a number in the text box to complete the constraint e.g., to plot heat map that contain samples with TPM greater than 50: user needs to first select TPM, then in the drop-down menu select the operator ">" and input 50 in the text box. This will only select samples stored in the database in which their TPM values are higher than 50.

Sample IDs:
After selecting the expression data, the next step is to select the sample ids that need to be visualized in the heat map. To do that the user needs to input all the sample ids separated by a "," or by a "-" to plot a range in the text box provided. The combination of both "," and "-" are also considered valid inputs e.g., to plot a heatmap with samples ranged from 10 to 100, the expression needs to look like: 10-100.
• Non-Coding genes • All genes from database • Text box to input the constraint.
• Text box to input the second constraint.

S.3.C Sample ids
• Text box to input the sample ids to select.

GVViZ GUI: Data Visualization Panel
Once the user has selected the data, the next step is to select genes to plot heat map using data visualization panel. The data visualization panel is divided into three tabs/modules: S. Figure 4. GVViZ search screen located in the center of the screen.

S.4.A Disease Search Box
• Text box for typing disease keywords.
• Search button to submit query.

S.4.B Search Table
• Results from the search box will appear here.

S.4.C Sample Selection
• Select all samples.

S.4.D Submit Query Button
• Submits the overall query to render the heat map.

Search:
The search tab allows user to look for genes that are associated to diseases. The user needs to input disease keyword (full or partial) in the text bar and click the search button. A collection of genes that are related to that disease will appear in. This information will be based on the backend connected gene-disease annotation database. Having the list of genes at successful execution of automatically generated SQL query, user can select all genes and customize selection to submit query to start rendering of the heat map (S. Figure. 5 and S. Table 4).
S. Figure 5. GVViZ visualization screen located in the center of the screen.

S.5.A Rendering Canvas
• The heat map will be rendered here.

Visualization:
This module is to displays the heat map. Every time, when a new/change is made to the heat map, the render heat map button needs to be clicked to draw/refresh heat map (S. Figure. 5 and S. Table 5).
S. Figure 6. GVViZ SQL screen located in the center of the screen.

S.6.A SQL Query Box
• SQL query to send to the database.

S.6.B SQL Submit Button
• Submit the query inputted in the text box. Table  • Results of the SQL query will appear in this table.

S.6.C SQL Results
S. Table 6. GVViZ SQL screen features

SQL:
The last tab is the SQL that offer features to perform SQL queries directly with the database. The text box located in the top of the interface allows user to input the SQL query to be executed. In the middle of the interface, there is the button to submit the query, and in the bottom of the interface, there is a table where the output of the SQL query will be displayed (S. Figure. 6 and S. Table 6).

GVViZ GUI: Heat Map Settings Panel
The last panel is the heat map customization that is in the EAST side of the screen and allows user to render and customize the looks of the heat map (S. Figure.7 and S. Table 7). This panel is divided into five sections: S. Figure 7. GVViZ heat map screen located in the EAST of the screen.

S.7.A GVViZ Status • Database status connection: it displays if
GVViZ is currently connected to the local database.
• Processing status: displays if GVViZ is performing computations on the background or if its idle.

S.7.B Record Information
• Min Value: displays the smallest value that is currently being rendered in the heat map.
• Max Value: displays the biggest value that is currently being rendered in the heat map.
• Number of records found: displays the total number of records that where returned by the keyword that was inputted in the disease search box.

S.7.C Heat Maps Plotting Options
• Draw Title: checkbox to display or not display the heat map title.
• Draw Legend: checkbox to render the legend for the heat map.
• Draw X-Axis Title: checkbox to display or not display the x-axis title of the heat map.
• Draw Y-Axis Left Title: checkbox for rendering or not rendering the left y-axis of the heat map.
• Draw Y-Axis Right Title: checkbox for rendering or not rendering the right y-axis of the heat map.

S.7.D Heat Map Naming.
• Title: text box for setting the title name of the heat map.
• X-Axis Title: text box for setting the x-axis title name.
• Y-Axis Left Title: text box for setting the left y-axis title name.
• Y-Axis Right Title: text box for setting the right y-axis title name.
• Y-Axis Left Title Source • Render Heat Map, it needs to be clicked to render the heat map.
S. Table 7. GVViZ heat map screen features

Status:
The first section of the panel are two labels that communicate the status of the connection with the local database and a status of the processing.

Record information:
The second section consists of two informative labels that displays the minimum and the maximum values of the plotted heat map.

Heat map plotting options:
The third section is a collection of five check boxes that allows the user to decide if to render the tile, the legend, the x-axis title or two render a single or a double y-axis. Selecting or deselecting one of this option will be reflect on the visualization tab after clicking the render heat map button, which is located on the bottom of this panel.

Heat map naming:
The fourth section allows user to set a title name, x-axis title, y-axis titles and select the names to be displayed for both y-axis. Note that if none is selected for the y-axis right, it will only plot a single y-axis.

Color selector and rendering:
The last section consists of two components are the color scheme selector for the heat map, where the user can select from twenty-eight different color combinations, and the render heat map button, which needs to be clicked every time a new change has been made to the heat map. The following table lists all the gradient combinations available, along with the high and the low color (S. Table 8).

Gene Expression Data:
The Gene Expression Data table contains all the data that has been populated by the RNA-seq data pipeline and uploaded to this table. This table is consisting of seven columns (S. Table 9, and S. Figure 8): gene id, length, effective length, expected count, TPM, FPKM and SID (sample id).

Gene Disease Annotation:
The Gene Disease Annotation table is used for gene disease annotation. The table contains four columns: gene id, ensemble id, category, and the disease (S. Table 10 and S. Figure 8).

RNA-seq pipeline and annotated gene-disease data
A typical workflow for RNA-seq (S. Figure 9) analysis using GVViZ is as follows. First the RNAseq pipeline is deployed, in which the quality control of the raw reads is conducted using FastQC [3]. Then the reads are trimmed using Trimmomatic [4], and the data sequences are sorted using SAMtools [5]. MarkDuplicates is then used for removing duplicates [16], and CollectInsertSizeMetrics are used to compute size distribution and read orientation of paired-end libraries. Then the paired end raw reads are aligned to the human reference genome (hg38) using HISAT [7] with Bowtie2 [8] software.
RNA by Expectation Maximization (RSEM) [9] is then applied for quantification and identification of differentially expressed genes by aligning reads to reference de novo transcriptome assemblies, based on TPM (Transcript Per Million mapped reads). Lastly the results of the RNA-seq pipeline are parsed and automatically loaded into the GVViZ gene expression database, where the results will be queried and visualized using the GUI of GVViZ.

Declarations
Ethical Approval and Consent to participate: Not applicable.

Consent for publication: Not applicable
Availability of data and material: The data that support the findings of this study are openly available in the following GitHub repository: <https://github.com/drzeeshanahmed/GVViZ-Public>