Overcoming Challenges in Omics Data Analytics

Data Analytics

Sep 01, 2021 | 8 min read

One of the challenges with analyzing and interpreting Omics data is extracting meaningful insights from large-scale and high-dimensional data sets from multiple sources. The ability to treat the data sets as a whole is important in every step of the data handling from pre-processing to statical analysis to interpreting and making deductions. However, since the number of variables in Omics studies can be very large compared with the number of variables in conventional multivariate analyses, Omics analysis is much more challenging than conventional studies.

_{^{This article is posted on our Science Snippets Blog.}}

In addition, multi-omics datasets often suffer from missing values due to technical limitations and various constraints in experiments. These missing values can hinder the consolidative analysis of the Omics data.

Making sense of Omics data requires advanced methods of data reduction and visualization techniques that can only be achieved with powerful data analytics tools, methods, and software. One of the most essential of these is the use of multivariate data analysis (MVDA) to distill the critical information from Omics data into relevant insights by finding the correlations that exist among the variables.

Omics Studies Generate Huge Amounts of Data

“Omics” refers to the fields of study such as genomics, proteomics, metabolomics, lipidomics, cytomics, and phenomics, which are the cornerstone of many pharmaceutical advancements that rely on biologics. Omics studies involve large-scale analyses of biological samples using high-throughput analytical approaches and bioinformatics, which produce vast amounts of data that are complicated and time-consuming to interpret.

For example, in Omics studies, a particular type of molecule in samples may be measured in terms of character and quantity as a whole, and the patterns and/or relating to the sample attributes may need to be investigated. Genomic studies measure DNA molecules, whereas epigenomic, transcriptomic, proteomic, and metabolomic studies measure the chemical states of DNA and its binding proteins, RNA, proteins, and metabolites, respectively. ^[1] This creates a range of differences in the types of data produced.

The concept of -omics, or collective measurement, is applied not to molecules, but rather to various measurable targets, such as a set of traits (phenome), states of brain neural networks (connectome), and bacterial florae (metagenome)^[2].

Furthermore, in Omics studies, each experimental instance generates a huge amount of information simultaneously. For example, a next-generation sequencing (NGS) experiment produces billions of short reads for a genomic, epigenomic, or transcriptomic study^[3]. Another example is gas chromatography-mass spectrometry for a metabolomic study, which produces bands of data that contains all of the information of various metabolites^[4].

Objectives in Omics

Omics studies are created to study genes, proteins, metabolic pathways, and cellular events. The data created and extracted must be distilled and interpreted to provide practical information that can answer questions around cause and effect with respect to genes, disease, physiological state, diet, biological age, and nutrition. Two important areas for data analytics are:

Metabolomics
“comprehensive analysis of the whole metabolome under a given set of conditions” ^[5]

Metabonomics
“the quantitative measurement of the dynamic multiparametric metabolic response of living systems to pathophysiological stimuli or genetic modification” ^[6]

Interpreting the Data

Pre-processing. The first step in analyzing Omics data is pre-processing. This helps get all of the data into a similar format for analysis. The need to pre-process data manually can be eliminated using certain advanced data analytics tools. SIMCA^® multivariate data analysis software includes essential functions such as scaling, mean-centering, transformation, advanced scaling, data correction, and compression to help make Omics data analysis easier.

Manually interpreting data. In Omics studies, long sequential molecules, many genes, genetic variants, molecules, biomarkers, cells, and individuals are often part of the data evaluated. One way to manage these items is to investigate each element one by one and to gather multiple levels of data about each one. But evaluating the results of many items individually, in the context of multiple experiments and combinations of variables, will be time-consuming and produce many p-values that must be statistically evaluated for relevance.

Multivariate analysis. Advanced data analytics software like SIMCA^® uses a number of essential multivariate statistical analysis tools, such as PCA, PLS, OPLS^®, and O2PLS^® to align data, creating summary indices (latent variables) in order to observe trends and outliers. One of the key issues is the determination of causality versus correlation. Biological studies involve a myriad of changes, but only some will be related to the cause. Using statistical methods, it’s possible to find the population of changes that are impactful vs those which are irrelevant or even counter-directional.

Data visualization. One of the essentials issues in the interpretation of Omics data is visualization. Some of the issues to be handled with Omics data include dealing with “noisiness”, intra-experimental quality variation, and inter-experimental quality variation. The right software can help eliminate these issues and provide advanced, interactive visualization that makes results easy to understand.

Although the two variables are correlated, this does not imply that one causes the other.

Experiments are not always perfect when conducted manually on a small scale or even when performed with automated high-throughput equipment. However, the main difference is that manual experiments can be redone, but that is not always possible with automated processes, because a selected portion of the process can’t be repeated separately. So, the data analytics tools must be able to manage outlier or “bad” data effectively.

The degree to which the quality of data among experiments varies is referred to as intra-experimental quality heterogeneity.
The degree to which the quality of the data records is affected by factors shared between other experiments, or other procedures, is known as inter-experimental heterogeneity.

Data analytics solutions must be able to account for both of these issues, factor out which matter, and make sense of the data without tossing out important experiments. Using Omics-supportive software for data analytics helps with these important procedures.

In addition, the effect that various factors have on each other must be weighed.

Read more about pre-processing and scaling of data and How Data Analytics Supports Commercialization of Omics Research

Finding Differences or Grouping Data

Finding meaning in Omics datasets often requires being able to group data and understand differences. Two common techniques employed by MVDA include:

principal component analysis (PCA) for data summary and overview
partial least squares (PLS) and orthogonal PLS (OPLS^®) for regression analysis, or O2PLS^® for data fusion

The SIMCA^® (soft independent modeling by class analogy) method, based on disjoint principal component analysis (PCA), or more recently OPLS-DA, allows you to target either classification or discriminant analysis data analytical objectives.

SIMCA^® is suitable when you have many groups (‘classes’), some of which may overlap, and where the number of samples in the groups (‘classes’) may vary considerably. The OPLS-DA is superior when looking into two classes and trying to find differences and understanding the sometimes very complex and hidden patterns. The Sartorius SIMCA^® MVDA software uses these advanced data analytics techniques and others to help you effectively handle Omics data.

How MVDA Supports Omics

As the molecular complexity of disease etiology exists at many different levels, the multivariate analysis offers an effective way to incorporate multi-level Omics data and can be more powerful than single-level analysis. One of the primary ways that MVDA supports Omics is by dealing with the dimensionality problem. MVDA is able to cope with multi-dimensional data and separate causality from correlation. It allows you to extrapolate information from all data simultaneously. It is also able to handle all types of data tables including:

Short and wide, N >> K
Almost square, N ≈ K
Long and lean, N << K

MVDA copes with missing data and can separate the regularities from the “noise” (this includes robust noise in both the X and Y axis). MVDA:

Models X and models Y
Models the relationship between X and Y
Expresses the noise

Critically, MVDA allows for results to be displayed graphically. A visual representation of the data makes understanding and insight much clearer.

Download an OMICS Course
Learn more about data analytics for Omics. Download this free course from Sartorius.

Download Omics Course

SIMCA^® Supports Omics Research

SIMCA^® multivariate data analysis software provides:

Multivariate projection methods to simplify complex Omics data
Powerful visualization techniques for spotting outliers and patterns (using PCA)
Tools to recognize classes and identify the genes, proteins or metabolites responsible for differences (OPLS^®, OPLS-DA^®)
Automatic pre-processing and scaling of data

DEMO of SIMCA^®

Omics Skin in SIMCA^®

An Omics skin is a customized view within SIMCA^® software designed to help people who typically work in various biological fields such as proteomics, genomics, metabolomics, or transcriptomics. The Omics skin graphical user interface (GUI) is specifically designed to help with the complex analysis of biological or gene data obtained through methods such as mass spectrometry.

It’s especially effective as a solution for:

Toxicology studies – finding differences between groups
Managing variation between plates, chips, and gels
Quality control and biological variation

Want to try it out?

Get SIMCA Free Trial

References

Yamada, R., Okada, D., Wang, J. et al. Interpretation of Omics data analyses. J Hum Genet 66, 93–102 (2021). https://doi.org/10.1038/s10038-020-0763-5
Perez-Riverol Y, Zorin A, Dass G, Vu MT, Xu P, Glont M, et al. Quantifying the impact of public Omics data. Nat Commun. 2019;10:3512-019–11461-w.
Albrecht D, Kniemeyer O, Brakhage AA, Guthke R. Missing values in gel-based proteomics. Proteomics. 2010;10(6):1202–11.
Beale DJ, Pinu FR, Kouremenos KA, Poojary MM, Narayana VK, Boughton BA, et al. Review of recent developments in GC-MS approaches to metabolomics-based research. Metabolomics. 2018;14:152-018–1449-2.
Fiehn, O., et.al Metabolite profiling for plant functional genomics. Nature Biotechnology. 2000;18:1157- 1161.
Nicholson, J. K., et.al 'Metabonomics': understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. Xenobiotica. 1999;29:1181-1189.

Overcoming Challenges in Omics Data Analytics