Plexiform Identity’s Kindle Notes & Highlights for Network Medicine: Complex Systems in Human Disease and Therapeutics

Using different -omics data types has the potential to provide mechanistic insights regarding how the impact of genetic variants is biologically transduced to cause disease.

40%

Our knowledge of the molecular networks influenced by disease-related genetic variants is typically fragmentary.

40%

Reconstruction of context-specific gene regulatory networks may be necessary to understand genetic predisposition to complex diseases.

40%

Chen and colleagues performed multiple -omics analysis in blood samples from a single individual repeatedly over more than 1 year of observation. In addition to performing whole-genome DNA sequencing, they obtained repeated assessments of transcriptomics (using RNA-seq), proteomics (including autoantibody profiles), and metabolomics, thus creating an integrative personal -omics profile (iPOP) (Chen, Mias, et al. 2012).

40%

Future complex disease studies will need to consider the value of longitudinal multiple -omics assessments in uncovering the etiologies of those diseases.

40%

One of the key goals of network medicine is to create more meaningful classification systems for complex diseases based on etiology.

40%

Key ongoing challenges in applying such genetic networks include addressing the impact of linkage disequilibrium between SNPs as well as population stratification on network structure and subtype identification.

40%

Pharmacogenetics may assist in identifying individuals likely to benefit from specific pharmacological treatment and in avoiding treatment of individuals at high risk for adverse events.

41%

Early hopes that identifying alterations in the DNA sequence (genetic variants) would lead us quickly to the root cause of human disease or that simply looking at patterns of gene expression could inform us about the functional underpinnings of the phenotypes we observe were quickly dashed.

41%

Encoded within the human genome are approximately 20,000 protein-coding genes, something on the order of fivefold more isoforms, more than 1000 microRNAs, and multiple noncoding epigenomic states, all of which can affect the functioning of the cell.

41%

Much of the progress we have made in understanding disease phenotypes has come from analyzing gene transcriptional data—making static measurements of the abundance of RNA levels for different cellular states and using these data to develop network models representing the dynamical processes driving biological systems.

42%

Fortunately, new DNA-sequencing technologies are allowing the generation of increasingly large and complex datasets comprising multiple -omic assays from individual samples, including genome-sequence data, transcriptomic data, and genome-wide data on patterns of epigenetic modification.

42%

Microarrays quantify the amount of mRNA that is captured, or bound, to a set of complementary sequences (probes) that are themselves attached to a solid substrate

42%

As new sequencing technologies have become more robust and cost-efficient, the sequencing of RNA (or RNA-seq) has begun to replace microarrays as a means of assessing gene-transcript levels.

42%

While there are advantages and disadvantages to both microarrays and RNA-seq, and the analysis of data from each requires careful preprocessing to eliminate artifacts in order to estimate gene expression levels accurately, both have been widely used in transcriptomic network modeling.

42%

For gene regulatory networks, transcription is the output of an underlying process wherein the concentration of mRNA in a cell or population of cells is mediated by the context-specific behavior of a variety of controlling factors

42%

A common assumption in this analysis is that genes whose expression was highly correlated across samples were under common regulatory control and hence “coregulated.”

42%

Since these correlation-based similarity matrixes are symmetric across the diagonal, networks generated using these measures are generally undirected. They also include information relating every pair of genes, instead of just relationships between TFs and target genes; thus, regulatory relationships are confounded with coregulatory correlations.

43%

It soon became obvious that networks created in this way did not accurately represent the underlying regulatory processes. Highly correlated pairs of genes, which were the most common associations in the networks, were likely to be commonly targeted by an upstream TF rather than to regulate each other

43%

Furthermore, co-expression networks estimated from the Pearson correlation did not retain many of the properties that were already beginning to be associated with biological networks, such as a scale-free degree distribution.

43%

It is important to note that WGCNA is specifically tuned for finding sets of co-expressed genes with greater accuracy rather than modeling the regulatory network connecting those genes.

43%

While linear correlation is a useful measure of relatedness, some scientists recognized that biological interactions may be nonlinear and that these would be missed by simple linear measures such as Pearson correlation

43%

Linear measures, such as the Pearson correlation, can easily capture the relationship between A and B in the top plot, but for nonlinear relationships, such as the one shown in the bottom plot, a measure such as mutual information (MI) is more informative.

43%

Two methods that use mutual information as a starting point to infer gene regulatory networks are Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNe) and Context Likelihood of Relatedness (CLR).

43%

ARACNe seeks to address this issue by evaluating all such “triads” of nodes in a network, and removes the edge in this triad for which there is the least evidence of direct regulation.

43%

Unfortunately, ARACNe’s ability to reconstruct useful networks in other contexts has been limited. One reason for this limitation may be a consequence of the algorithm removing all triads in the network, a structure that is important in feedback and feed-forward loops

43%

Rather than pruning specific edges by comparing triads, CLR instead prunes edges based on local structure in the mutual information by normalizing the mutual information matrix by recasting it into z-score units.

43%

In their paper, Faith and colleagues demonstrated that CLR outperformed ARACNe in a benchmark Escherichia coli gene expression dataset.

43%

Despite their limitations, both CLR and ARACNe have been applied to the reconstruction of networks in many varied systems and remain well cited in the field of transcriptomics.

43%

Co-expression networks may well capture direct regulatory relationships, but these cannot be distinguished from indirect associations based on similarity of expression patterns. The result is often a series of many-to-many associations between genes with correlated expression in which the strongest associations are not necessarily those that are most relevant to understanding regulatory processes.

43%

Statistical methods have also been adapted for use in the reconstruction of gene regulatory network models. One of the main motivations for using these is that the score predicted for each edge in the network has a probabilistic interpretation with weights and errors.

43%

Statistical approaches for modeling gene regulatory networks generally fall into two main classes.

43%

The first frames network inference as a series of regression problems wherein the expression level of each target gene is predicted by a combination of the expression levels across a set of potential upstream regulators. The second casts the problem of finding regulators as a classification problem in which new targets of a TF are predicted by comparing each potential target gene’s expression profile to the profiles of known “true” and “false” targets.

43%

Regression approaches generally employ a resampling scheme, such as bootstrapping, to determine a score for each regulatory interaction that assesses the probability that the coefficients in the regression equation wt are nonzero.

43%

In contrast to regression-based approaches that try to predict the regulators of each gene, classification approaches look at the problem from the opposite direction and try to predict the targets of each TF by conceptualizing regulatory network reconstruction as a feature-selection model

43%

Classification methods rely heavily on the “prior” information used to build regulatory network predictions.

44%

Further, as explicated below, despite the large increase in genomic information over the past decade, only a subset of all known TF regulators have high-quality, condition-specific, validated regulatory interactions.

44%

One limitation of both regression-based and classification-based approaches is that after predicting each gene’s regulators, or each TF’s targets, it is necessary to perform a postprocessing step to stitch together these sets of predictions into a global network.

44%

Bayesian networks represent an alternative approach to network modeling that requires edges to be directed. Formally, a Bayesian network is a directed acyclic graph (DAG) whose vertices are random variables X1, … , Xn that are probabilistic, can be discrete or continuous, and describe variation across conditions. In this context, each variable has a conditional distribution given its parents P(Xi|Parents(Xi)) and is independent of its nondescendants given its parents. Consequently, Bayesian networks allow only dependencies between a node and its parents, and conditional independence statements ...more

44%

Part of the attraction of these models is that the edges do not necessarily represent direct interactions but can represent the influence of a number of undetected genes, proteins, or metabolites that, in many ways, allow us to overcome the imperfect knowledge of the relationships that exist in the systems we study and incompleteness in the experimental data.

44%

However, application of Bayesian network analysis to more “realistic” datasets (such as tumor vs. normal, treated vs. control) failed to provide similarly useful insights and, as a result, is rarely used in analysis of expression profiling data.

44%

The most significant reason for this is the computational complexity of learning the structure of the networks, a problem that has been shown to be nondeterministic polynomial time (NP)–hard (Chickering 1996), implying that an exact computational solution is not possible.

44%

For example, Wolpert and Macready (1997) noted that the use of domain-specific knowledge can provide a useful bias that leads to near-optimal solutions in exploring the state space of a particular problem.

44%

Although Bayesian networks allow high resolution of correlation structure in large datasets, they are fundamentally acyclic graphs and therefore cannot include feedback loops that are important for many biological processes, including the cell-cycle processes that Friedman and colleagues first studied.

44%

It has become increasingly clear that inferring regulatory networks from gene expression data alone results in, at best, an incomplete model.

44%

Other methods include gene expression information when doing enhancer mapping in an attempt to incorporate even more distal enhancers that may be regulating a target gene; these complex methods are much more computationally intensive and do not lead to a significant improvement in functional predictions based on validation experiments.

44%

PANDA (Passing Attributes between Networks for Data Assimilation; Figure 8–7) (Glass, Huttenhower, et al. 2013) is a promising new method that borrows an idea called message passing (or affinity propagation) from communication theory (Frey and Dueck 2007) to integrate diverse sources of genomic data and to model the flow of information in complex regulatory networks.

44%

One key feature of PANDA is its emphasis on agreement between data elements in a network neighborhood.

45%

Because PANDA considers multiple types of relationships between both regulators and their targets, the method can incorporate multiple independent data sources.

45%

Although a wealth of gene expression data has been generated over the past decade, most biological inference has been based on statistical tests at the level of individual genes (with high rates of spurious associations) followed by functional meta-analysis using gene set enrichment techniques.

See a Problem?

Preview — Network Medicine by Joseph Loscalzo