Select Page
  

  

Assessment criteria include, paper format, content literature cited and article choice. In addition to the three articles used for analysis, supporting references should also be used. A minimum of one unique supporting article is required in the Introduction and Conclusion sections. A student example paper has been uploaded to Blackboard. This paper is well-done and hits almost all the “Meets expectations” criteria in the rubric.

Summary sums up the strengths and weaknesses of each article; compares and contrasts articles to one another summary lacks some details, misses a strength/weakness of one of the articles; occasionally compares and contrasts but mainly lists each articles strengths and weakness separately summary lacks detail, includes either strengths or weaknesses; fails to compare and contrasts articles to one another

Significance establishes practical and theoretical significance of body of work; has your chosen article been cited by others; did your articles spark other researches hypotheses or questions; are there any practical applications; implication (social, political, technological, medical) to the research; cites at least one other supporting reference (unique from introduction) logic not clear to the theoretical significance of the body of work; not thorough in establishing its significance; cites at least one other supporting reference (unique from introduction) no connection to theoretical significance of body of work; fails to cite at least one supporting reference Literature cited Format one journal format chosen and used throughout in bibliography and in-text citations some in-text citations were not in the same format; 1-2 errors in bibliography consistency lacking for in-text citations; bibliography with 3+ formatting errors Subject Chosen articles were all on the same topic; topic was specific enough so that an analysis was possible topics were not consistent or were too broad/general each article was on a separate topic and the topics were without reasonable similarities Citation Each reference was used and cited correctly within the body of the paper; three focal references were analyzed; at least 5 references used references were occasionally cited incorrectly; three focal references were analyzed; 4 total references used two or fewer references were analyzed; no supporting references used Quantity minimum of 5, 1 unique to intro, 1 unique to discussion and 3 critically reviewed missing 1 unique missing 2 unique and/or 1 of critically reviewed. 

Host-Microbe Coevolution: Applying Evidence from Model
Systems to Complex Marine Invertebrate Holobionts

Paul A. O’Brien,a,b,c Nicole S. Webster,b,c,d David J. Miller,e,f David G. Bournea,b,c

aCollege of Science and Engineering, James Cook University, Townsville, QLD, Australia
bAustralian Institute of Marine Science, Townsville, QLD, Australia
[email protected], Townsville, QLD, Australia
dAustralian Centre for Ecogenomics, University of Queensland, Brisbane, QLD, Australia
eARC Centre of Excellence for Coral Reef Studies, James Cook University, Townsville, QLD, Australia
fCentre for Tropical Bioinformatics and Molecular Biology, James Cook University, Townsville, QLD, Australia

ABSTRACT Marine invertebrates often host diverse microbial communities, making
it difficult to identify important symbionts and to understand how these communi-
ties are structured. This complexity has also made it challenging to assign microbial
functions and to unravel the myriad of interactions among the microbiota. Here we
propose to address these issues by applying evidence from model systems of host-
microbe coevolution to complex marine invertebrate microbiomes. Coevolution is
the reciprocal adaptation of one lineage in response to another and can occur
through the interaction of a host and its beneficial symbiont. A classic indicator of
coevolution is codivergence of host and microbe, and evidence of this is found in
both corals and sponges. Metabolic collaboration between host and microbe is of-
ten linked to codivergence and appears likely in complex holobionts, where micro-
bial symbionts can interact with host cells through production and degradation of
metabolic compounds. Neutral models are also useful to distinguish selected mi-
crobes against a background population consisting predominately of random associ-
ates. Enhanced understanding of the interactions between marine invertebrates and
their microbial communities is urgently required as coral reefs face unprecedented
local and global pressures and as active restoration approaches, including manipula-
tion of the microbiome, are proposed to improve the health and tolerance of reef
species. On the basis of a detailed review of the literature, we propose three re-
search criteria for examining coevolution in marine invertebrates: (i) identifying sto-
chastic and deterministic components of the microbiome, (ii) assessing codivergence
of host and microbe, and (iii) confirming the intimate association based on shared
metabolic function.

KEYWORDS codivergence, coevolution, marine invertebrates, microbiome,
phylosymbiosis

Coevolution theory dates back to the 19th century (box 1), and coevolution iscurrently referred to as the reciprocal evolution of one lineage in response to
another (1). This definition encompasses a broad range of interactions such as predator-
prey, host-symbiont, and host-parasite interactions or interactions among the members
of a community of organisms such as a host and its associated microbiome (1, 2). In the
case of host-microbe associations, this has produced some of the most remarkable
evolutionary outcomes that have shaped life on Earth, such as the eukaryotic cell,
multicellularity, and the development of organ systems (3, 4). It is now recognized that
microbial associations with a multicellular host represent the rule rather than the

Citation O’Brien PA, Webster NS, Miller DJ,
Bourne DG. 2019. Host-microbe coevolution:
applying evidence from model systems to
complex marine invertebrate holobionts. mBio
10:e02241-18. https://doi.org/10.1128/mBio
.02241-18.

Editor Danielle A. Garsin, University of Texas
Health Science Center at Houston

Copyright © 2019 O’Brien et al. This is an
open-access article distributed under the terms
of the Creative Commons Attribution 4.0
International license.

Address correspondence to David G. Bourne,
[email protected]

Published 5 February 2019

MINIREVIEW
Host-Microbe Biology

crossm

January/February 2019 Volume 10 Issue 1 e02241-18 ® mbio.asm.org 1

D
ow

nl
oa

de
d

fr
om

h
tt

ps
:/

/j
ou

rn
al

s.
as

m
.o

rg
/j

ou
rn

al
/m

bi
o

on
0

2
F

eb
ru

ar
y

20
22

b
y

24
.1

16
.2

51
.2

22
.

exception (4), but in complex associations of that kind, the extent to which coevolution
operates is often unclear.

BOX 1: A BRIEF HISTORY OF COEVOLUTION
Charles Darwin once explained the sudden and rapid diversification of flowering

plants as an “abominable mystery,” since it could not be explained by traditional
views of evolution alone (5). While his correspondent Gaston de Saporta speculated
that a biological interaction between flowering plants and insects might be the
cause of the phenomenon, it was not until nearly 100 years later that the concept of
coevolution developed. In a pioneering study, Ehrlich and Raven (6) observed that
related groups of butterflies were feeding on related groups of plants and specu-
lated this was due to a process for which they coined the name “coevolution.” Using
butterflies, they argued that plants had evolved mechanisms to overcome predation
from herbivores, which in turn had evolved new ways to prey on plants. Decades on,
the introduction of phylogenetics has shown that plants evolved in the absence of
butterflies, which colonized the diverse group of plants after their chemical defenses
were already in place (7). Nevertheless, the theory of coevolution was endorsed, and
two important points came to light. First, care must be taken when inferring
coevolution from seemingly parallel lines of evolution, and where possible, diver-
gence times and common ancestry should be included. Second, coevolution can
occur between communities of organisms (“guild” coevolution), as observed in the
case of flowering plants, where predation and pollination from a wide variety of
insects likely influenced the diversification of angiosperms (8).

Since coevolution can occur across multiple levels of interactions, multiple theories
have also developed. The Red Queen theory is based on the concept of antagonistic
coevolution and assumes that an adaptation that increases the fitness of one species
will come at the cost to the fitness of another (9). This type of coevolution has been
most pronounced in host-parasite interactions, where the antagonistic interactions are
closely coupled (10). However, coevolutionary patterns may also arise in the case of
mutualistic symbioses, which require reciprocal adaptations to the benefit of each
partner (11). Mutualistic coevolution is associated with a number of key traits that are
discussed further in this review, such as obligate symbiosis, vertical inheritance, and
metabolic collaboration. Third, coevolution has also recently been placed in context of
the hologenome theory (12), which suggests that the holobiont can act as a unit of
selection (but not necessarily as the primary unit) since the combined genomes
influence the host phenotype on which selection may operate (13, 14). However,
hologenome theory also acknowledges that selection acts on each component of the
holobiont individually as well as in combination with other components (including the
host). Thus, the entity that is the hologenome may be formed, in part, through
coevolution of interacting holobiont compartments, in addition to neutral processes
(12).

Given the ubiquitous nature of host-microbe associations and the huge metabolic
potential that microorganisms represent, it is not surprising that evidence of host-
microbe coevolution is emerging. Model representatives of both simple and complex
associations are being used to study coevolution, allowing researchers to look for
specific traits, signals, and patterns (1, 15). A well-known model system is the pea-aphid
and its endosymbiotic bacteria in the genus Buchnera. This insect has evolved special-
ized cells known as bacteriocytes to host its endosymbionts, which in turn synthesize
and translocate amino acids that are missing from the diet of the pea aphids (16).
Amino acid synthesis occurs through intimate cooperation between host and symbiont,
with some pathways missing from the host and some from the symbiont, such that the
relationship is obligate to the extent that the one organism cannot survive without the
other (17). The human gut microbiome has been extensively studied in complex
systems and has been shown to be intimately associated with human health. Gut

Minireview ®

January/February 2019 Volume 10 Issue 1 e02241-18 mbio.asm.org 2

D
ow

nl
oa

de
d

fr
om

h
tt

ps
:/

/j
ou

rn
al

s.
as

m
.o

rg
/j

ou
rn

al
/m

bi
o

on
0

2
F

eb
ru

ar
y

20
22

b
y

24
.1

16
.2

51
.2

22
.

microbes have been shown to be linked with human behavior and development
through metabolic processes, such as microbial regulation of the essential amino acid
tryptophan (18, 19). The human microbiome contains around 150-fold more nonre-
dundant genes than the human genome (20), and the metabolic capacity of microbes
residing in the intestine is believed to have been a driving evolutionary force in the
host-microbe coevolution of humans (2). In these examples, as well as many others
(21–23), both host and symbiont evolved to maintain and facilitate the symbiosis.
Furthermore, phylogenies of host and symbiont in these systems are often mirrored,
indicating that host and symbiont are diverging in parallel (16, 24, 25), a phenomenon
known as codivergence (26).

In the marine environment, invertebrates can host microbial communities as simple
and stable as that of the pea aphid or as complex and dynamic as that of the human
gut (Fig. 1). The Hawaiian bobtail squid, for example, maintains an exclusive symbiosis
with a single bacterial symbiont which it hosts within a specialized light organ (27). On
the other hand, corals host enormously diverse microbial communities, comprising
thousands of species-level operational taxonomic units (OTUs), which are often influ-
enced by season, location, host health, and host genotype (28–31). Marine sponges also
host complex microbial communities with diversity comparable to that of corals (32)
but with associations that are generally far more stable in space and time (33).
Less-diverse microbial communities are found in the sea anemone Aiptasia, where the
number of OTUs is generally in the low hundreds (34). Due to the close taxonomic
relationship of Aiptasia with coral and its comparatively simple microbial community, it
has been proposed as a model organism for studying coral microbiology and symbiosis
(34). Some marine invertebrates also include species along a continuum of microbial
diversities. Ascidians, for example, have been shown to host fewer than 10 (Polycarpa
aurata) or close to 500 (Didemnum sp.) microbial OTUs within their inner tunic (35).
Furthermore, species with low microbial diversity such as P. aurata can exhibit high
intraspecific variation, with as few as 8% of OTUs shared among individuals of the same
species (35). Taken together, the data from those studies highlight the vast spectrum
of associations that marine invertebrates form with microbial communities in terms of
diversity, composition, and stability (Fig. 1).

While previous research has provided a good understanding of the composition of
marine invertebrate microbiomes, our understanding of how the microbiome interacts
with the host, and of the potential to coevolve, is far more limited. Moreover, the

FIG 1 Spectrum of microbial diversity associated with different compartments of marine invertebrates. Microbial associations may involve a single symbiont
in a specialized organ or over 1,000 operational taxonomic units (OTUs) associated with tissues. The levels of OTUs reported in the figure represent the highest
recorded in the referenced study for that species. Reported levels of diversity may differ significantly within the same species across different studies.

Minireview ®

January/February 2019 Volume 10 Issue 1 e02241-18 mbio.asm.org 3

D
ow

nl
oa

de
d

fr
om

h
tt

ps
:/

/j
ou

rn
al

s.
as

m
.o

rg
/j

ou
rn

al
/m

bi
o

on
0

2
F

eb
ru

ar
y

20
22

b
y

24
.1

16
.2

51
.2

22
.

increasing number of studies generating tremendous volumes of host-associated
microbiome sequence data requires theoretical development to interpret these rela-
tionships. Coevolved microbial symbionts are presumed to be intimately linked with
host fitness and metabolism (36); therefore, understanding these relationships in
marine invertebrates will have direct implications for health and disease processes in
these animals. Three research criteria arise for examining coevolution in marine inver-
tebrates: (i) identifying stochastic and deterministic microbial components of the
microbiome, (ii) assessing codivergence of host and microbe, and (iii) confirming an
intimate association between host and microbe related to shared metabolic function
(metabolic collaboration). While each of these criteria may be fulfilled without the
involvement of coevolution (26, 37, 38), evidence of their existence in combination
provides a strong basis for establishing coevolution patterns (Fig. 2). This review
positions these three criteria in coevolution as representing a complementary approach
to the study of complex marine invertebrate microbiomes by drawing from examples
of model systems. Focussing on keystone coral reef invertebrates, this review also
evaluates the current evidence for each criterion. Finally, while parasites and pathogens
also contribute to host coevolution, the focus of this review is mutualistic symbionts;
thus, pathogens and parasitism are not discussed.

BOX 2: GLOSSARY
(i) Codivergence. Two organisms which speciate or diverge in parallel as illus-

trated by topological congruency of phylogenetic trees.
(ii) Coevolution. Reciprocal adaptation of one (or more) lineage(s) in response to

another (or others).
(iii) Holobiont. A host organism and its associated microbial community.
(iv) Hologenome. The collective genomes of a host and its associated microbial

community, which may act as a unit of selection or at discrete levels.
(v) Metabolic collaboration. Two or more oganisms that are linked through

metabolic interactions, generally to the benefit of one another.
(vi) Metagenome. The collective microbial genes recovered from an environmen-

tal sample, usually predominantly prokaryotic.
(vii) Metatranscriptomics. Quantification of the total microbial mRNA in a sample

as an indication of gene expression and active microbial functions.
(viii) Microbiome. The total genetic make-up of a microbial community associated

with a habitat.
(ix) Microbiota. The community of microorganisms residing in a particular habitat,

usually a host organism.
(x) Phylosymbiosis. The rentention of a host phylogenetic signal within its

associated microbial community.
(xi) Virome. The total viral genetic content recovered from an environmental

sample.

UNTANGLING PATTERNS OF HOST-MICROBE COEVOLUTION IN A WEB OF
MICROBES

(i) Phylosymbiosis and neutral theory—identifying stochastic and determinis-
tic components of the microbiome. Host-microbe coevolution may occur to some
degree at the level of the hologenome, i.e., reciprocal evolution of the host genome
and microbiome (12). Therefore, it is necessary to understand microbial community
structure and population dynamics within the host environment. This may illustrate (i)
that the microbiome associated with a host is structured through phylogenetically
related host traits and may therefore retain a host phylogenetic signal (phylosymbiosis)
and (ii) that certain microbes deviate from the expected patterns of neutral population
dynamics, i.e., stochastic births and deaths and immigration. It is likely that phylosym-
biosis and neutral population dynamics are linked; therefore, their potential to con-
tribute to coevolution is discussed together.

Minireview ®

January/February 2019 Volume 10 Issue 1 e02241-18 mbio.asm.org 4

D
ow

nl
oa

de
d

fr
om

h
tt

ps
:/

/j
ou

rn
al

s.
as

m
.o

rg
/j

ou
rn

al
/m

bi
o

on
0

2
F

eb
ru

ar
y

20
22

b
y

24
.1

16
.2

51
.2

22
.

Homocysteine + Serine
(host diet & metabolism)

Cystathionine B-synthase
(symbiont enzyme)

Cystathionine

Cystathionine y-lyase
(host enzyme)

Cysteine

S-H
CH2

C COOH H2N
H

Bacteria spp. 1

Host spp. A

d)

Host phylogeny Microbial dendrogram
Host species

A

B

C

D

a)

b)

Host phylogeny Microbial phylogeny

Bacteria spp. 1

A

B

C

D

Relative abundance of
microbes in host sample

Fr
eq

ue
nc

y
of

m
ic

ro
be

s
in

h
os

t Bacteria spp. 1

Bacteria spp.2
Low High

0

1

c)

FIG 2 Hypothetical scenario addressing three criteria for host-microbe coevolution in species A to D. (a) Phylosymbiosis
shown through hierarchical clustering of the microbial community, resulting in a microbial dendrogram which mirrors host
phylogeny. (b) Neutral model showing the expected occurrence of microbes based on neutral population dynamics (blue
line). As the relative abundance increases, so too does the occurrence in host samples. The members of bacterial species
group 1 (Bacteria spp. 1) are therefore more abundant than would be expected by chance and may indicate active selection,
while the members of Bacteria spp. 2 are less abundant. (c) Codivergence of the members of Bacteria spp. 1 with their hosts.
The members of Bacteria spp. 1 are found within the microbial community of each host species and appear to be actively
selected for. Their phylogeny indicates a host split at the strain level followed by diversification within each host species.
Congruence between host and microbial lineages suggests important host-microbe interactions and warrants further
investigation. (d) Metabolic collaboration between the members of Host spp. A and those of Bacteria spp. 1. Fluorescence
in-situ hybridization (FISH) confirms that the members of Bacteria spp. 1 are located within bacteriocyte cells in the tissues
of Host spp. A. Genome and transcriptome data for each species suggest that the amino acid cysteine is produced by the
activity of a metabolic pathway shared between host and microbe. In corals of the genus Acropora, for example, the
genome is incomplete with respect to biosynthesis of cysteine and represents a potential pathway for collaborations of host
and microbe (101). Hypothetically, the amino acids homocysteine and serine (potentially sourced from host diet and
metabolism) are combined to form cystathionine through the enzyme cystathionine V synthase (provided by the host’s
endosymbiont). The host enzyme cystathionine �-lyase then breaks down cystathionine to form cysteine.

Minireview ®

January/February 2019 Volume 10 Issue 1 e02241-18 mbio.asm.org 5

D
ow

nl
oa

de
d

fr
om

h
tt

ps
:/

/j
ou

rn
al

s.
as

m
.o

rg
/j

ou
rn

al
/m

bi
o

on
0

2
F

eb
ru

ar
y

20
22

b
y

24
.1

16
.2

51
.2

22
.

The term “phylosymbiosis” is not intended to imply coevolution (12, 38); however,
coevolution of a host and microbiome may reinforce patterns of phylosymbiosis. There
are many host traits that correlate with host phylogeny, some of which can act as
environmental filters, preventing the establishment of microbes in the host environ-
ment. Thus, neutral population dynamics, with host traits acting as an ecological filter
to microbial immigration, may be sufficient to result in phylosymbiotic patterns (39, 40).
However, host traits are not static; thus, the evolution of these microbial niches may
further drive the radiation of the microbes that reside within them. In turn, the
continuous colonization over many generations of a microbial community likely adds
to the selective pressure on host traits. Therefore, ecological filtering of microbes
through host traits and coevolution of a host and microbiome need not be mutually
exclusive in the appearance of phylosymbiosis (39). Moreover, assessing patterns of
phylosymbiosis and neutral population dynamics also allows the detection of microbes
that deviate from these patterns and may identify important microbial species that are
actively selected for (or against) by the host. In this context, neutral models can
simulate expected microbial abundance, allowing easier detection of microbes that do
not fit these patterns (41). This reasoning justifies consideration of phylosymbiosis and
microbial population dynamics in assessing coevolution in complex holobionts.

Patterns of phylosymbiosis are frequently detected in complex holobionts. One
particular study tested for phylosymbiosis across 24 species of terrestrial animals from
4 groups that included Peromyscus deer mice, Drosophila flies, mosquitos, and Nasonia
wasps and an additional data set of 7 hominid species (42). Since these animals (with
the exception of hominids) could be reared under controlled laboratory conditions,
environmental influences could be eliminated, leaving the host as the sole factor
influencing the microbial community. Under these conditions, phylosymbiotic patterns
were clearly observed for all five groups, with phylogenetically related taxa sharing
similar microbial communities and microbial dendrograms mirroring host phylogenies.
Similar patterns of phylosymbiosis have been observed in a growing number of
terrestrial systems, including all five gut regions in rodents (43), the skin of ungulates
(44), the distal gut in hominids (45), and roots of multiple plant phyla (46), providing
evidence that such patterns are common among host-associated microbiomes.

In the marine environment, two major studies, one involving 236 colonies across 32
genera of scleractinian coral collected from the east and west coasts of Australia (47)
and the other involving 804 samples of 81 sponge species collected from the Atlantic
Ocean, Pacific Ocean, and Indian Ocean and the Mediterranean Sea and Red Sea (32),
have provided the most convincing examples of phylosymbiosis. Both studies found a
significant evolutionary signal of the host with respect to microbial diversity and
composition. Specifically, mantel tests were used to delineate the finding that closely
related corals and sponges hosted more extensively similar microbial communities in
terms of composition than would be expected by chance. In the case of corals, the
similarity was seen in the skeleton and, to a lesser extent, in the tissue microbiome,
while the mucus microbiome was more highly influenced by the surrounding environ-
ment (47). However, both studies found that host species was the strongest factor in
explaining dissimilarity among microbial communities. Additional studies on both cold
water and tropical sponges have found similar phylogenetic patterns within the
microbiome of the host species (48, 49). Together, these results suggest that host
phylogeny (or associated traits) has a significant role in structuring associated microbial
communities, although there are additional factors related to host identity (and unre-
lated to phylogeny) that also likely play a major role.

Most studies to date have focused on the microbes that adhere to these patterns of
phylosymbiosis, though more-useful information arguably could be determined from
the microbes that do not. Since phylosymbiosis is a pattern that shows correlations
between microbiome dissimilarity and host phylogeny, it does not indicate active
microbial selection or cospeciation (38), and the species that deviate from these
patterns would be interesting targets for studies of codivergence and metabolic
collaboration (see below). Neutral models have been applied to three species of

Minireview ®

January/February 2019 Volume 10 Issue 1 e02241-18 mbio.asm.org 6

D
ow

nl
oa

de
d

fr
om

h
tt

ps
:/

/j
ou

rn
al

s.
as

m
.o

rg
/j

ou
rn

al
/m

bi
o

on
0

2
F

eb
ru

ar
y

20
22

b
y

24
.1

16
.2

51
.2

22
.

sponges, a jellyfish, and a sea anemone, and while neutral models have been shown to
fit well to the expectation of microbial abundance in sponges (which also show
phylosymbiosis), jellyfish and sea anemone microbiomes were found to be associated
with a higher level of nonneutrality (40). Potential reasons for nonneutrality include the
presence of a more sophisticated immune system in cnidarians that provides active
selection on certain microbial taxa and that the microbiomes in such cases are more
transient or a combination of the two. In summary, neutral population dynamics filtered
through phylogenetically related host traits likely result in, or at least contribute to, the
observed patterns of phylosymbiosis. This does not necessarily mean that the pattern
is unimportant or is not contributing to coevolution at the hologenome level, and it
may be that the communities of microbes that follow these patterns are responsible for
broad ecological functions (50). On the other hand, microbes that deviate from these
patterns may be responsible for more-specific functions and are of high interest to
those trying to identify symbionts and coevolution at the microbial species or strain
level.

(ii) Codivergence—microbial phylogeny and host phylogeny are congruent.
The second criterion in assessing host-microbe coevolution is that of whether individ-
ual microbial lineages and their hosts have matching phylogenies (22, 24, 51). Codi-
vergence implies a tightly coupled, long-term interaction between two species and can
potentially identify beneficial symbionts (or parasites) that have coevolved with the
host (26). However, it is also important that codivergence can arise due to processes
other than coevolution, such as one species adaptively tracking another, which would
imply that the evolution is not reciprocal, or two species responding independently to
the same speciation event or environmental stress (37). In known cases of coevolution,
phylogenies of hosts and their microbial symbionts are congruent (16, 51, 52). However,
in complex and uncharacterized systems, this strategy can be reversed to identify
potential symbionts. Therefore, the main value of investigating codivergence in com-
plex associations is to identify those specific microbes on which to focus further
attention.

Codivergence has been demonstrated in the case of Hydra viridissima, a freshwater
relative of marine cnidarians, and its photosymbiont Chlorella (53). In this system,
photosynthetically fixed carbohydrates from Chlorella are transported to its host (54),
and phylogenetic analysis of 6 strains of H. viridissima and their vertically transmitted
symbionts revealed clear congruency of host and symbiont topologies (55). In more-
complex systems, patterns of codivergence have been illustrated in the gut microbiota
of hominids (25). Analysis of fecal samples from humans, wild chimpanzees, wild
bonobos, and wild gorillas showed that four clades of bacteria from the dominant
families Bacteroidaceae and Bifidobacteriaceae codiverged with host phylogeny. Impor-
tantly, this example illustrates one possible way of identifying codivergence in complex
holobionts where the symbionts are unknown. Since bacteria from the families Bacte-
roidaceae, Bifidobacteriaceae, and Lachnospiraceae are known to dominate the gut of
hominids, multiple primer sets targeting each individual family were utilized, and
phylogenetic analyses of the families were completed independently. Furthermore,
instead of using the relatively slowly diverging 16S rRNA gene, the fast-evolving and
variable gene encoding DNA gyrase subunit B was used for bacterial phylogenetics.
Similar methods may be applied to complex marine invertebrates such as coral and
sponges, where 16S rRNA gene studies have identified prominent bacteria.

Within complex marine invertebrate holobionts, codivergence has been most clearly
demonstrated in cold-water sponges in the family Latrunculiidae. The microbiomes of
six species within this family were dominated by a single betaproteobacterial OTU, and
the phylogeny of this OTU was highly congruent with that of the host (56). Further-
more, gene expression analysis suggested that the dominant betaproteobacteria are
active members of the microbiome rather than dormant or nonviable members;
however, whether or not this potential symbiont and its host participate in metabolic
collaboration is unknown, highlighting an example warranting further investigation.
The microbiomes of many other marine invertebrates are dominated by members of

Minireview ®

January/February 2019 Volume 10 Issue 1 e02241-18 mbio.asm.org 7

D
ow

nl
oa

de
d

fr
om

h
tt

ps
:/

/j
ou

rn
al

s.
as

m
.o

rg
/j

ou
rn

al
/m

bi
o

on
0

2
F

eb
ru

ar
y

20
22

b
y

24
.1

16
.2

51
.2

22
.

the genus Endozoicomonas (57). A pan-genomic analysis of the genomes of seven
Endozoicomonas strains representing a broad range of hosts (corals, sponges, and sea
slugs) provided some evidence for codivergence (58). Strikingly, the two closely related
corals Stylophora pistillata and Pocillopora verrucosa hosted Endozoicomonas with
highly similar genomes. A second, large-scale study (47) found that Endozoicomonas
species within the coral tissues showed strong signals of codivergence with their hosts;
however, they were grouped into two major divisions, namely, the host-specific and
host-generalist divisions. The presence of a host-generalist clade may partly explain
why the …

ESSAY

Natural experiments and long-term

monitoring are critical to understand and

predict marine host–microbe ecology and

evolution

Matthieu LerayID
1‡*, Laetitia G. E. WilkinsID

2¤‡
, Amy ApprillID

3
, Holly M. BikID

4
,

Friederike Clever
1,5

, Sean R. ConnollyID
1
, Marina E. De León

1,2
, J. Emmett DuffyID

6
,

Leïla Ezzat7, Sarah Gignoux-WolfsohnID8, Edward Allen Herre1, Jonathan Z. KayeID9,
David I. KlineID

1
, Jordan G. KuenemanID

1
, Melissa K. McCormickID

8
, W. Owen McMillan

1
,

Aaron O’DeaID
1,10*, Tiago J. PereiraID

4
, Jillian M. PetersenID

11
, Daniel F. PetticordID

1
,

Mark E. Torchin
1
, Rebecca Vega ThurberID

12
, Elin VidevallID

13,14
, William T. WcisloID

1
,

Benedict YuenID
11

, Jonathan A. EisenID
2,15,16

1 Smithsonian Tropical Research Institute, Balboa, Ancon, Republic of Panama, 2 UC Davis Genome

Center, University of California, Davis, Davis, California, United States of America, 3 Marine Chemistry and

Geochemistry Department, Woods Hole Oceanographic Institution, Woods Hole, Massachusetts, United

States of America, 4 Department of Marine Sciences and Institute of Bioinformatics, University of Georgia,

Athens, Georgia, United States of America, 5 Department of Natural Sciences, Manchester Metropolitan

University, Manchester, United Kingdom, 6 Tennenbaum Marine Observatories Network, Smithsonian

Environmental Research Center, Edgewater, Maryland, United States of America, 7 Department of Ecology,

Evolution and Marine Biology, University of California Santa Barbara, Santa Barbara, California, United

States of America, 8 Smithsonian Environmental Research Center, Edgewater, Maryland, United States of

America, 9 Gordon and Betty Moore Foundation, Palo Alto, California, United States of America,

10 Department of Biological, Geological and Environmental Sciences, University of Bologna, Bologna, Italy,

11 Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria,

12 Department of Microbiology, Oregon State University, Corvallis, Oregon, United States of America,

13 Center for Conservation Genomics, Smithsonian Conservation Biology Institute, Washington, DC, United

States of America, 14 Department of Ecology and Evolutionary Biology, Brown University, Providence,

Rhode Island, United States of America, 15 Department of Evolution and Ecology, University of California,

Davis, Davis, California, United States of America, 16 Department of Medical Microbiology and Immunology,

University of California, Davis, Davis, California, United States of America

¤ Current address: Max Planck Institute for Marine Microbiology, Department of Symbiosis, Bremen,
Germany

‡ These authors share first authorship on this work.

* [email protected] (ML); [email protected] (AO)

AbstractAU : Pleaseconfirmthatallheadinglevelsarerepresentedcorrectly:
Marine multicellular organisms host a diverse collection of bacteria, archaea, microbial

eukaryotes, and viruses that form their microbiome. Such host-associated microbes can sig-

nificantly influence the host’s physiological capacities; however, the identity and functional

role(s) of key members of the microbiome (“core microbiome”) in most marine hosts coexist-

ing in natural settings remain obscure. Also unclear is how dynamic interactions between

hosts and the immense standing pool of microbial genetic variation will affect marine eco-

systems’ capacity to adjust to environmental changes. Here, we argue that significantly

advancing our understanding of how host-associated microbes shape marine hosts’ plastic

and adaptive responses to environmental change requires (i) recognizing that individual

host–microbe systems do not exist in an ecological or evolutionary vacuum and (ii)

PLOS BIOLOGY

PLOS Biology | https://doi.org/10.1371/journal.pbio.3001322 August 19, 2021 1 / 18

a1111111111

a1111111111

a1111111111

a1111111111

a1111111111

OPEN ACCESS

Citation: Leray M, Wilkins LGE, Apprill A, Bik HM,

Clever F, Connolly SR, et al. (2021) Natural

experiments and long-term monitoring are critical

to understand and predict marine host–microbe

ecology and evolution. PLoS Biol 19(8): e3001322.

https://doi.org/10.1371/journal.pbio.3001322

Published: August 19, 2021

Copyright: This is an open access article, free of all

copyright, and may be freely reproduced,

distributed, transmitted, modified, built upon, or

otherwise used by anyone for any lawful purpose.

The work is made available under the Creative

Commons CC0 public domain dedication.

Funding: Financial support for the workshop was

provided by grant GBMF5603 (https://doi.org/10.

37807/GBMF5603) from the Gordon and Betty

Moore Foundation (W.T. Wcislo, J.A. Eisen, co-

PIs), and additional funding from the Smithsonian

Tropical Research Institute and the Office of the

Provost of the Smithsonian Institution (W.T.

Wcislo, J.P. Meganigal, and R.C. Fleischer, co-PIs).

JP was supported by a WWTF VRG Grant and the

ERC Starting Grant ’EvoLucin’. LGEW has received

funding from the European Union’s Framework

Programme for Research and Innovation Horizon

2020 (2014-2020) under the Marie Sklodowska-

Curie Grant Agreement No. 101025649. AO was

supported by the Sistema Nacional de

Investigadores (SENACYT, Panamá). A. Apprill was

supported by NSF award OCE-1938147. D.I. Kline,

M. Leray, S.R. Connolly, and M.E. Torchin were

expanding the field toward long-term, multidisciplinary research on entire communities of

hosts and microbes. Natural experiments, such as time-calibrated geological events associ-

ated with well-characterized environmental gradients, provide unique ecological and evolu-

tionary contexts to address this challenge. We focus here particularly on mutualistic

interactions between hosts and microbes, but note that many of the same lessons and

AU : Anabbreviationlisthasbeencompiledforthoseusedinthemaintext:Pleaseverifythatallentriesarecorrect:approaches would apply to other types of interactions.

Main

It is widely recognized that host-associated microbes play profound roles in the health of their

marine hosts and the ecosystems they inhabit. Although some such interactions with microbes

are transient, many are more persistent and can be generally described as symbioses. Symbio-

ses come in many flavors including parasitism, commensalism, and mutualism (see Box 1),

and, in this paper, we focus in particular on the mutually beneficial (i.e., mutualistic) subset of

such interactions involving marine hosts. Despite the wide recognition of the importance of

such mutualisms, it remains less clear how these associations scale up to drive broader ecologi-

cal and evolutionary patterns and processes. For example, the contribution of microbes to host

acclimatization and adaptation (see Box 1 for definitions) is an active new field of experimental

research with much potential. Studies, mostly conducted in controlled laboratory settings,

have evaluated the ecological costs/benefits for hosts to associate temporarily with different

microbes (e.g., corals [1–4]) or to engage in obligate intimate relationships (e.g., bobtail squid

with the bioluminescent bacteria Aliivibrio fischeri [5]).
Experimental studies are, however, intrinsically limited in several ways. They limit them-

selves to a small number of experimentally tractable hosts and microbes, and, in doing so,

fail to account for the enormous complexity of interactions and variation that exist in nature

between multiple hosts and their multitudes of associated microbes. Short-lived experiments

(e.g., days to weeks) cannot replicate the scales of time and space involved in the potential

coevolution of hosts and microbes (Box 1). Attempts to merge long-term datasets to reveal

overarching patterns (e.g., [6–9]) have provided valuable insights but are shadowed by the lim-

its and biases introduced by mixing information from different contexts or methodologies

[10]. These limitations obscure general principles on the roles (mutualistic or otherwise) of

host-associated microbes across host individuals, species, and communities [11–13].

Here, we demonstrate the value of moving beyond taxon-centric approaches to studying

host–microbe associations in their natural evolutionary and ecological context. We suggest

intensifying long-term research in well-documented “natural experiments”. Such natural

experiments, including well-calibrated geological events (e.g., vicariance and creation of novel

habitats accurately dated using fossil and geological data) and environmental gradients where

multiple hosts and associated microbes are subjected to the same range of environmental con-

ditions, can be particularly useful (Fig 1). These phenomena provide a unique framework for

comparative studies where the processes of interest occur over spatial and evolutionary time

scales that are nearly impossible to capture in laboratory experiments. The value of combining

experimental and long-term field studies at natural experiments has been recognized by ecolo-

gists [14–16]. We argue that similar approaches should be applied to the study of host–microbe

interactions. We highlight several natural experiments that can advance our understanding of

the ecological and evolutionary mechanisms shaping host–microbe interactions (with a focus

on mutualistic ones) in marine communities and ecosystems.

PLOS BIOLOGY

PLOS Biology | https://doi.org/10.1371/journal.pbio.3001322 August 19, 2021 2 / 18

supported by a Rohr Family Foundation grant for

the Rohr Reef Resilience Project, for which this is

contribution #2. This is contribution #85 from the

Smithsonian’s MarineGEO and Tennenbaum

Marine Observatories Network. The funders had no

role in study design, data collection and analysis,

decision to publish, or preparation of the

manuscript.

Competing interests: The authors have declared

that no competing interests exist.

Abbreviations: LTER, Long-Term Ecological

Research; MarineGEO, Marine Global Earth

Observatory; MBON, Marine Biodiversity

Observation Network; TEP, Tropical Eastern Pacific.

Identifying important players

Marine organisms have evolved complex structural, behavioral, and chemical mechanisms to

regulate the presence, abundance, and activity of their microbial associates. Hosts can limit

colonization by transient opportunistic microbes that would use space and resources without

providing any benefits, and some hosts can even block pathogens entirely [17–19]. Host-spe-

cific and obligate microbial associates, often called the “core microbiome” of a host population

Box 1. Definitions of key terms

Acclimatization: The process by which an organism becomes accustomed to new envi-

ronmental conditions during its lifetime.

Adaptation: A heritable trait of an organism that increases its fitness in its surrounding

environment. In comparison to acclimatization, adaptations will be passed on to the

next generation.

Convergent evolution: Independent origins of similar features in different organisms in

response to separately experiencing similar selective pressures. Importantly, conver-

gently originated features, also known as analogous features, were not present in the

common ancestor of the taxa in question.

Genetic drift: Change in the relative frequency of genotypes due to random variation in

reproduction. Such drift is more common in small populations and leads to changes in

genotype frequencies independent of adaptive forces.

Host–microbe coevolution: During host–microbe coevolution, multicellular hosts and

their associated microbes show a concerted and heritable response to an environmental

change.

Homologous recombination: The process by which two pieces or stretches of DNA that

are very similar in their sequence physically align and exchange nucleotides.

Horizontal gene transfer: The unidirectional movement of DNA, usually only small frac-

tions of a genome, from one organism to another. Though this generally occurs more

frequently within species than between, it can also occur across vast evolutionary

distances.

Metagenomics: Studies of the genetic material of communities of organisms.

Phenotypic plasticity: Phenotypic plasticity is the ability of a specific genotype to pro-

duce more than one phenotype in response to a changing environment during an indi-

vidual’s lifetime. These phenotypic changes may include an organism’s behavior,

morphology, physiology, or other features. Phenotypic plasticity is adaptive if it increases

an individual’s survival and if the ability is passed on to the next generation.

Symbioses: Symbioses are broadly defined as intimate interactions between at least two

organisms where at least one of them benefits. We focus here specifically on mutually

beneficial interactions (aka mutualisms) between multicellular eukaryotes and their

associated microbes. These interactions may include disease resistance, predator avoid-

ance, and nutrition. These interactions will ultimately increase host survival and fitness.

PLOS BIOLOGY

PLOS Biology | https://doi.org/10.1371/journal.pbio.3001322 August 19, 2021 3 / 18

Fig 1. Examples of marine natural experiments as observatories of host–microbe interactions. Regionally focused, long-term,

and taxonomically broad research programs will help fill key knowledge gaps about the nature of microbe functions and the

dynamics of host–microbe interactions in changing oceans. We highlight areas of the world’s oceans where environmental

gradients are well characterized, where the taxonomy and evolutionary history of the local host fauna and flora is already well

established, where paleoecological studies can provide important historical context, where a long-term monitoring program is

ongoing, and where there is significant research infrastructure. Long-term monitoring sites (white dots) include sites of the NSF’s

LTER Network, the Smithsonian Institution’s MarineGEO network of partners, the MBON, the AIMS, and the ASSEMBLE. (1)

NASA MODIS data; (2) Adapted from [93]; (3) Adapted from [73]; (4) Adapted from [74]; (5) Adapted from [94]; (6) Adapted

from [95]. AIMSAU : AbbreviationlistshavebeencompiledforthoseusedinFigs1and4:Pleaseverifythatallentriesarecorrect:, Australian Institute of Marine Science; ASSEMBLE, Association of European Marine Biological Laboratories;
LTER, Long-Term Ecological Research; MarineGEO, Marine Global Earth Observatory; MBON, Marine Biodiversity

Observation Network.

https://doi.org/10.1371/journal.pbio.3001322.g001

PLOS BIOLOGY

PLOS Biology | https://doi.org/10.1371/journal.pbio.3001322 August 19, 2021 4 / 18

or species, are generally assumed to play more important functional roles than opportunistic

and transient taxa [20]. This core microbiome is exemplified by an obligate nutritional micro-

bial symbiosis, in which the host relies extensively on microbial partners for survival by syn-

thesis of food, often in a nutrient-limited habitat. The host may acquire these partners

horizontally (from the surrounding environment), vertically (from the parent to the offspring),

or in both ways (mixed mode) [21]. Many evolved symbioses result in codependency; for

example, the genomes of host-associated microbes have lost genes encoding pathways that

were previously essential, such as those for motility or environmental stress responses, but that

became obsolete in obligate symbiotic lifestyles [22]. In return, hosts have evolved mechanisms

to maintain their associated microbes in stable intracellular environments and to support their

nutritional needs [23]. Some of these nutritional associations are clearly identifiable because

symbionts form massive and dense populations, sometimes only consisting of a single micro-

bial species, in or on the bodies of their hosts. Examples include photosynthetic symbioses in

cnidarians [24] and chemosynthetic symbioses in invertebrate animals such as bathymodiolin

mussels, lucinid clams, Riftia tubeworms, and Astomonema nematodes [25,26]. Although
widespread, host reliance on a single or few microbes for nutrition are the exception rather

than the rule. The vast majority of animals and plants are instead associated with a diverse

assemblage of microbes where it is challenging to differentiate between members of the core

microbiome and the myriad of transient microbes and even more challenging to determine

what, if any, key functional roles such microbes play.

Several approaches have been proposed to identify key microbes or functions within com-

plex host microbiomes (reviewed in [27]). The most common practice is to identify microbial

taxa that are consistently associated with a host population or species using marker gene

sequencing, usually above some arbitrary prevalence threshold ([28]; but see [29,30] for alter-

native methods). The prevalence of a host–microbe association is typically measured without

explicit attention to co-occurring and closely related host taxa, the surrounding environment,

or adequacy of spatial and temporal sampling. This limited sampling and lack of context, often

resulting from funding constraints, leads to several major limitations. First, a microbial taxon

can be prevalent in a host population for reasons unrelated to its functional role. For example,

it may originate from the host’s food or habitat, including seawater or sediment [31]. Second,

even the core microbiome can change over time [32]. Functionally important microbes may

fluctuate in abundance throughout host ontogeny and may also vary seasonally. Essential host-

associated microbes may be overlooked if the sampling method cannot detect low abundance

reliably, resulting in false negatives, or if sampling is sporadic, missing the life stage or season

when particular microbes are essential. Third, many studies rely upon sequencing of rRNA

genes to characterize communities, yet rRNA genes are generally too conserved to distinguish

closely related taxa and reveal little directly about genomic functional potential. Clearly, under-

standing the functional roles of host-associated microbes requires analyses that go far beyond

individual marker gene profiles and instead encompass other types of information such as

whole genomes or metagenomes, transcriptomes, metabolomes, localization, biochemistry,

and more. Fourth, taxon-focused studies may miss valuable information about interactions

that could be gleaned from broader comparative analyses. Microbes that are specific to particu-

lar host genotypes, host species, or closely related groups of hosts, indicating a shared evolu-

tionary history, are likely candidates for core microbes with specialized functions (e.g., gut

fermenters associated with herbivores). These existing limitations could be robustly circum-

vented via whole-ecosystem studies where long-term collection of comprehensive genomic-

level datasets (e.g., ‘omic scale information) would transform our understanding of host–

microbe interactions at all levels.

PLOS BIOLOGY

PLOS Biology | https://doi.org/10.1371/journal.pbio.3001322 August 19, 2021 5 / 18

To instigate this new approach, we recommend strategically intensifying research within a

few ocean regions. This entails collecting large scale data on host-associated microbes across

phylogenetically diverse sets of co-occurring host organisms, together with data on surround-

ing free-living microbes (i.e., in seawater and sediments) through time in areas where the sur-

rounding abiotic environment and community dynamics have been well characterized. A

regionally focused and coordinated approach will allow identifying environmental sources

and hosts that serve as reservoirs of key host-associated microbial taxa and genes. Long-term

investments in research on particular communities of hosts and microbes will also help estab-

lish links between changes in core microbiome composition, environmental factors, ecosystem

function, and resilience. Public archival of genomic data and samples (available for comple-

mentary analysis using emerging technologies) collected from a few intensively studied ocean

regions will foster transformative discoveries on dynamic host–microbe relationships. Habi-

tat-forming corals, sponges, seagrasses, and mangrove trees are important focal groups, since

breakdowns in the associations between these species and their microbiomes likely dispropor-

tionately influence other taxa and ecosystem functions. However, this should not come at the

expense of research on more inconspicuous and overlooked, yet functionally important taxa

that comprise the majority of the oceans’ biological diversity (e.g., small fish that fuel marine

food webs [33] and urchins and crustaceans that feed on algae that can displace corals [34]).

Systematic biases toward studying certain taxa (vertebrates, species with large body sizes,

charismatic fauna), partly caused by the lack of coordination, have clearly affected our under-

standing of the distribution and roles of host-associated microbes. For example, a recent

microbiome comparison of several Indo-Pacific invertebrate species demonstrated that

sponges have a less specific microbiome than had been assumed for many years [35]. Expand-

ing the taxonomic breadth of host–microbe studies will be most fruitful in areas where taxo-

nomically rigorous field guides, ecological survey data, and functional trait databases are

available. Substantial progress will also occur where phylogenetic relationships are known and

local expert taxonomists can be engaged. One of the numerous potential outcomes includes

building community-wide association matrices to unveil the extent of reliance between hosts

and microbial partners (specificity versus ubiquity, obligate versus facultative) and the interac-

tions that promote the stability of core microbiomes.

Role of microbes in host acclimatization and adaptation

Host-associated microbes can rapidly respond to extrinsic factors such as extreme or anoma-

lous environmental conditions (e.g., heatwaves, hypoxia), pathogens, anthropogenic distur-

bances (e.g., pollution, overfishing, aquaculture, invasive species), and acute and chronic

stressors [36,37]. They can also quickly change in response to factors intrinsic to the host (e.g.,

changes in host physiology [38]). The dynamic nature of microbes may provide a source of

ecological and evolutionary novelty to support potential host response mechanisms that aug-

ment the host’s own evolutionary potential. Host-associated microbial communities can shift

rapidly through the loss, gain, or replacement of individual members. Individual microbial

cells can make rapid physiological adjustments during their lifetime (plasticity) or within a few

generations (adaptation) [39] (Fig 2). In many microbes, relatively high rates of mutation and

exchange of genetic material among divergent lineages (through homologous recombination

and horizontal gene transfer) generate a high frequency of new genetic variants, some of

which may be better suited to novel conditions (Fig 2). These mechanisms contribute to fuel-

ing an immense standing pool of genetic variation that hosts can potentially draw upon. The

outcomes of the collective ecological and evolutionary response of hosts and their associated

microbes to environmental change may comprise 1 of 4 nonmutually exclusive scenarios

PLOS BIOLOGY

PLOS Biology | https://doi.org/10.1371/journal.pbio.3001322 August 19, 2021 6 / 18

[40,41]: (1) Imbalance: a temporary or permanent change of host fitness and microbial func-

tions leading to increased disease susceptibility; (2) Resistance: the microbiome continues per-

forming its functions and the host does not lose or gain fitness; (3) Acclimatization: the newly

formed microbial community in conjunction with host phenotypic plasticity enable the indi-

vidual host to adjust and maintain performance under changing environmental conditions

(Fig 2); and (4) Adaptation: in the long term, newly formed interactions between host geno-

types and associated microbes increase the fitness of the symbiosis and they become heritable

(Fig 2).

The role that host-associated microbes play in their host’s response to environmental

change is also influenced by their mode of transmission (Fig 3). While vertical transmission

may help ensure the intergenerational stability of mutualistic symbioses, the dependence on

symbionts with highly simplified and inflexible genomes is a risky strategy under variable or

unpredictable stressful conditions [42,43]. Vertically transmitted symbionts have fewer oppor-

tunities to exchange genes with the vast pool of genetic diversity available in the external

Fig 2. Conceptual representation of the role of microbes in host acclimatization and adaptation. Microbes can frequently adapt to environmental

changes more rapidly than their host because of shorter generation times and higher standing genetic variation. Changes that occur at the levels of

individual microbes and microbiomes can rapidly generate phenotypic plasticity in a broad range of host traits (i.e., one host genotype expresses multiple

phenotypes induced by microbes). Microbially induced phenotypes may promote host adaptation if they become heritable traits. Within microbiomes,

transient microbes (thin dashed circles) have limited effects on host phenotype. On the other hand, core microbes (thick dashed circles) that engage in

prolonged relationships with hosts and potentially coevolve with hosts likely alter host phenotypes and promote host adaptation. Note that the time scale at

which evolutionary changes occur varies widely between organisms, but adaptation is generally slower than acclimatization. Plain line: nonaltered

interaction; dashed line: altered interaction; colors of microbes represent different microbial taxa.

https://doi.org/10.1371/journal.pbio.3001322.g002

PLOS BIOLOGY

PLOS Biology | https://doi.org/10.1371/journal.pbio.3001322 August 19, 2021 7 / 18

environment, which could constrain the adjustment of these associations to rapidly changing

conditions. In the marine environment, the vast majority of mutualistic symbionts are

acquired horizontally from the surrounding environment or from other hosts [44]; this

includes associations where a host is entirely dependent upon a single or a few symbionts for

nutrition (e.g., tubeworms [45]; mussels [46]). Horizontal transmission has important implica-

tions for the adaptive potential of hosts [47]. The ability to acquire microbes and genes from

the surrounding environment allows hosts to access the huge evolutionary potential contained

within the larger microbial communities. Hosts with horizontally acquired microbes could

thus be better positioned to adjust and become resilient to changing environmental condi-

tions. Selection that maintains and fine-tunes the relationship could subsequently lead to adap-

tive genetic change.

Several key bottlenecks currently impede our understanding of how host-associated

microbes drive the initial response as well as long-term, evolutionary adaptation to climate

change–related disturbances in hosts with diverse microbial communities. First, changes in

microbiomes that confer adverse or beneficial outcomes for the host cannot be distinguished

from natural variability …

Resource

Deep-Learning Resources for Studying Glycan-

Mediated Host-Microbe Interactions

Graphical Abstract

Highlights

d Glycan-focused language models can be used for sequence-

to-function models

d Information in glycans predicts immunogenicity,

pathogenicity, and taxonomic origin

d Glycan alignments shed light into bacterial virulence

Bojar et al., 2021, Cell Host & Microbe 29, 132–144
January 13, 2021 ª 2020 The Author(s). Published by Elsevier In
https://doi.org/10.1016/j.chom.2020.10.004

Authors

Daniel Bojar, Rani K. Powers,

Diogo M. Camacho, James J. Collins

Correspondence
[email protected]
(D.M.C.),
[email protected] (J.J.C.)

In Brief

Bojar et al. present a workflow that

combines machine learning and

bioinformatics techniques to analyze the

prominent role of glycans in host-microbe

interactions. The herein developed

glycan-focused language models and

alignments allow for the prediction and

analysis of glycan immunogenicity,

association with pathogenicity, and

taxonomic classification.

c.
ll

OPEN ACCESS

ll

Resource

Deep-Learning Resources for Studying
Glycan-Mediated Host-Microbe Interactions
Daniel Bojar,1,2 Rani K. Powers,1,2 Diogo M. Camacho,1,4,* and James J. Collins1,2,3,4,5,*
1Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
2Department of Biological Engineering and Institute for Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge,

MA 02139, USA
3Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
4These authors contributed equally
5Lead Contact

*Correspondence: [email protected] (D.M.C.), [email protected] (J.J.C.)

https://doi.org/10.1016/j.chom.2020.10.004

SUMMARY

Glycans, the most diverse biopolymer, are shaped by evolutionary pressures stemming from host-microbe
interactions. Here, we present machine learning and bioinformatics methods to leverage the evolutionary in-
formation present in glycans to gain insights into how pathogens and commensals interact with hosts. By
using techniques from natural language processing, we develop deep-learning models for glycans that are
trained on a curated dataset of 19,299 unique glycans and can be used to study and predict glycan functions.
We show that these models can be utilized to predict glycan immunogenicity and the pathogenicity of bac-
terial strains, as well as investigate glycan-mediated immune evasion via molecular mimicry. We also develop
glycan-alignment methods and use these to analyze virulence-determining glycan motifs in the capsular
polysaccharides of bacterial pathogens. These resources enable one to identify and study glycan motifs
involved in immunogenicity, pathogenicity, molecular mimicry, and immune evasion, expanding our under-
standing of host-microbe interactions.

INTRODUCTION

In contrast to RNA and proteins, whose sequences can be eluci-

dated from their associated DNA sequence, glycans are the only

biopolymer outside the rules of the central dogma of molecular

biology. Although glycans are synthesized by DNA-encoded en-

zymes (Lairson et al., 2008), an individual glycan sequence is

dependent on the interplay between multiple enzymes and

cellular conditions. Additionally, the expansive glycan alphabet

of hundreds of different monosaccharides allows for a large

number of potential oligosaccharides, built with different mono-

saccharides, lengths, connectivity, and branching. Glycans are

present as modifications on all other biopolymers (Varki, 2017),

exerting varying effects on biomolecules, including stabilization

and modulation of their functionality (Dekkers et al., 2017; Solá

and Griebenow, 2009). Apart from influencing the function of in-

dividual proteins, glycans are also crucial for cell-cell contact in

the case of glycan-glycan interactions during the attachment of

pathogenic bacteria to host cells (Day et al., 2015), and they

mediate essential developmental processes such as nervous

system development (Haltiwanger and Lowe, 2004). Recently,

Lauc et al. hypothesized that the plethora of available glycoforms

and their plasticity facilitated the evolution of complex multicel-

lular lifeforms (Lauc et al., 2014), reasoning that is supported

by the essential roles of glycans in developmental processes

132 Cell Host & Microbe 29, 132–144, January 13, 2021 ª 2020 The
This is an open access article under the CC BY-NC-ND license (http://

and cell-cell communication and emphasizes the evolutionary

information in glycans.

Because glycans make up the outermost layer of both eukary-

otic and prokaryotic cells, cross-kingdom interactions will

necessarily involve these molecules (Day et al., 2015). The prom-

inent role of glycans in host-pathogen interactions (Varki, 2017)

has resulted in evolutionary pressures and opportunities on

both sides of the interaction—natural selection can modify

host glycan receptors used by pathogens without losing their

functionalities, whereas pathogens and commensals need to

alter their glycans to evade the host immune system. These inter-

actions provide a window into understanding glycan-mediated

host-microbe relationships. Glycans display great phenotypic

variability: sequences can be changed depending on environ-

mental conditions, such as the level of extracellular metabolites

(Park et al., 2017), without the need for genetic mutations, poten-

tially facilitating rapid responses to changes in host-microbe

relationships.

Given the aforementioned glycan-mediated host-microbe in-

teractions, glycans could provide insights into pathogenicity

and commensalism determinants, as, for instance, molecular

mimicry of host glycans by both pathogens and commensals fa-

cilitates their immune evasion (Carlin et al., 2009; Varki and Gag-

neux, 2015). Additional therapeutic potential is enabled by the

widespread usage of glycans by viruses for cell adhesion and

Author(s). Published by Elsevier Inc.
creativecommons.org/licenses/by-nc-nd/4.0/).

C

D
Bonds Made
by NeuNAc

α2-8

α2-3

α2-6

N
um

b
e

r o
f G

ly
c

a
ns

0

200

400

α2-6

α2-3

α2-8

Bonds Made
by NeuNGc

0

40

80

NeuNAc

NeuNGc Kdo

Monosaccharides
with Bond α2-3

0

200

400

Monosaccharides Paired with Fuc

N
um

b
e

r o
f G

ly
c

a
ns

Branching

O
c

c
ur

e
nc

e
s

Position

Main

Side

G
a

l
G

lc

G
a

lN
A

c
M

a
n

Fu
c

G
lc

N
A

c
N

G
lc

N
A

c
O

S

G
lc

A
Rh

a

G
lc

N
A

c
G

ro
G

a
lO

A
c

G
a

lA
G

lc
N

1200

800

400

0

3000

2000

1000

0

G
lc

N
A

c

G
a

lO
S

A 12,674 Species-Specific Glycans

6,969 eukaryotic
6,119 prokaryotic
152 viral

19,299 Unique Glycans

1,027 Glycoletters
19,866 Glycowords
9,152 Glycans with at
least one label

1600

6000

Domain

Order

Kingdom

Family

Phylum

Genus

Class

Species

Number of Glycans

Number of Glycans
0 2000 4000 6000 0 2000 4000 0 20001000 3000

0 800 0 400 800 0 400 800

Virus

Archaea

Primates
Pseudomonadales
Fabales

Rhizobiales

Saccharomycetales

Lactobacillales
Artiodactyla

Burkholderiales

Actinomycetales

0 1000 2000

Plantae
Animalia

Fungi
Excavata

Virus
Euryarchaeota
Riboviria
Chromista
Proteoarchaeota

0 2000 4000

Hominidae

Pseudomonadaceae
Fabaceae

Saccharomycetaceae
Rhizobiaceae

Pasteurellaceae

Burkholderiaceae

Solanaceae
Muridae

Angiosperms
Chordata

Ascomycota
Firmicutes

Basidiomycota
Actinobacteria

Euglenozoa

Virus
Arthropoda

Homo

Salmonella

Burkholderia
Shigella

Bos
Sus
Streptococcus

Lactobacillus

Dicotyledons
Mammalia

Bacilli
Alphaproteobacteria
Monocotyledons

Saccharomycetes

Betaproteobacteria

Sordariomycetes
Actinobacteria

Sus scrofa
Mus musculus

Rattus norvegicus

Shigella dysenteriae
Gallus gallus

Pseudomonas sp.

Saccharomyces cerevisiae

B
Eukarya

Bacteria

Bacteria Proteobacteria Gammaproteobacteria

Enterobacterales Enterobacteriaceae Escherichia
Homo

Pseudomonas
Homo sapiens

Bos taurus

Escherichia coli

Figure 1. Using a Curated Glycan Dataset as a Resource for Glycobiology and Analyzing Host-Microbe Interactions
(A) Building curated datasets of species-specific and unique glycan sequences. Glycans stemming from proteins, lipids, small molecules, or cellular surfaces

were gathered from UniCarbKB, CSDB, GlyTouCan, and the academic literature. We deposited these datasets in our database SugarBase, containing additional

associated metadata, such as linkage and immunogenicity information.

(legend continued on next page)

ll
OPEN ACCESSResource

Cell Host & Microbe 29, 132–144, January 13, 2021 133

ll
OPEN ACCESS Resource

entry (Thompson et al., 2019) and pathogenic bacteria (Poole

et al., 2018).

In addition to previous work developing computational ap-

proaches to glycan analysis (McDonald et al., 2016; Spahn

et al., 2016), identifying relevant glycan motifs and their roles in

host-microbe interactions at scale would benefit from pattern-

learning algorithms, such as machine learning, that can uncover

statistical dependencies in biological sequences (Camacho

et al., 2018). Research on other biopolymers has shown that lan-

guage models, originally developed for the analysis of human

languages, perform best in this task (Alley et al., 2019; Almagro

Armenteros et al., 2020; Strodthoff et al., 2020), because they

can leverage evolutionarily conserved regularities and lan-

guage-like properties in such sequences. Language models,

with their memory-like features, are well suited for leveraging

patterns and implicit structure in biopolymers such as those un-

derlying nucleic acids (Valeri et al., 2020) and proteins (Alley

et al., 2019), because information in these sequences is order

dependent, and non-neighboring residues can have meaningful

interactions. Applying a natural language-processing approach

to biological sequences also enables learning a representation

of a molecule that can be used to analyze sequence motifs

and predict functional properties. These types of models are

therefore a suitable starting point for the analysis of glycan

sequences.

Here, we present a resource toolkit comprising machine

learning and bioinformatics methods as well as a large glycan

database to leverage the evolutionary information present in gly-

cans for predictive purposes in the context of host-microbe in-

teractions, e.g., by understanding pathogenicity-associated

glycan motifs. This toolkit can be used as a complete workflow

for investigating host-microbe interactions, from a glycan data-

set to glycan motifs identified by machine learning and further

investigated by glycan alignments, or as separate modules. Un-

derlying all of this is our language model for glycans, SweetTalk,

trained on a dataset of 19,299 unique glycan sequences. With

this, we demonstrate that similarities between glycans can be

visualized and used to predict glycan properties such as human

immunogenicity. Another part of our platform is SweetOrigins, a

language-model-based classifier predicting the taxonomic

origin of glycans that we use to obtain evolution-informed repre-

sentations of glycans. To achieve this in the context of glycan-

mediated host-microbe interactions, we manually curated a

comprehensive dataset comprising 12,674 glycans with species

annotations. These datasets were combined into a database,

SugarBase, that is amenable to programmatic access and inte-

gration into deep-learning pipelines, thus providing resources for

analyses involving host-microbe interactions.

In this work, we demonstrate the potential and generaliz-

ability of using SugarBase, SweetTalk, SweetOrigins, and a

glycan-alignment methodology for studying glycan-mediated

host-microbe interactions. We show that a language-model-

based classifier trained on glycan sequences can accurately

(B) Glycan species distribution in the species-specific glycan dataset. For all gly

taxonomic level are shown with their number of glycans.

(C and D) Analyzing the local structural context of glycoletters. We identified the m

local structural context together with its likely position in the glycan structure (main

sialic acids (D).

134 Cell Host & Microbe 29, 132–144, January 13, 2021

predict glycan immunogenicity and the pathogenicity of

E. coli strains, revealing predictive glycan motifs. We also

leverage the evolutionary information gained by SweetOrigins

to analyze glycan motifs that could be used for molecular-

mimicry-mediated immune evasion by commensals and

pathogens. Applying our glycan-alignment methodology to

the example of the capsular polysaccharides of Staphylo-

coccus aureus and Acinetobacter baumannii, we uncover a

potential connection to the enterobacterial common antigen

and hypothesize a mechanism for the increased virulence

mediated by these glycan motifs. Taken together, these

resources offer a powerful and generalizable platform for

studying and understanding the role of glycans in host-

microbe interactions.

RESULTS

Curating Glycan Datasets for Glycobiology and Glycan-
Mediated Host-Microbe Interactions
To investigate the role of glycans in host-microbe interactions,

we constructed a dataset of species-specific glycan sequences

that could be used to train machine-learning models. For this, we

gathered and curated a dataset with glycans from GlyTouCan

(Tiemeyer et al., 2017), UniCarbKB (Campbell et al., 2014), the

Carbohydrate Structure Database (CSDB) (Toukach and Egor-

ova, 2016), and targeted literature searches (see STAR

Methods). To facilitate training deep-learning models on glycan

sequences, we only included glycans with fully elucidated se-

quences, including the determination of linkages between

monosaccharides. Our dataset contained 12,674 highly diverse

glycans with a deposited species association (Figure 1A; Table

S1) and included glycans from 1,726 species (corresponding to

39 taxonomic phyla; Figure 1B). Specifically, our dataset con-

tained 6,969 eukaryotic, 6,119 prokaryotic, and 152 viral gly-

cans. Because we included all species for which we could find

glycans, this dataset constituted a comprehensive snapshot of

currently known species-specific glycans, with glycans from

numerous bacteria, facilitating the study of glycan-mediated

host-microbe interactions.

We further reasoned that the inclusion of glycan sequences

without a deposited species label would strengthen the lan-

guage models we describe below. This approach is supported

by the success of transfer learning in the field of machine learning

(Howard and Ruder, 2018), in which models are initially trained

on large datasets without labels and then finetuned on smaller

datasets with labels. This makes more data available to learn

general patterns, such as sequence motifs, that can be lever-

aged to predict glycan properties. Accordingly, we curated a

separate dataset in which we used the databases mentioned

above to gather 19,299 unique glycan sequences, irrespective

of whether species information was available (Figure 1A;

STAR Methods; Table S2). To gain a comprehensive view of

glycobiology, we included all glycan categories, encompassing

cans with species information, up to the 10 most abundant classes for each

ost frequent monosaccharides following fucose in glycans (C), highlighting its

versus side branch). Additionally, we compared the binding behavior of several

Featurize
Input

/ // /

Language Model
(Glycoletters)

Classifiers
(Glycowords)

SweetTalk

Xt-1 Xt

LSTMRv LSTMRv LSTMRv

Xt+1

Yt-1 Yt Yt+1

2-3x

Embedding

A

Language Model Output

C Glycowords With
Existing Alphabet

Possible Realized

Data Processing

GlyTouCan Literature

D

Datasets

α2

β4
α3 β3 β3

Ser/Thr

α2

β4
α3 β3 β3

Ser/Thr

α2

β4
α3 β3 β3

Ser/Thr

α2

β4
α3 β3 β3

Ser/Thr

β3 β3

α3 β3

β4 α3

α2 β4

Fuc

Gal GalNAc
GlcNAc

tSNE Dim 1

Glycoletter Embeddings

Fuc

Glc
Neu

Man

B Glycowords

tS
N

E
D

im
2

Bonds

Gal

NeuNAc

Glc

α3 β3

10 –

30 –

-10 –

-30 –

-30

-10

30

10

N
um

b
e

r o
f G

ly
c

o
w

o
rd

s

Possible Realized

U
M

A
P

D
im

2

UMAP Dim 1

8 –

4 –

0 –

-4 –

-10

-6

-2

2

6

LSTMFw LSTMFw LSTMFw

19,299
Glycans

E F G

2-2-6
UMAP Dim 1

6

-8

-4

0

4

U
M

A
P

D
im

2

Non
-immuno

-genic

Immuno
-genic

N
um

b
e

r o
f U

nm
a

sk
e

d
G

ly
c

o
w

o
rd

s

Probability Immunogenic Probability Immunogenic

α3
α6
α6

α3

β2

β4 β4
α6

α2 α3 α3
α2

α6

α2 β4 β4 α3 β4 β4
α6

WT
α2 α3

α2 α6

1 2

3

3

1 2

2

α6 α3

α3 α6

β4 β21

1

2

1

1

1

α6 β4 β3 β4

WT

1 2

β4 β31

1

β3 β6

α6 α3

Fuc Gal GalNAcGlcNAc ManNAcGlc ManRhaNeuNAcNeuNGc Xyl

αGal

0.2

0.6

1.0

0.0

0.4

0.8

0.2

0.6

1.0

0.0

0.4

0.8

1 –

2 –
3 –

4 –
5 –

6 –
7 –

0.2

0.6

1.0

0.2

0.6

1.0

1 –

2 –

3 –

4 –

0.2

0.6

1.0

0.2

0.6

1.0

Homo sapiens

Non-Reducing Reducing

Homo sapiens

Non-Reducing Reducing

Ruminococcus
gnavus

Homo sapiens

High
MMannose

ghHig
haRh

NN-Glycans

nsO-Glycan

1012

108

104

100

(legend on next page)

ll
OPEN ACCESSResource

Cell Host & Microbe 29, 132–144, January 13, 2021 135

ll
OPEN ACCESS Resource

protein-, lipid-, and small molecule-associated glycans, as well

as capsular and extracellular polysaccharides.

In our dataset, we observed 1,027 unique monosaccharides or

bonds that were present in glycan sequences and comprised the

smallest units of an alphabet for a glycan language. Analogous to

natural language processing, we termed these entities ‘‘glycolet-

ters’’ and constructed ‘‘glycowords’’ by considering trisaccha-

rides (i.e., three monosaccharides and two connecting bonds,

or five glycoletters), yielding 19,866 unique glycowords in our da-

taset. With this, we sought to incorporate local structural infor-

mation into our models and enable the discovery of relevant mo-

tifs, which usually contain subsequences larger than a single

monosaccharide. Even larger substructures would preclude

the analysis of shorter glycans and lead to an exponential in-

crease in the size of the resulting vocabulary. We would also

like to note that although we chose trisaccharides as building

blocks, glycan substructures of any length can be used to build

a vocabulary for our models without considerable changes.

To make these data and analysis resources readily accessible

and facilitate further advances in glycobiology, we created Sug-

arBase, a comprehensive glycan database with metadata and

analytical tools based on this work (Figure S1A; Table S2;

https://webapps.wyss.harvard.edu/sugarbase). SugarBase of-

fers accessible glycan data, explorable glycan representations

learned by our language models, and many of the methods

developed here as tools, such as the local structural context

of any glycoletter (Figure S1B) and glycan alignments, described

below.

Reasoning that our glycan datasets constitute broad re-

sources for glycobiology and host-microbe interactions, we set

out to investigate host glycan substructures that could be

emulated by microbes for molecular mimicry. Analyzing the envi-

ronment of the monosaccharide fucose as an example, we

observed N-acetylglucosamine (GlcNAc) and galactose (Gal)

as typical connected monosaccharides (Figure 1C), which is

consistent with the fucosyltransferase substrate specificities an-

notated in glycosyltransferase family 10 (Lombard et al., 2014).

Thus, microbial glycans containing fucose could potentially

include either GlcNAc or Gal in direct proximity to maximize sim-

ilarity with host glycans. This insight aids in formulating hypoth-

Figure 2. Learning the Language of Glycans Revealed Regularities in S

(A) Building a language model for glycobiology. We used glycowords, overlapping

based bidirectional RNN, SweetTalk, that was trained by predicting the next glyc

symbol nomenclature for glycans (SNFG).

(B) Learned representation of glycoletters by SweetTalk. We visualized the embe

SNE). Areas enriched for modified monosaccharides of one type are colored.

(C) Comparing the abundance of possible and observed glycowords. Possible

exhaustive combination (36 bonds and 991 monosaccharides).

(D) Comparing the distribution of possible and observed glycowords. We gene

monosaccharides and bonds and formed their embedding by averaging their co

jection (UMAP) of these generated glycowords (blue) and all observed glycoword

(E) Glycan embeddings learned by the immunogenicity classifier. Embeddings fo

according to whether they were immunogenic (blue) or non-immunogenic (orang

(F) Glycoword masking to probe the immunogenicity classifier. Glycowords were

Reducing’’/‘‘Reducing’’) and used as input for the trained immunogenicity classifi

glycan is for prediction, with the bar representing the full-length glycan at the bo

(G) Glycan in silico alterations to probe immunogenicity classifier. For 4,000

monosaccharide or bond. If the resulting glycowords were observed, we used th

probability is plotted together with the altered glycan sequences, with the wildtyp

monosaccharide was modified. The addition of an ‘‘S’’ implies a sulfurylated mo

136 Cell Host & Microbe 29, 132–144, January 13, 2021

eses and identifying glycan motifs relevant for molecular mim-

icry, as we describe below. We also differentiated binding

orientation preferences for different sialic acids, a crucial mono-

saccharide type in host-pathogen interactions (Figure 1D;

Haines-menges et al., 2015), revealing a preference for the char-

acteristic human monosaccharide NeuNAc to be (a2-3)-linked,

relative to other sialic acids such as NeuNGc. These types of an-

alyses can directly lead to hypotheses of glycan motifs that can

be investigated by using the methods presented in this work.

Using Natural Language Processing to Learn the
Grammar of Glycans
Next, we used our curated dataset of 19,299 glycan sequences

(Table S2) to develop a deep-learning-based language model,

SweetTalk. For this, we chose a bidirectional recurrent neural

network (RNN; Figure 2A; Sherstinsky, 2020), because this

type of model has delivered state-of-the-art results for other bio-

polymers, such as protein sequences (Alley et al., 2019; Almagro

Armenteros et al., 2020; Strodthoff et al., 2020). Originally devel-

oped for human languages, RNNs exhibit memory-like elements

by predicting the next word given the preceding words (Sherstin-

sky, 2020); this enables RNNs to learn complex, order-depen-

dent interactions in proteins by viewing amino acids as letters

and predicting the next amino acid given the preceding

sequence (Alley et al., 2019). Two of the main usages for a trained

language model are as follows: (1) extracting a learned represen-

tation for each word and (2) finetuning the model for predicting

structural or functional properties of a sequence. For the former,

a representation or embedding that characterizes a word in

terms of context, usage, and meaning is constructed in the pa-

rameters of the trained model for each word in the vocabulary.

This learned representation can be used to quantify the similarity

of two glycan sequences or analyze language properties, which

we demonstrate with the analysis of molecular mimicry in host-

microbe interactions. The latter—finetuning a general language

model on a predictive task such as predicting pathogenicity—

is also known as transfer learning (Howard and Ruder, 2018;

Tan et al., 2018), and in our case it involves general glycan fea-

tures that are learned by the language model to predict func-

tional properties.

ubstructures and Can Be Used to Predict Glycan Immunogenicity

units consisting of three monosaccharides and two bonds, for our glycoletter-

oletter given previous glycoletters. Glycans are drawn in accordance with the

dding for every glycoletter by t-distributed stochastic neighbor embedding (t-

glycowords were calculated from the pool of observed glycoletters and their

rated 250,000 glycowords by randomly sampling from the observed pool of

nstituent glycoletter embeddings. A uniform manifold approximation and pro-

s (orange) is shown.

r glycans from our immunogenicity dataset are shown via UMAP and colored

e).

progressively exchanged with padding (‘‘masking’’) from both termini (‘‘Non-

er. Inferred immunogenicity probability indicates how crucial each region of a

ttom.

iterations, single monosaccharides or bonds were replaced with a random

em as input for the trained immunogenicity classifier. Inferred immunogenicity

e glycan found at the bottom. In case of ambiguity, a number indicates which

nosaccharide, whereas ‘‘Me’’ implies a methylated monosaccharide.

ll
OPEN ACCESSResource

Glycans are the only nonlinear biopolymer, with up to multiple

branches per sequence. To enable a language model despite

this branching, we extracted partially overlapping ‘‘glyco-

words’’ from the non-reducing end to the reducing end of gly-

cans in the bracket notation (Figure 2A), comprising three

monosaccharides and two bonds. These glycowords repre-

sented snapshots of structural contexts that characterize a

glycan sequence. By using monosaccharides and bonds as

‘‘glycoletters,’’ we then trained a glycoletter-based language

model, SweetTalk, predicting the next most probable glycolet-

ter given the preceding glycoletters in the context of these gly-

cowords (Table S3). This operation, instead of directly training

on full sequences, avoids learning specious relationships be-

tween glycoletters that are close in the bracket notation but

far apart in the actual glycan structure due to branching. We

then demonstrated the necessity of accounting for the order-

dependent information in glycans by training SweetTalk on

scrambled glycan sequences, randomizing the order but keep-

ing the composition of a sequence—this resulted in severely

degraded model performance, emphasizing the language-like

elements inherent in glycan sequences (Table S3). Analyzing

the learned embeddings of glycoletters after training SweetTalk

revealed similar positions in embedding space for monosac-

charides and their modified counterparts (e.g., sulfurylated

galactose, GalOS, and sulfurylated N-acetylgalactosamine,

GalNAcOS; Figure 2B), implying similarity in their language

characteristics and context. This finding is reminiscent of ob-

servations made on the popular word2vec embeddings that

also learn a representation of words in a human language by

considering their neighboring words/context, in which seman-

tically similar words form clusters (Mikolov et al., 2013).

We then constructed glycoword embeddings by averaging the

embeddings of their constituent glycoletters. Our first observa-

tion was that from the close to 1.2 trillion possible glycowords

(given our observed glycoletters), only 19,866 distinct glyco-

words (�0.0000016%) were observed here (Figure 2C). More-
over, these 19,866 glycowords were not evenly distributed in

the learned embedding space, as existing glycowords formed

clusters compared to in silico-generated, possible glycowords

(Figure 2D). The observation that the glycoword space (and,

thus, glycan space) is sparsely populated is potentially a conse-

quence of having to evolve dedicated enzymes for constructing

specific glycan substructures from a species-specific set of

monosaccharide building blocks, making most combinations

inaccessible.

Predicting Glycan Immunogenicity with a Glycan-Based
Language Model
Given the important role glycans play in human immunity (Kap-

pler and Hennet, 2020; Reusch and Tejada, 2015), we curated

known immunogenic glycans from the literature (Table S2) to fi-

netune a SweetTalk-based classifier with glycan sequences as

input to predict their immunogenicity to humans. On an indepen-

dent validation dataset, our model achieved an accuracy of

�92% (F1 score or balanced F score: 0.915), in comparison
with an accuracy of �51% for a model trained on scrambled
glycan sequences (Figures 2E–2G; Table S4). Alternative ma-

chine-learning models that did not treat glycan sequences as a

language, such as random forest classifiers, only achieved accu-

racies ranging from �80%–88% for this task (Table S4), empha-
sizing the importance of order and patterns for elucidating

glycan properties.

Rhamnose-rich glycans, a common monosaccharide in bac-

teria but not in mammals, were unambiguously assigned to an

immunogenic cluster by our RNN-based model and presented

the most striking motif for glycan immunogenicity (Figure …

Running Head: CRITIQUE OF OCEAN TEMPERATURES IN CORAL REEFS

CRITIQUE OF OCEAN TEMPERATURES IN CORAL REEFS Madison McNeill

Introduction

Coral reef ecosystems are the most diverse marine ecosystem in the world. They provide a home to thousands of species of plants and animals. In the last few decades, global warming has caused increased temperatures, resulting in ocean acidification and increasing surface temperatures of the ocean. This can lead to the bleaching of coral reefs as well as the death of coral reef fishes due to their inability to acclimate to the elevated temperature. These three papers were chosen, because they illustrate the environmental impact higher temperatures have on these coral reefs and the organisms that live within them.

· Dias, M., Ferreira, A., Gouveia, R., Cereja, R., & Vinagre, C. (2018). Mortality, growth and regeneration following fragmentation of reef-forming corals under thermal stress. Journal of Sea Research141, 71-82. doi: 10.1016/j.seares.2018.08.008.

· De’ath, G., Lough, J., & Fabricius, K. (2009). Declining Coral Calcification on the Great Barrier Reef. Science323(5910), 116-119. doi: 10.1126/science.1165283.

· Nilsson, G., Östlund-Nilsson, S., & Munday, P. (2010). Effects of elevated temperature on coral reef fishes: Loss of hypoxia tolerance and inability to acclimate. Comparative Biochemistry and Physiology Part A: Molecular & Integrative Physiology156(4), 389-393. doi: 10.1016/j.cbpa.2010.03.009.

Dias (2018) evaluated how elevated surface temperatures of the ocean affected growth, mortality, and regeneration following the fragmentation of nine coral species in the Indo-Pacific, while De’ath (2009) suggested that the ability of coral in the Great Barrier Reef may have depleted due to a decrease in the saturation state of aragonite and rising temperature stress in this region. The third paper evaluated, Nilsson (2010), examined whether or not an elevated temperature decreased tolerance levels for low-oxygen regions in two species of coral reef fishes. This experiment used adults fishes of two species and tested their ability to acclimate to changes in higher temperatures, which differed from the other two studies in that Dias and De’Ath did not study the fishes in the ecosystems, only the coral there. Dias found that whether or not a coral had previous injury did not impact the mortality, partial mortality, or rate of growth of each fragment. However, the species of coral and the ocean temperature had significant impacts on the results for each fragment. Although the cause for coral calcification of Great Barrier Reef corals was not determined by the De’ath’s study, he did find that it was largely related to increasing temperatures of oceans, which caused more thermal stress in coral populations. This differed from the Nilsson paper, which showed that certain species of coral reef fishes were unable to adjust to higher ocean temperatures, a phenomenon that has occurred due to global warming and ocean acidification.

Analysis


Introduction

When the three articles’ introductions were evaluated, some similarities as well as dissimilarities stood out. For example, the titles of the articles varied in appropriateness. Nilsson’s title was too long. The paper had a title that told its audience what the researchers hoped to get out of it, but the title seemed long and bulky. The title, in my opinion, could have been shortened or rephrased to one that grabbed the audience’s attention more quickly, even a change as simple as changing the title to, “Effects of elevated temperature on coral reef fishes.” However, Dias’s title was accurate and concise. “Mortality, growth and regeneration following fragmentation” was a title that accurately explained what was being examined within the confines of this study. De’ath had a title that matched the contents of the paper as well.

The abstract’s statement of purpose of all three articles matched the introductions. For the Dias article, they stressed that the impacts of thermal stress on fragments of regenerating coral species needed to quickly be explored, while De’ath’s abstract was well written, telling readers how many coral colonies were studied and what the results showed. The abstract of Nilsson’s paper plainly stated what occurred within the first two sentences. The abstract’s statement of purpose for this article was to display how two species of coral reef fishes in the Great Barrier Reef are failing to acclimate to higher sea surface temperatures. This was plainly stated in both the abstract and introduction of the article.

The hypotheses of the three articles varied greatly. Dias stated that the change in the global climate has led to rising sea surface temperatures and ocean acidification, which jeopardized coral reef survival. With this sentence, Dias made it clear why his study efforts were so urgent. Nilsson followed a similar pattern when he clearly stated his concerns for the inability of coral reef fishes to acclimate to rising water temperatures. De’ath’s hypothesis was stated in the abstract, which said that his study suggested that the increasing thermal stress may be depleting the ability of Great Barrier Reef corals to deposit calcium carbonate. Thus, the hypotheses of all three articles were given. I also found that Dias, De’ath, and Nilsson all had a nice way of arranging their data, which allowed the information to build to what the experimental design included and what the researchers were hoping to accomplish from this experiment.


Methods

The sample selection among the three articles showed great contrast. The Dias paper used nine reef-forming coral species, while De’ath’s experiment studied 328 colonies of coral from the same genus, Porites, which is a stony coral. Nilsson studied adults of two species of coral reef fishes. For Dias’s paper, the methods were easy to follow and seemed easy to repeat, while De’ath’s methods were harder to follow, for the details did not appear to all be listed. The Methods section of the Nilsson article was both valid and delivered with enough detail that another group could perform most of this study again. Only most of the experiment, because although the article listed when and where the experiment was conducted, the number of each species of adult coral reef fish caught and analyzed was not given in the Methods sections of the paper. This information is crucial, because a small sample size could invalidate the data, while a large sample size could support the data more accurately. Furthermore, if one species of coral reef fishes had a much larger or small number than the other species, the data would also not be well represented in the results found by this study. The number of samples for both of the other articles were given.

While some articles had strong Methods sections, others were missing key components. The experimental design for the three articles chosen all seemed valid. De’ath’s study seemed valid for the experiment being conducted, though I am unsure that this study could be repeated using the paper alone. The experimental design did make sense overall, in that Porites is commonly chosen for sclerochronological analyses because they have annual density bands that are widely distributed. Portites coral also has the capability of growing for hundreds of years, so choosing this genus of coral for a large analysis made sense. Using the three growth parameters De’ath mentioned—skeletal density, calcification rate, and annual extension rate—are good parameters to look at in a genus of coral that has such a long life span. Dias justified every step of his experimental design, making it easy to repeat the process. For instance, Dias utilized contrasting morphologies, because they have different susceptibilities to thermal stress, giving the overall results more credit. These corals were held in captivity for several years, giving the researchers knowledge of the corals’ thermal history. Twenty fragments were cut from each of the nine species, half of which were used as a control. Sources of variations were eliminated in this process by cutting only one coral from each colony. These methods appear valid, and each is given a reason as to why a scientist would conduct the experiment in this way, making the overall flow of the methods logical and easy to follow. This was similar to De’ath’s paper in that De’ath listed the parameters used to test the samples, and he mention that Porites has such a long lifespan, so these types of corals have been proven to record environmental changes within their skeletons. This statement justified why De’ath chose this coral and explained why these particular parameters were chosen. However, he did not specify how to conduct these analyses. Also, although the data was collected within a two-month period for both Dias and Nilsson’s experiments, De’ath’s experiments was a composite collection from the years 1900-2006, containing over sixteen thousand annuals records with corals ranging from ten to 436 years old. Hence, the broad range of years the specimens were collected was overwhelming, not to mention the three growth parameters the paper mentioned but again failed to explain. Lastly, for Nilsson’s experimental design, I found that it was carried out well, using adults of the two species of coral reef fishes and varying temperatures that supported their hypothesis. However, not including the number of each species caught negates the data to a certain degree. Overall, I found that the Methods section of Nilsson’s paper was logical, but it did not contain details that were pertinent to this experiment, whereas Dias included all pertinent information and De’ath failed to include how he performed the parameters that were chosen.


Results

Since the concentration for each study varied, the results were also quite different in composition. Dias et al. (2018) found that injury—whether present or absent—had no impact on the death or growth rate of the coral fragments studied. The researchers determined that the true factors that impacted death and growth rate of the corals analyzed were temperature and the coral species itself. These results were illustrated using tables and figures. Table 1 was difficult to follow, because some of the columns were abbreviated using terms not explained within the content of the article. However, the numbers in the table coincide with the text, showing that injury did not impact the growth or mortality rate of the coral fragments used in this experiment. The results found in De’ath’s paper were easy to follow, but showed that the cause of decline in coral populations in the GBR were still not known. Within the Results section of the articles written by Nilsson and De’ath, I saw that the figures and tables matched the text without repeating the same information to the audience. The figures and tables were accurate with what the text had previously stated, showing P values that were statistically significant, and the data was very easy to understand. The table in Dias’s article could have been better presented if the abbreviations used had some type of key that denote what each header meant. I did not find any discrepancies among the figures and text of the three articles as far as percentages were concerned.

The results found in the three studies did test the hypothesis of the researchers. For Dias’s paper, these results were shown in Figure 1, which illustrated that as temperature increased, the mortality rate of coral species also increased. As Dias mentioned in the Abstract section, there were two coral species that survived this experiment, Turbinaria reniformis and Galaxea fascicularis. However, the results of the Nilsson paper were to test the hypothesis of the researchers in that study, which was that after a given number of days in varying temperatures, the coral reef fishes studied would fail to acclimate to those temperature changes. Again, in the third paper, De’ath’s results tested the hypothesis, including 328 colonies of massive corals form 69 various reefs, which made the results more broad.


Discussion

I found that none of the three articles repeated the same information in both the figures and the text of the article. From the Discussion section of the Dias paper, a reader could tell the main points of the article, which were to show that there was variability in the susceptibility to thermal stress of different coral reef species. These coral reef species had the lowest mortality, partial mortality, and levels of bleaching at 26 degrees Celsius, while their growth rate was at its zenith at this temperature. Dias found that the regeneration rate of corals generally increased as the temperature increased. These results also show that the bleaching resistance capacity of most of the corals analyzed was overcome at 32 degrees Celsius. Because this paper is so new—published in 2018—I could not find its interpretations to be supported by other research. However, the article does list the direction in which the research is headed and lists other studies similar to this one.

The findings of De’ath’s and Dias’s articles were supported by each other, as well as many other articles over coral reef ecology. However, I found very few articles that supported the conclusions drawn by Nilsson (2010) about the effects increasing temperatures had on two species of coral reef fishes, which was a weakness for the paper. In all three journal articles, I found that the interpretation of data was logical.

Conclusion


Summary

The three articles, overall, had both strengths and weaknesses. For instance, all three papers were peer reviewed. Nilsson’s paper had few other articles that backed up its findings, while Dias and De’ath backed up each other’s paper. The article by Dias (2018) was very recent, which made it one of the newest published papers in its field, while the De’ath and Nilsson papers were a few years older. Although the Dias paper had tables and figures that were not entirely straight forward, the content of the article itself was very easy to follow. Each section within the article was set apart, whereas in the article by De’ath, the sections (introduction, methods, etc.) were not separated from each other. The Dias and De’ath papers had appropriate titles, while Nilsson’s title was too long. The abstracts and introductions match for all three articles. The pace for the Dias article is great, leading the audience straight into the hypothesis and objectives for the analysis, while De’ath failed to separate his paper into different sections. Overall, Dias and the other researchers took many steps to ensure accurate results, and the Methods section of this article is explained well enough to be repeated, which differed from Nilsson’s paper in that Nilsson did not include his sample size. When it came to reproducing the experiment, Dias included all pertinent information, but De’ath failed to include how he performed the parameters that were chosen.

De’ath had a concise paper with text that accurately related to the figures mentioned. Overall, the article was concise in its findings, but not as easy to follow as it could have been if the proper sections and subsections had been utilized. The title seemed appropriate, and readers know from the statement of purpose and the introduction that the primary goal of this study was to help determine what is causing the decline in corals’ ability to lay down a calcium carbonate skeleton to more efficiently build coral reef ecosystems. The De’ath paper used a large sample selection, which made the results seem more inclusive as opposed to Dias’s samples size of nine species of corals using twenty fragments of each species. Nilsson’s article had excellent figures that were easy to interpret, in contrast to Dias’s paper that did not explain what some of the abbreviations meant in the tables. The strengths and weaknesses of the three papers varied greatly.


Significance

When evaluating the role these articles play in the world, striking similarities were found. Nilsson’s article showed primary concerns toward two populations of fish species that lived in the Great Barrier Reef, while De’ath and Dias wrote papers over the reactions of different coral species to increasing surface temperatures. De’ath’s article has practical significance similar to the Dias paper, in that major ecosystems are dying as a result of rising ocean surface temperatures, and these researchers tried to find ways to explain these issues. Nilsson’s article examined whether or not increasing temperatures reduced the hypoxia tolerance of coral reef fishes. It has been cited sixty-nine times, cited in papers involving hypoxia tolerance of coral reef fishes, how temperature and hypoxia play a role in respiratory performance of certain tropical fishes, and many other similar studies (Nilsson et al.). De’ath et al. has been cited twenty-four times, which sparked interest for similar research in the Great Barrier Reef in the last decade. The Dias et al. paper was only published in 2018, so not many other researchers have cited this paper yet. This can be seen as a potential problem; however, the article was peer-reviewed by individuals who are well-educated in this particular field. The currency of the Dias article may also be seen as a good attribute, showing that this information was some of the newest in its field of interest. The research among all three articles has significance to today’s society, in that the bleaching of coral reefs has become a growing problem, and without more research to determine what factors are causing this issue, large hypoxic zones in aquatic ecosystems may result.

Overall, all three of these articles illustrated environmental significance. Human survival depends on the biodiversity of plants and animals, and many animals live in these coral reef ecosystems.


Works Cited

De’ath, G., Lough, J., & Fabricius, K. (2009). Declining Coral Calcification on the Great Barrier Reef. Science323(5910), 116-119. doi: 10.1126/science.1165283

Dias, M., Ferreira, A., Gouveia, R., Cereja, R., & Vinagre, C. (2018). Mortality, growth and regeneration following fragmentation of reef-forming corals under thermal stress. Journal of Sea Research141, 71-82. doi: 10.1016/j.seares.2018.08.008

Nilsson, G., Östlund-Nilsson, S., & Munday, P. (2010). Effects of elevated temperature on coral reef fishes: Loss of hypoxia tolerance and inability to acclimate. Comparative Biochemistry and Physiology Part A: Molecular & Integrative Physiology156(4), 389-393. doi: 10.1016/j.cbpa.2010.03.009

[Type here]

2

9