Website tutorial¶
Warning
This page is under construction. Contact Trestan Pillonel (trestan.pillonel@chuv.ch) if you have any question or suggestion regarding the website or its documentation.
This page attempts to provide the user with help to understand and perform the analyses offered in the webinterface.
In this tutorial a database composed of a collection of Enterobacteriaceae genomes is presented and a focus on the fliLMNOPQR operon and specifically on FliL is reserved to display the available analyses. This operon is FlhDC-dependent and encodes important regulatory factors and structural components of the membrane-spanning basal body of the flagellum.
HOME page¶
In the HOME page the user has an overview of the genomes included in the database through the ‘Genome’ and ‘Phylogeneny’ sections (see the paragraph below) and the status of the database. Indeed, a summary reports which analyses are available, and which have not been included in the generation of the database. Note that it is necessary to re-run the pipeline to incude the new analyses if wished (-resume argument - see documentation)
- Direct access to the available analyses is provided in:
the home page
the left menu to facilitate the navigation
Genome table and phylogeny¶
Through the genome table the user can move the first steps to globally evaluate the content of the database, getting details about the contigs and the loci identified on each genome (the clickable locus tags redirect to the Protein annotation view
page).
The phylogeny, built on concatenated single copy orthologs with FastTree, shows the evolutionary relationships between the genomes given in input and essential data for the quality assessment of the given sequences.
Homology search - Blast¶
Perform a blast search of one or more sequences of interest against one or more genomes of the database.
Either an amino-acid or a nucleotide sequence can be given as input.
- Set up:
- the type of homology search according to the input file:
blastp, tblastn with an aa sequence
blast_ffn, blast_fna, blastx with a nt sequence
e-value
maximum number of hits to display
target genome or all genomes
Note: If the search is performed against all genomes, max number of hits should be set up to ‘all’ to avoid losing high identity matching hits.
In the reported example the protein sequence of the genes of fliLMNOPQR operon extracted from Enterobacter soli genome are blasted (blastp) against all genomes of the database (Fig.1).
Through this analysis it is possible to identify whether any of these genes is present in the genomes and evaluate the number and the identity of the alignment of each hit (Fig.2 - Result 1):
Protein annotation view
),Additionally, the generated annotated phylogeny facilitate the interpretation of their distribution and conservation along all the genomes. As shown in Fig.2 - Result 2, four genomes carry all the investigated genes, fourteen genomes do not carry them, while the remaining ones have an incomplete set.
TIPS:
If you are interest in a specific gene expected to be present in one of the genomes included in the database, you can either retrive the sequence in a public database, such as SwissProt, or use the search bar in the left-side menu of the web interface. Type the gene name, and identify which loci are annotated with that gene, clicking on one of them the user can directly retrieve both the nucleotide and the amino acid sequence of the gene - see
Protein annotation view
page below.Compare the genomic regions around a protein of interest in selected genomes accessing the ‘MENU/Genome alignments/Plot region’ analysis - see
Genome alignments
page below.
Comparisons¶
This block of analyses can widely change based on the settings defined during the generation of the database - see the documentation for an extensive explanation. It allows the user to compare several aspects of selected genomes and perform comparative analyses for each annotation: a) Orthogroups are identified by default, and differently, the user can optionally identify b) KEGG orthologs, c) COG cluster, and d) PFAM domains along the genomes during the database generation.
- Before proceeding here a brief summary of the mentioned annotations and the link to their databases:
Kegg: Kegg annotations refer to the Kyoto Encyclopedia of Genes and Genomes (KEGG). The genome annotation is composed of two aspects: a) KO assignemnt (KO is the identifier given to a functional ortholog defined from experimentally characterized genes and protein in specific organism), b) KEGG mapping where each KO is stored in a PATHWAY or MODULE identified based on molecular networks. This database provides a highly curated and reliable description of the metabolic pathway of the annotated genomes.
COG: COG annotations refer to the database of Cluster of Orthologous Genes (COGs). In this database each COG is assigned to a functional category including metabolic, signal transduction,repair and other pathways. This database allows an easy comparison of organisms based on their preference for certain pathways.
Pfam: Pfam annotations refer to the Pfam database used to identify protein families and domains. Due to the nature of proteins as combinations of fixed structure, this database is based on the idea that the identification of domains wihin proteins can provide insights to discover their function.
Overview of Orthogroups analyses
Orthogroups are identified with Orthofinder, an accurate platform that cluster set of genes that are descended from a single gene in the last common ancestor of all the species being considered as reported in its publication. In the following example, the orthogroup content is compared between Enterobacter soli, Enterobacter ausbriae, Enterobacter ludvigii, and Klebsiella variicola genomes.
List of analyses:
1A. Summary of the selected settings for the comparative analysis: the orthgroup of 4 genomes are compared, no orthogroup will be excluded if present in other genomes, orthogroup that are present in 3 out of the 4 selected genomes are also reported.
1B. List of identified orthogroups, description and distribution in the selected genomes: clicking on a Orthogroup entry redirects the user to the Orthogroup annotation summary page.
1C. List of locus tags per each orthogroup and genome: clicking on a Orthogroup entry redirects the user to the
Protein annotation view
page.
In Fig. 4:
This analysis generates three plots that display the content and conservation of Orthologous groups in selected genomes of interest.
A: this plot shows the number of all Orthologous groups present in a set of genomes. If the green curve reaches a plateau we can talk about ‘closed pangenome’ since no new Orthogroups are carried by additional genomes, on the contrary if the increment of the curve grows when looking at other genomes we can talk about ‘open pangenome’.
B: The red curve represents the core Orthogroups shared by the genomes and it tends to decrease as much as the compared genomes are different.
C: the blue curve represents the number of Orthologous groups present in exactly n genomes displayed in the x-axis. This representation is useful to appreciate how many Orthologous groups are present in the totality of the genomes of interest, for example, or the diversity brought by single genomes. For example, if tot-1 is low it means that there are no specific genomes that bring a unique Orthologous groups.
Additional plots for Kegg Orthologs and Cluster of Orthologous Groups (COGs)¶
As anticipated, the comparative analyses of Kegg and COGs come with additional plots:
1. Barchart of the distribution of the entries annotated with a COG/KEGG category of selected genomes. It allows the evaluation of potential increment or decrement of entries known to be relevant for a certain function in some genomes of interest (Fig. 6).
Focusing on the COG ‘Cell motility’ category, we see that Klebsiella variicola has fewer annotations of that category than Enterobacter soli, Enterobacter ausbriae, and Enterobacter ludvigii.
2 and 3. Heatmaps of the COGs along all the genomes expressed as fequency or number of identified entries (These plots are available only for COGs) | Here the focus is again on the COG ‘Cell motility’ category where it emerges that Klebsiella variicola has 67 loci annotated in this category that represents 1.29% of total number of its loci, while Enterobacter soli has more than the double of its loci annotated in this category, 2.76% of them.
Genome alignments¶
This set of analyses allow the user to align the genomes and check the conservation of specific regions of interest.
- Two plots can be generated:
circos
Plot region
Circos¶
Genomes alignment visualized in an interactive circular layout. This plot can trigger the identification of differentially distributed genomic regions in the genomes of interest, the presence of potential plasmid(s), or the products of other HGT events when looking at the GC composition, for example. Following the help box, it is possible to recognize which regions encode for genes or tRNA and evaluate the conservation of the sequence checking the identity percentages.
In Fig. 8 B, Enterobacter ausbriae, Enterobacter ludvigii, and Klebsiella variicola are mapped against ‘Enterobacter soli’. The genomes appears similar in terms of gene content, however Enterobacter soli carries a plasmid which is absent in the other genomes.
When the user clicks on a gene of interest the Protein annotation view
page will be displayed and provide the user with all the information about function, distribution and conservation of this protein.
NOTE: the regions present in one of the compared genomes but in the reference, will not be visualized. A new plot inverting the genome given as reference will give this info.
Plot region¶
‘Plot region’ analysis allows the user to discover a specific genomic region of interest. It plots the genomic features located in the neighborhood of a provided target locus, it displays the conservation of the protein of interest and the genes present in the flanking region among selected genomes (max 20000 bp).
In Fig. 8 B, the focus is on the fliL gene of the fliLMNOPQR operon in Enterobacter soli, Enterobacter ausbriae, Enterobacter ludvigii, and Klebsiella variicola. The operon is highly conserved in the Enterobacter genomes, but absent in Klebsiella variicola, which is indeed not reported in the plot (Fig. 8 B). (Note that the phylogeny obtained in Homology search - Blast, already highlight the lack of these genes in Klebsiella variicola ).
Metabolism¶
This section provides the user with a set of analyses useful to discover the metabolism of given genomes based on the KEGG Orthology database. It relies on the functional orthologs of the KO database which are categorized in molecular interaction, reaction and relation networks, named KEGG pathway maps, and functional units of gene sets, named Kegg modules associated with metabolism.
Kegg maps¶
With this analysis the Kegg pathways of a genome of interest can be discovered, which Kegg orthologs of the pathway are present and compare their distribution in the other genomes. In the following example (Fig. 9), the Kegg pathways present in the Enterobacter Soli genome are listed (235 pathways in total) and a heatmap of the Ko of the flagellar pathways is shown. In this page a direct link to the official Kegg page is provided to evaluate the state of composition of this Kegg map (in red the KOs present in Enterobacter soli.
Kegg modules¶
Discover the KO of Kegg modules, organized in categories and sub categories, of a genome of interest or a subset of them (Fig. 10). Three types of search are available:
NOTE: Search 1 and 3 come with a link to the ``Kegg module overview`` page (see below).
Kegg module overview page¶
This page is accessible clicking on the Kegg module entry from the ‘Metabolism/Kegg module’ analysis or from the ‘Locus tag overview page’. It gives access to the list of Ko entries that form the Kegg module of interest, and provides an indication of the completeness of the Kegg module within the genomes of the database.
The reported example is based on the KO entries of the kegg module number M00049 which describes the Adenine ribonucleotide biosynthesis ( IMP => ADP,ATP), and it is part of the Nucleotide metabolism category and Purine metabolism subcategory. Four genes are required to have a complete module, and one of them can be one among a set of four redundant genes. Among the genomes of the dataset, all except three have a complete module.
Protein annotation view¶
This page provides a complete overview of a selected locus of interest. The annotations are automatically retrieved from the .gbk files given as input, while further annotations can be assigned with COG, KEGG, Pfam, Swissprot, and Refseq databases only upon request (Note that RefSeq annotations are highly computational- and time-demanding)
In the example reported (Fig. 12), the page displays the locus tag ENTAS_RS13815 of Enterobacter soli annotated with the fliL gene. The following info can be retrieved from the ‘Overview’ page:
From the ‘Overview’ page further plots are accessible (Fig. 13): the phylogenetic distribution of the orthogroup of the locus tag (A),the homologs of which are reported in a phylogeny with a dedicated attention to the Pfam domains composing them (D). Additionally, SwissProt and RefSeq annotations are listed to further evaluate the best homologs according to their databases (B and C) and the best RefSeq hits are included in the homologs phylogeny (E). These analyses better characterize the locus whether the other annotations are not consistent for example, to infer horizontal gene transfer occurences, and also to observe potential dissimilarities/similarities in terms of Pfam domains between members of the same orthogroup.
NOTE: In the boxes with Kegg, COGs, and Pfam annotations, you will be redirected to their explanatory overview pages (3 ouputs, all similar, with link to external sources, occurences in proteins in the orthologous groups, then list of locus tags with that annotation in all the genomes of the database, phylogeny of the dataset annotated with the number of hits for that annotation and their distribution in the orthologous groups — MAYBE PUT AN EXAMPLE OF THAT PAGE FOR ONE ANNOTATION )
Orthogroup annotation summary¶
This page represents several overlaps with the Protein annotation view
page, however this is focused on the orthogroup rather than on a single member and its homologs. Indeed, it may occur that the homologs of a locus tag are split within more orthogroups.
Of interest, in this page the alignment between the members of the orthogroup is available and amino acid substitutions can be easily observed (Fig.14 A)
KO/COG/Pfam annotation summary¶
A summary page of each COG, Pfam, and Kegg entry is accessible in the web interface through the analysis in the Comparison
section pages, through the Protein annotation view
page and even from the Metabolism
section pages.
Each page provides a complete overview of the investigated annotation within the database and it comes also with external links.
- It is organized in three sections that can be visualized in Fig. 15 where Pfam domain PF03748 is reported:
General: It provides how many loci are characterized with that annotation combining the info with the Orthogroups classification.
Protein list: list of all locus tags with that annotation
Profile: phylogeny annotated with an heatmap of the entries with that annotation and their distribution into Orthogroups
Search bar¶
The search bar at the top of the left-side menu recognizes the following entries:
Name |
Example |
---|---|
KO entry |
K02415 |
COG entry |
COG1580 |
COG name |
Glutamate-1-semialdehyde aminotransferase |
Gene name |
fliL |
Gene product |
flagellar basal body-associated protein FliL |
Locus tag accession name |
ENTAS_RS13815 |
Organism |
Enterobacter soli |
It is built with Whoosh and it can take in input also combination of terms separated by AND/OR, for a more complex search, for example.