Methods¶
Overview¶
A simplified scheme of zDB
annotation workflow is shown in Figure 1. In summary:
Annotated genome assemblies integrated into ChlamDB were downloaded from GenBank (or RefSeq if the GenBank assembly was not annotated).
Protein sequences were annotated based on data from multiple source databases and annotation softwares.
- Protein sequences were clustered into orthologous groups (orthogroups) with OrthoFinder.
An alignment and a phylogeny was reconstructed for each orthologous group.
Data were then integrated into a custom SQL database derived from the bioSQL relational model.
NextFlow pipeline¶
The annotation pipeline summarized in Figure 1 was fully automated using NextFlow . The code with detailed software versions and command line parameters is available on github here.
Selection of PVC genomes integrated into the database¶
They are more than 2’000 genome assemblies classified as part of the PVC superphylum (NCBI Taxid 1783257)
on the NCBI taxonomy website. Those genomes are highly redundant (with a high number of genomes of the
same species, in particular for species of the Chlamydia genus). In order to limit the size of the database,
not all genome assemblies were included into the database.
ChlamDB version 2.0
(June 2019) includes:
All complete PVC genomes (June 2019).
All draft Chlamydiae genomes excluding draft genomes from species of the Chlamydia genus with at least one complete genome (June 2019).
Note
Draft genomes of the most studied Chlamydia species were excluded because they are highly redundant
Note
Draft genomes of Planctomycetes, Lentisphaerae, Verrucomicrobia and Kiritimatiellaeota phyla were excluded to limit the total number of genomes in the database. Additional genomes of clades of interest might be integrated in future releases of the database.
The list of genomes is available on ChlamDB home page. Complete genomes can be distinguished from draft genomes based on the number of contigs (“N. contigs” column) since draft assemblies exhibit more than one contig.
Protein annotation¶
Protein sequences were annotated using data from multiple databases and various bioinformatic softwares (Figure 1). Those annotations are summarized on the “locus” page available for each protein (see for instance the entry of CT_495).
Transporters
were annotated using the Transporter Classification Database (TCDB) and the gBLAST3 software from the BioV suite. Detailed scores including the E-value, score, percentage of identity, the number of transmembrane domains identified in the query and the reference protein, the query and hit coverage are reported on the locus page.
COG
annotation was done with rpsblast (with an e-value cutoff of 0.001) using position-specific scoring matrix (PSSM) from the Conserved Domain Database (CDD)
- Proteins were assigned to Kegg Orthologs (KO) numbers with KofamScan (default parameters) and KOfam (a database of Hidden Markov Models [HMM] of KEGG Orthologs). KofamScan rely on HMMER to search the HMM database.
The KEGG API was used to retrieve the annotation of
Kegg Orthologs
and mapping toKEGG modules
,KEGG pathways
and Enzyme Commission numbers.Enzyme Commission numbers were assigned based on Kegg Orthologs annotations and with PRIAM, a method for automated enzyme detection in fully sequenced genomes. PRIAM uses position-specific scoring matrix (PSSM) of all sequences available in the ENZYME database.
- InterProScan was used to identify protein signatures such as
transmembrane domains
,Pfam domains
,CDD domains
,… Detailed InterPro annotations as well as scores associated with each annotation are reported on the “InterproScan” tab of each locus page.
The InterProScan match lookup service was used to retrieve pre-calculated InterProScan results for protein integrated into the InterPro database. For sequences not in the lookup service, InterProScan was used to analyse sequences from scratch.
Mapping to STRING (a database of protein-protein interactions) was done by exact protein sequence match. Protein sequences from SRING were indexed using SEGUID checksums. Associations to the scientific literature (
PMID
tab visible only if references were found) were retrived using the STRING API.Protein subcellular localization predictions were done using PSORTb (withe the
--negative
argument).
- Effectors of the type III secretion system were predicted using four different softwares:
BPBAac (default parameters)
effectiveT3 (model
TTSS_STD-2.0.2
, cutoff of0.9999
)DeepT3 (default parameters)
T3_MM (default parameters)
TODO: UNIPARC - SwissProt
Homology search (RefSeq and SwissProt)¶
The closest identifiable RefSeq homologs were searched with Diamond (parameters:
--max-target-seqs
200-e
0.01--max-hsps
1). The 200 first hits were retained for each protein.The closest identifiable SwissProt homologs were searched with BLASTp (parameters:
-evalue
0.001). The 100 first hits were retained for each protein.
Identification of Orthogroups¶
Orthogroups (or orthologous groups) were identified with OrthoFinder. This tools identify orthogroups based on BLASTp (parameters: -evalue
0.001)
results using the MCL clustering software.
Note
Orthologs
are pairs of genes that descended from a single gene in the last common ancestor (LCA) of two species.
Note
An orthogroup
is the group of genes descended from a single gene in the last common ancestor (LCA) of a group of species.
As gene duplication and loss occur frequently in bacteria, we rarely have exactly one ortholog in each considered genome.
Orthogroup multiple sequence alignments & phylogenies¶
Protein sequences of each orthogroup were aligned with MAFFT (default parameters). A phylogeny was then reconstructed for each orthogroup of three or more sequences. The phylogeny was reconstructed with FastTree with default parameters. The node support values are not traditionnal boostrap support values. FastTree uses the Shimodaira-Hasegawa test with 1,000 bootstrap replicates to quickly estimate the reliability of each split in the tree. Values higher than 0.95 can be considered as “strongly supported”.
Phlogeny including top RefSeq hits¶
A second phylogeny was reconstructed for each orthogroup. This phylogeny includes the 4 best non-PVC RefSeq hits of each protein (see the Homology search paragraph).
First, the NCBI taxon ID of each RefSeq hit was retrieved using the prot.accession2taxid
mapping file available from the NCBI taxonomy ftp website. PVC hits were removed and the amino acid sequence of the 4 best non-PVC hits was retrieved from the NCBI using the biopython interface to Entrez. Protein sequences of each orthogroup + RefSeq homologs were aligned with MAFFT (default parameters) and the phylogeny was reconstructed with FastTree (default parameters).
This phylogeny allows users to check whether PVC proteins form a monophyletic group or if they are for instance hints of horizontal gene transfer(s) with bacteria from other phyla.
Calculation of pairwise protein identity¶
Pairwise protein sequence identities reported on ChlamDB were calculated based on the multiple sequence alignments of orthologous groups (two see previous paragraph). The identity between two sequences can be calculated in different ways (see this blog post discussing that topic). We calculated the identity by excluding gaps and calculating the identity based on aligned positions only:
number of matches / ( number of matches + number of mismatches)
Nevertheless, if the alignment covered less than 30% of one of the two compared sequences, the identity was set to 0.
Circular genome plots (Circos)¶
Circular genome plots are generated dynamically with Circos. These plots are not generated based on the alignment of DNA sequences but based on orthology data (see previous paragraphs). The two outer gray circles show the location of open reading frames (ORFs) encoded on the leading and lagging strand of the reference genome (Figure 2 A and B). The red/blue inner circles show the conservation of each protein encoding ORF in one or multiple other genomes. Identity values were pre-calculated from protein alignments (see previous paragraphs). If the compared genomes encode more than one ortholog, the highest identity is used to color the region.
Species phylogeny¶
The reference phylogeny was reconstructed with FastTree (default parameters, JTT+CAT model) based on the concatenated alignment of 32 single copy orthogroups conserved in at least 266 out of the 277
genomes part of ChlamDB 2.0
.
Prediction of protein-protein interactions¶
The interactions reported in the “interactions” tab were predicted based on two different approaches: phylogenetic profiling
and conservation of gene neighborhood
. Indeed, proteins that physically interact tend to be encoded in a close neighborhood and tend to co-occur in the same genomes (see Dandekar et al and Kensche et al. for more details).
1 Phylogenetic profiling¶
Orthogroup exhibiting similar patterns of presence/absence were identified by calculating the euclidian and jaccard distances of all pairs of orthologroups phylogenetic profiles (see Figure 3). See Kensche et al. for a review on phylogenetic profiling methods.
Phylogenetic profiles were collapsed at the species level. If an orthogroup was present in only a subset of the strains of the considered species, it was still considered as present in that species.
Pairwise euclidean and jaccard distances were calculated between pairs of profiles
The default euclidian distance cutoff to report interactions is
2.2
. If more than 30 profiles had an euclidian distance smaller or equal to2.2
, the stringency was adjusted incrementally with cutoffs of2
,1
and finally0
. If more than 30 profiles exhibit the exact same profile, nothing is reported. Indeed, it would by typically poorly informative profiles (e.g. proteins conserved in all species).
Note
Have a look at the predicted interactors of the cell shape-determining protein MreB. The detailed profile of each orthogroup is available in the “profile” tab.
Warning
The profiles 5 and 6 in Figure 3 are strictly identitcal but poorly informative since orthologs were identified in all considered species. Spurious associations like this one cannot always be automatically discarded. Thus, those associations should always be interpreted with caution. The detailed profile of associated orthogroups is always provided (“profile” tab).
2. Identification of conserved gene neighborhood in different species¶
Genes encoded in a close neighborhood in distantly related species were identified based on orthology data (see Identification of Orthogroups). See for example the case of the subunits of the F-type ATPase that are systematically encoded in a close neighborhood. The method was the following:
protein encoding genes were iterated for each genome
if the considered protein encoding gene had one (or multiple) homolog(s) in distantly related genomes (a cutoff of 60% median protein identity was used to consider two genomes sufficently distant)
the 10 kilobases upstream and downstream of each ortholog were compared to the reference genome to identify conserved neighbors (see Figure 4.A)
the ratio of co-occurence of pairs of orthogroups was used to score the association between pairs of locus (Figure 4.B)
only proteins exhibiting a
conservation score of 0.8
are reported in the “interactions” tab of locus pages
Warning
Since orthologous groups identified by OrthoFinder include both orthologs and paralogs, the relationship between pair of genomes is not necessarily 1 to 1 (have a look at the OMA website to read detailed explanations of the different types of orthologs). If multiple orthologs were encoded in the compared genome, only the most similar one was used for the comparison (Figure 4.C). The choice of the closest locus was based on protein identity and not based on phylogenetic trees, which means that we might not always consider the most closely related protein encoded in the compared genome for the comparison.
Prediction of candidate type III secretion system effectors¶
Effectors of the type III secretion system are poorly conserved. In addition, the signal allowing to specifically secrete effector proteins is not clearly identified. It is thus difficult to identify effectors in newly sequenced genomes. Predition methods generally rely on the use classifiers that are trained on a set of known effector proteins. Since all Chlamydiae genomes encode a type III secretion system, candidate effectors were identified using four different classifiers:
BPBAac (default parameters)
effectiveT3 (model
TTSS_STD-2.0.2
, cutoff of0.9999
)DeepT3 (default parameters)
T3_MM (default parameters)
Results of all four tools are reported on the locus page of each protein (in the T3SS effectors prediction
section).
Taxonomic profile of Pfam domains and COGs¶
Coding sequences of the 6661 reference and representative genomes available from RefSeq (September 2017) were downloaded from RefSeq ftp (ftp://ftp.ncbi.nlm.nih.gov/refseq/). Hidden Markov models of the PFAM database were used to search for sporulation-related domains using hmmsearch (HMMER version 3.1b2). Hits were filtered based on PFAM trusted cutoffs. Genomes were then grouped by order based on the NCBI Taxonomy database with the python library ete3. Only orders exhibiting a minimum of 5 genomes are reported of the figure.
see for instance the profile of the Pfam domain PF01823 (MAC/Perforin domain): https://chlamdb.ch/pfam_profile/PF01823/phylum
Phylogenetic profiles of COGs are based on data from the eggnog database (v4.51). Eggnog provides Clusters of Orthologous Groups annotation of 2,031 genomes. Taxonomic profiles of non-supervised orthologous groups (NOGs) were extracted from the file NOG.members.tsv . The 2,031 genomes were classified according to the classification of the NCBI taxonomy database.
see for instance the profile of COG2385 (Peptidoglycan hydrolase enhancer domain protein): https://chlamdb.ch/eggnog_profile/COG2385/phylum
Evaluation of the quality and completeness of draft genomes¶
The completeness and contamination of each genome was evaluated with checkM (v1.0.12). Completeness and contamination estimates are reported on the the home page phylogeny.
Pubmed References¶
Links to Pubmed references were retreived from two difference sources: STRING (v11) and PaperBLAST (September 25 2019_release).
STRING references are derived from text mining of scientific publications. Only PubMed references from identical protein sequences were retrieved from STRING.
PaperBLAST links to research publications come from automated text searches against the articles in EuropePMC and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot, BRENDA, CAZy (as made available by dbCAN), CharProtDB, MetaCyc, EcoCyc, REBASE, and the Fitness Browser (see PaperBLAST paper). All proteins were blasted against the PaperBLAST database (evalue cutoff of 10^-3). Links to research publications from the top 20 best hits were integrated into ChlamDB.
Protein accessions mapping¶
Protein accessions from multiple reference databases such as RefSeq, GenBank and UniProt can be used to search for ChlamDB entries. The mapping between the various identifiers was donc
by exact protein sequence match for UniParc/UniProt
was extracted from RefSeq genome assembly records for RefSeq. The “old_locus_tag” qualifier was used to map RefSeq locus tags with Genbank locus tags (see RefSeq documentation)
SQL database and Web Interface¶
TODO * database size 80Gb * Figure generated from stores data
Note
Thee SQL database is about 80GB in size. Protein alignments are not stored into the database.
Source databases¶
Database name |
Version |
---|---|
UniprotKB-UNIPARC |
2019.03 |
TCDB |
2019.06 |
RefSeq |
90 |
CDD (COG) |
v3.17 |
InterProScan |
5.35-74.0 |
STRING |
v11 |
PDB |
2019.06 |
COG/CDD |
v3.17 |
KoFam |
2019.04.09 |
PRIAM |
2018.06 |
EggNOG |
4.51 |
Software versions¶
Software name |
Version |
---|---|
FastTree |
2.1.10 |
Diamond |
0.9.24 |
OrthoFinder |
2.2.7 |
BLAST |
2.7.1 |
CheckM |
1.0.12 |
KoFamScan |
2019/4/9 |
Mafft |
7.407 |
PSORTb |
3.0.6 |
HMMER |
3.1b2 |
Code availability¶
Website interface |
|
Annotation pipeline |
|
Public database download and indexing |