citation
Example
in silico ‘subtractive hybridization’
Estimate of the species gene pool
Data format
Prepare user's genome sequence
References
Tools/Databased Used
Contact
mGenomeSubtractor performs a mpiBLAST-based in silico ‘subtractive hybridization’ by comparing selected closely related genomes to generate a list of conserved or strain-specific fragments which may provide clues as to the phenotype, specific environmental adaptation and/or disease-association linked with a particular bacterium.
Citation
Y. Shao, X. He, C. Tai, H.Y. Ou, Rajakumar K. and Deng Z. (2010) mGenomeSubtractor: a web-based tool for parallel in silico subtractive hybridization analysis of multiple bacterial genomes. Nucleic Acids Research, doi:10.1093/nar/gkq326. [Abstract]

[Abstract]
mGenomeSubtractor performs an mpiBLAST-based comparison of reference bacterial genomes against multiple user-selected genomes for investigation of strain variable accessory regions. With parallel computing architecture, mGenomeSubtractor is able to run rapid BLAST searches of the segmented reference genome against multiple subject genomes at the DNA or amino acid level within a minute. In addition to comparison of protein coding sequences, the highly flexible sliding window-based genome fragmentation approach offered can be used to identify short unique sequences within or between genes. mGenomeSubtractor provides powerful schematic outputs for exploration of identified core and accessory regions, including searches against databases of mobile genetic elements, virulence factors or bacterial essential genes, examination of G+C content and binucleotide distribution bias, and integrated primer design tools. mGenomeSubtractor also allows for the ready definition of species-specific gene pools based on available genomes. Pan-genomic arrays can be easily developed using the efficient oligonucleotide design tool. This simple high-throughput in silico ‘subtractive hybridization’ analytical tool will support the rapidly escalating number of comparative bacterial genomics studies aimed at defining genomic biomarkers of evolutionary lineage, phenotype, pathotype, environmental adaptation and/or disease-association of diverse bacterial species. mGenomeSubtractor is freely available to all users without any login requirement at http://bioinfo-mml.sjtu.edu.cn/mGS_llm/.
Comparative genomics based in silico ‘subtractive hybridization’
Example: The entire annotated CDS in Acinetobacter baumannii strain AYE genome blastn against the complete genome sequences of A. baumannii SDF and ATCC 17978.
Fragment appoach: protein-coding sequences (CDS)
Reference genome: A. baumannii AYE (NC_010410)
Subject genomes: A. baumannii SDF (NC_010400) and ATCC 17978 (NC_009085)
 
(I) Determination of strain-specific CDS (or fragment)
mGenomeSubtractor uses H-value criteria which reflects the degree of similarity in terms of the length of match and degree of identity at a nucleotide level between each CDS examined and the set of comparator genomes examined (Fukiya et al., 2004).
(1) Each CDS was used as a query in a similarity search against a reference genome sequence using a locally installed version of mpiBLAST and default NCBI BLASTN parameters.
(2) The stretch of sequence from the reference genome with the highest Bit score for each query sequence was retrieved and a homology score ( H -value) calculated for each annotated CDS in turn. This homology score had been proposed by Fukiya et al. (2004, J. Bacteriol. , 186, 3911-3921) and reflected the degree of similarity between the matching reference genome sequence and the CDS itself in terms of the length of match and the degree of identity at a DNA level. For each query, the H -value was calculated as follows:
, where was i the level of identity of the region with the highest Bit score expressed as a frequency of between 0 and 1, the length of the highest scoring matching sequence (including gaps), and the query length. Therefore H belonged to the set, .
(3) For each annotated CDS a threshold value of H () was used to determine whether the corresponding CDS was to be classed as 'conserved' or 'strain-specific'. Taking Acinetobacter baumannii strain AYE (Refseq accession number: NC_010410) as the query genome (reference), all 3607 annotated AYE CDS were analysed against the two subject genomes, A. baumannii strain SDF (NC_010400) and ATCC 17978 (NC_009085) in turn. If all three H-values for a given AYE CDS < 0.42, the CDS were considered to be 'strain-specific' with respect to SDF and ATCC 17978. Based on the markedly bipolar distribution of H-values corresponding to CDS (see Fig. 1), the value of 0.42 was chosen as an appropriate threshold. Threshold values ranging from 0.3 to 0.6 were tested with there being only minor differences in the number of CDS identified as strain-specific (unpublished data, Ou and Rajakumar). In this study, mGenomeSubstrator was written to perform the mpiBLASTN search, parse the resulting report and generate the list of strain-specific CDS. Applying this algorithm, 234 MG1655 CDS were classified as specific with respect to the other two genomes.
Figure 1 . Histograms of H -values for all 3607 annotated CDS in Acinetobacter baumannii strain AYE (NC_010410) against the two the two subject genomes, A. baumannii strain SDF (NC_010400) and ATCC 17978 (NC_009085).
[TOP]
(II) Inverstigate genomic mosaicism and examine strain variable regions of Acinetobacter baumannii
Figure 2(A) Genome map of Acinetobacter baumannii strain AYE with CDS colour-coded based on the number of comparator A. baumannii genomes identified as harbouring a nucleotide sequence-conserved homologue; CDS shown in black (2) are conserved across the two other full-sequenced A. baumannii comparator genomes, A. baumannii strain SDF (NC_010400) and ATCC 17978 (NC_009085); while at the other extreme those shown in white (0) are unique to strain AYE. The red Downwards Arrow indicates the tRNA gene. AYE strain-specific CDS were identified based on an H-value cutoff of less than 0.42. The H-value reflects the degree of similarity in terms of the length of match and the degree of identity at a nucleotide level between the matching reference genome sequence and the CDS examined.
Figure 2(B) An expanded view of the hypervariable region highlighted by a yellow rectangle in (A) corresponding to the 86-kb TnAbaR1 resistance mega-transposons of AYE and its upstream/downstream flanking regions. A. baumannii strain AYE CDS are colour-coded based on their COG assignment. The five comparator genomes are shown below. Grey bars indicate the extent of AYE sequences within this region that are also present in the individual comparator genomes. The black bars at the end represent the ends of the region selected for the expanded view. A G+C profile of the selected AYE region is shown topmost.
Figure 2(C) A zoom-in view of the region of the AYE genome highlighted by the yellow rectangle in (B) showing selected TnAbaR1 CDS coding for an integrase, transposases and antibiotic resistance genes.
[TOP]
Estimate of the species-specific gene pool: the known Pseudomonas aeruginosa species gene pool
mGenomeSubtractor analysis allowed for the ready definition of the Pseudomonas aeruginosa species-specific gene pool based on the seven completely sequenced Pseudomonas aeruginosa chromosomes with the mGenomeSubtractor::Gene Pool tool.
Figure 3 mGenomeSubtractor-based determination of the Pseudomonas aeruginosa species gene pool as represented within seven fully sequenced P. aeruginosa chromosomes. (A) Screenshot of the mGenomeSubtractor::Gene Pool interface used to define input genomes and configuration options. The following completely sequenced chromosomes (NCBI Refseq accession no.) were used in this analysis: LESB58 (NC_011770), PA7 (NC_009656), UCBPP-PA14 (NC_008463), PAO1 (NC_002516), PACS2, C3719 and 2192. (B) Screenshot of the output page showing a summary table with links to COG classification pie charts and Excel and Fasta files. Tabs shown offer an alternative bar graph output and the YODA probe design utility. (C) – (E) COG classification pie charts for the sets of newly added genes listed in (B) as indicated by the linking arrows. (F) COG classification pie chart for the full complement of de-duplicated genes within the defined P. aeruginosa gene pool. (G) Matching COG classification table for the defined de-duplicated P. aeruginosa gene pool. (H) Histogram showing the developing process. As a first step, all the annotated chromosomal CDS of P. aeruginosa LESB58 were added into the gene pool. Next, unique chromosomal CDS (BLASTN-based H-values < 0.42) present in the six other completely sequenced P. aeruginosa chromosomes were sequentially added to the growing gene pool following examination of each chromosome in turn in the order listed against the expanding gene pool. Finally, single representative unique CDSs were selected from the gene pool by removing any redundant 'duplicated' genes as determined using the H-value criteria (BLASTN-based H-values > 0.81).
[TOP]
References
(1) H.Y. Ou, L.L. Chen, J. Lonnen, R.R. Chaudhuri, A.B. Thani, R. Smith, N.J. Garton, J.C. Hinton, M. Pallen, M. Barer and K. Rajakumar (2006). A novel strategy for identification of genomic islands by comparative analysis of the contents and contexts of tRNA sites in closely related bacteria. Nucleic Acids Res. , 34, e3. [Abstract]

(2) H.Y. Ou, X. He, E.M. Harrison, B.R. Kulasekara, A.B. Thani, A. Kadioglu, S. Lory, J.C. Hinton, M.R. Barer, Z. Deng and K. Rajakumar (2007). MobilomeFINDER: web-based tools for in silico and experimental discovery of bacterial genomic islands. Nucleic Acids Res., 35, W97-W104. [MobilomeFINDER webserver] [Abstract]

(3) H.Y. Ou , C.T.S. Ju , K.L. Thong, N. Ahmad, Z. Deng, M.R. Barer and K. Rajakumar (2007). Translational Genomics to Develop a Salmonella enterica Serovar Paratyphi A Multiplex PCR Assay. Journal of Molecular Diagnostics, 2007, 9, 624-630.
[Abstract]
[TOP]
Comments/Questions
Feel free to send comments or questions about mGenomeSubtractor to
Hong-Yu Ou at hyou@sjtu.edu.cn
Laboratory of Molecular Microbiology
School of Life Sciences & Biotechnology
Shanghai Jiaotong University
1954 Huashan Road
Shanghai 200030 P.R. China
Tel: +86 21 62933765
Fax: +86 21 62932418
http://mml.sjtu.edu.cn/
[TOP]
Useful links
Bioinformatics Tools/ Databases used by mGenomeSubtractor
  • mpiBLAST, parallel implementation of NCBI BLAST
  • NCBI BLAST, NCBI Basic Local Alignment Search Tool
  • MUSCLE, protein multiple sequence alignment
  • Jalview, a multiple alignment editor written in Java
  • YODA, design specific oligonucleotide probes for DNA sequences for use in microarrays
  • Primer3Plus, pick primers from a DNA sequence
  • CGview, generate circular genome maps

  • ACLAME, A Classification of Genetic Mobile Elements
  • VFDB, Virulence Factor database
  • DEG, Database of essential genes
  • DrugBank, a knowledgebase for drugs, drug actions and drug targets
  • TTD, Therapeutic Target Database
  • ARDB, Antibiotic Resistance Genes Database
Other tools or servers of interest
  • WebACT, Database of sequence comparisons between prokaryotic genome sequences
  • Mauve, Multiple Genome Alignment
  • MUMmer, Ultra-fast alignment of large-scale DNA and protein sequences
  • xBASE, Database for comparative bacterial genomics
  • xBASE Annotation Service, quick annotation for unfinished bacterial genome sequences where a similar reference sequence is available
  • CGview server, a comparative genomics tool for circular genomes

  • MobilomeFINDER, in silico and experimental discovery of bacterial genomic islands
  • IslandViewer, a computational tool that integrates three different genomic island prediction methods
  • SIGI-HMM, Prediction of Genomic Islands in Procaryotic Genomes Using HMMs

  • Islander, Database of Genomic Islands
  • IslandPath, An aid to the identification of genomics islands
  • HGT-DB, Horizontal Gene Transfer Database (HGT-DB)
  • PAIDB, Pathogenicity island database

  • IS Finder, Reference centre for bacterial insertion sequences

  • Primaclade, Web-based application that accepts a multiple species nucleotide alignment file as input and identifies a set of PCR primers that will bind across the alignment
[TOP]
Data file format
A single genome sequence file: FASTA

A single genome sequence file is prepared in FASTA format. It begins with a single-line description, followed by lines of sequence data. The description line must begin with a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. It is suggested that the user download the *.fna or other required genome files in FASTA format from the NCBI at ftp.ncbi.nih.gov/genome/bacteria or specified genome sequencing centres. {wiki}
[Example] The genome sequence file of Acinetobacter baumannii AYE : NC_010410.fna.gz
Note: only the gzip-compressed file is acceptable (.gz).

Note: The current version of mGenomeSubtractor supports the FASTA file storing a single nucleotide sequence of a given bacterial genome as reference. It is preferable to select a finished and annotated genomes as the reference (or query) genome. For the nearly complete genome with multiple contigs from 454 or Solexa de novo assemblies, please reference to the 'pre-processing protocol'. For the user-uploaded complete or nearly complete genome sequences served as the subject genomes, the multi-FASTA file contains the contigs is allowed.

[TOP]
CDS annotation file: NCBI PTT format
Tabular list of all protein-coding regions (CDS) in the corresponding genome sequence should be prepared in the NCBI PTT format.
The PTT file format is a table of protein features. It is used mainly by NCBI who produce PTT files for all their published genomes found in ftp://ftp.ncbi.nih.gov/genomes/. It has the following format:
Line 1 (optional)
Description of sequence to which the features belong
eg. "Leptospira interrogans chromosome II, complete sequence - 0..358943"
It is usually equivalent to the DEFINITION line of a Genbank file,
with the length of the sequence appended. It is unclear why "0" is
used as a starting range, it should be "1".
Line 2 (optional)
Number of feature lines in the table
eg. "367 proteins"
Line 3 (*required)
Column headers, tab separated
eg. "Location Strand Length PID Gene Synonym Code COG Product"

[Example] the CDS annotation file of Acinetobacter baumannii AYE : NC_010410.ptt.gz
Note: only the gzip-compressed file is acceptable (.gz).
[TOP]
Query gene sequence(s) in multi-FASTA format
The multi-FASTA format consists of alternating description lines followed by sequence data. It is important that each ">" symbol appear on a new line.
[Example 1] the 86 CDS (ABAYE3552..ABAYE3668) in the 86-kb TnAbaR1 resistance mega-transposons of A. baumannii AYE:
TnAbaR1_gene_seq.fas.gz
Note: only the gzip-compressed file is acceptable (.gz).

[Example 2] the 86 proteins (ABAYE3552..ABAYE3668) encoded by the 86-kb TnAbaR1 resistance mega-transposons of A. baumannii AYE:
TnAbaR1_gene_aa_seq.fas.gz
Note: only the gzip-compressed file is acceptable (.gz).
The entire annotated proteining-coding genes in the Subject genome in multi-FASTA format
The multi-FASTA format consists of alternating description lines followed by sequence data. It is important that each ">" symbol appear on a new line.
[Example I] the 2913 proteins of Acinetobacter baumannii SDF (NC_010400):
NC_010400.faa.gz
Note: only the gzip-compressed file is acceptable (.gz).

[Example II] the 3351 proteins of Acinetobacter baumannii ATCC 17978 (NC_009085):
NC_009085.faa.gz
Note: only the gzip-compressed file is acceptable (.gz).
[TOP]
Simple pre-processing protocol of preparing the non-NCBI bacterial genomes (a user created genome sequence from unfinished genome data, e.g. from 454 or Solexa de novo assemblies)
1. The complete genome assembled as a single nucleotide sequence (finished but unannotated)

2. The nearly complete genome with multiple contigs (draft with unclosed gaps)
1. The complete genome assembled as a single nucleotide sequence (finished but unannotated)
Step 1.1 Annotate the genome automatically and quickly using xBASE Annotation Server with the input of the user's sequence in the FASTA format.

[Example]: Downloaded Pseudomonas aeruginosa C3719 whole genome sequence scaffold from the Broad Institute. Right-click the link to save the FASTA file (with the size of 6.1 Mb) containing the complete sequence of P. aeruginosa C3719 to your local drive.
xBASE Annotation

Step 1.2
Download the xBASE-generated file 'seq.gbk' via the link 'Annotation in GenBank format' in 'Sequence Files', which contains the xBASE-annotated genome in GenBank format (.gbk).
[Example]: Downloaded annotated genome from the xBASE Annotation Server in the .gbk format. Right-click the link to save the .gbk file (with the size of 13.1 Mb) containing the sequence and annotation of P. aeruginosa C3719 to your local drive.
xBASE Annotation download

Step 1.3
Converte the .gbk file into the .ptt file using the GBK2PTT tool online. Right-click the link to save the .ptt file (with the size of 0.4 Mb) containing the annotation of P. aeruginosa C3719 to your local drive.


Step 1.4
Upload the single complete sequence ( in the FASTA format) and the annotation file ( in the .ptt format) to run mGenomesubtractor as the reference genome using both BLASTN and BLASTP approaches.

[TOP]
2. The nearly complete genome with multiple contigs (draft with unclosed gaps)
Step 2.1 Annotate the genome automatically and quickly using xBASE Annotation Server with the input of the user's sequence in the FASTA format.
Step 2.2 Download the xBASE-generated file 'seq.protein.faa' via the link 'Predicted protein sequences in FASTA format' in 'Sequence Files', which contains the entire xBASE-predicted putative proteins in the multi-FASTA format.
Step 2.3 Upload the multi-FASTA file storing all the putative proteins in the reference genome to run mGenomesubtractor as the reference genome using the BLASTP approaches.
[TOP]
Last updated on 18 April 2010.