Sections

Zymo Mock Community Analysis

ATCC Mock Human Communities Analysis

Madison Water Samples Analysis

Computing Environment

R v3.6.1

Bioconductor v3.10
DADA2 v1.14.0
ShortRead v1.44.0
ggplot2 v3.2.1

CentOS6 Linux

Bowtie v2.2.9
samtools v1.3.1
Alfred v0.1.17
Kraken v2.0.8-beta
Bracken v2.5
trimmomatic v0.36
FLASH v1.2.11
cutadapt v1.16

Zymo Mock Community Analysis

The Zymo mock community was analyzed using three methods:

Mapping LoopSeq 16S contigs and Illumina V3V4 short reads to provided references
Mapping LoopSeq 16S contigs to NCBI RefSeq references
Classifying LoopSeq 16S contigs with LCA classifier

Data Preparation

Mock Community Description

The ZymoBIOMICS Microbial Community DNA Standard is composed of ten microbial species: eight bacteria and two fungi, the latter two of which were not amplified by the LoopSeq 16S rRNA primers used. Genomic DNA quantity from the eight species are evenly distributed and vary in Gram stain, genome size, GC content, 16S copy number and phylogenetic relatedness.

Species	16S copies	Expected contig %
Bacillus subtilis	10	17.4
Enterococcus faecalis	4	9.9
Escherichia coli	7	10.1
Lactobacillus fermentum	5	18.4
Listeria monocytogenes	6	14.1
Pseudomonas aeruginosa	4	4.2
Salmonella enterica	7	10.4
Staphylococcus aureus	6	15.5

LoopSeq 16S Contig QC & Filtering

Load fastq file containing raw contigs as downloaded from the Loop Genomics cloud platform.

Raw contigs fastq file

#R
library(dada2)
library(ShortRead)

F_prim <- "AGAGTTTGATCMTGGCTCAG" #forward primer
R_prim <- "TACCTTGTTACGACTT" #reverse primer

zymo <- file.path("zymo.fastq")

#Basic statistics of raw contigs
summary(nchar(getSequences(zymo)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    500    1494    1508    1430    1510    1572 

#Distribution of raw contigs
hist(nchar(getSequences(zymo)), 100) 

#Remove forward and reverse primers. This step also ensures that the contigs that remain after filtering are the full length 16S gene. 
removePrimers(zymo, paste0("zymo_noprimers.fastq"), primer.fwd=F_prim, primer.rev=rc(R_prim), compress=FALSE, max.mismatch=2, allow.indels=FALSE, verbose=TRUE)
Read in 38629, output 27123 (70.2%) filtered sequences.

#Basic statistics of filtered contigs (full length 16S, primers trimmed) 
summary(nchar(getSequences(zymo_noprimers)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1426    1464    1472    1474    1474    1509 

#Distribution of filtered contigs (full length 16S, primers trimmed)
zymo_noprimers <- file.path("zymo_noprimers.fastq")
hist(nchar(getSequences(zymo_noprimers)), 10) 

Filtered contigs fastq file

Mapping to Provided References

Zymo replaced five strains in the Zymo standard with similar strains beginning with Lot ZRC190633. As has been noted elsewhere (Callahan Benjamin C., et al. “High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution”, Nucleic Acids Research, 47.18 (2019): e103), the exact strains used are not disclosed and the reference sequences provided by Zymo Research do not exactly correspond to the strains provided. There are two versions of the references: version 1 and version 2. In both versions, there are systematic nucleotide substitutions that cannot be explained by sequencing or contig assembly error. The references include whole genomes and 16S genes only. In version 1 of the references, the whole genome references are in multiple contigs and in the 16S gene references, only one 16S copy is given for each species. Version 2 of references aren’t correct either but is still more correct than version 1. The new version has complete whole genome references and all 16S gene copies are given for each species.

Concatenate all bacterial whole genome fasta files together and build a Bowtie2 index with version 2 of the whole genome references and map the filtered Zymo contigs.

cat *.fasta > zymo_bacteria_genome_ref.fasta

bowtie2-build zymo_bacteria_genome_ref.fasta zymo_bacteria_genome_ref

bowtie2 --very-sensitive -p 40 --un zymo.un.fastq -x zymo_bacteria_genome_ref/zymo_bacteria_genome_ref -q zymo_noprimers.fastq -S zymo.sam
27123 reads; of these:
  27123 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    27123 (100.00%) aligned >1 times
100.00% overall alignment rate 

Many of the 16S genes are highly similar/identical and multiple valid alignments may occur. Even though the provided reference sequences are incorrect, given that the contigs are full-length 16S sequences and the contigs are accurate enough, all eight bacterial species are sufficiently phylogenetically distant that the multiple mappings occur within a single genome reference. It is the case generally that accurate full-length 16S sequences are able to discriminate between species (Johnson, Jethro S., et al. “Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis.” Nature communications 10.1 (2019): 1-11.). There are exceptions (eg. Mycobacterium genus) where inter-species 16S copies are so similar (~99.9%), that with even minor sequence errors, it may be difficult to unambiguously map the sequences to the correct references.

#Extract names of hits and count number of each unique hit
samtools view -F 4 zymo.sam | cut -f3 | sort | uniq -c

BS.pilon.polished.v3.ST170922
Enterococcus_faecalis_complete_genome
Escherichia_coli_chromosome
Lactobacillus_fermentum_complete_genome
Listeria_monocytogenes_complete_genome
Pseudomonas_aeruginosa_complete_genome
Salmonella_enterica_complete_genome
Staphylococcus_aureus_chromosome

Species	Expected contig %	Mapped contig %
Bacillus subtilis	17.4	21.0
Enterococcus faecalis	9.9	9.7
Escherichia coli	10.1	10.9
Lactobacillus fermentum	18.4	11.3
Listeria monocytogenes	14.1	17.4
Pseudomonas aeruginosa	4.2	4.2
Salmonella enterica	10.4	11.2
Staphylococcus aureus	15.5	14.3

Species ratios are as expected.

#R
library(ggplot2)
zymo <- read.table("zymo_exp_vs_map.txt", header=TRUE, sep="\t")
ggplot(zymo, aes(fill=Species, y=Percentage, x=Contigs)) + geom_bar(position="stack", stat="identity")

Intergenomic 16S Gene Copies & Contig Accuracy

B. subtilis has the most 16S gene copies (ten) of the species in the Zymo standard so it can provide a fine-grain view of relative intergenomic 16S gene ratios. It also appears to be one of the correct references for the strain provided (no systematic mismatches between contig and reference) so it would allow for an unbiased assessment of LoopSeq contig accuracy.

Extract the full length 16S contigs that orginally mapped to B. subtilis.

samtools view zymo.sam | cut -f1,3 | awk -F "\t" '{ if($2 == "BS.pilon.polished.v3.ST170922") { print $1}}' > bsubtilis_contigs_list
seqtk subseq zymo_noprimers.fastq bsubtilis_contigs_list > bsubtilis_contigs.fastq

Sequences for each of the ten 16S genes is given in the reference fasta file. Multiple sequence alignment shows that there are five non-redundant copies . Copy #1, #4, #8, #9, #10 are identical. #2, #3 are identical. #5, #6, #7 are all unique. To prevent multi-mapping, build a bowtie2 index only using #1, #2, #5, #6, #7. Thus a mapping ratio of 5:2:1:1:1 is expected.

bowtie2 --very-sensitive -p 40 --un bsubtilis_contigs_trimmed.un.fastq -x zymo_bsubtilis_ref/Bacillus_subtilis_16S_nr -q bsubtilis_contigs.fastq -S bsubtilis.sam
reads; of these:
(100.00%) were unpaired; of these:
(0.00%) aligned 0 times
(0.00%) aligned exactly 1 time
(100.00%) aligned >1 times
00% overall alignment rate #all mapped to 16S copies

#Extract reference hits
samtools view -F 4 bsubtilis.sam | cut -f3 | sort | uniq -c > bsubtilis_hits_counts

Bacillus_subtilis_16S_1
Bacillus_subtilis_16S_2
Bacillus_subtilis_16S_5
Bacillus_subtilis_16S_6
Bacillus_subtilis_16S_7
    

16S Gene Copy	Expected contig %	Mapped contig %
Bacillus_subtilis_16S_1	50	43.3
Bacillus_subtilis_16S_2	20	26
Bacillus_subtilis_16S_5	10	10.6
Bacillus_subtilis_16S_6	10	10.8
Bacillus_subtilis_16S_7	10	9.1

Intergenomic 16S gene ratios are as expected.

zymo_bsubtilis <- read.table("zymo_bsubtilis_16S_copies.txt", header=TRUE, sep="\t")
ggplot(zymo_bsubtilis, aes(fill=Contigs, y=Percentage, x=Gene.Copy)) + 
  geom_bar(position="dodge", stat="identity") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Visually inspect the alignments to check if there is anything unusual about the alignments. Sort and index the SAM file and load in IGV.

samtools view -b -S bsubtilis.sam | samtools sort -o bsubtilis.sorted.bam - && samtools index bsubtilis.sorted.bam

Bacillus_subtilis_16S_1

Bacillus_subtilis_16S_2

Bacillus_subtilis_16S_5

Bacillus_subtilis_16S_6

Bacillus_subtilis_16S_7

Quantify the quality of the alignments in Alfred.

alfred qc -r zymo_bsubtilis_ref/Bacillus_subtilis_16S_nr.fasta -o alignment_qc.tsv.gz bsubtilis.sorted.bam
zgrep ^ME alignment_qc.tsv.gz | cut -f 2- | datamash transpose | column -t > zymo_bsubtilis_16S_alignment_qc.tsv

Result file

QC Summary

MatchRate 0.999798
MismatchRate 0.000201824
DeletionRate 2.38704e-06
InsertionRate 2.14834e-06

The error rates are extremely low considering these contigs have only been filtered to cover the full-length of the 16S gene. They have not been quality filtered, denoised or any other type of algorithmic error correction. Indels are negligible. Since consensus sequence accuracy is contigent on read depth, higher per base accuracy and a higher proportion of full-length 16S contigs can be expected with increased sequencing yield. But since there is a limited number of UMI barcodes per 16S fragment, there will be an upper limit to gains in accuracy.

Examine the distribution of mismatch errors in the contigs. Use a custom script to quantify the position and frequencies of variants.

samtools mpileup -B -L 1 -Q 0 -F 0 -f zymo_bsubtilis_ref/Bacillus_subtilis_16S_nr.fasta bsubtilis.sorted.bam > bsubtilis.pileup
pileup2SNVbaseCounts.pl -i bsubtilis.pileup -snvDepth 0 -indelDepth 0

Result file

The five non-redundant copies of the 16S gene are the same length, so they can be overlapped on the same plot. Only base mismatches are shown since there are practically no indels to visualize.

#R
bsubtilis_variants <- read.csv("zymo_bsubtilis_mpileup_baseCount.csv",header=TRUE)
ggplot(data=bsubtilis_variants,aes(x=bsubtilis_variants$Position,fill=Reference,y=bsubtilis_variants$Variant_Ratio)) + geom_bar(stat="identity")

Mismatch errors are randomly distributed. The mismatches at the two peaks (gene copy #2 at pos: 285 [G->A] and gene copy #1 pos: 268 [G->A]) likely represent real variants since they are too high to be explained by sequencing or assembly error: the sequencing library consists of short randomly fragmented segments of the 16S gene. Consequently, the true mismatch error rate may be even lower: the bases in the two peaks combined accounts for 454/1691 total mismatch errors.

Comparison with Illumina Short Paired-End Reads

Short read 16S barcoding studies frequently sequence the V3V4 hypervariable region in 300bp paired-end mode. These paired-end reads are then filtered, trimmed and merged into ~440bp fragments which are then used for denoising and/or OTU clustering and classification. To make a fair comparison with LoopSeq contigs, the short reads are processed in a way as might be seen in a typical 16S barcoding study. The denoising step is omitted because it would potentially produce error-free reads; as would be the case if LoopSeq contigs were to undergo denoising. This would not permit a meaningful comparison of accuracies between the two methodologies.

The fastq files are too large to host on Github so have not been provided.

#Check how many pairs of reads in raw fastq file
echo $(zcat zymo-v3v4_R1.fastq.gz|wc -l)/4|bc
3278936

#Map against the PhiX reference to remove spike-in sequences
bowtie2 -p 24 --un-conc zymo-v3v4.fastq -x phiX/phiX -1 zymo-v3v4_R1.fastq.gz -2 zymo-v3v4_R2.fastq.gz -S zymo-v3v4_phiX.sam

#Filter out reads with low average quality and clip low quality bases in the tails
java -jar trimmomatic-0.36.jar PE zymo-v3v4.1.fastq.gz zymo-v3v4.2.fastq.gz zymo-3V4_clean_1.fastq.gz zymo-v3v4_unclean_1.fastq.gz zymo-v3v4_clean_2.fastq.gz zymo-v3v4_unclean_2.fastq.gz AVGQUAL:20
java -jar trimmomatic-0.36.jar PE zymo-v3v4_clean_1.fastq.gz zymo-v3v4_clean_2.fastq.gz zymo-v3v4_good_1.fastq.gz zymo-v3v4_bad_1.fastq.gz zymo-v3v4_good_2.fastq.gz zymo-v3v4_bad_2.fastq.gz TRAILING:20 

#Merge reads
flash -t 24 -m 20 -M 250 -d merged -o zymo-v3v4 zymo-v3v4_good_1.fastq.gz zymo-v3v4_good_2.fastq.gz 

#Trim V3V4 primers. 3’ primer sequence needs to be given in the reverse complement.
cutadapt -e 0.05 -g CCTACGGGNGGCWGCAG -a GGATTAGATACCCBDGTAGT -o zymo-v3v4.noprimers.fastq zymo-v3v4.extendedFrags.fastq --discard-untrimmed --no-indels

Check merged read distribution.

#R
zymo_v3V4_noprimers <- file.path("zymo-v3v4.noprimers.fastq")
summary(nchar(getSequences(zymo_v3V4_noprimers)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   17.0   444.0   444.0   444.1   444.0   574.0 
   
#Some merged reads are too short/long to be correct so these need to be filtered out
fastqFilter(zymo_v3V4_noprimers, "zymo-v3v4_noprimers_length.fastq", minLen=440, maxLen=450, verbose=TRUE, compress=FALSE)
Read in 2761523, output 2239559 (81.1%) filtered sequences.

#Map to Zymo whole genome bacteria references
bowtie2 --very-sensitive -p 40 --un zymo-v3v4.un.fastq -x zymo_bacteria_genome_ref/zymo_bacteria_genome_ref -q zymo-v3v4_noprimers_length.fastq -S zymo-v3v4.sam
2239559 reads; of these:
  2239559 (100.00%) were unpaired; of these:
    2051 (0.09%) aligned 0 times
    51 (0.00%) aligned exactly 1 time
    2237457 (99.91%) aligned >1 times
99.91% overall alignment rate

#Extract names of hits and count number of each unique hit
samtools view -F 4 zymo-v3v4.sam | cut -f3 | sort | uniq -c
 348730 BS.pilon.polished.v3.ST170922
 202215 Enterococcus_faecalis_complete_genome
 266642 Escherichia_coli_chromosome
 351894 Lactobacillus_fermentum_complete_genome
 282619 Listeria_monocytogenes_complete_genome
 177842 Pseudomonas_aeruginosa_complete_genome
 268401 Salmonella_enterica_complete_genome
 339165 Staphylococcus_aureus_chromosome

Species	Expected %	Mapped %
Bacillus subtilis	17.4	15.6
Enterococcus faecalis	9.9	9.0
Escherichia coli	10.1	11.9
Lactobacillus fermentum	18.4	15.7
Listeria monocytogenes	14.1	12.6
Pseudomonas aeruginosa	4.2	8.0
Salmonella enterica	10.4	12.0
Staphylococcus aureus	15.5	15.2

The species ratios are as expected. It does not appear that short read methodologies significanlty skew the relative abundances of the species examined.

To compare read accuracies with the LoopSeq contigs, extract the merged reads that mapped to B. subtilis and map the reads to the five non-redundant B. subtilis 16S gene copy references.

samtools view zymo-v3v4.sam | cut -f1,3 | awk -F "\t" '{ if($2 == "BS.pilon.polished.v3.ST170922") { print $1}}' > bsubtilis-v3v4_reads_list
seqtk subseq zymo-v3v4.noprimers.fastq bsubtilis-v3v4_reads_list > bsubtilis-v3v4_reads.fastq
bowtie2 --very-sensitive -p 40 --un bsubtilis-v3v4_reads.un.fastq -x zymo_bsubtilis_ref/Bacillus_subtilis_16S_nr -q bsubtilis-v3v4_reads.fastq -S bsubtilis-v3v4_reads.sam
348730 reads; of these:
  348730 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    4 (0.00%) aligned exactly 1 time
    348726 (100.00%) aligned >1 times
100.00% overall alignment rate

#Extract names of hits and count number of each unique hit
samtools view -F 4 bsubtilis-v3v4_reads.sam | cut -f3 | sort | uniq -c
  78053 Bacillus_subtilis_16S_1
  77557 Bacillus_subtilis_16S_2
  38240 Bacillus_subtilis_16S_5
  77371 Bacillus_subtilis_16S_6
  77509 Bacillus_subtilis_16S_7

16S Gene Copy	Expected contig %	Mapped reads %
Bacillus_subtilis_16S_1	50	22.4
Bacillus_subtilis_16S_2	20	22.3
Bacillus_subtilis_16S_5	10	11.0
Bacillus_subtilis_16S_6	10	22.2
Bacillus_subtilis_16S_7	10	22.2

Many of the nucleotide differences between the five non-redundant B. subtilis 16S copies are not within the V3V4 region (pos: 348-792). Within the V3V4 region of B. subtilis, only 16S copy #5 is different (pos: 373 C->A) while copies #1, #2, #6 and #7 are all identical in the V3V4 region. Reads corresponding to these identical sequences would map equally well to any of the 16S copies so are randomly placed, resulting in the approximately equal ratios observed for copy #1, #2, #6 and #7. This shows that sequencing short hypervariable regions may not permit an accurate assessment of intergenomic 16S gene ratios. This limits the use of short reads in inferring bacterial subtypes which may be distinguished on the basis of their intergenomic 16S copies. In general, the use of short hypervariable regions (V1-V2, V1-V3, V4, etc) has limited utility in identifying prokaryotes to the species level (Johnson, Jethro S., et al. “Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis.” Nature Communications 10.1 (2019): 1-11.).

Quantify the quality of the alignments in Alfred.

samtools view -b -S bsubtilis-v3v4_reads.sam | samtools sort -o bsubtilis-v3v4_reads.sorted.bam - && samtools index bsubtilis-v3v4_reads.sorted.bam
alfred qc -r zymo_bsubtilis_ref/Bacillus_subtilis_16S_nr.fasta -o bsubtilis-v3v4_alignment_qc.tsv.gz bsubtilis-v3v4_reads.sorted.bam
zgrep ^ME bsubtilis-v3v4_alignment_qc.tsv.gz | cut -f 2- | datamash transpose | column -t > zymo_bsubtilis-v3v4_alignment_qc.tsv

Result file

QC Summary

MatchRate 0.992513
MismatchRate 0.00748692
DeletionRate 8.36867e-05
InsertionRate 3.11562e-05

The mismatch, insertion and deletion error rates for short reads are at least an order of magnitude larger than the LoopSeq contigs. Optimized experimental protocols and stricter quality filtering criteria may result in reads with even lower mean error rates. There are also denoising algorithms that can correct PCR and sequencing errors, but such corrected reads would not accurately reflect the underlying error profile. Conversely, not performaing any quality filtering would result in reads with a higher error rate, thus not meeting basic quality standards for analysis. The filtering utilized above simply represents a typical pre-processing workflow which allows us to make a fair comparison with LoopSeq contigs which have only been filtered to full-length.

Mapping to NCBI RefSeq References

The eight bacterial species in the Zymo mock community are sufficiently phylogenetically distant that if given reads/contigs are long and accurate enough, reads/contigs are unlikely to me errorneously mapped to the wrong species. However, it is not always the case that species within a sample are known and/or reference sequences for the species are available. It it more often the case that samples with unknown composition are compared to a more general database. To determine if full-length 16S sequences can be used to accurately identify the correct species using a general database, the full-length LoopSeq 16S Zymo contigs were mapped to all bacterial genomes in the NCBI RefSeq CG (complete genomes) database as of October 2019. The database contains genomes of the same species as those in the Zymo sample (although it is unclear if the strains are the same since this information is lacking). There are also many genomes of closely related species (species from the same genus) that could potentially counfound a direct mapping approach to species identification.

Bacterial genomes were download from NCBI RefSeq using Kraken2. A Bowtie2 index was built (this takes some time) and full-length LoopSeq 16S contigs were mapped to the database.

bowtie2 --very-sensitive -p 40 --un zymo_refseq.un.fastq -x bt2_refseq_bacteria_genomes/refseq_bacteria -q zymo_noprimers.fastq -S zymo_refseq.sam
27123 reads; of these:
  27123 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    1 (0.00%) aligned exactly 1 time
    27122 (100.00%) aligned >1 times
100.00% overall alignment rate

#Extract reference hits
samtools view -F 4 zymo_refseq.sam | cut -f3 > zymo_refseq_hits

The bacterial references and taxonomy files were downloaded with Kraken2. The header does not contain the original species names. It does have taxids which can be converted and reformatted using taxonkit.

#Cut out the taxids from the header and convert to a 7-rank taxonomy
cat zymo_refseq_hits | cut -f2 -d"|" \
| taxonkit lineage --data-dir Kraken2/standard_DB/taxonomy \
| taxonkit reformat --data-dir Kraken2/standard_DB/taxonomy \
| cut -f3 | sort | uniq -c
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Bacillus halotolerans
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Bacillus sp. JS
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Bacillus sp. SJ-10
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Bacillus subtilis
Bacteria;Firmicutes;Bacilli;Bacillales;Listeriaceae;Listeria;Listeria monocytogenes
Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus argenteus
Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus
Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis
Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;Lactobacillus fermentum
Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella aerogenes
Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;Salmonella enterica
Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella dysenteriae
Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella flexneri
Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas aeruginosa
Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas fluorescens
Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas sp. AK6U

Species	Expected %	Mapped %
Bacillus subtilis	17.4	21.0
Enterococcus faecalis	9.9	9.7
Escherichia coli	10.1	10.9
Lactobacillus fermentum	18.4	11.3
Listeria monocytogenes	14.1	17.4
Pseudomonas aeruginosa	4.2	4.2
Salmonella enterica	10.4	11.2
Staphylococcus aureus	15.5	14.0
Other species	0	0.4

#R
library(ggplot2)
zymo_refseq <- read.table("zymo_refseq_exp_vs_map.txt", header=TRUE, sep="\t")
ggplot(zymo_refseq, aes(fill=Species, y=Percentage, x=Contigs, group=customOrder)) + geom_bar(position="stack", stat="identity")

Only 0.4% of contigs were mapped to incorrect species. With the exception of a single contig that mapped to a different genus (Klebsiella), the incorrect mappings were consistent with the genuses in the Zymo sample. This demonstrates that it is possible to accurately classify bacteria to the species level using full-length LoopSeq 16S contigs against a general whole genome database. This also demonstrates that even minor differences may produce incorrect species mappings.

Classification with LCA Classifier

Although the direct mapping approach may provide higher classification precision when references are available, it is frequently the case that exact references are not availble. One may nontheless still be interested in classifying the sequence to higher taxonomic levels (genus, family etc.). For such cases, lowest common ancestor (LCA) classifiers may be utilized. Commonly used 16S classification tools include mothur and QIIME used in conjunction with the SILVA, GreenGenes or RDP databases. The databases may contain 16S sequences from many representative species but the full intergenomic 16S diversity is usually not represented. Thus, it may be more advantageous to use whole genome references (which contain all the 16S copies) when they are available. The obvious disadvantage is that fewer species are represented as compared to 16S databases.

Kraken2 is a LCA classifer that searches for exact alignment of k-mers against an index built from whole genome references. The database used is the standard Kraken2 database (NCBI RefSeq CG: bacteria, archaea, virus, human, common cloning vectors).

kraken2 --db Kraken2/standard_DB --threads 24 --out zymo.kraken --report zymo.kreport zymo_noprimers.fastq

The result columns are interpreted as follows:

Percentage of fragments covered by the clade rooted at this taxon
Number of fragments covered by the clade rooted at this taxon
Number of fragments assigned directly to this taxon
A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., “G2” is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.
NCBI taxonomic ID number
Indented scientific name

00	27123	0	R	1	root
00	27123	0	R1	131567	  cellular organisms
00	27123	0	D	2	    Bacteria
66	19978	0	D1	1783272	      Terrabacteria group
66	19978	0	P	1239	        Firmicutes
66	19978	0	C	91061	          Bacilli
66	14284	0	O	1385	            Bacillales
98	5691	0	F	186817	              Bacillaceae
98	5691	21	G	1386	                Bacillus
86	5658	3877	G1	653685	                  Bacillus subtilis group
51	1767	1759	S	1423	                    Bacillus subtilis
03	7	4	S1	135461	                      Bacillus subtilis subsp. subtilis
00	1	1	S2	224308	                        Bacillus subtilis subsp. subtilis str. 168
00	1	1	S2	1052588	                        Bacillus subtilis subsp. subtilis str. RO-NN-1
00	1	1	S2	1302650	                        Bacillus subtilis subsp. subtilis str. BAB-1
00	1	0	S1	96241	                      Bacillus subtilis subsp. spizizenii
00	1	1	S2	1052585	                        Bacillus subtilis subsp. spizizenii TU-B-10
04	11	0	G2	1938374	                    Bacillus amyloliquefaciens group
03	8	8	S	492670	                      Bacillus velezensis
01	3	2	S	1390	                      Bacillus amyloliquefaciens
00	1	1	S1	1034836	                        Bacillus amyloliquefaciens XH7
01	2	2	S	1402	                    Bacillus licheniformis
00	1	1	S	119858	                    Bacillus sonorensis
02	5	1	G1	86661	                  Bacillus cereus group
01	3	1	S	1396	                    Bacillus cereus
00	1	1	S1	1454382	                      Bacillus cereus D17
00	1	0	S1	1179100	                      Bacillus cereus biovar anthracis
00	1	1	S2	637380	                        Bacillus cereus biovar anthracis str. CI
00	1	0	S	1392	                    Bacillus anthracis
00	1	1	S1	673518	                      Bacillus anthracis str. A16R
01	3	3	S	152268	                  Bacillus litoralis
01	2	0	S	1404	                  Bacillus megaterium
01	2	2	S1	592022	                    Bacillus megaterium DSM 319
01	2	0	S	665099	                  Bacillus oceanisediminis
01	2	2	S1	1196031	                    Bacillus oceanisediminis 2691
41	4722	0	F	186820	              Listeriaceae
41	4722	0	G	1637	                Listeria
41	4722	4722	S	1639	                  Listeria monocytogenes 
27	3871	0	F	90964	              Staphylococcaceae
27	3871	2725	G	1279	                Staphylococcus
14	1122	1115	S	1280	                  Staphylococcus aureus
03	7	1	S1	46170	                    Staphylococcus aureus subsp. aureus
02	5	5	S2	985006	                      Staphylococcus aureus subsp. aureus LGA251
00	1	1	S2	548473	                      Staphylococcus aureus subsp. aureus TCH60
01	4	4	S	1295	                  Staphylococcus schleiferi
01	3	3	S	29379	                  Staphylococcus auricularis
01	3	3	S	28035	                  Staphylococcus lugdunensis
01	2	2	S	1282	                  Staphylococcus epidermidis
01	2	2	S	29384	                  Staphylococcus kloosii
00	1	1	S	70258	                  Staphylococcus piscifermentans
00	1	1	S	70255	                  Staphylococcus condimenti
00	1	1	S	246432	                  Staphylococcus equorum
00	1	0	S	29388	                  Staphylococcus capitis
00	1	1	S1	72758	                    Staphylococcus capitis subsp. capitis
00	1	1	S	29385	                  Staphylococcus saprophyticus
00	1	1	S	283734	                  Staphylococcus pseudintermedius
00	1	1	S	308354	                  Staphylococcus simiae
00	1	1	S	1286	                  Staphylococcus simulans
00	1	1	S	1283	                  Staphylococcus haemolyticus
00	1	1	S	2044912	                  Staphylococcus sp. SDB 2975
99	5694	0	O	186826	            Lactobacillales
30	3066	0	F	33958	              Lactobacillaceae
30	3066	0	G	1578	                Lactobacillus
30	3066	3064	S	1613	                  Lactobacillus fermentum
01	2	2	S1	767453	                    Lactobacillus fermentum F-6
69	2628	0	F	81852	              Enterococcaceae
69	2628	0	G	1350	                Enterococcus
69	2628	2628	S	1351	                  Enterococcus faecalis
34	7145	0	P	1224	      Proteobacteria
34	7145	0	C	1236	        Gammaproteobacteria
10	5994	1	O	91347	          Enterobacterales
10	5993	2038	F	543	            Enterobacteriaceae
19	3034	0	G	590	              Salmonella
19	3034	3013	S	28901	                Salmonella enterica
08	21	9	S1	59201	                  Salmonella enterica subsp. enterica
01	3	1	S2	149539	                    Salmonella enterica subsp. enterica serovar Enteritidis
01	2	2	S3	1244111	                      Salmonella enterica subsp. enterica serovar Enteritidis str. EC20110357
01	2	0	S2	57046	                    Salmonella enterica subsp. enterica serovar Paratyphi C
01	2	2	S3	476213	                      Salmonella enterica subsp. enterica serovar Paratyphi C str. RKS4594
00	1	1	S2	90370	                    Salmonella enterica subsp. enterica serovar Typhi
00	1	0	S2	90371	                    Salmonella enterica subsp. enterica serovar Typhimurium
00	1	1	S3	1008297	                      Salmonella enterica subsp. enterica serovar Typhimurium str. 798
00	1	1	S2	98360	                    Salmonella enterica subsp. enterica serovar Dublin
00	1	0	S2	108619	                    Salmonella enterica subsp. enterica serovar Newport
00	1	1	S3	1454618	                      Salmonella enterica subsp. enterica serovar Newport str. USDA-ARS-USMARC-1925
00	1	1	S2	605	                    Salmonella enterica subsp. enterica serovar Pullorum
00	1	0	S2	58712	                    Salmonella enterica subsp. enterica serovar Anatum
00	1	1	S3	1454592	                      Salmonella enterica subsp. enterica serovar Anatum str. CDC 06-0532
00	1	1	S2	29474	                    Salmonella enterica subsp. enterica serovar California
33	903	0	G	561	              Escherichia
33	902	899	S	562	                Escherichia coli
01	2	0	S1	83333	                  Escherichia coli K-12
01	2	2	S2	1110693	                    Escherichia coli str. K-12 substr. MDS42
00	1	1	S1	2048777	                  Escherichia coli O15:H11
00	1	1	S	1499973	                Escherichia marmotae
04	10	3	G	620	              Shigella
03	7	3	S	623	                Shigella flexneri
01	3	0	S1	424718	                  Shigella flexneri 5a
01	3	3	S2	1086030	                    Shigella flexneri 5a str. M90T
00	1	1	S1	42897	                  Shigella flexneri 2a
01	4	0	G	570	              Klebsiella
01	4	4	S	573	                Klebsiella pneumoniae
01	3	0	G	547	              Enterobacter
01	2	0	G1	354276	                Enterobacter cloacae complex
00	1	1	S	550	                  Enterobacter cloacae
00	1	1	S	1915310	                  Enterobacter cloacae complex sp. ECNIH7
00	1	1	S	881260	                Enterobacter bugandensis
00	1	0	F1	191675	              unclassified Enterobacteriaceae
00	1	0	F2	84563	                ant, tsetse, mealybug, aphid, etc. endosymbionts
00	1	0	F3	146507	                  aphid secondary symbionts
00	1	1	S	1199245	                    secondary endosymbiont of Ctenarytaina eucalypti
24	1151	0	O	72274	          Pseudomonadales
24	1151	0	F	135621	            Pseudomonadaceae
24	1151	1143	G	286	              Pseudomonas
01	2	0	S	101564	                Pseudomonas alcaliphila
01	2	2	S1	741155	                  Pseudomonas alcaliphila JAB1
01	2	0	G1	136842	                Pseudomonas chlororaphis group
01	2	2	S	296	                  Pseudomonas fragi
00	1	1	S	1981174	                Pseudomonas sp. M30-35
00	1	1	S	1302376	                Candidatus Pseudomonas adelgestsugas
00	1	0	G1	136841	                Pseudomonas aeruginosa group
00	1	1	S	287	                  Pseudomonas aeruginosa
00	1	0	G1	136843	                Pseudomonas fluorescens group
00	1	1	S	47883	                  Pseudomonas synxantha

Most contigs were correctly assigned to the species level. The exception was P. aureginosa, for which most contigs could only be assigned to the genus level. However, the percentage of contigs (4.24%) that were classified to the Pseudomonas genus does conform to expectations. This lack of species resolution for P. aureginosa is due to characteristics of the Pseudomonas genus in general and not a particular shortcoming of LoopSeq contigs. A previous study (Gomila, Margarita, et al. “Phylogenomics and systematics in Pseudomonas.” Frontiers in Microbiology 6 (2015): 214.) found that:

In conclusion, because the resolution of the 16S rRNA tree was not sufficient to differentiate 63 genomes from other closely related Pseudomonas species, the classification of these bacteria should follow the phylogeny of the housekeeping genes until the whole genome sequence of the type strains of all Pseudomonas species is known.

Using Bracken, the companion tool to Kraken that redistributes all the classified reads to a single taxonomic level, it is apparent that at the genus level, the ratios of all eight species are as expected.

bracken -d Kraken2/standard_DB -i zymo.kreport -o zymo_genus.bracken -r 1500 -l G -t 1

Result file

Genus	Expected %	Classified %
Bacillus	17.4	21.0
Enterococcus	9.9	9.7
Escherichia	10.1	10.0
Lactobacillus	18.4	11.3
Listeria	14.1	17.4
Pseudomonas	4.2	4.2
Salmonella	10.4	11.9
Staphylococcus	15.5	14.3
Other species	0	0.3

Redistributing contigs to the species level results in a higher proportion of false positives. Of the expected 4.2% P. aureginosa species, approximately 3/4 were classied to other Pseudomonas species.

bracken -d Kraken2/standard_DB -i zymo.kreport -o zymo_species.bracken -r 1500 -l S -t 1

Result file

Species	Expected %	Classified %
Bacillus subtilis	17.4	19.0
Enterococcus faecalis	9.9	9.7
Escherichia coli	10.1	10.0
Lactobacillus fermentum	18.4	11.3
Listeria monocytogenes	14.1	17.4
Salmonella enterica	10.4	11.9
Staphylococcus aureus	15.5	13.8
Pseudomonas aeruginosa	4.2	1.0
Other Pseudomonas spp.	0	3.3
Other species	0	2.6

ATCC Mock Human Communities Analysis

The two ATCC mock human communities were analyzed using two methods:

Mapping LoopSeq 16S contigs to whole genome strain references
Strain detection using LCA classifier

Data Preparation

Mock Community Description

Although the Zymo mock community is widely used and is arguably the common standard in microbiome studies, many of the references that are provided do not match the strains in the sample. Zymo Research also does not offer the extensive supporting materials that the American Type Culture Association (ATCC) offers through their mock community samples (BEI Resources, USA). Many of the species in the ATCC mock communities have exact strain genome references along with assembly details and genome annotations. Where exact reference strains are not provided, strains names allows researchers to look for identical or similar references in common genome repositories (eg. NCBI RefSeq CG database).

Two ATCC mock communities representing human-derived samples were analyzed. These two mock communities demonstrates the utility of applying LoopSeq for human site-specific studies. The provided exact strain names also allows for a more fine-grained assessment of the level of taxonomic resolution that can be achieved through accurate full-length 16S sequences. Two mock communities were analyzed: ATCC-gut and ATCC-oral. Six out of the 12 genome references are provided by the ATCC for ATCC-gut, while the remaining six have equivalent NCBI RefSeq assemblies. All six of genome references are provided by the ATCC for ATCC-oral.

Gut Microbiome Genomic Mix (MSA-1006) (“ATCC-gut”)

Species	Strain	Genomic DNA %	ATCC genome reference	Reference	RefSeq assembly	RefSeq assembly strain	16S copy information	16S copies
Bacteriodes fragilis	ATCC 25285	8.3	No	link	Yes	NCTC 9343 (ATCC 25285)	Yes	6
Bacteroides vulgatus	ATCC 8482	8.3	No	link	Yes	ATCC 8482	Yes	7
Clostridioides difficile	ATCC 9689	8.3	No	link	Yes	ATCC 9689	Yes	12
Enterobacter cloacae	ATCC 13047	8.3	No	link	Yes	ATCC 13047	link	8
Escherichia coli	ATCC 700926	8.3	No	link	Yes	MG1655 (ATCC 700926)	link	7
Salmonella enterica subsp. enterica	ATCC 9150	8.3	No	link	Yes	ATCC 9150	Yes	7
Bifidobacterium adolescentis	ATCC 15703	8.3	Yes	link				5
Enterococcus faecalis	ATCC 700802	8.3	Yes	link				4
Lactobacillus plantarum	ATCC BAA-793	8.3	Yes	link				5
Helicobacter pylori	ATCC 700392	8.3	Yes	link				2
Yersinia enterocolitica	ATCC 27729	8.3	Yes	link				7
Fusobacterium nucleatum subsp. nucleatum	ATCC 25586	8.3	Yes	link				5

Oral Microbiome Genomic Mix (MSA-1004) (“ATCC-oral”)

Species	Strain	Genomic DNA %	ATCC genome reference	Reference	16S copies
Schaalia odontolytica	ATCC 17982	16.7	Yes	link	3
Prevotella melaninogenica	ATCC 25845	16.7	Yes	link	4
Fusobacterium nucleatum subsp. nucleatum	ATCC 25586	16.7	Yes	link	5
Streptococcus mitis	ATCC 49456	16.7	Yes	link	4
Veillonella parvula	ATCC 17745	16.7	Yes	link	4
Haemophilus parainfluenzae	ATCC 33392	16.7	Yes	link	6

LoopSeq 16S Contig QC & Filtering

ATCC-gut

Load fastq file containing raw contigs as downloaded from the Loop Genomics cloud platform.

Raw contigs fastq file

#R
library(dada2)
library(ShortRead)

F_prim <- "AGAGTTTGATCMTGGCTCAG" #forward primer
R_prim <- "TACCTTGTTACGACTT" #reverse primer

atcc_gut <- file.path("atcc_gut.fastq")

#Basic statistics of raw contigs
summary(nchar(getSequences(atcc_gut)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    259    1436    1462    1376    1489    2242 


#Distribution of raw contigs
hist(nchar(getSequences(atcc_gut)), 100) 

#Remove forward and reverse primers. This step also ensures that the contigs that remain after filtering are the full length 16S gene. 
removePrimers(atcc_gut, "atcc_gut_noprimers.fastq", primer.fwd=F_prim, primer.rev=rc(R_prim),  compress=FALSE, max.mismatch=2, allow.indels=FALSE, verbose=TRUE)
Multiple matches to the primer(s) in some sequences. Using the longest possible match.
Multiple matches to the primer(s) in some reverse-complement sequences. Using the longest possible match.
86 sequences out of 20027 are being reverse-complemented.
Read in 20027, output 9087 (45.4%) filtered sequences.

#Basic statistics of filtered contigs (full length 16S, primers trimmed) 
atcc_gut_noprimers <- file.path("atcc_gut_noprimers.fastq")
summary(nchar(getSequences(atcc_gut_noprimers)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1416    1426    1451    1451    1464    1585 

#Distribution of filtered contigs (full length 16S, primers trimmed)
hist(nchar(getSequences(atcc_gut_noprimers)), 10) 

Filtered contigs fastq file

ATCC-oral

Load fastq file containing raw contigs as downloaded from the Loop Genomics cloud platform.

Raw contigs fastq file

#R
library(dada2)
library(ShortRead)

F_prim <- "AGAGTTTGATCMTGGCTCAG" #forward primer
R_prim <- "TACCTTGTTACGACTT" #reverse primer

atcc_oral <- file.path("atcc_oral.fastq")

#Basic statistics of raw contigs
summary(nchar(getSequences(atcc_oral)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    205    1444    1478    1372    1498    2255   
    
#Distribution of raw contigs
hist(nchar(getSequences(atcc_oral)), 100) 

#Remove forward and reverse primers. This step also ensures that the contigs that remain after filtering are the full length 16S gene. 
removePrimers(atcc_oral, "atcc_oral_noprimers.fastq"), primer.fwd=F_prim, primer.rev=rc(R_prim),  compress=FALSE, max.mismatch=2, allow.indels=FALSE, verbose=TRUE)
Multiple matches to the primer(s) in some sequences. Using the longest possible match.
193 sequences out of 33222 are being reverse-complemented.
Read in 33222, output 16392 (49.3%) filtered sequences.

#Basic statistics of filtered contigs (full length 16S, primers trimmed) 
atcc_oral_noprimers <- file.path("atcc_oral_noprimers.fastq")
summary(nchar(getSequences(atcc_oral_noprimers)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    300    1442    1462    1460    1469    2219 

#The max contig length after primer trimming is 2219 bp. 16S genes shouldn't be that long. There are also contigs that are too short. These may represent experimental or assembly errors. These are filtered out.
filterAndTrim(atcc_oral_noprimers, "atcc_oral_noprimers_lengthfilter.fastq", minLen=1400, maxLen=1600, verbose=TRUE, compress=FALSE)
Read in 16392, output 16173 (98.7%) filtered sequences.

atcc_oral_noprimers_lengthfilter <- file.path("atcc_oral_noprimers_lengthfilter.fastq")
summary(nchar(getSequences(atcc_oral_noprimers_lengthfilter)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1409    1442    1462    1460    1469    1595 
   
#Distribution of filtered contigs (full length 16S, primers trimmed, length filtered)
hist(nchar(getSequences(atcc_oral_noprimers_lengthfilter)), 10) 

Filtered contigs fastq file

Mapping to Whole Genome References

ATCC-gut

Concatenate all bacterial whole genome fasta files together and build a Bowtie2 index with the whole genome references and map the filtered ATCC-gut contigs.

ATCC-gut genome references

bowtie2 --very-sensitive -p 40 --un atcc_gut.un.fastq -x atcc_gut_ref/atcc_gut_ref -q atcc_gut_noprimers.fastq -S atcc_gut.sam
reads; of these:
(100.00%) were unpaired; of these:
(0.03%) aligned 0 times
(0.00%) aligned exactly 1 time
(99.97%) aligned >1 times
97% overall alignment rate

#Extract names of hits and count number of each unique hit
samtools view -F 4 atcc_gut.sam | cut -f3 | sort | uniq -c 
Bacteroides_fragilis_NCTC_9343
Bacteroides_vulgatus_ATCC_8482
bifidobacterium_adolescentis_atcc_15703_contig1
bifidobacterium_adolescentis_atcc_15703_contig2
Clostridioides_difficile_ATCC_9689
Enterobacter_cloacae_ATCC_13047
enterococcus_faecalis_atcc_700802_contig1
Escherichia_coli_MG1655
fusobacterium_nucleatum_subsp_nucleatum_atcc_25586
helicobacter_pylori_atcc_700392
lactobacillus_plantarum_atcc_baa-793_contig1
Salmonella_enterica_ATCC_9150
yersinia_enterocolitica_atcc_27729_contig1

Species	Expected contig %	Mapped contig %
Bacteriodes fragilis	8.0	11.7
Bacteroides vulgatus	9.3	10.7
Clostridioides difficile	16.0	21.6
Enterobacter cloacae	10.7	7.6
Escherichia coli	9.3	8.1
Salmonella enterica subsp. enterica	9.3	4.3
Bifidobacterium adolescentis	6.7	0.3
Enterococcus faecalis	5.3	6.4
Lactobacillus plantarum	6.7	6.1
Helicobacter pylori	2.7	5.8
Yersinia enterocolitica	9.3	7.4
Fusobacterium nucleatum subsp. nucleatum	6.7	9.8

The proportion of species 16S ratios are converted from 16S copy numbers. All 12 species are observed in approximately the expected ratios, except B. adolescentis which is much lower than expected.

ATCC-oral

Concatenate all bacterial whole genome fasta files together and build a Bowtie2 index with the whole genome references and map the filtered ATCC-oral contigs.

ATCC-oral genome references

bowtie2 --very-sensitive -p 40 --un atcc_oral.un.fastq -x atcc_oral_ref/atcc_oral_ref -q atcc_oral_noprimers_lengthfilter.fastq -S atcc_oral.sam
reads; of these:
(100.00%) were unpaired; of these:
(0.01%) aligned 0 times
(0.00%) aligned exactly 1 time
(99.99%) aligned >1 times
99% overall alignment rate

#Extract names of hits and count number of each unique hit
samtools view -F 4 atcc_oral.sam | cut -f3 | sort | uniq -c 
fusobacterium_nucleatum_subsp_nucleatum_atcc_25586
haemophilus_parainfluenzae_atcc_33392_contig1
prevotella_melaninogenica_atcc_25845_contig1
prevotella_melaninogenica_atcc_25845_contig2
schaalia_odontolytica_atcc_17982
streptococcus_mitis_atcc_49456
veillonella_parvula_atcc_17745

Species	Expected contig %	Mapped contig %
Schaalia odontolytica	11.5	3.9
Prevotella melaninogenica	15.4	13.7
Fusobacterium nucleatum subsp. nucleatum	19.2	28.1
Streptococcus mitis	15.4	15.4
Veillonella parvula	15.4	13.4
Haemophilus parainfluenzae	23.1	25.6

Strain-level Detection Using LCA Classifier

ATCC-gut

kraken2 --db Kraken2/standard_DB --threads 24 --out atcc_gut.kraken --report atcc_gut.kreport atcc_gut_noprimers.fastq 
Loading database information... done.
9087 sequences (13.19 Mbp) processed in 0.384s (1418.5 Kseq/m, 2058.48 Mbp/m).
  9087 sequences classified (100.00%)
  0 sequences unclassified (0.00%)

00	9087	0	R	1	root
00	9087	0	R1	131567	  cellular organisms
00	9087	0	D	2	    Bacteria
52	3137	0	D1	1783272	      Terrabacteria group
18	3106	0	P	1239	        Firmicutes
65	1967	0	C	186801	          Clostridia
65	1967	0	O	186802	            Clostridiales
64	1966	0	F	186804	              Peptostreptococcaceae
64	1966	0	G	1870884	                Clostridioides
64	1966	1965	S	1496	                  Clostridioides difficile
01	1	1	S1	699037	                    Clostridioides difficile M68
01	1	0	F	31979	              Clostridiaceae
01	1	0	G	1485	                Clostridium
01	1	1	S	1502	                  Clostridium perfringens
52	1138	0	C	91061	          Bacilli
52	1138	0	O	186826	            Lactobacillales
40	582	0	F	81852	              Enterococcaceae
40	582	0	G	1350	                Enterococcus
40	582	582	S	1351	                  Enterococcus faecalis
12	556	0	F	33958	              Lactobacillaceae
12	556	229	G	1578	                Lactobacillus
57	324	324	S	1590	                  Lactobacillus plantarum
01	1	1	S	1613	                  Lactobacillus fermentum
01	1	0	S	109790	                  Lactobacillus jensenii
01	1	1	S1	525329	                    Lactobacillus jensenii JV-V16
01	1	1	S	1610	                  Lactobacillus coryniformis
01	1	0	C	909932	          Negativicutes
01	1	0	O	1843489	            Veillonellales
01	1	0	F	31977	              Veillonellaceae
01	1	0	G	29465	                Veillonella
01	1	1	S	29466	                  Veillonella parvula
34	31	0	P	201174	        Actinobacteria
34	31	0	C	1760	          Actinobacteria
34	31	0	O	85004	            Bifidobacteriales
34	31	0	F	31953	              Bifidobacteriaceae
34	31	0	G	1678	                Bifidobacterium
34	31	29	S	1680	                  Bifidobacterium adolescentis
02	2	2	S1	367928	                    Bifidobacterium adolescentis ATCC 15703
23	3020	0	P	1224	      Proteobacteria
39	2489	0	C	1236	        Gammaproteobacteria
39	2489	0	O	91347	          Enterobacterales
00	1817	1503	F	543	            Enterobacteriaceae
45	223	0	G	561	              Escherichia
45	223	222	S	562	                Escherichia coli
01	1	0	S1	1072458	                  Escherichia coli O7:K1
01	1	1	S2	1072459	                    Escherichia coli O7:K1 str. CE10
77	70	0	G	590	              Salmonella
77	70	56	S	28901	                Salmonella enterica
14	13	13	S1	59201	                  Salmonella enterica subsp. enterica
01	1	0	S1	59202	                  Salmonella enterica subsp. salamae
01	1	1	S2	2500152	                    Salmonella enterica subsp. salamae serovar 42:r:-
14	13	1	G	570	              Klebsiella
09	8	7	S	573	                Klebsiella pneumoniae
01	1	1	S1	72407	                  Klebsiella pneumoniae subsp. pneumoniae
01	1	1	S	244366	                Klebsiella variicola
01	1	1	S	1134687	                Klebsiella michiganensis
01	1	1	S	1463165	                Klebsiella quasipneumoniae
01	1	1	S	2153354	                Klebsiella sp. WCHKl090001
03	3	0	G	620	              Shigella
01	1	1	S	621	                Shigella boydii
01	1	1	S	622	                Shigella dysenteriae
01	1	1	S	623	                Shigella flexneri
02	2	0	F1	191675	              unclassified Enterobacteriaceae
02	2	0	F2	84563	                ant, tsetse, mealybug, aphid, etc. endosymbionts
01	1	0	F3	146507	                  aphid secondary symbionts
01	1	1	S	1199245	                    secondary endosymbiont of Ctenarytaina eucalypti
01	1	0	G	1906660	                  Candidatus Mikella
01	1	1	S	1778264	                    Candidatus Mikella endobia
01	1	0	G	547	              Enterobacter
01	1	0	G1	354276	                Enterobacter cloacae complex
01	1	1	S	158836	                  Enterobacter hormaechei
01	1	0	G	544	              Citrobacter
01	1	0	G1	1344959	                Citrobacter freundii complex
01	1	1	S	2066049	                  Citrobacter freundii complex sp. CFNIH2
01	1	0	G	1780190	              Izhakiella
01	1	1	S	2579935	                Izhakiella sp. KSNA2
40	672	0	F	1903411	            Yersiniaceae
40	672	3	G	629	              Yersinia
36	669	669	S	630	                Yersinia enterocolitica
84	531	0	P1	68525	        delta/epsilon subdivisions
84	531	0	C	29547	          Epsilonproteobacteria
84	531	0	O	213849	            Campylobacterales
84	531	0	F	72293	              Helicobacteraceae
84	531	0	G	209	                Helicobacter
84	531	530	S	210	                  Helicobacter pylori
01	1	1	S1	1163741	                    Helicobacter pylori Shi169
39	2035	0	D1	1783270	      FCB group
39	2035	0	D2	68336	        Bacteroidetes/Chlorobi group
39	2035	0	P	976	          Bacteroidetes
39	2035	0	C	200643	            Bacteroidia
39	2035	0	O	171549	              Bacteroidales
39	2035	0	F	815	                Bacteroidaceae
39	2035	0	G	816	                  Bacteroides
73	1066	1066	S	817	                    Bacteroides fragilis
65	968	271	S	821	                    Bacteroides vulgatus
67	697	697	S1	435590	                      Bacteroides vulgatus ATCC 8482
01	1	1	S	820	                    Bacteroides uniformis
84	894	0	P	32066	      Fusobacteria
84	894	0	C	203490	        Fusobacteriia
84	894	0	O	203491	          Fusobacteriales
84	894	0	F	203492	            Fusobacteriaceae
84	894	0	G	848	              Fusobacterium
84	894	0	S	851	                Fusobacterium nucleatum
84	894	379	S1	76856	                  Fusobacterium nucleatum subsp. nucleatum
67	515	515	S2	190304	                    Fusobacterium nucleatum subsp. nucleatum ATCC 25586
01	1	0	D1	1783257	      PVC group
01	1	0	P	74201	        Verrucomicrobia
01	1	0	C	203494	          Verrucomicrobiae
01	1	0	O	48461	            Verrucomicrobiales
01	1	0	F	1647988	              Akkermansiaceae
01	1	0	G	239934	                Akkermansia
01	1	1	S	239935	                  Akkermansia muciniphila

Species	Strain	Species detected	Strain detected	Name of strain detected
Bacteriodes fragilis	ATCC 25285	Yes	No
Bacteroides vulgatus	ATCC 8482	Yes	Yes	Bacteroides vulgatus ATCC 8482
Clostridioides difficile	ATCC 9689	Yes	Yes	Clostridioides difficile M68
Enterobacter cloacae	ATCC 13047	No
Escherichia coli	ATCC 700926	Yes	Yes	Escherichia coli O7:K1
Salmonella enterica subsp. enterica	ATCC 9150	Yes	Yes	Salmonella enterica subsp. Enterica & Salmonella enterica subsp. Salamae
Bifidobacterium adolescentis	ATCC 15703	Yes	Yes	Bifidobacterium adolescentis ATCC 15703
Enterococcus faecalis	ATCC 700802	Yes	No
Lactobacillus plantarum	ATCC BAA-793	Yes	No
Helicobacter pylori	ATCC 700392	Yes	Yes	Helicobacter pylori Shi169
Yersinia enterocolitica	ATCC 27729	Yes	No
Fusobacterium nucleatum subsp. nucleatum	ATCC 25586	Yes	Yes	Fusobacterium nucleatum subsp. nucleatum ATCC 25586

Of the 12 species in the sample, 11 could be detected to species level. Of these 11 detected species, seven could be detected to strain level. Of these seven detected strains, three are exact matches to the ATCC strain designation.

ATCC-oral

kraken2 --db Kraken2/standard_DB --threads 24 --out atcc_oral.kraken --report atcc_oral.kreport atcc_oral_noprimers_lengthfilter.fastq 
Loading database information... done.
16173 sequences (23.61 Mbp) processed in 0.481s (2018.2 Kseq/m, 2945.82 Mbp/m).
  16173 sequences classified (100.00%)
  0 sequences unclassified (0.00%)

00	16173	0	R	1	root
00	16173	0	R1	131567	  cellular organisms
00	16173	0	D	2	    Bacteria
61	5274	0	D1	1783272	      Terrabacteria group
76	4651	0	P	1239	        Firmicutes
37	2485	0	C	91061	          Bacilli
37	2485	0	O	186826	            Lactobacillales
37	2485	0	F	1300	              Streptococcaceae
37	2485	2450	G	1301	                Streptococcus
10	16	16	S	2420308	                  Streptococcus sp. FDAARGOS_520
02	3	3	S	1346	                  Streptococcus iniae
02	3	3	S	1313	                  Streptococcus pneumoniae
01	2	0	G1	119603	                  Streptococcus dysgalactiae group
01	2	0	S	1334	                    Streptococcus dysgalactiae
01	2	0	S1	119602	                      Streptococcus dysgalactiae subsp. equisimilis
01	2	2	S2	486410	                        Streptococcus dysgalactiae subsp. equisimilis GGS_124
01	1	1	S	2420310	                  Streptococcus sp. FDAARGOS_522
01	1	1	S	1917441	                  Streptococcus ruminantium
01	1	1	S	150055	                  Streptococcus lutetiensis
01	1	1	S	28037	                  Streptococcus mitis
01	1	1	S	1319	                  Streptococcus sp. 'group B'
01	1	0	S	1311	                  Streptococcus agalactiae
01	1	1	S1	1309807	                    Streptococcus agalactiae ILRI005
01	1	1	S	1309	                  Streptococcus mutans
01	1	1	S	1308	                  Streptococcus thermophilus
01	1	1	S	1111760	                  Streptococcus troglodytae
01	1	1	S	1305	                  Streptococcus sanguinis
01	1	0	G1	671232	                  Streptococcus anginosus group
01	1	1	S	1338	                    Streptococcus intermedius
39	2166	0	C	909932	          Negativicutes
39	2166	0	O	1843489	            Veillonellales
39	2166	0	F	31977	              Veillonellaceae
39	2166	0	G	29465	                Veillonella
39	2166	2166	S	29466	                  Veillonella parvula
85	623	0	P	201174	        Actinobacteria
85	623	0	C	1760	          Actinobacteria
85	623	0	O	2037	            Actinomycetales
85	623	0	F	2049	              Actinomycetaceae
80	615	0	G	2529408	                Schaalia
80	615	615	S	1660	                  Schaalia odontolytica
05	8	0	G	1654	                Actinomyces
05	8	8	S	1852377	                  Actinomyces pacaensis
08	4542	0	P	32066	      Fusobacteria
08	4542	0	C	203490	        Fusobacteriia
08	4542	0	O	203491	          Fusobacteriales
08	4542	0	F	203492	            Fusobacteriaceae
08	4542	0	G	848	              Fusobacterium
08	4542	0	S	851	                Fusobacterium nucleatum
08	4542	2056	S1	76856	                  Fusobacterium nucleatum subsp. nucleatum
37	2485	2485	S2	190304	                    Fusobacterium nucleatum subsp. nucleatum ATCC 25586
01	1	1	S2	525283	                    Fusobacterium nucleatum subsp. nucleatum ATCC 23726
62	4144	0	P	1224	      Proteobacteria
62	4144	0	C	1236	        Gammaproteobacteria
62	4144	0	O	135625	          Pasteurellales
62	4144	0	F	712	            Pasteurellaceae
62	4144	0	G	724	              Haemophilus
62	4144	4144	S	729	                Haemophilus parainfluenzae
68	2213	0	D1	1783270	      FCB group
68	2213	0	D2	68336	        Bacteroidetes/Chlorobi group
68	2213	0	P	976	          Bacteroidetes
68	2213	0	C	200643	            Bacteroidia
68	2213	0	O	171549	              Bacteroidales
68	2212	0	F	171552	                Prevotellaceae
68	2212	0	G	838	                  Prevotella
68	2212	2212	S	28132	                    Prevotella melaninogenica
01	1	0	F	815	                Bacteroidaceae
01	1	0	G	816	                  Bacteroides
01	1	1	S	817	                    Bacteroides fragilis

Species	Strain	Species detected	Strain detected	Name of strain detected
Schaalia odontolytica	ATCC 17982	Yes	No
Prevotella melaninogenica	ATCC 25845	Yes	No
Fusobacterium nucleatum subsp. nucleatum	ATCC 25586	Yes	Yes	Fusobacterium nucleatum subsp. nucleatum ATCC 25586
Streptococcus mitis	ATCC 49456	Yes	No
Veillonella parvula	ATCC 17745	Yes	No
Haemophilus parainfluenzae	ATCC 33392	Yes	No

Of the six species in the sample, all six could be detected to species level. Of these six species detected, one could be detected to strain level, which was an exact match to the ATCC strain designation.

Madison Water Samples Analysis

To demonstrate the the application of LoopSeq 16S kit on real world environmental samples, four water samples were collected from Madison (Wisconsin, USA): 1) Home aquarium (Aquarium) 2) Feynman Pond outside the Promega office buildings (FeynmanPond) 3) Residential rain barrel collecting rain runoff (RainBarrel) 4) Riva Pond in a residential neighbourhood (RivaPond)

Samples were classified using an LCA classifer and diversity analysis was performed on the classification results.

The 16S + 18S kit was used but only the 16S contigs are examined.

Data Preparation

LoopSeq 16S Contig QC & Filtering

Madison-Aquarium

Load fastq file containing raw contigs as downloaded from the Loop Genomics cloud platform.

Raw contigs fastq file

#R
library(dada2)
library(ShortRead)

F_prim <- "AGAGTTTGATCMTGGCTCAG" #forward primer
R_prim <- "TACCTTGTTACGACTT" #reverse primer

madison_aquarium <- file.path("madison_aquarium.fastq")

#Basic statistics of raw contigs
summary(nchar(getSequences(madison_aquarium)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    231    1393    1457    1355    1489    2921 
    
#Remove forward and reverse primers. This step also ensures that the contigs that remain after filtering are the full length 16S gene. 
removePrimers(madison_aquarium, "madison_aquarium_noprimers.fastq", primer.fwd=F_prim, primer.rev=rc(R_prim), compress=FALSE, max.mismatch=2, allow.indels=FALSE, verbose=TRUE)
6 sequences out of 18585 are being reverse-complemented.
Read in 18585, output 6967 (37.5%) filtered sequences.

#Basic statistics of filtered contigs (full length 16S, primers trimmed) 
madison_aquarium_noprimers <- file.path("madison_aquarium_noprimers.fastq")
summary(nchar(getSequences(madison_aquarium_noprimers)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1176    1442    1453    1443    1455    1544 
   
#Still some contigs that are too short. Filter for length.
filterAndTrim(madison_aquarium_noprimers, "madison_aquarium_noprimers_lengthfilter.fastq", minLen=1400, maxLen=1600, verbose=TRUE, compress=FALSE)
Read in 6967, output 6428 (92.3%) filtered sequences.

#Basic statistics of filtered contigs (full length 16S, primers trimmed, length filtered)
madison_aquarium_noprimers_lengthfilter <- file.path("madison_aquarium_noprimers_lengthfilter.fastq")
summary(nchar(getSequences(madison_aquarium_noprimers_lengthfilter)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1402    1451    1453    1447    1455    1544 

Filtered contigs fastq file

FeynmanPond

Load fastq file containing raw contigs as downloaded from the Loop Genomics cloud platform.

Raw contigs fastq file

#R
library(dada2)
library(ShortRead)

F_prim <- "AGAGTTTGATCMTGGCTCAG" #forward primer
R_prim <- "TACCTTGTTACGACTT" #reverse primer

madison_feynman_pond <- file.path("madison_feynman_pond.fastq")

#Basic statistics of raw contigs
summary(nchar(getSequences(madison_feynman_pond)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    222    1376    1451    1352    1487    2714 
    
#Remove forward and reverse primers. This step also ensures that the contigs that remain after filtering are the full length 16S gene. 
removePrimers(madison_feynman_pond, "madison_feynman_pond_noprimers.fastq", primer.fwd=F_prim, primer.rev=rc(R_prim), compress=FALSE, max.mismatch=2, allow.indels=FALSE, verbose=TRUE)
5 sequences out of 20189 are being reverse-complemented.
Read in 20189, output 8251 (40.9%) filtered sequences.

#Basic statistics of filtered contigs (full length 16S, primers trimmed) 
madison_feynman_pond_noprimers <- file.path("madison_feynman_pond_noprimers.fastq")
summary(nchar(getSequences(madison_feynman_pond_noprimers)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1376    1434    1446    1442    1457    1522 

Filtered contigs fastq file

RainBarrel

Load fastq file containing raw contigs as downloaded from the Loop Genomics cloud platform.

Raw contigs fastq file

#R
library(dada2)
library(ShortRead)

F_prim <- "AGAGTTTGATCMTGGCTCAG" #forward primer
R_prim <- "TACCTTGTTACGACTT" #reverse primer

madison_rain_barrel <- file.path("madison_rain_barrel.fastq")
     
#Basic statistics of raw contigs
summary(nchar(getSequences(madison_rain_barrel)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    204    1337    1443    1349    1482    2671 
 
#Remove forward and reverse primers. This step also ensures that the contigs that remain after filtering are the full length 16S gene. 
removePrimers(madison_rain_barrel, "madison_rain_barrel_noprimers.fastq", primer.fwd=F_prim, primer.rev=rc(R_prim), compress=FALSE, max.mismatch=2, allow.indels=FALSE, verbose=TRUE)
2 sequences out of 15140 are being reverse-complemented.
Read in 15140, output 4673 (30.9%) filtered sequences.

#Basic statistics of filtered contigs (full length 16S, primers trimmed) 
madison_rain_barrel_noprimers <- file.path("madison_rain_barrel_noprimers.fastq")
summary(nchar(getSequences(madison_rain_barrel_noprimers)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1375    1436    1446    1443    1456    1496 

Filtered contigs fastq file

RivaPond

Load fastq file containing raw contigs as downloaded from the Loop Genomics cloud platform.

Raw contigs fastq file

#R
library(dada2)
library(ShortRead)

F_prim <- "AGAGTTTGATCMTGGCTCAG" #forward primer
R_prim <- "TACCTTGTTACGACTT" #reverse primer

madison_riva_pond <- file.path("madison_riva_pond.fastq")

#Basic statistics of raw contigs
summary(nchar(getSequences(madison_riva_pond)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    242    1090    1412    1253    1448    2036 
    
#Remove forward and reverse primers. This step also ensures that the contigs that remain after filtering are the full length 16S gene. 
removePrimers(madison_riva_pond, "madison_riva_pond_noprimers.fastq", primer.fwd=F_prim, primer.rev=rc(R_prim), compress=FALSE, max.mismatch=2, allow.indels=FALSE, verbose=TRUE)
7 sequences out of 14352 are being reverse-complemented.
Read in 14352, output 3808 (26.5%) filtered sequences.

#Basic statistics of filtered contigs (full length 16S, primers trimmed) 
madison_riva_pond_noprimers <- file.path("madison_riva_pond_noprimers.fastq")
summary(nchar(getSequences(madison_riva_pond_noprimers)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1350    1410    1419    1428    1449    2000 
 
#Still some contigs are too long. Filter for length.
filterAndTrim(madison_riva_pond_noprimers, "madison_riva_pond_noprimers_lengthfilter.fastq", minLen=1400, maxLen=1600, verbose=TRUE, compress=FALSE)
Overwriting file:C:\Users\Nico\Documents\workdir\madison_riva_pond_noprimers_lengthfilter.fastq
Read in 3808, output 3564 (93.6%) filtered sequences.

#Basic statistics of filtered contigs (full length 16S, primers trimmed, length filtered)
madison_riva_pond_noprimers_lengthfilter <- file.path("madison_riva_pond_noprimers_lengthfilter.fastq")
summary(nchar(getSequences(madison_riva_pond_noprimers_lengthfilter)))

Filtered contigs fastq file

Classification with LCA Classifier

Taxonomic classification with Kraken 2 outputs reports which show a hierarchical LCA scheme. While most full length 16S contigs can be classified to species level, some contigs can only be classifed to higher taxonomic levels. To obtain species abundances, the companion tool Bracken is run to redistribute all contigs to species level. This redistribution is limited to lineages with at least one species identified. While this method may not fully represent the species diversity in the sample, it does allow a comparison of relative abundances of species that can be classified.

Aquarium

kraken2 --db Kraken2/standard_DB --threads 24 --out madison_aquarium.kraken --report madison_aquarium.kreport madison_aquarium_noprimers_lengthfilter.fastq 
Loading database information... done.
6428 sequences (9.30 Mbp) processed in 0.440s (876.8 Kseq/m, 1268.39 Mbp/m).
  6428 sequences classified (100.00%)
  0 sequences unclassified (0.00%)

Kraken 2 report

bracken -d Kraken2/standard_DB -i madison_aquarium.kreport -o madison_aquarium.bracken -r 1500 -l S -t 1
 >> Checking for Valid Options...
 >> Running Bracken 
      >> python src/est_abundance.py -i madison_aquarium.kreport -o madison_aquarium.bracken -k Kraken2/standard_DB/database1500mers.kmer_distrib -l S -t 1
PROGRAM START TIME: 12-31-2019 05:36:30
BRACKEN SUMMARY (Kraken report: madison_aquarium.kreport)
    >>> Threshold: 1 
    >>> Number of species in sample: 118 
	  >> Number of species with reads > threshold: 118 
	  >> Number of species with reads < threshold: 0 
    >>> Total reads in sample: 6428
	  >> Total reads kept at species level (reads > threshold): 6129
	  >> Total reads discarded (species reads < threshold): 0
	  >> Reads distributed: 68
	  >> Reads not distributed (eg. no species above threshold): 231
	  >> Unclassified reads: 0
BRACKEN OUTPUT PRODUCED: madison_aquarium.bracken
PROGRAM END TIME: 12-31-2019 05:36:31
  Bracken complete.

Bracken report

FeynmanPond

kraken2 --db Kraken2/standard_DB --threads 24 --out madison_feynman_pond.kraken --report madison_feynman_pond.kreport madison_feynman_pond_noprimers.fastq 
Loading database information... done.
8251 sequences (11.89 Mbp) processed in 0.516s (959.3 Kseq/m, 1382.90 Mbp/m).
  8251 sequences classified (100.00%)
  0 sequences unclassified (0.00%)

Kraken 2 report

bracken -d Kraken2/standard_DB -i madison_feynman_pond.kreport -o madison_feynman_pond.bracken -r 1500 -l S -t 1
 >> Checking for Valid Options...
 >> Running Bracken 
      >> python src/est_abundance.py -i madison_feynman_pond.kreport -o madison_feynman_pond.bracken -k Kraken2/standard_DB/database1500mers.kmer_distrib -l S -t 1
PROGRAM START TIME: 12-31-2019 05:45:10
BRACKEN SUMMARY (Kraken report: madison_feynman_pond.kreport)
    >>> Threshold: 1 
    >>> Number of species in sample: 332 
	  >> Number of species with reads > threshold: 332 
	  >> Number of species with reads < threshold: 0 
    >>> Total reads in sample: 8251
	  >> Total reads kept at species level (reads > threshold): 5870
	  >> Total reads discarded (species reads < threshold): 0
	  >> Reads distributed: 496
	  >> Reads not distributed (eg. no species above threshold): 1885
	  >> Unclassified reads: 0
BRACKEN OUTPUT PRODUCED: madison_feynman_pond.bracken
PROGRAM END TIME: 12-31-2019 05:45:10
  Bracken complete.

Bracken report

RainBarrel

kraken2 --db Kraken2/standard_DB --threads 24 --out madison_rain_barrel.kraken --report madison_rain_barrel.kreport madison_rain_barrel_noprimers.fastq
Loading database information... done.
4673 sequences (6.74 Mbp) processed in 0.274s (1022.6 Kseq/m, 1475.63 Mbp/m).
  4673 sequences classified (100.00%)
  0 sequences unclassified (0.00%)

Kraken 2 report

bracken -d Kraken2/standard_DB -i madison_rain_barrel.kreport -o madison_rain_barrel.bracken -r 1500 -l S -t 1
 >> Checking for Valid Options...
 >> Running Bracken 
      >> python src/est_abundance.py -i madison_rain_barrel.kreport -o madison_rain_barrel.bracken -k Kraken2/standard_DB/database1500mers.kmer_distrib -l S -t 1
PROGRAM START TIME: 12-31-2019 06:28:32
BRACKEN SUMMARY (Kraken report: madison_rain_barrel.kreport)
    >>> Threshold: 1 
    >>> Number of species in sample: 165 
	  >> Number of species with reads > threshold: 165 
	  >> Number of species with reads < threshold: 0 
    >>> Total reads in sample: 4673
	  >> Total reads kept at species level (reads > threshold): 2336
	  >> Total reads discarded (species reads < threshold): 0
	  >> Reads distributed: 1489
	  >> Reads not distributed (eg. no species above threshold): 848
	  >> Unclassified reads: 0
BRACKEN OUTPUT PRODUCED: madison_rain_barrel.bracken
PROGRAM END TIME: 12-31-2019 06:28:32
  Bracken complete.

Bracken report

RivaPond

kraken2 --db Kraken2/standard_DB --threads 24 --out madison_riva_pond.kraken --report madison_riva_pond.kreport madison_riva_pond_noprimers_lengthfilter.fastq 
Loading database information... done.
3564 sequences (5.10 Mbp) processed in 0.339s (631.5 Kseq/m, 903.08 Mbp/m).
  3564 sequences classified (100.00%)
  0 sequences unclassified (0.00%)

Kraken 2 report

bracken -d Kraken2/standard_DB -i madison_riva_pond.kreport -o madison_riva_pond.bracken -r 1500 -l S -t 1
 >> Checking for Valid Options...
 >> Running Bracken 
      >> python src/est_abundance.py -i madison_riva_pond.kreport -o madison_riva_pond.bracken -k Kraken2/standard_DB/database1500mers.kmer_distrib -l S -t 1
PROGRAM START TIME: 12-31-2019 06:16:39
BRACKEN SUMMARY (Kraken report: madison_riva_pond.kreport)
    >>> Threshold: 1 
    >>> Number of species in sample: 166 
	  >> Number of species with reads > threshold: 166 
	  >> Number of species with reads < threshold: 0 
    >>> Total reads in sample: 3564
	  >> Total reads kept at species level (reads > threshold): 2507
	  >> Total reads discarded (species reads < threshold): 0
	  >> Reads distributed: 87
	  >> Reads not distributed (eg. no species above threshold): 970
	  >> Unclassified reads: 0
BRACKEN OUTPUT PRODUCED: madison_riva_pond.bracken
PROGRAM END TIME: 12-31-2019 06:16:39
  Bracken complete.

Bracken report

Diversity Analysis

The number of species identified and top 3 species by abundance.

Sample	Species count	First	Second	Third
Aquarium	118	Limnohabitans sp. 63ED37-2 (0.55)	Polynucleobacter necessarius (0.12)	Novosphingobium ginsenosidimutans (0.05)
FeynmanPond	332	Candidatus Planktophila limnetica (0.09)	Arachidicoccus ginsenosidivorans (0.09)	Limnohabitans sp. 103DPR2 (0.05)
RainBarrel	165	Steroidobacter denitrificans (0.25)	Pseudomonas sp. SWI6 (0.11)	Planctomycetes bacterium Pan189 (0.08)
RivaPond	166	Scytonema sp. HK-05 (0.09)	Niabella soli (0.07)	Planctomycetes bacterium Pan97 (0.06)

Combine all bracken reports into a contingency table, convert to biom file and run plot rank abundance.

Madison water combined

biom convert -i madison_water_combined.txt -o madison_water_combined.biom --to-hdf5 --table-type="OTU table" --process-obs-metadata taxonomy
plot_rank_abundance_graph.py -i madison_water_combined.biom -s '*' -x -f png -v -o madison_water_combined_rank_abundance.png

The Aquarium sample is dominated by fewer high abundance species while the FeynmanPond sample is most diverse and with a more even species distribution.

#R
library(dplyr)
library(ggplot2)
library(RColorBrewer)

#Melt and add sample info to madison_water_combined.txt
data <- read.csv("madison_water_combined.csv", header = TRUE, stringsAsFactors = FALSE)
data <- data %>% arrange(Sample, Abundance)

x <- ggplot(data, aes(x=Sample, y=Abundance, fill=factor(Species),group=Abundance)) + 
     geom_bar(stat="identity",position="stack") + 
     theme(legend.position="none",
           axis.text.x = element_text(size=10,angle=90,vjust=0))
	   
cols <- colorRampPalette(brewer.pal(12, "Paired")) 
myPal <- cols(length(unique(data$Species)))

x + scale_fill_manual(values = sample(myPal))

newpath <- paste0("madison_water_combined_barchart.png")
ggsave(filename=newpath,width=3,height=8,units="cm",dpi=600)

LoopSeq 16S

Analysis for LoopSeq 16S manuscript

Sections

Zymo Mock Community Analysis

ATCC Mock Human Communities Analysis

Madison Water Samples Analysis

Computing Environment

R v3.6.1

CentOS6 Linux

Zymo Mock Community Analysis

Data Preparation

Mock Community Description

LoopSeq 16S Contig QC & Filtering

Mapping to Provided References

Intergenomic 16S Gene Copies & Contig Accuracy

QC Summary

Comparison with Illumina Short Paired-End Reads

QC Summary

Mapping to NCBI RefSeq References

Classification with LCA Classifier

ATCC Mock Human Communities Analysis

Data Preparation

Mock Community Description

LoopSeq 16S Contig QC & Filtering

ATCC-gut

ATCC-oral

Mapping to Whole Genome References

ATCC-gut

ATCC-oral

Strain-level Detection Using LCA Classifier

ATCC-gut

ATCC-oral

Madison Water Samples Analysis

Data Preparation

LoopSeq 16S Contig QC & Filtering

Madison-Aquarium

FeynmanPond

RainBarrel

RivaPond

Classification with LCA Classifier

Aquarium

FeynmanPond

RainBarrel

RivaPond

Diversity Analysis