F the samples using the CloneSmart Blunt 1516647 Cloning Kit (Lucigen). Plasmid sequencing of the clones from the three libraries was conducted with dyeterminator Sanger sequencing at the University of Hawai`i Advanced Studies in Genomics, Proteomics, and Bioinformatics sequencing facility. Paired-end reads were obtained from 391 of the 1651 sequenced inserts for a total of 1942 sequences.sequences from uncultivated organisms. The sequences were classified based on the identity of the sequence with which it shared the greatest similarity, except when the most similar sequence was non-viral, but the sequence also displayed significant similarity (E-value #0.001) to a virus. In the latter case, the sequences were classified according to the most similar virusderived sequence. Sequences classified as viral were further classified based on their family and protein type.Phylogenetic AnalysisIn an effort to assess phylogenetic diversity of viruses in our library, sequences that had any significant similarity (not just the highest similarity) to a viral DNA polymerase were used to construct a phylogram. These sequences were translated and aligned with other translated DNA polymerase gene sequences from viral genomes present in GenBank using custom scripts. A maximum-likelihood tree was then constructed based on this amino acid alignment as previously described [30] with RAxML [31] using the WAG substitution matrix with a subset estimation of invariable sites and gamma distribution in four discrete categories (WAG+C4+ I).Sequence MedChemExpress SIS-3 Assembly and Contig AnalysisSequencher was used to assemble forward and GNF-7 cost reverse reads using the “Assemble by Name” function. Those that assembled were merged into consensus sequences. The resulting 1723 sequences were then assembled using the criteria of a minimum overlap of 20 bp and a minimum of 98 identity according to Breitbart et al. [13]. Open reading frames (ORFs) were predicted in only the larger assembled contigs (.4 kb) using GeneMark.hmm 2.0 [32] and annotated by comparing the ORF sequences to the GenBank non-redundant protein database using BLASTx [28,29] with the same criteria used as when analyzing the trimmed sequence library.Analysis of SequencesSequences from the 3 libraries were pooled and analyzed as one library. Sequence trimming and assembly were performed with Sequencher 4.10.1 (Gene Codes Corp.). Vector sequence was removed using the automatic recognition function in the software. Assembly of all sequences to the vector sequence as a template revealed additional vector-only sequences, which were removed. Forward and reverse reads of the same clone were assembled using the “Assemble by Name” function. Some of these assemblies produced odd results, with forward and reverse reads in same direction. In some cases, the second strand assembled to the first immediately after a string of Ns in the middle of the first strand. These odd assemblies (11 contigs of 22 sequences) were removed. The remaining sequences were trimmed such that the first and last 99 base pairs (bp) contained ,1 ambiguity and the first and last 20 bp contained ,2 bp with a confidence value ,40 . These conditions were applied repeatedly until all sequences met the criteria. The sequences were then trimmed further using the criteria that the first and last 20 bp had ,1 bp with a confidence ,20 . In some cases, sequences with poor quality regions (strings of Ns) in the middle of the sequence were not identified by these criteria and these.F the samples using the CloneSmart Blunt 1516647 Cloning Kit (Lucigen). Plasmid sequencing of the clones from the three libraries was conducted with dyeterminator Sanger sequencing at the University of Hawai`i Advanced Studies in Genomics, Proteomics, and Bioinformatics sequencing facility. Paired-end reads were obtained from 391 of the 1651 sequenced inserts for a total of 1942 sequences.sequences from uncultivated organisms. The sequences were classified based on the identity of the sequence with which it shared the greatest similarity, except when the most similar sequence was non-viral, but the sequence also displayed significant similarity (E-value #0.001) to a virus. In the latter case, the sequences were classified according to the most similar virusderived sequence. Sequences classified as viral were further classified based on their family and protein type.Phylogenetic AnalysisIn an effort to assess phylogenetic diversity of viruses in our library, sequences that had any significant similarity (not just the highest similarity) to a viral DNA polymerase were used to construct a phylogram. These sequences were translated and aligned with other translated DNA polymerase gene sequences from viral genomes present in GenBank using custom scripts. A maximum-likelihood tree was then constructed based on this amino acid alignment as previously described [30] with RAxML [31] using the WAG substitution matrix with a subset estimation of invariable sites and gamma distribution in four discrete categories (WAG+C4+ I).Sequence Assembly and Contig AnalysisSequencher was used to assemble forward and reverse reads using the “Assemble by Name” function. Those that assembled were merged into consensus sequences. The resulting 1723 sequences were then assembled using the criteria of a minimum overlap of 20 bp and a minimum of 98 identity according to Breitbart et al. [13]. Open reading frames (ORFs) were predicted in only the larger assembled contigs (.4 kb) using GeneMark.hmm 2.0 [32] and annotated by comparing the ORF sequences to the GenBank non-redundant protein database using BLASTx [28,29] with the same criteria used as when analyzing the trimmed sequence library.Analysis of SequencesSequences from the 3 libraries were pooled and analyzed as one library. Sequence trimming and assembly were performed with Sequencher 4.10.1 (Gene Codes Corp.). Vector sequence was removed using the automatic recognition function in the software. Assembly of all sequences to the vector sequence as a template revealed additional vector-only sequences, which were removed. Forward and reverse reads of the same clone were assembled using the “Assemble by Name” function. Some of these assemblies produced odd results, with forward and reverse reads in same direction. In some cases, the second strand assembled to the first immediately after a string of Ns in the middle of the first strand. These odd assemblies (11 contigs of 22 sequences) were removed. The remaining sequences were trimmed such that the first and last 99 base pairs (bp) contained ,1 ambiguity and the first and last 20 bp contained ,2 bp with a confidence value ,40 . These conditions were applied repeatedly until all sequences met the criteria. The sequences were then trimmed further using the criteria that the first and last 20 bp had ,1 bp with a confidence ,20 . In some cases, sequences with poor quality regions (strings of Ns) in the middle of the sequence were not identified by these criteria and these.