Description:

TrIdent is a reference-independent bioinformatics tool that automates the analysis of transductomics data by automatically detecting, classifying and characterizing potential transducing events. Transductomics is a DNA-sequencing based method for the detection and characterization of transduction events. Developed by Kleiner et al. (2020), transductomics relies on mapping reads from a virome (VLP-fraction) of a sample to contigs assembled from the metagenome (whole-community) of the same sample. Reads from bacterial DNA carried by viruses and other VLPs (Virus-like particles) will map back to the bacterial contigs of origin creating read coverage patterns indicative of potential ongoing transduction.

Reference: Kleiner, M., Bushnell, B., Sanderson, K.E. et al. Transductomics: sequencing-based detection and analysis of transduced DNA in pure cultures and microbial communities. Microbiome 8, 158 (2020). https://doi.org/10.1186/s40168-020-00935-5


TrIdent tutorial:

TrIdent- Transduction Identification

TrIdent consists of three main functions to automatically detect, classify, and characterize potential transducing events:

  • TrIdent_Classifier(): Classifies contigs as ‘Prophage-like’, ‘Sloping’, ‘HighCoverageNoPattern’, and ‘InsufficientCoverage’
  • PlotTrIdentPatternMatches(): Plots results of TrIdent_Classifier
  • SpecializedTransduction_ID(): Detects potential specialized transduction on contigs classified as Prophage-like

Running TrIdent in default mode is the easiest, but users can learn how to use various arguments to modify TrIdent’s results in this tutorial.

Input Data:

The datasets used in this tutorial- ‘VLPFraction_sampledata’ and ‘WholeCommunity_sampledata’- were generated from a conventional mouse fecal metagenome. The homogenized feces represents the whole-community. The VLP-fraction of the fecal sample was separated and purified via CsCl density gradient ultracentrifugation. Both the whole-community and VLP-fraction were sequenced with Illumina (paired-end mode, 150 bp reads) after which the metagenome was assembled from the whole-community reads. The whole-community and VLP-fraction raw reads were mapped to the metagenome contigs using BBMap. The resulting .BAM file was sorted and indexed using the Samtools sort and index functions, respectively. Finally, two pileup files were generated to summarize the respective read coverages across each contig. The contigs were pre-filtered to remove contigs less than 30 kbp. Note- Specific sequencing requirements are needed for transductomics! Sample preparation and sequencing procedures are detailed in Kleiner et al. (2020). A subset of 10 contigs from the mouse fecal metagenome were selected for the sample dataset used in this tutorial.

The two pileup files were generated using BBMap’s ambiguous=random, qtrim=lr, minid=0.97, and binsize=100. The binsize/windowsize must be 100!

We recommend using the commands below (with your own sorted .bam files) to generate the pileup files needed for TrIdent:

pileup.sh in=VLPFraction_ReadMappingSorted.bam out=VLPFraction.pileupcovstats bincov=VLPFraction.bincov100 binsize=100 stdev=t

pileup.sh in=WholeCommunity_ReadMappingSorted.bam out=WholeCommunity.pileupcovstats bincov=WholeCommunity.bincov100 binsize=100 stdev=t

In R…

Load the package into your library. If you don’t already have TrIdent, you can find installation instructions here

library(TrIdent)

Import your pileup files:

Note The ‘VLPFraction_sampledata’ and ‘WholeCommunity_sampledata’ pileup files needed for the tutorial come preloaded with the TrIdent package. There is no nead to load or import these files.

Here is what the raw pileup file should look like: (note that the information in the first column may be formatted differently depending on your contig accession format)

##                                           V1   V2  V3      V4
##  NODE_4 length_493049_cov_5.62057_ID_9556231 2.62 100 1938832
##  NODE_4 length_493049_cov_5.62057_ID_9556231 6.94 200 1938932
##  NODE_4 length_493049_cov_5.62057_ID_9556231 6.39 300 1939032
##  NODE_4 length_493049_cov_5.62057_ID_9556231 5.98 400 1939132
##  NODE_4 length_493049_cov_5.62057_ID_9556231 8.12 500 1939232
##  NODE_4 length_493049_cov_5.62057_ID_9556231 5.02 600 1939332
  • The first column contains character strings of the contig accessions given by the assembler used.
  • The second column contains numerical values of read coverages binned over 100 bp regions.
  • The third column contains integer values for the starting position of each 100 bp bin. The position restarts at the start of each new contig.
  • The fourth column contains integer values for the starting position of each 100 bp bin. The position does NOT restart at the start of each new contig.

The ‘VLPFraction_bincov100.txt’ and ‘WholeCommunity_bincov100.txt’ files generated by BBMap’s pileup.sh in the example above could be used directly as input to TrIdent. TrIdent has built-in data cleaning and reformatting for files output specifically by BBMap’s pileup.sh. If you do not use BBMap’s pileup.sh to generate the pileup files, then you are responsible for data-cleaning and reformatting. Your pileup files must be in the following format (same column names, same column classes, etc.) and you must use the cleanup=FALSE argument for the TrIdent_classifier(), Plot_TrIdentPatternMatches(), and SpecializedTransduction_ID().

##  ref_name coverage position
##    NODE_4     2.62      100
##    NODE_4     6.94      200
##    NODE_4     6.39      300
##    NODE_4     5.98      400
##    NODE_4     8.12      500
##    NODE_4     5.02      600

The CleanVLPFraction_sampledata comes preloaded with TrIdent and provides an example of a cleaned and reformatted pileup file. Please use this dataframe as an example if you are doing your own data-cleaning and reformatting.

Running TrIdent:

TrIdent_Classifier():

TrIdent_Classifier() is the main function that TrIdent relies on. This function cleans and reformats your input data, filters contigs based on length and read coverage, performs pattern-matching to classify contigs, identifies active/highly abundant and heterogenously integrated prophage-like elements, determines which contigs have high VLP-fraction:whole-community read coverage ratios, identifies start and stop positions and sizes of pattern matches, calculates slopes for Sloping pattern matches, generates a pattern-match quality score and outputs all information in a neat summary table.

TrIdent_Classifier() features:

1. Contig filtering:

Contigs are filtered out based on short length or low read coverage. TrIdent filters out contigs that do not have at least 10x coverage on a total of 5000 bp across the whole contig due to insufficient read coverage. Contigs where the 50th greatest coverage value is less than 10 means that there is no region on the contig with read coverages greater than 10 for at least 5,000bp. The low read coverage filtering was done in this way to avoid filtering out long contigs with short Prophage-like patterns that might get removed if filtering was done with averages or medians. Additionally, contigs less than 30 kbp are filtered out by default, however this can be changed with the MinContigLength parameter. Contigs shorter than 30 kbp may be poor quality and are not big enough to show clear transduction patterns. If you would like to speed-up processing time of TrIdent, consider pre-filtering your assembly for contigs greater than 30 kbp!

2. Pattern-matching:

Contigs that are not filtered out proceed to pattern-matching where they are matched with a variety of patterns representing transduction events. Patterns are ‘built’ and the x and y-axis values are scaled specific to the characteristics of each contig to ensure the pattern-matching is data agnostic. After a pattern is built, it is translated across the contig being assessed, and the mean absolute difference in coverage (match-score) between the contig and the pattern is calculated at each translation. Theoretically, if a pattern is a perfect match to the coverages on a contig, then taking the mean absolute difference in y-axis values will result in a 0. Obviously, no pattern will be a perfect match to a contig, but the closer to 0 the match-score is, the better that pattern matches the read coverage pattern on the contig. The contig is classified based on the pattern that achieves the lowest match-score.

Patterns:
  • Sloping:

There are four sloping pattern variations in the Sloping class. The sloping pattern is representative of large transfers of bacterial DNA which take place during generalized, lateral and gene transfer agent transduction. Other unknown mechanisms of DNA transfer may also be responsible for sloping patterns. The sloping read coverage is due to the decreasing frequency of DNA packaging moving away from the packaging initiation sites. All tested patterns are adapted by the software to the length of the contig being assessed. The peak of the slope is set to start slightly above the contigs’ maximum coverage value and the base of the slope to start at the contig’s minimum coverage value. Different slopes are tested by both increasing the minimum value and decreasing the maximum value until a minimum slope of 0.00015 (change of read coverage by 15 over 100,000 bp) is reached. Generalized, lateral and gene transfer agent transduction events can span many kbps of DNA and a single contig typically does not capture the entire event. Depending on which part of the transducing event is captured by the contig, the slope can be very steep or close to 0. Patterns 1 and 2 below represent contigs that capture a Sloping transducing event somewhere in the middle of the pattern. Patterns 2 and 4 represent contigs that capture the packaging initiation of a Sloping transduction event. Patterns 2 and 4 are translated across the contig in addition to having the slopes changed while only the slopes are changed on patterns 1 and 2.

  • Prophage-like:

There are three patterns in the Prophage-like class. The block pattern is representative of reads from inegtrated genetic elements, like prophage or phage-inducible chromosomal islands (PICIs), mapping back to their respective integration sites in the host bacterium’s chromosome. The block patterns are built based on the length of the contig being assessed. The top of the block starts at the contig’s maximum coverage value while the base starts at the contig’s minimum coverage value. The block width starts close to the length of the contig however this can be changed with the user-defined variable maxblocksize. While most prophage tend to be in the 30-60 kbp range, some are much larger. As a default, we do not set an upper-limit for the Prophage-like pattern because the upper-limit of prophage sizes is not known. The block heights are decreased followed by the block’s width to ensure a variety of block height/width combinations are tested. The block widths never get smaller than 10 kbp as a default as it can be difficult to distinguish between prophage-like elements and other mobile genetic elements, like transposons, in that size range. However, the minimum block width is a user-defined variable (minblocksize) and can be changed if smaller mobile genetic elements are of interest. The block heights are decreased followed by the block’s width to ensure that a variety of block height/width combinations are tested. Pattern 1 represents a prophage-like element that is entirely on the contig while patterns 2 and 3 represent a prophage-like element that trails off the right or left side of the contig, respectively. Each pattern variation is translated across the contig being assessed.

  • No pattern:

There is one pattern in the 'No pattern' class. We use the 'No pattern' pattern to generate a basline line match score which we compare to the match scores generated with the Sloping or Prophage-like pattern matches. A contig without a Prophage-like or Sloping pattern is likely to have fairly even read coverage across the contig (i.e. no pattern) which the ‘No pattern’ pattern tries to replicate. If a contig has a Prophage-like or Sloping pattern, its match score to the 'No pattern' pattern will likely be lower than the match score to its true pattern match. The pattern is built to the length of the contig being assessed and is a horizontal line that's simply the mean read coverage of the contig.


3. Identifying highly active/abundant OR heterogenously present prophage-like elements:

A prophage or prophage-like genetic element that is actively replicating or exists in high abundance will generate more reads than its respective host bacterium. This may create a region of elevated read coverage at the elements' genetic insertion sites that can be visualized in the whole-community fraction contigs. Conversely, if a prophage-like genetic element is only integrated into a portion of the host bacterial population, the read coverage at the insertion site will drop in comparison to the read coverage neighboring the insertion site. Since TrIdent locates prophage-like elements as part of its pattern-matching functionality, we can use these genetic 'coordinates' to see if the associated prophage-like region has elevated or decreased read coverage in the whole-community fraction. Contigs where the prophage-like:non-prophage-like region has a mean read coverage ratio of greater than 1.3 are labeled highly active/abundant whereas a mean read coverage ratio less than 0.5 are labeled as not homogenously integrated into the host population ('mixed'). If the non-prophage-like region is less than 20,000 bp, then the contig is labeled as ‘CBD’ (Can’t Be Determined) as its difficult to determine if the prophage-like region is truely elevated or depressed when there is so little non-prophage-like region to compare to.

4. Identifying contigs with high VLP-fraction:whole-community read coverages:

Contigs that are classified as have 'No pattern' via pattern-matching are assessed to see if they have high VLP-fraction:Whole-community median read coverage ratios. A contig with no pattern match that has an unusually high amount of bacterial DNA in the VLP-fraction may represent the ‘tail’ of a sloping pattern formed by a Sloping event, unknown transduction pathways or contamination from other sources e.g. very small, very dense cells that were co-purified with VLPs. As mentioned in the pattern-matching section for Sloping transduction events above, depending on which part of the sloping read coverage pattern is captured by a contig, the sloping can vary from being very steep to almost non-existent. Typically the ‘tails’ of sloping events have very little to no slope, but still represent transduction events. To differentiate contigs classified as having 'No pattern' that represent real transduction events and those that represent no transduction, we use the median read coverage ratio between the VLP-fraction and whole-community metagenome. The idea is that contigs with a high amount of VLP-fraction read coverage relative to the whole-community metagenome read coverage may represent real transfer events rather than just contaminating bacterial DNA. If the VLP-fraction has a median read coverage of greater than 50% of the median read coverage in the whole-community metagenome, then the contig is classified as having high VLP-fraction read coverage but no distinct read coverage pattern (HighCoverageNoPattern). It is up to the user to decide if they would like to include or exclude these classifications in their assessment. NOTE: The HighCoverageNoPattern phenomenom is very much impacted by how many reads are sequenced for the whole-community versus the VLP fraction. For example, if you sequence a lot more reads for the VLP fraction as compared to the whole community the ratio may be less meaningful as it increases automatically with more reads sequenced for the VLP fraction. As such, one needs to be careful in the interpretation of these ratios.


TrIdent_Classifier() user parameters:

  • VLP-pileup
  • WC_pileup
  • windowsize
  • minblocksize
  • maxblocksize
  • mincontiglength
  • SaveFilesTo
  • cleanup

VLP-pileup A dataframe containing contig names, coverages averaged over 100bp windows, and contig positions associated with mapping VLP-fraction reads to whole-community contigs

WC_pileup A dataframe containing contig names, coverages averaged over 100bp windows, and contig positions associated with mapping whole-community reads to whole-community contigs

windowsize TrIdent resizes the bins or ‘windows’ used by pileup.sh to improve processing time and reduce noise in the data. Resizing is done by averaging the read coverages across the specified windowsize. TrIdent_Classifier() resizes windows to 1000 bp as a default. Depending on the dataset, the user may want to select a different windowsize. Users can choose between windowsizes of 200, 500, 1000 or 2000 ONLY. We recommend increasing the windowsize to 2000 if processing speed is of importance or if data is noisy (i.e. VLP-fraction contaminated with external bacterial DNA). We recommend decreasing the windowsize if your data is very clean and/or small and you are interested in increasing the resolution of read coverage patterns for the initial classification of contigs. Note that decreasing the windowsize will increase TrIdent’s processing time! Increasing/decreasing the windowsize may alter the results of the TrIdent_Classifier() slightly. Prophage-like classifications tend to stay the same when windowsize is changed, but Sloping and HighCoverageNoPattern classifications may switch classes with each other. This is due to how averaging the read coverages affects the sloping pattern./p>

minblocksize The minimum size of prophage-like patterns. The default is 10,000bp.

maxblocksize The maximum size of prophage-like patterns. The default is undefined/NA (i.e. no max size).

Be aware that changing the minblocksize and maxblocksize will not necessarily remove contigs with prophage-like patterns smaller/larger than the defined parameters from the resulting classifications. Contigs with prophage-like patterns larger/smaller than the maxblocksize and minblocksize may still be classified, however, the classifications and associated pattern matches may be poor. For example, if a contig has a clear prophage-like pattern that’s ~50,000 bp but the user sets the maxblocksize=40000, TrIdent will likely still classify the contig as prophage-like as the maximum block-like pattern of 40,0000 bp will still achieve a lower pattern match score than any of the Sloping or 'No pattern' patterns.

mincontiglength The minimum contig length used processed for pattern-matching. Contigs shorter than the mincontiglength will be filtered out prior to pattern-matching. The default is 30,000 bp

SaveFilesTo A file path in which TrIdent saves outputs to. This is useful if using TrIdent in a command-line environment. Default is that files are NOT saved to an output folder.

cleanup If TRUE, TrIdent will clean and re-format the input pileup files to ensure they are in the correct format for pattern-matching. If FALSE, users are responsible for putting their input files into the correct format (specified above). TRUE by default.

Default:

To run TrIdent_Classifier() in default-mode, run the following:

TrIdent_results <- TrIdent_Classifier(VLP_pileup=VLPFraction_sampledata, WC_pileup=WholeCommunity_sampledata)
## Starting pattern-matching... 
## A quarter of the way done with pattern_matching 
## Half of the way done with pattern_matching 
## Almost done with pattern_matching! 
## Identifying potential transducing events 
## Determining sizes (bp) of potential transduction events 
## Identifying highly active/abundant or heterogenously integrated prophage-like elements 
## Finalizing output 
## Execuion time: 23.9757790565491 
## 1 contigs were filtered out based on low read coverage 
## 0 contigs were filtered out based on length  
## 
##             Sloping   HighCoverageNoPattern  InsufficientCoverage  Prophage-like 
##                2                2                      1                  4 
## 2 of the prophage-like classifications are highly active or abundant 
## 1 of the prophage-like classifications are 'mixed', i.e. heterogenously integrated into their bacterial host population 
## 

TrIdent_Classifier() outputs a histogram containing the distribution of normalized pattern-match scores for your dataset. The normalized pattern-match score is the pattern-match score for a specific classification divided by the contig's average read coverage. This normlaized match score is an indicator of a pattern-match's quality i.e. how well the TrIdent pattern fits the associated contig's read coverage profile. Smaller match-scores, and normalized match-scores, indicate better pattern matches. The histogram can be used to filter the resulting TrIdent classifications by quality of pattern matches. A suggested filtering threshold is marked on the plot with a vertical line, however, the suggested threshold tends to be stringent and quality pattern-matches may be filtered out if the upper threshold boundary is not explored. For this reason, we encourage users to initially test a filtering value slightly greater than the suggested threshold.

For the test dataset, the resulting histogram looks like this:

However, for a 'real' dataset, the histogram looks like more of a normal distribution:

Obtain results of main classifier:

The output of TrIdent_Classifier() is a list containing five objects:

  1. Full_summary_table: A summary table containing the classification information for all contigs that were not filtered out.
  2. Cleaned_summary_table: A cleaned summary table containing the classification information for all contigs classified as either Prophage-like, Sloping, or HighCoverageNoPattern (i.e. ‘InsufficientCoverage’ classifications removed)
  3. PatternMatchInfo: A list of pattern-match info that is used by other functions in TrIdent.
  4. FilteredOutContig_table: A table of contigs that were filtered out and the reason why (either low read coverage or too short(<30kbp)).
  5. Windowsize: The windowsize used.

Save the desired list-item to a new variable using its associated name:

TrIdent_summary_table <- TrIdent_results$Full_summary_table
The TrIdent_Classifier output summary table
ref_name classifications NormMatchScore match_size start_pos stop_pos active_prophage elevation_ratio slope
NODE_25 InsufficientCoverage 0.20711667 NA NA NA NA NA NA
NODE_44 Prophage-like 0.19117212 65000 151000 216000 YES 1.3657 NA
NODE_62 Prophage-like 0.14338316 171000 63000 234000 YES 1.4700 NA
NODE_125 Prophage-like 0.23903635 31000 153000 184000 NO 1.1175 NA
NODE_238 Sloping 0.14225027 146000 1000 147000 NA NA -0.2060
NODE_251 Sloping 0.11734790 144000 1000 145000 NA NA 0.0524
NODE_368 Prophage-like 0.16050720 28000 27000 55000 MIXED 0.3986 NA
NODE_560 HighCoverageNoPattern 0.07094737 NA NA NA NA NA NA
NODE_1088 HighCoverageNoPattern 0.08964437 NA NA NA NA NA NA
filteredout_contigs <- TrIdent_results$FilteredOutContig_table
The TrIdent_Classifier filtered-out contig summary table
filteredout_contigs reason
NODE_4 Low VLP-fraction read cov

Plot_TrIdentPatternMatches():

Plot_TrIdentPatternMatches() will output a list of read coverage plots of all contigs predicted as either Sloping, Prophage-like, or HighCoverageNoPattern and their respective pattern matches.

Plot_TrIdentPatternMatches() user parameters:

  • VLP-pileup
  • WC_pileup
  • transductionclassifications
  • MatchScoreFilter
  • SaveFilesTo
  • cleanup

VLP-pileupA dataframe containing contig names, coverages averaged over 100bp windows, and contig positions associated with mapping VLP-fraction reads to whole-community contigs

WC_pileupA dataframe containing contig names, coverages averaged over 100bp windows, and contig positions associated with mapping whole-community-fraction reads to whole-community contigs

transductionclassifications

The complete output from TrIdent_Classifier()

MatchScoreFilter Used to filter the TrIdent_Classifier() classifications by the quality of their respective pattern-matches. Choose the filtering theshold using the histogram of normalize pattern-match scores output with TrIdent_Classifier(). There is no filter set by default.

SaveFilesTo A file path in which TrIdent saves outputs to. Each plot is saved as an indiviudal png file named by the associated contig reference name. This is useful if using TrIdent in a command-line environment. Default is that files are NOT saved to an output folder.

cleanupIf TRUE, TrIdent will clean and re-format the input pileup files to ensure they are in the correct format for pattern-matching. If FALSE, users are responsible for putting their input files into the correct format (specified above). TRUE by default.

TrIdent_Classifier() Default:

TrIdent_patternmatches <- Plot_TrIdentPatternMatches(VLP_pileup=VLPFraction_sampledata, WC_pileup=WholeCommunity_sampledata, transductionclassifications=TrIdent_results)

View either all plots at once or one plot at a time. All of the output plots can be saved as individual ggplot objects for further manipulation by the user. Each plot is named by its respective contig accession. View all plots:

TrIdent_patternmatches

View one plot:

TrIdent_patternmatches$NODE_10

SpecializedTransduction_ID():

Specialized transduction occurs when a prophage-like element has an improper excision from the host bacterium’s chromosome and accidentally packages a small portion of bacterial DNA directly outside the prophage-like region. SpecializedTransduction_ID() searches contigs classified as Prophage-like for dense read coverage outside the borders of the Prophage-like pattern that could represent specialized transduction. Because specialized transduction tends to be fairly short (several kbps) compared to Sloping transduction (tens to hundreds of kbps), averaging over a 1000 bp distance (i.e using a windowsize=1000) can ‘blur’ specialized transduction patterns depending on their size. This is why specialized transduction is not identified in TrIdent_Classifier(). Instead, we use the locations of prophages-like elements identified with TrIdent_Classifier() to guide our search for specialized transduction in SpecializedTransduction_ID().

SpecializedTransduction_ID() does not resize the windows of the input pileup files to preserve resolution of potential specialized transduction patterns. Because of this, we can not use the locations of the Prophage-like pattern matches to determine the exact border locations of prophage-like elements. The locations generated with TrIdent_Classifier() using a windowsize=1000 (or one of the other options) will not perfectly translate back to a windowsize of 100. Instead, we use the locations of the Prophage-like pattern matches to ’zoom-in’ on the region of a contig where an associated Prophage-like pattern match is located. SpecializedTransduction_ID() then searches the contig, starting from the left moving inward, for the first coverage value that is at least 20% of the maximum coverage value in the defined region. This represents the left ‘border’. The search is repeated starting from the right side of the contig moving inwards and the first coverage value that is at least 20% of the maximum value represents the right ‘border’. For contigs that have a Prophage-like match that trails off the side of a contig, then only the border that falls on the contig is searched for.

Once the prophage-like borders are identified, SpecializedTransduction_ID() starts from the borders and searches outwards for dense read coverage that meet the ‘requirements’ for specialized transduction as defined by the arguments in SpecializedTransduction_ID(). SpecializedTRansducion_ID uses two arguments to define specialized transduction:

  • noreadcov
  • spectranslength

SpecializedTransduction_ID() first makes sure that any coverage it detects outside the borders is not disrupted by a defined region of no read coverage (noreadcov). The default value for noreadcov is 500 bp. Secondly, SpecializedTransduction_ID() ensures that any read coverage it detects outside of the prophage/PICI borders meets a minimum length requirement (spectranslength). The default value for spectranslength is 2000 bp. So by default, SpecializedTransduction_ID() will search for coverage immediately outside the left and right prophage/PICI boundaries that is at least 2000 bp long and is not interrupted at any point by more than 500 bp of no read coverage. If these requirements are met, SpecializedTransduction_ID() will mark the contig as having specialized transduction. We suggest using the default values for initial usage of SpecializedTransduction_ID() and only changing the noreadcov and spectranslength arguments when adapting the specialized transduction search for your specific dataset.

Default:

Search all contigs classified as Prophage-like for specialized transduction:

Specialized_transduction <- SpecializedTransduction_ID(VLP_pileup=VLPFraction_sampledata, transductionclassifications=TrIdent_results, noreadcov=500, spectranslength=2000, cleanup=TRUE)
## 2 contigs have potential specialized transduction

When you search all contigs, the output of SpecializedTransduction_ID() will be a list. The first object contains a summary table for the specialized transduction search results:

SpecializedTransduction_summary_table <- Specialized_transduction$Summary_table
ref_name Specialized_transduction Left Right Length_left Length_right
NODE_44 yes yes no 2700 NA
NODE_62 yes yes no 45300 NA
NODE_125 no no no NA NA
NODE_368 no no no NA NA

The second object in the output-list contains another list with the resulting log 10 read coverage plots for all contigs classified as Prophage-like. The coverages are put in log-scale to help users visualize specialized transduction patterns as they are sometimes too low frequency to be seen with raw coverages alone. Additionally, the plots are ‘zoomed-in’ on the Prophage-like pattern to further aid with specialized transduction visualization. The borders of the prophage/PICI as identified by SpecializedTransduction_ID() are marked on each plot with a black vertical line. If SpecializedTransduction_ID() identifies potential specialized transduction, it will color the plot green whereas if it does not identify specialized transduction, it will color the plot blue. The end of specialized transduction as determined by SpecializedTransduction_ID() will be marked with a red vertical line. Each plot is named by the associated contig accession and can be saved as a ggplot object for further manipulation by the user.

View all the plots:

Specialized_transduction$Plots

View a specific plot:

Specialized_transduction$Plots$NODE_44

If desired, the user can also search a single contig for specialized transduction by specifying the contig’s reference name with the specificcontig parameter:

Specialized_transduction_NODE44 <- SpecializedTransduction_ID(VLP_pileup=VLPFraction_sampledata, transductionclassifications=TrIdent_results, specificcontig="NODE_44", noreadcov=500, spectranslength=2000, cleanup=TRUE)

Create final summary table:

If you’d like to combine the summary tables produced by TrIdent_Classifier() and SpecializedTransduction_ID(), try the following code:

Final_TrIdentSummaryTable <- merge(TrIdent_summary_table, SpecializedTransduction_summary_table, by="ref_name", all.x=TRUE)
ref_name classifications NormMatchScore match_size start_pos stop_pos active_prophage elevation_ratio slope Specialized_transduction Left Right Length_left Length_right
NODE_1088 HighCoverageNoPattern 0.08964437 NA NA NA NA NA NA NA NA NA NA NA
NODE_125 Prophage-like 0.23903635 31000 153000 184000 NO 1.1175 NA no no no NA NA
NODE_238 Sloping 0.14225027 146000 1000 147000 NA NA -0.2060 NA NA NA NA NA
NODE_25 InsufficientCoverage NA NA NA NA NA NA NA NA NA NA NA NA
NODE_251 Sloping 0.11734790 144000 1000 145000 NA NA 0.0524 NA NA NA NA NA
NODE_368 Prophage-like 0.16050720 28000 27000 55000 MIXED 0.3986 NA no no no NA NA
NODE_44 Prophage-like 0.19117212 65000 151000 216000 YES 1.3657 NA yes yes no 2700 NA
NODE_560 HighCoverageNoPattern 0.07094737 NA NA NA NA NA NA NA NA NA NA NA
NODE_62 Prophage-like 0.14338316 171000 63000 234000 YES 1.4700 NA yes yes no 45300 NA