Recent metagenome selleck products studies of the gut microbiomes of the wood degrading higher termites, the Australian Tammar wallaby and two studies of the cow rumen metagenome have revealed new insights into the mechanisms of cellulose degradation in uncul tured organisms and microbial communities. Microbial communities of different herbivores have been shown to be dominated by lineages affiliated to the Bacteroidetes and Firmicutes, of which different Bacteroidetes lineages exhibited endoglucanse activity. Notably, exo acting families and cellulosomal structures have a low rep resentation or are entirely absent from gut metagenomes sequenced to date. Thus, current knowledge about genes and pathways involved in plant biomass degradation in different species, particularly uncultured microbial ones, is still incomplete.
We describe a method for the de novo discovery of protein domains and CAZy families associated with mi crobial plant biomass degradation from genome and metagenome sequences. It uses protein domain and gene family annotations as input and identifies those domains or gene families, which in concert are most distinctive for the lignocellulose degraders. Among the gene and protein domains identified with our method were known key genes of plant biomass degradation. Additionally, it identified several novel protein domains and gene fam ilies as being relevant for the process. These might rep resent novel leads towards elucidating the mechanisms of plant biomass degradation for the currently less well understood microbial species.
Our method furthermore can be used to identify plant biomass degrading species from the genomes of cultured or uncultured microbes. Application to draft genomes assembled from the metagenome of a switchgrass adherent microbial com munity in cow rumen predicted genomes from several Bacteroidales lineages which encode active glycoside hydrolases and a AV-951 relative to a known plant biomass de grader to represent lignocellulose degraders. In technical terms, our method selects the most infor mative features from an ensemble of L1 regularized L2 loss linear Support Vector Machine classifiers, trained to distinguish genomes of cellulose degrading species from non degrading species based on protein family content. Protein domain annotations are available in public databases and new protein sequences can be rapidly annotated with Hidden Markov Models or somewhat slower with BLAST searches of one pro tein versus the NCBI nr database.
Co occurrence of protein families in the biomass degrading fraction of samples and an absence of these families within selleck compound the non degrading fraction allows the classifier to link these proteins to biomass degradation without requiring sequence homology to known proteins involved in lignocellulose degradation. Classification with SVMs has been previously used successfully for phenotype predic tion from genetic variations in genomic data.