16S Microbial Analysis with QIIME

Overview

:grey_question: Questions

:dart: Objectives

:heavy_check_mark: Requirements

:hourglass: Time estimation: 1h-1h30

Introduction

In this tutorial you will learn how to perform a 16S Microbial Analysis using the QIIME toolsuite in Galaxy. It was based on already existing tutorials about QIIME, more precisely this one, but it was adapted to the Galaxy interface.

We will work with Illumina sequencing data that is from the Adaptive immunity increases the pace and predictability of evolutionary change study in commensal gut bacteria but that has been shortened to make this tutorial quicker to execute. We'll use the Greengenes reference OTUs, which is the default reference database used by QIIME. Data for this tutorial may be found in Zenodo.

In the study mentioned before, it was performed high-throughput 16S rRNA gene sequencing of DNA extracted from faecal samples from two timepoints of the evolution experiment (day 0 and day 3) of wild-type mice. Also the mice from day 3 of experiment were treated with Streptomycin and were subsequently colonized with Escherichia coli.

Agenda

In this tutorial, we will deal with:

  1. Uploading the data
  2. Validating the mapping file
  3. Quality treatment and demultiplexing
  4. Picking OTU's
  5. Diversity Analysis
    1. Alpha Diversity
    2. Beta Diversity
    3. Taxa Summaries
    4. Group Significance
  6. Create your workflow

Uploading the data

This tutorial covers the case where you are starting with two specific input files: your sample metadata mapping file which contains information about the samples, especially the data needed to perform the microbial analysis, and 10 fastq files containing your amplicon sequence reads.

:nut_and_bolt: Comment

Sequencing centers nowadays will likely perform demultiplexing of sequences for you, so in this tutorial you will work with already demultiplexed data. However, and as we are going to show you later, if you need to demultiplex your data you will need an additional file, which is a fastq file containing the barcode reads for each amplicon sequence in your reads file.

:pencil2: Hands-on: Data Upload

  1. Create a new history
  2. Upload to Galaxy the files provided (you can download them from Zenodo). You should have 5 files that start with 16S_WT_unt (untreated mice) and another five files starting with 16S_WT_day3.

    • Click on the upload button in the upper left of the interface.
      Uploading files
    • Press Choose local file and search for your file.
    • Press Start and wait for the upload to finish. Galaxy will automatically unpack the file.
  3. Upload to Galaxy the mapping file called metadata.tab
  4. Upload to Galaxy the file called 97_otus.fasta that corresponds to the database reference we are going to use in this tutorial, which is the Greengenes database.

    :nut_and_bolt: Comment

    In this tutorial we are only working with the forward reads of the sequencing data, but if you have to work with both forward and reverse reads there's the tool 'Join paired-end Illumina reads' that can pair up your reads so you can use the resulting files in the rest of the analysis.

Validating the mapping file

The QIIME mapping file contains all of the per-sample metadata, including technical information such as primers and barcodes that were used for each sample, and information about the samples, such as the treatment of the mice and a timepoint. In this data set we're looking at mice microbiome samples from the gut of 5 untreated wild-type mice and 5 Streptomycin treated mice, taken at two different time points.

You should ensure that your mapping file is formatted correctly. There's a tool in Galaxy called 'Validate mapping file' that will print 3 types of files to help you understand if and what errors do you have in your mapping file. It will output an HTML file showing the location of errors and warnings and a text log file that will tell you explicitly which errors/warnings you have.

Errors appear in red and they can cause problems further down the analysis, so it's strongly recommended that you fix them. Warnings are usually typos in your mapping file, often resultant from the use of invalid characters. These may impact your analysis, for example if you have typos in your SampleIDs or barcodes, therefore you should also correct these ones. The tool creates a file ending with corrected map will, which will be a copy of your mapping file with invalid characters flagged by warnings replaced by a valid character, which is underscore by default.

:nut_and_bolt: Comment

In this tutorial we will not validate the mapping file as we are going to provide you the correct one. However we recommend that you do this when you're performing your analysis. To have an idea what errors/warnings are flagged by the tool you can search 'Validate the mapping file' in the search bar on the left in Galaxy and check the list of errors and warnings at the end of the tool page.

Quality treatment and demultiplexing

As mentioned before, sequencing centers often perform demultiplexing of sequences for you, which is the case for this tutorial. Our sequences are already separated but they still need to be assigned to the respective SampleID and put together in the same fasta file (right now you have 10 files!). So in this step we will take care of that, and also perform some quality filtering, removing ambiguous and poor quality reads. 

:pencil2: Hands-on:

  1. Split fastq libraries:wrench:
    • As input fastq files select all the 10 fastq files uploaded initially.
    • Metadata mapping files?:  Select the metadata.tab file
    • Barcode read files?Don't select anything
    • Comma-separated list of samples ids to be applied to all sequences: This step is very important, because we don't have a barcode file, so it's here we are going to assign the sequences to the respective ID. So here you should write all the sample ID's separated by commas in the order they appear in your history, otherwise you'll swap the samples: WT.unt.7,WT.unt.4,WT.unt.3,WT.unt.2,WT.unt.1,WT.day3.15,WT.day3.14,WT.day3.13,WT.day3.11,WT.day3.9
    • Maximum unacceptable Phred quality score: 19
    • Type of barcode: Data not barcoded
    • Ascii offset to use when decoding phred scores: 33
    • The rest of the parameters you can leave on default
  2. Count sequences:wrench::Run this tool on the "Split fastq libraries: sequences" file.

    :question: Question

    1.  Which type of files does the split libraries tool outputs to your history?

    2.  How many sequences do we have?

    Click to view answer

    1.  It creates a sequences file which is a fasta formatted file where each sequence is renamed according to the sample it came from; an histograms file which is a tabular file that shows the number of reads at regular size intervals before and after splitting; it also outputs a log file which contains the summary of demultiplexing and quality filtering, including the number of reads detected for each sample and a brief summary of any reads that were removed due to quality considerations. A part of the log file is represented here:

    Input file paths
    Sequence read filepath: /home/galaxy/galaxy/galaxy-dist/database/files/000/dataset_577.dat (md5: b4cac16234f31d76da52f208f311ec5c) Quality filter results Total number of input sequences: 10000 Barcode not in mapping file: 0 Read too short after quality truncation: 951 Count of N characters exceeds limit: 0 Illumina quality digit = 0: 0 Barcode errors exceed max: 0 Result summary (after quality filtering) Median sequence length: 294.00 WT.unt.1 9049 Total number seqs written 9049 ---
    2.  Checking the file from count sequences we see that we have a total of 90,849 sequences.

    :nut_and_bolt: Comment

    To cover the situation where your data is not already demultiplexed, you can run the same tool mentioned in 1 but in that case you need to put the barcodes file as input so it can correctly separate your sequences. The type of the barcode is usually golay_12 and you no longer need to list the sample ID's if you have a barcodes file. The remaining of the settings should be the same.

Picking Operational Taxonomic Units (OTU)

Now that we have demultiplexed sequences, we're ready to cluster these sequences into OTUs. There are three high-level ways to do this in QIIME. We can use de novo, closed-reference, or open-reference OTU picking. Open-reference OTU picking is currently the preferred method among QIIME users, and it's the one we're going to use. Discussion of these methods can be found in Rideout et. al (2014).

In an open-reference OTU picking process, reads are clustered against a reference sequence collection and any reads which do not hit the reference sequence collection are subsequently clustered de novo.  This process is achieved with Perform open-reference OTU picking tool in QIIME, and includes taxonomy assignment, sequence alignment, and tree-building steps.

:pencil2: Hands-on: Perform open-reference OTU picking

  1. Perform open-reference OTU picking:wrench:
    • As input sequence files select the fasta file from the split fastq libraries
    • Reference sequences to querySelect databases from your history and then select the 97_otus.fasta file
    • The rest of the parameters you can leave on default or change accordingly to the characteristics and needs of your data.

    :nut_and_bolt: Comment

    The primary output that we get from this command is the OTU table, or the number of times each operational taxonomic unit (OTU) is observed in each sample. Several OTU tables are generated by this command. The one we typically want to work with is OTU table with taxonomic assignment and without sequences failing the alignment to the OTU representative sequences. This has OTUs with a total count of 1 removed, as well as OTUs whose representative sequence couldn't be aligned with PyNAST. It also contains taxonomic assignments for each OTU as observation metadata.

    The open-reference OTU picking command also produces a phylogenetic tree where the tips are the OTUs. The file containing the tree is Representative set tree, and is the file that should be used with the previous one in downstream diversity analysis.

  2. Summarize sample or observation data:wrench:
    • As input BIOM table select the 'Perform open-reference OTU picking: OTU table with taxonomic assignment and without sequences failing the alignment to the OTU representative sequences'
    • Present counts as number of unique observation ids per sample:  No
    • Summarize over observationsNo

      • The key piece of information you need to pull from this output is the depth of sequencing that should be used in diversity analyses. Many of the analyses that follow require that there are an equal number of sequences in each sample, so you need to review the Counts/sample detail and decide what depth you'd like. Any samples that don't have at least that many sequences will not be included in the analyses, so this is always a trade-off between the number of sequences you throw away and the number of samples you throw away.

        16S_summarize_table.png
      • Click on the download button displayed in your dataset in the history
      • Open With a text editor such as TextEdit, WordPad or gedit

      :question: Question

      What do you think should be the sample depth to use in the diversity analysis?

      Click to view answerThe sample depth should be the lowest number of counts from the smallest sample, which in this case is 7786.

Diversity Analysis

Diversity indices provide valuable mathematical tools to describe the ecological complexity of a single sample (alpha diversity) or to detect species differences between samples (beta diversity). In QIIME there's a tool that performs both of these analysis and also outputs some other results.

:pencil2: Hands-on: Perform alpha and beta diversity analysis

  1. Compute core set of diversity analysis:wrench:
    • As OTU table select the 'Perform open-reference OTU picking: OTU table with taxonomic assignment and without sequences failing the alignment to the OTU representative sequences'
    • As mapping file select metadata.tab
    • Sequencing depth:  7786
    • Tree file: Select 'Perform open-reference OTU picking: Representative set tree'
    • Metadata category: Write 'AntibioticUsage'
      NOTE: Here you can choose any category from your metadata file that you want to compare. In this tutorial we are interested in seeing the changes of the microbiome population between treated and untreated WT mice.

    :nut_and_bolt: Comment

    This tool creates and index.html file that will have all the resulting files from the analysis.

    Click on the eye next to the Core diversity report and you'll see a list of files, some of them which we're going to examine.

Alpha diversity

:pencil2: Hands-on: Analyse alpha diversity results

  1. Click on the rarefaction_plots.html file
    1. Select the metric observed_otus and the category as SampleID.

      :question: Question

      1.  Can you identify any patterns or groups? Change the category to AntibioticUsage

      2.  Change back to SampleID. What can you say about the fraction of OTUs sequenced? Do they cover our sample diversity? 

      Click to view answer

      1.  In the SampleID view you can clearly see two separated groups, one with a lot more observed_otus than the other. If you change to AntibioticUsage we can see that those groups correspond to the presence/absence of streptomycin and, as expected, there's a lot more observed OTUs in the untreated samples than in the ones that were treated.
      2. We can see that the curves corresponding to the group with less OTUs have started to level off, which indicates that we covered a great part of our sample diversity. The other curves are not quite there yet, but we can see a slight reduction of slope, which can signal that we are closer to a plateau.

      :nut_and_bolt: Comment

      You can select different metrics and confirm that the results are very similar. You can know more about the chao1 metric here and about the PD metric here.

  2. Now click on the AntibioticUsage_boxplots.pdf file with the observed_otus metric. Verify that there's a very big difference in the OTUs observed in the two groups.
  3. Finally open the AntibioticUsage_stats.txt file with the observed_otus metric. In this file you have the information about the mean and the standard deviation of OTUs for each group, as well as a value for a non parametric test, and a p-value determined by Monte Carlo permutations.

Beta diversity

Beta diversity metrics assesses the differences between microbial communities. The fundamental output of these comparisons is a square, hollow matrix where a “distance” or dissimilarity is calculated between every pair of community samples, reflecting the dissimilarity between those samples. The data in this distance matrix can be visualized with analyses such as Principal Coordinates Analysis (PCoA).

:pencil2: Hands-on: Analyse beta diversity results

  1. Click on the index.html file that refers to PCoA plot (unweighted_unifrac). It should open Emperor.
    • Look at the top right of the page and you'll see some options to change your PCoA plot. Click on Colors→SampleID.

      :question: Question

      1.  Can you identify any patterns or groups?

      2.  Now go to Colors→AntibioticUsage. What can you say about the homogeneity/heterogeneity of the groups? 

      Click to view answer

      1.  In the SampleID view you can clearly see one group to the right, which is an aglomerate of points, very close together. To the left we see some other points but seem to be distant from each other, but all of those are also very distant from the aglomerate on the right.
      2. Now checking the antibiotic view we can see that the group on the right corresponds to the untreated mice. We can therefore take some conclusions:
              ♦The distance between points of the non-treated group is a lot smaller than the distance between points of the treated group.
                 ♦The distance between points of the non-treated group is smaller than the distance between the non-treated and treated group.
                 ♦The distance between points of the treated group is also smaller than the distance between the non-treated and treated group.
      As a result of these observations we can verify that the group corresponding to the untreated mice is very homogeneous, as opposed to the group of the treated mice which is a lot more heterogeneous.

      :nut_and_bolt: Comment: The logic behind a PCoA plot

      In an ordination graph, such as a PCoA graph, samples are plotted so that distances between them in the graph reflect the ecological differences. The aim of ordination methods is to represent the data along a reduced number of orthogonal axes, constructed in such a way that they represent, in decreasing order, the main trends of the data. Borcard, Gillet, and Legendre 2011

      Most ordination methods are based on the extraction of the eigen-vectors of an association matrix. The basic principle of ordination is the following: Imagine an n × p data set containing n samples and p OTUS.The n samples can be represented as a cluster of points in the p-dimensional space. Now, this cluster is elongated in some directions and flattened in others. These directions are not necessarily aligned with a single dimension (= a single OTU). The direction where the cluster is most elongated corresponds to the direction of largest variance of the cluster. This is the first axis that an ordination will extract. The next axis to be extracted is the second most important in variance, provided that it is orthogonal (linearly independent, uncorrelated) to the first one. The process continues until all axes have been computed. Kindt and Coe 2005.

  2. Now click on the index.html file that refers to PCoA plot (weighted_unifrac). The difference between weighted and unweighted unifrac is that the weighted takes into account the OTU abundance when it calculates the distance between communities. So in this plot we can still distinguish two groups, but now the points from the non-treated group are dispersed.

Taxa Summaries

One of the results obtained by the diversity analysis is, obviously, the different phylums, classes, families and OTUs obtained in each sample. For this analysis it's important to remember that the treated mice were also colonized with E.coli.

:pencil2: Hands-on: Analyse Taxa results

  1. Click on the bar_charts.html file that refers to Taxonomy summary results.
    • Look at the first figure that appears. This represents the phylums observed in each sample, with the coloured bars representing the abundance of the phylum in each sample, with the respective legend below.

      :question: Question

      Can you identify any patterns related with treatment?

      Click to view answer

      We are going to adress three of the most visible phylums/colours. The Bacteroidetes phylum represented in green seems to be very present in all of the samples (treated and untreated), so it's somewhat homogeneous between samples. The Firmicutes phylum is significally present in the untreated samples, but is almost inexistent in the treated ones, which can indicate that it was killed by the steptomycin or that it couldn't compete with the E.coli. Lastly, looking at the rose phylum, which is Proteobacteria, we can see that it's not very present in the untreated samples, but it is visible in the treated samples. It's important to remember that this is the phylum of E.coli, so the colonization is likely the main reason for that fact.

    • Now look at the figure in family level (4th figure).

      :question: Question

      Can you identify any patterns related with treatment? What can you say about the homogeneity of the samples?

      Click to view answer

      We can verify that the phylum that seemed independent of the treatment, which was the Bacteroidetes phylum, has different families in treated and non-treated samples, which distinguishes completly the samples with and without antibiotic.  We can still identify a lot of families that belong to the Firmicutes phylum in the untreated samples, and a lot of Proteobacteria in the treated samples.
      Looking at the purple family, which is Alcaligenaceae we notice that it is around 10% for mice 9 and 11, but it's about 1% for the rest of the treated mice. The rest of the colours behave differently in the treated samples, so we can say that this samples are not very homogeneous. The untreated samples however seem more homegenous.

  2. Now click on the bar_charts.html file that refers to Taxonomy summary results (by AntibioticUsage). Here you're only comparing between groups, which gives you a more clear idea of what OTUs are present in which group. However you lose sight of the heterogeneity of the samples.

Group Significance

The function responsible for this output compares OTU frequencies in sample groups and ascertains whether or not there are statistically significant differences between the OTU abundance in the different sample groups.
At a basic level, it constructs a OTUxSample (rowXcolumn) contingency table, and tests whether or not each OTU is differentially represented in certain groups of columns.[1]

:pencil2: Hands-on: Analyse group significance

  1. Click on the group_significance_AntibioticUsage.txt file.
    • Save this file to your computer as a .txt file and open it with a spreadsheet application such as Excel.

      :question: Question

      Look at the headers of the file. This tool outputs three p-values, but you should look at the FDR_P value, which is the Benjami-Hochberg corrected p-value. At the level of 0.05 how many significant OTUs can you see? Take some conclusions about the abundance of these OTUs in each sample type.

      Click to view answer

      At the 0.05 level we identify 170 significantly different OTUs. Note that for everyone of these, each OTU is only present in one type of sample. For example, we can see that the OTUs correspondent to the Firmicutes family are only present in the untreated samples, which is in agreement with the conclusions we took observing the taxa bar charts.

Creation of your workflow

As you learnt in Galaxy 101 tutorial you can create a workflow that runs all the steps at once, saving you a lot of time. We're are also going to create one for this analysis.

:pencil2:Hands-on: Extract workflow

  1. Clean up your history. If you had any failed jobs (red), please remove those datasets from your history by clicking on the x button. This will make the creation of a workflow easier.

  2. Go to the history Options menu (gear symbol) and select the Extract Workflow option.

    The center pane will change as shown below and you will be able to choose which steps to include/exclude and how to name the newly created workflow.

  3. Uncheck any steps that shouldn’t be included in the workflow (if any), and rename the workflow to something descriptive, for example 16S Microbial Analysis. We personally chose to uncheck Count sequences.

  4. Click on the Create Workflow button near the top.

    You will get a message that the workflow was created. But where did it go?

  5. Click on Workflow in the top menu of Galaxy. Here you have a list of all your workflows. Your newly created workflow should be listed at the top:

The workflow editor

We can examine the workflow in Galaxy’s workflow editor. Here you can view/change the parameter settings of each step, add and remove tools, and connect an output from one tool to the input of another, all in an easy and graphical manner. You can also use this editor to build workflows from scratch.

:pencil2:Hands-on: Extract workflow

  1. Click on the triangle to the right of your workflow name.

  2. Select Edit to launch the workflow editor. You should see something like this:

    When you click on a component, you will get a view of all the parameter settings for that tool on the right-hand side of your screen.

  3. Click the asterisk next to log.txt in the Split libraries, output_fp (text) in the Summarize sample and also the html report in the Core diversity analysis tools. Now, when we run the workflow, we will only see these outputs. Once you have done this, you will notice that the minimap at the bottom-right corner of your screen will have a colour-coded view of your workflow, with orange boxes representing a tool with an output that will be shown.

    If you didn't specify a name for the input files at the beginning they will be labeled Input Dataset. In this case you can rename them now to avoid confusion when using the workflow later on.

  4. Save your workflow (important!) by clicking on the gear icon at the top right of the screen, and selecting Save.

  5. Return to the analysis view by clicking on Analyze Data at the top menu bar.

Comments

We could validate our newly built workflow by running it on the same input datasets than the ones in this history  in order to make sure we do obtain the same results.


:clap: Congratulations on successfully completing this tutorial!