(PART ONE)
BREAK
(PART TWO)
We’ll start by reviewing the basics on genes and genomes: what they are, how we represent and talk about them, and how we go from a DNA sequence- a string of letters- to making predictions about gene function. We’ll look at this in the context of WormBase ParaSite and other online database resources.
WormBase ParaSite gathers together nematode and flatworm genomes that have been assembled and annotated by the research community, adds additional analyses and makes these data available to the scientific community. We’ll look at the kind of data you can retrieve, initially by exploring the website. We’ll also look at BioMart, a data-mining tool that allows you to retrieve WormBase ParaSite data in bulk.
Throughout this course, we’ll assume that you’re familiar with genes and genomes.
A gene is a unit of the genome, a DNA sequence, that is transcribed into an RNA molecule, or a transcript. A gene’s transcript may go on to be translated into a protein (in that case it is an mRNA), or it may have a role as a non-coding RNA, such as a ribosomal RNA (rRNA), transfer RNAs (tRNA), microRNA (miRNA), or long non-coding RNA (lncRNA).
In eukaryotes, most protein-coding genes comprise alternating exons and introns (some genes have a single exon), flanked by untranslated regions (UTRs). The exons constitute the parts of the gene that are translated into a polypeptide. Introns are transcribed but soon after excised and the final mature mRNA is formed by a 5’UTR, joined exons and a 3’UTR. A CAP and poly-A tail are added to the 5’ and 3’ ends respectively. These structures are essential to guarantee the molecular stability and downstream processing of the mRNAs.
This figure represents the steps that are needed to transform information encoded in the DNA into a polypeptide and eventually a functional protein. The starting information is encoded in the genome. A gene encodes, among other things, the transcription start and transcription end. Transcription produces an RNA copy of the gene, known as pre-mRNA, which comprises exons and introns. Mature mRNA is produced by splicing together exons, removing introns, adding a CAP at the 5’ end, and polyadenylating the 3’end, to add a poly(A) tail. The mature mRNA is the template for the translation into a polypeptide by the ribosome.
In the cell, genomes are organised into chromosomes. In practice, current DNA sequencing methods are unable to read the DNA sequence of a whole chromosome without errors. We therefore use the technique of sequencing shorter segments of chromosomes, and do it in such a way that the segments overlap and can be pieced together like a jigsaw puzzle. This process is referred to as genome assembly. For now, we will focus on what genome assemblies look like, and how they are represented in genome databases.
The diagram below shows the structure of a typical assembly. It has 3 layers: the contigs are stretches of contiguous DNA sequence without gaps. The scaffolds are ordered sets of contigs separated by gaps of estimated length. In order to make scaffolds from contigs, techniques such as optical mapping and Hi-C are used. Finally, the chromosome is an ordered set of scaffolds separated by gaps on unknown length. To make the chromosome sequence from the scaffold, techniques such linkage mapping and FISH are used.
Sometimes, there is insufficient (or no) data to reliably place a scaffold into a specific position on a chromosome. In the figure above, this is true of the scaffold on the right. The assembly above therefore comprises 2 top-level sequences: 1 chromosome, and one unplaced scaffold.
As technology has evolved, there has been an explosion in the number of genomes that have been sequenced. Sequence databases provide a place where these sequences can be deposited, stored and made available to the world. There are three widely-used nucleotide repositories (or primary databases) for the submission of nucleotide and genome sequences:
Together they form the International Nucleotide Sequence Database Collaboration and luckily for users, they all “mirror” each other. This means that irrespective of where a sequence is submitted, the entry will appear in all three databases. Once data are deposited in primary databases, they can be accessed freely by anyone around the world.
WormBase ParaSite takes sequencing data from INSDC (a genome assembly and a set of gene predictions) and adds additional value by performing additional analyses and making the data available in a user-friendly interactive way. In this part of the workshop, we will explore the basic functionality of the website for looking at helminth genomes and genes.
In this section, we’ll explore how genome assemblies are presented in WormBase ParaSite, and look at some commonly used metrics of assembly quality.
In an ideal world, each genome assembly would fully reconstruct contiguous chromosomes. Many of the genomes in WormBase ParaSite are much more fragmented than this; a chromosome might actually be represented by hundreds or even thousands of smaller scaffolds or contigs. Having a more fragmented genome makes identifying genes much more challenging, as gene models are more likely to be split across scaffolds.
This will take you to a list of all of the genomes currently available in WormBase ParaSite, divided phylogenetically into the phyla Nematoda and Platyhelminthes.
Table Features:
2. Scroll down the page to find Brugia malayi and click the species name link- this will take you to the B. malayi genome landing page.
The genome page has useful summary information about the species and the assembly. You can see a summary of the methods used to produce the assembly and the annotation, and links to the publication describing it in more detail (where this is available). 3. Look now at the ‘Assembly statistics’ box.
The information in this box tells us about two metrics related to the quality of the assembly: contiguity and completeness.
1. Find the two other genome assemblies from different Brugia species in WormBase ParaSite, which are of lower quality than Brugia malayi. 2. According to their scaffold statistics and BUSCO scores, which of these two assemblies is more contiguous and complete?
For each genome in WormBase ParaSite, there are gene and transcript pages available for browsing. The aim of this section is to familiarise you with retrieving information about a gene of interest from WormBase ParaSite.
We will use a walk through example to illustrate how to use the website to find out about the function of an Opisthorcis viverrini gene.
The page should look something like this:
2. Paste “T265_10539” into the search bar in the top right of the page and click return. T265_10539 is a stable (i.e. permanent) identifier for a gene. These identifiers are usually allocated by the scientist or group that sequenced and annotated the genome.
You should get one result, matching a gene in Opisthorchis viverrini, the Southeast Asian liver fluke. Let’s look at the page for the T265_10539 gene:
3. Click T265_10539
Every gene in WormBase ParaSite has a gene page, which presents sequence data and a summary of information from various sources about the gene.
The gene page has three main sections. In the summary section, together with the description of our gene we can also see the genomic location of the gene (“opera_v5_385”, in this case) and the INSDC Sequence ID. This is an identifier that links to an entry for the scaffold in the ENA.
Underneath, we can see some information about the gene: it has one transcript isoform and a number of orthologues and paralogues. We’ll revisit this later. We can also see that the gene is protein-coding.
On the left, we have a navigation menu, which can be used to explore the information that is available for each gene. We’ll be going through each of these menu options in turn.
The “Genomic context” image underneath shows us a snapshot of the scaffold that our gene is on.
4. Click the ‘Region in Detail’ link in the “Genomic context” section.
Here, each of the three boxes gives us an increasingly zoomed-in view of the gene’s genomic position. The top box shows the whole scaffold, and the middle box below it shows a zoomed-in part of the scaffold. In this case, the scaffold (“opera_v5_385”) is short so the middle box is showing the whole scaffold. Looking at the middle box, it shows us that out gene of interest is located approximately a quarter of the way along the scaffold. The bottom box shows us the structure of the gene model.
We can see that:
You can learn more about the Genome Browser here.
5. Navigate back to the gene page by clicking the “Gene:inx” tab at the top of the page.
As well as gene pages, WormBase ParaSite has a page for each transcript that a gene produces. In this case, only one transcript isoform has been annotated.
6. On the gene page, click the “Show transcript table” button to show the trancript table. Then click the transcript ID in the transcipt table to navigate to the transcript page.
Again using the navigation menu on the left hand side of the page, we can retrieve three main types of information on the transcript: sequences, information about the protein domains, and external references.
7. Click “Exons”, “cDNA” and “Protein” in the “Sequence” section of the navigation menu to see the different types of sequence that are available for the transcript.
Many users use sequences retrieved from these pages to design primers.
Note that this protein sequence is what is known as a conceptual translation: the amino acids have not been sequenced directly, but we can infer the sequence given the predicted structure of the gene (the coordinates of the introns and exons), the underlying DNA sequence and a given codon usage table.
So far we have gathered general information about this Opisthorcis viverrini gene. We have also inspected their genomic location and sequence. However, we don’t have any clues about the genes’ function! What does the protein encoded from this gene do?
A fast way to find out about the function of a gene’s product is to see which Gene Ontology (GO) terms have been associated with it. GO is a project that seeks to describe complex biology in a logical and consistent way that is humanreadable and and computer-processable. GO is a controlled vocabulary, whereby gene products are associated with GO terms that describe their function. There are three aspects to GO: Cellular Component, Molecular Function and Biological Process. Cellular Component GO terms describe where a protein is localised (in the membrane, extracellular, in the nucleus etc). Molecular Function GO terms describe the biochemical activity of the protein. Biological Process GO terms describe the pathways and broader processes that the protein contributes to.
WormBase ParaSite imports GO annotations from three sources:
The GO terms associated with this gene make sense given what we already know about the Innexin protein family: this protein likely forms part of the gap junction, which is a channel connecting the cytoplasm of two cells.
For the vast majority of predicted protein sequences, experiments to determine function have not yet been performed. However, we can use homology to take proteins that are well-studied in one experimental system and infer that proteins of similar sequence in other organisms are likely to have similar structure, and therefore similar function.
Protein sequences are commonly analysed by looking at their domains - subsequences of a protein that have a defined tertiary structure or sequence motif and confer a defined function. A protein can consist of several domains. When comparing proteins between organisms, often the region encoding a protein domain is highly conserved whilst the bit that connects different domains together is more divergent.
The InterPro consortium: There are many protein domain databases. A well known example of a protein domain database is Pfam. Pfam uses multiple sequence alignments of the known proteins with a certain domain to capture a representative model (a profile Hidden Markov Model) of that domain. Other protein domain databases, that might use slightly different methods to define domains, are: CATH, CDD, HAMAP, MobiDB Lite, Panther, PIRSF, PRINTS, Prosite, SFLD, SMART, SUPERFAMILY and TIGRfams. Luckily for us, all of these databases are united under the InterPro consortium.
InterPro provides a tool, InterProScan, that we can use to search protein sequences against all of the member databases to identify any protein domains that the protein might have: InterProScan is an extremely useful tool for predicting gene and protein function.
At WormBase ParaSite, we pre-run InterProScan to annotate protein domains for all of the genes in our database so you don’t have to do it yourself every time!
On this page we see a pictorial representation of the protein domains that have been annotated to this polypeptide. We can see here that this protein has a match with an Innexin domain in several protein domain databases, and four transmembrane helices.
2. The same data is available in tabular format. To view this format, click the “Domains & features” menu option.
A protein’s function is determined by its 3D structure. Knowing its structure can enable detailed predictions to be made about its function, such as mechanistic insights into important biological interactions with other proteins, or even drugs.
Very few protein structures have been directly solved but the ability to predict structures has dramatically improved recently.
AlphaFold, is a state-of-the-art AI system developed by DeepMind that is able to computationally predict protein structures with unprecedented accuracy and speed. Working in partnership with EMBL’s European Bioinformatics Institute (EMBL-EBI), AlphaFold released over 200 million protein structure predictions that are freely and openly available to the global scientific community. Included are nearly all catalogued proteins known to science – with the potential to increase humanity’s understanding of biology by orders of magnitude.
All AlphaFold-predicted models available for proteins encoded by WormBase ParaSite species have been imported.
You can now view the interactive 3D AlphaFold structure of the protein. The interactive molecular viewer visualizes the structure, coloured by the per-residue pLDDT confidence measure.
Drag and drop with your mouse over the protein model to rotate it and use your mouse wheel to zoom in/out. You can use the right panel to visualise exons as well protein domains and features on the 3D model. This might give you a better understanding of where your domains of interest are.
External references are the identifiers by which the gene (or transcript or protein, in this case) is known in other databases outside WormBase ParaSite.
These usually include RefSeq (the reference sequence database of the NCBI) and UniProt, and sometimes (though not in this case), WormBase ParaSite’s sister database, WormBase.
Another approach to understanding what a gene does is comparing its sequence to other genes, both within the same genome, and across different genomes.
WormBase ParaSite groups all helminth genes, together with comparator genes from a number of model organisms, into families, based on the similarity of their protein sequences.
For each family, we arrange the genes into an evolutionary tree.
The gene tree shows the inferred evolutionary history for the family that this gene is a member of.
Note that the most closely related gene in the tree is from another Opisthorchis species, O. felineus, these two genes are orthologous to each other.
It can be useful to look at alignments of these related proteins to see how well conserved they are. Highly conserved regions are more likely to be essential for the function of the protein. To do this:
2. Click on the section of the tree labelled “Blood flukes” and click “expand this subtree”.
Next to the main tree, in green, we can see a pictorial summary of the multiple alignment of the proteins of these four genes, with green coloured regions representing alignments and non-coloured regions representing gaps. You may be interested in exploring these alignments at a higher resolution.
3. Click the node that separates the Opisthorchis spp. from the blood flukes and then click “View in Wasabi” in the pop-up box.
A multiple alignment of the 25 proteins will appear in a new window: we can see that parts of these protein sequences are extremely well conserved.
Orthologues and paralogues are also available in tabular format, where they can be easily exported.
4. Select “Orthologues” in the navigation menu.
In the main table on this page, each row represents an orthologue of inx. The table gives details on the nature of the relationship between our O. viverrini gene and the gene in the other species, such as whether the gene has one or multiple orthologues in the other species (1-to-1 or 1-to-many), and how similar the two proteins are. Multiple alignments can be viewed by following the links.
WormBase ParaSite uses a computational pipeline developed by the Ensembl project to group related genes into families and define the evolutionary relationships between them. Below is a summary of the steps of the pipeline:
You can read more about this pipeline here.
The aim of this exercise is to familiarise yourself with the WormBase ParaSite gene page. Go to the gene page for the Trichuris muris gene TMUE_3000014465 and retrieve the following information:
So far we have seen how you can manually browse WormBase ParaSite by searching for genes and then navigating to their gene/transcript/protein pages. However, in many cases you might have to automatically extract information from WormBase ParaSite for multiple entries. Or simply you might need to extract information about your favourite genome’s features that fullfil some criteria.
BioMart is an extremely powerful tool that allows you to query WormBase ParaSite data in bulk, with no programming knowledge. Consider the information that we gathered on our O. viverrini gene of interest, by clicking around the gene page. Now imagine that rather than having one gene of interest, we actually have a list of 100 genes. That would be a lot of clicking around on gene pages! BioMart allows you to output all of this data for multiple genes in a few clicks.
Retrieving data for a list of known genes isn’t the only thing that BioMart can do. In this section, we’ll go through a series of examples and exercises that aim to illustrate the power of this tool.
There are two main steps involved in building a BioMart query.
Some of the filters allow you to enter data to filter on, e.g. a list of gene names.
The table below lists some examples of filters and attributes for BioMart queries:
Examples of Filters | Examples of Attributes |
---|---|
A genome | Gene, transcript or protein IDs |
A genomic region | Sequences |
A list of gene IDs | Identifiers from external databases (eg, Uniprot IDs) |
All genes that have GO term x, or protein domain Y | Protein domains or GO terms associated with a gene ID |
All genes that have GO term x, or protein domain Y | IDs of orthologous genes, % identity |
Query Filters and Output attributes can be combined to produce more complex queries and customised output.
Let’s try to do this: Let’s say we want to retrieve the IDs and predicted protein domains of all of the genes from Schistosoma mansoni chromosome 1 that have a predicted AlphaFold 3D protein structure. We’ll walk through this example to get started.
We have to set three Query Filters: the genome (the S. mansoni genome), genomic location (chromosome 1), and a protein domain (genes whose protein have a predicted 3D AlphaFold model).
2. Select “Species”, tick the “genome” checkbox and scroll down to select “Schistosoma mansoni (PRJEA36577)”.
3. Select “Region”, tick the “Chromosome/scaffold” check box and type “Sm_V10_1” into the text field (you must know the exact name of the chromosome).
4. Select “Protein domains”, tick the “Limit to genes…” checkbox and select “with AlphaFold protein structures”
Note that as we have built up the query, the filters have appeared on the left hand side of the page.
5. Click “count” to count the number of genes in the database that fulfil these filter criteria.
Next we will select the output attributes. “Genome project” and “Gene stable ID” are already pre-selected as attributes:
6. Select “Output attributes”
BioMart lets us generate two types of output: data tables, and sequence (FASTA) files. In this example we’ll be generating a data table. We want to retrieve the gene IDs and associated protein domains of the 215 genes that fulfil our filter criteria.
7. Select “Interpro protein domains” and check the tick boxes for “InterPro ID”, “InterPro short description”, “Start position” and “End position”.
Use the following S. mansoni gene stable IDs to answer questions 1-4:
Smp_000090
Smp_000120
Smp_000180
Smp_000210
Smp_000220
Smp_000250
Smp_000330
Smp_000380
Smp_000400
Smp_000520
Smp_000030
Smp_000040
Smp_000050
Smp_000070
Smp_000080
Smp_000130
Smp_000140
Smp_000150
Smp_000160
Smp_000170
Smp_000320
Smp_001085
Smp_002080
Smp_002180
Smp_002550
Smp_000020
Smp_000075
Smp_000100
Smp_000110
Smp_000370
1. How many of these genes have orthologues in S. haematobium?
2. Generate a table listing the genes. The table should also has the gene stable ID for the homologue in both species, the homology type (1-1, 1-many, etc), and the % identity between the two orthologues.
3. Of these genes, how many also do not have a human orthologue?
4. Retrieve (a) a FASTA file with the CDS sequence of each transcript encoded by these genes. Make sure that the transcript stable ID is in the header; and (b) a FASTA file containing the CDS sequence plus 100 nt downstream of the stop codon of each of those transcripts. In the header, include the transcript stable ID and the name of the scaffold that the transcript is on.
Next, you will analyses a region of the Trichuris muris (murine whipworm) genome, from position 20000000–20500000 on chromosome 1 (“TMUE_LG1”). From these coordinates, generate an output with:
6. The WormBase gene IDs and UniProtKB/TrEMBL IDs for T. muris genes from the region.
7. the InterPro domains that they have been annotated with (InterPro short description). [Q: why do some of the output rows appear multiple times?]
8. the gene stable IDs of their T. trichiura (human whipworm) orthologues. [Q: which gene has more than one T. trichiura orthologue?].
9. the names of any GO terms associated with the genes.
10. FASTA file of their peptide sequences.
Welcome back!
We will start by looking at commonly-used tools in WBPS:
Finally, the section ends with a an additional section introducing our Expression browser.
BLAST (Basic Local Alignment Search Tool) is one of the most commonly used tools to search for sequences that are similar to each other. It is a fast searching programme that is able to compare a query sequence with hundreds to millions of sequences quickly.
You can use BLAST to search a query sequence against the sequences in WormBase ParaSite.
How BLAST works?
BLAST uses three steps: 1) It ‘chops’ the query sequence into small ‘words’ of typically 3-4 amino acids for proteins or 10-12 nucleotides for DNA sequences. 2) It uses these short words to look for perfect matches across all the entries in the database. 3) When a match is found it then tries to extend the alignment by comparing consecutive letters of the word. For each new pair of letters, it evaluates whether it is a good match.
When do we need to use BLAST?
BLAST in WormBase ParaSite
The BLAST tool is accessible:
Both options will take you to WormBase ParaSite’s BLAST tool page:
How to evaluate Blast results?
Metrics used in the results table:
TTTGCAGATGCTTCTCCCTTCAAACTTGACGACGTCAACATTAATGACGTCATCATCAGA
ATCGTACGACGCTGATAATCCGGGGCTTCCGCCTGAGCCAATCCTGTCGGATTACGTGGA
AATGTTCACTTTGGTGCTCAATTTTATTGTTGGCGCGCCGTTGAACCTGGCCGCTTATAC
ACAGCTAAGCGAACGACCTACATCAACGCGGTTAGACCTTCTGAAGCGATCACTCAACTA
TTCGGATCTTCTCGTTCTATTCATCTACGTACCATCTCGTGCCTGCTGGTTATTGACCTA
CGATTGGCGGGGTGGAGATGCACTCTGTAAAATTGTCAAGATGTTTCATACGTTCGCGTT
TCAGAGCTCCTCCAACGTGATCGTGTGCATCGCCGTGGATCGCCTGCTATCCGTCCTCTC
CCCATCCCATCACAGCCCCAACAAAGCCCTGAAACGGACTAAAATGATGTTAATAGTCGC
GTGGATAGTAGCGCTAGTAATCTCATGCCCACAACTTTTCATCTGGAAAGCATATCTAGC
ACTTCCCGAGTATAATTGGAGCCAGTGTCTGCAAATTTGGGAGATTGCACGGATGGAAAA
ATTCAACAAACCACAGGTAGTGCCAGAGTTTGACGCCGAGTTCTGGTACAGCATACTGCA
TATTAGTCTCGTTTTTTGGATCCCTTGTATCATTATCATGCTATCCTACATCATAGTCAT
CTCATGGGTATGGATCAACTCTCGGCCGTCCATCCGTCACACCTCTTCATTTTCCTTCCA
CACCGGCTGCGATACGGTAGATACAGTACTGACTAGAGCCTCTGAATGGAATCCTTTGAA
GACATTCTCCCGTCACGTCAACATCAAGGAGCCCGAGAAGCCGATGACGACTCCCAGAAT
CGTGGTCAGCGACGAGACGGAGGTCCCACTGACGCAGCGACCATCGATTTCTCCGTCGGA
AGCGTCGGCGGTGATGAGGACCGGTGTGCACACGAGTACCTCGTATAATGCTAATTTGAA
TCGATCCCGAGCCCTGCGAGTTTCCTTGCTACTAGTCGTCGCGTACATCATCTGCTGGCT
ACCATATAACCTCATAAGTCTTATCCAATTTCTTGATCGGGACTTTTTTTCGTCATATCT
TAAACATGTCCACTTCTGCCAACAACTAATCATTTTTAACTCGGTCGTCAATCCATGGCT
CTACGGTTTCTTCGGTCCCCGCCGCCCGTCTACCACCGGTGCCGGCCGTCACTGATCTCC
AAACATCAAACATCGAATTCGCCATATCTTTCCAAAATCCCCCCAACGTTCCAGTTTTCA
AGCCCAACGAATTGCCAATGCCATATCTTTAACAACTTTTATGGTTTCTTGTTTGTTTTT
TTTTATTTATTTTATTGTAATGTTTGATTCTCGGTGAAAAATTTGTGTAAAATAAATTAT
TTTTTATGTGAAA
A genome browser is a tool that allows you to visualise a genome assembly and its features, together with experimental data aligned to the genome.
There are several commonly used genome browsers in bioinformatics, each with different features. In WormBase ParaSite we have two:
Ensembl - this can be used to browse a large catalog of genomes across the tree of life. WormBase ParaSite has an instance of the Ensembl browser built in, and we explored it earlier.
JBrowse 1 - this is the genome browser that we’ll be using today. WormBase ParaSite has an instance of JBrowse for every genome that it hosts. The Apollo project is a well known extension of JBrowse, which can be used to edit gene models.
There are many other genome browsers for different needs out there. Feel free to explore them at your own time:Integrative Genomics Viewer (IGV), UCSC Genome Browser, and the new version of Jbrowse
In this example we’ll introduce the basic functionality of WormBase ParaSite’s JBrowse 1, and demonstrate how to use the various tracks.
For this example, we’ll consider that you’re interested in the gene Smp_312440.
Here, you can see the forward and reverse DNA strands, together with the six possible translational reading frames (3 forward and 3 reverse).
Scrolling down the content of the box, you can extract genomic or cDNA sequence, or the sequence of specific subfeatures (specific exons or UTRs, for example).
Alternatively, you may wish to extract the genomic sequence of a whole region:
We can also use JBrowse to view other types of data aligned to the genome.
For most species, in addition to the gene model (“Genome Annotation”) track, there are RNASeq tracks. WormBase ParaSite finds and aligs RNASeq data in the ENA for our species of interest. These can be useful, for example, for checking that a gene model is well supported by expression data, or seeing in which life stages, or under which conditions, a gene of interest is transcribed.
Let’s say you want to see in which life stages Smp_312440 is expressed:
As well as looking at publicly available data, you can use WormBase ParaSite JBrowse to visualise your own data.
We’ll demonstrate how to do this using a local BAM file.
What is a BAM file? It’s based on a Sequence Alignment Map (SAM) file format, used in genomics to store sequencing read alignments against a reference sequence. BAM is binary version that is compressed and indexed, so that it is quick to access computationally.
Trying to read a BAM file won’t be very informative because it’s binary. It need to first be converted back into the SAM file format (non-binary, human-readable). We can do that with samtools:
To look at how a BAM file is structured, you’d need to use a command like this in a terminal window:
samtools view -h your_file.bam | less
####A brief aside on the BAM and SAM file formats:####
The SAM file starts with a header section. All header lines begin with a ‘@’ character.
Moving down through the file (by pressing the space bar) you come to the alignment section. Here, each line represents a sequencing read (though be aware that the lines are long, so a single line will probably wrap around your terminal window a few times). Some of the key fields are labelled below:
The full SAM specification is available here: http://samtools.github.io/hts-specs/
Before the file can be visualized in JBrowse, an index needs to be created. An index is another file that often accompanies a BAM file, and acts like a table of contents. Software such as JBrowse can look inside the index file and find where exactly in the corresponding BAM file it needs to look, without having to go through all of the reads (which would be computationally very expensive).
BAM index files should have exactly the same name as their corresponding BAM file, with the addition of a .bai suffix. You can index a BAM file using samtools. Type:
samtools index your_file.bam
This should put a file called your_file.bam.bai in your working directory, which can be loaded into WormBase ParaSite JBrowse.
We can only add an indexed BAM file to Jbrowse. To add a BAM track to Jbrowse :
In the demonstrated example, you can see the reads aligned to the genome. Notice that this RNA-Seq data is stranded- this means that the library preparation protocol preserved information on which end of the RNA molecule was 5-prime and which end was 3-prime, so we can infer which strand of DNA it was transcribed from. This information is encoded in the BAM file, and JBrowse colours the reads accordingly:
Another WormBase ParaSite tool that we will look at today is the Variant Effect Predictor, or VEP.
A common approach to understanding the genetic basis of phenotypic differences is to identify genetic variants that are over-represented in some populations of individuals.
For example, you might sequence two populations of worm: one that is susceptible to a drug and one that is resistant to the drug. You could then identify genomic positions where each of these populations differs from the reference genome.
VEP is a tool that allows you to predict what the consequences of these variants are: whether they fall within or near genes, and whether they result in a change to the amino acid sequence of a protein.
The standard file format for storing variation data is the Variant Call Format (VCF); this is another tab-delimited text format. Later in the course, you’ll see how to make one of these files. In the meantime, for some helminth genomes, these files have already been shared by other researchers. Today you’ll be using an available VCF file for Strongyloides ratti.
First, we’ll download a VCF file from the European Variation Archive (EVA). Then will upload it to WormBase ParaSite
You can download complete studies from the “Study Browser” tab but today we are using the “Variant Browser” to download a much smaller file corresponding to a 250 kb region of the genome.
3. Download the first 250kb of S. ratti chromosome 2:
4. Move to your Downloads directory and have a look at the file to see how it is structured:
# look at the contents
less sratti*.vcf
You’ll have to scroll down beyond the headers (lines starting with ##) to see the data lines. The actual data lines looks like:
5. From the WormBase ParaSite homepage, select “Tools” from the toolbar. 6. From the “Tools” page, select Variant Effect Predictor 7. To submit a VEP job, just select the correct species (Strongyloides ratti), upload the VCF file we just downloaded and click “Run”.
8. Once you have clicked “Run”, your input will be checked and submitted to the VEP as a job. All jobs associated with your session or account are shown in the “Recent Tickets” table. You may submit multiple jobs simultaneously.
9. Navigate to the results page:
The results are presented in pie-charts and an interactive table:
10. You can explore the results interactively on the webpage using the Results Preview filter panel at the centre. Use this panel and filter for variant that cause (select “consequence”) changes to amino acids (select “missense_variant”).
You can actually visualise the affected Amino acid by the “missense_variant” on the protein’s 3D AlphaFold model.
To do this:
11. Go to the “Protein matches” column of the results table. If the “Protein matches” column has not been switched on you can do so by using the “Show/hide columns” button at the top left of the table”. If the protein affected by the “missense_variant” has an AlphaFold protein model available, then you should see an “AlphaFold model” button in the “Protein matches” column. Click it.
12. Explore the 3D protein model. You can Click the “Focus” button underneath the variant information to zoom-in to the affected residue.
Download the VEP results from the example above as a “VEP file”. Use this file and the original VCF file to answer the following questions:
How many variants were there in the original dataset?
What are the different types of consequence that are found in the file, and how often does each occur?
List all of the variants found in SRAE_2000005500.1. Which variant or variants show the greatest impact?
Create a list of genes where a missense variant is found.
Find out which genes has the highest number of missense mutations.
It is possible to view the distribution of variants along the coding sequence in Jbrowse. To view the VCF in JBrowse, you first need to compress and index using ‘tabix’, a generalised indexer of tab-delimited files from the samtools package
bgzip file.vcf && tabix -p vcf file.vcf.gz
Earlier in this section, we looked at a gene in JBrowse and used RNAseq tracks to see in which life stages it was expressed. What if you were interested in transcriptional differences between life stages, but didn’t have a specific gene in mind?
You might want to retrieve all of the S. mansoni genes that are differentially expressed between 2 life cycle stages.
WormBase ParaSite has collated RNAseq data from publicly available studies and analysed it against our genomes and annotations.
This means that if somebody has already conducted a study to compare the conditions that you’re interested in, you can look up pre-calculated differentially expressed genes.
2. We can see a summary of the different studies that have been conducted. We’re interested in life cycle stages, so select the first study “Schistosoma mansoni transcriptomics at different life stages”.
For each study, we have a summary of the conditions for which data is available. A detailed description of transcriptomic experiments is beyond the scope of this workshop, but for those who are interested, HISAT2 was used to align reads to the genome, HTSeq to quantify counts per gene and DESeq2 to compute differential expression per condition.
Several files are available for download. These are:
3. Download the full results files for the “Schistosoma mansoni transcriptomics at different life stages” “24-hour-schistosomule-vs-cercariae” experiment by clicking “Full result files for 3 contrasts (zipped)” and place it into the “Downloads” directory.
cd ~/Downloads
# Extract the compressed directory
unzip ERP000427.de.contrasts.zip
# move inside the results directory
cd ERP000427.de.contrasts
# have a look at the 24-hour-schistosomule-vs-cercariae file
grep -v "^#" 24-hour-schistosomule-vs-cercariae.tsv | less
Use some of the commands you learned yesterday to extract the following information from the “24-hour-schistosomule-vs-cercariae.tsv” file:
4. Extract the top 5 most significantly regulated genes (hint: the final column, “padj”, gives the adjusted p value. A smaller adjusted p value = more significant).
grep -v "^#" 24-hour-schistosomule-vs-cercariae.tsv | grep -v "^gene_id" | sort -g -k 7,7 | awk -F'\t' '$7 != "NA"' | head -n 5
5. Of the genes with an adjusted p-value that is less than 0.05, which is (a) most highly upregulated in the 24h schistosomules v the cercariae (b) most strongly upregulated in the cercariae v the 24h schistosomules?
# upregulated in the 24h schistosomules means tha Log2FoldChange (column 3) should be a positive number
grep -v "^#" 24-hour-schistosomule-vs-cercariae.tsv | grep -v "^gene_id" | awk -F'\t' '$7 != "NA" && $7 < 0.05 && $3 > 0' | sort -r -g -k 3,3 | head -n 10
# upregulated in the cercariate means tha Log2FoldChange (column 3) should be a negative number
grep -v "^#" 24-hour-schistosomule-vs-cercariae.tsv | grep -v "^gene_id" | awk -F'\t' '$7 != "NA" && $7 < 0.05 && $3 < 0' | sort -g -k 3,3 | head -n 10
Gene set enrichment analysis (GSEA) (also called functional enrichment analysis or pathway enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with disease phenotypes.
Earlier, we talked about Gene Ontology (GO). “Gene Ontology” enrichment analysis is one of the most commonly used enrichment analyses.
Gene ontology is a formal representation of knowledge about a gene with respect to three aspects: Molecular Function, Cellular Component and Biological Process.
Gene Ontology (GO) enrichment analysis answers the very simple question: “What biological processes, molecular functions, and cellular components are significantly associated with a set of genes or proteins?”
For example, we can take the list of genes we identified as significantly upregulated in cercariae vs 24h-schistosomules and try to identify what are the biological processes, cellular components and molecular functions that are implicated in this developmental stage comparison.
Instead of manually trying to identify which genes in your list of differentially expressed genes have similar biological processes, cellular component and molecular functions, the GO enrichment analysis does it for you. More specifically, it clusters the genes into gene ontologies group, performs statistical analysis and shows you the most significantly overepressented ontologies!
So basically a GO enrichment analysis takes a list of gene identifiers like this:
and organises them to Gene Ontology terms (GO):
WormBase ParaSite offers a tool that performs this kind of analysis: g:Profiler that can be accessed from the “Tools” page:
Use the 24-hour-schistosomule-vs-cercariae.tsv from the previous section and print a list of genes with an adjusted p-value that is less than 0.05, which are most strongly upregulated in the cercariae v the 24h schistosomules.
Use gProfiler and perform a Gene-set enrichment analysis for these 40 genes from the “Schistosoma mansoni (PRJEA36577)” organism.
Which are the 3 most significantly enriched Cellular Component terms? Are they relevant to this developmental stage comparion we’re performing?
Expand the stats by using the “»” in the header of the GO:CC result table. Try to interpret the T, Q, TnQ and U metrics. What do they represent?