BSP2024

WormBase ParaSite

Table of contents

(PART ONE)

  1. Overview and Aims
  2. Genes and genomes
  3. Looking at genomes in WormBase ParaSite
  4. Looking at genes in WormBase ParaSite
  5. BioMart

BREAK

(PART TWO)

  1. Overview and Aims
  2. Tools
  3. The WormBase ParaSite Expression browser
  4. Gene-set enrichment analysis

1. Overview and Aims

We’ll start by reviewing the basics on genes and genomes: what they are, how we represent and talk about them, and how we go from a DNA sequence- a string of letters- to making predictions about gene function. We’ll look at this in the context of WormBase ParaSite and other online database resources.

WormBase ParaSite gathers together nematode and flatworm genomes that have been assembled and annotated by the research community, adds additional analyses and makes these data available to the scientific community. We’ll look at the kind of data you can retrieve, initially by exploring the website. We’ll also look at BioMart, a data-mining tool that allows you to retrieve WormBase ParaSite data in bulk.


2. Genes and Genomes

Throughout this course, we’ll assume that you’re familiar with genes and genomes.

Genes: the basics

A gene is a unit of the genome, a DNA sequence, that is transcribed into an RNA molecule, or a transcript. A gene’s transcript may go on to be translated into a protein (in that case it is an mRNA), or it may have a role as a non-coding RNA, such as a ribosomal RNA (rRNA), transfer RNAs (tRNA), microRNA (miRNA), or long non-coding RNA (lncRNA).

In eukaryotes, most protein-coding genes comprise alternating exons and introns (some genes have a single exon), flanked by untranslated regions (UTRs). The exons constitute the parts of the gene that are translated into a polypeptide. Introns are transcribed but soon after excised and the final mature mRNA is formed by a 5’UTR, joined exons and a 3’UTR. A CAP and poly-A tail are added to the 5’ and 3’ ends respectively. These structures are essential to guarantee the molecular stability and downstream processing of the mRNAs.

This figure represents the steps that are needed to transform information encoded in the DNA into a polypeptide and eventually a functional protein. The starting information is encoded in the genome. A gene encodes, among other things, the transcription start and transcription end. Transcription produces an RNA copy of the gene, known as pre-mRNA, which comprises exons and introns. Mature mRNA is produced by splicing together exons, removing introns, adding a CAP at the 5’ end, and polyadenylating the 3’end, to add a poly(A) tail. The mature mRNA is the template for the translation into a polypeptide by the ribosome.


Genomes: the basics

In the cell, genomes are organised into chromosomes. In practice, current DNA sequencing methods are unable to read the DNA sequence of a whole chromosome without errors. We therefore use the technique of sequencing shorter segments of chromosomes, and do it in such a way that the segments overlap and can be pieced together like a jigsaw puzzle. This process is referred to as genome assembly. For now, we will focus on what genome assemblies look like, and how they are represented in genome databases.

The diagram below shows the structure of a typical assembly. It has 3 layers: the contigs are stretches of contiguous DNA sequence without gaps. The scaffolds are ordered sets of contigs separated by gaps of estimated length. In order to make scaffolds from contigs, techniques such as optical mapping and Hi-C are used. Finally, the chromosome is an ordered set of scaffolds separated by gaps on unknown length. To make the chromosome sequence from the scaffold, techniques such linkage mapping and FISH are used.

Sometimes, there is insufficient (or no) data to reliably place a scaffold into a specific position on a chromosome. In the figure above, this is true of the scaffold on the right. The assembly above therefore comprises 2 top-level sequences: 1 chromosome, and one unplaced scaffold.


Sequence databases

As technology has evolved, there has been an explosion in the number of genomes that have been sequenced. Sequence databases provide a place where these sequences can be deposited, stored and made available to the world. There are three widely-used nucleotide repositories (or primary databases) for the submission of nucleotide and genome sequences:

Together they form the International Nucleotide Sequence Database Collaboration and luckily for users, they all “mirror” each other. This means that irrespective of where a sequence is submitted, the entry will appear in all three databases. Once data are deposited in primary databases, they can be accessed freely by anyone around the world.

WormBase ParaSite takes sequencing data from INSDC (a genome assembly and a set of gene predictions) and adds additional value by performing additional analyses and making the data available in a user-friendly interactive way. In this part of the workshop, we will explore the basic functionality of the website for looking at helminth genomes and genes.

Back to top


3. Looking at genomes in WormBase ParaSite

In this section, we’ll explore how genome assemblies are presented in WormBase ParaSite, and look at some commonly used metrics of assembly quality.

In an ideal world, each genome assembly would fully reconstruct contiguous chromosomes. Many of the genomes in WormBase ParaSite are much more fragmented than this; a chromosome might actually be represented by hundreds or even thousands of smaller scaffolds or contigs. Having a more fragmented genome makes identifying genes much more challenging, as gene models are more likely to be split across scaffolds.

  1. From the WormBase ParaSite homepage, click either the ”Genome List” tab in the tools bar, or the “Genomes” icon.

This will take you to a list of all of the genomes currently available in WormBase ParaSite, divided phylogenetically into the phyla Nematoda and Platyhelminthes.

Table Features:

2. Scroll down the page to find Brugia malayi and click the species name link- this will take you to the B. malayi genome landing page.

The genome page has useful summary information about the species and the assembly. You can see a summary of the methods used to produce the assembly and the annotation, and links to the publication describing it in more detail (where this is available). 3. Look now at the ‘Assembly statistics’ box.

The information in this box tells us about two metrics related to the quality of the assembly: contiguity and completeness.


Genome assembly metrics exercise

1. Find the two other genome assemblies from different Brugia species in WormBase ParaSite, which are of lower quality than Brugia malayi. 2. According to their scaffold statistics and BUSCO scores, which of these two assemblies is more contiguous and complete?


4. Looking at genes in WormBase ParaSite

For each genome in WormBase ParaSite, there are gene and transcript pages available for browsing. The aim of this section is to familiarise you with retrieving information about a gene of interest from WormBase ParaSite.

The Gene Page

We will use a walk through example to illustrate how to use the website to find out about the function of an Opisthorcis viverrini gene.

Basic Navigation

  1. Open up a web browser, and type this URL into the address bar: https://parasite.wormbase.org/

The page should look something like this:

2. Paste “T265_10539” into the search bar in the top right of the page and click return. T265_10539 is a stable (i.e. permanent) identifier for a gene. These identifiers are usually allocated by the scientist or group that sequenced and annotated the genome.

You should get one result, matching a gene in Opisthorchis viverrini, the Southeast Asian liver fluke. Let’s look at the page for the T265_10539 gene:

3. Click T265_10539

Every gene in WormBase ParaSite has a gene page, which presents sequence data and a summary of information from various sources about the gene.

The gene page has three main sections. In the summary section, together with the description of our gene we can also see the genomic location of the gene (“opera_v5_385”, in this case) and the INSDC Sequence ID. This is an identifier that links to an entry for the scaffold in the ENA.

Underneath, we can see some information about the gene: it has one transcript isoform and a number of orthologues and paralogues. We’ll revisit this later. We can also see that the gene is protein-coding.

On the left, we have a navigation menu, which can be used to explore the information that is available for each gene. We’ll be going through each of these menu options in turn.

The “Genomic context” image underneath shows us a snapshot of the scaffold that our gene is on.

4. Click the ‘Region in Detail’ link in the “Genomic context” section.

Here, each of the three boxes gives us an increasingly zoomed-in view of the gene’s genomic position. The top box shows the whole scaffold, and the middle box below it shows a zoomed-in part of the scaffold. In this case, the scaffold (“opera_v5_385”) is short so the middle box is showing the whole scaffold. Looking at the middle box, it shows us that out gene of interest is located approximately a quarter of the way along the scaffold. The bottom box shows us the structure of the gene model.

We can see that:

You can learn more about the Genome Browser here.

Sneak peek There is an alternative interactive Genome Browser in WormBase ParaSite that can be accessed by the "View region in Jbrowse" button at the top-right of the gene page but we're going to talk about it after the break!


5. Navigate back to the gene page by clicking the “Gene:inx” tab at the top of the page.

As well as gene pages, WormBase ParaSite has a page for each transcript that a gene produces. In this case, only one transcript isoform has been annotated.

6. On the gene page, click the “Show transcript table” button to show the trancript table. Then click the transcript ID in the transcipt table to navigate to the transcript page.

Again using the navigation menu on the left hand side of the page, we can retrieve three main types of information on the transcript: sequences, information about the protein domains, and external references.

7. Click “Exons”, “cDNA” and “Protein” in the “Sequence” section of the navigation menu to see the different types of sequence that are available for the transcript.

Many users use sequences retrieved from these pages to design primers.

Note that this protein sequence is what is known as a conceptual translation: the amino acids have not been sequenced directly, but we can infer the sequence given the predicted structure of the gene (the coordinates of the introns and exons), the underlying DNA sequence and a given codon usage table.

Back to top


Functional annotation: Gene Ontology (GO) terms, protein domains and protein structure

So far we have gathered general information about this Opisthorcis viverrini gene. We have also inspected their genomic location and sequence. However, we don’t have any clues about the genes’ function! What does the protein encoded from this gene do?

Gene ontology (GO)

A fast way to find out about the function of a gene’s product is to see which Gene Ontology (GO) terms have been associated with it. GO is a project that seeks to describe complex biology in a logical and consistent way that is humanreadable and and computer-processable. GO is a controlled vocabulary, whereby gene products are associated with GO terms that describe their function. There are three aspects to GO: Cellular Component, Molecular Function and Biological Process. Cellular Component GO terms describe where a protein is localised (in the membrane, extracellular, in the nucleus etc). Molecular Function GO terms describe the biochemical activity of the protein. Biological Process GO terms describe the pathways and broader processes that the protein contributes to.

  1. Click the “Gene:inx” tab at the top of the page to return to the main gene page, then select “Biological process” and/or “Cellular component” from the Gene Ontology section of the navigation menu.

WormBase ParaSite imports GO annotations from three sources:

The GO terms associated with this gene make sense given what we already know about the Innexin protein family: this protein likely forms part of the gap junction, which is a channel connecting the cytoplasm of two cells.


Protein Features and Domains

For the vast majority of predicted protein sequences, experiments to determine function have not yet been performed. However, we can use homology to take proteins that are well-studied in one experimental system and infer that proteins of similar sequence in other organisms are likely to have similar structure, and therefore similar function.

Protein sequences are commonly analysed by looking at their domains - subsequences of a protein that have a defined tertiary structure or sequence motif and confer a defined function. A protein can consist of several domains. When comparing proteins between organisms, often the region encoding a protein domain is highly conserved whilst the bit that connects different domains together is more divergent.

The InterPro consortium: There are many protein domain databases. A well known example of a protein domain database is Pfam. Pfam uses multiple sequence alignments of the known proteins with a certain domain to capture a representative model (a profile Hidden Markov Model) of that domain. Other protein domain databases, that might use slightly different methods to define domains, are: CATH, CDD, HAMAP, MobiDB Lite, Panther, PIRSF, PRINTS, Prosite, SFLD, SMART, SUPERFAMILY and TIGRfams. Luckily for us, all of these databases are united under the InterPro consortium.

InterPro provides a tool, InterProScan, that we can use to search protein sequences against all of the member databases to identify any protein domains that the protein might have: InterProScan is an extremely useful tool for predicting gene and protein function.

At WormBase ParaSite, we pre-run InterProScan to annotate protein domains for all of the genes in our database so you don’t have to do it yourself every time!

  1. To view the annotated protein domains, click the “Protein summary” menu option in the navigation menu on the T265_10539 transcript page.

On this page we see a pictorial representation of the protein domains that have been annotated to this polypeptide. We can see here that this protein has a match with an Innexin domain in several protein domain databases, and four transmembrane helices.

2. The same data is available in tabular format. To view this format, click the “Domains & features” menu option.

How to explore the protein domains of a protein that is not available in WormBase ParaSite? * Go to the [Interpro Search page](https://www.ebi.ac.uk/interpro/search/sequence/), paste your protein sequence into the box and click search. You may need to wait a few minutes for the search to run. ![](/BSP2024/manual/figures/figure_3.8.5.png) On the results page, each horizontal coloured line represents a match between our protein sequence and a domain or motif in one of the InterPro member databases. Mouse over these, to get more information. InterPro groups the same domain represented in different databases under a single InterPro accession. * Click through to read more about the annotated protein family on the Interpro site.

Protein Structure: Explore the 3D protein model of the gene using AlphaFold

A protein’s function is determined by its 3D structure. Knowing its structure can enable detailed predictions to be made about its function, such as mechanistic insights into important biological interactions with other proteins, or even drugs.

Very few protein structures have been directly solved but the ability to predict structures has dramatically improved recently.

AlphaFold, is a state-of-the-art AI system developed by DeepMind that is able to computationally predict protein structures with unprecedented accuracy and speed. Working in partnership with EMBL’s European Bioinformatics Institute (EMBL-EBI), AlphaFold released over 200 million protein structure predictions that are freely and openly available to the global scientific community. Included are nearly all catalogued proteins known to science – with the potential to increase humanity’s understanding of biology by orders of magnitude.

All AlphaFold-predicted models available for proteins encoded by WormBase ParaSite species have been imported.

  1. AlphaFold predicted model is browsable from the transcript page. To view the model click the “AlphaFold predicted model” menu option in the left navigation menu on the T265_10539 transcript page.

You can now view the interactive 3D AlphaFold structure of the protein. The interactive molecular viewer visualizes the structure, coloured by the per-residue pLDDT confidence measure.

Drag and drop with your mouse over the protein model to rotate it and use your mouse wheel to zoom in/out. You can use the right panel to visualise exons as well protein domains and features on the 3D model. This might give you a better understanding of where your domains of interest are.

What can I do with an AlphaFold protein structure?
To download the raw files for an AlphaFold protein structure you found in WormBase ParaSite:
1. Take a note of the AlphaFold accession ID, in our case it's: AF-A0A074Z666-F1 2. Go to the AlphaFold web-page and search for it (https://alphafold.ebi.ac.uk/entry/A0A074Z666). 3. At the top of the page you can downlaod the prediction in PDF, mmCIF or Predicted Align Error format. Then you could use the structure file to perform subsequent analyses. Online tools that can be used with the downloaded structures from AlphaFold: - [SwissDock](http://www.swissdock.ch/), a web service to predict the molecular interactions that may occur between a target protein and a small molecule. It is used alongside S3DB, a database of manually curated target and ligand structures, inspired by the Ligand-Protein Database. - [Zhang group Online-serivces portal](https://zhanggroup.org/) - Docking simulations (https://zhanggroup.org/EDock/) against different ligands. - Protein structure alignment with another protein (https://zhanggroup.org/TM-align/).

Back to top


From WormBase ParaSite to the world: External References

External references are the identifiers by which the gene (or transcript or protein, in this case) is known in other databases outside WormBase ParaSite.

These usually include RefSeq (the reference sequence database of the NCBI) and UniProt, and sometimes (though not in this case), WormBase ParaSite’s sister database, WormBase.

Back to top


Comparative genomics

Another approach to understanding what a gene does is comparing its sequence to other genes, both within the same genome, and across different genomes.

WormBase ParaSite groups all helminth genes, together with comparator genes from a number of model organisms, into families, based on the similarity of their protein sequences.

For each family, we arrange the genes into an evolutionary tree.

  1. Select “Gene tree” from the Comparative Genomics section of the navigation menu on the gene page.

The gene tree shows the inferred evolutionary history for the family that this gene is a member of.

Note that the most closely related gene in the tree is from another Opisthorchis species, O. felineus, these two genes are orthologous to each other.

It can be useful to look at alignments of these related proteins to see how well conserved they are. Highly conserved regions are more likely to be essential for the function of the protein. To do this:

2. Click on the section of the tree labelled “Blood flukes” and click “expand this subtree”.

Next to the main tree, in green, we can see a pictorial summary of the multiple alignment of the proteins of these four genes, with green coloured regions representing alignments and non-coloured regions representing gaps. You may be interested in exploring these alignments at a higher resolution.

3. Click the node that separates the Opisthorchis spp. from the blood flukes and then click “View in Wasabi” in the pop-up box.

A multiple alignment of the 25 proteins will appear in a new window: we can see that parts of these protein sequences are extremely well conserved.

Orthologues and paralogues are also available in tabular format, where they can be easily exported.

4. Select “Orthologues” in the navigation menu.

In the main table on this page, each row represents an orthologue of inx. The table gives details on the nature of the relationship between our O. viverrini gene and the gene in the other species, such as whether the gene has one or multiple orthologues in the other species (1-to-1 or 1-to-many), and how similar the two proteins are. Multiple alignments can be viewed by following the links.

Back to top

WormBase ParaSite Gene Trees: technical overview

WormBase ParaSite uses a computational pipeline developed by the Ensembl project to group related genes into families and define the evolutionary relationships between them. Below is a summary of the steps of the pipeline:

  1. A library of protein family Hidden Markov Models (HMMs) is used as a starting point. Gene sequences are scored against these models, giving a probability of how likely each sequence is to be a member of the corresponding family. The HMM library used in the Compara pipeline is based on the Panther and TreeFam databases.
  2. Any proteins that were not classified into a family in the HMM search are then compared with each other by all-against-all BLAST.
  3. Any family with more than 400 members is broken down into smaller families (max 400 proteins).
  4. All of the protein sequences in each family are aligned against each other using efficient multiple alignment software.
  5. For each family, a phylogenetic tree is built (using TreeBeST5). Tree building is guided by a species phylogenetic tree.
  6. Orthologues and paralogues are called on the basis of the resulting tree: any two genes separated by speciation events are orthologs, and any two genes in the same species separated by a duplication event are paralogues.

You can read more about this pipeline here.

Back to top

Gene page exercise

The aim of this exercise is to familiarise yourself with the WormBase ParaSite gene page. Go to the gene page for the Trichuris muris gene TMUE_3000014465 and retrieve the following information:

  1. What is the summary description of the gene? Do you have any idea what the gene might be doing from this description?
  2. How many transcripts of the gene are annotated?
  3. Which strand is the gene on? What is the name of the 5’ neighbouring gene?
  4. Download the 3’UTR sequence.
  5. What identifier would you use to search for the gene in Uniprot?
  6. Where is this gene’s protein predicted to localise to?
  7. Which Pfam domains is the protein predicted to have? Which of these is responsible for its DNA binding activity?
  8. Download the protein alignment of TMUE_3000014465 and its C. elegans orthologue. Is there any published literature on the C. elegans orthologue?
    HintFollow the link to the WormBase ParaSite page for the _C. elegans_ orthologue and look in the “Literature” tab.
  9. Are there any phenotypes associated with this T. muris gene according to the gene page? Which one(s)? Where are these gene-phenotype associations inferred from?
    HintGo back to the TMUE_3000014465 gene page and look in the "Phenotypes" tab.

    Back to top


5. BioMart

So far we have seen how you can manually browse WormBase ParaSite by searching for genes and then navigating to their gene/transcript/protein pages. However, in many cases you might have to automatically extract information from WormBase ParaSite for multiple entries. Or simply you might need to extract information about your favourite genome’s features that fullfil some criteria.

BioMart is an extremely powerful tool that allows you to query WormBase ParaSite data in bulk, with no programming knowledge. Consider the information that we gathered on our O. viverrini gene of interest, by clicking around the gene page. Now imagine that rather than having one gene of interest, we actually have a list of 100 genes. That would be a lot of clicking around on gene pages! BioMart allows you to output all of this data for multiple genes in a few clicks.

Retrieving data for a list of known genes isn’t the only thing that BioMart can do. In this section, we’ll go through a series of examples and exercises that aim to illustrate the power of this tool.

There are two main steps involved in building a BioMart query.

Some of the filters allow you to enter data to filter on, e.g. a list of gene names.

The table below lists some examples of filters and attributes for BioMart queries:

Examples of Filters Examples of Attributes
A genome Gene, transcript or protein IDs
A genomic region Sequences
A list of gene IDs Identifiers from external databases (eg, Uniprot IDs)
All genes that have GO term x, or protein domain Y Protein domains or GO terms associated with a gene ID
All genes that have GO term x, or protein domain Y IDs of orthologous genes, % identity

Query Filters and Output attributes can be combined to produce more complex queries and customised output.

Let’s try to do this: Let’s say we want to retrieve the IDs and predicted protein domains of all of the genes from Schistosoma mansoni chromosome 1 that have a predicted AlphaFold 3D protein structure. We’ll walk through this example to get started.

  1. From the WormBase ParaSite homepage, select BioMart from the tool bar, or the BioMart icon.

We have to set three Query Filters: the genome (the S. mansoni genome), genomic location (chromosome 1), and a protein domain (genes whose protein have a predicted 3D AlphaFold model).

2. Select “Species”, tick the “genome” checkbox and scroll down to select “Schistosoma mansoni (PRJEA36577)”.

3. Select “Region”, tick the “Chromosome/scaffold” check box and type “Sm_V10_1” into the text field (you must know the exact name of the chromosome).

4. Select “Protein domains”, tick the “Limit to genes…” checkbox and select “with AlphaFold protein structures”

Note that as we have built up the query, the filters have appeared on the left hand side of the page.

5. Click “count” to count the number of genes in the database that fulfil these filter criteria.

Next we will select the output attributes. “Genome project” and “Gene stable ID” are already pre-selected as attributes:

6. Select “Output attributes”

BioMart lets us generate two types of output: data tables, and sequence (FASTA) files. In this example we’ll be generating a data table. We want to retrieve the gene IDs and associated protein domains of the 215 genes that fulfil our filter criteria.

7. Select “Interpro protein domains” and check the tick boxes for “InterPro ID”, “InterPro short description”, “Start position” and “End position”.

Back to top


BioMart exercise

Use the following S. mansoni gene stable IDs to answer questions 1-4:

Smp_000090 
Smp_000120 
Smp_000180 
Smp_000210 
Smp_000220 
Smp_000250 
Smp_000330 
Smp_000380 
Smp_000400 
Smp_000520 
Smp_000030 
Smp_000040
Smp_000050 
Smp_000070 
Smp_000080 
Smp_000130 
Smp_000140 
Smp_000150 
Smp_000160 
Smp_000170 
Smp_000320 
Smp_001085 
Smp_002080 
Smp_002180 
Smp_002550 
Smp_000020 
Smp_000075 
Smp_000100 
Smp_000110 
Smp_000370

1. How many of these genes have orthologues in S. haematobium?

2. Generate a table listing the genes. The table should also has the gene stable ID for the homologue in both species, the homology type (1-1, 1-many, etc), and the % identity between the two orthologues.

3. Of these genes, how many also do not have a human orthologue?

4. Retrieve (a) a FASTA file with the CDS sequence of each transcript encoded by these genes. Make sure that the transcript stable ID is in the header; and (b) a FASTA file containing the CDS sequence plus 100 nt downstream of the stop codon of each of those transcripts. In the header, include the transcript stable ID and the name of the scaffold that the transcript is on.

Next, you will analyses a region of the Trichuris muris (murine whipworm) genome, from position 20000000–20500000 on chromosome 1 (“TMUE_LG1”). From these coordinates, generate an output with:

6. The WormBase gene IDs and UniProtKB/TrEMBL IDs for T. muris genes from the region.

7. the InterPro domains that they have been annotated with (InterPro short description). [Q: why do some of the output rows appear multiple times?]

8. the gene stable IDs of their T. trichiura (human whipworm) orthologues. [Q: which gene has more than one T. trichiura orthologue?].

9. the names of any GO terms associated with the genes.

10. FASTA file of their peptide sequences.

Back to top

Break time!

6. Overview and Aims

Welcome back!

We will start by looking at commonly-used tools in WBPS:

Finally, the section ends with a an additional section introducing our Expression browser.


Tools

BLAST

BLAST (Basic Local Alignment Search Tool) is one of the most commonly used tools to search for sequences that are similar to each other. It is a fast searching programme that is able to compare a query sequence with hundreds to millions of sequences quickly.

You can use BLAST to search a query sequence against the sequences in WormBase ParaSite.

How BLAST works?

BLAST uses three steps: 1) It ‘chops’ the query sequence into small ‘words’ of typically 3-4 amino acids for proteins or 10-12 nucleotides for DNA sequences. 2) It uses these short words to look for perfect matches across all the entries in the database. 3) When a match is found it then tries to extend the alignment by comparing consecutive letters of the word. For each new pair of letters, it evaluates whether it is a good match.

When do we need to use BLAST?

BLAST in WormBase ParaSite

The BLAST tool is accessible:

Both options will take you to WormBase ParaSite’s BLAST tool page:

How to evaluate Blast results?

Metrics used in the results table:

Back to top


BLAST exercise

TTTGCAGATGCTTCTCCCTTCAAACTTGACGACGTCAACATTAATGACGTCATCATCAGA
ATCGTACGACGCTGATAATCCGGGGCTTCCGCCTGAGCCAATCCTGTCGGATTACGTGGA
AATGTTCACTTTGGTGCTCAATTTTATTGTTGGCGCGCCGTTGAACCTGGCCGCTTATAC
ACAGCTAAGCGAACGACCTACATCAACGCGGTTAGACCTTCTGAAGCGATCACTCAACTA
TTCGGATCTTCTCGTTCTATTCATCTACGTACCATCTCGTGCCTGCTGGTTATTGACCTA
CGATTGGCGGGGTGGAGATGCACTCTGTAAAATTGTCAAGATGTTTCATACGTTCGCGTT
TCAGAGCTCCTCCAACGTGATCGTGTGCATCGCCGTGGATCGCCTGCTATCCGTCCTCTC
CCCATCCCATCACAGCCCCAACAAAGCCCTGAAACGGACTAAAATGATGTTAATAGTCGC
GTGGATAGTAGCGCTAGTAATCTCATGCCCACAACTTTTCATCTGGAAAGCATATCTAGC
ACTTCCCGAGTATAATTGGAGCCAGTGTCTGCAAATTTGGGAGATTGCACGGATGGAAAA
ATTCAACAAACCACAGGTAGTGCCAGAGTTTGACGCCGAGTTCTGGTACAGCATACTGCA
TATTAGTCTCGTTTTTTGGATCCCTTGTATCATTATCATGCTATCCTACATCATAGTCAT
CTCATGGGTATGGATCAACTCTCGGCCGTCCATCCGTCACACCTCTTCATTTTCCTTCCA
CACCGGCTGCGATACGGTAGATACAGTACTGACTAGAGCCTCTGAATGGAATCCTTTGAA
GACATTCTCCCGTCACGTCAACATCAAGGAGCCCGAGAAGCCGATGACGACTCCCAGAAT
CGTGGTCAGCGACGAGACGGAGGTCCCACTGACGCAGCGACCATCGATTTCTCCGTCGGA
AGCGTCGGCGGTGATGAGGACCGGTGTGCACACGAGTACCTCGTATAATGCTAATTTGAA
TCGATCCCGAGCCCTGCGAGTTTCCTTGCTACTAGTCGTCGCGTACATCATCTGCTGGCT
ACCATATAACCTCATAAGTCTTATCCAATTTCTTGATCGGGACTTTTTTTCGTCATATCT
TAAACATGTCCACTTCTGCCAACAACTAATCATTTTTAACTCGGTCGTCAATCCATGGCT
CTACGGTTTCTTCGGTCCCCGCCGCCCGTCTACCACCGGTGCCGGCCGTCACTGATCTCC
AAACATCAAACATCGAATTCGCCATATCTTTCCAAAATCCCCCCAACGTTCCAGTTTTCA
AGCCCAACGAATTGCCAATGCCATATCTTTAACAACTTTTATGGTTTCTTGTTTGTTTTT
TTTTATTTATTTTATTGTAATGTTTGATTCTCGGTGAAAAATTTGTGTAAAATAAATTAT
TTTTTATGTGAAA

Back to top


The genome browser

A genome browser is a tool that allows you to visualise a genome assembly and its features, together with experimental data aligned to the genome.

There are several commonly used genome browsers in bioinformatics, each with different features. In WormBase ParaSite we have two:

There are many other genome browsers for different needs out there. Feel free to explore them at your own time:Integrative Genomics Viewer (IGV), UCSC Genome Browser, and the new version of Jbrowse

Using JBrowse: basic functionality

In this example we’ll introduce the basic functionality of WormBase ParaSite’s JBrowse 1, and demonstrate how to use the various tracks.

  1. Navigate to the S. mansoni genome page and select the “Genome Browser (JBrowse)” icon.

For this example, we’ll consider that you’re interested in the gene Smp_312440.

  1. Start by typing the identifier into the search box and clicking “Go” to navigate to the gene.
  2. Zoom in by clicking the large magnifying glass with a “+” symbol until you can see the reference sequence.

Here, you can see the forward and reverse DNA strands, together with the six possible translational reading frames (3 forward and 3 reverse).

  1. Zoom out again so that you have the whole gene model in your field of view.
  2. To extract sequence information about the gene, click the gene model such that a dialogue box pops up.

Scrolling down the content of the box, you can extract genomic or cDNA sequence, or the sequence of specific subfeatures (specific exons or UTRs, for example).

Alternatively, you may wish to extract the genomic sequence of a whole region:

  1. Click the arrow next to the “Reference sequence” track label in the top left of the screen, select “Save track data”, then download the sequence as a FASTA file.

Tracks

We can also use JBrowse to view other types of data aligned to the genome.

  1. Click the “select tracks” button in the top left of the screen.

For most species, in addition to the gene model (“Genome Annotation”) track, there are RNASeq tracks. WormBase ParaSite finds and aligs RNASeq data in the ENA for our species of interest. These can be useful, for example, for checking that a gene model is well supported by expression data, or seeing in which life stages, or under which conditions, a gene of interest is transcribed.

Let’s say you want to see in which life stages Smp_312440 is expressed:

  1. Click the “developmental stage” facet
  2. Select a few of the available libraries (in the example below we’ve selected 3h schistosomules and miracidia) and click “back to browser”.

Visualising your own data

As well as looking at publicly available data, you can use WormBase ParaSite JBrowse to visualise your own data.

We’ll demonstrate how to do this using a local BAM file.

What is a BAM file? It’s based on a Sequence Alignment Map (SAM) file format, used in genomics to store sequencing read alignments against a reference sequence. BAM is binary version that is compressed and indexed, so that it is quick to access computationally.

Trying to read a BAM file won’t be very informative because it’s binary. It need to first be converted back into the SAM file format (non-binary, human-readable). We can do that with samtools:

####A brief aside on the BAM and SAM file formats:####

The SAM file starts with a header section. All header lines begin with a ‘@’ character.

Moving down through the file (by pressing the space bar) you come to the alignment section. Here, each line represents a sequencing read (though be aware that the lines are long, so a single line will probably wrap around your terminal window a few times). Some of the key fields are labelled below:

The full SAM specification is available here: http://samtools.github.io/hts-specs/

Before the file can be visualized in JBrowse, an index needs to be created. An index is another file that often accompanies a BAM file, and acts like a table of contents. Software such as JBrowse can look inside the index file and find where exactly in the corresponding BAM file it needs to look, without having to go through all of the reads (which would be computationally very expensive).

BAM index files should have exactly the same name as their corresponding BAM file, with the addition of a .bai suffix. You can index a BAM file using samtools. Type:

samtools index your_file.bam


This should put a file called your_file.bam.bai in your working directory, which can be loaded into WormBase ParaSite JBrowse.

We can only add an indexed BAM file to Jbrowse. To add a BAM track to Jbrowse :

In the demonstrated example, you can see the reads aligned to the genome. Notice that this RNA-Seq data is stranded- this means that the library preparation protocol preserved information on which end of the RNA molecule was 5-prime and which end was 3-prime, so we can infer which strand of DNA it was transcribed from. This information is encoded in the BAM file, and JBrowse colours the reads accordingly:

Back to top


VEP

Another WormBase ParaSite tool that we will look at today is the Variant Effect Predictor, or VEP.

A common approach to understanding the genetic basis of phenotypic differences is to identify genetic variants that are over-represented in some populations of individuals.

For example, you might sequence two populations of worm: one that is susceptible to a drug and one that is resistant to the drug. You could then identify genomic positions where each of these populations differs from the reference genome.

VEP is a tool that allows you to predict what the consequences of these variants are: whether they fall within or near genes, and whether they result in a change to the amino acid sequence of a protein.

The standard file format for storing variation data is the Variant Call Format (VCF); this is another tab-delimited text format. Later in the course, you’ll see how to make one of these files. In the meantime, for some helminth genomes, these files have already been shared by other researchers. Today you’ll be using an available VCF file for Strongyloides ratti.

First, we’ll download a VCF file from the European Variation Archive (EVA). Then will upload it to WormBase ParaSite

  1. Go to the European Variation Archive (EVA).
  2. Select the “Variant Browser” tab.

You can download complete studies from the “Study Browser” tab but today we are using the “Variant Browser” to download a much smaller file corresponding to a 250 kb region of the genome.

3. Download the first 250kb of S. ratti chromosome 2:

4. Move to your Downloads directory and have a look at the file to see how it is structured:

# look at the contents
less sratti*.vcf

You’ll have to scroll down beyond the headers (lines starting with ##) to see the data lines. The actual data lines looks like:

5. From the WormBase ParaSite homepage, select “Tools” from the toolbar. 6. From the “Tools” page, select Variant Effect Predictor 7. To submit a VEP job, just select the correct species (Strongyloides ratti), upload the VCF file we just downloaded and click “Run”.

8. Once you have clicked “Run”, your input will be checked and submitted to the VEP as a job. All jobs associated with your session or account are shown in the “Recent Tickets” table. You may submit multiple jobs simultaneously.

9. Navigate to the results page:

The results are presented in pie-charts and an interactive table:

10. You can explore the results interactively on the webpage using the Results Preview filter panel at the centre. Use this panel and filter for variant that cause (select “consequence”) changes to amino acids (select “missense_variant”).

You can actually visualise the affected Amino acid by the “missense_variant” on the protein’s 3D AlphaFold model.

To do this:

11. Go to the “Protein matches” column of the results table. If the “Protein matches” column has not been switched on you can do so by using the “Show/hide columns” button at the top left of the table”. If the protein affected by the “missense_variant” has an AlphaFold protein model available, then you should see an “AlphaFold model” button in the “Protein matches” column. Click it.

12. Explore the 3D protein model. You can Click the “Focus” button underneath the variant information to zoom-in to the affected residue.

Back to top

VEP exercise

Download the VEP results from the example above as a “VEP file”. Use this file and the original VCF file to answer the following questions:

  1. How many variants were there in the original dataset?

  2. What are the different types of consequence that are found in the file, and how often does each occur?

  3. List all of the variants found in SRAE_2000005500.1. Which variant or variants show the greatest impact?

  4. Create a list of genes where a missense variant is found.

  5. Find out which genes has the highest number of missense mutations.

It is possible to view the distribution of variants along the coding sequence in Jbrowse. To view the VCF in JBrowse, you first need to compress and index using ‘tabix’, a generalised indexer of tab-delimited files from the samtools package

bgzip file.vcf && tabix -p vcf file.vcf.gz


Back to top


The WormBase ParaSite Expression browser

Earlier in this section, we looked at a gene in JBrowse and used RNAseq tracks to see in which life stages it was expressed. What if you were interested in transcriptional differences between life stages, but didn’t have a specific gene in mind?

You might want to retrieve all of the S. mansoni genes that are differentially expressed between 2 life cycle stages.

WormBase ParaSite has collated RNAseq data from publicly available studies and analysed it against our genomes and annotations.

This means that if somebody has already conducted a study to compare the conditions that you’re interested in, you can look up pre-calculated differentially expressed genes.

  1. Navigate back to the S. mansoni genome landing page, and select “Gene expression”

2. We can see a summary of the different studies that have been conducted. We’re interested in life cycle stages, so select the first study “Schistosoma mansoni transcriptomics at different life stages”.

For each study, we have a summary of the conditions for which data is available. A detailed description of transcriptomic experiments is beyond the scope of this workshop, but for those who are interested, HISAT2 was used to align reads to the genome, HTSeq to quantify counts per gene and DESeq2 to compute differential expression per condition.

Several files are available for download. These are:

3. Download the full results files for the “Schistosoma mansoni transcriptomics at different life stages” “24-hour-schistosomule-vs-cercariae” experiment by clicking “Full result files for 3 contrasts (zipped)” and place it into the “Downloads” directory.

cd ~/Downloads

# Extract the compressed directory
unzip ERP000427.de.contrasts.zip

# move inside the results directory
cd ERP000427.de.contrasts

# have a look at the 24-hour-schistosomule-vs-cercariae file
grep -v "^#" 24-hour-schistosomule-vs-cercariae.tsv | less

Use some of the commands you learned yesterday to extract the following information from the “24-hour-schistosomule-vs-cercariae.tsv” file:

4. Extract the top 5 most significantly regulated genes (hint: the final column, “padj”, gives the adjusted p value. A smaller adjusted p value = more significant).

grep -v "^#" 24-hour-schistosomule-vs-cercariae.tsv | grep -v "^gene_id" | sort -g -k 7,7 | awk -F'\t' '$7 != "NA"' | head -n 5

5. Of the genes with an adjusted p-value that is less than 0.05, which is (a) most highly upregulated in the 24h schistosomules v the cercariae (b) most strongly upregulated in the cercariae v the 24h schistosomules?

# upregulated in the 24h schistosomules means tha Log2FoldChange (column 3) should be a positive number
grep -v "^#" 24-hour-schistosomule-vs-cercariae.tsv | grep -v "^gene_id" | awk -F'\t' '$7 != "NA" && $7 < 0.05 && $3 > 0' | sort -r -g -k 3,3 | head -n 10

# upregulated in the cercariate means tha Log2FoldChange (column 3) should be a negative number
grep -v "^#" 24-hour-schistosomule-vs-cercariae.tsv | grep -v "^gene_id" | awk -F'\t' '$7 != "NA" && $7 < 0.05 && $3 < 0' | sort -g -k 3,3 | head -n 10

Back to top


Performing Gene-set enrichment analysis

Gene set enrichment analysis (GSEA) (also called functional enrichment analysis or pathway enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with disease phenotypes.

image

Earlier, we talked about Gene Ontology (GO). “Gene Ontology” enrichment analysis is one of the most commonly used enrichment analyses.

Gene ontology is a formal representation of knowledge about a gene with respect to three aspects: Molecular Function, Cellular Component and Biological Process.

Gene Ontology (GO) enrichment analysis answers the very simple question: “What biological processes, molecular functions, and cellular components are significantly associated with a set of genes or proteins?”

For example, we can take the list of genes we identified as significantly upregulated in cercariae vs 24h-schistosomules and try to identify what are the biological processes, cellular components and molecular functions that are implicated in this developmental stage comparison.

Instead of manually trying to identify which genes in your list of differentially expressed genes have similar biological processes, cellular component and molecular functions, the GO enrichment analysis does it for you. More specifically, it clusters the genes into gene ontologies group, performs statistical analysis and shows you the most significantly overepressented ontologies!

So basically a GO enrichment analysis takes a list of gene identifiers like this:

and organises them to Gene Ontology terms (GO):



WormBase ParaSite offers a tool that performs this kind of analysis: g:Profiler that can be accessed from the “Tools” page:

Back to top


Gene-set enrichment analysis exercise

Use the 24-hour-schistosomule-vs-cercariae.tsv from the previous section and print a list of genes with an adjusted p-value that is less than 0.05, which are most strongly upregulated in the cercariae v the 24h schistosomules.

  1. Use gProfiler and perform a Gene-set enrichment analysis for these 40 genes from the “Schistosoma mansoni (PRJEA36577)” organism.

  2. Which are the 3 most significantly enriched Cellular Component terms? Are they relevant to this developmental stage comparion we’re performing?

  3. Expand the stats by using the “»” in the header of the GO:CC result table. Try to interpret the T, Q, TnQ and U metrics. What do they represent?

    Help! You can read more here: https://biit.cs.ut.ee/gprofiler/page/docs
    * T - Term Size: How many S. mansoni genes are in general associated with this term.
    * Q - Query Size: The number of genes in our gene list (the one we run the analysis for). In our case this number should theoretically be 40, however it is 14, why?
    * TnQ - Overlap Size: How many genes from our list are associated with this term.
    * U - Total number of S. mansoni genes.

Back to top