Teaching Online

Masters project ideas

This is where I will collect project ideas. My main research interests are exploring and visualising information on biodiversity. I am interested in maps, evolutionary trees, DNA barcoding, taxonomy, museum collections, digital libraries, artificial intelligence, etc.

DNA barcoding

DNA barcoding is a popular technique for identifying species, and for animals is typically a 600-700 base pair sequence of the COX1 gene. There are millions of such sequences publicly available in the Barcode of Life Data System (BOLD). However we lack simple tools to explore this data. Specifically, we don't have a simple tool that takes a sequence and returns a tree for a set of similar sequences (e.g., like BLAST), nor do we have a global alignment for all the sequences (hence people have to align sequences every time the want to analyse the data.

These two topics (alignment-free sequence search and global sequence alignment) could be treated as separate projects, or as different aspects of the same project, depending on time and resources.

Project (a) Alignment-free sequence search and tree construction

Given a DNA barcode sequence it would be useful to be able to quickly query the database of barcodes for similar sequences, and to construct a tree for those sequences. At present we can query GenBank using BLAST, or BOLD using the Identification Engine, but these tools are slow and don't display a phylogeny (at least not straight away).

Can we develop a quick way to explore DNA barcodes, for example using alignment-free methods? I have experiment with this approach using Elastic search to store a limited number of sequences and searching them using "n-grams" (i.e., k-mers), see DNA barcode browser. However, this approach does not seem to scale well. Recent developments in AI have lead to a lot of interest in vector databases, and given that k-mers define a vector for a sequence (i.e., the frequencies of each k-mer), perhaps we can use vector databases to speed up sequence search? I describe some experiments in the blog post Sub-second searching of millions of DNA barcodes using a vector database. Those experiments use the Postgresql database which can support vectors. More recently I have experimented with Elasticsearch, which also supports vector searching.

Tools

Managing the data will require basic scripting skills to read and process large CSV and FASTA files, hence the ability to use a language such as PHP, Python, or Perl would be very useful. Sequence alignment will require familiarity with using command line tools build phylogenies. The project also requires the ability to install use databases such as Postgresql and Elasticsearch, and/or the ability to use online vector databases. For Postgresql and Elasticsearch, experience with Docker would be useful.

Output

The key question is whether vector databases can support simple sequence queries, whether they return results consistent with other approaches, and whether they offer any perfomance advantage of existing methods.

Extras

I have some preliminary code for a web site that can provide an interface to a sequence search engine (written in PHP and Javacsript), so that could be the starting point for a web interface to the results.

Reading

Project (b) A global alignment of all DNA barcodes

Regardless of the potential of alignment-free methods, many analytical tools still require aligned sequences. Typically a user of BOLD will download sequences align them, and do an analysis. The more users that do this, the more the same costly step (sequence alignment) will be repeated, wasting effort. Can we avoid this by providing a single, global alignment of DNA barcodes?

There are two problems to tackle here. The first is how can we align millions of sequences? The second is how do we update that alignment as new sequences are added? One approach to building the alignment is divide and conqueror. For example, split the sequences into taxonomic groups, choose one or more representative sequences for each group (for example, sequences from complete mitochondrial genomes), construct a local alignment, then assemble larger alignments from the smaller alignments (e.g., using profile alignments). A complementary approach might be to first assemble an alignment of representative sequences (e.g., one per family) and use that to align the remaining sequences.

Tools and skills

Managing the data will require basic scripting skills to read and process large CSV and FASTA files, hence the ability to use a language such as PHP, Python, or Perl would be very useful. Sequence alignment will require familiarity with using command line tools to align sequences. That tool will need to be able to align protein coding sequences, and ideally would support "profile" alignments.

Output

The key output would be the alignment, which could be published in a data repository such as Zenodo. It would also be useful to think about developing a tool that could return an alignment for a set of taxa. This could be implemented as a simple web site.

Given the number of sequences involved, it may be prudent to aim for an alignment of a subset of sequences as a proof of concept.

Extras

Depending on the speed with which the alignment can be assembled, it might be interesting to cluster the sequences to see if they support the BINs found by A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System.

Reading