Teaching Online

Project ideas

The following are ideas for Honours projects. If you have your own idea then feel free to discuss that as well. If you have an idea that you want to work on it would be useful to be able to describe it along the lines of the projects below, i.e., what is the idea, what are the questions you want to ask, what is the outcome, and what skills are needed.

Note that some of these topics have been studied in past years, I have added links to those projects so you can get an idea of what a project in this area looks like.

My main research interests are exploring and visualising information of biodiversity. I am interested in maps, evolutionary trees, taxonomy, museums, artificial intelligence, machine learning, etc.

Where are the new species being found?

My BioRSS project monitors journals for new species, and classifies the papers by taxonomic group (e.g., "Insecta") and country (e.g., "China").

Questions

  1. In which country are the most species found?
  2. For which taxonomic group are the most species found?
  3. How many papers are actually about new species? Can we determine that from the title? This may require some machine learning
  4. How many of the new species descriptions are Open Access?
  5. Where are the people who do taxonomy based? Are they in the same country as the new species?
  6. Who is funding taxonomy?
  7. How do these results compare to 2022?

Outcomes

A survey of the new species found in 2023, their geographic and taxonomic distribution, and an understanding of who is doing that research, where they are based, and who funds them.

Skills

The data will be provided, will need to be analysed with some standard techniques (spreadsheets and/or R).

Have any students done a project like this before?

Yes, in 2022: Where Are the New Species?: Taxonomic Discoveries in 2022.

Here are some notes made for last year's version of this project:

The BioRSS project is described here: https://github.com/rdmpage/biorss There is also a blog post about it here: https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html

BioRSS regularly harvests papers from journals that publish new species descriptions, as well as Google Scholar searches for terms such as “new species” and “n. sp.”. To harvest from journals it uses RSS feeds (see https://en.wikipedia.org/wiki/RSS ) which is a common way for journals to make lists of new articles available to tools such as blog post readers.

BioRSS was inspired by uBioRSS. Unlike uBioRSS it also has the ability to “geolocate” papers based on their titles and abstract using the “Glasgow geoparser”, see https://github.com/rdmpage/glasgow-geoparser It also uses a tool to find taxonomic names in text, and then maps those to taxonomic names in GBIF (https://www.gbif.org) . Hence for each paper we can (ideally) place it on a map and say what taxa it is about.

For this project data was collected for the year 2022, and this data can be filtered by country mentioned by the paper (e.g., “China”), or taxonomic group (e.g., “Aves”).

To understand more about the origins of the publications for 2022 we can use web services that take the DOI for an article and extract additional information. CrossRef.org has detailed metadata for articles with DOIs, and can include things such as the ORCID id for one or more authors, and a DOI for the funding agency that supported the work. For each DOI software retrieved the metadata from CrossRef, for example http://api.crossref.org/works/10.1515/pjen-2017-0018 (need a better DOI to use here, one from the data files I sent might be a better example).

If ORCIDs were found the web site https://orcid.org was queried for information on the author, specifically the country in which the author was based. This is not available for all authors, so only a subset of the data could be used for this. For those authors that had country information, it was possible to compare the country the paper was about with the country the researchers were based in. Might be a good idea to show an ORCID record as an example (one of the files I sent will have lots of these).

For funders, CrossRef.org often has a DOI for the funding agency. This DOI was used to look up information on the funder in Wikidata, such as the country where the funder was based. This enables a comparison between the geography of the paper, the authors, and the funders.

A lot of scientific literature is behind a paywall, and hence not readily accessible to readers who aren’t associated with a major institution such as a university. You could mention SciHub here as one attempt to open all science up, see for example https://doi.org/10.7554/eLife.32822

The Unpaywall project tries to find a legally free to read version of a paper to read. For an Open Access journal this may be from the publisher website, for others it might be in a repository. To determine whether a paper was open access the Unpaywall service queried for each DOI. The number of accessible versus non-accessible papers was counted, and compared across countries.

Lots of things to think about, such as the requirement that papers be online, ideally in a journal that has a RSS feed or can be searched by Google Scholar, otherwise the papers won’t be found. Some of the papers included will be for new fossil species, not living ones, and not all papers will be on new species. There can also be errors in geolocating papers. For example, if a paper mentions “India” it will be geolocated to that country, but if it mentioned “Indian” then it won’t.

Patrick R. Leary, David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar, uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23, Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109

BBC Wildlife meets ChatGPT

A while ago the BBC had a rich web site full of details about species that featured in their wildlife programmes. This site is now dead, but I grabbbed the data and made a crude demo here. Given growing interest in the use of artificial intelligence I would like to see if we can use ChatGPT to create a useful guide to the biology of species covered by the BBC.

Questions

  1. For the BBC species how reliable is ChatGPT in providing information?
  2. Can this reliability be improved by adding data from the BBC web site (e.g., on diet, location)

Approach

Get a list of species from the BBC site, and a list of their features (e.g., ecology, behaviour, etc.). Develop queries to ask ChatGPT (e.g., "is species x a predator?"), measure how accurate it is. Then seed with BBC data and ask questions again, measure change in accuracy.

Outcomes

The outcome would be a measure of how accurate ChatGPT is by itself (e.g., how many facts does it get wrong?), and a measure of whether adding BBC data improves it. Could potentially be used as the basis of a quizz (e.g., "which of these species are monogamous?"). Could also be extended to other sources, for example can we also add information from Wikipedia (in multiple languages), can we add scientific papers, etc.?

Skills

ChatGPT is easy to use, we would need to extract the BBC data into a more useful form so that it can be feed to ChatGPT (I can do that). No programming is needed, but if you have programming skills then could explore further applications, such as making a Q&A site.

Have any students done a project like this before?

Not yet (2024).

How can Uber help us map biodiversity?

GBIF provides (literally) billions of observations on where species are found, but the maps based on this data are highly biased, and display numbers of observations, not numbers of species. Hence GBIF doesn't tell us hwere the greatest species richness is, just where the most data comes from.

There are interesting techniques for taking uneven sampling into account, such as ES50 (see Exploring es50 for GBIF) that should be explored further. There is also a new method for standardising geographic data called "hex tiles" (see Your Guide to Our Next-Gen Geospatial Tile System) which cover the planet in a grid of equally sized hexagons (and a few pentagons). There are different ways to do this, but the system that seems to be most popular is based on work by Uber (yes, that Uber).

Questions

  1. Can we build a biodiversity map that shows numbers of species not numbers of observations?
  2. How easy is it to build a map using H3?
  3. Can we compare diversity across taxa (e.g., mammals versus butterflies), or based on different data sets (e.g., DNA barcodes)?
  4. Can we measure the amount of sampling in each area?

Approach

This study will be data intensive, so likely to start with a smallish dataset to explore how easy it is to assign observations to H3 tiles, and what level of tile works best (you can explore H3 tiles here). Once different datasets have been mapped onto the same set of tiles we can compare then (for example, is the number of species in each tile in one taxon correlated with those in another?).

Outcomes

The outcome would be interactive maps of biodiversity (e.g., showing species richness for reptils or other groups). There would also be comparisons between taxa, and comparisons between sampling effort (number of observations) and diversity (number of species).

Skills

If you are familiar with R there are packages that support H3. Otherwise we would need code to get data from GBIF and sort it into tiles.

Have any students done a project like this before?

Not yet (2024). However there is some overlap with the project Mapping the global distribution of phylogenetic diversity using DNA barcodes.

Mapping taxonomy: where are the museums and where are the taxonomic journals?

Taxonomy is a science that often feels embittered and under appreciated, despite its importance in cataloguing biodiversity. We lack a lot of basic details about the state of taxonomy, in particular where it is done, and where it is published. Is it concentrated in rich countries, of is it more widley spread? Is it mostly closed and behind a paywall, or is a lot of it open access. How does the distribution of taxonomic journals compare to other disciplines, see e.g. Recalibrating the scope of scholarly publishing: A modest step in a vast decolonization process.

Questions

  1. Where in the world are the major natural history collections?
  2. Where in the world are the major taxonomy journals?
  3. Is the distribution of museums, herbaria, and journals correlated with each other?
  4. Is the distribution of museums, herbaria, and journals correlated with biodiversity? (i.e., do biologically rich countries also have museums and taxonomic journals?)
  5. How many taxonomy journals are open access?
  6. What languages is taxonomic work published in?

Outcomes

A global overview of the distribution of museums and herbaria, the distribution of taxonomic journals, a measure of how many are open access, and what languages they publish in.

Skills

Most of the data for this study would come from Wikidata. There is likely to be a lot of missing data, so this would need to be added. Wikidata has a powerful query language that can be used to find museums, plot maps, etc.

Have any students done a project like this before?

Not yet (2024).

Where is species discovery happening and who is funding it?

Every year new species are being discovered. Where are these discoveries happening? Are discoveries being made by researchers based in those counties, or by researchers elsewhere. Who is funding that research? Are these species descriptions available to anyone (i.e. open access) or are they behind paywalls? Open access publications are more likely to be accessible by “citizen scientists” using iNaturalist or editing Wikipedia.

The taxonomic group studied can be chose one you are interested in, although it is important to check that there is enough data to do something useful. Possible groups include reptiles, fish, insects, etc.

Questions

  1. Where are the most species in this group being discovered?
  2. Is research in a country mostly funded by that country, or is it externally funded?
  3. Do researchers describing new species mostly come from that country?
  4. Where are the funders of taxonomy based?
  5. How much of the work is open access?

Outcomes

The project will provide answers to the questions above. There is scope for interesting visualisations such as maps and Sankey diagrams.

Skills

It would be useful to be able to manage data using a database such as SQLite. This is a fairly easy to use program, if you can use Excel or Google Docs you can use SQLite. Getting additional data may require programming, I can help with this.

Have any students done a project like this before?

Yes, in 2023 an hours student did this for amphibians, you can read her project here: Trends of amphibian species discovery: an overview of the geographical and social factors affecting amphibian taxonomy between 2000-2019.

How accurate is text mining scientific papers?

Quality control in biodiversity is a major topic of concern. We have global databases that combine data from museums, herbaria, citizen science projects like iNaturalist, and data extracted from scientific publications. This data is used to infer species distributions, both now and in the future (e.g., under models of climate change). How reliable is this data?

The Plazi project generates a lot of data on where species are found by extracting it from published papers. This data is uploaded into GBIF and is downloaded and cited a lot. But how accurate is it?

The Plazi web site is http://plazi.org. For some background on how accurate Plazi is see Problems with Plazi parsing: how reliable are automated methods for extracting specimens from the literature?. There is also a simple testing tool here.

Questions

  1. Is there any difference between accuracy for plants and animals?
  2. What sort of errors occur?
  3. How many downloads might be affected?
  4. Does Plazi duplicate existing data in GBIF, and if so, does it recreate the data correctly.

Outcomes

This project should generate a measure of the accuracy of Plazi, and whether that accuracy varies by taxonomic group (e.g., plants versus animals), journals, etc.

Skills

Initially just the ability to use online tools to make comparisons. The ability to program would be useful if the project drills down into further detail, I could help with that.

Have any students done a project like this before?

No (2024)