The following are ideas for Honours projects. If you have your own idea then feel free to discuss that as well. If you have an idea that you want to work on it would be useful to be able to describe it along the lines of the projects below, i.e., what is the idea, what are the questions you want to ask, what is the outcome, and what skills are needed.
Note that some of these topics have been studied in past years, I have added links to those projects so you can get an idea of what a project in this area looks like.
My main research interests are exploring and visualising information of biodiversity. I am interested in maps, evolutionary trees, taxonomy, museums, artificial intelligence, machine learning, etc.
My BioRSS project monitors journals for new species, and classifies the papers by taxonomic group (e.g., "Insecta") and country (e.g., "China").
The data will be provided, will need to be analysed with some standard techniques (spreadsheets and/or R).
Here are some notes made for last year's version of this project:
The BioRSS project is described here: https://github.com/rdmpage/biorss There is also a blog post about it here: https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html
BioRSS regularly harvests papers from journals that publish new species descriptions, as well as Google Scholar searches for terms such as “new species” and “n. sp.”. To harvest from journals it uses RSS feeds (see https://en.wikipedia.org/wiki/RSS ) which is a common way for journals to make lists of new articles available to tools such as blog post readers.
BioRSS was inspired by uBioRSS. Unlike uBioRSS it also has the ability to “geolocate” papers based on their titles and abstract using the “Glasgow geoparser”, see https://github.com/rdmpage/glasgow-geoparser It also uses a tool to find taxonomic names in text, and then maps those to taxonomic names in GBIF (https://www.gbif.org) . Hence for each paper we can (ideally) place it on a map and say what taxa it is about.
For this project data was collected for the year 2022, and this data can be filtered by country mentioned by the paper (e.g., “China”), or taxonomic group (e.g., “Aves”).
To understand more about the origins of the publications for 2022 we can use web services that take the DOI for an article and extract additional information. CrossRef.org has detailed metadata for articles with DOIs, and can include things such as the ORCID id for one or more authors, and a DOI for the funding agency that supported the work. For each DOI software retrieved the metadata from CrossRef, for example http://api.crossref.org/works/10.1515/pjen-2017-0018 (need a better DOI to use here, one from the data files I sent might be a better example).
If ORCIDs were found the web site https://orcid.org was queried for information on the author, specifically the country in which the author was based. This is not available for all authors, so only a subset of the data could be used for this. For those authors that had country information, it was possible to compare the country the paper was about with the country the researchers were based in. Might be a good idea to show an ORCID record as an example (one of the files I sent will have lots of these).
For funders, CrossRef.org often has a DOI for the funding agency. This DOI was used to look up information on the funder in Wikidata, such as the country where the funder was based. This enables a comparison between the geography of the paper, the authors, and the funders.
A lot of scientific literature is behind a paywall, and hence not readily accessible to readers who aren’t associated with a major institution such as a university. You could mention SciHub here as one attempt to open all science up, see for example https://doi.org/10.7554/eLife.32822
The Unpaywall project tries to find a legally free to read version of a paper to read. For an Open Access journal this may be from the publisher website, for others it might be in a repository. To determine whether a paper was open access the Unpaywall service queried for each DOI. The number of accessible versus non-accessible papers was counted, and compared across countries.
Lots of things to think about, such as the requirement that papers be online, ideally in a journal that has a RSS feed or can be searched by Google Scholar, otherwise the papers won’t be found. Some of the papers included will be for new fossil species, not living ones, and not all papers will be on new species. There can also be errors in geolocating papers. For example, if a paper mentions “India” it will be geolocated to that country, but if it mentioned “Indian” then it won’t.
Patrick R. Leary, David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar, uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23, Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109
A while ago the BBC had a rich web site full of details about species that featured in their wildlife programmes. This site is now dead, but I grabbbed the data and made a crude demo here. Given growing interest in the use of artificial intelligence I would like to see if we can use ChatGPT to create a useful guide to the biology of species covered by the BBC.
GBIF provides (literally) billions of observations on where species are found, but the maps based on this data are highly biased, and display numbers of observations, not numbers of species. Hence GBIF doesn't tell us hwere the greatest species richness is, just where the most data comes from.
There are interesting techniques for taking uneven sampling into account, such as ES50 (see Exploring es50 for GBIF) that should be explored further. There is also a new method for standardising geographic data called "hex tiles" (see Your Guide to Our Next-Gen Geospatial Tile System) which cover the planet in a grid of equally sized hexagons (and a few pentagons). There are different ways to do this, but the system that seems to be most popular is based on work by Uber (yes, that Uber).
This study will be data intensive, so likely to start with a smallish dataset to explore how easy it is to assign observations to H3 tiles, and what level of tile works best (you can explore H3 tiles here). Once different datasets have been mapped onto the same set of tiles we can compare then (for example, is the number of species in each tile in one taxon correlated with those in another?).
Taxonomy is a science that often feels embittered and under appreciated, despite its importance in cataloguing biodiversity. We lack a lot of basic details about the state of taxonomy, in particular where it is done, and where it is published. Is it concentrated in rich countries, of is it more widley spread? Is it mostly closed and behind a paywall, or is a lot of it open access. How does the distribution of taxonomic journals compare to other disciplines, see e.g. Recalibrating the scope of scholarly publishing: A modest step in a vast decolonization process.
Every year new species are being discovered. Where are these discoveries happening? Are discoveries being made by researchers based in those counties, or by researchers elsewhere. Who is funding that research? Are these species descriptions available to anyone (i.e. open access) or are they behind paywalls? Open access publications are more likely to be accessible by “citizen scientists” using iNaturalist or editing Wikipedia.
The taxonomic group studied can be chose one you are interested in, although it is important to check that there is enough data to do something useful. Possible groups include reptiles, fish, insects, etc.
The project will provide answers to the questions above. There is scope for interesting visualisations such as maps and Sankey diagrams.
It would be useful to be able to manage data using a database such as SQLite. This is a fairly easy to use program, if you can use Excel or Google Docs you can use SQLite. Getting additional data may require programming, I can help with this.
Yes, in 2023 an hours student did this for amphibians, you can read her project here: Trends of amphibian species discovery: an overview of the geographical and social factors affecting amphibian taxonomy between 2000-2019.
Quality control in biodiversity is a major topic of concern. We have global databases that combine data from museums, herbaria, citizen science projects like iNaturalist, and data extracted from scientific publications. This data is used to infer species distributions, both now and in the future (e.g., under models of climate change). How reliable is this data?
The Plazi project generates a lot of data on where species are found by extracting it from published papers. This data is uploaded into GBIF and is downloaded and cited a lot. But how accurate is it?
The Plazi web site is http://plazi.org. For some background on how accurate Plazi is see Problems with Plazi parsing: how reliable are automated methods for extracting specimens from the literature?. There is also a simple testing tool here.
This project should generate a measure of the accuracy of Plazi, and whether that accuracy varies by taxonomic group (e.g., plants versus animals), journals, etc.
Initially just the ability to use online tools to make comparisons. The ability to program would be useful if the project drills down into further detail, I could help with that.