Through out the Fall of 2021 and the Spring of 2022, Virus Hunters’ approaches to sequence analysis matured and by June we were just beginning to write code in R to automate some of our analysis. Right before I left for vacation, I managed to cobble together a little code that would tally the number of mutations or amino acid changes in genomes according to what viral gene it was in. We can download ALL of the mutations for any of these viruses at Nextclade in a CSV format and edit/clean those up with Excel. At the time I had a test set of only about 11 viruses. It worked, but I wasn’t sure how it would go with hundreds let alone thousands of genomes.
Last night I tried more: I had my little code-jalopy tackle 200 BA.5.2.1 genomes from Georgia taken from patients between this August 30 and September 7 (2022), and it worked!
Ignore the first “0” – R hates data with empty cells so I entered a zero in all of the 1,467 empty cells (these viruses have different numbers of mutations and thus there are empty cells in the file). To read the output: going from left to right, there are 602 amino acid changes among these 200 viruses in the viral M gene, 1,241 changes to the viral N gene, etc etc.
My short term goal here is to make an analysis pipeline that would allow us to do regular updates on viral sequences collected recently from [name your region, although Georgia is a good start]. This isn’t extremely high level science, but I will bet you that there are no tools currently available that show you this information for Georgia.
One idea I have is to take that kind of information and regularly update it on a website as sort of an advertisement of “look what Perimeter College students can do!”. This certainly isn’t the only analysis that could be done, its just the easiest one I could think of. I’m imagining something along the lines of:
Now for the problems
- I wouldn’t call this a pipeline yet because at the beginning I have to clean up the CSV file so that R will accept it – cleaning up a CSV file with 200 viral genomes took me a total of 3 hours, and I am still making my chart with Excel. I don’t currently have any good ideas for automating the CSV file cleanup, although it seems like those methods must exist. If they don’t, I honestly do not mind spending 3 hours once a month to do it myself. Or maybe if I find a student who is also an Excel nerd….. Automating the graphing is easier – I know it can be done in R (I have played around with it but need to spend more time to make the graphs attractive) and I have seen code that will export these kinds of things to HTML so that it would auto-update on a website.
- The mutations that we download are defined by Nextstrain, which compares each sequence to a genome isolated in China in December of 2019. Therefor our tally combines ALL mutations in the selected viruses’ history going back to the beginning. It would be FAR more sophisticated to compare each sequence to a more recent one or, probably more intelligent, a recent average of viruses of the same variant from the same region you are interested in…. that would tally only NEW mutations and thus be much more powerful. Doing that is going to be REALLY complicated and is currently impossible for me (although its obvious the method must exist otherwise we wouldn’t be able to make phylogenetic trees from genomes like those find at Nextstrain), so what we have here is not letting perfect be the enemy of good enough for now, or something like that.
- While I am achieving the flow state, it was really me who did all of the “coding” and finding a way to interest students in that is a goal of mine… or rather – finding students who are interested in coding who can join the group and nerd out with us is a goal of mine. Last year a Computer Science major regularly attended meetings, but I discovered R too late – she had already transferred to Georgia Tech!

