Semiautomated analysis of 200 genomes – proof of concept

Through out the Fall of 2021 and the Spring of 2022, Virus Hunters’ approaches to sequence analysis matured and by June we were just beginning to write code in R to automate some of our analysis.  Right before I left for vacation, I managed to cobble together a little code that would tally the number of mutations or amino acid changes in genomes according to what viral gene it was in.  We can download ALL of the mutations for any of these viruses at Nextclade in a CSV format and edit/clean those up with Excel.  At the time I had a test set of only about 11 viruses.  It worked, but I wasn’t sure how it would go with hundreds let alone thousands of genomes.

Last night I tried more: I had my little code-jalopy tackle 200 BA.5.2.1 genomes from Georgia taken from patients between this August 30 and September 7 (2022), and it worked!

R studio codiing with output

The code I pasted together from Googling and its output

Ignore the first “0” – R hates data with empty cells so I entered a zero in all of the 1,467 empty cells (these viruses have different numbers of mutations and thus there are empty cells in the file).  To read the output: going from left to right, there are 602 amino acid changes among these 200 viruses in the viral M gene, 1,241 changes to the viral N gene, etc etc.

My short term goal here is to make an analysis pipeline that would allow us to do regular updates on viral sequences collected recently from [name your region, although Georgia is a good start].  This isn’t extremely high level science, but I will bet you that there are no tools currently available that show you this information for Georgia. 

One idea I have is to take that kind of information and regularly update it on a website as sort of an advertisement of “look what Perimeter College students can do!”.  This certainly isn’t the only analysis that could be done, its just the easiest one I could think of.  I’m imagining something along the lines of:

 

 

 

 

Now for the problems 

  1. I wouldn’t call this a pipeline yet because at the beginning I have to clean up the CSV file so that R will accept it – cleaning up a CSV file with 200 viral  genomes took me a total of 3 hours, and I am still making my chart with Excel.  I don’t currently have any good ideas for automating the CSV file cleanup, although it seems like those methods must exist.  If they don’t, I honestly do not mind spending 3 hours once a month to do it myself.  Or maybe if I find a student who is also an Excel nerd…..  Automating the graphing is easier – I know it can be done in R (I have played around with it but need to spend more time to make the graphs attractive) and I have seen code that will export these kinds of things to HTML so that it would auto-update on a website.
  2. The mutations that we download are defined by Nextstrain, which compares each sequence to a genome isolated in China in December of 2019.  Therefor our tally combines ALL mutations in the selected viruses’ history going back to the beginning.  It would be FAR more sophisticated to compare each sequence to a more recent one or, probably more intelligent, a recent average of viruses of the same variant from the same region you are interested in…. that would tally only NEW mutations and thus be much more powerful.  Doing that is going to be REALLY complicated and is currently impossible for me (although its obvious the method must exist otherwise we wouldn’t be able to make phylogenetic trees from genomes like those find at Nextstrain), so what we have here is not letting perfect be the enemy of good enough for now, or something like that.
  3. While I am achieving the flow state, it was really me who did all of the “coding” and finding a way to interest  students in that is a goal of mine… or rather – finding students who are interested in coding who can join the group and nerd out with us is a goal of mine.  Last year a Computer Science major regularly attended meetings, but I discovered R too late – she had already transferred to Georgia Tech!

The origins of Virus Hunters

I spent all of 2021 working with my good buddy Dr Pyatt making an online viral bioinformatics research course that we taught Spring and Fall at Keane University and over the Summer I ran it with a group of high schools students.  It was outrageously successful in allowing students with very little experience with molecular genetics to produce high quality posters on their *individual* research … usually students are put in groups and each individual works on a part of the project – not so with Virus Hunters where everyone has their own research questions.

Working on this stuff quickly brings me into a state of  flow, and since high school students can do it, there is no reason why my non-majors here at Perimeter College couldn’t.  So in the Fall of 2021 I began it here as an informal club – basically, I spent 2 hours every Friday working on it – either doing the science or, if someone showed up who wanted to learn, teaching them how to access the databases and use the analysis tools.  Typically these sequences are from viruses that were in the noses of people 10 days to 2 weeks ago.  Let me rephrase that: undergraduates with no college coursework in biology where analyzing viral genomes that 10 days ago where in someone’s nose, and these undergraduates were finding novel mutations, localizing the location of changed amino acids in 3D structures of the proteins and addressing very basic questions of molecular epidemiology.  The number of students wasn’t extremely high (anywhere from 1 to 4 students at meetings) but that’s ok – more students makes it difficult to make progress and right now I need to build something out that is more robust and has a hope of working in a classroom setting.