A breakthrough in virology was achieved by a collaboration between scientists from the University of Sydney, Alibaba Cloud Intelligence's Apsara Lab and Sun Yat-Sen University. The study published in the prestigious journal, Cell, identifies more than 161,000 new RNA virus species. This is an unprecedented feat in the history of science, made possible by the development of an innovative machine learning tool called LucaProt, which has revolutionized the way we analyze viral genomes data.
Until now, vast amounts of genetic information have been obtained from sequencing environmental samples, ranging from soil and water to plant and animal tissues, but have remained largely unprocessed. Millions of fragments of genetic sequences potentially belonging to viruses have been scattered across vast databases which are inaccessible for effective analysis using traditional methods. Manual processing of such massive amounts of information was an impossible task requiring decades of work by a large number of researchers. Unlike its predecessors, LucaProt uses sophisticated deep learning algorithms capable of recognizing characteristic patterns in genomic sequences. Instead of simply looking for matches with known viruses, it analyzes the structure of genes and predicts the functions of proteins encoded by them, and then classifies new viruses based on this complex information even if they do not have close relatives in the known science.. The key point is LucaProt's ability to take into account many factors, including the features of the genetic code, the size of the genome and the presence of specific genes. This allows it to accurately distinguish viruses belonging to different families and genera.