Skip to the content.

Home / News

Congratulations to Xubo for his outstanding work in developing a new tool named SegVir for constructing segmented viruses!

Augest 10, 2024

The article titled "SegVir: Reconstruction of Complete Segmented RNA Viral Genomes from Metatranscriptomes" has been accepted by Molecular Biology and Evolution. Our tool, "SegVir" can identify viral segments, reconstruct full genomes, and quantify the completeness of segmented viruses. This is a collaborative effort with Prof. Shi Mang, who provided invaluable inspiration and suggestions throughout the project.

Congratulations to Guowei on his paper "RNAVirHost: A Machine Learning-Based Method for Predicting Hosts of RNA Viruses Through Viral Genomes" on RNA virus-host identification accepted by GigaScience!

Augest 10, 2024

In collaboration with Dr. Jiang Jinzhe, our initial goal was to identify hosts of ocean viruses, a highly challenging problem due to the diversity and complexity of potential hosts. Guowei excelled in formulating a feasible computational problem for practical host prediction. There remains a significant need for better host prediction solutions. In our discussion section, we have summarized and visualized the major challenges.

A visualization illustrating the potential sequencing bias in the reference database, posing a challenge to the prediction of hosts. On the virus genus axis, there are three reference genera represented by colors: yellow for G1, blue for G2, and red for G3. The host group axis consists of three distinct host groups (H1, H2, H3) present in the reference database. The cylinders at the intersections of the dashed lines represent viruses belonging to the respective genera that infect the corresponding host groups. The height of each cylinder indicates the relative number of viruses. G1 is extensively studied, and its hosts are well-documented. G2 and G3 are under-studied, resulting in limited information about their hosts. More information can be found in the paper: “RNAVirHost: A Machine Learning-Based Method for Predicting Hosts of RNA Viruses Through Viral Genomes,” which will appear in GigaScience soon.

Congratulations to Dr. Yu Runzhou and Ziyi on publishing a review of using profile hidden Markov models (pHMMs) for virus discovery in metagenomic data!

Augest 10, 2024

We conducted a thorough comparison and evaluation of multiple pHMM databases, including comprehensive ones like Pfam and virus-specific databases like CheckV. While using pHMM for RNA virus search is generally effective, finding DNA viruses (such as phages) remains challenging with these databases. Some virus-specific models incur a high false positive rate. More information can be found in the paper: “Utilizing Profile HMM Databases for Discovering Viruses from Metagenomic Data: A Comprehensive Review,” available at Briefings in Bioinformatics: https://pubmed.ncbi.nlm.nih.gov/39003531/.

Congratulations! Ziyi and Dehan's work on microbial source tracking has been accepted for presentation at ISMB 2024 and will be included in the conference proceedings, which will be published online in the journal Bioinformatics.

April 11, 2024

This work developed a novel tool called SourceID-NMF for more precise microbial source tracking. SourceID-NMF utilizes a non-negative matrix factorization (NMF) algorithm to trace the microbial sources contributing to a target sample. By leveraging the taxa abundance in both available sources and the target sample, SourceID-NMF estimates the proportion of available sources present in the target sample. A series of benchmarking experiments using simulated and real data were conducted to evaluate the performance of SourceID-NMF. The simulated experiments aimed to mimic realistic yet challenging scenarios, including identifying highly similar sources, irrelevant sources, unknown sources, low abundance sources, and noise sources. The results demonstrate the superior accuracy of SourceID-NMF compared to existing methods. In particular, SourceID-NMF accurately estimated the proportion of irrelevant and unknown sources, while other tools tended to either over- or under-estimate them. Additionally, the noise sources experiment showcased the robustness of SourceID-NMF for microbial source tracking. SourceID-NMF is available online at <a href= https://github.com/ZiyiHuang0708/SourceID-NMF> here.

Congratulations! Herui has been accepted into the MIT-Novo Nordisk AI Postdoc Program.

Feb 9, 2024

The MIT-Novo Nordisk Artificial Intelligence Postdoctoral Fellows Program supports postdoctoral fellows conducting research at the intersection of AI and data science with life sciences. Each year, the program will support a cohort of up to ten postdoctoral fellows for two-year terms. Postdoctoral fellows participating in the program will receive professional development opportunities, including entrepreneurship-focused workshops and mentorship from experts in both life science and data science.

You can visit this website for more information.

Congratulations! Herui and Runzhou Successfully Defend Their Dissertations!

The thesis title of Herui is "Computational Methods for high-resolution microbial composition analysis and relevant applications". In this thesis, we developed a viral strain identification tool VirStrain, a bacterial strain identification tool StrainScan, and a microbial-based host disease classification method GDmicro. We applied these tools to large-scale real sequencing samples, including pathogen-infected patients, individuals from different countries, colorectal cancer cohorts, etc. The experimental results demonstrate these tools have the potential to enhance our comprehension of microbes and facilitate the development of diagnostic and therapeutic strategies for diseases. Runzhou's thesis title is "Computational Methods for accurate reconstruction of RNA viral genomes using third-generation sequencing data". This thesis presents three main works. The first work introduces AccuVIR, a tool for accurate viral genome assembly and polishing by generating high-quality paths in alignment graphs and then ranking them using multiple criteria. It performs well with both known and novel viruses, as demonstrated through evaluations with simulated and real sequencing data. The second work is HMMPolish, a pipeline designed to refine protein-coding regions of known RNA viruses sequenced using third-generation sequencing. HMMPolish utilizes profile Hidden Markov Models (pHMMs) of viral proteins and has been validated with clinically important virus datasets. The third work reviews commonly used profile HMM databases for viral metagenomic data. It evaluates their model properties using quantitative metrics and assesses their performance with simulated and real metagenomic datasets. The findings provide practical suggestions for optimizing the use of these databases in viral metagenomics research.

Congratulations! Herui and Jiayu's work GDmicro was accepted for publication by Bioinformatics.

Dec. 12th, 2023

With advances in metagenomic sequencing technologies, there are accumulating studies revealing the associations between the human gut microbiome and some human diseases. These associations shed light on using gut microbiome data to distinguish case and control samples of a specific disease, which is also called host disease status classification. However, available tools have not fully addressed two challenges associated with this task: limited labeled microbiome data and decreased accuracy in cross-studies. To address these challenges, we develop a new tool GDmicro, which combines semi-supervised learning and domain adaptation to achieve a more generalized model using limited labeled samples. GDmicro is available at GDmicro Check out the paper at GDmicro: classifying host disease status with GCN and deep adaptation network based on the human gut microbiome data.

Congratulations! Jiaojiao and Cheng's work PhaGenus was Published on the BIB.

Nov. 15, 2023

Bacteriophages prey on and replicate within bacterial cells and have a significant role in modulating microbial communities. However, the taxonomic classification of assembled phage contigs still faces several challenges. In this work, we develop a learning-based model named PhaGenus, which conducts genus-level taxonomic classification for phage contigs. PhaGenus utilizes a powerful Transformer model to learn the association between protein clusters and support the classification of up to 508 genera. For more information refer to the paper. PhaGenus: genus-level classification of bacteriophages using a Transformer model.

Congratulations! Our cooperation works with Chinese Academy of Sciences "Identifying ARG-carrying bacteriophages in a lake replenished by reclaimed water using deep learning techniques" was accepted for publication by Water Research.

Nov. 10, 2023

As important mobile genetic elements, phages support the spread of antibiotic resistance genes (ARGs). Previous analyses of metaviromes or metagenome-assembled genomes (MAGs) failed to assess the extent of ARGs transferred by phages, particularly in the generation of antibiotic pathogens. Therefore, we have developed a bioinformatic pipeline that utilizes deep learning techniques to identify ARG-carrying phages and predict their hosts, with a special focus on pathogens. Using this method, we discovered that the predominant types of ARGs carried by temperate phages in a typical landscape lake, which is fully replenished by reclaimed water, were related to multidrug resistance and β-lactam antibiotics. Check out the paper at Identifying ARG-carrying bacteriophages in a lake replenished by reclaimed water using deep learning techniques.

Congratulations! Jiayu was the CHAMPION in the competition named “Postgraduate Student Research Paper Competition 2022 – 2023”.

Aug 26, 2023

The competition is held by IEEE (Hong Kong) Computational Intelligence Chapter. Jaiyu was the CHAMPION for his outstanding academic performance.

Congratulations! Herui and Yongxin’s work for identifying bacterial strains from short reads was accepted for publication by Microbiome.

Aug. 17th, 2023

In this work, we developed a novel software, StrainScan, for identifying bacterial strains from short read. By utilizing a novel tree-based k-mers indexing structure, StrainScan can strike a balance between the strain identification accuracy and the computational complexity. We tested StrainScan extensively on a large number of simulated and real sequencing data and benchmarked StrainScan with popular strain-level analysis tools. The results show that StrainScan has higher accuracy and resolution than the state-of-the-art tools on strain-level composition analysis. It improves the F1 score by 20% in identifying multiple strains at the strain level. StrainScan is available at StrainScan. Check out the paper at High-resolution strain-level microbiome composition analysis from short reads.

We participated in the ISMB/ECCB 2023 conference!

Aug 02, 2023

Congratulations! Dr. Sun and four talented Ph.D. students, namely Jiayu Shang, Xubo Tang, Dehan Cai, and Cheng Peng, recently participated in the prestigious ISMB/ECCB 2023 conference from 23rd to 27th July. During the conference, Jiayu Shang delivered an outstanding proceedings presentation titled PhaVIP: Phage VIrion Protein Classification based on chaos game representation and Vision Transformer. Xubo Tang showcased his work on identifying plasmid contigs from metagenomic data using a Transformer , while Dehan Cai presented his report on HaploDMF: viral haplotype reconstruction from long reads via deep matrix factorization. Cheng Peng exhibited the work on PhaBOX: A web server for identifying and characterizing bacteriophage contigs in metagenomic data through a poster presentation. Their innovative research work has garnered significant attention from the scientific community. If you are interested in learning more about their work, please refer to the published papers.

You can visit this website for more information.

Congratulations! Runzhou and Umer’s work for correcting (polishing) errors in protein-coding regions of RNA viruses was accepted for publication by Briefings in Bioinformatics.

July 21th, 2023

In this work, we developed a novel pipeline, HMMPolish, for correcting (polishing) errors in protein-coding regions of known RNA viruses. By utilizing profile Hidden Markov Models of protein families/domains in known viruses, HMMPolish can correct errors that are ignored by available polishers. We extensively validated HMMPolish on 34 datasets that covered four clinically important viruses, including HIV-1, influenza-A, norovirus, and severe acute respiratory syndrome coronavirus 2. These datasets contain reads with different properties, such as sequencing depth and platforms (PacBio or Nanopore). The benchmark results against popular/representative polishers show that HMMPolish competes favorably on error correction in coding regions of known RNA viruses. HMMPolish is available at HMMPolish . Check out the paper at HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses.

Congratulations! Xubo, Jiayu and Yongxin’s work on plasmid identification was accepted for publication by Nucleic Acids Research.

Jun. 24, 2023

Plasmids are mobile genetic elements that carry crucial accessory genes. Cataloging plasmids is a fundamental step in elucidating their role in promoting horizontal gene transfer between bacteria. To identify plasmid contigs from short-read assemblies, we developed a tool called PLASMe that utilizes the Transformer. PLASMe leverages the strengths of both alignment and learning-based methods. The alignment component in PLASMe facilitates the easy identification of closely related plasmids, while order-specific Transformer models predict diverged plasmids with accuracy. The tool is available at PLASMe. Check out the paper at PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer

Congratulations! Xubo received 2nd Prize for the presentation titled “Identify Plasmid from Metagenomic Data Using Transformer” in the EE Graduate Research Seminar Awards.

May. 10, 2023

Congratulations! Yongxin, Jiayu, and Xubo’s work on plasmid host prediction was accepted for publication by Bioinformatics!

Apr. 19, 2023

In this work, we construct a tool named HOTSPOT, aiming at predicting the host association of plasmids. By incorporating the state-of-the-art language model, Transformer, in each node’s taxon classifier, the top-down tree search achieves an accurate host taxonomy prediction for the input plasmid contigs. We rigorously tested HOTSPOT on multiple datasets and all experiments show that HOTSPOT outperforms other popular methods. The tool is available at HOTSPOT Check out the paper at HOTSPOT: hierarchical host prediction for assembled plasmid contigs with transformer

Congratulations! The Chinese version of VirBot is published at the offical wechat account "宏基因组"

Apr. 16, 2023

You can visit detailed information here.

Congratulations! Guowei and Xubo’s work on RNA virus identification and taxonomy was accepted for publication by Bioinformatics.

Feb. 17, 2023

In this work, we described a tool, VirBot, aiming at identifying the RNA virus contig from the metagenomic data. Based on the profile hidden Markov model (pHMM), VirBot can better detect novel RNA virus sequences which share little similarity with the known species while maintaining high precision. In the benchmark experiment on both simulated and real sequencing data, VirBot shows its high specificity in metagenomic datasets and superior sensitivity in detecting novel RNA viruses. The tool is available at VirBot.Check out the paper at VirBot: an RNA viral contig detector for metagenomic data

A new integrated software named PhaBOX for phage identification and analysis has been developed!

Jan. 3rd, 2023

Bacteriophages (phages) play key roles in regulating the composition/function of the microbiome by infecting their host bacteria. Lacking integrated software for phage identification and analysis, novel phages awaiting to be discovered constitute a large portion of “viral dark matter”. In this work, we developed a web server, named PhaBOX, to accurately identify and analyze phage contigs in metagenomic data. To our best knowledge, this is the first web server for comprehensive phage contig analysis in metagenomic data. PhaBOX integrates our previously published tools: PhaMer, PhaTYP, PhaGCN, and CHERRY, for phage identification, lifestyle prediction, taxonomy classification, and host prediction, respectively. To help users conduct downstream analysis, PhaBOX also provides visualization of the essential features for making the predictions, such as the similarity-based relationships between the contigs and other phages, predicted proteins on the contigs, and protein homology. All the predictions and intermediate results are provided for users. We hope that it can help advance the field of phage study in various ecosystems. To try PhaBOX, please click the link PhaBOX

Congratulations! Runzhou and Dehan’s work on viral genome assembly and polishing was accepted for publication by Bioinformatics.

Dec. 26th, 2022

In this work, we introduce a new tool, AccuVIR, for viral genome assembly and polishing using error-prone long reads. It can better distinguish sequencing errors from true variants based on the key observation that sequencing errors can disrupt the gene structures of viruses, which usually have high density of coding regions. Our experimental results on both simulated and real third-generation sequencing data demonstrated its superior performance on generating more accurate viral genomes than generic assembly or polish tools. The tool is available at https://github.com/rainyrubyzhou/AccuVIR. Check out the paper at https://doi.org/10.1093/bioinformatics/btac827

Congratulations! Yilin, Jiayu and Peng Cheng's work on phage taxonomy classification was accepted for publication by Frontiers in Microbiology.

Dec. 22rd, 2022

This is the first review conducted under the new ICTV classification framework since several large families were removed from ICTV in August 2022. This study provides a comprehensive review of phage classification in different scenarios and a practical guidance for choosing appropriate taxonomic classification pipelines. (https://www.frontiersin.org/articles/10.3389/fmicb.2022.1032186)

Congratulations! Dehan and Jiayu’s work on viral haplotype reconstruction was accepted for publication by Bioinformatics.

Oct. 21th, 2022

This work developed a tool “HaploDMF” to reconstruct viral haplotypes from TGS data. Unlike existing tools that reconstruct haplotypes by checking the identity of overlap between reads, HaploDMF utilizes a deep matrix factorization model with an adapted loss function to automatically learn latent features from aligned reads. It is able to achieve highly robust performance on data with different properties while existing tools’ performance can be affected by the overlap size between reads. The tool is available at https://github.com/dhcai21/HaploDMF.

Congratulations! Jiayu's work on phage lifestyle prediction is accepted today in Briefings in Bioinformatics!

Oct. 17th, 2022

Bacteriophages (or phages), which infect bacteria, have two distinct lifestyles: virulent and temperate. Predicting the lifestyle of phages helps decipher their interactions with their bacterial hosts, aiding phages' applications in fields such as phage therapy. PhaTYP adopt Bidirectional Encoder Representations from Transformer (BERT) to learn the protein composition and associations from phage genomes and achieves more stable performance on short contigs. arXiv version: [PhaTYP: Predicting the lifestyle for bacteriophages using BERT ](https://arxiv.org/abs/2206.09693)

Congratulations! Jiayu, Herui, Xubo, and Dehan received awards!

Sep 9, 2022

They got the Research Tuition Scholarship ($42,096) and the Outstanding Academic Performance Award ($1,000) for their outstanding academic performance in the 2021-2022 academic year!

Congratulations! Jiayu's work on identifying phage is published today in Briefings in Bioinformatics!

Jun 30, 2022

Accurate identification of bacteriophages from metagenomic data using Transformer

Congratulations! Herui and Dehan received the 2nd Prize in competition of "The 8th Hong Kong University Student Innovation and Entrepreneurship Competition"

June 07, 2022

you can vist this website [https://www.hkchallengeplus.com/news/] for more information.

Herui and Dehan attended the final round of the competition named "The 8th Hong Kong University Student Innovation and Entrepreneurship Competition" at Science Park of Hong Kong

May 27, 2022

We recorded a video about our work in the competition. If you are interested in it, please scan this 2D code.

Congratulations! Jiayu's work on host prediction is published today in Briefings in Bioinformatics!

May 22, 2022

CHERRY: a Computational metHod for accuratE pRediction of virus–pRokarYotic interactions using a graph encoder–decoder model

Congratulations! Xubo's poster presentation receives the $2,000 for 2nd Prize in EE faculty!

May 19, 2022

Ph.D student Tang Xubo gave a talk titled "Sensitive RNA Virus Read Binning Using Learning-Based Models"

April 27, 2022

The talk was held at a company called Centre for Intelligent Multidimensional Data Analysis Limited

Welcome!A new research assistant Liyan joined the group today

April 13, 2022

Welcome Liyan!

An invited talk: Charactering viral haplotypes using long reads

April 6, 2022

Yanni Sun will give an invited talk “Charactering viral haplotypes using long reads” at the Workshop on Combinatorial Problems of Strings and Graphs and Their Applications in Bioinformatics. This is organized by National University of Singapore and Institute for Mathematical Sciences.

Combinatorial Problems of Strings and Graphs and Their Applications in Bioinformatics Part 2

Talk: Understanding the language of life using AI

Mar. 18, 2022

Yanni Sun gave a talk titled “Understanding the language of life using AI” to middle high-school students in Hong Kong at CityU-Learning Classroom.

Dehan’s work is published

Feb. 14, 2022

Dehan’s work on viral haplotype reconstruction “Reconstructing viral haplotypes using long reads” is published today!

Reconstructing viral haplotypes using long reads

Xubo and Jiayu’s work on RNA virus classification is published

Feb. 7, 2022

"RdRp-based sensitive taxonomic classification of RNA viruses for metagenomic data" is published today!

RdRp-based sensitive taxonomic classification of RNA viruses for metagenomic data

Herui’s work on "strain-level" RNA virus composition analysis is published

Jan. 31, 2022

"VirStrain: a strain identification tool for RNA viruses" is published!

VirStrain: a strain identification tool for RNA viruses

</table>

The PDF format of a talk         The PPT format of a talk