Research

Our research aims to develop the next generation of genome foundation model. Inspired by the recent success of large natural language models such as chatGPT, several genomic large models have been developed. Yet this area is still underexplored from different perspective, including tokenization techniques and training objectives.

On the one hand, unlike natural languages, it is not apparent what “words” means in the language of genomes. Thus, this project aims to explore the impact of different strategies these models employed to decipher the “words” in genome sequences and how they affect reading and comprehending various “stories” or biological functions genome sequence narrates. Is there one strategy that is superior to others, or different strategies facilitate different aspects of the biological functions? This is what the project tries to distinguish.

On the other hand, the optimal training target for genome foundation model is unclear. Though most of previous work replies on Masked Language Modeling, some recent works also suggest the effectiveness of training an autoregression language model. The autoregressive langugae modeling training target is even more promising as the model size is scaled up. We aims to explore the potention of the autoregressive language modeling in the context of DNA sequence modeling. We would like to understand more about the pros and cons of different training targets and explore the possibility of genome sequence generation.

If this project is successful, scientists will have a better idea about how to build genome foundation models to help illuminate genome functions and be able to utilize the great power of pre-trained models in various scientific problems.