AI Models for Biological Engineering: LLMs and Beyond

Takeaway: The next frontier of AI in biology is moving beyond traditional predictive models to generative AI—including Large Language Models (LLMs)—that can not only understand the language of life but can also write novel biological code, designing proteins and genetic circuits that have never existed before.

For the past decade, the application of artificial intelligence in biology has largely been an exercise in prediction. We have used machine learning models to predict a protein's structure from its sequence, to predict a compound's toxicity, or to predict which patients are most likely to respond to a drug. These predictive models are incredibly powerful, but they are only the beginning. The new frontier is generative AI.

We are now entering an era where AI can not only read the language of biology but can also write it. This is made possible by a new class of AI models, most famously the Large Language Models (LLMs) that power systems like ChatGPT, which are now being adapted to understand the grammar and syntax of DNA, RNA, and proteins. This shift from prediction to generation is unlocking the ability to design novel, functional biology from scratch.

Biology as a Language

The conceptual breakthrough that enables this is the realization that biology itself can be treated as a language.

  • Proteins are "sentences" made up of 20 "letter" amino acids. The order of these letters determines the protein's structure and its ultimate "meaning" or function.

  • Genomes are "books" written with the four "letter" nucleotide bases (A, T, C, G). The grammar and syntax of this code dictate how an organism functions.

By treating biology as a language, we can apply the same powerful AI architectures that have revolutionized natural language processing directly to the challenges of biological engineering.

The Rise of Generative Models

  • Large Language Models (LLMs) for Proteins: Scientists are now training LLMs on massive databases of known protein sequences. By learning the "grammar" of protein structure, these models can be prompted to generate brand new, viable protein sequences that have never been seen in nature. A researcher can, in theory, describe the function they want in plain English, and the model can "write" the sequence for a novel protein designed to perform that function.

  • Diffusion Models for Protein Structure: Similar to the AI models that generate photorealistic images (like DALL-E or Midjourney), diffusion models are being used to generate novel, three-dimensional protein structures. They start with a random cloud of atoms and iteratively refine it into a stable, functional protein fold that meets a specific set of design criteria.

  • Generative Adversarial Networks (GANs) for Genetic Circuits: GANs involve two AI models—a "generator" and a "discriminator"—that compete against each other. The generator creates new designs (e.g., for a genetic circuit), and the discriminator tries to distinguish the fake designs from real, functional ones. This competitive process allows the generator to "learn" how to create increasingly sophisticated and effective designs.

This new generation of AI tools is transforming the very nature of biological R&D. We are moving from a process of discovering what nature has already created to a new paradigm of engineering, where we can dream up a novel biological function and use AI as our creative partner to write the genetic code to make it a reality.

Disclaimer: This post is for general informational purposes only and does not constitute legal, tax, or financial advice. Reading or relying on this content does not create an attorney–client relationship. Every startup’s situation is unique, and you should consult qualified legal or tax professionals before making decisions that may affect your business.