IN A Doodle: When The Mighty Fat Bubble Discovered Artificial Intelligence

Elly Peng
10 min readDec 28, 2022

Using Machine Learning to Optimize Lipid Nanoparticles for mRNA Vaccines

Five billion people use the internet today.

Yet, there are more people on Earth vaccinated against COVID-19 (68.2%) than people on the internet (63.5%)

Both Pfizer and Moderna vaccines use messenger RNA (mRNA) technology to replicate the SARS-CoV-2 spike protein and train the immune system.

However, mRNA is delicate.

Our enzymes destroy mRNA before it enters our cells — on its own, it wouldn’t have a chance to help our immune system against the virus.

So how did we solve this problem?

Hold up. What is mRNA?

[I know about mRNA. Skip to Lipid Nanoparticles.]

Before we answer that, let’s talk about nucleic acids. The two main types are deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). (source).

Nucleic acids are polymers made of monomers called nucleotides — think, a building (polymer) made of four types of bricks (monomers) repeated.

We define the 5 types of building blocks by their nitrogen-containing bases: adenine (A), guanine (G), cytosine (C), thymine (T), and uracil (U).

A and G are purines (contain 2 rings), and C, T, and U are pyrimidines (contain 1 ring). DNA doesn’t have the U, and RNA doesn’t have T. Note: the nitrogenous base is part of the nucleotide but isn’t the only component.

But what are the nucleotides?

Each nucleotide contains three parts:

  1. a 5-carbon sugar — ribose for nucleotides in RNA, deoxyribose for nucleotides in DNA)(source
  2. a nitrogenous base
  3. one or more phosphate groups — when PO₄³⁻ is part of a carbon-containing atom, it’s called a “phosphate group”) (source)

The sugar molecule is in the centre, the nitrogenous base attaches to one of its carbon, and so do the phosphate group(s).

Side note: Why are they called nucleic acids if they have nitrogen-containing bases? Nucleic acids contain a basic component (nitrogenous base) and an acidic component (phosphate groups). The structure of nucleic acids lends the acidic properties of phosphate to be more prominent than the nitrogenous base. (source)

What’s so different about DNA and RNA?

Functionality-wise, DNA store genetic information, and RNA carry instructions for protein synthesis.

Structurally, DNA comes in 2 strands in a double helix, while RNA has one strand. They have different sugar molecules, and RNA swaps out the thymine nitrogenous base in DNA for uracil.

There are four major types of RNA: messenger RNA (mRNA), ribosomal RNA (rRNA), transfer RNA (tRNA), and regulatory RNAs.

Let’s focus on mRNA.

Finally. What is messenger RNA?

Let’s envision a mother and son making a pie.

Let’s also pretend the son can’t cook, but he’s good at delivering instructions from a cookbook. The son takes the recipe and gives the information to the mom, who is actively making each part of the pie.

The cookbook is the DNA, the mom is the ribosome, the pie is the protein, and the son is the messenger RNA.

The process involves transcription (DNA → mRNA) and translation (mRNA → protein). When a cell needs to make a protein, the gene encoding the protein “turns on” [explain this pls], and an enzyme (a protein that speeds up chemical reactions) called RNA polymerase makes a copy of the DNA sequence, forming an mRNA strand.

Since nucleotides aren’t symmetrical, we must differentiate the two ends of nucleic acid strands. If you noticed the 1’ to 5’ labels on the sugar molecule in the nucleotide diagram, those would identify the directionality of the nucleic acid.

The chain begins with the 5’ phosphate group of the first nucleotide and ends with the 3’ hydroxyl group of the last nucleotide. We consider the 3’ hydroxyl group the end because the RNA polymerase only adds to the 3’ end of the chain. (source)

Note: the base T in the DNA strand becomes U in the RNA strand.

After the RNA polymerase separates from the chain, an enzyme called poly-A polymerase adds a poly-A tail (a strand of adenine nucleotides connected to each other) to cap the 3’ end of the strand. The number varies based on different RNAs, but one in the SARS-CoV-2 spike protein contains 110 nucleotides, preventing the RNA from degrading when exported from the nucleus to the cytoplasm. When transcription starts, the 5’ end is also capped with a modified guanine nucleotide to protect it from enzymes breaking it down.

During translation, the mRNA interacts with a ribosome. The ribosome uses the information in the mRNA to create the protein out of amino acids.

[3D-visualization of mRNA at work]

So, how is it used in mRNA vaccines…like the COVID-19 ones?

Unlike most, messenger RNA vaccines don’t contain any dead, weakened, or pieces of viruses.

Remember the son and mom making pie? Let’s introduce a daughter, too. She knows the recipe for an apple pie by heart — she learnt this from external sources. The daughter gives the mom instructions on how to make the apple pie without the cookbook (DNA). In this case, the daughter is the mRNA from the vaccine, and the apple pie is the spike protein our body replicates, found on the SARS-CoV-2 virus. Since the mRNA doesn’t need to transcribe anything from our genes, it doesn’t enter the nucleus or interact with DNA.

Ribosomes create the spike protein antigen (a foreign substance our immune system attacks), and the cell surfaces the antigen on its membrane. This stimulates a type of white blood cell called a cytotoxic T cell that kills the “infected” cell. B-cells, another type of white blood cell, identify these antigens and create antibodies to neutralize them — when the virus is present, the antibodies bind to the spike proteins and deactivate it.

In both cases, your body produces memory cells that last much longer than the mRNA and spike protein. They activate the T-cells and B-cells in the body if the virus enters, building immunity towards the virus without infecting your body. (source)

It’s like taking a practice test before your final math exam — we understand what the test is like and what to look out for without repercussions of doing poorly on the test.

This sounds great, but what if the body destroys the mRNA before entering the cells?

The Degradation Problem

Unlike the strands made by our body, the synthetic mRNA have the additional challenge of getting into the cells before the protein synthesis can start.

Even with the 5’ cap and poly-A tail, mRNA is delicate, and enzymes in our body destroy it before it can enter the cells.

Another challenge is that mRNA strands are large, polar, and negatively charged and can’t make it past the lipid bilayer without help.

Lipid Nanoparticles

Scientists discovered that the vaccine could deliver the mRNA into the cell by packaging it in the same substance as the cell membrane: fat.

The fatty droplets, lipid nanoparticles, wrap around the mRNA like a delivery box. When it safely arrives inside the cell, proteins can translate the mRNA message into proteins replicating the spike protein on the SARS-CoV-2 virus. Now, the immune system gets the training to deal with the virus. (cool LNP video)

LNP-based mRNA vaccines usually contain four types of lipids: cholesterol, distearoylphosphatidylcholine (DSPC), polyethylene glycol (PEG) lipid, and ionizable lipid. Here’s what each one does:

  • Cholesterol: helps LNP formation by adjusting the flexibility of lipids during mixing
  • DSPC (the helper lipid): helps LNP structure, interfacial tension, and mRNA release
  • PEG-lipid: influences the LNP stability, size, and potency
  • Ionizable lipid: binds to mRNA, interacts with the endosomal membrane, and responsible for mRNA release.

Ionizable lipids are the most abundant and critical ingredient. Traditionally, ionizable lipids are screened by creating numerous lipids and testing their in vivo efficacy.

However, current experimental screening methods require much cost, time, and materials.

A Bit on Machine Learning & Formulation Prediction

Machine learning (ML) is a branch of artificial intelligence that enables computers to learn knowledge without being explicitly programmed.

We can use ML to determine the relationship between the input and output parameters to predict a variety of datasets, including drug formulations.

Previous studies have applied ML to predict the drug delivery systems, such as nanocrystals, solid dispersion, cyclodextrin complex, and self-emulsifying drug delivery systems (SEDDS).

So… what about building a machine-learning model to predict LNP formulations for mRNA vaccines?

Let’s follow a study that uses the lightGBM algorithm to build a prediction model. (source)

ML joins the mRNA Party

Data Collection

The study uses specific parameters to control the outcomes. The model takes data from 65 publications with the following criteria:

  1. Tested antibody titers. Titers are a type of blood test determining antibodies' presence and level (titer) in the blood. In this case, they were Immunoglobin G (IgG) titers or hemagglutination inhibition (HAI) titers.
  2. LNP composition of ionizable lipids, DSPC, cholesterol, and PEG-DMG
  3. mRNA encoding a single antigen
  4. Antigen has been studied in more than one research
  5. No more than two vaccinations done on subjects
  6. Test time was within one year after the initial vaccination

There were three types of input parameters:

  • Boolean data (assigned “1” or “0”): whether or not the mRNA sequences functioning as self-amplifying, containing pseudouridine, and undergoing codon optimization
  • Multicatgorical variables: antigen protein type, cap type, subject type, population or strain, and injection route
  • Numerical variables… everything else.

The eventual dataset contained lipid nanoparticles with seven kinds of ionizable lipids.

MC3, L319, Lipid M, SM-102, DLinDMA, Lipid N, and Lipid Q

Structural Representation of Ionizable Lipids

The extended connectivity fingerprints (ECFP) represents the ionizable lipid structural characteristic. It is a bit string of “1” and “0” — each bit corresponds to a set of chemical substructures — “1” = contains, “0” = does not contain.

Data Splitting

The dataset splits into two sets: one for training the models(260 data points) and one to test for the best model (75 data points).

Evaluation Criteria

We use these four values to determine errors between real labels and predictions and evaluate model performance:

  1. Mean absolute error (MAE)
  2. Mean squared error (MSE)
  3. Root mean squared error (RMSE)
  4. Determination coefficient (R²)

Model Building with LightGBM

The prediction model uses the light gradient-boosting machine (lightGBM), a gradient-boosting framework based on a decision tree algorithm.

In this case, the model predicts the titer concentration of mRNA vaccine immunological performance.

First… what is a decision tree?

Decision Trees

A decision tree uses follow-up questions to separate and categorize data. Here’s an example of a simple decision tree:

The problem with decision trees is their tendency to overfit. Overfitting happens when the model fits too closely to the training set and fails to generalize for the important patterns.

One method to reduce overfitting error is using a random forest.

Random Forests

Random forests combine many decision trees to enhance accuracy and performance. Each tree in the random forest has its prediction, and the class with the most votes or the average becomes the model’s prediction.

Random forests use the bagging method to combine many decision trees and create an ensemble.

And… what is bagging?


Bagging (or bootstrap aggregation) involves decision trees randomly sampling from the dataset. Instead of taking all, each tree takes part of the data. Individual trees make decisions based on their selection of data points and predict outcomes solely on these points.

The trees train on different data in each random forest and use various features to make decisions. The variation provides a buffer for the trees, minimizing errors and incorrect predictions.

The bagging process only uses about two-thirds of the data so that the remaining third can be used as a test set.

Random forests are great because they randomize trees' construction to get various predictions.

However, the algorithm must build and evaluate each decision tree independently, making random forests harder to interpret and slower to build.

To address this issue, an alternative to random forests is gradient boosting.

Gradient Boosted Decision Trees

Like random forests, gradient-boosted decision trees (GBDTs) also combine decision trees to enhance prediction accuracy. Instead of using bagging, GBDTs use the boosting method to connect the trees.


Boosting combines learning algorithms to achieve a strong learner from a series of weak learners.

Weak learners are models that perform slightly better than random guessings but aren’t great by themselves (such as regression or shallow decision trees). Strong learners are models that have arbitrarily good accuracy.

In GBDT algorithms, the weak learners are decision trees.

Each tree attempts to minimize the errors of the previous tree. While individual trees are weak learners, adding many trees in series makes boosting a highly efficient and accurate model. Instead of bootstrap sampling, the tree fits on a modified version of the initial dataset.

Ground truth is the target for training or validating the model with a labelled dataset.


Turns out, the lightGBM model performed well.

The MAE and RMSE were 0.2 and 0.3 log10 units for the training and validation set, corresponding to the error commonly seen in the experiments. The R² is above 0.9, showing that this model has covered the major factors influencing variation in the IgG tiger.

While the ionizable lipid and LNP formulations for COVID-19 vaccines have been settled, the results showcase the capabilities of machine learning for drug formulation predictions.

It’s only becoming increasingly popular: if you search “machine learning directed drug formulation development,” the list of research papers does not stop — and they’re recent, too.