Machine learning for biological sequence data using Python

Byeungchun Kwon
4 min readJun 28, 2021

In this post, I present a ML modeling demo for RNA sequence data to analyze a target similarity between nucleotide sequence and target (protein) type. The purpose of this article is not model performance but how to use the biological sequence on ML models.

To software engineers, RNA (cDNA) sequence is just a set of characters, consisting of four alphabet characters (A,T,C,G) and the sequence length is very different. We need to convert this sequence character to numeric vector, which represents its attributes. One of popular featurization methods is k-mer.

The k-mer is a subset of a sequence string. The k of k-mer is a subset length. For example, 1-mer is A,T,C,G and 2-mer is AT, AC,AG and so on. Combination of subset k for RNA sequence is 4^k.

Fig. 1 mRNA sequence featurization of k-mer method

For example, there is a RNA sequence, which consists of 9 nucleotides (AATCGCGCT) and k-mer is 3. The number of the subset combination is 64 (⁴³) and number of subset is 7= (9 characters− 3 mer+ 1) (Fig. 1). Here, the number of subset types is 6 because CGC exists two times.

RNA (cDNA) k-mer subset generation

The frequency of the k-mer subset in the sequence can be a numerical vector of the ML model. Let’s calculate the frequency of k-mer subset. The sequence in 5'UTR (Fig. 1) is AATCGCGCT.

k-mer frequency calculation

If we have hundreds of mRNA sequences and the length of the sequences are various, counting k-mer in the sequences as ML input vector is unfair for shorter sequences. Instead, it is better use the ratio of k-mer subset distribution. Using this concopt, let’s apply to the real target data.

In this post, we estimate drug-target similarity using K-means clustering, the unsupervised ML model. We can retrieve the drug-target list from DrugCentral as I showed on my previous post.

In the drug-target list, the ACCESSION column is the UniProt protein accession number for each target. We can map the accession number to Ensembl gene ID from UniProt API service and retrieve mRNA coding sequence data using the gene id from Ensembl API service.

Nucleotide sequence retrieval for target protein

Now, we prepare the sequence data. We follow below four steps to execute the K-means cluster algorithm.

  • Step 1: generate target (protein) list
  • Step 2: download target sequences
  • Step 3: convert the sequence to k-mer frequency distribution vector
  • Step 4: execute ML model

Step 1: generate target (protein) list

It downloads the drug-target file and extracts a list of targets. There are 769 targets in the file as of 27 June 2021.

Unique target list on DrugCentral site

Step 2: download target sequences

Using UniProt API service, it retrieves Ensembl gene id for each target accession number. And it retrieves mRNA coding sequences for each gene using Ensembl API service. Because some gene contains multiple isoforms, it can download multiple mRNA sequences for a single gene id.

Step 3: convert the sequence to k-mer frequency distribution vector

It assumes the length of k-mer is 3 and uses k-mer subset distribution ratio as the frequency.

Step 4: execute ML model

We have 524 sequences as training dataset and 64 features for each sequence. Using the Scikit-learn library, we can build a pipe line and execute an ML algorithm.

Using K-means cluster result, we can generate a chart for multi-dimensional attribute visualization.

pairwise feature scatterplot with k-means clustering

I focused on a biological sequence featurization for ML. The k-mer approach is useful to reflect a sequence characteristic but it doesn’t take the position information of the sequence which can generate a linked action.

--

--