Transposon recognition by machine learning methods

An overview of machine learning applications for analyzing genome data. Molecular medicine and gene therapy. DNA and RNA are mobile genetic elements. Possible applications of transposons. Cross check software algorithm. Recognition of relocatable items.

Рубрика Программирование, компьютеры и кибернетика
Вид дипломная работа
Язык английский
Дата добавления 10.12.2019
Размер файла 1,5 M

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

2

NATIONAL RESEARCH UNIVERSITY

HIGHER SCHOOL OF ECONOMICS

Faculty of Business and Management

School of Business Informatics

Shein Alexander Vladimirovich

Transposon recognition by machine learning methods

MASTER'S THESIS

Field of study: Business Informatics?

Degree programme: Big Data Systems

Moscow 2019

Academic Supervisor

Dr. Maria S. Poptsova

Abstract

The role of 3' UTR stem-loops secondary structures in retrotransposition was experimentally shown for mobile genetic elements of various species, where LINE and SINE retrotransposons share the same 3' UTR sequences, containing a stem-loop. The properties of 3'-end stem-loops of human L1s, Alus, were investigated. They do not match in terms of sequences, but all have 3' UTR stem-loops. Two types of machine-learning models have been built: a sequence-based and a structure-based in order to recognize 3'-end L1 and Alu, stem-loops with high accuracy. The sequence-based models consider only sequence statistics information and capture compositional bias in 3'-ends. The structure-based models take into account chemical, physical and geometrical characteristics of dinucleotides in a stem and position-specific nucleotide features of a loop and a bulge. The most significant parameters include shift, rise, tilt, and hydrophilicity. Obtained results point to the existence of some structural constrains for 3' UTR stem-loops of L1 and Alu, which are probably required for transposition.

Keywords: Machine Learning, Big Data, Bioinformatics, Retrotransposons, Data Analysis, Transposons, Stem-loop, Secondary Structures, Supervised Learning, Random Forest, LINE 1, Alu, Biotechnologies

3

4

Acknowledgements

In the present acknowledgement of this Master's Thesis, I would like to appreciate support and great intellectual contribution to this paper of my academic supervisor, Dr. Maria Poptsova. I want to express my gratitude for her support during development of this study.

Alexander Shein

Moscow, Russia 2019

List of Figures

  • Figure 1. A) Different vector system contributions to clinical trials that took place in 2016. B) The most popular clinical trials that employed vector system list. 16
    • Figure 2. Mechanism of DNA reparation after the cleavage by CRISPR-Cas9 complex 18
    • Figure 3. Possible outcome of non-homologous end joining caused by CRISPR-Cas9 19
    • Figure 4. Transposon classes distinguished by transposition intermediate steps 21
    • Figure 5. Simplified view of the main consequences of transposon affecting splicing of a gene. 22
    • Figure 6. RNA Stem-loop structure example. 23
    • Figure 7. Random Forest principle. 28
    • Figure 8. 5-fold cross validation. 30
    • Figure 9. Stem-loop annotation details. A) Encoding loop (LP) coordinates by binary features. B) Geometrical parameters used for stem annotation C) Stem-loop structure coordinates 34
    • Figure 10. Receiver Operating Characteristic plots of 50 base pairs based models 39
    • Figure 11. Precision-Recall of 50 base pairs model 40
    • Figure 12. Feature importance heatmap of 50 base pairs model 41
    • Figure 13. Receiver Operating Characteristic of stem-loop sequence based model. 42
    • Figure 14. Precision-Recall of stem-loop sequence based model. 43
    • Figure 15. Feature importances heatmap plot of stem-loop sequences based models. 44
    • Figure 16. Receiver Operating Characteristic of physical, chemical and structure properties-based models. 45
    • Figure 17. Precision-Recall of physical, chemical and structure properties-based models. 46
    • Figure 18. Feature importance heatmap of physical, chemical and structure properties-based models. 47
    • Figure 19. Top-10 most important features of physical, chemical and structural property-based models. 48
    • Figure 20. Top-10 most important features of physical, chemical and structural property-based models. 49
    • Figure 21. Top-10 most important features of physical, chemical and structural property-based models mapped to positions of stem-loops. 50
    • Figure 22. Most important features of physical, chemical and structural property-based models mapped to positions of stem-loops heatmap plot. 51
  • List of Tables

Table 1 Recognition of 3'-ends and 3'-end stem-loops of L1 and Alu sequences. 52

List of Algorithms

Algorithm 1. Random Forest 29

List of contents

  • Abstract 2
  • Acknowledgements 4
  • List of Figures 5
  • List of Tables 7
  • List of Algorithms 8
  • List of contents 9
  • INTRODUCTION 11
  • 1 Big data in bioinformatics 13
    • 1.1. Emergence of Big Data 13
    • 1.2. Big Data in Bioinformatics 13
    • 1.3. Molecular medicine and gene therapy 14
    • 1.4. DNA and RNA Mobile Genetic Elements 20
    • 1.5. RNA secondary structures 23
    • 1.6. Possible applications of transposons 24
    • Summary 24
  • 2 MACHINE LEARNING APPLICATIONS to ANALYSIS OF GENOME DATA 25
    • 2.1. Machine Learning and Bioinformatics 25
    • 2.2. Machine Learning Applications for Classification problem 25
    • Algorithm of Random Forest: 29
    • 2.5. Software tools used 30
    • Summary 31
  • 3 Transposable elements recognition PIPELINE 32
    • 3.1. Used data 32
    • 3.2. Stem-loop secondary structures annotation pipeline 32
    • 3.3. Machine learning pipeline 36
    • Summary 37
  • 4 Experiments results 38
    • 4.1. Overview 38
    • 4.2. 50 base pairs composition-based models 38
    • 4.3. Stem-loop statistical models 41
    • 4.4. Physical, chemical and structure properties-based models 44
    • 4.5. Combined results analysis 52
    • Summary 52
  • DISCUSSION 53
  • CONCLUSION 55
  • Bibliography 56
  • Appendix. Scripts for merging tables 60

INTRODUCTION

Bioinformatics or computational biology is a new field of study which is closely related to medicine. It has arrived some 30 years ago and gained a boost of popularity along with the development of the Human Genome Project. It was claimed by the BIS Research Report that, the “global precision medicine market is expected to be $141.7 billion by 2026” [1].

Bioinformatics becomes a very fast-growing field with many scientists involved and large investments present, hence there will be more and more researches related to it. Results of those researches can affect everybody's lives - e.g. invention of new generation medication, based on RNA interference can solve problems of inactivation of mutated genes which could lead to the development of cancer or longevity researches which could suggest methods of increasing human's lifespan. This field of study already implies processing of large amounts of data using specific techniques in order to make it efficient. Biological data is a subject of processing by special algorithms and approaches.

Today genome editing technologies attract a lot of attention because of the fact that they can achieve dramatic impact on our lives: now genomes of different agricultural organisms are being modified in order to add resistance to pathogens or increase fertility. It is theoretically possible to cure diseases by replacing broken genes and endogenous viruses.

Since Barbara McClintock's discovery of the so-called “jumping genes” there was a great interest in these genomic elements. Existence of these transposable elements (transposons) has been demonstrated both in vitro and in vivo. They can be subdivided in two groups by the mechanism of transposition: cut-and-paste (DNA) transposons and copy-and-paste transposons (RNA transposons, retrotransposons). First ones are already being tested as an instrument to deliver genes into a cell's genome, but retrotransposons may become even a more efficient tool for this purpose, because they can produce a lot of copies of themselves. The problem is that the exact mechanism of retrotransposition remains unclear.

This study is devoted to most common human retrotransposons L1 and Alu in order to understand mechanism of transposons' RNA recognition by the reverse transcriptase. The properties of retrotransposons will be evaluated by the machine-learning approach. The study consists of creating a data preprocessing and machine learning pipeline which was applied to biological poorly structured data. Computational part of this study includes annotation of sequences by stem-loop structures and extraction of physical, chemical and geometrical features in order to build a machine-learning model with interpretable predictors.

The goal of the present research is to understand mechanism of retrotransposition, which will be helpful in the design of experimental genome-editing systems. The tasks to achieve the goal include:

- extracting L1 and Alu sequences from human genome;

- annotating sequecnes with stem-loop structures;

- characterizing stem-loop structures with physical, chemical and geometrical properties;

- constructing machine learning models that should be trained to separate stem-loop structures at the 3'-end of human L1 and Alu mobile genetic elements from stem-loops of shuffled sequences.

- extracting the most important properties of stem-loop structures at the 3'-ends of L1 and Alu retrotransposons

1. Big data in bioinformatics

1.1 Emergence of Big Data

The humanity is constantly gathering and storing new information. Evolution of technologies allows creating more and more sophisticated methods to retain that data that results into exponential growth of data stored. For example, “In 2011 1.8 zetabytes of data had been created” [2], there are forecasts that the volume of data will increase by 50 times in 2020 [2].

Capabilities of information processing are developing very fast so some researches, which required power of a mainframe twenty years ago, can be completed using personal computer now. Community is turning towards scalable approaches such as the utilization of cloud resources. This will actually even more increases possibilities of cheap and effective computing. And all these factors help to conduct researches which require a lot of powerful computations including those, related to bioinformatics.

Big Data has great impact on economics: it is present in almost every sphere of human life. It provides a lot of different challenges like building efficient storages and processing pipelines, increasing computational power, inventing new algorithms. Today's market requires a lot of specialists in this field and offers many different opportunities.

1.2 Big Data in Bioinformatics

Bioinformatics is a new interdisciplinary field of science which uses different methods and software for processing and understanding biological data. It transforms sections of human knowledge such as mathematics, physics, computer science, into information processing. This recently developed scientific filed aims towards an integration of heterogeneous data analysis and generation of experimentally testable hypotheses. It has very many different applications such as a disease prediction based on a genome sequencing, understanding evolutionary relations between different species, inventing new medicines, e.g. antibiotics. Bioinformatics has been traditionally used in molecular biology, especially in dealing with data related to genomics.

Since the discovery of the DNA structure by Watson and Crick [3] scientific community was interested in decoding “the code of life” - sequencing a genome of a living organism. It was believed to explain causes of diseases and details of heredity. Solving the task of reading a whole genome of a mammal took almost a half of a century and it happened in 2002 [4]. This delay may be explained if we take into an account several facts:

· Processing raw DNA molecule to obtain its genetic code as a sequence of letters is a very complex task which requires sophisticated tools.

· Full genome of an animal requires large diskspace to be stored, e.g. size of human genome is about 3 000 000 000 base pairs.

Processing of that data is an even more complex task. Genomic sequences are unstructured: coding and non-coding regions are hard to distinguish at a first glance. It took a lot effort and many different researches to isolate genes from other, non-coding parts of genome.

Also, researches revealed that there are some genomic elements which occur in extreme amounts in genome, for example, there is a family of mobile genetic elements called Alu which account for approximately 11% of whole human genome, and there is approximately a million of such sequences in everybody's genome [5].

Hence, we can conclude that processing of such data requires powerful computers, sometimes even cloud-based distributed solutions and sophisticated state-of-art algorithms.

Bioinformatics is a rapidly developing field of science. It has a big potential and it is very popular among people of different professions. Increasingly, big genomic data sets are being used in biotechnology companies, drug firms and medical centers, which also have specific needs. The thoughts shared by these accomplished computational biologists make it clear that biology is becoming a data-related science, and future breakthroughs will depend on strong collaborations between experimental and computational biologists. Biologists will need to adapt to the data-driven nature of the discipline, and the training of future researchers is likely to reflect these changes as well. Aspects of computational biology are integrating into all levels of medicine and health care. And usage of modern computational tools affects further development of new ones because manufacturers have to make tools compatible and operational in different environments. Given that big-data analysis in biology is incredibly difficult, open science is becoming increasingly important.

1.3 Molecular medicine and gene therapy

Molecular medicine is a new trend in the high-tech age. It is a broad field which represents a fusion of biology, bioinformatics, physics, chemistry and medicine. Molecular medicine is pointed towards finding out molecular and genomic issues which cause diseases and developing some kinds of interventions to fix ones. It provides complex treatments that may cure diseases through engineering cells genomes. Even though it has some drawbacks, such as requirement of stable genomic integration methods, it can achieve significant results curing diseases, that previously were perceived as very tough ones.

One of the key features of this approach is that it provides more and more sources of data related to any patient's specific causes. Medical practitioners obtain sophisticated tools to make a factual-based diagnosis. This leads to comprehensive analysis of patient's situations based not only on symptoms, but which also relies on DNA-array based prognosis screening, which results into treatments that correspond best to patient's conditions.

Another feature of the molecular medicine is an idea that some diseases could be treated not only with drugs and other previously used approaches but using the so-called gene therapy. It is a new way to deal with gene malfunctions. Modern biomedical technologies provide powerful tools to interrupt into cell's metabolism and heredity.

This field of study is developing with an impressive speed, for example there were more than 2500 gene therapy clinical studies [6] back in 2016 (see Figure 1).

Figure 1. A) Different vector system contributions to clinical trials that took place in 2016. B) The most popular clinical trials that employed vector system list.

As we see, back in 2016 vast majority of gene delivery systems approved for clinical trials were viral vector-based. Viral vectors are virus-derived tools for delivering something into cells. Around ten years ago the main challenge of gene therapy was not a lack of therapeutic genes, but paucity of efficient gene delivery systems [7]. Viruses are naturally very efficient genetic information “couriers” - this process is highly required for their own replication. Process of engineering them usually implies removing parts of viral genome related to replication. This a good alternative to manual inserting naked DNA directly into cells, because former method may damage or even kill target cells, it is less stable and sustainable.

Viral vector DNA delivery method implies usage of specific modified viruses with needed DNA sequences cloned into viral genome and parts related to replication removed. The former ensures that a virus will be no longer virulent. Vectors are being replicated in laboratory conditions and then passed to the organism (cells) and, depending on a virus type, it may infect up to 100% of the cell population. Thus, this method is proven to be a highly efficient way to deliver genetic material.

However, even vector-based methods have some solid cons. Firstly, they can induce an immune response against vector-encoded proteins from a patient and hence loose efficiency [6], [7].This also results into some therapy-related adverse effects [6]. Another drawback is that large genes integrated into viral vectors dramatically reduce efficiency of those viral packages [8]. Another weak point of such way is sustainability of viral plasmids into eukaryotic cells - only small fraction of viruses inserts the genetic material into host's cells genome, Those are called retroviruses, and HIV is a well known example of the discussed group. Plasmids could be maintained for a long time only in prokaryotes. This fact limits possible carrier options to two viral families (lenti-/retro-). “However, the utility of retro?/lenti- viral vectors is heavily restricted by the size of genes” [8]. There could be problems of another origin as well. Patient could have preexisting cellular immunity against such kind of a virus which would lead to inefficiency of such method [9].

There is another trending approach of editing cell's genome, CRISPR-Cas9 system. CRISPR stands for “Clustered Regularly Interspaced Short Palindromic Repeats”. This mechanism was derived from bacterial adaptive immune system. Functionality of this method is based on ability of CRISPR-Cas9 complex to cut DNA matching to some provided template. This results in target DNA cleavage and reparation by special cell's machinery. It is possible then to supply a reparation template with a desired gene in it, which will result into repairing original cell's genome along with a pattern sequence. (See Figure 2)

Figure 2. Mechanism of DNA reparation after the cleavage by CRISPR-Cas9 complex

One of the problems associated with CRISPR-Cas9 system is that double-breaks in DNA strands reparation is error-prone and it can result in “deletions or insertions leading to loss of function” (See Figure 3) [10]. Another challenge is to increase specificity of this method. Off-target modifications could occur, which may lead to other genes malfunction [10].

Figure 3. Possible outcome of non-homologous end joining caused by CRISPR-Cas9

There is another group of possible genome-editing tools, it is called Zinc-finger proteins (ZNFs). They are an abundant group of proteins and have a wide range of molecular functions [11]. First ZNF was discovered in the late 1980 and it was capable of binding specific sequences of DNA. It is one of the most frequently used DNA binding specific motifs found in eukaryotes. It is possible to exploit this functionality using proteins containing several ZNF active domains each one recognizing specific triplet of DNA letters. Fusing this recognition module with a sequence-independent endonuclease was the first successful strategy to introduce breaks at specific sites of genomic DNA [12].

Even though this technology provides a way to edit the genome, there are some drawback it introduces as well: it is a complex task to construct a complex of several ZNF domains which will recognize desired DNA sequence.

As we see, modern genome-editing technologies have drawbacks and there is no ready-to-go ultimate solutions. But they all provide several alternative methods to modify genomes and hence each one could be substituted by another if it is required to do so. Molecular medicine in general and genome therapy in particular are very important and promising fields of study and more and more new researches should take place.

1.4 DNA and RNA Mobile Genetic Elements

Mobile genetic elements are pieces of DNA which are capable of multiplying and changing their positions inside a genome. There is a strong evidence that they are accumulated in different species and at least some of them are still active [13]. They could be found in almost every genome with very rare exceptions. It is proved with an increase of whole-genome sequencing data available. Transposable elements occupy significant parts of genomes of different organisms (46% of human, 40% of rat, 85% of corn) [13].

Human genome harbors significant quantities of mobile genetic elements, some of which are still active and sometimes they can act as a “tool” of evolution - modify existing genes by making insertions into them, change expression of genes by breaking promoters or “jumping” nearby.

Generally speaking, there are two distinct types of transposons which are identified on the base of their mechanism of displacement: DNA and RNA (retrotransposons). Key difference between them is a mechanism of transposition: former ones move by cut-and-paste method and first is using copy-and-paste method (See Figure 4). Both classes are subdivided into families and subfamilies by sequences similarity, structure or transposition mechanisms.

Figure 4. Transposon classes distinguished by transposition intermediate steps

Retrotransposons use RNA intermediate in the process of transposition - e.g. they are transcribed into mRNA from DNA and then retrotranscribed back into DNA. The last step is an insertion of newly built sequence into organism's genome.

On the other hand, DNA transposons are not being retrotranscribed, they are being moved, not copied. This strongly affects their replication mechanism - they are being copied through “indirect mechanisms that rely on the host machinery” [14].

Transposable elements were considered a “junk” DNA for a long time, now they are believed to play a significant role in cell's life. They can take part in gene expression regulation using their own promoters, affect alternative splicing (see Fig. 5), even trigger chromosome rearrangements [15]. In some cases they play a significant role in diseases. Transposon activity can lead to interruption of gene sequence by a jumping gene itself that could possibly lead to gene's malfunction. It also can result into deletions or chromosomal rearrangements [15].

Figure 5. Simplified view of the main consequences of transposon affecting splicing of a gene.

This study considers RNA transposable genetic elements, because they are the only transposons that are still capable of displacement into human genome (there are experiments that prove this by showing de novo insertions) [16]. There are three families of human RNA mobile genetic elements that are currently active in vivo - long interspersed nuclear elements class 1 - L1, Alu, which was firstly associated with action of the bacterium Arthrobacter luteus (Alu) restriction endonuclease (gene-cutting protein), hence the name, and SVA (SINE-VNTR-Alus), which are somewhat alike with endogenous retroviruses [17]. Former transposon family is out of scope of this study.

For some species it was previously experimentally shown that replication of transposable elements highly depends on RNA secondary structures - stem-loops - located at the end of transposon sequence [18]. It also was shown in silico that stem-loop structures are present in almost all Alu and L1 mobile genetic elements in human genome [19], thus, these structures may be crucial for retrotransposition of even wider retrotransposon range.

1.5 RNA secondary structures

Nucleic acids secondary structures are forms in which RNA or DNA could fold itself by base-pair interactions. RNA secondary structures differ from DNA structures because DNA exists mostly as double helix in contrast to RNA, which is mostly represented as a single strand. Also, RNA is more likely to form some complex and intricate base-pairing interactions because of its increased ability to form hydrogen bonds due to the extra hydroxyl group located in the ribose sugar.

Stem-loop secondary structures, also known as hairpins or hairpin loops are base-pairing patterns that commonly occur in single-stranded RNA (see Fig. 6). These structures are common in all genomes, there are researches that related to discovery of non-random stem-loop patterns in prokaryotic [20] and eukaryotic genomes.

Figure 6. RNA Stem-loop structure example.

This study includes transposon sequence stem-loop annotation in order to obtain more interpretable feature extraction.

1.6 Possible applications of transposons

Transposons are subjects of many researches. Scientists tend to have an interest in studying this topic due to possible outcomes: biological mechanisms that are capable of changing cell's genome in vivo are extremely powerful - and may be harnessed as tools to engineer cell's genomes [21].

Newly discovered mobile genetic elements may be used in order to achieve different results. We can consider Sleeping Beauty DNA transposable element as an example: it originates from some fish species and has some interesting features. Firstly, it was reconstructed from an ancient and currently nonactive transposon found in the fish genome [8]. Secondly, it was modified to increase its efficiency by 100 times [8]. Now it is often used in genomic therapy as a non-viral tool for delivering and inserting DNA into target cells.

Retrotransposons naturally can take part in some genome-rearrangement processes in living organisms. For example, it is known, that LINE-1 (Long Interspersed Nucleotide Elements) can promote gene duplication [22]. Generally speaking, retrotransposons play significant role in cells, they may provide substrate for DNA reparation systems during nonallelic homologous recombination [22], alter gene regulation, take part into cell regulatory pathways. It is known that up to 63% “of primate-specific regulatory sequences are derived from transposable elements” [23].

Such elements should not be ignored by science because of the abundance of possible applications. RNA transposable elements are not used as a genome editing tool yet, but it is clear that understanding the way they work may give us a possibility to exploit them and widely affect genomes of living organisms.

Summary

As we can conclude there are several existing genome-editing techniques which already include usage of mobile genetic elements - DNA transposons. All of the discussed methods are associated with some risks and drawbacks. There is a need in conducting further researches in order to increase efficiency of those methods and possibly invent new ones. Retrotransposons machinery may be a promising field of study due to the fact that they are naturally active in human genome, hence it is already proven that they are capable of modifying human genetic sequences in vivo.

2. MACHINE LEARNING APPLICATIONS to ANALYSIS OF GENOME DATA

2.1 Machine Learning and Bioinformatics

Machine learning is a wide scientific study related to algorithms and statistical models. This field of science is a fusion of optimization methods, mathematical statistics and classical mathematical disciplines. This study has its own challenges like improving prediction accuracy and increasing computation efficiency.

The scope of tasks it solves is continuously expanding. Generally speaking, it is a subset of an artificial intelligence. Machine learning is widely used in the numerous spheres: economics and finance, health care, transport, production, business, etc. Main tasks are forecasting and classification, models should be trained on some historical data/examples.

There are several biological domains where

machine learning techniques are applied for knowl-

edge extraction from data.

Machine learning is widely used in different domains of bioinformatics. One of key directions of further researches is creation of tools, methods capable of transforming and applying all heterogeneous data into solid biological knowledge [24].

There are two distinct classes of machine learning algorithms: supervised and unsupervised. Speaking about unsupervised algorithms, one should mention that these models are designed to search patterns in data, derive some patterns or even cluster it without knowledge about number of classes. Usually these models require a lot of data to be used. On the other hand, supervised algorithms are quite different. Key purpose of these models is finding out a function that maps some data to a class label or a value. Interpretability of model's output is higher than of unsupervised ones because former searches for previously unknown patterns and first one is built to find predictor function's parameters.

Machine-learning models are solving two types of problems: regression and classification. A regression model generally finds relations between one dependent variable and one or more independent variables. A classification model finds out to which class a sample belongs to.

2.2 Machine Learning Applications for Classification problem

There is a wide range of quite different approaches that could be applied to classification tasks. Some of them are:

· Logistic regression classifiers, which are generally built on a statistical method that maps predictors into dichotomous variable (there are only two possible outcomes in it). It is possible to do multiclass classification using several models with 1-vs-all method.

· Naпve Bayes classifiers, which are based on Bayes' theorem. These models are built with an assumption that predictors are all independent. These classifiers are easy to build and especially useful for a large amount of data.

· Support Vector Machines. These models are discriminative classifiers, based on the principle of finding the best separating hyperplane. In other words, given a labeled training data (the supervised learning), this classification algorithm outputs an optimal hyperplane, which categorizes new examples with the smallest error. One of the key drawbacks of this model is that it is only suitable for binary classification.

· Decision trees - it is a model which breaks down a dataset into smaller models step-by-step, resulting into a tree with nodes and leaves (outcomes). This model may have some drawbacks if the relation between predictors and resulting variable is complex. This method has a significant drawback - it is prone to overfitting.

· Random forest, which is an extremely powerful ensemble method which implies an idea that several simple models like decision trees combined together will do better significantly better than any of those models alone. This method is simple to use, robust and it corrects decision tree's tendency to overfitting.

· Boosting models. Boosting is a meta-algorithm in machine learning that implies an idea to “chain” simple prediction algorithms in order to mark examples which were classified incorrectly with higher weights and to learn on mistakes of previous model for every next model. This may sums up into very good results compared to every single model.

· Neural networks are extremely powerful algorithms which imply creating a model consisting of artificial neurons connected into layers. They may have an arbitrary layer structure, which allows for data scientists to create the model that will fit best for every problem stated.

· Nearest neighbor. This is a simple classification method that uses a bunch of closest N already labeled points to label new points. It is prone to overfitting with small values of N, and if it is considering large amounts of neighbors, accuracy of labels will reduce up to random guessing.

There is an abundance of different algorithms which may be used for solving classification problems. They all have some strengths and weaknesses. Key idea of this study was building a robust and not prone to overfitting model (mainly because of the fact that there may and even should be repeats in data due to it biological origin - transposons are copied across the genome), so Random Forest Classifier was chosen.

It has several advantages such as it requires minimal hyper-parameters tuning, it does not tend to overfit, it could be parallelized in order to reduce computation time, it is robust to outliers, also it deals well with high-dimensional data, what is important because processed sequences are annotated with extra features. This classifier algorithm could be applied to a wide range of problems, it “has become a major data analysis tool used with success in various scientific areas” [25].

Even though it is hard to interpret feature importance of this machine learning model, it is possible to do several steps to obtain some information about top features: first of all, after each run we could extract model's feature importance using built-in classes methods of scikit-learn Python library. We also can compare lists of top most important features from different training-evaluation folds. This will give us a new list of steady top most important features. So, after some iterations we could obtain some stable list of features to be considered.

2.3 Random Forest Algorithm

Random forest was suggested back in 1995 by Tin Kam Ho [26]. This algorithm is based on an idea that several simple algorithms that predict slightly better than random guessing, so-called “weak learners”, combined together could result into a “stronger learner” with much better prediction results.

The essence of the Random forest algorithm is to create multiple trees on random subspaces of the initial feature space, what implies both bootstrap aggregation and random feature selection concepts.

Bootstrap aggregation is a machine learning meta-algorithm that samples initial training dataset into several smaller samples uniformly and with replacement. Usage of bagging helps to achieve significant model's variance decrease.

Random forest implies building some relatively small decision trees and its output is defined by the majority voting (See Fig.7).

Figure 7. Random Forest principle.

Algorithm of Random Forest:

Algorithm of Random Forest:

1. define a number of model as n

2. for i in n:

| create a bootstrap sample using a train dataset

| train a decision tree submodel using the bootstrap sample

for split:

| | select randomly m predictors from the original dataset

| | select the best predictor from these m predictors and split the dataset

end

| store variable name and average shuffle RMSE

end

determine when the tree is complete with some tree model stopping criteria

Algorithm 1. Random Forestdetermine when the tree is complete with some tree model stopping criteria

Algorithm 1. Random Forest

This research uses Random forest implementation from Python library scikit-learn and K-Fold cross-validation to tune hyperparameters of model in order to prevent overfitting.

2.4 Cross Validation

Cross-validation is a statistical technique used to predict how the results of training some model will generalize to an independent data set. It includes splitting train dataset into training and testing subsets to perform out-of-sample testing.

Out-of-sample testing helps to ensure that model does not overfit in contrast to evaluating a model on a training dataset.

K fold implies k train-test iterations in order to use every part of dataset as training and testing subset (See Fig. 8).

Figure 8. 5-fold cross validation.

The version of k-fold from sklearn Python library was used during computations in this study.

2.5 Software tools used

Present research includes a computational part which required a high-level programming language to be implemented Python 3.6 was chosen to solve this problem due to the speed and simplicity of software development, abundance of different libraries and its popularity in bioinformatics [27].

Following libraries were used during this research:

· BioPython, which is developed as a part of Open Bioinformatics Foundation (OBF) project. It could be used for specific biological tasks like sequence data processing [28]. This tool is especially handy for work with fasta sequences file formats.

· SKlearn, which contains a lot of different machine learning models, built as classes and has numerous utility functions. This package is widely used in bioinformatics [29], there are numerous researches which include sklearn usage, e.g. for estimation and robust modeling of translation dynamics at codon resolution [30], for processing neural data [31], analysis of spinal cord damages [32]. It was used in this study in order to create and evaluate machine-learning models.

· Pandas - this is a Python library designed to work with table data, it provides “high-performance, easy-to-use data structures and data analysis tools for the Python programming language” [33].

· MatPlotLib, which is a Python special plotting library, which produces good quality figures in a variety of different formats, interactive environments across all platforms [34]. Matplotlib is often used through the Jupyter notebooks, which makes is so convenient tool.

· Seaborn - a Python data visualization library based on matplotlib [35]. This library provides some plot types that are not present in matplotlib, along with extra style options.

Also, building data processing pipelines included some tasks, more conveniently solved directly by using system shell - BASH. This is a command script language for UNIX-like operating systems, which could be used for automating routine operations like other program invocations.

Summary

This chapter is a brief review of different methods and technical approaches, which were used during this study, including several classification machine-learning methods.

Размещено на http://www.allbest.ru/

3. Transposable elements recognition PIPELINE

3.1 Used data

L1 and Alu transposon families were used in this study. Only full-length L1 transposons were selected from an evolutionary study of Khan et al. [36]. This dataset consists of 6622 elements. The selection was done by searching the annotation of human genome for L1 elements longer than 6 KB.

Also, full-length Alu mobile genetic elements were taken from Price et al. [37] study. A set of 12,431 AluS (724) and AluY (11,707) sequences) was considered in this research.

Shuffled sequences were used as an alternative class. They were obtained with the dinucleotide shuffling method that preserved dinucleotide frequencies. It prevents the models from relying on low-level statistics of genomic regions [38].

Present study includes experiments with following classes of sequences:

· L1 end untranslated regions (L1 3' UTR).

· L1 head untranslated regions (L1 5' UTR).

· Alu end untranslated regions (Alu 3' UTR).

· Shuffled sequences.

3.2 Stem-loop secondary structures annotation pipeline

This research includes building a robust pipeline used for two annotation steps:

· Sequence annotation by stem-loop structures using DNA Punctuation software [39].

· Annotation of stem-loops by physical, chemical and structural features, taken from DiProDB (Dinucleotide Property Database) [40].

DNA Punctuation stem-loop annotation algorithm is designed to search base-wise complementary parts of sequences with some pre-defined length, allowing up to the number of mismatches pre-defined by a user. The following parameters were used: the stem length range of 10-20 bp, the maximum loop length of 10 bp, and the number of mismatches allowed in the stem is 5 bp. For the 3'-end set of stem-loops, we took stem-loops located within the last 50 bp of the sequences.

Two categories of features were extracted:

· Sequence-based frequencies of di-trinucleotides counting occurrences of each k-mer moving with the 1 bp step along the sequence.

· Structure-based RNA dinucleotide properties from DiPRoDb [40], which include structural parameters shift, slide, rise, tilt, roll, twist and physical and chemical properties such as enthalpy, entropy, free energy, and hydrophilicity.

Stem-loop structure annotation affected first the 10 bp of the stem, counting from the loop, which was split into 9 dinucleotides (see Figure 8). Each dinucleotide was characterized by 10 parameters: 6 structural, 4 physical and 1 chemical. The minimum loop length was considered to be 5 base pairs. The loop was represented by the position-specific nucleotide binary vector where each loop position was annotated by 4 possible nucleotides (see Figure 8a). Thus, the loop was represented by 20-dimenisional binary vector. The bulge size was considered to be up to 3 bp on both stems, hence creating 6 possible bulge positions: 3 on the left stem and 3 on the right stem. Same as a loop property vector, each position was characterized as a binary vector of 4 nucleotides resulting into a 24-dimensional binary vector for all bulge positions. If a bulge structure in a stem-loop was not present or it was less than 3 nucleotides, missing positions were filled with zeros. As a result, the stem-loop annotation algorithm emitted property vectors consisted of 90+20+24=134 parameters.

Figure 9. Stem-loop annotation details. A) Encoding loop (LP) coordinates by binary features. B) Geometrical parameters used for stem annotation C) Stem-loop structure coordinates

Data processing pipeline includes the following steps:

· Bash script for fasta file format preprocessing - cutting off poly-A sequence tail, if it is present.

· Bash script for stem-loop annotation - it is used to run DNA Punctuation for raw sequences stored in a single folder as separate files.

· Python program, performing annotation of DNA Punctuation output by nucleotide pairs, triplets occurrence statistics and physical, chemical and geometrical features.

· Python function for merging previous step output into a single .csv file.

Bash script that was used to cut poly-A sequence tails:

#!/bin/bash

# First arg: path

# Second arg: path

for filename in $(ls $1); do

POLY_A_END=$(tail -n1 $1$filename | rev | grep -aob '[CTG]\{2,\}' | head -n1 | grep -oE '[0-9]+')

# Cut the sequence to the position of last non-A bases pair

SEQ=$(tail -n1 $1$filename)

echo $SEQ | cut -c1-$((${#SEQ}-$POLY_A_END)) >> "$2$filename"

done;

Bash script was designed to process every file into raw sequences directory:

#!/bin/bash

# First arg: target path

LEN=$(ls $1 | wc -l)

for filename in $(ls $1 | shuf | head -n$LEN); do

transp/src/cpp/search $1$filename 123.ptt 10 20 0 8 5

done;

Python annotation logic included several functions written in a parallelizable way. This optimization is necessary due to large amounts of data processed. Multiprocessing python library was used to create several subprocesses:

from multiprocessing import Pool, cpu_count

STREAMS = cpu_count()

def begin_processing(

path,

lines,

omit,

output_file=None,

n_lines=0,

):

. . .

log.info("Got {0} lines".format(len(data_to_process)))

chunk_size = len(data_to_process) // STREAMS + 1

log.info("Processing with chunk_size = {0}. Starting {1} workers".format(chunk_size, STREAMS))

with Pool(processes=STREAMS) as pool:

processed_data = pool.map(

process_lines, get_chunks(data_to_process, chunk_size)

)

log.info("Combining results into single dict")

for chunk in processed_data:

results += chunk

log.info("Creating df")

result_df = pd.DataFrame(results)

This function includes merging logic and emits resulting pandas dataframe.

Dinucleotide shuffling logic was used as well. It is invoked by the bash script, which processes a directory with sequences to shuffle and writes results into another folder:

#!/bin/bash

LEN=$(ls $1 | wc -l)

for filename in $(ls $1); do

SEQ=$(tail -n1 $1$filename | tr -d '\n' | $(dirname $0)/../py_scripts/altschulEriksonDinuclShuffle.py)

echo $SEQ > "$2$3$filename"

done;

echo -ne "\r#Done\n"

Python script, used for merging table is presented in an appendix. It supports command line interface and could be used for merging several files.

3.3 Machine learning pipeline

One of the key steps of this research includes using a machine learning model for solving classification problem. The model was built using Python programming language, Jupyter Notebooks - a simple web app which allows running live code along with visual arrangement of plots and markdown [41].

Experiment pipelines consist of several steps:

· Data preparation - reading csv files created by annotation pipeline logic.

· True negative and true positive datasets normalization - random sampling of bigger one in order to make their sizes equal.

· ROC AUC and Precision-recall metrics calculation in 5-fold loop to get better estimate of this value.

· Feature importance extraction.

· Metrics and feature importance export into csv.

This pipeline structure allows quickly running several experiments by changing paths to csv dataset files. Developed software is modular and parallelizable, its components, such as scikit-learn library, allow running tasks on several processor cores simultaneously.

Also, pipeline includes visualization steps, which are added in order to check results. MatPlotLib and Seaborn Python libraries were used for visualization tasks to create following plots:

· Receiver Operating Characteristic plot.

· 2-class Precision-Recall plot.

· Feature importance scores plot.

Summary

This chapter reviews two modules of this research: data preparation and machine learning pipelines. Several submodules of these pipelines are designed to run in multiprocessing modes in order to increase computations speed.

4. Experiment results

4.1 Overview

This study includes several different experiments comparing L1, Alu and shuffled sequences. There are three kinds of feature sets were used:

· First type of machine learning models was trained to recognize the last 50 bp of Alu and L1 sequences when compared to shuffled sequences based on sequence composition, taking into account di- and trinucleotide frequencies.

· Second type was built considering sequence composition, taking into account di- and trinucleotide frequencies of stem-loop raw sequences.

· Third type of models is based on stem-loop annotation by physical, chemical and structural features.

Experiments were built as pairwise binary classification using Random Forest machine learning model.

4.2 50 base pairs composition-based models

This feature set was chosen to distinguish compositional features of transposon tails from shuffled sequences.

All six constructed models obtained good performance with Receiver Operating Characteristic Area Under Curve (ROC AUC)>=97% (See Fig.10). It was showed that not only 3'-ends of both L1 and Alu transposons are clearly separable from the other genomic sequences, but also L1 3'UTR could be distinguished from 5'UTR.

Figure 10. Receiver Operating Characteristic plots of 50 base pairs based models

As can be seen on the precision-recall plot, all classes are easily distinguishable. L1 3' UTR combined with L1 5' UTR is a worst class in terms of Receiver Operating Characteristic Area Under Curve, this could be explained by the fact that 3'- and 5'-regions are subjects of different selection pressures.

Even though quite different classes of sequences were considered in this experiment series, they all are clearly distinguishable by constructed machine learning model.

As can be seen in the precision-recall plot (Fig. 11), the constructed model is a well-balanced and has a good rate of both true-positive and true-negative predictions.

Figure 11. Precision-Recall of 50 base pairs model

As can be seen in the feature importance heatmap plot (Fig. 12), top-10 most important features are both dinucleotides for experiments with L1 versus Alu opposition and trinucleotides for recognizing L1 and Alu 3'-end stem-loops jointly or separately versus shuffled sequences. This could be explained by two facts:

...

Подобные документы

  • Basic assumptions and some facts. Algorithm for automatic recognition of verbal and nominal word groups. Lists of markers used by Algorithm No 1. Text sample processed by the algorithm. Examples of hand checking of the performance of the algorithm.

    курсовая работа [22,8 K], добавлен 13.01.2010

  • Machine Translation: The First 40 Years, 1949-1989, in 1990s. Machine Translation Quality. Machine Translation and Internet. Machine and Human Translation. Now it is time to analyze what has happened in the 50 years since machine translation began.

    курсовая работа [66,9 K], добавлен 26.05.2005

  • Machine Learning как процесс обучения машины без участия человека, основные требования, предъявляемые к нему в сфере медицины. Экономическое обоснование эффективности данной технологии. Используемое программное обеспечение, его функции и возможности.

    статья [16,1 K], добавлен 16.05.2016

  • A database is a store where information is kept in an organized way. Data structures consist of pointers, strings, arrays, stacks, static and dynamic data structures. A list is a set of data items stored in some order. Methods of construction of a trees.

    топик [19,0 K], добавлен 29.06.2009

  • Data mining, developmental history of data mining and knowledge discovery. Technological elements and methods of data mining. Steps in knowledge discovery. Change and deviation detection. Related disciplines, information retrieval and text extraction.

    доклад [25,3 K], добавлен 16.06.2012

  • Управление электронным обучением. Технологии электронного обучения e-Learning. Программное обеспечение для создания e-Learning решений. Компоненты LMS на примере IBM Lotus Learning Management System и Moodle. Разработка учебных курсов в системе Moodle.

    курсовая работа [146,6 K], добавлен 11.06.2009

  • Написание тестирующей программы для проверки знаний учащихся с помощью языка программирования Visual Basic for Applications (VBA), встроенного в пакет Microsoft Office. Общие сведения о программе, условия ее выполнения, настройка, проверка, выполнение.

    контрольная работа [25,2 K], добавлен 07.06.2010

  • Функции и основная роль коммерческого банка. Особенности кредитных и депозитных операций. Описание среды программирования и сущность Visual Basic for Applications (VBA). Схема алгоритма программы, процедура сохранения файла и выхода из программы.

    курсовая работа [1,9 M], добавлен 04.04.2012

  • Сумма двух разреженных полиномов, заданных ненулевыми коэффициентами и их номерами. Разработка программ на языке программирования Visual Basic for Applications. Вывод справочной информации. Операционная система Windows. Хранение двоичных данных.

    научная работа [390,2 K], добавлен 09.03.2009

  • Рождение и развитие Basic. Краткое описание Visual Basic for Applications. Новые возможности Visual Basic 5.0. Пример взаимодействия Excel и Visual Basic. Программирование табличных функций. Встраивание, применение функций. Формы, средства управления OLE.

    реферат [20,7 K], добавлен 11.03.2010

  • Теория тестирования. Тест как система заданий и его эффективности. Качество тестовых заданий. Проверка качества тестовых заданий. Матрица результатов. Современный подход к понятию "трудность". Visual Basic for Applications (VBA). Объектные модели.

    дипломная работа [198,9 K], добавлен 10.11.2008

  • Общие понятия об e-learning. Области применения продукта. Модели и технологии. Исследование и анализ программных инструментов. Создание учебного курса для преподавателей инженерно-экономического факультета. Оценка эффективности внедрения такого обучения.

    дипломная работа [4,7 M], добавлен 03.05.2018

  • Web Forum - class of applications for communication site visitors. Planning of such database that to contain all information about an user is the name, last name, address, number of reports and their content, information about an user and his friends.

    отчет по практике [1,4 M], добавлен 19.03.2014

  • Назначение и основные функции Ехсе1. Причины возникновения ошибок и способы их решения в Ехсе1. Язык программирования Visual Basic for Applications (VBA): общая характеристика языка. Основные понятия информационной безопасности, способы ее нарушения.

    шпаргалка [201,2 K], добавлен 26.02.2010

  • Lists used by Algorithm No 2. Some examples of the performance of Algorithm No 2. Invention of the program of reading, development of efficient algorithm of the program. Application of the programs to any English texts. The actual users of the algorithm.

    курсовая работа [19,3 K], добавлен 13.01.2010

  • Блог: понятие, функции, классификация. Политика, быт, путешествие, образование, мода, музыка. Drupal, wordpress, textpattern, nucleus CMS, inTerra blog machine как популярные движки для блогов. Особенности выбора темы оформления странички пользователя.

    контрольная работа [23,3 K], добавлен 18.09.2014

  • International Business Machines (IBM) — транснациональная корпорация, один из крупнейших в мире производителей и поставщиков аппаратного и программного обеспечения. Прозвище компании — Big Blue. Основание IBM в период 1888—1924. Начало эры компьютеров.

    презентация [1023,3 K], добавлен 14.02.2012

  • Проблемы оценки клиентской базы. Big Data, направления использования. Организация корпоративного хранилища данных. ER-модель для сайта оценки книг на РСУБД DB2. Облачные технологии, поддерживающие рост рынка Big Data в информационных технологиях.

    презентация [3,9 M], добавлен 17.02.2016

  • Макрос как запрограммированная последовательность действий, записанная на языке программирования Visual Basic for Applications. Рассмотрение особенностей решения данных задач в Excel. Характеристика проблем создания пользовательских функций на VBA.

    курсовая работа [1,8 M], добавлен 15.01.2015

  • Создание программного обеспечения в среде Visual Basic for Applications для проведения теста по работе полушарий мозга человека. Описание команд. Разработка интерфейса и тестирование программы. Листинг приветствия и задаваемых пользователю вопросов.

    курсовая работа [387,1 K], добавлен 09.03.2014

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.