Главная Коллекция "Revolution" Коммуникации, связь, цифровые приборы и радиоэлектроника Disease Prediction in User Generated Chinese Medical Texts Based on Deep Learning

Disease Prediction in User Generated Chinese Medical Texts Based on Deep Learning

Problems of the medical industry in China at present. Development of an automatic online system for sorting free medical texts in Chinese collected from Chinese platforms. Technical characteristics and prospects of their use for predicting diseases.

Рубрика	Коммуникации, связь, цифровые приборы и радиоэлектроника
Вид	дипломная работа
Язык	английский
Дата добавления	16.07.2020
Размер файла	1,3 M

посмотреть текст работы

скачать работу можно здесь

полная информация о работе

весь список подобных работ

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Размещено на http://www.allbest.ru/

Disease Prediction in User Generated Chinese Medical Texts Based on Deep Learning

Enze Gong

Abstract

automatic online system medical text

Building an accurate automatic triage system requires a large amount of annotated data. Due to compliance, obtaining a big medical dataset is difficult. Annotating the dataset also requires intensive human labor, which increases the cost of automatic triage. This study tests that whether an accurate disease prediction model can be built, based on free-to-access medical texts in Chinese. Five respiratory diseases with similar symptoms are selected to raise the difficulty of prediction. Different methods including simple neural networks, Bi-LSTM, tf-idf, CNN and attention mechanism are tested on the corpora gathered. The best performance is yielded by a combination of Bi-LSTM and attention, with a F1 score of 0.865. The result of this study suggests that Chinese medical institutions may be able to construct a reliable automatic triage system, without the need to hire annotators or purchase formulated data.

Introduction

The medical industry of China is currently facing two major problems:

1) General lack of medical resources:

According to World Bank World Bank, Current health expenditure per capita (current US$) - China URL: https://data.worldbank.org/indicator/SH.XPD.CHEX.PC.CD?locations=CN, China's medical expenditure per capita is 398 USD, around 40% of the world average of the year 2016. In the same year, China's GDP per capita is 80% of the world average. The gap between a relatively higher GDP and a relatively lower medical expenditure indicates a general lack of medical resources in China. For the patients of harder-to-treat diseases, which cannot be treated at a much lower price in China compared to in other countries, this lack of medical resources means that medical help is less affordable. Some patients, deterred by the cost of treatment, choose to “survive” with the disease, which in many situations exacerbates their conditions and drives the treatment cost even higher. Thus, a vicious cycle is formed.

2) Unequal distribution of medical resources:

Between developed and developing regions, between big cities and small countryside, between rich and poor families, the medical resources are not equally distributed. More medical resources are concentrated in the top tier hospitals in the top tier cities. In 2015, Less than 8% of hospitals in China are the top-tier hospitals, but they received 48.7% of the visiting patients. Lower tier hospitals, 65.1% of all the hospitals, only received 13.3% of the visiting patients №ъјТРЕПўЦРРД·ЦПнѕјГСРѕїЦРРД,ЦР№ъТЅБЖ·ЦПн·ўХ№±Ёёж 2017.02 URL: http://www.sic.gov.cn/archiver/SIC/UpFile/Files/Default/20180801173851887747.pdf. People do not trust the medical capabilities of lower tier hospitals, which makes the top-tier hospitals less available, and makes lower tier hospitals less able to sustain good doctors and equipment. This is another vicious cycle.

Internet healthcare can provide an expedient relief for the two problems mentioned above. Real doctors answer questions of real patients on the internet for a relatively low price, or even for free. Patients can get help from prestigious doctors in big cities without traveling, waiting in lines, and paying for various fees. Online healthcare is convenient and affordable, and it is very popular. According to Huajin Securities »ЄЅрЦ¤ИЇ, »ҐБЄНшТЅБЖ·ўХ№СёЛЩЈ¬РРТµБъН·ЖЅ°ІєГТЅЙъЙПКРФЪјґ URL: https://www.investank.com/static/upload/system/201807/1891.pdf, the compound growth rate of China's internet healthcare market size from 2012 - 2026 is expected to be from 33.6% to 38.7%.

In the beginning of the Covid-19 pandemic in China, the number of people asking for medical help on the internet surged to tens of thousands every hour. Many people rushed to the hospitals in panic for medication. Hundreds of people waiting in the hospital lobbies to be directed to different departments and medical experts by only a few nurses. The sudden inflow of patients greatly increased the chances for them to get infected.

To cope with this difficulty, an automatic triage system is required to quickly and accurately predict a patient's disease, based on the information provided by the patient. Currently, many leading medical platforms in Chin are selling their AI models to hospitals to reduce the waiting time for triage. Those platforms first gather medical data by providing consultation service online. The conversations between patients and medical experts on their platforms are annotated, formulated, and stored in their respective databases. A small fraction of those conversations is made public. When the AI models are sold to hospitals, the price covers the cost of laborious annotation and formulation. The final payer of the expensive AI models are common patients. If an accurate AI model can be built on a smaller and free-to-access corpus with minimal human labor for annotation, the hospitals can build the triage system themselves without having to purchase it from big companies. It lowers the cost for hospitals to provide medication, and lowers the financial burden for patients.

Currently, most studies in this field use annotated dataset, which is expensive to build, and hard to obtain in the real world. In this study focuses on testing the feasibility of building an automatic triage model that does not require a big dataset or complicated annotations, but still predicts a patient's disease quickly and accurately, based on the information provided by the patient.

Literature Review

The task of disease prediction can be categorized as text classification. It is a popular topic in the study of NLP on Chinese medical texts. Dai et al. (2003) tested Information Gain, Mutual Information, and Chi-Statistics methods on Chinese texts. The results of their experiments show that the methods that perform well on English texts do not perform well on Chinese texts, if those methods are not modified for Chinese language. When tested on Chinese, IG, MI and Chi need much higher dimension counts to reach the same level of performance as when they are tested on English. The increased dimension counts slow the training process, and requires a larger dataset, which raised the overall cost. Another challenge for using those methods is that when the dataset is not very big, there are many words with low frequencies. While IG considers words with lower frequencies less important, this assumption does not hold when important words that categorize a text appear only a few times throughout the corpus. The solution proposed in the study is to combine IG, MI and Chi methods, and apply the combined method on a support vector machine. The researchers conducted a test on a news corpus in Chinese with 6 thousand texts with 6 categories. The training time of using only support vector machine and that of combined methods of document frequencies, information gain and mutual information can both reduce the training time by 5 to 13 times and enhance the classification accuracy, raising the F1 score from 0.89 to 0.93.

Xia et al. (2015) used a combination of Information Gain and automatic feature extraction to classify user reviews in Chinese for goods online. They came up with two length thresholds to judge the validity of each review. If a review exceeds the bigger length threshold, even if no words with high information gain appear in that review, the review will be considered valid. If a review falls between the two length thresholds, when there are words of high information gain in that review, the review is valid. If a review is shorter than the minimum length threshold, the review is discarded. The researchers gathered 1700 reviews on internet stores of 2 items. The results show that different min/max length thresholds need to be manually set for different topics to get the best automatic classification performance. For one item, the best range of length thresholds is between 70 and 30, and for the other one, between 10 and 3.

Lu et al. (2019) studied information extraction from surgery records in Chinese. To solve the problem of surgical texts not containing explicit mentioning of the incision counts, the researchers turned the task into text classification. They gathered a corpus of 3000 surgical records and filtered out sentences that did not contain the keyword “incision” (in Chinese). Then for each text with the remaining sentences, an annotation of the incision count was given. The researchers used LSTM, Bi-LSTM, SVM, attention mechanism and TextCNN for training. The result shows that a combination of Bi-LSTM, CNN and attention mechanism yields the highest F1 score of 0.981.

Fan et al. (2020) proposed to use semantic dependency parsing for graph network text classification. A semantic graph network (SGN) unit contains 2 convolutional layers. Words are one-hot encoded into nodes and semantic dependencies between words are encoded into edges on the network graph. As the graph network units are trained, the nodes and edges on the graph are simultaneously updated. The experiment was conducted on 4 public review corpora in which semantic dependencies are annotated. Methods including TextCNN, TextRCNN, GNblock and TextSGN are tested on the corpora. The results show that the SGN method yields better performance by 1-3% on short news texts compared to other methods, with prediction accuracy ranging from 85.6% to 95.2%.

Hu et al. (2019) Hu JingЈ¬ Liu WeiЈ¬ Ma KaiЈ® Text categorization of hypertension medical records based on machine learningЈЫJЈЭ Ј® Science Technology and EngineeringЈ¬ 2019Ј¬ 19( 33) : 296-301 conducted research on categorization of hypertension medical records. Theories including Bag of Words (BOW), Tf-idf, and SVM were tested on a hypertension medical records with detailed annotations. The results show that BOW yields higher performance than tf-idf by 0.02 of F1 score.

Huang et al. (2019) HUANG Meng-ting, ZHANG LING, JIANG Wen-chao, Short Text Feature Expansion and Classification Based on Non-negative Matrix Factorization, Computer Science, Vol. 46, No. 12, Dec. 2019 studied non-negative matrix factorization, a method that does not utilize external resources for feature extension. To address the problem that online texts are often short and contain sparse terms with low frequency, they constructed word cluster matrix to store the relation between words and texts. The matrices are reduced in dimensionality to create feature space. The method is compared with BOW and CNN on 3 corpora compiled by other researchers, the total size of which is 150,000 texts. The results show that their method raises the accuracy score by 25.8%, 10.9% and 1.8% on 3 corpora, respectively.

Guo et al. (2019) Guo Chaolei, Chen Junhua, Chinese Text Categorization Based on SA-SVM, Computer Applications and Software, Vol. 36, No.3, Mar. 2019. used simulated annealing (SA) to optimize SVM for text classification. A penalty factor c was applied to the original SVM layer to randomly create a new result after every iteration. If the difference between the new result and the ground truth is lower than that between the original layer and the ground truth, the new result substitutes the original layer. The research was conducted on 2 corpora: Fundan Chinese Corpus (9800 texts) and Sougou Corpus (1900 texts). The results showed that SA-SVM method raises the F1 score of classification by 0.03-0.05 compared to Naпve Bayes and KNN.

Methodology

Core notions

Tf-idf

Term Frequency-inverse Document Frequency is an algorithm that estimates the importance of a word in a set of texts. It has two parts, Term Frequency and Inverse Document Frequency. This algorithm works on the assumption that, if a word is more seen in a specific text than in other texts, this word may contain more characteristic information of that specific text. This is the Term Frequency part. Some punctuational words or words commonly seen in medical texts like “disease” and “doctor” are frequently used, but they do not contain meaningful information. Therefore, a normalizing factor is also introduced to balance out those frequent yet trivial words. And this is the Inverse Document Frequency part.

The Term Frequency is measured in the following way:

where stands for the number of times a term appears in a text, and stands for the total term count of that text.

The Inverse Document Frequency is measured in the following way:

N represents the total number of texts in a corpus, and represents the number of documents in that corpus which contain a specific term t. To prevent the situation in which no text contains the term and is zero, a plus 1 is added to the denominator. Since falls into the range between 0 and 1, should also fall into a smaller range. When N is very big and is very small, will be too big when there is no normalization. Therefore, a logarithmic is taken to prevent that from happening.

The tf-idf value is thus calculated:

Terms that are concentrated in a few texts of a corpus will have a higher tfidf score, which is an indication that these terms may better summarize the texts.

CNN

Convolutional Neural Network (CNN) is a neural network that extract certain features to make predictions (Bouvrie, 2006) Jake Bouvrie, Notes on Convolutional Neural Networks. It consists of 3 layers: Input Layer, Hidden Layer, and Output Layer. The Hidden Layer contains two parts: convolutional layers and fully-connected layers. The input layer is filtered by max-pooling to create a smaller hidden layer.

An input layer has a volume of:

The corresponding output layer will thus have a volume of:

Where

represents the number of filters.

represents the spatial extents of those filters.

S represents the stride.

P represents the number of paddings.

ReLU

Rectified Linear Units (ReLU) is an activation function used for the hidden layer of a neural network model Abien Fred M. Agarap, Deep Learning using Rectified Linear Units (ReLU), arXiv:1803.08375v2 [cs.NE] 7 Feb 2019 (Agarap, 2018). The ReLU function yields the input as output directly, if the input is not positive. If the input is negative, the function yields 0. The ReLU function is expressed by the following equation and graph:

ReLU function, URL: https://ailephant.com/glossary/relu-function/

The benefits of adopting ReLU for activation are:

ћ Simplicity, which saves computational resources.

ћ Sparse representation. Negative inputs will return true zero values, instead of values that are very close to zero. This attribute simplifies the model and reduces the time required for training.

ћ Linearity. Gradients and node activations maintain relative proportions to avoid the problem of vanishing gradients.

ћ No need for pre-training, before running deep learning models.

LSTM

Long Short-Term Memory (LSTM) network is a recurrent neural network that is designed for sequences of input data (Srivastava et al., 2015 Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov, Unsupervised Learning of Video Representations using LSTMs, arXiv:1502.04681). There are 3 gates in LSTM, Forget Gate, Input Gate, and Output Gate. Each gate is an activation function that outputs in most cause either 0 or 1. The gates can be expressed by:

represent Forget Gate, Input Gate, and Output Gate, respectively.

represents the current timestamp.

, , represent the weights at the corresponding gates.

represents the output of the previous LSTM block.

represent the biases at the corresponding gates.

represents the input at the timestamp of .

represents the activation function (ReLU is used as the activation function in this study).

represents the candidate of cell states at the current timestamp t.

represents the cell state at the current timestamp t.

At timestamp t-1, all information is kept in the cell state. The cell state is sent to Forget Gate, where non-significant information is discarded. Then the cell state goes through Input Gate, where new information is added. The remaining cell state is the output of the LSTM block at timestamp t (graph 3.1).

Graph 3.1: LSTM Memory Cell

LSTM carries previous information to solve the problem with long term dependencies. It can also avoid vanishing and exploding gradient to save computational resources.

Bi-LSTM

Bi-directional LSTM (Bi-LSTM) is a model that runs two LSTM. One is run forward, and the other one is run backward. Running LSTM from both directions can take into account more contextual information to make the prediction more accurate.

Attention

The attention mechanism puts a contextual vector between the encoder and decoder layer (Bahdanau et al. 2015, Vaswani et al, 2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention Is All You Need, arXiv:1706.03762 Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio, NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE, arXiv:1409.0473v7. The contextual vector is applied to all cells to put different weights on all the information, based on their relative positions. The contextual vector is expressed by:

where

and

represents annotations.

represents the alignment model to test how well the input and output match at the position i.

represents the hidden state.

Cross-Entropy Loss

Cross-entropy loss is used to measure the loss between predictions and ground truth.

The cross-entropy loss is measured by:

where

represents the ground truth.

represents the prediction.

The higher the cross-entropy loss is, the closer the prediction is to ground truth. Cross entropy loss is widely used for classification tasks.

F1 Score

F1 score is used to measure the performance of the prediction models.

where

and

Experiment Design

The online medical texts have the following characteristics:

ћ Diseases can be determined by key terms (symptoms, explicit mentioning of diseases, etc.), and those terms may not be frequently seen across other texts.

ћ Key terms do not have to be reiterated for many times. Patients mention them once, and once is enough for doctors online to give advice.

ћ Contextual information matters. The symptoms closely connected with different body parts may imply different diseases.

Based on those characteristics, tf-idf is chosen to test whether singling out key phrases generates accurate predictions. Bi-LSTM and attention mechanism are used to reflect contextual information. Cross-entropy loss and F1 score are used to measure the robustness of the prediction models.

Experiment Environment

The experiment is implemented by PyTorch library of Python. The training process is done with a GPU of Nvidia RTX 2070.

Data Description

General Description

Five respiratory diseases with similar symptoms are selected for this study: influenza, pneumonia, pneumoconiosis, tracheitis, and asthma. Picking diseases with similar symptoms increases of difficulty of categorization. If the model can tell the nuances between diseases of the same category, it is expected to correctly predict drastically different diseases. The texts are parsed from 4 leading online medical platforms: 39.net(39ЅЎїµНш), 120ask.com(УРОК±Шґр), chunyuyisheng.com(ґєУкТЅЙъ), and haodf.com(єГґу·тФЪПЯ).

The texts from these platforms come in different templates.

On 120ask.com, patients give out all their information in a non-formatted paragraph. Some patients already have diagnosis, and some do not. Doctors answer a patient's question and give medical advice. Patients and doctors may exchange more information afterwards.

Graph 4.1: An example of the 120ask format

On 39.net, the texts come in a similar format as on 120ask.com. Patients give out non-formatted questions, and doctors answer the questions. There are no following conversations shown on 120ask.com.

Graph 4.2: An example of the 39.net format

On chunyuyisheng.com, texts come in the format of pure conversations. All information is exchanged by conversations between patients and doctors.

Graph 4.3: An example of the Chunyuyisheng format

On haodf.com, patients fill in a form before initiating a conversation with doctors. Patients give out information of their symptoms, drugs that they have taken, hospitals that they have visited, and detailed description of their questions. Then follows a change of information between patients and doctors.

Graph 4.4: An example of the Haodf format

The text formats of the 4 platforms are summaries in the chart below.

Template

Formatted text?

120ask

Question-Answer + Conversation

Non-formatted

39

Question-Answer

Non-formatted

haodf

Question-Answer + Conversation

Formatted

chunyuyisheng

Conversation

Non-formatted

Chart 4.5: Corpora Summary

In total, there are 11804 texts parsed. The spread of the texts between platforms and diseases is shown in the chart below.

Asthma

Pneumonia

Pneumoconiosis

Tracheitis

Influenza

Total

120

795

527

800

797

798

3717

39

100

100

100

100

100

500

Chunyu

600

599

584

600

589

2972

Haodf

855

933

944

930

953

4615

Total

2350

2159

2428

2427

2440

11804

Chart 4.6: Corpora Summary by Disease

On 39.net only the most recent 100 records are shown, so only 100 texts for each disease are collected.

The average length of the texts is 557.9 tokens, including punctuations.

Graph 4.7: Text Distribution by Length

The corpora are collected with the Python library Beautifulsoup4. One text file is created to save the text collected from every page. The text is named by the platform from where the text is crawled, and by the disease category to which the text belongs. The created texts are then combined into one single csv file, in which all the gathered information including texts, platforms and diseases are stored. The study is conducted on the combined single csv file.

Data Pre-processing

The following pipeline of procedures is designed to carry out the experiment:

Since only the texts on haodf.com are formatted, consider all texts are considered as plain texts. They are not tagged with information such as part-of-speech, symptom. The information that comes with the text format on haodf.com is ignored. The collected texts are noted with the platforms from which they are gathered, and the diseases they are tagged with.

Two copies of the corpus are created. No initial modification is applied to the first one. To the second copy, all the explicit mentioning of the 5 diseases is removed. The goal is to make the prediction more difficult, as explicit mentioning of the diseases will greatly reduce the significance of symptoms for prediction. These two copies are named as original and reduced. Both the original and reduced copies are stripped of stop words and punctuations with a stop word and punctuation list provided by Harbin Institute of Technologies ЦРОДіЈУГНЈУГґК±н, URL: https://github.com/goto456/stopwords.

Jieba segmentation module is used to segment both copies into phrases URL: https://github.com/fxsjy/jieba. Only the longest matches of phrases are considered as segments, and segments do not overlap. Jieba is one of the most popular Python libraries for Chinese segmentation. To make the segmentation more precise, a Sougou dictionary of medical phrases with a size of 90047 terms is utilized as an additional dictionary for segmentation ТЅС§ґК»гґуИ«, URL: https://pinyin.sogou.com/dict/detail/index/15125. Sougou is the mostly used input software in China, and it keeps updating dictionaries for terms of specific themes. This dictionary is not specifically compiled for a certain type of disease, nor for the topic of this study.

Based on the segmentation results, the tf-idf of every phrase is calculated across the entire corpus. Two versions of the original and reduced corpora are created:

ћ One is the original pair. For every text in this pair, if its length exceeds 500 phrases without punctuations and stop words (the phrase here refers to a segment after segmentation, and every phrase contains at least one character), the part that goes beyond 500 phrases are discarded. If the length of that text is below 500 phrases, placeholders are added to extend the length to 500. For the original corpus, 94.5% of the texts are shorter than 500 phrases; for the reduced corpus, 95.8% of the texts are shorter than 500 phrases. Doctors should be able to make a judgement of a patient's disease based on the first 500 phrases, so setting the threshold to 500 should not cause a loss of vital information that affects prediction accuracy. This pair is named as full.

ћ The other version is to extract 50 phrases with the highest tf-idf scores for every text in both the original and reduced corpora, in order to reduce the time for training. For texts that contain less than 50 phrases, placeholders are added to fill the length up to 50. This pair is named as tfidf.

After the preprocessing, there are 4 corpora: original-full, original-tfidf, reduced-full, reduced-tfidf.

For each corpus, a dictionary is built, which contains all the unique phrases in that corpus. Every phrase in a dictionary is represented by a unique vector with a size of (1,500). Texts in the full corpora are converted into matrices with a size of (500, 500); texts in the tfidf corpora are converted into matrices with a size of (50, 500).

The entire process of the corpora preprocessing can be summarized in the following graph.

Graph 4.8: Data Processing Summary

Training Parameters

Two neural networks are built. One consists of 3 seq2seq layers, with ReLU activation functions between every two layers. This network is named as simple network. The other network consists of one layer of Bi-LSTM, one layer of attention, and one layer of CNN. This network is named as complex network.

The complicated network has 0.2 dropout rate after each layer to avoid overfitting. For both networks, the size of hidden layer is set to be 500.

The training corpus is fed into the neural networks in batches, the size of which is 64 texts. When the training corpus is run through entirely, one training epoch is finished, and a cross entropy loss is calculated between the model and the testing corpus. Early stop mechanism is used for the training process. When the cross entropy of the newest epoch is higher than the average cross entropy loss of the previous epochs, it means that there is no improvement for the model. To avoid overfitting, the training process is stopped, and the newest loss and F1 values are yielded as the final score of the model.

Results

The Simple Network Results

The final cross entropy loss of the original full corpus (explicit mentioning of diseases is not filtered) trained with the Simple Network is 0.705. The F1 score is 0.738.

Graph 5.1: Training Loss of Original Full Corpus with Simple Network

The final cross entropy loss of the tfidf full corpus (50 terms with the highest tf-idf scores are kept; the rest are filtered) trained with the Simple Network is 1.144. The F1 score is 0.524.

Graph 5.2: Training Loss of tfidf Full Corpus with Simple Network

The final cross entropy loss of the original reduced corpus (explicit mentioning of diseases is filtered) trained with the Simple Network is 0.837. The F1 score is 0.699.

Graph 5.3: Training Loss of Original Reduced Corpus with Simple Network

The final cross entropy loss of the tfidf reduced corpus trained with the Simple Network is 1.405. The F1 score is 0.402.

Graph 5.4: Training Loss of tfidf Reduced Corpus with Simple Network

Complex Network

The final cross entropy loss of the original full corpus trained with the Complex Network is 0.494. The F1 score is 0.865.

Graph 5.5: Training Loss of Original Full Corpus with Complex Network

The final cross entropy loss of the original reduced corpus trained with the Complex Network is 0.717. The F1 score is 0.786.

Graph 5.6: Training Loss of Original Reduced Corpus with Complex Network

The final cross entropy loss of the tfidf full corpus trained with the Complex Network is 0.677. The F1 score is 0.738.

Graph 5.7: Training Loss of Tfidf Full Corpus with Complex Network

The final cross entropy loss of the tfidf reduced corpus trained with the Complex Network is 0.949. The F1 score is 0.611.

Graph 5.8: Training Loss of Tfidf Reduced Corpus with Complex Network

The model is also run on augmented corpora: the texts are reversed and added to the original corpora. For example, the original sentence “I love you” will be reversed into “you love I” and added to the original corpus to form a new one.

The non-augmented corpus:

ОТїИЛФ

(I cough)

The augmented corpus:

ОТїИЛФ

(I cough)

ЛФїИОТ

(cough I)

The other preprocessing and model parameters remain the same.

The results by F1 scores are summarized in the following chart:

Simple Network

Simple Network + Data Aug

Complex Network

Complex Network + Data Aug

Original full

0.738

0.683

0.865

0.564

Original reduced

0.699

0.666

0.786

0.561

Tfidf full

0.524

0.504

0.738

0.714

Tfidf reduced

0.402

0.421

0.611

0.629

Graph 5.8: Training Loss of Tfidf Reduced Corpus with Complex Network

Analysis

Two trends are observed: full corpora have better performance than reduced corpora; original corpora have better performance than tfidf corpora. For non-augmented corpora, the complex network yield better performance; for augmented corpora, the complex network returns worse performance on original corpora but better performance on tfidf corpora.

The following conclusions are drawn from the results:

ћ Explicit mentioning of diseases improves performance of prediction. However, diseases do not need to be explicitly mentioned to be accurately predicted, as the complex network yields a F1 score of 0.786 on original reduced corpus. It means that the symptoms and other information in the corpora can be the reliable factors for disease prediction.

ћ Complex network yields better results, except on augmented original corpora. The extra computing resources expend on complex network is improving the performance significantly.

ћ Data augmentation by reversing texts lowers the performance. The biggest performance reduction is observed between the non-augmented original corpora and the augmented ones run on the complex network. The performance difference between the tfidf corpora run on the complex network is much smaller. This phenomenon may be due to the attention mechanism, which takes into account the contextual information. Reversing a text changes drastically its contextual structure, which causes confusion for the attention mechanism.

ћ Tfidf generally yields worse performance, except on the original corpora with the complex network. This may be cause by the Bag of Words (BOW) mechanism of the tfidf model. In the tfidf corpora, terms with high tfidf value are extracted from the original texts and are ranked by their tfidf values in the training data. The contextual information between the selected terms are fully discarded in the process. Therefore, when compared to the results of complex network run on the augmented original corpora, the contextual information of which is also lost in the process, tfidf is not at a big disadvantage. Also, tfidf with complex network is trained without contextual information, so augmented tfidf corpora have better performance on complex network than tfidf original corpora.

Conclusions

In this study I use CNN and attention mechanism on free-to-access medical texts gathered from Chinese platforms for disease prediction. The study is carried out without annotation and specifically built dictionaries. With ready-to-use resources, the CNN model with attention mechanism returns a F1 score up to 0.865. The diseases selected for this study are all respiratory diseases with similar and overlapping symptoms, which increases the difficulty of disease prediction. In the real-world cases, patients will consult about diseases that are more distinctive by the symptoms, which makes it easier to make the correct the predictions. Automatic online triage system that are built with limited unannotated texts in Chinese can make predictions with high accuracy. Medical institutions may not need to purchase formulated data to create reliable triage systems for self-use.

References

1. G. Fan et al., 2020. Text classification model with graph network based on semantic dependency parsing, Application Research of Computers, Vol. 37 No. 12

2. L. Dai et al., 2003. A Comparative Study on Feature Selection in Chinese Text Categorization, JOURNAL OF CHINESE INFORMATION PROCESSING, Vol.18 No.1

3. H. Xia et al., 2015. The Classification Method for Online Reviews' Effectiveness Based on Feature Extraction Improvement, Journal of the China Society for Scientific and Technical Information, Vol 34 No.5

4. X. Ni, 2017. Machine Learning from Non-structured Electronic Medical Record on Relation Extraction, China Digital Machine, 2017 12(6)

5. S. Lu et al., 2019. Research on Structural Data Extraction in Surgical Cases, Chinese Journal of Computers, 2019 Vol 42 No.12

6. M. Zhu et al., 2014. Research on Entity Linking of Chinese Microblog, Acta Scientiarum Naturalium Universitatis Pekinensis, Vol. 50, No. 1 (Jan. 2014)

7. J. Feng et al., 2019. Chinese Collective Entity Linking Method Based on Multiple Features, JISUANJI YU XIANDAIHUA, 2019 Vol. 1

8. Y. Zhao, 2016. Research on Medical Entity Link, Harbin Institute of Technology, TP391.1

9. Hu JingЈ¬ Liu WeiЈ¬ Ma KaiЈ® Text categorization of hypertension medical records based on machine learningЈЫJЈЭ Ј® Science Technology and EngineeringЈ¬ 2019Ј¬ 19( 33) : 296-301

10. HUANG Meng-ting, ZHANG LING, JIANG Wen-chao, Short Text Feature Expansion and Classification Based on Non-negative Matrix Factorization, Computer Science, Vol. 46, No. 12, Dec. 2019

11. Guo Chaolei, Chen Junhua, Chinese Text Categorization Based on SA-SVM, Computer Applications and Software, Vol. 36, No.3, Mar. 2019.

12. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention Is All You Need, arXiv:1706.03762

13. Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio, NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE, arXiv:1409.0473v7

14. Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov, Unsupervised Learning of Video Representations using LSTMs, arXiv:1502.04681

15. Abien Fred M. Agarap, Deep Learning using Rectified Linear Units (ReLU), arXiv:1803.08375v2 [cs.NE] 7 Feb 2019

16. ReLU function, URL: https://ailephant.com/glossary/relu-function/

17. Jake Bouvrie, Notes on Convolutional Neural Networks

18. ТЅС§ґК»гґуИ«, URL: https://pinyin.sogou.com/dict/detail/index/15125

19. ЦРОДіЈУГНЈУГґК±н, URL: https://github.com/goto456/stopwords

20. Jieba repository, URL: https://github.com/fxsjy/jieba

21. №ъјТРЕПўЦРРД·ЦПнѕјГСРѕїЦРРД,ЦР№ъТЅБЖ·ЦПн·ўХ№±Ёёж 2017.02 URL: http://www.sic.gov.cn/archiver/SIC/UpFile/Files/Default/20180801173851887747.pdf

22. »ЄЅрЦ¤ИЇ, »ҐБЄНшТЅБЖ·ўХ№СёЛЩЈ¬РРТµБъН·ЖЅ°ІєГТЅЙъЙПКРФЪјґ URL: https://www.investank.com/static/upload/system/201807/1891.pdf

23. World Bank, Current health expenditure per capita (current US$) - China URL: https://data.worldbank.org/indicator/SH.XPD.CHEX.PC.CD?locations=CN

X Data and Code

The data and code used in this study can be found in this link: https://github.com/Icefrozenbite/disease_prediction

Размещено на Allbest.ru
...

дипломная работа "Disease Prediction in User Generated Chinese Medical Texts Based on Deep Learning" скачать

Подобные документы

Digital transmission system (DTS)
Analyses o the current situation on the project and the development of their technical realization. Brief description of the existing zonal area network. Basic requirements for communication lines. Calculation of the required number of channels.

дипломная работа [771,0 K], добавлен 20.09.2016
Analysis of Control System and Synthesis of Real Compensator
Initial data for the term paper performance. Order of carrying out calculations. Analyze uncompensated system. Synthesize the real PD-compensator ( ) which would guarantee desired phase margin at gain crossover frequency . Analyze compensated system.

курсовая работа [658,7 K], добавлен 20.08.2012
Адміністрування користувачів з використанням локальних і глобальних груп
Створення облікової інформації користувачів в мережі Windows NT утилітою User Manager для локального комп'ютера і User Manager for Domains для всіх комп'ютерів домену. Локальні, глобальні і спеціальні групи користувачів. Керування профілями користувачів.

реферат [49,2 K], добавлен 11.03.2010
Modernization of the site zonal networks Stepnogorsk-Kokshetau based fiber-optic line
The lines of communication and the basic properties of the fiber optic link. Comparison of characteristics and selection of the desired type of optical cable. The concept of building a modern transmission systems. The main function module SDH networks.

дипломная работа [2,1 M], добавлен 16.08.2016
Unix-подобные системы
История образования и раскол в Microsoft, обзор GNU/Linux-подобных систем Fedora, Slackware. Обзор BSD-подобных систем OpenBSD, Frenzy. Unix-подобные операционные системы Extended File System ext. XFS и Unix File System, ядро linux-kernelи Emacs.

реферат [135,9 K], добавлен 07.12.2010
Разработка и исследование системы распознавания мультимедийных приложений на базе нейронных сетей
Рассмотрение принципов организации Deep Packet Inspection в телекоммуникации. Проведение исследований нейронных сетей. Выбор оптимальной модели для решения задач классификации мультимедийного трафика. Изучение вопросов безопасности жизнедеятельности.

дипломная работа [1,0 M], добавлен 22.06.2015
Piezoelectric vibration sensors
Concept and functional features of piezoelectric sensors, the scope of its application. Designing with piezoelectric sensors. Piezo-vibration sensor Parallax 605–00004 and Bosch 608–00112: overview, technical characteristic, accessories, installations.

контрольная работа [1,1 M], добавлен 27.05.2013
Интегрированная система борьбы с "SUBACS" (Submarine Advanced Combat System)
История использования подводных лодок ВМФ США. Описание боевых информационно-управляющих систем (БИУС) на их борту как комплекса электронно-вычислительной аппаратуры для управления и эффективного использования боевых и технических возможностей оружия.

презентация [896,8 K], добавлен 23.12.2013
Система сигнализации
Построение технологии ОКС-7 "сигнализация-маршрутизация-сообщение". Стандарты систем общеканальной сигнализации: CCITT Signalling System No.6 и No.7. Взаимодействие цифровых сетей. Виды систем сигнализации: абонентская, внутристанционная и межстанционная.

курсовая работа [228,0 K], добавлен 30.05.2014
Технология оборудования для установки ПМ-компонентов на печатные платы
Установка компонентов на печатные платы при помощи автоматов укладчиков или интегрированных монтажно-сборочных комплексов, их характеристики. Автомат с блоком монтажных головок. Роторно-башенная схема построения автоматов (Rotary Turret Placement System).

реферат [161,7 K], добавлен 21.11.2008
Проектирование тракта передачи данных между источником и получателем информации
Проектирование среднескоростного тракта передачи данных между двумя источниками и получателями. Сборка схемы с применением пакета "System View" для моделирования телекоммуникационных систем, кодирующего и декодирующего устройства циклического кода.

курсовая работа [3,0 M], добавлен 04.03.2011
Chinese cuisine
History of development and feature of the Chinese kitchen. Distribution of salt dishes and easy sauces is in the kitchen of China. Sichuan as technology of application of the strong seasonings and spicinesses is in dishes. Chinese menu for gourmets.

презентация [12,8 M], добавлен 28.01.2013
Chinese economy and agriculture
Chinese economy: history and problems. Problems of Economic Growth. The history of Chinese agriculture. The ratio of exports and imports of goods and service to gross domestic product at current prices. Inefficiencies in the agricultural market.

курсовая работа [162,1 K], добавлен 17.05.2014
Medicine in Ancient Civilization
The medical knowledge from Egypt. Hospital as a very important development in Middle Ages. The beginning of studying of anatomy on corpses. The beginning of new theories of disease. Great discoveries of analgetics, diagnostics development in medicine.

доклад [14,5 K], добавлен 27.12.2011
Medical ethics
General characteristics, objectives and functions of medical ethics as a scientific discipline. The concept of harmlessness and its essence. Disagreement among physicians as to whether the non-maleficence principle excludes the practice of euthanasia.

презентация [887,6 K], добавлен 21.02.2016
Chinese Labor Market
Transition of the Chinese labor market. Breaking the Iron Rice Bowl. Consequences for a Labor Force in transition. Labor market reform. Post-Wage Grid Wage determination, government control. Marketization Process. Evaluating China’s industrial relations.

курсовая работа [567,5 K], добавлен 24.12.2012
Semey State Medical University
History Semipalatinsk Medical University. The cost of training, specialty and duration of education. Internship and research activities. Student life. Residency - a form of obtaining an in-depth postgraduate medical education in clinical specialties.

презентация [509,2 K], добавлен 11.04.2015
Medical education in USA
Description of the directions of medical education in USA. The requirement for continuous training of doctors. Characteristics of the levels of their training to work with patients. Licensing of doctors through specialized advice and terms of the license.

презентация [4,0 M], добавлен 10.11.2015
Public health in Kazakhstan
The history of the public health system in Kazakhstan. Human resources, the capacity of organizations and reform of the health system. Pharmaceutical market in the country. Priority sectors of the medical equipment market. Medical education and science.

презентация [987,7 K], добавлен 04.02.2015
Principles of screening
Principles and types of screening. Medical equipment used in screening. identify The possible presence of an as-yet-undiagnosed disease in individuals without signs or symptoms. Facilities for diagnosis and treatment. Common screening programmes.

презентация [921,2 K], добавлен 21.02.2016

Другие документы, подобные "Disease Prediction in User Generated Chinese Medical Texts Based on Deep Learning"

весь список подобных работ

скачать работу можно здесь

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.

	Template	Formatted text?
120ask	Question-Answer + Conversation	Non-formatted
39	Question-Answer	Non-formatted
haodf	Question-Answer + Conversation	Formatted
chunyuyisheng	Conversation	Non-formatted

	Asthma	Pneumonia	Pneumoconiosis	Tracheitis	Influenza	Total
120	795	527	800	797	798	3717
39	100	100	100	100	100	500
Chunyu	600	599	584	600	589	2972
Haodf	855	933	944	930	953	4615
Total	2350	2159	2428	2427	2440	11804

	Simple Network	Simple Network + Data Aug	Complex Network	Complex Network + Data Aug
Original full	0.738	0.683	0.865	0.564
Original reduced	0.699	0.666	0.786	0.561
Tfidf full	0.524	0.504	0.738	0.714
Tfidf reduced	0.402	0.421	0.611	0.629

Disease Prediction in User Generated Chinese Medical Texts Based on Deep Learning

Problems of the medical industry in China at present. Development of an automatic online system for sorting free medical texts in Chinese collected from Chinese platforms. Technical characteristics and prospects of their use for predicting diseases.

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Literature Review

Methodology

Core notions

Tf-idf

The Term Frequency is measured in the following way:

where stands for the number of times a term appears in a text, and stands for the total term count of that text.

The Inverse Document Frequency is measured in the following way:

The tf-idf value is thus calculated:

Terms that are concentrated in a few texts of a corpus will have a higher tfidf score, which is an indication that these terms may better summarize the texts.

CNN

An input layer has a volume of:

The corresponding output layer will thus have a volume of:

Where

represents the number of filters.

represents the spatial extents of those filters.

S represents the stride.

P represents the number of paddings.

ReLU

ReLU function, URL: https://ailephant.com/glossary/relu-function/

The benefits of adopting ReLU for activation are:

LSTM

represent Forget Gate, Input Gate, and Output Gate, respectively.

represents the current timestamp.

, , represent the weights at the corresponding gates.

represents the output of the previous LSTM block.

represent the biases at the corresponding gates.

represents the input at the timestamp of .

represents the activation function (ReLU is used as the activation function in this study).

represents the candidate of cell states at the current timestamp t.

represents the cell state at the current timestamp t.

Graph 3.1: LSTM Memory Cell

LSTM carries previous information to solve the problem with long term dependencies. It can also avoid vanishing and exploding gradient to save computational resources.

Bi-LSTM

Bi-directional LSTM (Bi-LSTM) is a model that runs two LSTM. One is run forward, and the other one is run backward. Running LSTM from both directions can take into account more contextual information to make the prediction more accurate.

Attention

where

and

represents annotations.

represents the alignment model to test how well the input and output match at the position i.

represents the hidden state.

Cross-Entropy Loss

Cross-entropy loss is used to measure the loss between predictions and ground truth.

The cross-entropy loss is measured by:

where

represents the ground truth.

represents the prediction.

The higher the cross-entropy loss is, the closer the prediction is to ground truth. Cross entropy loss is widely used for classification tasks.

F1 Score

F1 score is used to measure the performance of the prediction models.

where

and

Experiment Design

The online medical texts have the following characteristics:

Experiment Environment

The experiment is implemented by PyTorch library of Python. The training process is done with a GPU of Nvidia RTX 2070.

Data Description

General Description

The texts from these platforms come in different templates.

On 120ask.com, patients give out all their information in a non-formatted paragraph. Some patients already have diagnosis, and some do not. Doctors answer a patient's question and give medical advice. Patients and doctors may exchange more information afterwards.

Graph 4.1: An example of the 120ask format

On 39.net, the texts come in a similar format as on 120ask.com. Patients give out non-formatted questions, and doctors answer the questions. There are no following conversations shown on 120ask.com.

Graph 4.2: An example of the 39.net format

On chunyuyisheng.com, texts come in the format of pure conversations. All information is exchanged by conversations between patients and doctors.

Graph 4.3: An example of the Chunyuyisheng format

Graph 4.4: An example of the Haodf format

The text formats of the 4 platforms are summaries in the chart below.

Chart 4.5: Corpora Summary

In total, there are 11804 texts parsed. The spread of the texts between platforms and diseases is shown in the chart below.

Chart 4.6: Corpora Summary by Disease

On 39.net only the most recent 100 records are shown, so only 100 texts for each disease are collected.

The average length of the texts is 557.9 tokens, including punctuations.

Graph 4.7: Text Distribution by Length

Data Pre-processing

The following pipeline of procedures is designed to carry out the experiment:

Based on the segmentation results, the tf-idf of every phrase is calculated across the entire corpus. Two versions of the original and reduced corpora are created:

Results

The Simple Network Results