ReaderBench: multilevel analysis of Russian text characteristics

Exploring a new open source version of the ReaderBench platform. Calculation of indexes of complexity of texts of various levels according to the Common European scale. Evaluation of the complexity of the perception of the text in various fields.

Рубрика Программирование, компьютеры и кибернетика
Вид статья
Язык английский
Дата добавления 16.08.2023
Размер файла 589,0 K

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Размещено на http://www.allbest.ru/

Размещено на http://www.allbest.ru/

ReaderBench: multilevel analysis of Russian text characteristics

Dragos Corlatescu

Stefan Ruseti

Mihaiд Dascalu

Abstract

This paper introduces an adaptation of the open source ReaderBench framework that now supports Russian multilevel analyses of text characteristics, while integrating both textual complexity indices and state-of-the-art language models, namely Bidirectional Encoder Representations from Transformers (BERT). The evaluation of the proposed processing pipeline was conducted on a dataset containing Russian texts from two language levels for foreign learners (A - Basic user and B - Independent user). Our experiments showed that the ReaderBench complexity indices are statistically significant in differentiating between the two classes of language level, both from: a) a statistical perspective, where a Kruskal-Wallis analysis was performed and features such as the «nmod» dependency tag or the number of nouns at the sentence level proved the be the most predictive; and b) a neural network perspective, where our model combining textual complexity indices and contextualized embeddings obtained an accuracy of 92.36% in a leave one text out crossvalidation, outperforming the BERT baseline. ReaderBench can be employed by designers and developers of educational materials to evaluate and rank materials based on their difficulty, as well as by a larger audience for assessing text complexity in different domains, including law, science, or politics.

Keywords: ReaderBench framework, text complexity indices, language model, neural architecture, multilevel text analysis, assessing text difficulty

Аннотация

ReaderBench: многоуровневый анализ характеристик текстана русском языке

Драгош Корлатескуи, Штефан Русети, Михаил Даскалу

В статье представлена новая версия платформы ReaderBenchс открытым исходным кодом. В настоящее время Readerbenchподдерживает многоуровневый анализ параметров текстов на русском языке, интегрируя при этом как индексы текстовой сложности, так и современные языковые модели, в частности, BERT. Оценка предлагаемого алгоритма обработки проводилась на корпусе русских текстов двух языковых уровней, используемых при обучении русскому языку как иностранному (A - базовый пользователь и B - независимый пользователь). Наши эксперименты показали, что (а) индексы сложности текстов различных уровней по Общеевропейской шкале, рассчитываемые при помощи ReaderBench, статистически значимы (по критерию Краскела-Уоллиса), при этом количество существительных на уровне предложения оказалось наилучшим предиктором сложности; б) aнаша нейронная модель, сочетающая индексы сложности текста и контекстуализированные вложения, при перекрестной валидации достигла точности 92,36% и превзошла базовый уровень BERT. ReaderBenchможет использоваться разработчиками учебных материалов для оценки и ранжирования текстов в зависимости от их сложности, а также более широкой аудиторией для оценки сложности восприятия текста в различных областях, включая юриспруденцию, естествознание или политику.

Ключевые слова: фреймворк Readerbench, индексы сложности текста, языковая модель, нейронная архитектура, многоуровневый анализ текста, оценка сложности текста

Main part

The Natural Language Processing (NLP) field focuses on empowering computers to process and then understand written or spoken language texts in order to perform various tasks. The performance of Artificial Intelligence or Machine Learning approaches on common NLP tasks has increased over the years, but there are still many tasks where computers are far from human performance. Nonetheless, the processing speed of computer programs is not to be neglected, and the current tradeoff between the response time of an algorithm and its errors is shifting the balance towards automated analyses - for example, a human invests tens of hours to correctly extract all the parts of speech from a novel, while the computer can perform the same task in a couple of minutes, with an error of only 1-5% mislabeledwords. As such, NLP tools are becoming more widely used to provide valuable inputs to further develop and test various hypotheses.

Tailoring reading materials for learners is a practical and essential field where NLP tools can have a high impact. Designing such materials can prove to be a difficult task since texts below readers' level of understanding will make them lose interest, while texts too difficult to comprehend will demotivate learners. Automated NLP frameworks provide valuable insights in those situations, especially the ones that focus on identifying the complexity of a text. One such tool is the ReaderBench (Dascalu et al. 2013) framework, which previously supported other languages besides English, namely French (Dascalu et al. 2014), Dutch (Dascalu et al. 2017), and Romanian (Gifu et al. 2016), and has now been adapted to also support Russian.

The new version of ReaderBenchhttps://github.com/readerbench/ReaderBench is a Python library that extracts multilevel textual characteristics from texts in multiple languages. These characteristics, named also textual complexity indices, provide valuable insights of text difficulty on multiple levels, namely surface, word, morphology, syntax, and semantics (i.e., cohesion), all described in the following sections. The purpose of this study is to present the adaptation process of ReaderBench to support the Russian language, starting from the computation of Russian complexity indices, and followed by the integration of new methods for building the Cohesion Network Analysis (CNA, Dascalu et al. 2018) graph using state-of-the-art language models, namely Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al. 2019).

Various neural network architectures and statistical analyses were employed to assess the performance of our processing pipeline. Our experiment uses Russian texts from two language level groups that reflect an individual's language proficiency: A (Basic User) and B (Independent User). The corpus is a part of the Russian as a Foreign Language Corpus (RuFoLC) compiled by language experts from the «Text Analysis» laboratory, Kazan Federal University. Our goal is to build an automated model and to perform statistical analyses of the texts in order to differentiate between the two classes, while assessing the importance of the textual complexity indices in making this decision.

Assessing Russian text complexity

The manner in which people understand and study languages has changed during the last two centuries; Russian is no different. A brief history of the approaches used to analyze textual complexity in Russian texts is presented by Guryanov et al. (2017). The authors documented that such analyses were conducted by linguists mostly by hand in the beginning of the 20th century. Even though the key terms such as readability or text complexity were not completely defined, the general understanding of the concepts existed and simple indices, such as word length or number of words, were considered. Moving on to the end of the 20th century - beginning of the 21st century, researchers started to include semantic features, such as, for example: the polysemy of the words. In recent times, additional features were introduced, detailed in the next subsection dedicated to automated measures of text complexity.

One important part of research on text complexity revolves around its educational theme: are texts appropriate in terms of complexity for the students reading or studying them? Linguists can provide an expert opinion to this question; however, this requires a lot of resources, including a considerable amount of time. Thus, a system that can provide meaningful insights into the difficulties encountered when reading a text is desirable.

McCarthy et al. (2019), who are language experts, developed a Russian language test to assess text comprehension. The test was conducted on approximately 200 students (~100 fifth graders, ~100 ninth graders) and the results showed that they struggle to understand the ideas of the texts. Additionally, the paper provided an overview of the entire evaluation process in the Russian educational system, and it offered a viable evaluation alternative designed by linguists in the form of a test.

One of the initial papers on the same matter, but written from a more statistical perspective, was the work by Gabitov et al. (2017). In their study, the problem of text complexity in Russian manuals was addressed. Specifically, the investigation focused on the 8th grade manual on social studies made by Bogolyubov. All analyses were performed mostly manually, starting from selecting 16 texts from the book and then computing readability indices formulas, such as Flesch-Kincaid, Coleman-Liau, Dale-Chale readability formula, Automated Readability Index, and Simple Measure of Gobbledygook (SMOG). The unevenness of those indices across the texts raised questions whether the texts were suitable for students and represented the underlying reason for further research in this domain.

The syntactic complexity of social studies texts was explored by Solovyev et al. (2018). The authors used ETAP-3 (Boguslavsky et al. 2004), a syntactic analyzer for Russian grammar, to compute the dependency parse tree for each sentence. Fourteen indices were extracted based on the dependency tree that looked at key components of the Russian sentence structure in order to deduct its complexity, namely the length of the path between two nodes and various counts of nodes, leaves, verbal participles, verbal adverb phrases, modifiers in a nominal group, syndetic elements, participial constructs, compound sentences, coordinating chains, subtrees, and finite dependent verbs. Their statistical analyses showed high correlation between the extracted features and grade level; however, syntactic features were less correlated than the lexical ones.

Solovyev et al. (2020) also explored how predictive specific quantitative indices were in ranking academic Russian texts and in determining their complexity. Their corpus was composed of texts from the field of Social Studies grouped by grade level, i.e. 5th-11th grades. The texts were extracted from manuals written by two authors (Bogolyubov and Nikitin) used at that time for teaching social studies. The corpus required a preprocessing step, where the parts of speech were extracted using TreeTagger (Schmid et al. 2007) for Russian, the texts were split into sentences, and outliers (i.e., sentences that were even too short or too long) were eliminated. The following indices were used in their analysis: Flesch-Kincaid Grade, Flesch Readability Ease, frequency of content words, average words per sentence, average syllables per word, and additional features based on the part of speech tags (such as the number of nouns or verbs). The authors performed a statistical analysis using both Pearson (1895) and Spearman (1987) coefficients to inspect the correlation between the indices and the complexity of the texts (i.e., their grades level). All features proved to be statistically significant, except for «average words per sentence» and «average syllables per word». Additionally, the authors proposed slightly modified formulas for the Flesch-Kincaid Grade and Flesch Readability Ease that better reflect the field of Social Studies.

Further studies of quantitative indices on the corpus containing texts from Social Sciences manuals, Churunina et al. (2020) introduced new indices such as type-token ratio (TTR), abstractness index, and words frequency based on Sharoffs dictionary (Sharoff et al. 2014) that proved to be statistically significant in differentiating the grades of the texts. Out of the specified indices, abstractness, was proven to be closely related to textual complexity. In fact, the study by Sadoski et al. (2000) claimed that the concreteness (which is the opposite of abstractness) is the most predictive feature for comprehensibility. As a follow-up, Solovyev et al. (2020) provided an in-depth analysis of the abstractness of words in the Russian Academic Corpus (RAC, Solnyshkina et al. 2018) and in a corpus containing students recalls of academic texts. The core of the experiments was the Russian dictionary of concrete/abstract words (RDCA, Akhtiamov 2019). A notable result was obtained in terms of students recall, where texts provided by students used more concrete words than the original ones, underlining the idea that abstract terms are harder to digest.

Quantitative indices provide significant insights into the textual complexity of writings, but they are not the only concept that can be applied to analyze text difficulty. One example can be topic modelling, as applied in an experiment performed by Sakhovskiy et al. (2020) on the Social Studies corpus. The authors implemented Latent Dirichlet Allocation with Additive regularization of topic models (ARTM, Vorontsov & Potapenko 2015). Topics were extracted at three granularity levels: paragraph, segment (i.e., sequences of 1000 words maximum), and full text level. The topics were manually verified by linguist experts, and they were further used in an experiment to determine the correlations between topics and grades of the texts, in four different ways: a) correlation between grade and topic weight, b) correlation between grade and the distance between topic words in a semantic space, c) correlation between grade and topic coherence, and d) correlation between topic properties and complexity-based topic proportion growth. The conclusion of their study highlighted that topic models can be successfully used to assess text complexity.

Textual complexity as a Natural Language Processing task

Readability reflects the level of easiness in understanding of a text. Extracting features using NLP techniques is a common approach, when exploring the readability of a given text. There are multiple tools readily available; however, most of them support only English. Nevertheless, the underlying ideas can be extrapolated to other languages, as well. We further describe recent tools that cover the most frequently integrated textual complexity indices and that are also present to some extent in the Russian version of ReaderBench.

One of the first freely available systems is Coh-Metrix (Graesser et al. 2004) which is at its 3rd version at present. Coh-Metrix provides 108 textual complexity indices from eleven categories: descriptive, text easability principal components scores, referential cohesion, LSA, lexical diversity, connectives, situation model, syntactic complexity, syntactic pattern density, word information, and readability. The framework can be freely accessed on a website, but the code is not open - sourced. Coh-Metrix offers support for other languages than English, namely Traditional Chinese, while adaptations for other languages exist - for example, Coh-Metrix-Esp (Quispesaravia et al. 2016) for Spanish.

The Automatic Readability Tool for English (ARTE, Choi 2020) is a Java library available on all platforms that processes plain text files and outputs a CSV file with all the computed indices. The list of indices includes the Flesch Reading Ease Formula (Flesch 1949), Flesch Kincaid Grade Level Formula (Kincaid et al. 1975), and Automated Readability Index (Senter & Smith 1967), which take into consideration the average number of words per sentence, the average number of syllables per word, and the difference between them consisting of the weights for each parameter. Other examples of indices are SMOG Grading (Mc Laughlin 1969) and the New Dale-Chall Readability Formula (Dale & Chall 1948). Lastly, there are multiple «crowdsourced» indices that are computed by aggregation of different counts and statistics from other libraries.

The following library is the Constructed Response Analysis Tool (CRAT, Crossley et al. 2016) which provides over 700 indices that also take into consideration text cohesion. The indices are grouped in specific categories, namely: a) indices that count or compute percentages for words, sentences, paragraphs, content words, function words, and parts of speech, or b) indices based on the MRC Psycholinguistic Database (Coltheart 1981), the Kuperman Age of Acquisition scores (Kuperman et al. 2012), the Brysbaert Concreteness scores (Brysbaert et al. 2014), the SUBTLEXus corpus (Brysbaert et al. 2012), the British National Corpus (BNC, BNC Consortium 2007), the COCA corpus (Davies 2010). Complementary, the Custom List Analyzer (CLA, Kyle et al. 2015) is a library written in Python that computes various occurrences of text sequences (i.e., a word, an n-gram, or a wildcard) in a corpus.

The Grammar and Mechanics Error Tool (GAMET, Crossley et al. 2019) is a Java library that identifies errors in a plain text file from the perspective of grammar, spelling, punctuation, white space, and repetitions. The core of the library integrates two packages, one from Java, Java LanguageTool (LanguageTool 2021), and one from Python, language-check (Myint 2014). The GAMET project was also tested and evaluated on two datasets (Crossley et al. 2019): a) a TOEFL-iBT corpus containing 480 essays written by English as a Second Language Learners, and b) 100 essays written by high school students in the Writing Pal Intelligent Tutoring System project (Roscoe et al. 2014). The errors reported by GAMET were evaluated by two expert raters, and the results showed that GAMET offered relevant feedback throughout the experiments.

Next, we explore a collection of four tools (TAACO, TAALEED, TAALES and TAASC) that cover a wide spectrum of analysis levels. All the tools have a graphical interface that accepts plain text files as input to produce CSV files with all indices as outputs. First, the Tool for the Automatic Analysis of Cohesion is a framework that focuses on text cohesion. The indices are separated into multiple categories: a) TTR and Density, where TTR stands for type-token ratio computed as the number of unique words/lemmas in a category, divided by the total number of words/lemma in the same category; b) Sentence overlap, where statistics regarding the repetition of the same word with certain properties in the following sentences are computed; c) Paragraph overlap, which is similar to the sentence overlap, only that the metrics are computed at paragraph level; d) Semantic overlap, where the scores of similarity between adjacent blocks (sentences and paragraphs) are computed on three methods: Latent semantic analysis (Landauer et al. 1998), Latent Dirichlet allocation (LDA, Blei et al. 2003), and word2vec (Mikolov et al. 2013); e) Connectives, where statistics are computed based on the types of the English connectives (e.g. conjunctions, disjunctions); f) Givenness, which is a measure of new information in the context of previous information, based on pronouns counts and repeated content lemmas. Second, the Tool for the Automatic Analysis of Lexical Diversity (TAALED, Kyle et al. 2021) provides 9 indices for measuring the language diversity of a text.

Third, the Tool for the Automatic Analysis of Lexical Sophistication (TAALES, Kyle et al. 2018) offers 484 indices addressing lexical sophistication divided into 4 major categories: a) Academic Language containing wordlists and formulas based on counts and percentages of words, b) indices based on the COCA corpus, c) indices based on other corpora (BNC, MRC, SUBTLEXus), and d) other types of indices, such as Age of Exposure or Contextual Distinctiveness. Fourth, the Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (TAASSC, Kyle 2016) focuses on analyzing the sentence components and the relations between them. It provides statistics at the clause and noun phrase level for measuring complexity. The syntactic sophistication is computed based on indices that focus on verbs and lemmas.

Textstat (Bansal 2014) is a Python library available online on the pypi archives which provides textual complexity indices for multiple languages. Textstat includes 16 indices, out of which most are English readability formulas: Flesch Reading Ease, Flesch Kincaid grade, SMOG, Coleman Liau, Automated Readability, and Dale Chall.

ReaderBench (Dascalu et al. 2013) is an open-source framework that offers multiple natural language processing tools. ReaderBench was initially developed in Java, but the library migrated to Python given that all major NLP frameworks, including Tensorflow (Abadi et al. 2016), Scikit learn (Pedregosa et al. 2011), spaCy (Honnibal & Montani 2017), and Gensim (Rehurek & Sojka 2010) are written in Python to enable Graphics Processing Unit (GPU) optimizations. ReaderBench is grounded in Cohesion Network Analysis (CNA, Dascalu et al. 2018), a method similar to Social Network Analysis, but instead of representing relations between people or entities, the CNA graph contains links between text elements. The weights of the links are given by the semantic similarity between the components using different semantic models, such as LSA, LDA, or word2vec. Both local and global cohesion are computed based on the strength of intra and inter-paragraph edges extracted from the CNA graph. The library comes with a demo website (Gutu-Robu et al. 2018), making it available to multiple audiences. On one hand, the Python library can be installed and used by machine learning/NLP developers using the Pip library archiveshttps://pypi.org/project/rbpy-rb/). On the other hand, the website provides multiple interactive interfaces, where linguists or any other person interested in studying text can perform their own analysis using the capabilities of the library, without having any programming knowledge - demos include for example: Multi document CNA (i.e., a detailed analysis and visualization of multiple documents grounded in Cohesion Network Analysis), Keywords extraction (i.e., a list and a graph of the keywords from a text), AMoC (Automated Model of Comprehension, a model that simulates reading comprehension), Sentiment Analysis (i.e., extracting the polarity of a text in terms of expressed sentiments), and Textual Complexity (i.e., provide an export of the complexity indices applied on the input text). All publicly available analyses cover multiple languages, not just English, and all the additional information required for each experiment is also present on the website. In this study, we focus on the extension of the framework to also accommodate textual complexity indices and prediction models for Russian texts.

It is important to note that ReaderBench provides a viable alternative to all other previously mentioned software for text analysis. ReaderBench leverages state of the art NLP models to explore the semantics of texts and was effectively employed in various comprehension tasks, in multiple languages including English, French, Dutch, and Romanian. The project is open sourced under an Apache 2 license, the library can be easily integrated into multiple Python projects, whereas the presentation website can be used freely for the remote processing of texts.

Current study objectives

Our study focuses on an in-depth multilevel analysis of Russian texts by employing textual complexity indices and the CNA graph updated with language models, together with neural network models and statistical analyses, all integrated into the ReaderBench framework. As such, we assess to what extent the Russian textual complexity indices integrintoed into ReaderBench are predictive of the differences between Russian texts from two language levels (i.e., A - Basic User and B - Independent User). We perform this analysis to explore the predictive power of our models and underline the most predictive features for this task.

Method

Corpus

This study considers Russian texts from two language levels for foreign learners (A-Basic User and B-Independent User) with the aim to predict a text's difficulty class. The selection of texts in terms of complexity assessment was performed by Russian linguists, members of the «Text Analytics» Laboratory from the Kazan Federal University. The corpus used in the follow-up experiments is a subpart of the Russian as a Foreign Language Corpus (RuFoLC). The initial corpus was in a raw format containing texts from 3 language levels A1 (Breakthrough or beginner), A2 (Waystage or elementary), and B1 (Threshold or intermediate). However, since only 3 texts were available for the A1 level, we decided to merge the A1 and A2 together (see Table 1 for corpus statistics). Since the overarching number of examples was too low for a neural network to learn meaningful representations, we decided to use paragraphs as input in order to ensure an increased number of samples.

Table 1. Language levels corpus statistics.

Class

# Documents

# Paragraphs

# Sentences

# Words

A

37

465

1663

18,307

B

48

333

1105

13,741

The ReaderBench Framework adapted for Russian

A specific set of resources is required for a new language to be integrated into ReaderBench. From this list, part are mandatory, while others are nice to have. One mandatory requirement is to have a language model available in spaCy (Honnibal & Montani 2017), an open-source library written in Python that offers support for NLP pre-processing tasks, such as part of speech tagging, dependency parsing, and named entity recognition. SpaCy offers a unified pipeline structure for any language and, at the moment of writing, spaCy reached version 3.1 with support for 18 languages, including Russian which has been integrated for reproducibility reasons. Additionally, spaCy includes a multi-language model that can be used for any language, but with lower performance. All languages have multiple models (i.e., small, medium, and large) available to address memory or time constraints. Smaller models are faster to run and require fewer resources, but yield lower performance.

Semantic models are a key component for the ReaderBench pipeline and for building the CNA graph. All indices that are calculated based on the meaning of the words, the relations between words, sentences and paragraphs need a semantic language model. ReaderBench generally uses word2vec as a language model because it is available for most languages from multiple sources. During the development of this paper, we also considered it fit to align the semantic models across all the languages available in ReaderBench. Thus, we added support for the MUSE (Conneau et al. 2018) version of word2vec, where the semantic spaces are similar across the three languages.

Previous versions of ReaderBench used to compute similarity scores between textual elements from the CNA graph using LSA, LDA, and word2vec; however, these models have been outperformed by BERT-based (Devlin et al. 2019) derivatives. The Transformer architecture introduced by Vaswani et al. (2017) obtained state of the art results in most NLP tasks, especially with its encoder component, namely the Bidirectional Encoder Representations from Transformers (BERT). The original BERT was trained on two tasks: language modeling (where 15% of the tokens were masked and the model tried to predict the best word that fitted the mask, given the context) and next sentence prediction (given a pair of sentences, the model tried to predict if the second sentence made sense to follow the first sentence). The language modeling component is used to represent words in a latent vector space.

Nowadays, almost all languages have a custom BERT model available, and Russian is no exception. The ReaderBench library now integrates the DeepPavlov rubert-base-cased (Kuratov & Arkhipov 2019) BERT-base model to compute contextualized embeddings. It is important to note that this is the first study in which ReaderBench indices are computed using BERT-based embeddings.

Besides the above-mentioned libraries and models, ReaderBench can also benefit from specific word lists which were adapted for Russian, including: list of stop words (i.e., words with no semantic meaning ignored in preprocessing stages), list of connectives and discourse markers, and list of pronouns grouped by type and person; all previous word lists were provided by Russian linguists.

Additional improvements were made to the ReaderBench Python codebase, including performance optimizations and a refactoring to provide a more efficient and cleaner implementation of the textual complexity indices. New cohesion - centered textual complexity indices were added in ReaderBench, as well as a new aggregation function on top of them - the maximum value at a certain granularity level (more details are presented in the next section).

Textual Complexity Indices for Russian

The textual complexity indices provided by ReaderBench ensure a multilevel analysis of text characteristics and are grouped by their scope (see dedicated Wiki pagehttps://github.com/readerbench/ReaderBench/wiki/Textual-Complexity-Indices). Table 2-6 present the names of the indices, their description, what component or components from the above enumeration are used, as well as availability in terms of granularity. Note that, as previously mentioned, all indices require the spaCy pre-processing pipeline to be executed; thus, SpaCy does not appear as a dependency. The «Granularity» column reflects four possible levels on which the index is calculated: Document (D), Paragraph (P), Sentence (S), or Word (W). In general, the value of one level of granularity is computed recursively as a function of values coming from one level below. For example, word counts are calculated at the sentence level by considering word occurrences from each sentence; follow-up at paragraph level, we report the count of the words from all sentences belonging to a targeted paragraph. The final values presented as indices are the results of three aggregation functions: mean (abbreviated «M»), standard deviation (abbreviated «SD»), and maximum (abbreviated «Max»). Thus, an index can look like «M (Wd / Sent)», which can be translated as the mean value of words per sentence in a text. In terms of consistency across languages, all ReaderBench indices, their acronyms and descriptions, are provided in English.

The surface indices available in ReaderBench are presented in Table 2. These indices are computed using simple algorithms that involve counting appearances of words, punctuations, and sentences. Starting from the Shannon's Information Theory (Shannon 1948), the idea of entropy at word level is also included as an index; the hypothesis is that a more varied vocabulary (i.e., higher entropy) may result in a more difficult text to understand.

Table 2. ReaderBench Surface indices

Abbreviation

Description

Dependencies

Granularity

D

P

S

W

Wd

Words

-

X

X

X

UnqWd

Unique words

-

X

X

X

Comma

Commas

-

X

X

X

Punct

Punctuation marks (including commas)

-

X

X

X

Sent

Sentences

-

X

X

WdEnt

Word Entropy

-

X

X

X

The morphology category (see Table 3) contains indices computed using the part of speech tagger from spaCy. Statistics are computed for each part of speech (e.g., nouns, verbs), while more detailed statistics are considered for sub-types of pronouns provided by linguists as predefined lists.

Table 3. ReaderBench Morphology indices

Abbreviation

Description

Dependencies

Granularity

D

P

S

W

PosMain

Words with a specific POS

-

X

X

X

UnqPosMain

Unique words with a specific POS

-

X

X

X

Pron

Specific pronoun types

Pronoun lists

X

X

X

From the syntax point of view (see Table 4), ReaderBench provides indices derived from the dependency parsing tree. An index is computed for each dependency type available in the spaCy parser, such as «nsubj» or «cc». The depth of the parsing tree is also an important feature in quantifying textual complexity: if the depth is high, then the text may become harder to understand.

Table 4. ReaderBench Syntax indices

Abbreviation

Description

Dependencies

Granularity

D

P

S

W

Dep

Dependencies of specific type

-

X

X

X

ParseTreeDpth

Depth of the parsing tree

-

X

Table 5 presents the indices that take into consideration text cohesion derived from the CNA graph. Cohesion is an important component when assigning text difficulty, as a lack of cohesion or cohesion gaps can make a text harder to follow (Dascalu 2014). As expected, a semantic model is required, either word2vec or the newly introduced BERT-base models. Note that the indices AdjSentCoh, AdjParCoh, IntraParCoh and InterParCoh were newly added to ReaderBench for this research.

Table 5. ReaderBench Cohesion indices

Abbreviation

Description

Dependencies

Granularity

D

P

S

W

AdjSentCoh

Cohesion between two adjacent sentences

Semantic Model

X

X

AdjParCoh

Cohesion between two adjacent paragraphs

Semantic Model

X

IntraParCoh

Cohesion between sentences contained within a given paragraph

Semantic Model

X

X

InterParCoh

Cohesion between paragraphs

Semantic Model

X

StartEndCoh

Cohesion between first and last text element

Semantic Model

X

X

StartMiddleCoh

Cohesion between start and all middle text

elements

Semantic Model

X

X

MiddleEndCoh

Cohesion between all middle and last elements

Semantic Model

X

X

TransCoh

Cohesion between the last sentence of the current paragraph and the first sentence from the upcoming paragraph

Semantic Model

X

ReaderBench also provides statistics at individual words level (see Table 6). Name entity features are computed based on the Named Entity Recognizer from spaCy, while specific tags depend on the corpus on which the NER model was trained. For example, the Russian model is trained on a Wikipedia corpus and offers only 3 tags: location («LOC»), organization («ORG»), person («PER»), while other models such as the English one offer 18 categories. This may affect the global statistics when comparing the complexity of texts from two languages, as observed in follow-up experiments. The syllables are computed using the «Pyphen» library for each language (Kozea 2016).

For other languages besides Russian, ReaderBench also includes additional textual complexity indices. For example, none of the Wordnet indices (e.g., sense counts, depths in hypernym trees) are currently available as the Russian WordNet (Loukachevitch et al. 2016) is in a different format when compared to the models integrated in Natural Language Toolkit (NLTK). Additionally, specific word lists like Age of Acquisition, Age of Exposure, and discourse connectors are not yet available for Russian; as such, their corresponding indices are not computed.

Table 6. ReaderBench Word indices

Abbreviation

Description

Dependencies

Granularity

D

P

S

W

WdLen

Number of characters in a word

-

X

WdDiffLemma

Distance in characters between word (inflected form) and its corresponding lemma

-

X

Repetition

Number of occurrences of the same lemma

-

X

X

X

NmdEnt

Number of specific types of named entity

Named Entity Recognizer

X

X

X

Syllab

Number of syllables in a word

Rules or Dictionary

X

Neural Network Architectures combining Textual Complexity Indices and Language Models

Our first approach for predicting text difficulty involved using ReaderBench to extract the complexity indices available for the Russian language that were further introduced into a neural network depicted in Figure 1.a. The architecture started with an input layer which received the complexity indices for each text as a list. An optional layer with 128 units and Rectified Linear Unit («RELU») activation function can be added to increase the complexity of the function computed by the neural network. Next, a dense layer with 32 units and with «RELU» as the activation function is used as a hidden layer. Finally, the output layer is a dense layer with only one output and the activation function `sigmoid', which provides the class of the text.

Second, BERT and its derived models hold state-of-the-art results in multiple text classification tasks. Thus, we decided to test an architecture that uses only RuBERT, a BERT-base model trained for the Russian language. We obtained a semantic representation for each text by computing the mean of the last hidden state from the RuBERT output. Then, the embedding was feed into a neural network with an architecture similar to the previous one (see Figure 1.b).

Third, we tested a combination of the two inputs, as the RuBERT embeddings were concatenated with the ReaderBench indices and fed as input into the neural network. The architecture of the neural network can be observed in Figure 1.c

Statistical Analyses

A statistical approach was adopted to determine which features were significant in differentiating between textual complexity classes. The Shapiro normality test (Shapiro & Wilk 1965), as well as the skewness and kurtosis tests (Hopkins & Weeks 1990), were used to filter ReaderBench indices in terms of normality. Since most indices were not normally distributed, the Kruskal-Wallis analysis of variance (Kruskal & Wallis 1952) was employed to determine the statistical importance of the indices.

Neural network architectures: a) Neural Network with ReaderBench indices as input; b) Neural Network with RuBERT embeddings as input; С) Neural network with both ReaderBench indices and RuBERT embeddings as input

readerbench text perception

Experimental Setup

The process of training neural networks requires the setup of hyperparameters. Thus, the Adam optimizer was considered with a learning rate of 1e-3. The loss function was binary cross-entropy, given that only two classes were predicted. Finally, each model was trained for 64 epochs with a batch size of 16.

The neural network architectures were used to classify the Russian texts on the two language levels: A (Basic user) and B (Independent user). The paragraphs were extracted from each text and labeled as the category the source text belonged to. We decided to perform cross-validation to evaluate the models due to the limited number of examples. There are multiple ways in which cross-validation can be performed, the most common ones being the 5-fold or 10-fold cross-validations. However, employing those methods involves limiting even more the input of the neural network, which in turn requires a substantial amount of data to be trained. Thus, given the limited number of entries, we elected to use a «leave-one-out» approach, where the entire corpus except a single entry is used for training a model at one iteration, followed by evaluation on the remaining entry. The process is repeated for each entry until the corpus is exhausted and performance is computed as the mean of all evaluation scores. Our corpus was composed of paragraphs and leaving one out would have meant that the other paragraphs from the same text would have been used in the training process which, again, could have generated biased. Thus, we decided to employ the «leave one text out» cross-validation. In this approach, an entire text (i.e., all the paragraphs belonging to the selected text) was left out, while the models were trained on all the other paragraphs. The final accuracy was reported as the mean of the results for each text.

Results

Table 7 depicts the results for the three neural architectures. The complexity indices from ReaderBench, as well as the RuBERT embeddings, were used as input to two different architectures: the first with only one hidden layer of 32 units, and the second with two hidden layers of 128 and 32 units. The scenario where the two input sources were combined is also presented.

Table 7. Neural networks results

Model Input Features

Hidden Layers

Leave one text out cross-validation (%)

Complexity Indices

1 hidden layer - 32 units

90.58

RuBERT

1 hidden layer - 32 units

87.49

Complexity Indices

2 hidden layers - 128, 32 units

87.05

RuBERT

2 hidden layers - 128, 32 units

88.69

Complexity Indices + RuBERT

1 hidden layer - 32 units

88.23

Complexity Indices + RuBERT

2 hidden layers - 128, 32 units

92.36

Table 8 presents a summary of the results obtained by applying Kruskal-Wallis test. The indices are divided by categories and subcategories, and each slot introduces specific indices that are either statistically significant in differentiating the texts from the classes A and B or not. The notation is condensed and indices with the same characteristics are grouped using the «|» character. For example, the first entry considers the category «Surface» and subcategory word («Wd»); the notation «M|Max (Wd / Doc|Par|Sent)» can be expanded to all the possibilities where «|» appears: M (Wd / Doc), M (Wd / Par), M (Wd / Sent), Max (Wd / Doc), Max (Wd / Par), Max (Wd / Sent). Additionally, in the «Dep» subcategory there is a list of dependency types that fitted the same pattern, and they are represented in a mathematical manner as a set. An important observation is that all features at document granularity were disregarded in this analysis, given the structure our data - i.e., all documents in the dataset contain only 1 paragraph; as such, the indices for the two granularities were the same. Similarly, the maximum values at paragraph level were ignored since the maximum and the mean values of only one entry are the same. An extended table with the descriptive statistics and corresponding x2 and p values for all statistically significant textual complexity indices is provided in Appendix 1.

Table 8. Summary of the predictive power of textual complexity indices.

Indices

Category

Indices

Subcategory

Significant Indices (p<.05)

Not Significant Indices (p >.05)

Surface

Wd

M |Max (Wd / Sent),

M (Wd / Par), SD (Wd / Sent)

SD (Wd / Par)

UnqWd

M|Max (UnqWd / Sent), M (UnqWd / Par), SD (UnqWd / Sent)

SD (UnqWd / Par)

Comma

M | Max (Commas / Sent),

M (Commas / Par), SD (Commas /

Sent)

SD (Commas / Par)

Punct

M (Punct / Par), SD (Punct / Sent)

M|Max (Punct / Sent), SD (Punct / Par)

Sent

M (Sent / Par)

SD (Sent / Par)

WdEntr

M|Max|SD (WdEntr / Sent), M|SD (WdEntr / Par)

-

NgramEntr

M|Max|SD (NgramEntr_2 / Word)

-

Morphology

POS

M |Max (POS_noun|_adj|_adv /

Sent), M (POS_noun|_adj|_adv / Par), SD (POS_noun|_adj|_adv / Sent)

SD (POS_noun|_adj|adv / Par)

SD (POS_pron / Sent)

M|Max (POS_noun / Sent), M (POS_noun / Par),

SD (POS_noun / Par)

M (POS_verb / Par), SD (POS_verb / Sent)

M|Max (POS_verb / Sent), SD (POS_verb / Par)

UnqPOS

M|Max (UnqPOS_noun|_adj|_adv / Sent), M (UnqPOS_noun|_adj|_adv / Par), SD (UnqPOS_noun|_adj / Sent)

SD (UnqPOS_noun|_adj|_adv /

Par)

SD (UnqPOS_pron / Sent)

M|Max (UnqPOS_noun / Sent), M|SD (UnqPOS_noun / Par)

M (UnqPOS_verb / Par), SD (UnqPOS_verb / Sent)

M|Max (UnqPOS_verb / Sent), SD (UnqPOS_verb / Par)

Pron

M |Max (Pron_indef / Sent), M (Pron_indef / Par), SD (Pron_indef / Sent)

SD (Pron_indef / Par)

M|Max|SD (Pron_fst|Pron_thrd / Sent),

M|SD (Pron_fst | Pron_thrd / Par)

SD (Pron_snd / Sent)

M|Max (Pron_snd / Sent), M (Pron_snd / Par), SD (Pron_snd / Par)

Syntax

Dep

M |Max (Dep_X / Sent), M (Dep_X / Par), SD (Dep_X / Sent)

SD (Dep_X / Par)

X Є {nmod, amod, case, acl, obl, det, xcomp, nummod, conj, appos, mark, cc, objt}

M (Dep_nsubj / Par), SD (Dep_nsubj / Sent)

M|Max (Dep_nsubj / Sent), SD (Dep_nsubj / Par)

* All other types of dependencies were not significant

ParseDepth

M |Max (ParseDepth / Sent), M (ParseDepth / Par), SD (ParseDepth / Sent)

SD (ParseDepth / Par)

Cohesion

AdjCoh

M|Max (AdjCoh / Par)

SD (AdjCoh / Par)

IntraCoh

M|Max (IntraCoh / Par)

SD (IntraCoh / Par)

* StartEndCoh, StartMidCoh, MidEndCoh, TransCoh -

Not Relevant for this analysis

Word

Chars

M |Max|SD (Chars / Sent|Word), M|SD (Chars / Par)

-

LemmaDiff

Max|SD (LemmaDiff / Word)

Max|M|SD (LemmaDiff / Sent), M (LemmaDif / Word), M|SD (LemmaDiff / Par)

Repetitions

M|Max|SD (Repetitions / Sent), M|SD (Repetitions / Par)

-

NmdEnt

M |Max (NmdEnt_loc|_org / Sent|Word), SD (NmdEnt_loc|_org / Sent|Word), M (NmdEnt_loc|_org / Par),

SD (NmdEnt_loc|_org / Par),

*All for NmdEnt_per

Syllab

M|Max (Syllab / Sent|Word),

M (Syllab / Par), SD (Syllab / Sent|Word)

SD (Syllab / Par)

* mean (abbreviated «M»), standard deviation (abbreviated «SD»), and maximum (abbreviated «Max») are the aggregation functions applied at various granularities

Two methods were employed to determine the efficiency of textual indices from ReaderBench in differentiating texts from two language levels (i.e., A versus B): neural networks and statistical analyses. In the first approach, the ReaderBench features performed better than RuBERT embeddings (see Table 7). Nonetheless, the neural networks that used only the RuBERT embeddings as input performed well (i.e., accuracy of 88.69%), even though the BERT embeddings are recognized for their capabilities to model the meaning of a text. Note that this result does not imply that ReaderBench indices are better than BERT on text classification tasks in general, but rather argue that ReaderBench textual complexity indices can be successfully employed to assess text difficulty.

Both inputs, ReaderBench textual complexity indices and RuBERT embeddings, were used in different versions of initial neural network. The results from Table 7 indicate that adding an extra hidden layer for the neural network with only textual complexity indices decreased performance, thus arguing that the function that maps the inputs to the predicted class should be a simple one. In contrast, the BERT embeddings benefitted from the additional layer, therefore arguing that the mapping between the encodings and the complexity of a text is more complex than in the previous case. In the third configuration the two input sources were combined and tested on the same task; this architecture achieved the highest score (92.36%) with two hidden layers, benefiting from both handcrafted features and BERT contextualized embeddings. The intuition behind the performance increase is that the two approaches complement each other.

The statistical analysis using Kruskal-Wallis statistical test showed that the majority of indices were significant in differentiating between the two classes. In general, the indices aggregated with the standard deviation function were not so statistically significant, while the mean and the maximum related indices proved to be more predictive. While considering Appendix 1, the «nmod» dependency category was the most influential one, ranking first in the Kruskal-Wallis x2 (1) score with the indexMax (Dep_nmod /Sent) (y2 = 84.48, p <.001), as well as having 6 appearances in top 10 most influential features. The nominal modifier appeared more frequently in more complex texts (B) than in the less complex texts (A). In the same syntactic category, the «amod» dependency also exhibited similar patterns.

In terms of morphology, the number of nouns was higher in B texts than in A text, both as rough count and unique count. The mean value of nouns at sentence level was ranked 2nd in terms of effect size (M (POS_noun / Sent); x2 = 84.31, p <.001), while other 3 related indices made it to top 10 most predictive features. The number of adjectives was also statistically significant, with the most predictive index in this subcategory (i.e., M (POS_adj / Par); x2 = 69.28, p <.001) ranking in top 5% of all the indices.

From the Word category, character indices performed best in terms of separating the two types of texts (e.g., M (Chars/Word); x2 = 76.03, p <.001), all the three variations being close to each other in the ranking. This finding supports the intuition that easier texts generally have shorter words in their composition. Strongly related to this subcategory is the syllables subcategory that also had an important impact (e.g., M (Syllab / Word); x2 = 73.08, p <.001).

From the remaining two categories, Surface and Cohesion, the highest impact was obtained by the features regarding the number of unique words (e.g., M (UnqWd / Par); x2 = 32.74, p<.001) and, respectively, the middle end cohesion feature (e.g., M (MidEndCoh/ Par); x2 = 25.89, p<.001). As it can be seen from Table 8, these features were still statistically significant in differentiating the two categories of texts, but they are in the middle of overall rankings in terms of predictive power (i.e., ranks between places 70 and 100).

Our findings indicate that the ReaderBench textual complexity indices, which span across multiple levels of analysis, provide valuable insights into the differences between two language levels for foreign Russian learners (A-Basic User and B-Independent User). From a machine learning perspective, the results are interesting, as a simple neural network using the features extracted with ReaderBench outperformed the Russian version of BERT, namely RuBERT, in the task of text classification. Nonetheless, this result likely occurred given that the complexity indices were specifically fitted for this task. In addition, we observed that the combination of features from both methods improved the overall classification scores. As such, the methods complement one another and the texts from the two categories differ from each other in terms of both textual complexity features and underlying themes (represented by meaning).

A follow-up analysis was centered on the textual complexity features; as such, the Kruskal-Wallis test was used to identify the most predictive indices, individually and per category. From the syntactic point of view, we can observe that the two most impactful features were «nmod» and «amod». The nominal modifier (i.e., «nmod») consists of a noun or a noun phrase that is expressed in Russian using genitive, while showing the possessiveness of another noun; «amod» is similar, with the difference that the synta...


Подобные документы

  • Developed the principles that a corpus of texts containing code-mixing should have and built a working prototype of Udmurt/Russian Code-Mixing Corpus. Discussed different approaches to studying code-mixing and various classifications of code-mixing.

    дипломная работа [1,7 M], добавлен 30.12.2015

  • Анализ существующего программного обеспечения эмпирико-статистического сравнения текстов: сounter оf сharacters, horos, graph, advanced grapher. Empirical-statistical comparison of texts: функциональность, процедуры и функции тестирование и внедрение.

    дипломная работа [4,4 M], добавлен 29.11.2013

  • Характеристика программных продуктов Open Source: Umbrello - среды UML-моделирования на языке, Rational Rose - средства визуального моделирования объектно-ориентированных информационных систем. Описание и сравнение сайтов по созданию онлайн UML диаграмм.

    контрольная работа [1,5 M], добавлен 03.11.2013

  • Lists used by Algorithm No 2. Some examples of the performance of Algorithm No 2. Invention of the program of reading, development of efficient algorithm of the program. Application of the programs to any English texts. The actual users of the algorithm.

    курсовая работа [19,3 K], добавлен 13.01.2010

  • Перспективные направления анализа данных: анализ текстовой информации, интеллектуальный анализ данных. Анализ структурированной информации, хранящейся в базах данных. Процесс анализа текстовых документов. Особенности предварительной обработки данных.

    реферат [443,2 K], добавлен 13.02.2014

  • Program automatic system on visual basic for graiting 3D-Graphics. Text of source code for program functions. Setting the angle and draw the rotation. There are functions for choose the color, finds the normal of each plane, draw lines and other.

    лабораторная работа [352,4 K], добавлен 05.07.2009

  • Високовольтний імпульсний драйвер MOSFET з синхронним випрямлянням від фірми Intersil. Ключові властивості драйверів SCALE. Концепція захисту драйверів SCALE. Технологія та характеристики драйверів SCALE для IGBT-модулів. Режими роботи драйверів SCALE.

    реферат [180,3 K], добавлен 08.11.2010

  • Program of Audio recorder on visual basic. Text of source code for program functions. This code can be used as freeware. View of interface in action, starting position for play and recording files. Setting format in milliseconds and finding position.

    лабораторная работа [87,3 K], добавлен 05.07.2009

  • Актуальность и значимость создания web-сайта образовательного учреждения - школы. Функциональное моделирование предметной области. Основные этапы разработки сайта. Программная реализация. Установка, настройка и работа с локальным сервером Open Server.

    дипломная работа [990,5 K], добавлен 01.01.2018

  • Program game "Tic-tac-toe" with multiplayer system on visual basic. Text of source code for program functions. View of main interface. There are functions for entering a Players name and Game Name, keep local copy of player, graiting message in chat.

    лабораторная работа [592,2 K], добавлен 05.07.2009

  • Creation of the graphic program with Visual Basic and its common interface. The text of program code in programming of Visual Basic language creating in graphics editor. Creation of pictures in Visual Basic, some graphic actions with graphic editor.

    лабораторная работа [1,8 M], добавлен 06.07.2009

  • Обзор рынка Информационных технологий. Современные автоматизированные системы управления проектами и их классификация. Open Plan (Welcom Software) - система, предлагающая решение по управлению проектами масштаба корпорации. Основные модули Open Plan.

    курсовая работа [630,9 K], добавлен 24.02.2010

  • Використання програмованих логічних інтегральних схем для створення проектів пристроїв, їх верифікації, програмування або конфігурування. Середовища, що входять до складу пакету "MAX+PLUS II": Graphic, Text, Waveform, Symbol та Floorplan Editor.

    курсовая работа [1,8 M], добавлен 16.03.2015

  • Электронные библиотеки, проблемы авторского права и их решение. Форматы выкладываемых произведений: графические растровые, графические векторные с оформлением, простой текст (plain text). Обзор по самым известным программам для чтения электронных книг.

    реферат [29,7 K], добавлен 16.07.2010

  • Basic assumptions and some facts. Algorithm for automatic recognition of verbal and nominal word groups. Lists of markers used by Algorithm No 1. Text sample processed by the algorithm. Examples of hand checking of the performance of the algorithm.

    курсовая работа [22,8 K], добавлен 13.01.2010

  • Значение атрибута TITLE тега HTML-документа. Возможности HTML для разработчиков Web-страниц. Параметры тега , регулирующие отступы вокруг изображения. Оформление комментариев в CSS. Теги логического форматирования текста (phrase elements).

    тест [19,9 K], добавлен 11.10.2012

  • Анализ оптово-розничной торговли в сфере флористики. Методы автоматизации предпринимательской деятельности, электронная коммерция и бесплатные Open-Source СУБД. Базы данных основного и архивного сервера. Запуск интернет-магазина и установка OpenCart.

    дипломная работа [3,2 M], добавлен 18.07.2012

  • Инсталляция программы Adobe PageMaker 6.5. Элементы интерфейса, палитра инструментов и меню, настройка параметров. Создание новой публикации. Форматирование текста. Масштаб отображения страниц. Инструмент Pointer и Text. Экспорт и импорт объектов.

    курсовая работа [949,2 K], добавлен 12.01.2011

  • Hyper Text Markup Language (html) как стандартный язык для создания гипертекстовых документов в среде web. Тэги списков, гипертекстовые ссылки, графика внутри документа, специальные тэги html и таблицы. Планирование фреймов. Этапы создания сайтов.

    контрольная работа [126,9 K], добавлен 18.11.2010

  • Опис мови програмування PHP. Стратегія Open Source. Мова розмітки гіпертекстових документів HTML. Бази даних MySQL. Обґрунтування потреби віддаленого доступу до БД. Веб-сервер Apache. Реалізація системи. Інструкція користувача і введення в експлуатацію.

    курсовая работа [42,9 K], добавлен 21.12.2012

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.