Anonymous Vs. Attributed: Cluster Analysis of Tolstovskii Sbornik Texts and Its Interpretation in Terms of Cultural Heritage
Lexico-semantic dominants, markers that distinguish the texts of medieval anthologies from each other. Analysis of the statistical distance between anonymous and author's texts. Differences between the anonymous Word of Wisdom and K. Turovsky's sermon.
Рубрика | Иностранные языки и языкознание |
Вид | статья |
Язык | английский |
Дата добавления | 01.04.2022 |
Размер файла | 2,2 M |
Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже
Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.
Размещено на http://www.allbest.ru/
Anonymous Vs. Attributed: Cluster Analysis of Tolstovskii Sbornik Texts and Its Interpretation in Terms of Cultural Heritage
Oleg F. Zholobov, Kazan Federal University, Kazan, Russian Federation
Victor A. Baranov, Kalashnikov Izhevsk State Technical University, Izhevsk, Russian Federation
and Maria O. Novak, Vinogradov Russian Language Institute of RAS, Moscow, Russian Federation; FRC «Kazan Scientific Center of RAS», Kazan, Russian Federation
Abstract
In the article, the quantitative analysis revealed lexical and semantic dominants and markers that distinguish the medieval anthology texts from each other. To verify whether three anonymous homilies in the thirteenth-century Tolstovskii Sbornik might be attributed to Cyril of Turov, the authors examined the statistical distance between anonymous and already attributed texts. Using the clustering method based on the ranks of the most frequent tokens and the corresponding ranks of other texts, they constructed dendrograms that showed the text grouping. This technique allowed demonstrating the statistical proximity of six Cyril of Turov's texts, their contrast to seven Cyril of Jerusalem's texts, and the formation of the third cluster from texts of other authors. Cluster analysis made it possible to identify in Cyril of Turov's homilies several crucial thematic keys, as well as to establish such a feature of his preaching discourse as the widespread use of role dixies. The analysis confirmed the sharp difference between the anonymous Parable of Wisdom and Cyril of Turov's homilies. Separate convergences of two anonymous sermons with Cyril of Turov's homilies were discovered. However, the level of convergence in this case, as analysis has shown, contrasts sharply with the level of convergence among Cyril of Turov's homilies. It suggests that the causes of individual convergences are not associated with one person's authorship.
Keywords: 13th-century Tolstovskii Sbornik, Cyril of Turov, anonymous texts, attribution, cluster analysis, tokens' frequency ranks, lexical and grammatical convergence.
Аннотация
Анонимность vs. атрибутированного: кластерный анализ текстов Толстовского сборника и его интерпретация в аспекте культурного наследия
О.Ф. Жолобов, Казанский федеральный университет Российская Федерация, Казань.
В.А. Баранов, Ижевский государственный технический университет имени М. Т Калашникова, Российская Федерация, Ижевск.
М.О. Новак, Институт русского языка им. В. В. Виноградова РАН, Российская Федерация, Москва; ФИЦ «Казанский научный центр РАН», Российская Федерация, Казань.
В статье с помощью квантитативного анализа выявлены лексико-семантические доминанты и маркеры, отличающие тексты средневекового сборника друг от друга. С целью выяснения принадлежности трех анонимных произведений Толстовского сборника XIII века Кириллу Туровскому исследуется статистическое расстояние между анонимными и авторскими текстами. С помощью метода кластеризации, данными для которого являются ранги наиболее частотных текстовых форм и соответствующие им ранги других текстов, построены дендрограммы, показывающие группировку исследуемых произведений. Последовательно демонстрируется статистическая близость шести текстов подкорпуса Кирилла Туровского и их контрастность семи текстам Кирилла Иерусалимского, формирование третьего кластера из текстов других авторов. Кластерный анализ позволил выявить в гомилиях Кирилла Туровского несколько наиболее важных для автора тематических ключей, а также установить такую особенность его проповеднического дискурса, как широкое использование ролевого дейксиса. Анализ подтвердил резкое отличие анонимного Слова о премудрости от гомилий Кирилла Туровского. Были обнаружены отдельные схождения двух анонимных поучений с гомилиями Кирилла Туровского. Однако уровень схождений в этом случае, как показал анализ, резко контрастирует с уровнем схождений в гомилиях самого Кирилла Туровского. Это свидетельствует о том, что причины отдельных схождений не связаны с единым авторством.
Ключевые слова: Толстовский сборник XIII века, Кирилл Туровский, анонимные тексты, атрибуция, кластерный анализ, ранговая корреляция, лексическая и грамматическая конвергенция.
The source and tasks
Kazan digital collection on the «Manuscript» website now contains a new online edition of an Old Russian manuscript, Tolstovskii Sbornik from the second half of the 13th century (National Russian Library, F.p.I.39, SbTol hereafter) (Kazan Collection). Owing to the fragmentation module, all the structural parts of the collection received, in addition to the general, a separate publication; thus for the first time, it became possible to compare them using computer methods and give a linguistic interpretation of this comparison.
SbTol has a unique complex composition. The contents of this written source determine its particular value: it includes the earliest copies of Cyril of Turov's homilies (ff. 1r-23r, 25r-46r), Parable of Wisdom (ff. 48r -49v), a compilation attributed to John Chrysostom (ff. 49v-56v)1, two apocryphal texts, the Legend of Aphroditian (ff. 56v-62r), the Abgar Legend (ff. 62v-68v), Life of Basil the Great (ff. 68v-88v)The compilation nature of the homily was recently discovered in Maria Novak's research (Novak, 2019). This version remained unknown to Bulgarian researchers of hagiography. Its incipit is different from those presented in (Ivanova, 2008: 410-418) that indicates a particular translation., and a peculiar version of Cyril of Jerusalem's catechetical lectures (ff. 89v-184r)The special version of catechetical lectures' translation was preliminarily confirmed in (Novak, Penkova 2020).. There are two anonymous sermons in between Cyril of Turov's homilies, and also, the Parable of Wisdom adjoins them. These texts could belong to Cyril of Turov, although they do not refer to any authorship, in contrast to the other six homilies. In (Svodnyi katalog 1984: 324), the anonymous sermons on the 5th Sunday after Easter (ff. 23r-25r) and on Pentecost (ff. 46r-48r) are attributed to Cyril of Turov, indeed. The Parable of Wisdom is associated with Cyril of Turov's authorship in (Svodnyi katalog, 2002: 494; Zalizniak, 2004: 464; Slovar' drevnerusskogo iazyka, VI: 34). In (Zholobov, 2018a, Zholobov, 2018b, Zholobov, Novak, 2018), we found cases of spelling and lexical and grammatical contrast, proving that these three texts did not belong to Cyril of Turov, contrary to the existing assumptions. The «intrigue» of the research below is the use of new IT-technologies in the search for accurate quantitative indicators in establishing authorship and their linguistic interpretation. Of great interest are the statistical parameters that determine the differences in the creative methods of two authors of homiletic texts - Cyril of Turov and Cyril of Jerusalem. The first experience of combining cluster and linguistic analysis gave the results unexpected in many ways.
At the statistical experiment stage, it is supposed to solve the following tasks:
- to find statistical differences in Cyril of Turov's and Cyril of Jerusalem's texts,
- to identify the degree of contrast of John Chrysostom's text and two subcorpora with the established authorship,
- to determine the degree of proximity or remoteness of three anonymous texts to Cyril of Turov's texts,
- to demonstrate the general grouping of Cyril of Turov's and Cyril of Jerusalem's works and other texts of SbTol.
Quantitative and statistical methods in linguistics.
Quantitative and statistical methods of data analysis have long been actively and productively used in various theoretical and applied linguistic studies: text systematization, text attribution, identification of topics and keynotes, and other areas. These methods are based on a statistical analysis of text units (characters, words, syntactic constructions) and/or their distribution. Started in Russia by N.A. Morozov (Morozov, 1915), the analysis based on quantitative and statistical characteristics was used toward the end of the 20th century as one of the leading research methods. As more machine-readable texts appeared, it became an integral part of linguistic research (cf., for example: (Marusenko, 1990; From Nestor to Fonvizin, 1994; Mukhin^2011; Shaikevich, Andriushchenko, Rebetskaia, 2013; Zakharov, Khokhlova, 2014; Mitrofanova, 2015; Borunov, Malygin, 2016; Litvinova, T., Litvinova, O., 2016; Litvinova, T., Zagorovskaia, Sere- din, 2016; Mimno, Blei, 2011; Bing, 2012; Blei, 2012; Daud, 2012; Guest, MacQueen, Namey, 2012; Baranov, 2018; Jurafsky, Martin, 2019: 325-415), and many other works devoted to general and peripheral issues of linguistic statistics).
Establishing authorship.
Determining the author of a piece of text is one of the traditional and long-developing areas using quantitative and statistical information. Over more than 100 years of searching for effective methods for solving this problem, a large number of methods and techniques have been proposed and tested using both various formal indicators of texts and various analysis procedures and algorithms (cf., for example, (Morozov, 1915; Marusenko, 1990; Martynenko, 2014; Martynenko, 2015; Gurova, 2016), and many others).
The work (Gurova, 2016) offers a review of various approaches and systematizes various methods used for anonymous text attribution: those based on the analysis of vocabulary, syntactic constructions, sequences of linguistic units, and complex ones (Gurova, 2016: 30). The author concludes that neither the use of lexical or syntactical quantitative characteristics of the texts (for instance, comparing the frequency of service or modal words, syntactic constructions, or sentence length) nor calculating letter combinations and applying to them, for example, Markov-based methods processes are not universal (Gurova, 2016: 31-34). The author recognizes complex methods as the most effective but notes that even using them, mathematical or linguistic methods «определения авторства могут быть лишь подспорьем для филолога: полностью доверять им нельзя» («determining authorship cannot be entirely helpful to a philologist: one cannot completely trust them») (Gurova, 2016: 35).
Methods.
To achieve the goal set in this paper, the cluster analysis method is used. It allows organizing the studied objects according to their characteristics into groups (clusters), where certain objects are more similar than objects from other groups, and visualize it as a dendrogram (Mandel', 1988: 10, 11, 22, 40; Manning, Raghavan, Schьtze, 2011: 353-354, 379-381; Vorontsov; Klasternyi Analiz).
The taxonomy method proposed in (Try- on, 1939) is a series of algorithms (procedures) that use various methods for measuring multidimensional distances (numerical expressions of similarities and differences of objects) between objects and various methods of forming clusters (cf., for example, (Cattell, 1944, Sokal, Sneath, 1963; Mandel', 1988; Prikladnaia statis- tika, 1989; Zagoruiko, 1999; Paul, Gore, 2000; Manning, Raghavan, Schьtze, 2011: 383-396).
The Euclidean distance, the squared Euclidean distance, Manhattan distance, the power iteration, the Chebyshev distance, and others are used as modes of measuring distances (metrics). The nearest neighbour rule (single-linkage and complete-linkage clustering), Ward's method, and k-means clustering methods are used for cluster-forming (Mandel', 1988: 3035, 41-52^onwards; Jain, Murty, Flynn, 1999; Prikladnaia statistika, 1989: 147-181, 249-260; Avtomaticheskaia obrabotka, 2011: 193-194; Klasternyi Analiz).
In addition to hierarchical methods, there is a group of non-hierarchical ones, which include, in particular, the k-means, where the algorithm constructs a given number of clusters using the smallest distances between objects within groups and the largest distances between clusters (Mandel', 1988: 40; Prikladnaia statistika, 1989: 221-222, 291-293; Manning, Raghavan, Schьtze, 2011: 363-370; Klasternyi Analiz). medieval text turovsky sermon
The cluster method has its peculiarities. The researchers indicate that the results of clustering depend, in particular, on the selection and proportions of the source data, on the choice of method for metrics' evaluation, on grouping rules, on the degree of compactness, on the proportionality of the selected measure for each feature of grouping objects (Mandel', 1988: 30, 111-112; Prikladnaia statistika, 1989: 148, 180-181, 300-301; Bureeva, 2007: 8-11; Vorontsov).
At the same time, they emphasize that the method allows presenting data following the tasks, for which one can select appropriate algorithms for measuring distance and moving from one level of grouping to another one (Mandel', 1988: 73, 146; Zagoruiko, 1999: 60-62; Rabinovich, 2007^ 74; Bureeva, 2007: 12-14; Avtomaticheskaia obrabotka, 2011: 208-209). For example, N.G. Zagoruiko writes: «Одной, «самой естественной», «абсолютно объективной», таксономии не существует. Все реальные объекты имеют бесконечное число свойств, и выделение некоторого конечного подмножества этих свойств - акт субъективный. Меры близости, критерии качества также выбираются субъективно. Если известна цель, для достижения которой делается таксономия (т. е. при наличии «суперцели»), то качество таксономии проверяется тем, хорошо ли она способствует достижению этой цели, удобна ли, экономична и т. д. Эта проверка носит объективный характер, но выбор суперцели опять-таки субъективен и для одной суперцели данная таксономия будет хорошей, для другой - нет» («The one and only, «the most natural», «objective», taxonomy does not exist. All real objects have an infinite number of properties, and the selection of a finite subset of these properties is a subjective act. Proximity measures, quality criteria are also selected subjectively. If we know for what goal a taxonomy is made (that is, if there is a «super goal2), then the quality of the taxonomy is verified whether it contributes well to this goal, whether it is convenient, economical, and so on. This verifying is objective. However, the choice of a super goal is again subjective, and for one super goal, this taxonomy will be good; for another, it will not») (Zagoruiko, 1999: 59). Also: «Надо помнить, что выбранная метрика, как и выбранное пространство, является единственной, и никакая другая такого же результата не гарантирует. Поэтому очень полезно сделать расчеты несколько раз с разными метриками и найти устойчивые общие черты в разбиениях. Окончательный критерий кластер-анализа - критерий практической полезности результата; в случае успеха одновременно считаются удачными и расстояние, и алгоритм» («One should remember that the selected metric, like the selected space, is the only one, and no other guarantees the same result. Therefore, it is advantageous to make calculations several times with different metrics to find stable common features in the partitions. The final criterion for cluster analysis is the criterion of the practical utility of the result; if successful, both distance and the algorithm are considered successful at the same time») (Mandel', 1988: 146)There is also a piece of evidence that «одна и та же пара алгоритма кластеризации и метрики дает различные результаты в зависимости от программы кластеризации» (“the same pair of clustering algorithm and metric gives different results depending on the clustering program”) (Rabinovic, 2007: 74)..
For methodology, «познание сущности объекта сводится к выявлению тех его качественных свойств, которые и определяют данный объект, отличают его от других. По этой причине задача построения естественных классификаций в известной мере смыкается с традиционной для статистически задачей построения типологических группировок...» («knowledge of the essence of an object leads to identifying those of its properties that determine this object, distinguishing it from others. For this reason, the task of constructing natural classifications is, to some extent, interfaced with the statistically traditional task of constructing typological groupings») (Mandel, 1988: 138). Besides: «Однако объекты могут быть однокачественными в одном отношении и разнокачественными в другом, причем выбор этих отношений (целей, точек зрения) полностью находится в руках исследователя» («However, objects can be one-quality in one respect and different-quality in another, and the choice of these relations (goals, points of view) is entirely in the hands of a researcher») (Mandel', 1988: 138).
The quoted statements and the logic of the study require choosing features that correspond to our goal. To verify whether anonymous pieces from SbTol might be attributed to Cyril of Turov, we should exclude from the analysis the features that unite anonymous and attributed texts, namely graphical and spelling peculiarities, and common text topics. Also, we should rely on the primary expert grouping on Cyril of Turov's and Cyril of Jerusalem's works based on their differentiating features.
Argumentation.
To initialize cluster analysis, we can use the information either about the properties (characteristics) of objects or their pairwise mutual distances, in both cases presented in the form of matrices (Prikladnaia Statistika, 1989: 143; Vorontsov).
To compare texts and register the degree of their closeness using cluster analysis, one can use various quantitative characteristics, such as the number of tokens, the order of their sequence in the lists ranked by quantitative or statistical value, the average length of linguistic units, lists of the most important words or combinations, and discrepancies knots.
There are cases of using cluster methods for the analysis of medieval Slavonic manuscripts. The works of D.M. Mironova (Mironova, 2015; Mironova, 2017) demonstrate the efficiency of automatic clustering of medieval manuscripts of one work (based on several dozen verses of the Gospel of Matthew (Matthew, 14: 14-34) from 525 Slavonic copies of the Gospels from the 11th - 16th centuries). The author also proposes optimal procedures for highlighting textually significant differences and selecting parameters for comparing manuscripts.
Source data selection and algorithm search.
The properties of cluster methods make it possible to select experimentally such characteristics of the analyzed objects, which provide a result that most closely corresponds to the intuitively (expertly) established grouping of objects.
The most popular are hierarchical clustering algorithms, in particular, due to the visibility of its results on dendrograms and the possibility (moreover, desirability) of their comparison, even if they are obtained by various methods (Mandel, 1988: 73).
There are several approaches to finding an objective grouping of objects: a) data arrays with the unknown structure are analyzed by various methods, with results comparing, b) data are analyzed using algorithms verified on similar data arrays, c) algorithms are verified on artificial arrays, and other ways (Mandel', 1988: 108).
In this paper, we use the first approach, which also involves comparing the analysis results to the existing (expert) grouping into two groups - texts of Cyril of Turov and texts of Cyril of Jerusalem. Searching for the actual results presumes, at the same time, the search for the most appropriate method of analysis and the need for experiments with various data sets.
Algorithms for grouping objects into clusters have various properties. One can assume that texts written by the same author have some similar features represented by numerical values, and can choose a grouping method that puts these texts into a separate group. After finding such a method, it is possible to determine the location of anonymous texts in relation to established clusters.
Texts to analyze
Our material contains six texts of Cyril of Turov and seven texts of Cyril of Jerusalem.
The sermons of Cyril of Turov are:
On Sunday of St. Thomas the Apostle (without beginning) (hereafter CT_Thom, ff. 1r-5v), On Descent from the Cross (hereafter CT_Desc, ff. 5v-16r), On Sunday of the Paralytic (hereafter CT_Paral, ff. 16r-23r), On Sunday of the Blind Man (hereafter CT_Blind, ff. 25r-32r), On Ascension of the Lord (hereafter CT_Asc, ff. 32r-37v), On Nicaea Council Fathers' commemoration (hereafter CT_Fath, ff. 37v-46r).
The catechetical lectures of Cyril of Jerusalem are: 1st - 3rd (hereafter CJ1, ff. 89v-92v; CJ2, ff. 92v-97v; CJ3, ff. 97v-103r) and 13th - 16th (hereafter CJ13, ff. 158v-170v; CJ14, ff. 170v-176v; CJ15, ff. 176v-181v; CJ16, ff. 181v-184v).
The volumes of the texts of the two authors are approximately equal: 39.899 and 44.524 tokens, respectively.
The volume of three anonymous texts is 5.215 tokens (2.173, 1.572, and 1.470).
The analysis also involves one text partially attributed to John Chrysostom, the Nativity sermon (hereafter Chrys_Nat, ff. 49v-56v, 5.391 tokens), as well as The Legend of Aph- roditian (hereafter Aphr, ff. 56v-62r, 4.649 tokens), The Abgar Legend (hereafter Abg, ff. 62v-68v, 5.016 tokens), and Life of St. Basil the Great (hereafter Life_Bas, ff. 68v-88 v, 19.421 tokens).
The pieces of Cyril of Turov, Cyril of Jerusalem, and the Nativity sermon, the authors of which are known, as well as four texts with non-established authorship, act as expert texts.
Machine-readable text transcription
A peculiarity of the text transcription on the «Manuscript» website is its maximum, as far as possible, correspondence to the original: transcriptions transfer the manuscripts letter to letter, line to line, and page to page. Since several scribes might have written a manuscript, the use of accurate transcription, conveying all the features of the original, would lead to analysis based on the graphic characteristics of writing, not the linguistic features of the texts. Therefore, when preparing the data, we level the graphic and orthographic variability as much as possible, using the modern Cyrillic alphabet in tokens' lists and eliminating all the diacritics.
Tools
The «Manuscript» corpus's statistics module (http://manuscripts.ru/mns/!cred2.stat) presents a query form and a sample visualization form. It allows creating comparable subcorpora; entering a wildcard of linguistic units; entering a quantitative (information about the absolute or relative use of units) or a statistical (statistical measures Log-Likelihood, TF * ICTF, Weirdness) ranking measure; selecting a contrasting subcorpus; sorting and displaying the lists of results in a table. We can use each subcorpus's numerical data (unit number, its rank, absolute or relative quantity, and the value of the statistical measure) to view the contents of tables and study them using the traditional comparative method, as well as export them to other programs for processing and evaluating numerical data. In this work, we carry out correlation and cluster analysis using the statistical software package Statistica (Stat- Soft-Dell / TIBCO Software Inc.), which is one of the professional programs.
Data extraction
For clustering, texts should be represented by some featured spaces whose vectors are sets of numerical values. For such a vector, we selected a set of quantities that described the relation of a text to other texts, namely, the values of pairwise correlations, which could be represented as an n*m matrix.
The correlations between the texts were calculated using the Spearman's rank method based on the information about the ranks of tokens in a regularized list.
We used various methods to identify how much thematic and semantic features of texts influence the analysis and minimize the peculiarities of writing: the generalization of tokens, their sorting based on quantitative parameters, and taking into account their part-of-speech characteristics.
Experimental technique
Using the query form of the «Manuscript» corpus, based on the SbTol transcription, we prepared 20 samples, including six sermons of Cyril of Turov, seven lectures of Cyril of Jerusalem, three controversial anonymous works, and four other texts.
The prepared samples were loaded into the statistics module, the necessary query parameters establishedUnit type: token; step type: sampling; measure: relative quantity; accuracy: 1 (diacritics removed, ligatures are dis. See Fig. 1:
Fig. 1. The query form of the statistics module
The result of the query is a table that includes all tokens of samples, and the information about each of them: their absolute and relative number, their index number, and their rank in every text (see Fig. 2).
Fig. 2. The result web form: a table of tokens regularized by frequency criterion
In the web form, there is a possibility to resort the table columns. It allows choosing each of the texts as the main one and sorting its tokens in descending order of their quantity, and then establishing the correspondence of the sequence of forms for each pair of texts (see Tables 1 and 2).
The length of the lists is 100 forms for tokens, the results saved in the files of the Statistica program.
In the Statistica program, using Spearman's rank correlation method, based on text formsWhen analyzing lemmas from a list of 200 forms, only function words and adverbs were selected., we established the correlation distances for each text concerning other texts (see Table 3), saving the results in temporary files.
Then, we collected the data in an n*m matrix (see Table 4), where each text gained a description with a set of correlation values relative to all other texts.
In the Statistica program, we obtained a series of dendrograms (see Fig. 3).
Experimentally, for the texts of Cyril of Turov and Cyril of Jerusalem, we selected such combinations of proximity measures and association rules that gave two distinct clusters.
Fig. 3. Building dendrograms in the Statistica program
Table 1. Correlation of quantitative characteristics of the most frequent tokens in Cyril of Turov's sermon «On Sunday of St. Thomas the Apostle», with the corresponding values of other texts (a fragment)
Tokens |
Cyril of Turov's sermon «On Sunday of St. Thomas the Apostle» (without a beginning) |
Cyril of Turov's sermon «On Descent from the Cross» |
Cyril of Turov's sermon «On Sunday of the Paralytic» |
Cyril of Turov's sermon «On Sunday of the Blind Man» |
|||||||||||||
Sample volume |
42931 |
9925 |
6605 |
6009 |
|||||||||||||
№The number of tokens used in the text. A token's ordinal number. |
RA token's rank. |
FA token's absolute quantity. |
FreqA token's relative quantity. |
№ |
R |
F |
Freq |
№ |
R |
F |
Freq |
№ |
R |
F |
Freq |
||
H |
1 |
1 |
95 |
0,076 |
1 |
1 |
186 |
0,068 |
1 |
1 |
125 |
0,067 |
1 |
1 |
107 |
0,062 |
|
BH |
2 |
2 |
20 |
0,016 |
2 |
2 |
42 |
0,015 |
3 |
3 |
39 |
0,021 |
3 |
3 |
32 |
0,019 |
|
o(t) |
3 |
2 |
20 |
0,016 |
4 |
3 |
41 |
0,015 |
9 |
9 |
16 |
0,009 |
6 |
5 |
26 |
0,015 |
|
He |
4 |
3 |
15 |
0,012 |
9 |
8 |
28 |
0,010 |
2 |
2 |
49 |
0,026 |
2 |
2 |
35 |
0,020 |
|
CH |
5 |
3 |
15 |
0,012 |
3 |
3 |
41 |
0,015 |
20 |
14 |
10 |
0,005 |
7 |
6 |
21 |
0,012 |
|
6o |
6 |
4 |
13 |
0,010 |
6 |
5 |
37 |
0,013 |
5 |
5 |
25 |
0,013 |
10 |
8 |
16 |
0,009 |
|
Ha |
7 |
4 |
13 |
0,010 |
5 |
4 |
40 |
0,015 |
8 |
8 |
17 |
0,009 |
4 |
4 |
30 |
0,017 |
|
HbIHfl |
9 |
5 |
12 |
0,010 |
39 |
20 |
8 |
0,003 |
30 |
17 |
7 |
0,004 |
31 |
15 |
6 |
0,003 |
|
HKO |
8 |
5 |
12 |
0,010 |
36 |
20 |
8 |
0,003 |
14 |
12 |
12 |
0,006 |
15 |
12 |
10 |
0,006 |
|
Zta |
10 |
6 |
9 |
0,007 |
17 |
14 |
16 |
0,006 |
10 |
9 |
16 |
0,009 |
25 |
15 |
6 |
0,003 |
|
enme |
14 |
6 |
9 |
0,007 |
23 |
17 |
11 |
0,004 |
64 |
20 |
4 |
0,002 |
72 |
18 |
3 |
0,002 |
|
5K e |
11 |
6 |
9 |
0,007 |
7 |
6 |
32 |
0,012 |
4 |
4 |
27 |
0,014 |
11 |
9 |
15 |
0,009 |
|
MS |
13 |
6 |
9 |
0,007 |
58 |
23 |
5 |
0,002 |
15 |
12 |
12 |
0,006 |
37 |
16 |
5 |
0,003 |
|
0 |
12 |
6 |
9 |
0,007 |
8 |
7 |
29 |
0,011 |
17 |
13 |
11 |
0,006 |
8 |
7 |
17 |
0,010 |
|
a3H |
15 |
7 |
7 |
0,006 |
265 |
26 |
2 |
0,001 |
74 |
20 |
4 |
0,002 |
11386 |
21 |
|||
BCfl |
16 |
8 |
6 |
0,005 |
18 |
15 |
14 |
0,005 |
82 |
21 |
3 |
0,002 |
280 |
20 |
|||
eCMb |
17 |
8 |
6 |
0,005 |
298 |
26 |
2 |
0,001 |
25 |
17 |
7 |
0,004 |
8834 |
21 |
|||
KH |
19 |
8 |
6 |
0,005 |
14 |
12 |
19 |
0,007 |
12 |
11 |
13 |
0,007 |
21 |
14 |
|||
MH |
20 |
8 |
6 |
0,005 |
24 |
18 |
10 |
0,004 |
16 |
12 |
12 |
0,006 |
562 |
20 |
|||
pe6pa |
18 |
8 |
6 |
0,005 |
85 |
24 |
4 |
0,001 |
3761 |
24 |
0 |
0,000 |
3729 |
21 |
Table 2. Correlation of quantitative characteristics of the most frequent words of Cyril of Turov's sermon «On Descent from the Cross» with the corresponding values of other texts (a fragment)
Tokens |
Cyril of Turov's sermon `On Descent from the Cross |
Cyril of Turov's sermon «On Sunday of St. Thomas the Apostle» (without a beginning) |
Cyril of Turov's sermon «On Sunday of the Paralytic» |
Cyril of Turov's sermon «On Sunday of the Blind Man» |
|||||||||||||
Sample volume |
4293 |
9925 |
6605 |
6009 |
|||||||||||||
№ |
R |
F |
Freq |
№ |
R |
F |
Freq |
№ |
R |
F |
Freq |
№ |
R |
F |
Freq |
||
H |
1 |
1 |
186 |
0,068 |
1 |
1 |
95 |
0,076 |
1 |
1 |
125 |
0,067 |
1 |
1 |
107 |
0,062 |
|
Bib |
2 |
2 |
42 |
0,015 |
2 |
2 |
20 |
0,016 |
3 |
3 |
39 |
0,021 |
3 |
3 |
32 |
0,019 |
|
o(T) |
4 |
3 |
41 |
0,015 |
3 |
2 |
20 |
0,016 |
9 |
9 |
16 |
0,009 |
6 |
5 |
26 |
0,015 |
|
CB |
3 |
3 |
41 |
0,015 |
5 |
3 |
15 |
0,012 |
20 |
14 |
10 |
0,005 |
7 |
6 |
21 |
0,012 |
|
Ha |
5 |
4 |
40 |
0,015 |
7 |
4 |
13 |
0,010 |
8 |
8 |
17 |
0,009 |
4 |
4 |
30 |
0,017 |
|
60 |
6 |
5 |
37 |
0,013 |
6 |
4 |
13 |
0,010 |
5 |
5 |
25 |
0,013 |
10 |
8 |
16 |
0,009 |
|
)Ke |
7 |
6 |
32 |
0,012 |
11 |
6 |
9 |
0,007 |
4 |
4 |
27 |
0,014 |
11 |
9 |
15 |
0,009 |
|
0 |
8 |
7 |
29 |
0,011 |
12 |
6 |
9 |
0,007 |
17 |
13 |
11 |
0,006 |
8 |
7 |
17 |
0,010 |
|
He |
9 |
8 |
28 |
0,010 |
4 |
3 |
15 |
0,012 |
2 |
2 |
49 |
0,026 |
2 |
2 |
35 |
0,020 |
The experiment
1.1. The experiment used lists of the 100 most frequent tokens. We found a correlation between the texts, as described above, and summarized the data in an n*m matrix. The matrix data underwent the cluster analysis.
Experimentally, we selected a combination of metric and method of unification, which most closely matched the expert grouping of texts: the distribution of Cyril of Turov's and Cyril of Jerusalem's texts into two clusters gave an integration of the Ward combination method and the 1-r Pearson proximity measure (see Fig. 4).
Fig. 4. The dendrogram of Cyril of Turov's and Cyril of Jerusalem's texts (Ward's method, Pearson's 1-r correlation)
1.2. The addition of three anonymous texts to 13 texts demonstrates the inclusion of the former into the subcluster of Cyril of Jerusalem's works, but not Cyril of Turov's (see Fig. 5). At the same time, two texts, the anonymous sermon on the 5th Sunday after Easter and the anonymous sermon on Pentecost, form a separate subcluster, and anonymous Parable of Wisdom form a subcluster with the 3rd lecture of Cyril of Jerusalem.
Fig. 5. The dendrogram of the texts of Cyril of Turov, Cyril of Jerusalem, and three controversial texts (Ward's method, 1-r Pearson proximity measure)
1.3. For comparison, the texts of Cyril of Turov and Cyril of Jerusalem were analyzed with the addition of four texts of unknown authors (Fig. 6).
Fig. 6. The dendrogram of the texts of Cyril of Turov, Cyril of Jerusalem and four texts of other authors (Ward's method, 1-r Pearson proximity measure)
All four texts formed a special subcluster close to the subcluster of Cyril of Jerusalem's texts.
1.4. The construction of the dendrogram using all texts gave the result shown in Fig. 7.
Fig. 7. The dendrogram of the texts of Cyril of Turov, Cyril of Jerusalem, three controversial texts, and four texts of unknown authors (Ward's method, 1-r Pearson proximity measure)
The controversial texts: the anonymous sermon on the 5th Sunday after Easter and the anonymous sermon on Pentecost became a subcluster in the Cyril of Turov's works cluster, close to the subcluster of Cyril of Turov's sermon on Descent from the Cross and Cyril of Turov's sermon on Sunday of the Paralytic. The anonymous Parable of Wisdom was included in the subcluster of works of various authors, close to the subcluster of Cyril of Jerusalem's works.
The linguistic interpretation of the experiment
The experiment results given above may, at first glance, seem paradoxical. Using statistical tools (a combination of association rules and proximity measures) we managed to construct a quantitative picture of the distribution of homiletic texts that are homogeneous in the genre-and-stylistic sense (see Fig. 4, 5), and the texts turned out to be distinctly separated there, i.e., grouped in different clusters and subclusters.
A distinct statistical contrast between Cyril of Turov's original homilies and Cyril of Jerusalem's translated lectures made it possible to reevaluate both the degree of originality of Cyril of Turov's sermons and the V.V. Kolesov's idea (which seemed to be an exaggeration) about russification of Cyril of Turov's language“Художественное открытие Кирилла и заключается в самом раннем в истории русского литературного языка и весьма последовательном сближении двух языковых стихий - церковнославянской и русской, в чрезвычайно тонком понимании их специфики и пределов использования в художественной речи («Cyril's artistic discovery is the persistent convergence of two language elements, Church Slavonic and Russian, with an extremely subtle understanding of their specificity and limits of use in the artistic speech, which phenomenon was the earliest in the history of the Russian literary language») (Kolesov, 1981: 38)..
Although different clustering conditions give similar results, there is still some dissimilarity. Consideration of specific lexical and grammatical forms, whose quantitative characteristics were analyzed, allows us to understand both lexical and grammatical reasons for the inclusion of controversial texts in certain subclusters and differences in the classification results. Comparison of the linguistic parameters, which base the classification in Fig. 7, allows us to interpret in what respects and how accurately the statistical data correspond to the linguistic picture of the convergence and divergence in the texts. In the comparison below, we consider the full-meaning units: nouns, adjectives, verbs, and pronouns. This comparison indicates not just lexical convergence or divergence, but a specific morphological and syntactic realization, i.e., a special kind of thematic proximity or remoteness of texts since we carry out statistical calculations based not on lemmas but word forms (tokens). The coincidence of tokens, and not lexemes, of course, should emphasize the particular proximity between the texts.
«On Descent from the Cross» and «On Sunday of the Paralytic» homilies form a subcluster; the frequent nouns accurately reflect the key themes of both sermons. At the same time, there are not many exact matches of tokens in them, and each list, being unique, characterizes Cyril of Turov's preaching style. Cf., with ranks indicated (Table 5).
In these lists of frequent tokens, along with expected coincidences of the Богъ word forms (see below), we also register non-trivial ones. For instance, the form словомь (Instr. Sg.) turned out to be quite frequent, as well as various declension forms of the word земля. Словомь is a new form instead of the original *s-stem словесьмь. In Cyril of Turov's preaching strategy, the form словомь is necessary when confessing in both homilies the life-giving, miraculous power of the divine Logos's utterances, when a word becomes a deed (слово гего дЬломь бсьі CT_Paral, 16.2). Cf.:
8 и? мьртвы-
9 эa? словомь въскр?сивъша тво?Мго
10 бжс ?тва мановени?Ммь CT_Desc, 8.1;
14 Како ли
15 въ мо??мь х?д?мь положю т? гро-
16 б? · нбс ?ныи кр?гъ ??твердивъша-
17 го словомь · иМ на х?ровим?хъ съ ??-
18 ц?мь иМ съ с?тымь почиваю?щаго д?хмь CT_
Desc, 11.1;
15 БлжЮнъ ?Мси и???сифе · и?же вс? ??живи-
16 въшаго словомь · иМ водами покры-
17 въшаго твердь нбс ?н?ю · сего эaМко мь-
18 ртвьца каменемь покрылъ ?си
19 въ гроб? CT_Desc, 14.2;
11 Егоже ны?н? хс ?ъ бла-
12 гыиМ члвЮколюбець словомь и?ц?ли ·
13 врачь бо ??сть д?шамъ нашимъ и? т?-
14 ломъ · и? слово ??го д?ломь бс ?ы CT_Paral,
16.2;
3 Лазор?
4 ??же раскыс?въша въ гроб? · и че-
5 тыри д?ни иМм?ща въ мьртвыхъ ·
6 словомь жива створихъ · и тоб?
7 нын? г?лю въстани иМ възми ??дръ
8 своиМ · иМ иМди въ домъ своиМ CT_Paral, 20.2;
8 не насытисте ли с? въ · лЮ · иМ ·
9 иЮ · лт ?? · зр?ще мене на ??др? иМспол?-
10 мьртва лежаща нын? же въставъ-
11 шю ми б Южи??мь словомь ??сльпосте
12 ?момь иМ ?? сво??иМ храмлюще пр?-
13 тыка?Мтес? неправд? CT_Paral, 21.1.
It is noticeable that syntagmas with the form словомь are not repetitive, and in the last context, this form occasionally expands a Dativus abso- lutus. Only once an instrumental form is used as a comitative one since it expands the lexeme's meaning, referring to the Tablets of the Law:
18 дв?дъ бо ? силома киво-
19 тъ съ б ?и??мь словомь принесе нъ
20 въ сво?Ммь ??боэa·с? поставити ?М-
21 го дом? · ты же не скинию? съ зако-
22 номь · нъ самого б ?а при??мъ ? крь-
23 ста CT_Desc, 14.1.
Table 5
СТ_Бе8с |
СТ_Рага1 |
|
тело 13 |
члвка 14 |
|
иосифе 20 |
одръ 17 |
|
ба (= бога) 21 |
г(с)ь 18 |
|
гробе 21 |
купель 18 |
|
бъ (= богъ) 22 |
одра 18 |
|
кр(с)те 22 |
бе (= боже) 18 |
|
руце 23 |
ба (= бога) 20 |
|
страха 23 |
болезни 20 |
|
х(с)а (= христа) 23 |
бу (= богу) 20 |
|
х(с)е (= христе) 23 |
бъ (= богъ) 20 |
|
адъ 24 |
вода 20 |
|
гробъ 24 |
купели 20 |
|
животъ 24 |
недуга 20 |
|
земля 24 |
одре 20 |
|
и(с)съ (= иисус) 24 |
члвкъ 20 |
|
кр(с)та 24 |
англъ 21 |
|
миръ 24 |
блг(д)ти 21 |
|
мьртвьца 24 |
бмь (= богомь) 21 |
|
ребра 24 |
воду 21 |
|
словомь 24 |
г(с)ди (= господи) 21 |
|
смрти 24 |
горе 21 |
|
сна (= сына) 24 |
земля 21 |
|
телесе 24 |
крщения 21 |
|
адама 25 |
народа 21 |
|
англа 25 |
недугъ 21 недугы 21 слово 21 словомь 21 |
The form земля (Nom.Sg. and the homonymous bookish form Gen. Pl., different from its East-Slavic correlate землп) is predictably frequent in both homilies. If lemmatized, this frequency would grow even higher, since in «On Descent from the Cross» sermon, there are also three Acc. Sg. forms землю, two Instr. Sg. forms землею, and one Loc. Sg. form земли; in «On Sunday of the Paralytic» sermon there are two Acc. Sg. forms and one Loc. Sg. form. The different meanings of the word земля in both Cyril of Turov's homilies convey a universal nature of the events of sacred history. The paired formula небо и земля `heaven and earth' used in homilies is inseparable from the mythological and folklore poetic tradition. Cf.:
18 ???? ?? ????????? ???????
19 ? ????? ?? ?Ё????? ???????? · ?` ?`-
20 ?? ?? ???? ???ia ?? ????? ?????`?? ????-
21 ?????? · ???????? ?? ~? ? ?????
22 ????????? · ??????????? ?? ???????
23 ??????????ia? · ???~?? ??????? · ?` ??-
24 ?????? ????????? · ??????????? ?Ё??-
25 ???????? ia`????` ?? CT_Desc, 6.2;
26 ????????? ??~???????? · ????-
1 ???? ?? ???? ????? · ?`???? ?? ??????-
2 ????? ?? ??? ???????????? · ? ?????
3 ?????? ????? ????????? ????? ?
4 ??????????` · ? ?? ?? ~?? ?????????-
5 ???` ????? CT_Desc, 8.1-2;
24 ???? ??? ????? ?? ????-
25 ?? ???????? · ?? ~? ?` ????? ???? ???-
26 ???? · ?Ё?? ??????? · ? ?? ??????? :
1 ???? ???? ??~??? ??????? ?` ???????
2 ?? ??????? · ?? ???? ?? ????????
3 ???? ?Ё?????`?? · ???? ???? ?Ё?????
4 ??????? ????? ????ia????? · ?? ??-
5 ??? ????? ????? ???????? · ??
6 ????? ?????????ia` · ?? ????? ?????-
7 ?? ?????????`?? CT_Paral, 19.2-20.1
The word земля in the sense `terra firma as a part of the universe' appears in the following context:
3 ??Ё-
4 ?? ??? i? ??? ??? ??????? ???? · ????
5 ??????? ????? ????? ?? ?? ???? ??
6 ??? ??? ?????? · ?`?? ?? ?????? ??? ??
7 ???????? ?Ё??????? ???? CT_Desc, 8.1
The word земля in the general sense of human habitat is a natural continuation of the previous usageCf. the entry земля in (Slovar' drevnerusskogo jazyka III: 371-376).:
21 ?? ?? ????
22 ??????? ?????? · ? ?? ?? ???????? ·
23 ?????? ?`?? ??? ?Ё???????? · ????
24 ??????? ?? ?? ?? ???~?? ?Ї?? ?`?? · ??
25 ?? ???????????? ???????? ?? ?? ·
26 ?????????? ?? ???? ?????? ?`??
1 ? ????? ???????? · ?? ?`?? ?`??
2 ?? ?????????? ? ???? CT_Desc, 10.1-2;
5 ?? ???? ???? ?? ?`?? ? ???ia` ????? ???-
6 ???? ??~??? · ?? ?Ё???????? ? ~?ia`
7 ?????? CT_Paral, 17.2
Figure 7 presents a classification where the anonymous sermons on the 5th Sunday after Easter and on Pentecost enter the subcluster of Cyril of Turov's homilies and join the subcluster, which includes «On Descent from the Cross» and «On Sunday of the Paralytic» sermons. See the contents and ranks of the matching noun forms in four texts: «On Descent from the Cross,» «On Sunday of the Paralytic,» the anonymous «On the 5th Sunday after Easter,» and the anonymous «On Pentecost» (Table 6).
The results of the comparison seem unexpected, demonstrating an extremely narrow lexical and syntactic base of convergence. Even lesser convergence we observe among adjective forms. Cf. (Table 7)
Verb forms (including participles) demonstrate a similar situation. Rare convergence here is almost entirely limited to individual forms of the existential verb бити, which acts as a copula and is associated with temporal and role deixis. Cf. (Table 8).
Table 6
CT_Desc |
CT_Paral |
An_East |
An_Pent |
|
- |
бе (= боже) 19 |
- |
бе 9 |
|
ба (= бога) 21 |
ба 20 |
- |
- |
|
- |
бу (= богу) 20 |
бу 11 |
бу 10 |
|
бъ (= богъ) 22 |
бъ 20 |
- |
бъ 9 ... |
Подобные документы
The peculiarities in texts of business documents, problems of their translation, interpretation and analysis of essential clauses. The main features of formal English as the language of business papers: stylistic, grammatical and lexical peculiarities.
дипломная работа [70,2 K], добавлен 05.07.2011The process of scientific investigation. Contrastive Analysis. Statistical Methods of Analysis. Immediate Constituents Analysis. Distributional Analysis and Co-occurrence. Transformational Analysis. Method of Semantic Differential. Contextual Analysis.
реферат [26,5 K], добавлен 31.07.2008English songs discourse in the general context of culture, the song as a phenomenon of musical culture. Linguistic features of English song’s texts, implementation of the category of intertextuality in texts of English songs and practical part.
курсовая работа [26,0 K], добавлен 27.06.2011Semantic meaning of the lyrics of Metallica. Thematic Diversity and Semantic Layers of Lyrics. The songs about love and feelings. Philosophical texts. Colloquialisms and Slang Words. The analysis of vocabulary layers used in the Metallica’s lyrics.
курсовая работа [33,4 K], добавлен 09.07.2013Types of translation theory. Definition of equivalence in translation, the different concept; formal correspondence and dynamic equivalence. The usage of different levels of translation in literature texts. Examples translation of newspaper texts.
курсовая работа [37,6 K], добавлен 14.03.2013Development of translation notion in linguistics. Types of translation. Lexical and grammatical peculiarities of scientific-technical texts. The characteristic of the scientific, technical language. Analysis of terminology in scientific-technical style.
курсовая работа [41,5 K], добавлен 26.10.2010Consideration of the problem of the translation of the texts of the maritime industry. An analysis of modern English marine terms, the peculiarities of the use of these techniques in the translation of marine concepts from English into Ukrainian.
статья [37,5 K], добавлен 24.04.2018The structure and purpose of the council of Europe. The structural and semantic features of the texts of the Council of Europe official documents. Lexical and grammatical aspects of the translation of a document from English to ukrainian language.
курсовая работа [39,4 K], добавлен 01.05.2012General information about archaisms. The process of words aging. Analysis of ancient texts Shakespeare, Sonnet 2. "Love and duty reconcil’d" by Congreve. Archaisms in literature and mass media. Deliberate usage of archaisms. Commonly misused archaisms.
курсовая работа [44,3 K], добавлен 20.05.2008Analysis of expression of modality in English language texts. Its use as a basic syntactic categories. Evaluation modalities of expression of linguistic resources. Composite modal predicate verb is necessary in the sense of denial assumption corresponds.
курсовая работа [29,1 K], добавлен 10.01.2015Peculiarities of asyndetic noun clusters in economic texts. Specific to translation of asyndetic noun clusters as the specific kind of the word from English into Ukrainian. Transformations, applied to asyndetic noun clusters in the process of translation.
презентация [22,5 K], добавлен 06.12.2015The study of the functional style of language as a means of coordination and stylistic tools, devices, forming the features of style. Mass Media Language: broadcasting, weather reporting, commentary, commercial advertising, analysis of brief news items.
курсовая работа [44,8 K], добавлен 15.04.2012Essence of the lexicology and its units. Semantic changes and structure of a word. Essence of the homonyms and its criteria at the synchronic analysis. Synonymy and antonymy. Phraseological units: definition and classification. Ways of forming words.
курс лекций [24,3 K], добавлен 09.11.2008The description of neologisms: definition, diachronic analysis, cultural acceptance factor. The manor and major word building types, presents latest top 50 neologisms, analyzed and arranged in table according to their word building type, sphere of usage.
курсовая работа [43,5 K], добавлен 19.04.2011The place and role of contrastive analysis in linguistics. Analysis and lexicology, translation studies. Word formation, compounding in Ukrainian and English language. Noun plus adjective, adjective plus adjective, preposition and past participle.
курсовая работа [34,5 K], добавлен 13.05.2013The lessons of reading and translation of different texts and word-combinations into Ukrainian. The most frequently used expressions with the verbs to be, to have and sentences with them. Reading and translation the dialogue used in the usual speech.
учебное пособие [89,2 K], добавлен 25.03.2010The essence and distinctive features of word formation, affixation. The semantics of negative affixes and their comparative analysis. Place in the classification of morphemes, affixes and classification of negative affixes. Function of negative affixes.
курсовая работа [34,7 K], добавлен 03.03.2011Information about the language and culture and their interpretation in the course of a foreign language. Activities that can be used in the lesson, activities and role-playing games. The value of the teaching of culture together with the language.
курсовая работа [128,2 K], добавлен 15.10.2011Semantics as the search for meaning in the language and character values in their combinations. Principles of color semantics. Linguistic and theological studies color categories in the poem J. Milton's "Paradise Lost." Semantic analysis of color terms.
курсовая работа [36,8 K], добавлен 12.03.2015Modern sources of distributing information. Corpus linguistics, taxonomy of texts. Phonetic styles of the speaker. The peculiarities of popular science text which do not occur in other variations. Differences between academic and popular science text.
курсовая работа [24,6 K], добавлен 07.02.2013