Главная Коллекция "Revolution" Иностранные языки и языкознание Spatial statistical analysis for linguistic data: Gender systems in Nakh-Daghestanian languages

Spatial statistical analysis for linguistic data: Gender systems in Nakh-Daghestanian languages

Mapping linguistic features in space and time. Types of gender systems, establishing the number of genders. Spatial autocorrelation, inter-rater agreement. Gender systems in Nakh-Daghestanian languages. Methodology of typological research in the field.

Рубрика	Иностранные языки и языкознание
Вид	дипломная работа
Язык	английский
Дата добавления	01.09.2018
Размер файла	5,3 M

посмотреть текст работы

скачать работу можно здесь

полная информация о работе

весь список подобных работ

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Страница:

Размещено на http://www.allbest.ru/

[Введите текст]

Правительство Российской Федерации

Федеральное государственное автономное образовательное

учреждение высшего образования

Национальный исследовательский университет

«Высшая школа экономики»

Факультет гуманитарных наук

Образовательная программа

«Фундаментальная и компьютерная лингвистика»

Spatial statistical analysis for linguistic data: Gender systems in Nakh-Daghestanian languages

Выпускная квалификационная работа

Картозия Инга Константиновна

студент 4 курса бакалавриата группы БКЛ-141

Москва 2018

Content

Introduction

1. Literature review

1.1 Mapping linguistic features in space

1.2 Mapping linguistic features in space and time

1.3 Cognitive mapping

1.4 Mapping the relation between features in languages

1.4.1 Regression models

1.4.2 Spatial autocorrelation

2. Gender systems

2.1 Types of gender systems

2.2 Overt and covert gender

2.3 Establishing the number of genders

2.4 Gender systems in Nakh-Daghestanian languages

3. Data

3.1 Data collection

3.2 Inter-rater agreement

3.3 Data analysis

4. Methods

4.1 Spatial autocorrelation

4.2 Database of Gender systems

5. Discussion

6. Results

Conclusion

References

Appendix 1. Code

Appendix 2. Datasets

Introduction

Spatial analysis of linguistic data is a young, but a promising field. Mapping of linguistic data is used in many studies of typologists, dialectologists and sociolinguists. The idea of using maps in linguistics first appeared at the end of XIX century. According to (Bottiglioni, 1954) linguistic geography origins from Gillйron's Atlas of French dialects that was published between 1902-1910. Work of Gillйron was highly successful and led to “the considerable number of linguistic atlases that were published in Europe and elsewhere from 1910 and 1950” (Bottiglioni, 1954, p. 382). Bottiglioni (1954) says that linguistic geography has become an essential part of linguistics. One of the key works in linguistic geography is (Dauzat, 1922). In his work Dauzat claims that the purpose of linguistic geography is to study the language through analysis of differences across chosen area. gender language research space

In the present maps are mostly used in the dialectology and sometimes in sociolinguistics. However, maps in these fields are mostly used as an illustrative material. For instance, in (Ellis 2016) dialectological characteristics of the Southern American English during the Civil War are mapped for the sake of illustration. In the perceptual dialectology participants of the experiment are asked to map their perception of a language. A good example of an experiment in this field is (Bounds, 2015), where Polish-speaking participants mark regions where people in their opinion speak differently.

However, some papers consider spatial data in their analysis of linguistic features, because it might reveal interesting patterns in distribution of features. Spatial statistics appeared in the 1950s and has been rapidly developing ever since. Getis (2008) describes the history of a particular concept in spatial statistics: a spatial autocorrelation. In his paper, he mentions three main statisticians that “laid out the mathematical characteristics of spatial autocorrelation, although they used the term contiguity ratio to describe their work”. These statisticians were Moran, Krishna Iyer and Geary. However, a major contribution to the development of spatial autocorrelation was made by Cliff and Ord with their works “Problems of Spatial Autocorrelation” (1969) and “Spatial Autocorrelation” (1973).

The most popular measures of spatial autocorrelation are Moran's I, Geary' c, Getis-Ord G, Gi and Gi*. I am particularly interested in Moran's I and Getis and Ord Gi* measures. The first measure was presented in (Moran, 1950) and meant for calculating global spatial autocorrelations. However, Moran's I was explicitly formulated and generalized by (Cliff and Ord, 1973). The second one was introduced in (Ord and Getis, 1995) and is used to detect local clusters.

Nowadays spatial autocorrelation is widely used in econometrics, biology, archaeology and some other fields. Linguistics only recently started to apply spatial analysis in research, however it is still not widespread.

Authos that chose spatial autocorrelation method for their research mostly studied dialectological features of Indo-European language (e.g. Tamminga, 2013; Grieve, Speelman, and Geeraerts, 2013) while other areas of language were neglected in this field. It would be interesting to investigate whether this method is applicable to other language families and grammatical subsystems.

The present thesis aims to apply methods of spatial analysis to analyze data from Nakh-Daghestanian dialects. I will test a hypothesis that received clusters do not always intersect with genealogical language classification which is the sign of language contact in this area. Therefore, the thesis will show the prospects of applying spatial autocorrelation method in linguistic research.

Currently science suffers from reproducibility crisis (Thorne et al., 2018). Linguistics is no exception. Results that were once received rarely being verified by other researchers. This issue is significant in typology where interpretation of the source often depends on a researcher. However, Cross-Linguistic Linked Data is the project that attempts to standardize methodology of typological research and make research in the field reproducible.

In order to make my research reproducible and reliable the cross-validation was used. Two people collected data for the present thesis independently and then the inter-rater agreement was measured.

A dataset with all settlements in Daghestan, Chechnya and Ingushetia were analyzed. I started with the dataset of 2048 settlements. Then I have examined the literature that contained dialectological description of Nakh-Daghestanian languages. Only villages that were mentioned in literature were included in a final dataset that also includes coordinates. Then the description of gender systems for each language was added. If there were any dialect differences in respect of gender systems, it was also included in the dataset. Finally, spatial autocorrelation was applied, particularly, Moran's I and Getis-Ord Gi* tests that measure spatial autocorrelation globally and locally, respectively.

The present method might show traces of a language contact between dialects if relevant cluster that contain dialect of different language will be found. However, distribution might have other causes that has to be further explored. Moreover, there will be a dataset of 617 Nakh-Daghestanian settlements with dialect and gender system data (https://raw.githubusercontent.com/kartozia/spatial_analysis_of_NakhDaghestanian_languages/master/data/allvillages_v2.csv) . It can be used by other linguists to conduct their research.

1. Literature review

In recent years a number of papers on linguistic data mapping has grown due to technological progress. The majority of works in the linguistic geography often neglect spatial statistical analysis while observing spatial patterns (Ellis, 2016; Dinkin, 2013; Nerbonne, 2013). Only two (e.g. Tamminga, 2013; Grieve, Speelman, and Geeraerts, 2013) out of 27 articles from Linguistic Geography that I looked through used spatial autocorrelation to analyze their data. No standardized methodology of spatial analysis has been established so far. Various approaches to linguistic mapping will be discussed in subsections below.

1.1 Mapping linguistic features in space

Works described in this section use maps only for illustration purposes rather than analytical. For instance, Ellis (2016) investigates the distinctive features of Southern American English during the Civil War. Maps illustrate which features were typical for the South. They contain information on distribution of dialect features, however, observed patterns are not tested for spatial autocorrelation. This paper does give an insight on 19th century American English and its regional varieties, however a larger number of letters required for better representation of regional varieties.

Another paper is by Mitchell, Lesho, Walker (2017). The authors explore the folk perception of African American English regional variation. They were looking at lexicon (e.g. y'all, bro, son/sun), phonological (e.g. floor > flo, door > doo) and morphosyntax (e.g. finna go to the store). This study can also be attributed to subsection 2.3. where works in perceptual dialectology are discussed because the participants of the experiment were asked to write down any language differences that they themselves have noticed in different parts of the country. For the analysis of participant's answers authors created a dissimilarity matrix and performed a hierarchical cluster analysis. The present study showed little awareness of morphological features, however, more production and perception studies are required to fully document African American English. Moreover, this study proves that approach of perceptual dialectology allows to access speaker's beliefs and its variation

1.2 Mapping linguistic features in space and time

Another use of maps demonstrates distribution in time and space. Most of the studies are sociolinguistic studies. A good example is Dinkin (2013). This work explores the shift at the eastern boundary of the Northern Cities in the USA. It is a thorough sociolinguistic research that raises questions about the persistence of dialect boundaries.

For each location chosen for the research natives of the community were interviewed. The interviewers followed the Short Sociolinguistic Encounter protocol. Between 400 and 600 stressed vowel tokens were measured for each analyzed speaker. Labov's method of formant measurements was applied.

Buchstaller, Alvanides (2013) are looking for a correlation between socio-economical parameters, commuting for work and dialect distribution. In their opinion, spatial data were mainly neglected in previous sociolinguistic works. Their research investigates socio-economical areal characteristics in connection with dialectal norms. Dialectal norms might be sensitive to human geographical factors. With this method, they want to test the acceptability of dialect morphosyntactic forms in North-West England.

Methods in both works are good for analyzing the relationship between social and linguistic variables, but not between linguistic and spatial information, such as longitude and latitude. These spatial relationships are often nonlinear and as a result sociolinguistic approach is not applicable to them. Nonlinear measures can be analyzed by other statistical approaches (see 5.1. for details). Authors believe that geodemographic representativeness deserves more attention from dialectologist, because it can potentially enrich dialectological research.

1.3 Cognitive mapping

Maps are widespread in the field of perceptual dialectology. Participants are usually asked to draw maps based on their perception of dialects. Later these will be compared to dialectal maps created by linguists. This method investigates language stereotypes among native speakers. A good illustration of research in this area is Bounds (2015) where participants mark Polish regions where people speak differently.

Montgomery, Stoeckle (2013) conduct similar research for British English. For visualization, they use GIS tools that merge all participant's maps. However, this approach is not suitable for the current thesis, because it lacks spatial statistical analysis. In (Montgomery and Stoeckle, 2013) talk about the importance of using GIS in linguistic research. It can be useful not only for analyzing hand-drawn maps in perceptual dialectology, but also in other fields of linguistics.

1.4 Mapping the relation between features in languages

1.4.1 Regression models

Regression models are often applied to analyze the relation between languoids according to chosen set of features. For example, in this study the consequences of a dialect change in Dutch are explored by (Heeringa and Hinskens (2015). Their main hypothesis is that dialect change results from convergence to standard Dutch. Two measurements were made: sound changes that make a dialect converge to standard Dutch and sound changes which make a dialect diverge from standard language. Hypothesis will be proved if the first measurement is significantly higher than the second one. To perform the measurements of dialectal change Levenshtein distance metric was used. Paired-samples t-test on the data showed that convergence to standard Dutch is larger than both neutral changes and changes due to divergence. Maps (see below) in this paper illustrate the convergence and divergence between dialects measured on the basis of various sound changes Heeringa and Hinskens (2015). As a result, it was found that average dialect change is around 13.3% and more than 50% of the changes were from convergence to standard Dutch.

Another study explores the influence of geographical factors on language variation and uses regression for analysis of German dialect and linguistic distances are measured by Levenshtein distance (Nerbonne (2013). Methods of quantative dialectology according to author might show new perspectives on how language variation and space interact. Furthermore, it is important to remember that distance should not be considered as physical influence on variation in a language. It is a illustration of possible social contacts that might have caused the variation.

In this approach, distance should not be understood as physical influence on variation, but as an illustration of possible social interactions. One more work about the dialects of German is Leemann (2016). Swiss dialect data was collected with Dialдkt Дpp http://www.dialaektaepp.ch that gave researchers high spatial resolution. Data collection through the app gives high-quality data. The linear mixed effects model was applied for data analysis. Indeed, Swiss German dialects exhibit regional patterns. Moreover, usage of apps is a good method for data collection and audio data in particular, because it provides high spatial resolution.

Even though dialectometry take spatial data into account, it usually neglects similar spatial patterns exhibited by regional variation.

1.4.2 Spatial autocorrelation

Only a few works study the interaction between the linguistic feature in space according to the set of languoids by means of a multivariate spatial analysis. This type of analysis allows discovering spatial individual and common patterns of spatial linguistic variation.

Spatial autocorrelation is usually calculated with two tests: Moran's I and Getis-Ord Gi* (see section 3.1 for explanation). The first shows if there is a tendency to global clustering of variables. The second test is able to locate any high- or low-value clusters. This analysis is mostly used in a dialectology; however, it can be applied theoretically in other fields of linguistics. In my research I will analyze structure of Nakh-Daghestanian gender systems. I have not found any works that would look at morphological or syntactic features in terms of spatial relation so far. Thus, it will be an interesting experiment to conduct.

Tamminga (2013), Grieve, Speelman, and Geeraerts (2013) use method spatial autocorrelation analysis for their research. Tamminga (2013) explores quantitative perspective in Dutch indefinite determiner and looks for spatial patterns in data. For the analysis Moran's I and Getis-Ord Gi* test that were mentioned above. For masculine evidences for significant global autocorrelation were found. However, for feminine spatial autocorrelation was negative.

Grieve, Speelman, and Geeraerts (2013) study vowels in America and multivariate spatial analysis helps them to prove that atlas data does not reflect the current language situation. As well as in (Tamminga 2013) global and local spatial autocorrelation is used. Beside that in this paper factor analysis is applied to the features in order to identify the common patterns of regional variation. The Atlas that was taken as base for the research already shown regional patterns of variable in American English. However, multivariate spatial analysis demonstrated a slightly different distribution of regional varieties than in the Atlas.

As can be seen, most of the works are more about linguistic mapping rather than the spatial analysis itself that I want to apply for data analysis in the present thesis.

2. Gender systems

Aikhenvald (2000) defines gender as an agreement category that at least partly correlates with semantic characteristics, particularly with humanness and animacy. Language can have two or more genders. It is also important to mention the problem of terminology. There are different grammatical traditions of describing gender in different languages. In Indo-European languages the term gender is used, whether in African and Caucasian languages the term noun class is preferred. Aikhenvald uses noun class as a cover term for this category, while Corbett prefers gender and gender systems. In the current thesis I will operate with term gender system to describe the agreement category in Nakh-Daghestanian languages.

Gender systems may vary in a number of genders, types of assignment and semantic transparency. However, all of them contain a semantic core. Based on the level of semantic transparency gender systems were divided into three types in (Corbett (1991)).

2.1 Types of gender systems

Corbett (1991) identifies three types: semantic, morphological and phonological systems. Assignment of gender to the noun in a semantic system evidently depends on the semantics of noun. These systems, in turn, are divided in strict and predominantly semantic systems.

In strict systems the gender of a noun is determined by its meaning and assigned gender can always provide us with information about noun's meaning. In (Corbett (1991)) Dravidian and several North-East Caucasian languages, particularly, Akhvakh, Avar, Bagvalal, Botlikh, Godoberi, Karat, Tindi, are classified as languages with strict semantic systems. These languages have three genders: masculine, feminine and neutral or non-human as in most Nakh-Daghestanian languages. Characteristics of gender systems in Nakh-Daghestanian languages will be discussed in greater detail further in this section.

Predominantly semantic systems are systems which assignment based on semantic criterion but allow sets of exceptions. As an example, such systems are found in Zande, Dyirbal, Ket, Ojibwa and some Nakh-Daghestanian languages (e.g. Archi, Bezhti, Rutul, Tsakhur, Khinalug, Kryz) (Corbett, 1991, p. 13-29).

For other types of gender systems semantic criterion is less sufficient. Assignment of a gender in such systems depends on the form of a noun. Form, in turn, can have two types: word structure and sound structure. Thus, formal systems can be divided into morphological and phonological systems.

Rules of gender assignment in a morphological system are applied more to than one form. A good example of this system is the Russian language, where all nouns that belong to the third declension are defined as feminine.

In phonological systems assignment rules are applicable only to one particular form of a noun. As an illustration Corbett (1991, p. 51) provides a rule from Qafar language: “nouns whose citation form ends in an accented vowel are feminine”.

2.2 Overt and covert gender

Gender can be overtly or covertly marked on the noun. However, markedness of gender is rather a continuum than just two options of expressing gender on the noun. Nakh-Daghestanian languages according to (Plungian, 2010) have a covert gender marking. In contrast to these languages, the author cites the example of Niger-Congo languages where each noun marked with a specific marker, i.e. a morpheme that attributes a noun to a certain gender. However, Plungian (2010) says that the most typical case is when a noun does not have any morphological gender marker, but it is still possible to predict its gender from its form. Therefore, there might be a correlation between the gender of a noun and morphemes in the noun.

2.3 Establishing the number of genders

Defining the number of genders in a language sometimes might be very problematic. Corbett (1991) suggested two terms: controller and target genders. The number of controller genders is the number of combinations (single and plural) of surface gender markers. The number of target genders, in turn, equals the number of unique gender markers. In many languages, there is a one-to-one relationship between two types of gender. Nevertheless, there are also languages where a number of controller and target genders differs.

Another issue in establishing the number of genders is existence of inquorate genders. Inquorate gender is a gender with a very small number of members. Evidence of such gender were found in some Nakh-Daghestanian languages, e.g. Archi. When establishing number of genders for my research, I counted the number of controller genders, while Polina counted number of genders in singular and plural separately. Even though we noted the existence of inquorate gender, it excluded it from the total number of genders for each language.

2.4 Gender systems in Nakh-Daghestanian languages

All Nakh-Daghestanian languages except Aghul, Lezgian and Udi have a gender system. Number of controller genders varies from two in Tabasaran and up to 5 (if we do not count an inquorate gender) in Andi, Batsbi, Chamalal, Chechen, Ingush, Hunzib and Hinuq. Most of the languages have a variety of dialects. In respect of gender systems Nakh-Daghestanian languages mostly differ from their dialects by phonological realization of gender markers. If there is no information about number of genders in a dialect of a language, then by default dialect has the same number of genders. Below a map with a number of genders in each language can be found:

Figure 1. Map of number of genders in Nakh-Daghestanian languages (https://kartozia.github.io/Gender-Systems-Database/ ). Created with lingtypology package in R (Moroz, 2017)

Figure 2. Map of Nakh-Daghestanian languages distribution (https://kartozia.github.io/Gender-Systems-Database/). Created with lingtypology package in R (Moroz, 2017)

Only few dialects differ in the number of genders from the standard language (see Figure 3), if there is no standard language than we refer to the most common variety. For instance, Andi has five genders (Salimov, 2010, p. 47-62). In Andi Rikvani there is a sixth gender that includes only harmful insects (Sulejmanov, 1957, p. 131-162). This gender can be defined as inquorate in Corbett's terms. Lower Andi have three genders (Tsertsvadze, 1965, p.312). Therefore, a slow loss of gender system can be observed. Khunzan dialect of Avar has lost its gender system (Alekseev, 1997, p. 23) and standard Avar still maintain three gender system. Standard Bezhta has only four genders, while Tliadal dialect of Bezhta preserved five genders (Testelets and Khalilov, 1998, p. 52). In Mehweb, the dialect of Dargwa, the fourth gender has emerged. Feminine gender in Mehweb has split in mother and daughter genders (Sumbatova and Lander, 2014, p. 433-435).

Figure 3. Map of with labeled dialects where number of genders differs from standard or common variety of language (https://kartozia.github.io/Gender-Systems-Database). Created with lingtypology package in R (Moroz, 2017)

As was mentioned earlier, in Caucasian languages gender systems are usually referred to noun classes. Moreover, genders have numeric labels instead of semantic-related names. Despite that, a human is a central semantic category around which the systems are organized.

Masculine and feminine genders tend to appear in all Nakh-Daghestanian languages. They can also include mythical creatures that resemble humans. One exception is Tabasaran that preserved only two genders. One gender is for humans and the second is for non-human objects.

The content of residual genders is partially semantic and can vary between languages. If there are only three genders, then two of them will be feminine and masculine and one will contain all non-human nouns. If there are more than three genders, than besides masculine and feminine genders, there will a gender that contains either animated objects only or most of the animate objects and some inanimate objects. All in all, some common patterns may be observed in the gender systems: male is often distinguished from female, human from non-human, animate from inanimate.

Below you can find a content of III-V genders in Hinuq from (Forker, 2016, p. 2):

· Gender III contains, among others, nouns denoting:?

o all animals, including mythical beings

o body parts, organs

o various foods and beverages

o places and buildings

o some clothes and objects related to clothing

o some plants

o heavenly bodies

o some expressions referring to language and language use

· Gender IV contains, among others, nouns denoting:

o body parts

o some plants and their parts

o some clothes

o expressions for openings

o some utensils

o names of paper and paper objects

· Gender V contains, among others, nouns denoting:

o some clothes and similar items

o some food items

o meteorological and astronomical phenomena

o utensils and tools (long sharp objects and containers)

o some names for buildings and their parts

o abstract nouns

It is also important to mention that in most Nakh-Daghestanian languages except Andi, Bastbi (Tsova-Tush), Lak, and Tsez there are fewer genders distinguished in plural than in singular. In Andi each gender has its own marker in plural (Salimov, 2010, p. 47-62). This dichotomy can also be seen in its dialects (Sulejmanov, 1957, p.131-162; Alekseev, 1998, p. 228). The situation in Batsbi is slightly different. Genders II and III share the same plural markers for gender, while other genders have different markers (Holisky and Gagua, 194, p. 158-159).

Lak has same markers for the first three genders in plural (Forker, 2018, p. 3). Most of Lak dialects have the same contrast in gender system, except Lak Arakul (dichotomy as in Batsbi) and Lak Bartkhin (separate marker for each gender) (Khaidakov, 1966, p. 20, 43-44). Masculine gender in Tsez has a separate marker and all other genders are marked in plural alike (Ibragimov, 1990, p. 54-63).

Most common dichotomy in plural is still human and non-human referents of the noun. Avar and Tabasaran have no gender distinction in plural at all.

3. Data

3.1 Data collection

The aim of my research is to find a spatial correlation between the gender feature in Nakh-Daghestanian dialects. First, for every language and its dialect, I looked up in the grammars the list of villages where it is spoken. Afterwards, dialect information was included in the dataset of Dagestan villages. The set with coordinates was kindly provided by S. Verhees. The set with coordinated was crawled from Wikipedia. Villages that were not mentioned in the grammars were excluded from the dataset. The main struggle was villages in Chechnya and Ingushetia. Set contained over a hundred settlements in this area, however, the vast majority of them were not mentioned in any literature. Fortunately, Yu. Koryakov provided me with the census of the region that helped to finish the dataset. Another problem of dataset is that some places might be ghost settlements even though the census says the opposite. Moreover, it is hard to collect information about the language that is spoken in a particular settlement due to the lack of valid data. Most of the data was collected by linguists over the years of fieldwork.

As a result, the dataset consists of 617 settlements. It contains settlement name, coordinates, language name, its genetic affiliation, the dialect's glottocode (if any), dialect name, source and source in BibTex format.

The second dataset that was collected is dedicated to the gender systems. Its structure and nuances were described in the previous section.

3.2 Inter-rater agreement

In recent years scientists became concerned about the reproducibility of research (Thorne et al., 2018). Linguistics also suffers from unreproducible research, especially in a field of typology. Typologists collect big sample of linguistic data from various sources. Different sources might interpret the same phenomenon variously. Moreover, linguists also may have various approaches to interpretation of data. Another problem is that the collected data is rarely verified by independent reviewer or annotator. Hence results of research might be unreliable.

I wanted to make my research reproducible and reliable. The first step to reproducible research was to ask two people to collect data independently. As a result, data about the structure of gender systems in Nakh-Daghestanian languages were collected by me and my consultant Polina Kasyanova. Each of us looked through grammars independently. We were collecting information about the number of genders in each language, what markers each gender has and what the semantic of each gender is.

Next step is to compare two datasets. In order to perform this, I had to measure inter-rater agreement. I was only comparing the number of established controller genders in singular in each dataset. For the measuring, I used the intraclass correlation coefficient (ICC) (Shrout, Fleiss, 1979).

Intraclass correlation coefficient is commonly used when each subject is rated by the same raters. In our case, a set of languages with gender systems. To calculate the level of agreement between rater I used icc function form the irr R package (Gamer, Lemon, Puspendra Singh, 2012). I have chosen the two-way mixed model because each target in the research was rated by the same number of judges (Polina Kasyanova and me), as was mentioned earlier.

There were 30 genders from 29 languages to rate in total. As a result, ICC equals 0.971 (p-value = 6.2e-17). 95% confidence interval for ICC is between 0.936 and 0.986. Therefore, the percentage of agreement between raters is significant. These results make our collected data valid for further investigation.

The main difference in our dataset is that I have more dialectal data than Polina. However, this part of the data was not used for measuring inter-rater agreement, because it would be impossible to compare. We had a different number of genders only for the following languages: Chechen, Ingush, Dargwa and Dargwa Mehweb.

For Chechen and Ingush, I established five genders, when Polina established six. We both used (Nichols, 1994, p. 21-22) for Chechen and (Nichols, 1994, p. 93-94) for Ingush. In both papers masculine and feminine genders are merged into human gender, which is a macrogender. Therefore, there is two possible interpretations. Either to split human gender into masculine and feminine or not to split and have a five genders system.

In situation with Dargwa we have one source in common with Polina (Sumbatova and Lander, 2014, p. 433-435), however she established four genders and I established three. The fourth gender is apparently a special gender for mass nouns. Thus, we have also distinction in number of genders in Mehweb as well.

Hence, cross-validation of the collected data is crucial for any type of research, because any research should be reproducible and reliable.

3.3 Data analysis

For analysis of prepared data, I will use the method of spatial autocorrelation (see section 5.1.) that will help to identify various regional patterns. First, I will apply Moran's I that was mentioned earlier for global spatial autocorrelation and then Getis-Ord Gi* test. This test might show us traces of language contact in Nakh-Daghestanian languages.

The analysis was conducted in R. In R there is a spdep package (Bivand, Piras, 2015) that contains a variety of functions to perform spatial analysis of data. First, the spatial weight matrix was created. Latitude and longitude from the dataset were converted to neighbors list with spatial weights by means of nb2listw function. Distances between settlements were calculated with k-nearest neighbors algorithm because the discrepancy in the density of areal entities is considered. K nearest points are chosen as neighbors, in this case, I chose k=10. Therefore, each point from the dataset has five neighbors. There is no certain method to define the number of neighbors, so it is better to look at the data on a map and try several options. If k< 10, then clusters are visible, but some irrelevant clusters may appear. If k>10, then some clusters may disappear due to smoothing. The higher k is, the smoother the map is. Smoothing in such cases is good for illustration purposes but it might eliminate significant clusters.

I have run two Moran's tests. First Moran's I was calculated with moran.test function that additionally performs significance test. In order to verify the result moran.mc function was applied. This function performs permutation test for Moran's statistic, i.e. Monte Carlo simulation. 1000 simulations were generated.

After Moran's tests, I used the same spatial weights matrix to calculate Getis-Ord Gi*. LocalG function from spdep package returns a vector of z-score values. High z-score is assigned to the points that are the part of a high-value cluster. Points in a low-value cluster have a low z-score, respectively.

In order to see the detected clusters, all z-scores were mapped with a map.feature function from lingtypology package (Moroz, 2017).

4. Methods

4.1 Spatial autocorrelation

Autocorrelation statistics is defined as “basic descriptive statistics for any data that are ordered in a sequence because they provide basic information about the ordering of the data that is not available from other descriptive statistics such as the mean and variance” (Odland, 1998, p. 9). Particularly spatial autocorrelation measures the degree to which features tend to cluster together or disperse in space. Spatial data contains important information about not only location of a variable in space, but also its value and how the values are arranged in space.

In classical statistical models, observations are assumed to be independent of each other a priori. This method allows to test whether an observed feature is influence by its neighbors or not. Spatial autocorrelation can be positive, negative or equal zero. If it is positive, then there is certain influence of neighboring points. If the value is negative, then there is no influence by neighbors. In case spatial autocorrelation equals zero, then the distribution of features is random in space.

Spatial autocorrelation analysis is used in various fields: archaeology, ecology, epidemiology, econometrics, geology, sociology and recently in dialectology. Linguists that used this analysis in their research were discussed in section 2.

In my thesis, I want to apply spatial autocorrelation analysis to explore how idiom clusters intersect with the phylogenetic affiliation of Nakh-Daghestanian languages. The hypothesis is that received clusters are not fully defined by phylogenetic affiliation. These clusters are the most interesting ones. The discrepancy could be caused by language contact or other reasons that need to be explored.

To confirm or contradict my hypothesis I will use two tests: Moran's I and Getis-Ord Gi*. Moran's I (Odland, 1988) and Getis-Ord Gi* (Ord and Getis, 1995) tests measure spatial autocorrelation globally and locally, respectively.

Moran's I identifies whether variables exhibit the significant level of spatial clustering. Therefore, it shows if the spatial distribution of features is random. The range of the test is between -1 and 1, where positive values demonstrate a tendency towards clustering of similar values. Negative values indicate that distribution is random. The formula for calculating Moran's I is based on spatial autocovariance:

, where

The double summation is a summation of all pairs of regions;

w_ij -- the spatial weight for the pair of regions i and j;

x_i, x_j -- data values of i and j;

-- the mean for the entire sequence.

Spatial autocovariance measures the relation among nearby values of x. Nearby is specified by the w_ij. The set of spatial weights of the pair is a weight matrix that represents the location of features with respect to each other in space. It is common to give 1 if two features are neighbors and 0 if not.

Moran's I is basically a spatial autocovariance standardized by two terms: the variance of the data series () and the measure of connectivity for the set of regions (). The formula for calculating Moran's I is the following:

, where

The double summation is the summation over all pairs of regions;

N -- the number of regions;

-- a measure of connectivity;

-- variance of data series;

-- spatial autocovariance.

Before interpreting the results of Moran's I, the statistical significance must be determined. Significance can be determined by calculating z-score and p-value.

Getis-Ord Gi* statistic is also known as hot spot analysis. If resultant z-score here is negative, then the low values cluster spatially. If it is positive, then the high values cluster in space. This statistic is widely used by crime analysts, but it might also show some local patterns in the current research. Getis-Ord Gi* is calculated with the following formula (Ord and Getis, 1995):

, where

x_i, x_j -- data values of i and j;

w_ij -- the spatial weight for the pair of regions i and j;

N -- the number of features;

-- the mean for the entire sequence.

If similar features are situated close to each other, then Gi* will be close to 1. If dissimilar features tend to group together, then Gi* will be negative and close to -1. If distribution of features is totally random Gi* will equal zero.

4.2 Database of Gender systems

After collecting information about the structure of gender systems in Nakh-Daghestanian languages, it was decided to create a database to which everyone can have a free access (https://zenodo.org/record/1253012#.WwgZeC9eNPM ). This database contains data that might be used by other linguists to conduct their own research. Moreover, it can also be used as an illustrative material, for example, at classes.

The database is hosted on GitHub pages and the link can be found (https://kartozia.github.io/Gender-Systems-Database/ ). A website for the database was created in RMarkdown (Allaire et al., 2018). The source code can be as well found on Github(https://github.com/kartozia/Gender-Systems-Database/blob/master/noun_classes.Rmd ).

Description of gender systems differs from grammar to grammar. As a result, my advisor G. Moroz, M. Daniel, P. Kasyanova and I decided to set up annotation guidelines for the Nakh-Daghestanian gender system database. It will contain the following information:

· Language -- a name of a language or its dialect

· Gender -- a number of gender in a language

· SG -- gender marker in singular

· PL -- gender marker in plural

· Source label -- how gender is marked in the source (e.g. Roman numbers) (if there is a distinction between singular and plural labels then we list them separating with a dot)

· Label -- custom labels:

· F for gender where all women and only women are included*

· M for gender where all men and only men are included*

· N a third gender that includes everything except F and M (is the only such gender)

· semantic -- semantic description of the gender from the source, additional information in note mode

· comments -- for any extra information one would like to mention, e.g. inquorate gender

· source -- information source in short and also BibTex format, e.g. [Ihilov 1967: 9-15] and ihilov1967

* Exceptions can be made for special human and human-like beings - old (wo)man, young child, devil, god etc.

The online version of a database consists of three sections. The first section contains the description of a database structure. It is also possible to download database in .csv format.

Figure 4. About section of database of Gender Systems. (https://kartozia.github.io/Gender-Systems-Database/ ). Created with lingtypology package in R (Moroz, 2017)

The second section provides search through the table. The content of the table can be found above. Search can be conducted in any column.

Figure 5. Search across database of Gender Systems. (https://kartozia.github.io/Gender-Systems-Database/ ). Created with lingtypology package in R (Moroz, 2017)

The third section contains a reference list in APA format. References include all sources that were used to collect the data. Particular pages for each language can be found in the second section.

5. Discussion

By asking two linguists to collect data individually we were able not only receive more reliable results, but also to see the discrepency in the annotation. Differences are crucial, because they lead to irreproducibility of results. Even though ICC score for inter-rater agreement was high, Polina and I might have received completely different outcomes of research, if we would work separately. This is why cross-validation should become an essential part not only of typological research, but also of any scientific research.

In the present research some interesting clusters were detected. However, it would be interesting to perform the same spatial analysis on other features of Nakh-Daghestanian languages with the use of same grammars and dataset of settlements. Moreover, the relief of Daghestan might be also considered Currently an evidence of some contact is observed, however, without analysis of more language feature, it is impossible to make any statements. Hence with more typological research and maps of Daghestan in particular, the causes of observed spatial patterns might be found.

6. Results

As expected positive spatial correlation across the distribution of gender systems in Nakh-Daghestanian language was received. Moran's I test provides evidence for global spatial autocorrelation (I = 0.888, p-value = 0.0001). Therefore, the null hypothesis for Moran's test that expects distribution to be random can be rejected. The results of the test were verified by a permutation test. Simulations have given comparable results. Permutation test also equals 0.888. It means that results of Moran's I statistics can be trusted, and the hypothesis of the current research has been proved. Positive results of global autocorrelation prove that the number of genders in a language is influenced by its location in space.

The map of the local spatial autocorrelation Gi* with different number of neighbors (5,10, 50) can be found below:

Figure 6. Z-scores of Getis-Ord Gi*, if k=5. (https://kartozia.github.io/Gender-Systems-Database/ ). Created with lingtypology package in R (Moroz, 2017)

Figure 7. Z-scores of Getis-Ord Gi*, if k=10. (https://kartozia.github.io/Gender-Systems-Database/ ). Created with lingtypology package in R (Moroz, 2017)

Figure 8. Z-scores of Getis-Ord Gi*, if k=50. (https://kartozia.github.io/Gender-Systems-Database/ ). Created with lingtypology package in R (Moroz, 2017)

With different number of neighbors different results are received. It is not really clear on what basis the number of k should be chosen. The only reasonable way is to try various options and decide what is the best one for you research. I chose for each point to have 10 neighbors, because it does not hide all interesting clusters completely and gives a good Moran's I score (I = 0.888). With higher values for k (e.g. 50) Moran's I score drops significantly, however map for Getis-Ord Gi* looks smoother and steady change of number of classes can be observed. When k = 50, it can be seen that the cluster in the South to which Budukh, Khinalug and Kryz belong merges with low-value cluster of Aghul and Lezgian.

Difference between Figure 5 and 6 are visible, but not that significant. The mentioned above cluster of Budukh, Khinalug and Kryz in Figure 5 has a similar color with other four genders languages. Therefore, various parameters might bring different results. It would be interesting to see whether same trends preserve during spatial analysis of other linguistic features of Nakh-Daghestanian languages.

The darkest dots on the map constitute the low-value cluster. The cluster includes Aghul and Lezgian languages that have lost their gender system. Yellow points belong to the high-value cluster. The majority of settlements where Ingush, Chechen, Andi or Chamalal are spoken forms the cluster where languages have five gender system. Between these clusters, a continuum can be observed where from the South to the North the number of genders in a language grows. An exception is a small cluster of Budukh, Khinalug and Kryz languages on the territory of Azerbaijan. Unlike Aghul, Lezgian and Udi, these languages have preserved a gender system while being surrounded by Turkic languages. All of them have four gender system.

Discrepancy in the number of genders in some dialects of Nakh-Daghestanian can might be also explained by influence of neighbors. As an example, Dargwa Mehweb has four genders while standard Dargwa and its other dialects have only three (Sumbatova and Lander, 2014, p. 433-435). Mehweb is situated between Lak and Avar with four and three genders, respectively. Moreover, it is completely isolated from other Dargwa settlements. Therefore, contacts with Lak might have caused the split of the feminine gender, because girls and sisters belong to gender III in Lak and not to feminine gender (Forker, 2018, p. 3).

According to (Magometov, 1965, p. 79-93) northern dialects of Tabasaran preserved gender system better than southern, although there are no relevant examples for this. However, it can be connected to neighborhood of Tabasaran in the South with Lezgian that has lost its gender system. The north neighbor of Tabasaran is Dargwa with three genders system.

Another interesting case is Tliadal dialect of Bezhta. Unlike standard Bezhta Tliadal has five genders instead of four (Testelets and Khalilov, 1998, p. 52). As can be seen on the interactive map Bezhta Tliadal is surrounded by Hunzib settlements. Hunzib has five genders and that might have helped to preserve the fifth gender in Tliadal.

Lower Andi is closely situated to the settlements where Karata and Botlikh languages are spoken. As well as these languages Lower Andi has three genders. Thus, this dialect has a reduced gender system in comparison to Andi Gagatli and Rikvani.

Dagestan is known for a high degree of multilingualism although younger generation gives preferences for major languages like Russian, Georgian, AzerbaijanGrawunder, 2017. Moreover, it is also known that the inhabitants of highland villages, particularly, males speak languages of lowland inhabitants, but not vice versa. Therefore, it can be supposed that highland languages might be influenced by lowland languages significantly. Gender system in Nakh-Daghestanian languages could have been also influenced by Turkic languages, e.g. Azerbaijan, Kalmyk and Georgian.

In my opinion, spatial autocorrelation analysis is essential if geodata is available and there is possibility of neighbors' influence. Instead of assumptions about possible spatial patterns of feature distribution, it is possible to detect global autocorrelation and find local low- and high-value clusters. Moreover, these approach to spatial analysis should be widely used in linguistic geography and typology to provide more reliable results.

Besides detecting spatial patterns in distribution of gender systems in Nakh-Daghestanian languages, there is also a dataset of 617 settlements in Dagestan, Chechnya and Ingushetia that is available to everyone. For each settlement it contains information about its coordinates, spoken language and dialect (if any), glottocode for a dialect (if any), number of genders and source where the settlement is mentioned.

Moreover, the online database of gender systems in Nakh-Daghestanian languages has been created and hosted on GitHub. Each language contains information about the structure of a gender system (gender markers in singular and plural, semantic components, source).

Conclusion

Idiom clusters that are not fully defined by phylogenetic affiliation were found. For instance, Chechen, Ingush, Chamalal and Andi are a part of the same high-value cluster. However, Chechen and Ingush belong to Nakh languages and Chamalal and Andi belong to Tsezic languages. Most of languages of the latter group have three-four genders. These clusters will become an argument for unexpected language clustering that arises due to a language contact or other reasons. In the further research it would be interesting to consider the relief of the region in spatial analysis and look at more linguistic features and see if it will main the same trends in clustering

...

Страница:

дипломная работа "Spatial statistical analysis for linguistic data: Gender systems in Nakh-Daghestanian languages" скачать

Подобные документы

Gender and age peculiarities of the language and some linguistic difficulties of translation them in practice
Study of lexical and morphological differences of the women’s and men’s language; grammatical forms of verbs according to the sex of the speaker. Peculiarities of women’s and men’s language and the linguistic behavior of men and women across languages.

дипломная работа [73,0 K], добавлен 28.01.2014
Gender discourse in modern English and Russian belles-letters
Theories of discourse as theories of gender: discourse analysis in language and gender studies. Belles-letters style as one of the functional styles of literary standard of the English language. Gender discourse in the tales of the three languages.

дипломная работа [3,6 M], добавлен 05.12.2013
Grammatical Categories of Number, Case, and Gender in Modern English. A Field Approach
Study of the basic grammatical categories of number, case and gender in modern English language with the use of a field approach. Practical analysis of grammatical categories of the English language on the example of materials of business discourse.

магистерская работа [273,3 K], добавлен 06.12.2015
Methods of concept description
New scientific paradigm in linguistics. Problem of correlation between peoples and their languages. Correlation between languages, cultural picularities and national mentalities. The Method of conceptual analysis. Methodology of Cognitive Linguistics.

реферат [13,3 K], добавлен 29.06.2011
The comparative typology of English, Russian and Uzbek languages
Investigating grammar of the English language in comparison with the Uzbek phonetics in comparison English with Uzbek. Analyzing the speech of the English and the Uzbek languages. Typological analysis of the phonological systems of English and Uzbek.

курсовая работа [60,3 K], добавлен 21.07.2009
Polysemy in English and Ukrainian
Lexicology, as a branch of linguistic study, its connection with phonetics, grammar, stylistics and contrastive linguistics. The synchronic and diachronic approaches to polysemy. The peculiar features of the English and Ukrainian vocabulary systems.

курсовая работа [44,7 K], добавлен 30.11.2015
A contrastive analysis of consonants of English and Turkish languages
Comparative analysis and classification of English and Turkish consonant system. Peculiarities of consonant systems and their equivalents and opposites in the modern Turkish language. Similarities and differences between the consonants of these languages.

дипломная работа [176,2 K], добавлен 28.01.2014
Automation Control Systems
Introduction to Simultaneous Localization And Mapping (SLAM) for mobile robot. Navigational sensors used in SLAM: Internal, External, Range sensors, Odometry, Inertial Navigation Systems, Global Positioning System. Map processing and updating principle.

курсовая работа [3,4 M], добавлен 17.05.2014
Principles of word-formation in English
Definitiоn and features, linguistic peculiarities оf wоrd-fоrmatiоn. Types оf wоrd-fоrmatiоn: prоductive and secоndary ways. Analysis оf the bооk "Bridget Jоnes’ Diary" by Helen Fielding оn the subject оf wоrd-fоrmatiоn, results оf the analysis.

курсовая работа [106,8 K], добавлен 17.03.2014
Nouns
The discovery of nouns. Introduction. Classification of nouns in English. Nouns and pronouns. Semantic vs. grammatical number. Number in specific languages. Obligatoriness of number marking. Number agreement. Types of number.

курсовая работа [31,2 K], добавлен 21.01.2008
The use of gender with zoonims in English and Uzbek
The notion of the grammatical category of gender. The main approaches in investigating the category of gender, the ways of expressing in English and Uzbek. Zoonims as separate lexical units. Generic categorization of zoonims in English and Uzbek.

курсовая работа [79,3 K], добавлен 05.04.2013
Autobiography studying
Genre of Autobiography. Linguistic and Extra-linguistic Features of Autobiographical Genre and their Analysis in B. Franklin’s Autobiography. The settings of the narrative, the process of sharing information, feelings,the attitude of the writer.

реферат [30,9 K], добавлен 27.08.2011
Colour as a linguistic and extra-linguistic phenomenon
Extra-linguistic and linguistic spheres of colour naming adjectives study. Colour as a physical phenomenon. Psychophysiological mechanisms of forming colour perception. The nuclear and peripherical meanings of the semantic field of the main colours.

реферат [193,7 K], добавлен 27.09.2013
Idioms in Commercials Pragmatic Aspect
Features of the study and classification of phenomena idiom as a linguistic element. Shape analysis of the value of idioms for both conversational and commercial use. Basic principles of pragmatic aspects of idioms in the field of commercial advertising.

курсовая работа [39,3 K], добавлен 17.04.2011
Linguistic and socio-cultural peculiarities of business communication
The theory and practice of raising the effectiveness of business communication from the linguistic and socio-cultural viewpoint. Characteristics of business communication, analysis of its linguistic features. Specific problems in business interaction.

курсовая работа [46,5 K], добавлен 16.04.2011
Information Systems Security
The computer systems and unique possibilities for fulfillment before unknown offenses. The main risks and threats to information systems security in the internet. Internet as a port of escape of the confidential information and its damage minimization.

контрольная работа [19,6 K], добавлен 17.02.2011
Local languages of Canada
The description of languages of Canada — a significant amount of languages of indigenous population, immigrants and dialects arising in Canada and hybrid languages. English and French languages are recognised by the Constitution of Canada as "official".

презентация [750,5 K], добавлен 27.11.2010
Homonyms in Modern English
Phonetic coincidence and semantic differences of homonyms. Classification of homonyms. Diachronically approach to homonyms. Synchronically approach in studying homonymy. Comparative typological analysis of linguistic phenomena in English and Russia.

курсовая работа [273,7 K], добавлен 26.04.2012
Legal system
Profession in the USA. Regulation of the legal profession. Lawyers: parasites of the back of the American taxpayer. The legal profession for women: a problem of gender equality. The legal system of the USA. The principles of the USA System of justice.

курсовая работа [35,9 K], добавлен 31.08.2008
Keele European parties Research unit
Kil'ske of association of researches of European political parties is the first similar research group in Great Britain. Analysis of evropeizacii, party and party systems. An evaluation of influence of ES is on a national policy and political tactic.

отчет по практике [54,3 K], добавлен 08.09.2011

Другие документы, подобные "Spatial statistical analysis for linguistic data: Gender systems in Nakh-Daghestanian languages"

весь список подобных работ

скачать работу можно здесь

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.