Lancsbox software options for the prospective investigation of the multilingual corpus for European studies

The paper presents a comparative analysis of the lexeme European in two language variations (American English) based on the built-in corpora represented by newspapers, fiction, etc. that are licensed by LancsBox software (AmE06 and BE06 respectively).

Рубрика Иностранные языки и языкознание
Вид статья
Язык английский
Дата добавления 13.11.2023
Размер файла 5,2 M

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Размещено на http://www.allbest.ru/

Lancsbox software options for the prospective investigation of the multilingual corpus for European studies

О. Iu. Andrushenko

Kyiv National Linguistic University, Ukraine

Abstract

The paper presents a comparative analysis of the lexeme European in two language variations (British and American English) based on the built-in corpora represented by newspapers, fiction, etc. that are licensed by LancsBox software (AmE06 and BE06 respectively). The investigation describes the algorithms of implementing linguistic research as part of the project taught during the course “Multilingual Corpus and its Resources for European Studies (KNLU)” (Erasmus+ Program). The LancsBox user-friendly software, that works with major operating systems, has proved to be a powerful manager for compiling and using the existing corpora. It enables to visualize the textual data based on the following software package tools: KWIC, GraphColl, Words, Ngrams, Wizard, etc. essential for the study of a specific linguistic unit. The statistical analysis of both corpora under analysis has revealed that the word European belongs to the lexemes that are seldom employed in the language. The comparison of the two variations has shown that the word occurs in similar top-ten frequent collocates, however, the GraphColl tool visualization has indicated the major differences between two oorpora. Thus, in British English Corpus N+N structures are more commonly employed and are more vibrant than in American English Corpus. The t-test has proved a statistically significant difference between the corpora with regard to the linguistic variable European. These data may testify to cultural differences between the users of two language variations taking into account that both corpora represent the same time frame. european lancsbox corpus studies

Keywords: European, LancsBox, corpus studies, corpus tools, automated analysis.

Анотація

У статті представлено компаративний аналіз лексеми European у двох мовних варіантах англійської мови (британський та американський) на основі вбудованих ліцензованих корпусів програмного забезпечення LancsBox (AmE06 та BE06 відповідно), що репрезентовані газетними статтями, художньою літературою тощо. Описано алгоритм виконання лінгвістичного дослідження, що є частиною проекту “Мультилінгвальний корпус та його ресурси для дослідження Європеїстики” (КНЛУ) (програма Erasmus+). LancsBox - зручне програмне забезпечення, що працює з основними операційними системами та є ефективним менеджером для укладання й використання вже наявних корпусів. Це дає змогу візуалізувати текстуальні дані на основі наступного пакету програмного забезпечення: KWIC, GraphColl, Words, Ngrams, Wizard. Вони є основними для вивчення окремої лінгвістичної одиниці. Статистичний аналіз обох окреслених корпусів довів, що слово European належить до лексичних одиниць із низькою частотою використання в мові. Порівняння двох мовних варіантів показало, що слово використовується в майже однакових найпоширеніших 10 колокаціях, проте при імплементації інструмента візуалізації GraphColl зауважена основна відмінність між уживанням одиниці в корпусах. Так, у корпусі британської англійської мови найчастіше трапляються структури N+N, що більш динамічні порівняно з відповідними структурами в корпусі американської англійської мови. Окрім цього, T-тест статистично показав значну різницю між корпусами у функціонуванні лінгвістичної змінної European. Отримані дані можуть свідчити про культурну відмінність носіїв у двох мовних варіантах, зважаючи на те, що обидва корпуси представляють тексти, укладені в межах однакових часових рамок.

Ключові слова: European, LancsBox, корпусні дослідження, корпусний інструментарій, автоматичний аналіз.

Introduction

Implementing the project “Multilingual Corpus and its Resources for European Studies Research (KNLU)” the article presents the LancsBox tool (Brezina et al, 2020) and its options for the automated analysis of built-in and self-compiled corpora which enable the investigation of a specific search term with reference to the language selected. The course aimed at PhD students of Kyiv National Linguistic University is carried out within Jean Monnet Activities (Erasmus+ Program) and has a goal to provide practical instruments for the young researchers to conduct their linguistic research. The aim of the current paper is to study the lexeme European in two balanced corpora (American English and British English) built-in LancsBox software that contain 500 texts each (Baker, 2009; Potts & Baker, 2012) and present software opportunities for the future automated analysis of the Multilingual corpus for European studies.

Literature review

Recent developments in corpus linguistics and the relevant technological progress have enabled to elaborate specific software for analysing the language. Such software appears to be more intuitively friendly for scholars who are not experts in computer science (O'Keeffe&McCarthy, 2021). The access to the corpora via online interfaces has “empowered a broader number of linguists to explore the data from a greater range of languages, which wasn't the case in the last decade, providing access to multi-million and multi-billion-word corpora of present-day and historical English” (Davies, 2019), moreover it can serve as a repository of over 500 corpora across 95 languages (Kilgarriff et al., 2014). Computerized corpora have proved to be excellent recourses for a wide range of research tasks connected with learning the language (Andrushenko, 2021) since they facilitate the automated search of the linguistic data, assist in analysing language phenomena based on significantly large collections of texts that represent various natural languages (Davies, 2019; Johansson, 2009; McEnery&Hardie, 2015; Rissanen, 2009). Modern linguistics has been continually and constantly enriched with the new collective monographs (see.: Lopez-Couso et al., 2016; Whitt, 2018), manuals (see: Collins, 2019; Lange&Leuckert, 2020; Stefanowitsch, 2020) and articles (Andrushenko, 2022; Anokhina, 2023; Lavidas&Haugh, 2020) that represent fundamental theoretical and methodological ground for research and specify the possibilities of different software aimed at corpora investigation. Despite simplifying the data search on the one hand, the corpus system requires knowledge of different approaches and methodologies of investigation, on the other hand. This presupposes competence in statistic verification that helps to support or disprove the hypothesis made (Andrushenko, 2021).

Undoubtedly, artificial intelligence programs are powerful tools for an automated analysis of linguistic phenomena, among which LancsBox stands out as a new generation software package for the study of languages. Initially developed at the University of Lancaster in 2015 (Brezina et al., 2015), it can work with the existing corpora, that have recently been elaborated, or with linguist's own data assisting in visualizing language facts, which presupposes their automatic annotation for part-of-speech. The major features of the software are 1) working with user's data or existing corpora that can be loaded in various formats (pdf, xml, docx, .doc, etc.); 2) language facts visualization; 3) analyzing the data irrespective of the language; 4) automatic annotation of data for part-of-speech, 5) compatibility with the main operating systems (Mac, Windows, Linux) (Brezina et al., 2018). The main asset of the software, according to its principal developers, lies in “automated research on word associations, identifying collocates based on traditional three criteria: distance (specifying the span around a node word, `collocation window'), frequency (an important indicator of typicality of word association) and exclusivity” (Brezina, 2018). The other criteria that should be taken into account are directionality (which presupposes to measure the attraction strength between collocates), dispersion (the distribution of the node and the two adjacent words in text corpora) and type-token distribution among collocates (viz. the strength of the collocational relationship and the level of competition for the slots around the node word from other collocate type) (Gries, 2013). Additionally, the developers of LancsBox take into account the connectivity between individual collocates (Brezina et al., 2015). Apart from working with user's data, LancsBox grants access to built-in corpora that approximately include 1,000,000 tokens each. Such corpora can be exemplified by American and British English text samples (AmE06; BE06, BNC1940-baby, etc.) (Brezina et al., 2020). Non-European languages are brought forward by Lancaster Corpus of Mandarin Chinese (L-C-M-C), etc. The full list of corpora accessible for download is given in Figure 1.

Fig. 1. The list of available built-in corpora in LancsBox

Methodology

The two corpora selected for the pilot investigation and licensed by LancsBox software are AmE06 (American English) and BE06 (British English) representing Brown Family of corpora (Baker, 2009). These are a “carefully balanced set of samples with approximately the same number of words (1,000,000+) for each genre coming from a single period of time” (Potts & Baker, 2012), i.e. the year of 2006. This allows comparing words within the same time frame and different types of English (in case of the current study the usage of the lexeme European has been estimated). Each sample from different genres in corpora amounts to over 2,000 words. The allotment of samples per genre is as follows: press editorials (27), press reportage (44), press reviews (17), skills, biographies and essays (75), trades and hobbies (36), religion (17), popular lore (48), miscellaneous (reports, science (academic prose) (80), official documents (30), mystery and detective fiction (24), general fiction (29), western and adventure fiction (29), romantic fiction (29), science fiction (6), humor (9) (Lawrence, 2019).

To simplify the data search and visualize the results obtained the following tools from LancsBox package have been used: KWIC (enables co-textual information about the token under scrutiny. It generates a list of all instances of a search term in a corpus in the form of a concordance (Andrushenko, 2023). Double clicking on the node opens a pop-up window with a larger number of the texts which allows investigating the word in a broader context), Words (which main function is to seek words belonging to the same word class), GraphColl (provides data on the collocational patterning of the node search. It can visualize both right and left collocates simultaneously or separately depending on the parameters identified for a collocation network graph taking into account three parameters: strength, frequency, position) (Brezina & Porizka, 2021). The Words tool provides in-depth analysis of frequencies of types, part-of-speech categories and lemmas as well as allows to compare corpora using the keywords technique. The Ngrams tool enables the analysis of frequencies of different ngram types, lemmas and part-of-speech categories and it also facilitates the comparison of corpora using the key ngram technique. (Brezina, 2018; Brezina et al., 2020).

Results and discussion

The statistical analysis bar shows that the word European in BE06 corpus occurs 175 times (1.76 per 10K) in 72 texts out of 500, while the frequency of the same word in Am06 is significantly lower, viz. 96 occurrences (0.96 per 10K) in 53 out of 500 texts, which can be explained by the cultural differences of speakers in terms of their interest to the current events. The comparative data for both language variations are presented in Figure 2.

Fig. 2. Frequency of search term European in BE06 and Am06 (LancsBox)

The GraphColl tool has allowed singling out collocates of European in Am06 and BE06 using the Collocation frequency (01 - Freq (5.0), L5-R5, C: 5.0 - NC: 5.0) (Brezina, 2018). Word associations of the top 10 collocates in both corpora are presented in Tables 1-2.

Collocates of search term European in AmE06

Table 1

ID

Position

Collocate

Stat (Freq)

Freq coll

Freq corpus

1

L

the

56

56

59942

2

L

of

34

34

30270

3

L

and

28

28

28797

4

L

in

25

25

19813

5

L

a

24

24

23381

6

L

to

20

20

25899

7

L

with

11

11

6961

8

L

for

10

10

8884

9

L

on

9

9

6866

10

L

that

8

8

11842

Table 2

ID

Position

Collocate

Stat (Freq)

Freq coll

Freq corpus

1

L

the

169

169

58919

2

L

of

76

76

30653

3

R

and

48

48

27911

4

L

to

44

44

26189

5

L

a

37

37

22758

6

R

in

33

33

19264

7

R

union

31

31

101

8

L

for

19

19

9252

9

R

on

17

17

7382

10

M

that

16

16

10231

Collocates of the search term European in BE06

The comparison of the search term in two Corpora has shown that the frequencies of the first three collocates are almost identical. Hence, the word European is most often used with the article the, preposition of and conjunction and. However, there is a slight difference in right and left dislocation of collocates when it comes to the conjunction usage in both Corpora. Moreover, the lexeme European rather frequently collocates with the noun union (31 collocates out of 175 amounting to 17.71%) in British English and further investigation of the American English (AmE06) has shown that it occupies the 15th place in terms of frequency being represented by 7 collocates only (7.29%). The collocation networks for both variations of the language are given in Figures 3-4.

Fig. 3. Collocation network: European in BE06

Fig. 4. Collocation network: European in AmE06

The collocation networks in Figures 3 and 4 indicate that the most typical collocates in British English are exemplified by N+N structures: European Parliament, European Countries, European Court, European Treaty, European Commission, European Union. This can suggest that the European studies are given a significant coverage in this language variation, while in American English the most frequent tokens with N+N are found in the single collocate European Union.

The t-test (t (782.19) = -2.16, p = 0.031) has revealed a statistically significant difference between the corpora with regard to the linguistic variable European. This result is visualised in Fig. 5 below. Cohen's d (-0.14, 95% CI [-0.26, -0.01]) showed a minimum effect. Figure 5 shows error bars plot in both corpora.

Fig. 5. Error bars plot for European in AmE06 and BE06

LancsBox also enables to trace the frequency of the lexical unit in the corpus. Thus, such software tool as Words visualizes the most frequent words in the selected corpus, which is illustrated in Figure 6 based on AmE06.

Fig. 6. The most frequent words in AmE06

The study of the word European in AmE06 has indicated that it belongs to non-frequent vocabulary with the dispersion that amounts to 4.147510 (Figure 7). The same is true for BE06 corpus (Figure 8), where the collocate the European has a bit higher dispersion of 5.239762.

Fig. 7. The frequency of the lexical unit European in AmE06

Fig. 8. The frequency of the lexical unit European in BEE06

Concluding remarks

The automated analysis of the word European has shown that owing to LancsBox software the lexical unit can be analysed in parallel in two different corpora: BE06 and AmE06. The software tools allow visualizing not only the most frequent collocates with the word based on left, middle and right dislocation but also enable to find the most regular collocation patterns for every language variation. As the investigation has indicated, N+N structure is frequently traced in BE06, although this tendency is not characteristic of AmE06. Moreover, the LancsBox provides the opportunity to trace the frequency of the word usage in both corpora representing and visualizing the peculiarities of both language variations related to cultural differences. This software can be implemented for analysis of user's corpora that will further compile the Multilingual language corpus for European studies.

REFERENCES

1. Andrushenko, O. (2021). Information-structural transformations of additive adverb EVEN (a case study of the English language written records and corpora of the XII-XVII c.). Messenger of Kyiv National Linguistic University. Series Philology. Volume 24, No. 1, pp. 16-32. DOI: 10.32589/2311-0821.24%20(1).2021.236109.

2. Andrushenko, O. (2022). The Scope of just: evidence from information-structure annotation in diachronic English Corpora. In N. Sharonova, V. Lytvyn, et al. (Eds.), Proceedings of the 6th international conference on computational linguistics and intelligent systems (COLINS2022), Vol. I: Main Conference, Gliwice, Poland, May 12-13, 2022 (pp. 677696). Available online: https://ceur-ws.org/Vol-3171/paper51.pdf

3. Andrushenko, O. (2023). Particularizing focus markers in Old English: just the case of adverb polysemy? Lege Artis: Language yesterday, today, tomorrow. (Accepted for publication, date of publication: December 2023).

4. Anokhina, T. (2023). Newspaper subcorpus (subcorpus of the modern european media) in the structure of the multilingual corpus. Philological Treatises. Volume 15, No. 1, pp. 7-15. DOI: 10.21272/Ftrk.2023.15(1)-1.

5. Baker, P. (2009). The BE06 Corpus of British English and recent language change. International Journal of Corpus Linguistics, 14 (3), 312-337. DOI: 10.1075/ijcl.14.3.02.bak.

6. Baker, P. (2010). Corpus methods in linguistics. In L. Litosseliti (Ed.), Research methods in linguistics (pp. 93-113). London, New York: Continuum.

7. Brezina, V. (2018). Statistics in corpus linguistics: A practical guide. Cambridge: Cambridge University Press.

8. Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20 (2), 139-173.

9. Brezina, V., Porizka, P. (2021). Kolokacni grafy a site s pouzitim nastroje #LancsBox: aplikace v anglictine a cestine. Casopis pro moderni filologii, 103, C. 1, 36-59. DOI: 10.14712/23366591.2021.1.

10. Brezina, V., Timperley, M., & McEnery, T. (2018). #LancsBox 4.x [software]. Available online: http://corpora.lancs.ac.uk/lancsbox.

11. Brezina, V., Weill-Tessier, P., & McEnery, T. (2020). #LancsBox 5.x and 6.x [software]. Available online: http://corpora.lancs.ac.uk/lancsbox.

12. Collins, L. (2019). Corpus linguistics for online communication: A guide for research. New-York: Routledge.

13. Davies, M. (2019). The best of both worlds: Multi-billion word “dynamic” corpora. In P. Banski at al (Eds.). Proceeding of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019 (pp. 23-28). Manhein: Leibniz Institute fur Deutsche Sprache. DOI: 10.14618/ids.pub.8998.

14. Gries, S. (2013). 50-something years of work on collocations: What is or should be next.... International Journal ofCorpus Linguistics, 18 (1), 137-166. DOI: 10.1075/ijcl.18.1.09gri.

15. Johansson, S. (2009). Some aspects of the development of corpus linguistics in thr 1970s and 1980s. In Anke Ludeling & Merja Kyto (Eds), Corpus linguistics: An international handbook (pp. 33-53). Berlin: De Gruyter.

16. Kilgarriff, A., Baisa, V., Busta, J., Jakubicek, M., Kovar, V., Michelfeit, J., Rychly, P., & Suchomel, V. (2014). The Sketch Engine: ten years on. Lexicography, 1 (1), 7-36. DOI: 10.1007/s40607-014-0009-9.

17. Lange, C. & Leuckert, S. (2020) Corpus linguistics for world Englishers: A guide for research. New-York: Routledge.

18. Lavidas, N. & Haugh, D.T.T. (2020). Postclassical Greek and treebanks for a diachronic analysis. In D. Rafiyenko & I. Serzant (Eds.), Postclassical Greek: contemporary approaches to philology and linguistics (pp. 163-202). Berlin: Walter de Gruyter.

19. Lawrence, S. (2019). A rite of the edge: The language of baptism and christening in the church of England. London: SCM Press.

20. Lopez-Couso, M. J., Mendez-Naya, A., Nunez-Pertejo, B. P., & Palacios-Martinez, I. M. (2016). Corpus linguistics on the move. Exploring and understanding English through corpora. Leiden, Boston: Brill Rodopi.

21. McEnery, T., & Hardie, A. (2015) Corpus Linguistics. New-York: Routledge.

22. O'Keeffe, A., McCarthy, M. (2021). The Routledge Handbook ofCorpus Linguistics. New- York: Routledge.

23. Potts, A., & Baker, P. (2012). Does semantic tagging identify cultural change in British and American English? International Journal of Corpus Linguistics, 17 (3), 295-324.

24. Rissanen, M. (2009). Corpus linguistics and historical linguistics. In A. Ludeling & M. Kyto (Eds.), Corpus linguistics: An international handbook (pp. 53-68). Berlin: De Gruyter.

25. Stefanowitsch, A. (2020). Corpus linguistics: a guide to the methodology. Berlin: Language Science.

26. Whitt, R. (2018). Using diachronic corpora to understand the connection between genre and language change. In R. Whitt (Ed.), Diachronic corpora, genre, and language change, (pp. 1-18), Amsterdam, Philadelphia: John Benjamins Publ.

Размещено на Allbest.ru

...

Подобные документы

  • Language is the most important aspect in the life of all beings. General information about Proto-Indo-European language. Proto-Indo-European phonology. Comparison of modern languages of origin. All words about family, particularly family members.

    курсовая работа [30,2 K], добавлен 12.12.2013

  • The history and reasons for the formation of american english, its status as the multinational language. Its grammatical and lexical-semantic features. Differences in American and English options in the grammar parts of speech, pronunciation and spelling.

    курсовая работа [34,8 K], добавлен 08.03.2015

  • Origin of the comparative analysis, its role and place in linguistics. Contrastive analysis and contrastive lexicology. Compounding in Ukrainian and English language. Features of the comparative analysis of compound adjectives in English and Ukrainian.

    курсовая работа [39,5 K], добавлен 20.04.2013

  • English language: history and dialects. Specified language phenomena and their un\importance. Differences between the "varieties" of the English language and "dialects". Differences and the stylistic devices in in newspapers articles, them evaluation.

    курсовая работа [29,5 K], добавлен 27.06.2011

  • Comparative analysis of acronyms in English business registers: spoken, fiction, magazine, newspaper, non-academic, misc. Productivity acronyms as the most difficult problem in translation. The frequency of acronym formation in British National Corpus.

    курсовая работа [145,5 K], добавлен 01.03.2015

  • Grammatical, phonetic, lexical differences in using British and American English. Practical comparison of the lexical usage of British and American English in newspapers and magazines. Analysis of the main grammatical peculiarities of British English.

    курсовая работа [3,4 M], добавлен 26.04.2016

  • Analysis and description of polynational options of English. Different the concepts "version" and "option" of English. Studying of the main problems of loans of a foreign-language element. consideration of a territorial variation of English in Australia.

    курсовая работа [52,5 K], добавлен 08.04.2016

  • Lexical and grammatical differences between American English and British English. Sound system, voiced and unvoiced consonants, the American R. Americans are Ruining English. American English is very corrupting. A language that doesn’t change is dead.

    дипломная работа [52,2 K], добавлен 21.07.2009

  • Comparative analysis and classification of English and Turkish consonant system. Peculiarities of consonant systems and their equivalents and opposites in the modern Turkish language. Similarities and differences between the consonants of these languages.

    дипломная работа [176,2 K], добавлен 28.01.2014

  • Kil'ske of association of researches of European political parties is the first similar research group in Great Britain. Analysis of evropeizacii, party and party systems. An evaluation of influence of ES is on a national policy and political tactic.

    отчет по практике [54,3 K], добавлен 08.09.2011

  • Diversity of dialects of the Old English period. Analysis of dialectal words of Northern English in the modern language. Differences between dialects and Standard language; investigation of differences between their grammar, pronunciation and spelling.

    курсовая работа [124,4 K], добавлен 07.11.2015

  • Comparison of understanding phraseology in English, American and post-Soviet vocabulary. Features classification idiomatic expressions in different languages. The analysis of idiomatic expressions denoting human appearance in the English language.

    курсовая работа [30,9 K], добавлен 01.03.2015

  • Development of harmonious and competent personality - one of main tasks in the process of teaching of future teachers. Theoretical aspects of education and competence of teacher of foreign language are in the context of General European Structure.

    контрольная работа [12,2 K], добавлен 16.05.2009

  • English is a language particularly rich in idioms - those modes of expression peculiar to a language (or dialect) which frequently defy logical and grammatical rules. Without idioms English would lose much of its variety, humor both in speech an writing.

    реферат [6,1 K], добавлен 21.05.2003

  • Identification of the main features of a subject in the sentence which is based on theoretical and scientific works of Russian, English, American and Romanian authors. Research of a subject and its features in works of the American and English fiction.

    курсовая работа [59,5 K], добавлен 05.05.2011

  • The place and role of contrastive analysis in linguistics. Analysis and lexicology, translation studies. Word formation, compounding in Ukrainian and English language. Noun plus adjective, adjective plus adjective, preposition and past participle.

    курсовая работа [34,5 K], добавлен 13.05.2013

  • Investigating grammar of the English language in comparison with the Uzbek phonetics in comparison English with Uzbek. Analyzing the speech of the English and the Uzbek languages. Typological analysis of the phonological systems of English and Uzbek.

    курсовая работа [60,3 K], добавлен 21.07.2009

  • Analysis of some provisions of the famous essay by George Orwell, "Politics and the english language" about the bad influence of politics on the english, political writers use profanity, useless words, archaisms, distorting the real face of a problem.

    эссе [6,8 K], добавлен 10.03.2015

  • A short history of the origins and development of english as a global language. Peculiarities of american and british english and their differences. Social and cultural, american and british english lexical differences, grammatical peculiarities.

    дипломная работа [271,5 K], добавлен 10.03.2012

  • Theories of discourse as theories of gender: discourse analysis in language and gender studies. Belles-letters style as one of the functional styles of literary standard of the English language. Gender discourse in the tales of the three languages.

    дипломная работа [3,6 M], добавлен 05.12.2013

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.