Preconditions for appearance of computer lexicography

Stages of the development of computer technologies for compilation of dictionaries. Determination of the prerequisites that led to the emergence of such a direction in linguistics as computer lexicography. The tasks of the Language and Information Fund.

Sikaliuk Anzhela Ivanivna, Ph.D., Associate Professor, Associate Professor of Foreign Philology Department, Lytvyn Svitlana Volodymyrivna, Ph.D., Associate Professor, Head of Foreign Philology Department, Shevchenko Yulia Viktorivna, Senior Lecturer of Foreign Philology Department, Chernihiv Polytechnic National University


Computer lexicography is one of the important directions of modern domestic linguistics and translation studies. Nowadays, scientists face important questions related to the theoretical and practical aspects of compiling computer dictionaries, which, undoubtedly, have significant scientific significance.

A necessary stage in solving these questions is to understand the peculiarities of the formation of this section of linguistic science - its preconditions, methodological base; directions of the scientific research. The article is devoted to highlighting some aspects of the historical development of domestic and foreign lexicography.

The task of the article is to consider the main stages of the development of computer technologies for compilation of dictionaries and to determine the prerequisites that led to the emergence of such a direction in linguistics as computer lexicography.

The advent of computers actively influenced the development of lexicography. Initially, they were used to prepare paper dictionaries, in other words, they served as a typewriter. But later it turned out that computers can perform such functions as editing, storing any lexicographic information, and therefore computer corpora of texts appeared, and then machine-readable dictionaries.

So, at the initial stages of development, computers in the field of lexicography were used as a typewriter that could replace, delete letters, words, and even whole parts of texts. And with the emergence of operating systems such as Windows, the computer provided the lexicographer not only with editing tools, but also with formatting, creating an original dictionary layout. The advent of tablet scanners and recognition programs has greatly accelerated the process of typing and editing text. At the same time, software was created that allowed searching and indexing of texts. This led to the emergence of linguistic databases, electronic libraries and card libraries. Automated lexicographic databases in the form of electronic dictionaries are now an integral part of systems of machine translation, information search, editing and correction of texts, as well as processing of large text arrays and their storage as a separate task of creating electronic libraries. Computer dictionaries on optical media enabled translators and scientists to quickly find any information about a word (translation, interpretation, etc.).

Keywords: computer lexicography, machine translation, electronic dictionary, computer translation, lexicographic information, computer text corpus.


Передумови появи комп'ютерної лексикографії

Сікалюк Анжела Іванівна кандидат педагогічних наук, доцент, доцент кафедри іноземної філології, Литвин Світлана Володимирівна к.ф.-м.н., доцент, завідувач кафедри іноземної філології, Шевченко Юлія Вікторівна ст. викладач кафедри іноземної філології, Національний університет «Чернігівська політехніка»

Комп'ютерна лексикографія - це один із важливих напрямків сучасного вітчизняного мовознавства та перекладознавства. В наш час перед науковцями постають важливі питання, що стосуються теоретичних і практичних аспектів укладання комп'ютерних словників, які, безсумнівно, мають вагому наукову значущість.

Необхідним етапом у вирішенні цих питань є осмислення особливостей формування цього розділу мовознавчої науки - її передумов, методологічної бази, напрямків наукового пошуку. Стаття присвячена висвітленню деяких аспектів історичного розвитку вітчизняної та зарубіжної лексикографії.

Завдання статті - розглянути основні етапи розвитку комп'ютерних технологій укладання словників та визначити передумови, що зумовили появу такого напрямку у мовознавстві, як комп'ютерна лексикографія.

Поява комп'ютерів активно вплинула на розвиток лексикографії. Спочатку вони використовувалися для підготовки паперових словників, інакше кажучи слугували печатною машинкою. Але згодом виявилося, що комп'ютери можуть виконувати такі функції, як редагування, зберігання будь-якої лексикографічної інформації, і тому з'явилися комп'ютерні корпуси текстів, а далі й машинозчитувані словники.

Отже, на початкових етапах розвитку комп'ютери у галузі лексикографії використовувалися як друкарська машинка, яка могла заміняти, витирати літери, слова й навіть цілі частини текстів. А з виникненням операційних систем, таких як Windows, комп'ютер надав лексикографу не тільки інструментальні засоби редагування, але й форматування, створення оригінал-макету словника. Поява планшетних сканерів та програм розпізнавання значно прискорила процес набору та редагування тексту. Водночас було створено програмне забезпечення, що дозволяло здійснювати пошук, індексацію текстів. Це привело до виникнення лінгвістичних баз даних, електронних бібліотек й картотек. Автоматизовані лексикографічні бази у вигляді електронних словників зараз становлять невід'ємну частину систем машинного перекладу, інформаційного пошуку, редагування та правки текстів, а також обробки великих текстових масивів та їх зберігання як окремої задачі створення електронних бібліотек. Комп'ютерні словники на оптичних носіях дали змогу перекладачам, науковцям швидко знаходити будь-яку інформацію про слово (переклад, тлумачення тощо).

Ключові слова: комп'ютерна лексикографія, машинний переклад, електронний словник, комп'ютерний переклад, лексикографічна інформація, комп'ютерний корпус тексту.

Target setting

One of the important directions of modern domestic linguistics is computer lexicography. Today, researchers face important questions related to the theoretical and practical aspects of compiling computer dictionaries, the scientific significance of which is undoubted. A necessary stage in solving these questions is to understand the peculiarities of the formation of this section of linguistic science - its preconditions, methodological base, directions of scientific research. It is also important to consider some aspects of the historical development of domestic and foreign lexicography.

Actual scientific researches and issues analysis

Research in the field of Ukrainian computer lexicography is a promising direction of scientific research. M. Komova and I. Kochan rightly point out that the study of national terminology is an important direction of scientific activity, which provides substantiated information about the real state of functioning of the state language, the computerization of terminology unfolds in line with the global processes of informatization of all spheres of social life. Among the main problems of the modern theory of terminology V. Dubichinsky singles out the need to use computerization in the creation of terminological dictionaries. R. Mysak proposed the classification of electronic and computer dictionaries depending on information carriers and basic technical and operational characteristics; analyzed approaches to compiling dictionaries: from a paper version to an electronic version and vice versa. On the basis of the theory of semantic states, V. Shyrokov substantiated and developed the conceptual and system-technical principles of building multilingual dictionary systems and virtual systems of professional interaction in linguistics. As noted by V. Chumak and R. Tymoschuk, the use of information technologies provides an opportunity to create linguistic, in particular lexicographic, databases and knowledge for the purpose of using them in research and technological modes.

The research objective of the article is to consider the main stages of the development of computer technologies for compilation of dictionaries and to determine the preconditions that led to the emergence of such a direction in linguistics as computer lexicography.

The statement of basic materials

The advent of computers actively influenced the development of lexicography. Initially, they were used to prepare paper dictionaries, in other words, they served as a typewriter. But later it turned out that computers can perform such functions as editing, storing any lexicographic information, and therefore computer corpora of texts appeared, and then machine-readable dictionaries [1].

One of the first developments in this field was the computer corpus of the modern American variant of the English language, or simply Brown Corpus (Brown University Standard Corpus of Present-Day American English, or simply Brown Corpus), which was compiled in 1960 at Brown University by researchers by Henry Kuchera and Nelson Francis.

The Brownian corpus was a carefully compiled material from the American English language with a volume of 1 million words, selected from many sources. H. Kuchera and N. Francis performed various computer analyzes on this material, thanks to which scientists received a rich and diverse work that incorporated elements of linguistics, psychology, statistics, and sociology. The corpus has been widely used in computational linguistics and has been one of the most popular resources in the field for many years.

Subsequently, several attempts were made to create larger bodies. In Great Britain, such projects were the Bank of English and the British National Corpus (BNC). Along with this, computer lexicography also developed in the field of natural language text processing. Research in this field was conducted by the Laboratory of Automatic Document Processing and Linguistics, which was created in 1966 (Laboratoire d'Automatique Documentaire et Linguistique - LADL) at the University of Marne (France) and grew into the center of the European RELEX network [2].

The LADL research program is aimed at developing fundamental tools for processing natural language texts. Its toolkit consists of:

- linguistic components: electronic dictionaries and grammars, mainly for English, French, Spanish and Korean languages;

- software for algorithms that work with dictionaries and grammars on corpus arrays of texts in order to determine and reduce meaningful passages of texts to deep forms; at the same time, the main application is automatic indexing of texts, information search in full texts, help with translation.

The electronic dictionaries developed by LADL include DELAF multilingual inflection dictionaries. Such a dictionary contains about 600,000 word forms for simple words and about 150,000 word forms for complex words (DALACF). The main words in these dictionaries are given as finite state machines [3]. This makes it possible to perform powerful indexing of texts. A simple word is presented as a finite state automaton containing morphological and syntactic information.

A separate area of computer lexicography was the creation of machine dictionaries for information retrieval systems (such as GAT, used since 1966 by the US Atomic Energy Commission and Euroatom) and machine translation systems (for example, the SYSTRAN system, which was created in the 1970s x years for the General Motors Corporation to speed up material translation processes).

The first computer dictionaries, which were machine versions of paper dictionaries, appeared in the late 70s and early 80s of the last century. They served researchers as a convenient material for lexicographic research.

In the UK, machine versions of traditional English dictionaries such as the Oxford Advanced Learner's Dictionary (OALD), the Dictionary of Contemporary English (LDOCE) and the Collins Cobuild English Dictionary (COBUILD) have been produced.

The Oxford Advanced Learner's Dictionary (OALD) became available in machine-readable form in the late 1970s. The computer did not play any role during the lexicographic preparation of the dictionary. Basically, it was a computer punch card. It was the first machine-readable dictionary on a punched card.

In the early 1980s, the machine dictionary Longman Dictionary of Contemporary English (LDOCE) appeared. During its preparation, the authors used computer tools to check the sequence of word definitions. The LDOCE dictionary was the first machine-readable dictionary created using a computer.

The COBUILD dictionary was the first machine-readable dictionary developed using a computer. The development of the dictionary consisted of four stages: data collection, selection of dictionary entries, comparison of definitions for dictionary entries, and ordering of dictionary entries [4]. The computer was also used to check the consistency and completeness of the dictionary entries.

The first computer-based dictionary was the 15-volume "Dictionary of the French Language" (8,000 dictionary entries). But this was only a part of the automatic index of words with examples, covering a text corpus of 90 million word usages. In other words, already at the initial stage of electronic dictionary development, lexicographers set tasks wider than routine translation dictionary onto magnetic media.

Almost simultaneously with the automation of these tasks, computer lexicography moved to solving problems of a different quality. There was a real prospect of using the results of lexical analysis of words, which in many ways exceeded the possibilities of "manual" lexicography, that is, the work of individual researchers on the analysis of large linguistic arrays to create dictionaries and standard software for personal computers of millions of users [5].

In turn, with the emergence of machine-readable dictionaries, it became possible to publish them on CD-ROM, that is, on laser discs. The text of the first edition of the Oxford Dictionary became available in 1988. Subsequently, three electronic versions of the second edition appeared. The first version (1992) was identical in content to the paper counterpart, but the disk itself was not copy-protected. The second version (1999) had some additions and improved software and more convenient search facilities, but there were flaws in copy protection. The third version (2002) contains more words and more advanced software, although it still has the same copy protection flaws as the previous versions. The online version of the Oxford Dictionary became available on March 14, 2000.

As computer technologies improved, the possibilities and functions of electronic dictionaries expanded. They already served not only as a means of preserving linguistic information, components of language processing systems, machine translation, but could also perform such functions as language learning and fast search.

Dictionaries have acquired the ability to selectively display information contained in a dictionary article, display several dictionary articles at the same time, etc.

In our opinion, the birth of domestic computer lexicography dates back to the mid-80s of the last century. Scientists from the department of software and electrical engineering of "Lviv Polytechnic" created a support system for multilingual terminological dictionaries called "SLOVO". In it, the technological problems of preparing dictionaries for printing were worked out. Thus, a computer version of the English-Ukrainian-English dictionary of information technology terms was developed. Its volume was about 9,000 words.

The development of computer lexicography in Ukraine after gaining independence was characterized by the processes of integration of our country into the global information community and the implementation of language policy by the state, in particular in the computer and information field for the creation of a national dictionary base [6]. This is evidenced by the resolution of the Cabinet of Ministers of Ukraine of September 8, 1997 "On the approval of Comprehensive measures for the comprehensive development and functioning of the Ukrainian language", the decree of the President of Ukraine "On the development of the national vocabulary base" of 1999, the order of the Cabinet of Ministers "On the priority tasks of creating a national dictionary base" dated November 22, 2000, the resolution of the Verkhovna Rada of Ukraine "On the functioning of the Ukrainian language in Ukraine" dated May 22, 2003.

In connection with the above-mentioned processes, there is an increasing need for the creation of automated lexicographic systems - computer dictionaries, thesauruses, natural language text processing programs that could provide automated and machine editing, information search, recognition and correction of texts, compilation of new dictionaries , accumulation and support of materials for electronic libraries, etc.

The phrase "national vocabulary base" officially appeared for the first time in the text of the Decree of the President of Ukraine dated August 7, 1999 No. 967 "On the development of the national vocabulary base". Although the definition of this concept was not formulated in the specified document, certain key words associated with it were indicated. Namely: the term national dictionary base was associated with the expansion of the scope of the functioning of the Ukrainian language, the creation of a new generation of academic Ukrainian dictionaries and their electronic counterparts for computer information systems (the "Dictionaries of Ukraine" project).

The implementation of this project relied on the National Academy of Sciences of Ukraine. In our opinion, the Ukrainian Language and Information Foundation, which was founded in 1991, was the first to conduct active scientific research and development in the field of forming a national vocabulary base for the Ukrainian language. Basic models of computer lexicography were created here, which formed the basis for the development of relevant technologies, and a series of lexicographic works of a new generation - "Dictionaries of Ukraine" - was launched.

Initially, this organization dealt with issues of lexicography, as well as the compilation of specialized dictionaries that were part of the "Dictionaries of Ukraine" series.

During the preparation of publications, much attention was paid to the automation of the compilation process, for which a number of fundamental studies in computer linguistics were conducted. Thus, even for the first orthographic dictionary, a formal theory of classification of word change methods in the Ukrainian language was developed, and one algorithmic base, which made it possible to automatically obtain all word forms on the basis of the original form. Initially, 280 paradigmatic classes were defined, but today more in-depth research has made it possible to derive about 1.5 thousand of them and thus cover the entire vocabulary of the literary Ukrainian language.

Among the tasks of the Language and Information Fund, the project to create a fundamental multi-volume academic lexicographic system "Dictionary of the Ukrainian Language" stands out for its scale. In 2001, the first full-scale Ukrainian dictionary was released on a laser disc in the form of an integrated lexicographic system "Dictionaries of Ukraine", which contains a unique set of dictionary functions: the register of 152,000 units shows the full wordchanging paradigm and transcription according to the rules of Ukrainian orthography and spelling; the system provides more than 56,000 phraseological units, about 9,000 synonymous series, and more than 2,100 antonym pairs [7].

According to its linguistic and informational parameters, the "Dictionaries of Ukraine" system has no analogues in the world and is indispensable in computer administration, teaching the Ukrainian language, editorial and publishing activities and conducting linguistic research.

In order to popularize the achievements of domestic linguistics, a number of presentations of issues of the "Dictionaries of Ukraine" series were held, which received high praise and were widely covered by mass media. More than 20,000 copies of the publications of this series were given free of charge to educational and scientific institutions, libraries, ministries and agencies of Ukraine.

During 1998-2003, the foundation developed the system and technical principles for the creation and maintenance of linguistic corpora, on the basis of which computer lexicographic and text arrays of the Ukrainian language of national importance were formed, among which it should be noted:

- the linguistic corpus of Ukrainian texts, intended for the permanent formation, preservation and use of fiction, scientific and popular science, socio-political, journalistic literature (including translation);

- a fundamental electronic lexical index of more than 30 million word usages;

- lexicographic databases of more than 20 dictionaries (interpretive, orthographic, orthoepic, phraseological, synonymous, antonymic, grammatical, etc.);

- system of natural language indexing of Ukrainian texts and databases;

- automated system of conversion of dictionary texts to computer lexicographic databases;

- a network tool complex to support modern digital technology for the creation of fundamental lexicographic works.

An equally important role in the development of computer lexicography was played by the Department of Computer Lexicography of the Institute of the Ukrainian Language named after O.O. Potebni, which was founded in accordance with the resolution of the Cabinet of Ministers of Ukraine dated September 8, 1997. Large-scale projects initiated by the department include:

- the national corpus of the Ukrainian language with a planned initial volume of 2 million 500 thousand words, which is a systematized, structured, software-processed collection of model texts of the Ukrainian language in all variants and forms of its existence. Designed for linguistic research and technological applications;

- electronic lexical card index, a set of lexical cards with headline words, texts, illustrations of the use of these words in the appropriate meaning and an indication of the source of the illustrative text. This is similar to the traditional collection of lexical cards, but it is stored on electronic media, contains additional information fields, and a given register is automatically formed from it according to the lemmatization algorithm. The Computer Linguistics Laboratory of the Institute of Philology of the Kyiv National University has developed the following lexicographic products;

- RUTA auto corrector, with which you can check spelling, that is, automatically find and correct errors in words, perform grammar, punctuation and stylistic control, place hyphens during text formatting, use a dictionary of synonyms.

The informational basis of all dictionary systems became the record formats carefully worked out by philologists, which contained grammatical, morphological, phonetic, semantic, phraseological, etymological and other linguistic information about lexical units systematized in a certain way. Such dictionary systems created the necessary foundation and played an appropriate practical role in the creation and release of new generations of national dictionaries, in the formation of standard lexicographic arrays of national languages, and computer lexicographic and lexicological research carried out on this basis [8].


Thus, at the initial stages of development, computers in the field of lexicography were used as a typewriter that could replace, erase letters, words, and even whole parts of texts. And with the emergence of operating systems such as Windows, the computer provided the lexicographer not only with editing tools, but also with formatting, creating an original dictionary layout. The advent of tablet scanners and recognition programs has greatly accelerated the process of typing and editing text. At the same time, software was created that allowed searching and indexing of texts. This led to the emergence of linguistic databases, electronic libraries and card libraries.

Automated lexicographic databases in the form of electronic dictionaries are now an integral part of systems of machine translation, information search, text editing and correction, as well as processing of large text arrays and their storage as a separate task of creating electronic libraries. Computer dictionaries on optical media enabled translators and scientists to quickly find any information about a word (translation, interpretation, etc.).

1. Shchipitsina L.Yu. (2018). Kompjuterno-oposeredkovana komunikatsiya [Computer- mediated communication]. Zaporizhya: ZNU [in Ukrainian].

2. Asmus N.G. Lingvistychni osoblyvosti virtualnoho komunikatyvnoho prostoru [Linguistic features of the virtual communicative space]. Kyiv: TsUL [in Ukrainian].

3. Goroshko O.I. (2017). Psyholingvistyka [Psycholinguistics]. Zaporizhya: ZNU [in Ukrainian].

4. Crystal D. (2018). The scope of Internet linguistics. London, New-York: Taylor & Francis Group [in English].

5. Herring S.C. (2016). A faceted classification scheme for computer mediated discourse. London: Taylor [in English].

6. Perebyjnis V.I. (2017). Leksykografichne zabezpechennya navchalnogoprotsesu z inozemnoji movy [Lexicographic support of the foreign language learning process]. Kyiv: Apostrof [in Ukrainian].

7. Perebyjnis V.I. (2019). Teoriya ipraktyka ukladannya navchalnyh slovnykiv [Theory and practice of compiling educational dictionaries]. Zhytomyr: Akademiya [in Ukrainian].

8. Perebyjnis V.I., Rukina E.P., Hidekel S.S. (2015). Anglo-ukrajinskyi navchalnyi slovnyk-minimum [English-Ukrainian educational dictionary-minimum]. Kyiv: Naukova dumka [in Ukrainian].


1. Щіпіцина Л.Ю. Комп'ютерно-опосередкована комунікація: монографія. Запоріжжя: ЗНУ, 2018. 296 с.

2. Асмус Н.Г. Лінгвістичні особливості віртуального комунікативного простору. Київ: ЦУЛ, 2017. 266 с.

3. Горошко О.І. Психолінгвістика. Запоріжжя: ЗНУ, 2017. 356 с.

4. Crystal D. The scope of Internet linguistics. London, New-York: Taylor & Francis Group, 2018. 241 p.

5. Herring S.C. A faceted classification scheme for computer mediated discourse. London: Taylor, 2016. 297 p.

6. Перебийніс В.І. Лексикографічне забезпечення навчального процесу з іноземної мови. Київ: Апостроф, 2017. С. 12-43.

7. Перебийніс В.І. Теорія і практика укладання навчальних словників. Житомир: Академія, 2019. С. 73-98.

8. Перебийніс В.І., Рукіна Е.П., Хідекель С.С. Англо-український навчальний словник-мінімум. Київ: Наукова думка, 2015. 432 с.

