Stratification of lexical content of the frequency dictionary "Acoustics"

Lexical stratification analysis of verbs functioning in the text corpus of one of the branches of technical discourse. Stratification of vocabulary of the dictionary "Acoustics", lexical layers - commonly used, general scientific and terminological.

National university «Odessa Polytechnic»

Stratification of lexical content of the frequency dictionary «Acoustics»

Galina Diachenko,

candidate of philological sciences, PhD, associate professor at the department of foreign languages

Natalia Koval,

candidate of philological sciences, PhD, head of the department of foreign languages



The article presents the data of lexical stratification analysis of verbs functioning in the text corpus of one of the areas of scientific and technical discourse - «Acoustics». The text corpus that served as the basis for the frequency dictionary was based on scientific articles of the relevant specialty taken from the journals of Great Britain and the USA: IEEE International Conference on Acoustics, Speech, and Signal Processing; The Journal of the Acoustical Society of America; Acoustics Letters; Journal of the Audio Engineering Society; Acaustica. The size of the text corpus amounted to 200 thousand tokens. In the process of stratifying the vocabulary of the frequency dictionary «Acoustics» three lexical layers were identified - common, general scientific and terminological. The units of common and terminological layers were identified without the use of additional methods. To determine the general scientific lexemes, such methods were used as: a comparative analysis of several probabilistic-statistical models of various fields of technology, as well as a statistical method of rank correlation of words in the frequency dictionary «Acoustics» and the frequency general literary dictionary. The entire list of verbal lexemes includes 465 units. The common layer includes 204 words, general scientific vocabulary - 171 verbs, terminological layer - 49 words in total. The number of tokens in each lexical layer is presented in accordance with these values - 24324, 7003 and 3360. The obtained quantitative data confirm the results of lexical descriptions presented by other researchers in the process of statistical calculations when considering the lexical features of units of scientific-and-engineering text corpora. When investigating the location of the lexemes of these three layers in the frequency dictionary one can notice that common lexemes are concentrated mainly in the high-frequency zone of the model, lexemes of the general scientific layer have lower frequency values and are lower in the frequency list than common units, and, finally, the verbs of terminological layer are located almost over the entire area of the model (frequency dictionary).

Key words: probabilistic-statistical model, lexeme, lexical layer, rank correlation, expert assessment.


Галина Дьяченко,

кандидат філологічних наук, доцент, доцент кафедри іноземних мов Національного університету «Одеська політехніка» (Одеса, Україна)

Наталія Коваль, кандидат філологічних наук, доцент, завідуюча кафедрою іноземних мов Національного університету «Одеська політехніка» (Одеса, Україна)

Стратифікація лексичного контенту частотного словника «Акустика»

У статті наведено дані лексичного стратифікаційного аналізу дієслів, що функціонують у текстовому корпусі однієї з галузей науково-технічного дискурсу - «Акустика». Основою текстового корпусу, яку покладено в основу частотного словника, були наукові статті відповідної спеціальності, взяті з журналів Великобританії та США: IEEEInternationalConferenceonAcoustics, Speech, andSignalProcessing; TheJournaloftheAcousticalSocietyofAmerica; AcousticsLetters; JournaloftheAudioEngineeringSociety; Acaustica. Розмір текстового корпусу склав 200 тис. слововживань. У процесі стратифікації лексики частотного словника «Акустика» виділено три лексичні шари - загальновживаний, загальнонауковий і термінологічний. Одиниці загальновживаного та термінологічного шарів ідентифіковано без використання додаткових методів. Для визначення загальнонаукових лексем використано такі методи, як: порівняльний аналіз кількох імовірнісно-статистичних моделей різних галузей техніки, а також статистичний метод рангової кореляції слів у частотному словнику «Акустика» та частотному загальнолітературний словник. Увесь перелік дієслівних лексем налічує 465 одиниць. Загальновживаний пласт налічує 204 слова, загальнонаукова лексика - 171 дієслово, термінологічний - всього 49 слів. Відповідно до цих значень подано кількість лексем у кожному лексичному шарі - 24324, 7003 та 3360. Отримані кількісні дані підтверджують результати лексичних характеристик, представлених іншими дослідниками в процесі статистичних розрахунків при розгляді лексичних особливостей одиниць корпусів науково-технічних текстів. Досліджуючи розташування лексем цих трьох шарів у частотному словнику, можна помітити, що загальновживані лексеми зосереджені переважно у високочастотній зоні моделі, лексеми загальнонаукового шару мають менші частотні значення і знаходяться нижче в частотному списку, ніж загальновживані одиниці, і, нарешті, дієслова термінологічного шару розташовані майже по всій площі моделі (частотного словника).

Ключові слова: імовірнісно-статистична модель, лексема, лексичний шар, рангова кореляція, експертна оцінка.

Main part

Statement of problem. Analysis of research.

At present the formation and development of such a promising direction as corpus linguistics is observed in the field of theoretical linguistics. In addition to the creation of corpora of National Languages (including Ukrainian) (Лесная, 2012), compilation of parallel corpora, etc., corpus linguistics also deals with discourse studies, i.e. analysis of the features of texts at all levels. A comparative analysis of statistical, grammatical, lexical and semantic characteristics of such texts makes it possible to draw general conclusions about the nature of the content of text corpora and to foresee their further development.

We should note that the scientific and technical type of discourse has also been and is now being studied in detail. As already mentioned, the components of scientific texts were considered at a wide variety of levels: syntactic (Трофимовата ін., 2014), morphological (Неврева, 1984; Неврева та ін., 2014), paradigmatic (voice-and-tense aspects of verbs) (Tsapenko et al., 2015), etc.

The works devoted to the description of the statistical characteristics of scientific text corpora and their units are of particular importance. In recent years, considerable results have been achieved in this area, as evidenced by the emergence of a number of branch frequency dictionaries (Томасевич, 1983; Дьяченко, 1985; Неврева, 1984; Шапа, 1991). The formation of probabilistic-statistical models (frequency dictionaries) makes it possible to create at least a preliminary idea of the functioning of various fragments of scientific and technical texts. Therefore the compilation of new, previously unanalyzed models contributes to the confirmation of hypotheses about the features of the functioning of different units in the scientific discourse texts.

Thus we can argue that both in theoretical aspects and in practical research, the scientific type of dis-

course has already been given enough attention (Аннотированнаясправка, 2012; Антошинцева, 2011; Захаров, 2005; Глушко, 1974; Tsinova, 2014). However, the results of the analysis of verb units and their lexical stratification have not yet been considered and described in the linguistic research literature. A certain novelty of the proposed work is given by the object of study itself - a probabilistic-statistical model (frequency dictionary) which was first compiled on the basis of the text corpus «Acoustics».

All these facts have formed the goal of the paper - to present the data of the lexical stratification analysis of verbs functioning in the text corpus of one of the fields of scientific and technical discourse - «Acoustics».

The basic material. The text corpus that served as the foundation for our frequency dictionary was based on scientific articles of the relevant specialty, taken from the journals of Great Britain and the USA: IEEE International Conference on Acoustics, Speech, and Signal Processing; The Journal of the Acoustical Society of America; Acoustics Letters; Journal of the Audio Engineering Society; Acaustica. The size of the text corpus is 200 thousand tokens.

The main methods that were used in the work: the method of expert assessment, one of the statistical methods of rank correlation, methods for comparing text corpora, the quantitative method, etc.

Despite the fact that all researchers admit that the lexis of the language of science is heterogeneous (Береснев, 1961) there is no consensus among linguists on the problem of classification of scientific and technical texts vocabulary, and especially on the number of stratification layers into which the entire probabilistic-statistical model should be divided. Before the emergence and development of the idea of forming text corpora, theoretical linguists singled out only two stratification layers - common lexemes and terminology. However the presence of a certain number of corpora relating to different areas of scientific and technical discourse makes it possible to assert that in addition to the two layers already mentioned one more can be distinguished - general scientific (Андреев, 1967; Береснев, 1961).

The units of common and terminological layers of vocabulary can be identified almost unmistakably. So the lexemes of the commonly used layer noticeably differ in their meanings and are used in everyday usage. As for the terms, they are usually included in the system of scientific concepts of a particular field of knowledge (in our case, Acoustics), and can be easily determined with the help of expert assessment, survey of specialists. As for the units of general scientific layer, their distinguishing is very complicated, since it is necessary to substantiate the degree of their terminolization.

The procedure for determining the lexemes of the general scientific layer is as follows. First, since these lexemes are usually common to many sublanguages of engineering areas, one can simply compare their usage in other fields of knowledge preferably not related in terms of their scientific concepts, and single out common units (Пиотровскийта ін., 1970: 213). Therefore the list of lexemes of the general scientific layer of the field «Acoustics» was compared with the corresponding frequency lists (dictionaries) extracted from the text corpora of «Automotive» (Томасевич, 1983), «Chemical engineering (Неврева, 1984) and «Electrical engineering» (Шапа, 1991). Secondly, it is possible to carry out the formation of the general scientific layer of vocabulary in a formal way using statistical methods, i.e. by comparing the ranked lists of verb units extracted from the above frequency dictionaries and frequency general literary dictionary by Thorndike and Lorge (Thorndike E., et al., 1998) In this case, the degree of terminolization of the verb in the texts «Acoustics» with respect to the texts in the field of, for example, «Automotive» and the verbal units of the Thorndike and Lorde's dictionary is the difference in the numbers (ranks) of any verb in the considered text corpora. The rank correlation was calculated using the formula rs = 1 - 6 S d2/ N(N2 -1).

The statistical picture of the location of units of stratification layers throughout the area of the probabilistic-statistical model (frequency dictionary, frequency list) of the specialty «Acoustics» is as follows: commonly used lexemes are concentrated mainly in the high-frequency zone of the model, lexemes of the general scientific layer have lower frequency values and cost lower in the frequency list than commonly used units, and, finally, units of the terminological layer are located in almost the entire area of the model (frequency dictionary). This can be seen from the data presented in the relevant lists below.

First of all we indicate that the entire list of verbal lexemes includes 465 units. The common layer includes 204 words. As an example we give the most frequently used verbs of this lexical layer which are given in descending order of their frequencies: be (F=9956), can (F=1115), hare (F=1040), use (F=905), show (F=620), give (F=536), receive (F=405), obtain (F=365), will (F=345), may (F=292), make (F=261), follow (F=254), require (F=223), would (F=199), do (F=192), consider (F=191), find (F=186), see (F=183), correspond (F=178), study (F=159), describe (F=152), represent (F=152), must (F=151), present (F=138), support (F=137), note (F=131), take (F=131), become (F=130), express (F=116), know (F=102), shall (F=99), develop (F=95), write (F=90), apply (F=88), desire (F=87), include (F=85), result (F=84), achieve (F=82), say (F=73), associate (F=71), appear (F=70), observe (F=70), occur (F-70), should (F=69), match (F=69), form (F=68), choose (F=67), employ (F=67), satisfy (F=67), illustrate (F=65), vary (F=64), need (F=62), depend (F=59), lot (F=58), discuss (F=56), display (F=55), perform (F=54), consist (F=53), predict (F=53), allow (F=52), utilize (F=52), resolve (F=49), place (F=47), cause (F=46), concern (F=46), report (F=46), exist (F=44), improve (F=44), refer (F=44), remain (F=44), demonstrate (F=43), avoid (F=42), expect (F=42), involve (F=42), arise (F=41), generalize (F=39), introduce, (F=39), arrive (F=36), differ (F=36), position (F=35), examine (F=34), lie (F=34), accomplish (F=33), effect (F=32), change (F=32), state (F=32), approach (F=31), regard (F=31), reach (F=30), tune (F=17).

The group of verbs belonging to the general scientific layer is next in terms of the number of units - 171 verbs. The group of general scientific verbs is represented by units that occupy an intermediate position between common and terminological verbal lexemes. In the course of studying the lexical meanings of the units of this layer it was found that a significant part is made up of verbs that have passed from the common layer and received another lexical status in the text corpus «Acoustics». General scientific verbs form the basis of a scientific text because various phenomena, actions, processes in the different specialties of science and engineering are described and characterized with their help. As a result of comparing the lists of verbal units of the above text corpora («Automotive», «Chemical engineering and «Electrical engineering») in terms of determining their (verbal units) belonging to different lexical layers we have found that with the interaction of sciences and the penetration of certain areas of some sciences into each others the layer of general scientific verbs tends to expand. For example: provide (F=253), measure (F=227), process (F=224), determine (F=222), increase (F=208), assume (F=190), transform (F=175), define (F=169), set (F=159), compare (F=157), evaluate (F=127), operate (F=124), compute (F=119), space (F=117), denote (F=113), reduce (F=107), calculate (F=103), indicate (F=99), delay (F=89), estimate (F=81), normalize (F=73), coordinate (F=72), specify (F=66), contain (F=65), design (F=65), substitute (F=65), yield (F=63), record (F=62), genera to (F=61), lead (F=59), relate (F=58), decrease (F=57), couple (F=55), separate (F=54), connect (F=53), toot (F=53), limit (F=51), locate (F=51), drive (F=50), fix (F=50), approximate (F=49), odd (F=48), investigate (F=48), average (F=47), minimize (F=47), move (F=47), multiply (F=47), solve (F=47), combine (F=46), construct (F=46), simplify (F=42), adjust (F=40), comprise (F=40), replace (F=38), sum (F=38), carry (F=37), complicate (F=37), extend (F=37), integrate (F=37), depict (F=36), hold (F=36), select (F=36), mismatch (F=35), modify (F=35), mount (F=35), switch (F=34), bound (F=33), divide (F=33), maximize (F=33), distribute (F=31), equal (F=31), implement (F=31), adapt (F=30), expend (F=30), center (F=26), photograph (F=10).

The terminological layer of vocabulary contains the smallest number of units - only 49 words. The verb terms like noun terms also express concepts of objects and phenomena of the surrounding reality, but at another levels - at the one of movement, dynamics and process which follows from the function ofthe verb to denote a process. This is especially noticeable in the functioning of terminological verbs in their specific implementations. From the frequency dictionary «Acoustics» the following verbal terminological units were identified, they are also presented in descending order of absolute frequencies: process (F=224), plot (F=119), damp (F=114), radiate (F=98), transmit (F=97), sample (F=73), steer (F=72), weight (F=71), derive (F=70), shade (F=65), control (F=64), close (F=62), focus (F=60), cancel (F=57), back (F=53), range (F=53), water (F=52), scan (F=48), echo - send (F=46), reflect (F=41), transfer (F=41), suppress (F=37), scatter (F=36), filter (F=35), illuminate (F=35), simulate (F=34), excite (F=33), start (F=33), maintain (F=32), constrain (F=30), load (F=30), shield (F=29), spread (F=29), analyze (F=28), detect (F=28), phase (F=28), rear (F=28), vibrato (F=28), absorb (F=27), align (F=27), propagate (F=26), truncate (F=26), cuff (F=25), stagger (F=24), vanish (F=24), bear (F=23), correlate (F=22), sense (F=22), aim (F=21), dash (F=21), decorrelate (F=21), restrict (F=20), segment (F=20), jam (F=19), read (F=19), taper (F=19), degenerate (F=18), monitor (F=18), overlap (F=18), perfect-focus (F=18), refract (F=18), corrupt (F=17), pulse (F=17), rank (F=17), screw (F=17), synthesize (F=17), beam steer (F=16), cut (F=16), entail (F=16), aerate (F=15), decay (F=15), encode (F=15), extract (F=14), modulate (F=14), slot (F=13), fasten (F=12), program (F=12), seal (F=12), attenuate (F=11), sandwich (F=11), assemble (F=10), converge (F=10), deviate (F=10), emit (F=10), fade (F=10), fold (F=10), insonify (F=10), isolate (F=10), plane (F=10), strike (F=10).

For clarity all the obtained data of the stratification layers of the verbs of the scientific field «Acoustics» are summarized in the table. It provides information on the number of lexemes that are included in each of the lexical layers of the Acoustics techni-

Cal area which makes it possible to calculate their percentage. Statistical data of verb units functioning in the text corpus «Acoustics»

Lexical layers


number of verbs

The number of tokens

Percentage of verbs, %


Common verbal lexis





General scientific verbal lexis





Terminological verbal lexis








The table shows that the main number of verbs of the simulated text corpus «Acoustics» is represented not by terminological vocabulary which is characteristic only for this field of engineering and serving only as a means of professional communication but by common and general scientific units, which are a means of describing phenomena of a communicative orientation. This can be explained by the fact that a distinctive

feature of texts on acoustics as well as the most scientific and technical texts in other fields of knowledge is the focus on solving the special issues through an overview description of a generalizing nature.

Conclusions. Based on the foregoing the following conclusions can be drawn. Verbal units can be divided into three lexical layers: commonl, general scientific and terminological. They are represented by various numerical values. The largest group is made up of verbs of common vocabulary, in the second place are verbs of general scientific vocabulary, and the smallest group is terms. Quantitative data confirm the results of lexical descriptions obtained by other researchers in the process of statistical calculations when considering the lexical features of units of scientific text corpora Шапа, 1991; Tsapenko L.E., et al., 2015; Tsinova M.V., 2014).

The comparison of the lists of verbal units of the above text corpora in determining their (verbal units) belonging to different lexical layers with in the course of process of stratification procedures it was found that the layer of general scientific verbs is constantly increasing, which can be explained by the real interaction of various areas of technical knowledge as a result of which, respectively, the penetration of units occurs from other sublanguages into an integral lexical system.


