Российский НИИ Искусственного Интеллекта

Начало Новости Об институте Технологии и ноу-хау  Проекты Новосибирский филиал Персоналии Публикации 


Начало
Новости
Об институте
   История
   Деловые связи
   Деятельность
   Мы в СМИ
Технологии
Проекты
   Alex
   AURA
   InBASE
   InDOC
   Nemo+
   SemP-T
   Time-EX
   Unicalc
   FinPlan
   Частотный словарь
   Экономика
НC филиал
Персоналии
Публикации




The frequency dictionary for Russian
Русская версия
Serge Sharoff

The second version of the frequency list

From this page you can access the frequency list for modern Russian. Up to now Chastotnyj slovarj russkogo jazyka (Zasorina, 1977) provided the most widely used frequency list for Russian. However, the corpus used in Zasorina is relatively small according to modern standards (about 1 million words). It is outdated: mostly it covers uses from 1920s to 1960s and includes a high proportion of ideological sources, like texts by Lenin and Khrushchev and Soviet newspapers, thus, word frequencies in it are severely biased, e.g. Soviet and comrade are in the first hundred of Russian words on a par with function words. Finally, the list of (Zasorina, 1977) is not available electronically.

The list accessible from this page includes about 32000 words with frequency greater than 1 ipm (one instance per million words). A shorter selection of 5000 most frequent words is also available. Lists use utf8 encoding for Cyrillic and are compressed by lemma.al.zip - lemmas sorted in the alphabetical order

  • lemma.num.zip - lemmas sorted by their frequency
  • words.num.zip - word forms sorted by their frequency
  • Lists of 5000 most frequent words

    Some data about uses of words in modern Russian

    • The average word length is 5.28 characters.
    • The average sentence length is 10.38 words.
    • 1000 most frequent lemmas cover 64.0708% of word forms in texts.
    • 2000 most frequent lemmas cover 71.9521% of word forms in texts.
    • 3000 most frequent lemmas cover 76.6824% of word forms in texts.
    • 5000 most frequent lemmas cover 82.0604% of word forms in texts.

    The exact information on the mapping of frequency to coverage is available from here.

    The list is compiled on the basis of a corpus of modern Russian. It contains a selection of modern fiction, political texts, newspapers, and popular science (about 40 million words, MW, fiction allocates for about half of the corpus). All texts were written originally in Russian between 1970 and 2002; the majority of them between 1980 and 1995, the newspapers corpus is from 1997-1999.

    It is widely known that large texts present a problem for frequency lists, since a large text that contains many instances of a rare word can boost its frequency. If the corpus is based on fiction, large texts are quite frequent. As an example, the corpus contains a huge sequel to Tolkien's "The Lord of the Rings" written by a Russian author (Nick Perumov). In spite of the fact that the length of the sequel is about 250 kW, less than one percent of the whole corpus, the frequency of uses of the word hobbit in that book puts the word in the first thousand of most frequent Russian words, if no precautions against large texts are made. Out of this reason, the frequency list is calculated under the condition that no single text from the corpus contributes more than 10 kW and no author contributes more than 100 kW to the count. Thus, the subset of the whole corpus used for frequency count is about 16 MW.

    Words are not uniformly distributed in texts. Some of them (like prepositions) occur in many texts with predictable rate, some (like pronouns or mental verbs) are significantly more frequent for certain writers or genres, while some are "contagious": if a word (e.g. a proper name, a title of nobility or a technical term) occurs once in a text, it tends to be repeated, thus boosting its frequency in a document. The variation can be measured in a variety of ways (Church, K. and Gale, W. (1995) here. The structure:
    lemma, mean frequency (ipm), number of texts in which the lemma occurs, standard deviation of frequency counted for all texts, coefficient of variation, variance.

    The corpus, tools for working with it, as well as an aligned parallel English-Russian corpus are discussed in the following publication:

    Sharoff, Serge, (2002). Meaning as use: exploitation of aligned corpora for the contrastive study of lexical semantics. Proc. of Language Resources and Evaluation Conference (LREC02). May, 2002, Las Palmas, Spain. PDF file.

    Three frequency lists for word classes are also available:

    The compilation of the corpora, development of respective tools and the frequency lists were available due to the Fellowship awarded to the author from the

      Все права защищены. © 2001 РосНИИ ИИ Copyright. © 2001 RRIAI