Hungarian National Corpus

 

This is the old version of the Hungarian National Corpus. The Hungarian Gigaword Corpus will be available shortly.

Work on the Hungarian National Corpus (HNC) started in 1998 at the Department of Corpus Linguistics of the Research Institute for Linguistics of the Hungarian Academy of Sciences (HAS) under the supervision of Tamás Váradi. The objective was to create a 100-million-word balanced reference corpus of present-day Hungarian. From 2002 began a new effort to extend the area of data collection to the Hungarian language use of the whole Carpathian Basin in Hungarian Language Corpus of the Carpathian Basin project. Aim was to create a 15-million-word corpus of Hungarian language beyond the borders of Hungary. The truly national Hungarian National Corpus, containing language variants form Slovakia, Subcarpathia, Transylvania and Vojvodina also, was introduced in November 2005. The first Hungarian corpus covering language variants from also beyond the border of Hungary have been completed as the result of joint work of the Hungarian Language Offices and the Department of Corpus Linguistics.

What is a corpus?

A corpus is a collection of written or spoken linguistical data. The texts are selected and classified according to certain criteria. A corpus does not necessarily contain whole texts and is not only a repository of texts: it contains their bibliographical data and marks the structural units (paragraphs, sentences). HNC wishes to be a representative general-aim corpus of present-day standard Hungarian.

Automatic analysis

Relevant characteristic of the HNC is the detailed morphosyntactic annotation. Every wordform is annotated with stem, part of speech and inflecional information. This analysis is provided by using automatic methods with a general precision of about 97.5%, i.e. 2.5% of all wordforms has an erroneous analysis. Higher precision could only have been achieved by manual annotation, which was not feasible for such a large amount of data.

How is it built up?

HNC currently contains 187.6 million words. It is divided into five subcorpora by regional language variants, and into five subcorpora by text genres also. The subcorpus to be studied can be chosen by any combination of these. That makes the HNC an appropriate tool to study the differences not just between text genres but between language variants.

The HNC consists of following subcorpora (size given in million words, rounded off to the nearest 100000 words):

  Hungary Slovakia Subcarpathia Transylvania Vojvodina total  
press 71.0 5.7 0.7 5.5 1.5 84.5 Texts from the news media make up almost half of the corpus, presenting a broad scale of dialects, both vertically and horizontally.
literature 35.5 1.4 0.4 0.8 0.2 38.2 Material of the Digital Literary Academy (Digitális Irodalmi Akadémia) was fully incorporated in the autumn of 2005. This makes the literature subcorpus for Hungary.
science 20.5 2.3 0.7 1.6 0.3 25.5 The source of science texts for Hungary is the Hungarian Electronic Library (Magyar Elektronikus Könyvtár).
official 19.9 0.2 0.3 0.6 0.1 20.9 Regulations, laws, by-laws and parliamentary debates.
personal 17.8 0.4 0.4 0.1 18.6 This subcorpus contains discussions of internet forums (forums of the biggest and oldest Hungarian Internet portal: index.hu, and several forums from Subcarpathia). This language variant is particularly interesting because it stands closest to spontaneous linguistic communication. In certain cases it is very similar to spoken communication.
total 164.7 9.5 2.5 8.9 2.0 187.6  

Who can use this corpus?

Everybody can use the Hungarian National Corpus who fills out the registration form and agrees to the conditions laid down there.

Frequency data

 stemPOScountcount / 1000 words    stemPOScountcount / 1000 words    stemPOScountcount / 1000 words   
1.aDet1112842172.40   34.kiPre3054801.99   67.közöttNU1595831.04   
2.azDet371641424.18   35.amiPro2879991.87   68.elsőNum1585691.03   
3.ésCon254475116.56   36.nagyA2811341.83   69.napN1573101.02   
4.hogyCon216600414.09   37.mondV2768681.80   70.adV1545371.01   
5.ADet210397013.69   38.miPro2750761.79   71.99DIG1545261.01   
6.azPro180381411.74   39.magaPro2639831.72   72.azonbanCon1541501.00   
7.nemAdv169374811.02   40.mertCon2589621.68   73.sokNum1529070.99   
8.isCon167710810.91   41.énPro2453861.60   74.őkPro1517180.99   
9.vanV14181139.23   42.-eClit2376121.55   75.másPro1516980.99   
10.ezPro12042697.84   43.olyanPro2329471.52   76.kérdésN1514770.99   
11.egyNum8998325.85   44.A2328261.51   77.hanemCon1507020.98   
12.AzDet7302874.75   45.többNum2328031.51   78.HaCon1471170.96   
13.megPre5929863.86   46.magyarA2299341.50   79.esetN1468030.96   
14.kellV4996593.25   47.mindenPro2251301.46   80.elnökN1465000.95   
15.csakAdv4779563.11   48.úgyAdv2215241.44   81.forintN1446290.94   
16.leszV4691893.05   49.pedigCon2165131.41   82.egyikPro1436270.93   
17.deCon4625083.01   50.újA2157651.40   83.kormányN1394930.91   
18.márAdv4528142.95   51.teszV2117981.38   84.akarV1386960.90   
19.EzPro4473102.91   52.kétNum2110771.37   85.országN1372250.89   
20.amelyPro4179452.72   53.00DIG2059931.34   86.kerülV1355540.88   
21.haCon4025932.62   54.emberN1980391.29   87.DeCon1350620.88   
22.mégAdv3962072.58   55.AzPro1942631.26   88.százalékN1327800.86   
23.vagyCon3810982.48   56.utánNU1908051.24   89.látV1318660.86   
24.mintCon3705072.41   57.NemAdv1853381.21   90.törvényN1294850.84   
25.szerintNU3694812.40   58.időN1783741.16   91.98DIG1285400.84   
26.elPre3620042.36   59.majdAdv1774971.15   92.sorN1283110.83   
27.tudV3568332.32   60.bePre1756151.14   93.kapV1278410.83   
28.sCon3564532.32   61.tartV1730481.13   94.fogV1277680.83   
29.akiPro3508192.28   62.részN1708941.11   95.alapN1276320.83   
30.évN3382132.20   63.mostAdv1683341.10   96.2DIG1274610.83   
31.semAdv3295702.14   64.felPre1644671.07   97.ittAdv1273990.83   
32.lehetV3105002.02   65.szóN1629291.06   98.helyN1242620.81   
33.őPro3066211.99   66.1DIG1624861.06   99.veszV1235830.80   

Partners

Morphological analysis is made by Humor from MorphoLogic Ltd., disambiguation is based on Thorsten Brants' TnT tagger, corpus processing tool used is the IMS Corpus Workbench.

[Morphologic]

Supporters

Corpus creation - tender T 026091 of OTKA, browsable version - tender SZT-IS-7 of IHM, Hungarian Language Corpus of the Carpathian Basin project - tender NKFP/044/2002.

[OTKA] [IHM]

If you have any comments, please let us know.
Research Institute for Linguistics, HAS 1998-2006.