Hungarian National Corpus

The new, expanded version of the Hungarian National Corpus with new features has been launched. Click!
Please, use the new interface from now on. Old HNC registrations are still valid. Please, refer to our LREC 2014 publication presenting this new version.

This is the site of the old version of the Hungarian National Corpus.

Work on the Hungarian National Corpus (HNC) started in 1998 at the Department of Corpus Linguistics of the Research Institute for Linguistics of the Hungarian Academy of Sciences (HAS) under the supervision of Tamás Váradi. The objective was to create a 100-million-word balanced reference corpus of present-day Hungarian. From 2002 began a new effort to extend the area of data collection to the Hungarian language use of the whole Carpathian Basin in Hungarian Language Corpus of the Carpathian Basin project. Aim was to create a 15-million-word corpus of Hungarian language beyond the borders of Hungary. The truly national Hungarian National Corpus, containing language variants form Slovakia, Subcarpathia, Transylvania and Vojvodina also, was introduced in November 2005. The first Hungarian corpus covering language variants from also beyond the border of Hungary have been completed as the result of joint work of the Hungarian Language Offices and the Department of Corpus Linguistics.

What is a corpus?

A corpus is a collection of written or spoken linguistical data. The texts are selected and classified according to certain criteria. A corpus does not necessarily contain whole texts and is not only a repository of texts: it contains their bibliographical data and marks the structural units (paragraphs, sentences). HNC wishes to be a representative general-aim corpus of present-day standard Hungarian.

Automatic analysis

Relevant characteristic of the HNC is the detailed morphosyntactic annotation. Every wordform is annotated with stem, part of speech and inflecional information. This analysis is provided by using automatic methods with a general precision of about 97.5%, i.e. 2.5% of all wordforms has an erroneous analysis. Higher precision could only have been achieved by manual annotation, which was not feasible for such a large amount of data.

How is it built up?

HNC currently contains 187.6 million words. It is divided into five subcorpora by regional language variants, and into five subcorpora by text genres also. The subcorpus to be studied can be chosen by any combination of these. That makes the HNC an appropriate tool to study the differences not just between text genres but between language variants.

The HNC consists of following subcorpora (size given in million words, rounded off to the nearest 100000 words):

	Hungary	Slovakia	Subcarpathia	Transylvania	Vojvodina	total
press	71.0	5.7	0.7	5.5	1.5	84.5	Texts from the news media make up almost half of the corpus, presenting a broad scale of dialects, both vertically and horizontally.
literature	35.5	1.4	0.4	0.8	0.2	38.2	Material of the Digital Literary Academy (Digitális Irodalmi Akadémia) was fully incorporated in the autumn of 2005. This makes the literature subcorpus for Hungary.
science	20.5	2.3	0.7	1.6	0.3	25.5	The source of science texts for Hungary is the Hungarian Electronic Library (Magyar Elektronikus Könyvtár).
official	19.9	0.2	0.3	0.6	0.1	20.9	Regulations, laws, by-laws and parliamentary debates.
personal	17.8	—	0.4	0.4	0.1	18.6	This subcorpus contains discussions of internet forums (forums of the biggest and oldest Hungarian Internet portal: index.hu, and several forums from Subcarpathia). This language variant is particularly interesting because it stands closest to spontaneous linguistic communication. In certain cases it is very similar to spoken communication.
total	164.7	9.5	2.5	8.9	2.0	187.6

Who can use this corpus?

Everybody can use the Hungarian National Corpus who fills out the registration form and agrees to the conditions laid down there.

Frequency data

Browsable frequency data. (in Hungarian)
Full HNC frequency list in the META-SHARE repository.
Excerpt from HNC frequency list:

	stem	POS	count	count / 1000 words		stem	POS	count	count / 1000 words		stem	POS	count	count / 1000 words
1.	a	Det	11128421	72.40	34.	ki	Pre	305480	1.99	67.	között	NU	159583	1.04
2.	az	Det	3716414	24.18	35.	ami	Pro	287999	1.87	68.	első	Num	158569	1.03
3.	és	Con	2544751	16.56	36.	nagy	A	281134	1.83	69.	nap	N	157310	1.02
4.	hogy	Con	2166004	14.09	37.	mond	V	276868	1.80	70.	ad	V	154537	1.01
5.	A	Det	2103970	13.69	38.	mi	Pro	275076	1.79	71.	99	DIG	154526	1.01
6.	az	Pro	1803814	11.74	39.	maga	Pro	263983	1.72	72.	azonban	Con	154150	1.00
7.	nem	Adv	1693748	11.02	40.	mert	Con	258962	1.68	73.	sok	Num	152907	0.99
8.	is	Con	1677108	10.91	41.	én	Pro	245386	1.60	74.	ők	Pro	151718	0.99
9.	van	V	1418113	9.23	42.	-e	Clit	237612	1.55	75.	más	Pro	151698	0.99
10.	ez	Pro	1204269	7.84	43.	olyan	Pro	232947	1.52	76.	kérdés	N	151477	0.99
11.	egy	Num	899832	5.85	44.	jó	A	232826	1.51	77.	hanem	Con	150702	0.98
12.	Az	Det	730287	4.75	45.	több	Num	232803	1.51	78.	Ha	Con	147117	0.96
13.	meg	Pre	592986	3.86	46.	magyar	A	229934	1.50	79.	eset	N	146803	0.96
14.	kell	V	499659	3.25	47.	minden	Pro	225130	1.46	80.	elnök	N	146500	0.95
15.	csak	Adv	477956	3.11	48.	úgy	Adv	221524	1.44	81.	forint	N	144629	0.94
16.	lesz	V	469189	3.05	49.	pedig	Con	216513	1.41	82.	egyik	Pro	143627	0.93
17.	de	Con	462508	3.01	50.	új	A	215765	1.40	83.	kormány	N	139493	0.91
18.	már	Adv	452814	2.95	51.	tesz	V	211798	1.38	84.	akar	V	138696	0.90
19.	Ez	Pro	447310	2.91	52.	két	Num	211077	1.37	85.	ország	N	137225	0.89
20.	amely	Pro	417945	2.72	53.	00	DIG	205993	1.34	86.	kerül	V	135554	0.88
21.	ha	Con	402593	2.62	54.	ember	N	198039	1.29	87.	De	Con	135062	0.88
22.	még	Adv	396207	2.58	55.	Az	Pro	194263	1.26	88.	százalék	N	132780	0.86
23.	vagy	Con	381098	2.48	56.	után	NU	190805	1.24	89.	lát	V	131866	0.86
24.	mint	Con	370507	2.41	57.	Nem	Adv	185338	1.21	90.	törvény	N	129485	0.84
25.	szerint	NU	369481	2.40	58.	idő	N	178374	1.16	91.	98	DIG	128540	0.84
26.	el	Pre	362004	2.36	59.	majd	Adv	177497	1.15	92.	sor	N	128311	0.83
27.	tud	V	356833	2.32	60.	be	Pre	175615	1.14	93.	kap	V	127841	0.83
28.	s	Con	356453	2.32	61.	tart	V	173048	1.13	94.	fog	V	127768	0.83
29.	aki	Pro	350819	2.28	62.	rész	N	170894	1.11	95.	alap	N	127632	0.83
30.	év	N	338213	2.20	63.	most	Adv	168334	1.10	96.	2	DIG	127461	0.83
31.	sem	Adv	329570	2.14	64.	fel	Pre	164467	1.07	97.	itt	Adv	127399	0.83
32.	lehet	V	310500	2.02	65.	szó	N	162929	1.06	98.	hely	N	124262	0.81
33.	ő	Pro	306621	1.99	66.	1	DIG	162486	1.06	99.	vesz	V	123583	0.80

Partners

Morphological analysis is made by Humor from MorphoLogic Ltd., disambiguation is based on Thorsten Brants' TnT tagger, corpus processing tool used is the IMS Corpus Workbench.

Supporters

Corpus creation - tender T 026091 of OTKA, browsable version - tender SZT-IS-7 of IHM, Hungarian Language Corpus of the Carpathian Basin project - tender NKFP/044/2002.

If you have any comments, please let us know.
Research Institute for Linguistics, HAS 1998-2006.