Hungarian National Corpus

The new, expanded version of the Hungarian National Corpus with new features is open now. Click!
(an old HNC registration is needed)
The full English language site of the new corpus will be available shortly.
In the meantime, check the new English corpus query interface. Old HNC registrations are still valid.
Please try, use and test the new corpus. We hope you find it useful.
Do not hesitate to contact us if you have something to say about it.
We plan to replace the old version with the new one here after a test period of some months.
Please, inform us when you use the HNC, and refer to our new publication.
Csaba Oravecz, Tamás Váradi, Bálint Sass: The Hungarian Gigaword Corpus. In: Proceedings of LREC 2014, 2014.

This is the help site for the old version of the Hungarian National Corpus.

This page explains the searching features of the Hungarian National Corpus. While reading through this page it may be useful to open the query page, so you can try the examples marked by the [Pl.] icon. You can access the query page after registration.

Introduction
Searching for one word
Searching for a pair of words
Display options
The MSD codes
FAQ - Frequently Asked Questions

1. Introduction

Enter a word in the 1st word-form field, then press the [keresés indítása]

button.

The results will be displayed in a separate window. In the header you will see the name(s) of the selected subcorpora, the source language of the query, the total number of hits and their frequency per million words, and the duration of the query. Then the list of results follows: the word in blue, the analysis in green. The following paragraphs will give you a detailed explanation on the usage of the query page.

2. Searching for one word

You can define a word with any of these three features: stem or inflected form (word-form), POS and morphosyntactic features, MSD codes. These three features can be combined freely. The conditions are combined with the AND operator, this way setting ... features will give no results. When no feature is given at all, the program responds with an error message.

2.1 Stem and word-form

Enter a stem/word-form in the first entry field. The search is case sensitive. From the pull down menu below choose either word-form (e.g. szüleimé) if you search for a specific occurrence of a word, or choose stem if you search for any form of the stem.

Entering ember as a stem, you will get the inflected forms embert, emberek, embereinkhez, etc., while as a word-form you only get ember.

Entering the inflected form szüleimé as a stem will give no result.

2.2 POS category and morphosyntactic features

It is also possible to specify the POS (part-of-speech) categoryof the word and other features (number, person, mood, case, etc.) depending on its POS category.

The ANY option in the pull-down menus means that any value of the current feature will satisfy the search.

Explanation of some of the POS categories:

Nominal is a general category which includes every POS category which can have a “nominal” feature (case, etc.). You can choose the particular POS (noun, adjective, etc.) inside this category. This makes it possible to search for words with a particular feature and regardless of their POS.
To search for a word with arbitrary POS and particular case: choose nominal as POS, leave the actual POS ANY, and set the case in the last pull-down menu.
Digits and numerals (inside nominal) are in a different category.
The verb prefix category will give you both bare prefixes and infinitives and adverbial participles beginning with a verb prefix.
Infinitive and adverbial participle are among the main POS categories, while the future, past and present participles are inside the nominal category.
The prefix POS category includes words like mikro, kilo, bel, kül, etc.
Punctuation mark is a special “type of word”, hence they are among the POS categories. All punctuation marks are considered to be separate words, regardless of whether there is, or there is no space after the preceding or the following word.

When you choose a POS from the main pull-down menu, its additional features will be displayed. Such additional features are present in the nominal, verb, infinitive, adverbial participle and punctuation mark categories, which are followed by an ellipse in the menu.

The submenus of the specific POS categories:

In the nominal POS category you can specify 6 additional features. The particular POS category, the type (whether it is in base, comparative or superlative form), the number and the case features are pretty straightforward to use. In the case menu the specific suffixes are listed, the vowel change is indicated with a capital letter (the suffix -val, -vel is marked by -vAl).

Leave the stem/word-form field empty, set POS to nominal, then adjective, type to comparative, number to plural, and case to -bA.

The possessive suffix has four submenus,

the anaphoric possessive suffix (e.g. Péteré) has two.

The menus marked with a > are for specifying the current feature, they are visible only if we set the first menu of the group to yes.

Set the stem to apa. Set POS to nominal, then noun.

possessive = yes, set its number to plural, its person to first, set pluralizer to yes, anaphoric = no. This query would return nouns of the form apáink in any of the cases.
possessive = no, anaphoric = yes. This would return apáé, apáké in any of the cases.
possessive = yes, possessive pluralizer = no, anaphoric = yes, pluralizer = yes. This query returns the nouns in the form apáméi, apádéi, apjáéi, apánkéi, apátokéi, apjukéi.

If you choose verb, then it is possible to set five features: verb prefix, conjugation, tense/mood, number, person. In conjugation there is subjective, objectiveand the special form -lAk (as in láttalak, szeretnélek). As this form occurs only in the first person singular, these two options are automatically set when you choose -lAk.
Enter figyel as stem, set POS to verb.

Choose -lAk from the conjugation menu, and see the results.
Then choose ANY from the conjugation menu, set past tense, second person singular.
Tense and mood are in the same menu, because in one-word forms these two options cannot occur at the same time. If a verb is in the past tense it cannot be imperative or conditional and vice versa. If you look for a form in the future tense (e.g. fogsz menni) or past tense conditional (e.g. tudhattad volna) , you need to search for a pair of words.
Setting POS to infinitive, you can specify the presence of the verb prefix, the number in the base/number submenu, and person. Setting base will hide the person submenu.
If you choose adverbial participle, you can specify the presence of the verb prefix.
There are two types of punctuation marks: general and sentence boundary.

2.3 MSD codes

The third type of definition for a word-form is an alternative for the previously described menu system. Those familiar with MSD codes can enter the code of the desired word directly in the appropriate field.

Enter vár as a stem, set POS to verb and type V.e1 in the MSD-code field.

3. Searching for a pair of words

Besides searching for one word, it is also possible to search for two words in a specific relation. The relation can be set in the menu below the features of the first word.

The relation can be one of these five:

Choose only if you want to search for one word. In the other four cases a similar form will be displayed where you can set the features of the second word.

The most important feature of the Hungarian National Corpus is its ability to search for two consecutive words. This can be achieved by setting the following option. In the menu below the second word you can set the distance of the two words. The option within 1 word means that at most one word can be between the two specified words. The punctuation mark is considered as one separate word.

Enter szép as 1st stem and choose followed by below. At the 2nd word set nominal and noun. From the menu below the second word choose directly, then try within 5 words as well.

Enter Budapesti as 1st word-form, and Vállalat as 2nd word-form. Set followed by and try within 1 word and within 2 words.

Searching for two-word verb forms:

Future tense (fogsz menni): POS of the 1st word is verb, POS of the 2nd word is infinitive.
Past tense conditional(tudhattad volna): the 1st word is a past tense verb, the second is a verb in conditional.

Searching for a pair of words can last for minutes if the first word occurs too many times in the corpus. If a query is longer than 5-6 minutes, the server might close the connection before sending the data to your machine. In this case no answer is given. Try a less general query or change the order of words.

If you search for the word nagy followed by some noun, the searching time takes about 10 seconds. However, in reverse order you could encounter the problem described above.

The option near is the same as above, only the order of the two words is arbitrary.
As 1st stem enter szív, set POS to noun. Choose the option near. As 2nd stem enter dobog and choose within 5 words.
The searching time is about the double of a simple followed by search.
The or option is pretty straightforward. It performs two one-word searches at the same time.
The and not option also searches for one word. On the 1st form specify what you want to be true for the word, on the second what you want to be false for it.
Enter víz on the 1st form, choose and not. On the second form set nominal and nominative case.

4. Display options

You can set the display options in the last four rows of the page.

In the first row set the size of the concordance: how many items should occur in what context.
Displaying too many items could take a considerable amount of time. The context can be 10 words or 5 sentences or 1 paragraph at most.

In the second row set the display options of the stem and the MSD code.

You can set whether you want to see any of these two attributes beside the word-form and to one or two words on each side of the query word. Furthermore setting the Attributes in small window option, the attributes for any word will be displayed in a separate window when you point them with the mouse.

Enter the word to kutyatár, set the context to 1 sentence. Set both the stem and the MSD-code to be displayed only on target word, and set the Attributes in small window option. In the display window point the mouse to the words megnyílik, then a.

In the 3rd row you can set two further display options.

The sorting option can be applied either to the word before or after the target word. In case of searching for a pair of words the target word is the 1st word. When sorting, numbers will be in the beginning of the list, and the punctuation marks in the end of the list as separate words.
The bibliographical data (subcorpus, date, type, column, author, title) can be displayed in two ways. You can set them to appear either in text, in which case the data will appear in a separate row before every match, or in small window, in which case you point with the mouse over the number of the match. At every match only the available bibliographical data is displayed, the subcorpus is always present.

Enter the word én, set the sample size to 200 items. Sort according to the word after the target word and bibliography in small window. After the results are displayed point with the mouse to the number of any match.

Finally the last two options can be set in the 4th row.
- You can choose which subcorpus or subcorpora you want to search in. Not setting any of them will mean the same as setting all of them, i.e. the search will be performed in the whole corpus.
  
  After choosing the subcorpora, you can set the option Distribution by subcorpora. Setting this option will display, below the matches, a table with the number of occurrences per subcorpora and the frequency per million words. A diagram will also appear which shows the relative frequency of words, i.e. the distribution of matches if all of the subcorpora were the same size. The calculation of the distribution data takes a lot more time then the searching time, which means that if a search is already long you could encounter the above problem. Setting only one subcorpora will disable the calculation of the distribution data.
  
  Experiment with the calculation of the distribution data by entering some subcorpora-specific words, e.g.:
  nyilatkozik, vágtat, sejthártya, jogszabály, csávó.
- The last option you can set is the name of the author. Entering a part of the name is sufficient.
  Set author to Kovács and run a query.

Experiment with the examples of the second and third chapter with various display options.

5. The MSD code system

The MSD code is a notation for coding the morpho-syntactic features of word-forms. There are two reasons it could be useful for:

To be able to interpret the MSD codes in the result of a query,
Entering the MSD codes directly instead of using the menus.

An MSD code begins with the code of the POS category, then a period and the code of the morpho-syntactic features follow. There are two exceptions: superlative (FF) and verb prefix (Pre), whose code precedes that of the POS. The morpho-syntactic features of nominals, verbs, infinitives and adverbial participles are explained below, the remaining POS categories have no morpho-syntactic features, their MSD code is the plain POS category.

Following are the codes of the POS (part-of-speech) categories (nominals are in the first column):

N	noun
A	adjective
Num	numeral
MIA	future participle
MIB	past participle
MIF	present participle
Pro	pronoun
Adv	adverbial
Int	interjection
S	sentence
Abb	abbreviation

DIG	digit
Det	determiner
NU	postposition
V	verb
Pre	verb prefix
V.INF	infinitive
V.HIN	adverbial participle
Con	conjunction
ELO	prefix
WPUNCT	punctuation mark
SPUNCT	sentence-ending punctuation mark

In some cases the program is not able to determine the morpho-syntactic features of a word-form, these words are marked with UNKNOWNTAG.

5.1 The morpho-syntactic features of nominals

The construction of the MSD codes of nominals are as follows:
superlative (FF) [->] POS code [->] comparative (FOK) [->] plural (PL) [->] possessive marker [->] anaphoric possessive marker [->] case.
The features are optional, except the POS code and the case-marking. Beside each code follow some examples.

The codes of the possessive marker:

PSe1	-m, -am, -em, -om, -öm	(házam)
PSe2	-d, -ad, -od, -ed, -öd	(házad)
PSe3	-a, -e, -ja, -je, -á, -é, -já, -jé	(háza)
PSt1	-nk, -unk, -ünk	(házunk)
PSt2	-tok, -tek, -tök, -atok, -etek, -ötök	(házatok)
PSt3	-uk, -ük, -juk, -jük	(házuk)

PSe1i	-im, -aim, -eim	(házaim)
PSe2i	-id, -aid, -eid	(házaid)
PSe3i	-i, -ai, -jai, -ei, -jei	(házai)
PSt1i	-ink, -aink, -eink, -jaink, -jeink	(házaink)
PSt2i	-itok, -itek, -jaitok, -jeitek	(házaitok)
PSt3i	-ik, -aik, -eik, -jaik, -jeik	(házaik)

The codes of the anaphoric possessive marker:

POS	-é	(övé)
POSi	-éi	(övéi)

The codes of the cases:

NOM	nominativus	∅	(kutya)
ACC	accusativus	-t, -at, -et, -ot, -öt	(autót)
DAT	dativus	-nak, -nek	(vendégnek)
ILL	illativus	-ba, -be	(színházba)
INE	inessivus	-ban, -ben	(épületben)
ELA	elativus	-ból, -ből	(iskolából)
ALL	allativus	-hoz, -hez, -höz	(Jánoshoz)
ADE	adessivus	-nál, -nél	(mozinál)
ABL	ablativus	-tól, -től	(háztól)
SUB	sublativus	-ra, -re	(székre)

SUP	superessivus	-n, -on, -en, -ön	(falon)
DEL	delativus	-ról, -ről	(emberről)
INS	instrumentalis	-val, -vel	(villával)
FAC	factivus	-vá, -vé	(édessé)
FOR	formativus	-ként, -képp, -képpen	(tolmácsként)
TEM	temporalis	-kor	(ötkor)
CAU	causalis	-ért	(győzelemért)
TER	terminativus	-ig	(májusig)
SOC	sociativus	-stul, -stül	(kamatostul)
ESS	essivus formalis	-ul, -ül	(ráadásul)

Search for the word-formélményeimmel, display the MSD-code. The MSD-code N.PSe1i.INS means: noun, 1st person singular possessive marker with possessive pluralizer, -vAl suffix.

Search for the following word-forms and interpret the MSD-codes: embereknek, házainkban, anyámét, legrégebbieket.

5.2 The morpho-syntactic features of verbs

The construction of the MSD codes of verbs are as follows:
verb prefix (Pre) [->] V [->] conjugation [->] tense and mood [->] pronominal marker.
The obligatory elements are: V and the pronominal marker.

The codes of conjugation:

∅	subjective	(szeretek)
T	objective	(szeretem)
I	-lak, -lek form	(szeretlek)

The codes of the pronominal marker:

e1	1st person singular	(olvasok)
e2	2nd person singular	(olvasol)
e3	3rd person singular	(olvas)
t1	1st person plural	(olvasunk)
t2	2nd person plural	(olvastok)
t3	3rd person plural	(olvasnak)

The codes of tense and mood:

∅	present, declarative	(olvasok)
M	past, declarative	(olvastam)
F	present, conditional	(olvasnék)
P	present, imperative	(olvasd)

Search for the word form megnéztük, display the MSD-code. The interpretation of Pre.V.TMt1: verb with prefix, objective conjugation, declarative, past tense, plural, 1st person.

The word-form mondjátok has two possible meaning, therefore it can get two different MSD-codes.

V.Tt2: objective conjugation, declarative, present, plural, 2nd person.
V.TPt2: objective conjugation, imperative, present, plural, 2nd person.

Search for any form of the stem vár, and try to interpret their MSD-codes.

5.3 The morpho-syntactic features of infinitives

The structure of the MSD-codes of infinitives are:

base form:
verb prefix (Pre) V.INF.
conjugated form:
verb prefix (Pre) V.INR pronominal marker.

The codes of the pronominal marker of infinitives:

V.INRe1	-nom, -nem, -nöm, -anom, -enem	(látnom)
V.INRe2	-nod, -ned, -nöd, -anod, -ened	(látnod)
V.INRe3	-nia, -nie, -ania, -enie	(látnia)
V.INRt1	-nunk, -nünk, -anunk, -enünk	(látnunk)
V.INRt2	-notok, -netek, -nötök, -anotok, -enetek	(látnotok)
V.INRt3	-niuk, -niük, -aniuk, -eniük, -niok, -niök	(látniuk)

5.4 The morpho-syntactic features of adverbial participles

The single feature of an adverbial participle is whether it has a verb prefix or not. If it has, then the POS code V.HIN is preceded by the code Pre.

6. FAQ - Frequently Asked Questions

Why is the query interface not working?

Turn on Java and JavaScript.
If you use Internet Explorer, you need at least version 5.5.
On Windows XP you must download and install Java Virtual Machine (press 'Get it now').
Some users have reported that the query interface does not work on Macintosh at all. We do not yet know the reason for this problem. If you encounter this problem and would like to help in solving it, please contact us. As we do not have access to Macintosh, we would appreciate any help in resolving this problem.

	I see a square instead of long ő and ű letters.
	If you use Windows XP do the following. 'Control Panel' (switch to classic view) 'Java Plug-in' 'Advanced' tab: down into the parameters field write in -Dfile.encoding=iso-8859-1 OR 'Java' tab: upper 'View...' button, and into the right-hand field write in -Dfile.encoding=iso-8859-1 'Apply'

	What can I do if I forgot my password?
	You have to registrate again.

	I dont get any result, the program doesnt answer.
	See above.

	I dont get an answer for a long time, so I closed the window. When trying to run another query I get the message “You are currently running a query and the server does not have the resources to run multiple queries from one user at the same time” . What can I do?
	You have to wait. Closing the window does not mean that the server process has stopped as well, it may run and use resources. When the process has stopped, you can run a new query. If an error occurs during the process, it is stopped automatically in 15 minutes. See also above.

	What is “nominal”?
	The reason for creating the nominal category. The included POS categories.

	How can I search for “suffixes”?
	Set the POS to nominal, and select the desired case in the last pull-down menu. For more details see above.

If you have any comments, please let us know.
Research Institute for Linguistics, HAS 1998-2006.

Contents