[nlp-infra-devel] Fwd: [LORELEI-morph] Morphology list

Thu Oct 1 16:23:29 CEST 2015

Sziasztok, itt csatolok egy olyan üzenetet amit emlegettem, még egyet kettőt majd kiteszek... 

> Begin forwarded message:
> 
> From: David Yarowsky <yarowsky at gmail.com>
> Date: October 1, 2015 at 3:43:21 PM GMT+2
> To: Charles Yang <charles.yang at babel.ling.upenn.edu>
> Cc: "lorelei-morph at ldc.upenn.edu" <lorelei-morph at ldc.upenn.edu>
> Subject: Re: [LORELEI-morph] Morphology list
> 
> Charles (and everyone), we're *not* proposing to do away with segmentation. For derivational morphology in particular, I'm proposing an annotation convention that respects/captures both surface morphemes and morpheme order. I absolutely believe that segentation is an important upstream process in an ensemble of methods to determine what is the correct lemma+feature analysis of a word, and we should strive for representations of both segmentation and lemma+features that facilitate deterministic mapping between different representations depending on both consumer needs and ease of annotation/vetting.
> 
> A key open question facing this group, however, is what format and content we produce as an annotation/output convention for our downstream users, and what information and format would these downstream users/tools prefer. There are several hopefully interoperable formats we could use, but given the goals of the LORELEI program it should be straightforward for downstream users to extract the information they need from an analysis in a language-independent feature space that does not require downstream processes to worry about language-specific and word-specific details to extract the information they need. In particular:
> 
>  (1) It is critically important to produce a normalized *lemma* with output
> 
>     carries    carri|es  =>   carry  +V;PRS;IND;3;SG/es
>     duermen    duerm|en  =>   dormir +V;PRS;IND;3;PL/en
>     dancer     danc|er   =>   dance  +V:N{AGT/er)
> 
>        (i.e. we should not require downstream processes to do the job of allomorphic stem normalization and decide that carri->carry, duerm->dormir and danc->dance that we would otherwise be punting on)
> 
>    (2) It is highly desirable that we produce language-independent, normalized feature representations for our morphemes and not require that our downstream processes do this either:
> 
>    hablabas    habl|abas  =>  hablar +V;PST;IMP;IND;2;SG/abas
>    comías      com|ías    =>  comer  +V;PST;IMP;IND;2;SG/ías
> 
>    craziness   crazi|ness =>  crazy  +J:N(STAT/ness)
>    normality   normal|ity =>  normal +J:N(STAT/ity)
> 
>    Canadian    Canad|ian  =>  Canada +N:N(DEMN/ian)
>    Burmese     Burm|ese   =>  Burma  +N:N(DEMN/ese)
>    Brisbanite  Brisban|ite => Brisbane +N:N(DEMN/ite)
>    Michigander Michigand|er=> Michigan +N:N(DEMN/er)
> 
>    (i.e. our downstream processes would benefit from a language-independent annotation of +DEMN (demonym) and shouldn't have to worry about normalizing -ian, -ese, -ite, -er or knowing distinguishing the demonymic -er from other uses (e.g. agentive nominalization, comparative adjective, etc.)
> 
>    (3) We should be able to support templatic, circumfixal and reduplicative processes. The proposed lemma+feature representation does so seamlessly. It is less clear how a segmentational aproach by itself can do so as transparently. For example, consider the following example from Maltese (on the triconsonantal root gbr), which involves the same templatic-morphology issues of the Semitic languages without the character set issues.
> 
>      gbarna    gbar|na   =>   gabar  +V;PST;1;PL/na
>      gabret    gabr|et   =>   gabar  +V;PST;3;SG;F/et
>      gabru     gabr|u    =>   gabar  +V:PST;3;PL/u
>      nigbru    ni|gbru   =>   gabar  +V;PRS;1;PL/ni-
>      nigbor    ni|gbor   =>   gabar  +V;PRS;1;SG/ni-
> 
>     As shown in the last 2 lines above, it's not clear how a segmentation by itself would either capture the relevant semantics of the above inflection or serve as an anchor for the relevant features (as they are conveyed by both prefixes and infixes). One could imagine an annotated segment+feature representation as follows, but it's less clear that this is the most natural and directly useful output for our downstream consumers.
> 
>      nigbru ==>  ni/PRS;1;PL + gbru/V/gabar   ???
>      nigbor ==>  ni/PRS;1;SG + gbor/V/gabar   ???
> 
>    (4) Fortunately, this isn't an either/or forced choice. As shown in all my examples above, the affixal morphemes from a segmentational analysis can be appended to a lemma+feature analysis, directly yielding a segmentation if desired. If a downstream process only cares about lemma+features in a language-independent representation, it can strip the language-specific affix after the "/". In contrast, if the features are annotations on the morphemes, then extracting language-independent lemma+features without language-specific morphological detail is more complicated for downstream users, as the language-specific detail to be stripped forms the base layer of the annotation.
> 
>    (5) The format used for elicitation and review by annotators doesn't have to be the same one used by downstream consumers, especially if they are deterministically inter-convertible.
> 
>    (6) Our proposed format puts annotations in order of likely annotator confidence
>         e.g. applies  appli|es =>  apply +V;PRS;3;SG/es
> 
>     #1 = lemma (for which even non-linguists can usually agree) = apply
>     #2 = features (which are usually transparent) = +V;PRS;3;SG
>     #3 = segmentation (less clear) = appli|es?, appl|ies? , applie|s?
> 
> By putting the segmentation as an optional additional feature rather than the critical base layer of the annotation, the annotation is much more tolerant to inconsistencies, differing conventions, and/or the variable output of unvetted learned segmentations than if the segmentation layer needs to be fixed first before lemma and feature can be annotated as a dependent to it. If a different annotator or tool were to segment this differently as applie|s or appl|ies in the base annotation layer, then this would likely create greater incompatibilities between tools assuming variant segmentations.
> 
>    applie/V/apply + s/PRS;3;SG      vs.  apply +V;PRS;3;SG/s
>    appli/V/apply  + es/PRS;3;SG     vs.  apply +V;PRS;3;SG/es
>    appl/V/apply   + ies/PRS;3;SG    vs.  apply +V;PRS;3;SG/ies
> 
>    (7) I agree that one of our major challenges is deciding whether or not to segment/analyze a word (e.g. 'absent' vs 'abnormal'), although by having the large majority of analyzes take place effectively by highly efficient table lookup makes this less of a problem for the run-time analyzer (and at least clear to all stakeholders in advance what the relative treatment of these words will be). The treatment of the large majority of words is overtly transparent rather than buried in software.
> 
>   (8) BTW, the proposed annotation framework above works fine on OOVs, e.g you rexample:
> 
>    tweet|abl|ity ==> tweet +V:J(ABIL/abl) +J:N(STAT/ity)
>    tweet|er      ==> tweet +V:N(AGT/er)
> 
>     The key issue here is that the above analysis is capturing in a language-independent, normalized way that the new word tweetability is referring to the state/condition/property (STAT) of being able (ABIL) to be tweeted, and not just some language-specific, unnormalized and often ambiguous morphemes +abl, +ity and +er. We are not fully supporting the goals of the LORELEI program if we don't normalize these further into a language-independent structured morphosemantic representation, especially for inflectional morphology, but for derivational morphology as well.
> 
>      - David
> 
> On Wed, Sep 30, 2015 at 10:06 PM, Charles Yang <charles.yang at babel.ling.upenn.edu <mailto:charles.yang at babel.ling.upenn.edu>> wrote:
> Hi David/John,
> 
> Thanks for the clarification. I didn’t see mentions of derivational morphology in your papers/slides but good to hear it’s in the works. 
> 
> It seems, from David’s note and the Leipzig glossing convention you guys adopt, that the lemma+feature format and morpheme-based analysis may be ultimately mappable onto each other. The Leipzig rules, for instance, insist on morpheme-by-morpheme correspondence. If the JHU group would use some kind of morphological analyzer, why eliminate the morphemes and their alternations in the final annotation? 
> 
> First, morphemes yield semantics. it is clear that semantic features, rather than morphemes, are most likely more useful for downstream tasks. But without identifying the morphemes and their alternations, how do we figure out the semantic features for unannotated words?  The morphological analyzer needs to know that the “ity” in “ab-norm-al-ity” is a real morpheme that contributes nominal semantics, so it can analyze unannotated/novel words such as “tweetabi-ity”. Without morpheme boundaries marked as such, how is the substring “ity” mapped to the right semantic features? 
> 
> Second, morphological analysis is more than, and harder than segmentation, but to do it right requires segmentation in the first place. Much of the discussion so far has focused on choices of segmentation (e.g., running = runn+ing or run+ing, “dancer” = dance+er or danc+er). A much harder problem, on my view, is actually the decision of whether to segment at all. For instance, “ab” clearly contributes the semantics of negation in “abnormality”, but probably does not in “absolution” or “absent".  And “rubber” should not be segmented as “rub+ber” or analyzed as someone who rubs or something for rubbing.  We encountered these problems in our earlier work (Constantine Lignos's MORSEL system), and also see numerous instances of these in very small and simple spoken corpora such as child-directed English. Ultimately, the problem is about discovering which POTENTIAL morphemes are real and predictably contribute to meanings. But without the potential morphemes marked, it’s difficult to even approach this problem, i.e., to assess thevalidity of these potential morphemes.  For annotation purposes, I would venture to guess that it might be easier to segment more aggressively in the analyzer: it may be easier for the native speaker annotator to spot the ridiculousness of “ab-solution” and “let-ter” and reject them, than asking them to segment morphemes themselves, which appears to be very difficult as reported by others earlier in this thread. 
> 
> Finally, I am wondering about the suitability of doing away with morpheme-based analysis for Hausa and similar languages. I am just beginning to read up on Hausa morphosyntax, but the language makes extensive use of morphological alternations, some of which are quite transparent, to encode syntactic and semantic alternations: it makes all the sense to use them rather than directly going for semantic features. For instance, ventivity, passivity and causativity are all realized by simple alternations to the stem-final segment: e.g.,  saya (‘buy’)~sayar (’sell’; causative)~sayu (‘be well sold, be attractive to buyers’; passitve)~sayo (‘buy and bring’; ventive). It seems highly useful to encode the alternations of a~ar~u~o  in addition to saya plus whatever semantic features one chooses to encode these derivational relations.
> 
> Best,
> 
> Charles
> 
>> On Sep 28, 2015, at 7:20 PM, John Sylak-Glassman <johnsylakglassman at gmail.com <mailto:johnsylakglassman at gmail.com>> wrote:
>> 
>> Hello, Everyone,
>> 
>> I'm a postdoc working with David Yarowsky. My Ph.D. is in traditional linguistics, so I thought I'd post a response both from that point of view and from the (very much biased, but hopefully reasonable) point of view of favoring my group's own lemma+feature representations.
>> 
>> One thing to note about our proposed Universal Morphological Feature Schema (UniMorph) is that it is based on the widely used Leipzig Glossing Rules for linguistic annotation in interlinear-glossed-text, so it's already grounded in a de-facto widely used standard in the linguistics community and is based on a detailed, broad-coverage survey of the descriptive linguistics literature across the world's language families.
>> 
>> I've attached a preprint of a paper that will be published in the Proceedings of the 2015 Workshop on Systems and Frameworks for Computational Morphology (SFCM), which describes the proposed UniMorph Schema at a level of detail beyond what we were able to present at the LORELEI PI meeting. I've also attached the slides we presented. In addition, I am working on editing a larger document that describes the schema at an even greater level of detail, and I hope to distribute that through this list by late next week.
>> 
>> As other people have noted, the main argument for lemma+features, as opposed to segmentation, is that the lemma+features approach directly distills the core meanings encoded in a given inflectional or derivational form while a morpheme-by-morpheme segmentation tells us more about the underlying morphological processes and overt structure, which are of interest to us as linguists, but don't directly support downstream information extraction or machine translation without a lot of normalization which shouldn't be the task of our downstream customers. The lemma should be directly recoverable from the analysis, and the normalized meanings/features encoded in the affixes should be directly clear to downstream users. In general, it is important for us to separate the goals of: 1) discovering the form of morphemes, and 2) discovering the meanings that morphemes convey. These pieces of information are very different and have different applications.
>> 
>> To address Stephanie's question #4, we don't currently have a morphological segmenter per se because we haven't been working on the segmentation task. However, we are producing morphological analyzers that generate lemma+feature analyses that could help derive a binary segmentation via semi-supervised learning. Also, other LORELEI contributors like Mitch and Constantine have successful morphological segmenters which we would like to use rather than invent our own. There will probably be interannotator disagreement on which lemma+feature analyses are correct, but if annotators can discuss points of disagreement, the evidence for which analysis is correct will come largely down to meaning rather than to questions of analysis, which may be easier for speakers to work with. In this sense, the annotators only really have to be taught what the features mean, not necessarily how to do segmentation analysis. Also, each lemma does not need to have a huge feature set - only the features that are marked contrastively in the language on each word. For example, in Spanish, normal indicative verbs need not have an explicit IND feature - instead, we can supply that later through a rule like "if the mood is not subjunctive or imperative, the mood is indicative." That way, we can achieve very rich representations after some post-processing, but annotators don't need to mark any more features than the language overtly marks and contrasts.
>> 
>> Another solution, which we certainly support and might be easier for the LDC, is to have annotators use whichever terms they find useful for their language, as long as they're used in a regular enough way that they can be mapped to a more universal feature set. For example, "pretérito" or "preterite" could be used as an alias for the UniMorph Schema features PST;PFV (tense=past;aspect=perfective) since we separate tense and aspect but most pedagogical grammatical terms conflate them. Similarly, the VBZ Penn Treebank tag could also function as an alias and would map unambiguously to V;PRS;3;SG. If we went this route, the annotators would have to agree on a set of features to use for their language, and we would still have to ensure that annotators are capturing all the features that the language makes contrasts to show. These annotations will then need to be mapped to whichever language-independent standard LORELEI decides to use.
>> 
>> Best,
>>     John
>> 
>> On Mon, Sep 28, 2015 at 4:02 PM, David Yarowsky <yarowsky at gmail.com <mailto:yarowsky at gmail.com>> wrote:
>> First, I'd like to clarify (in response to Charles Yang's point), that we at JHU are developing a universalized lemma+feature analyzer for derivational morphology as well. Some examples follow (with only one of several possible segmentations given):
>>  
>> TARGET WRD   SEGMENTATION   MORPHOSEMANTIC    
>> -----------------------------------------------
>> crier        cr|i|er        cry +V:N(AGT)     
>> runner       run|n|er       run +V:N(AGT)     
>> observer     observ|er      observe +V:N(AGT)
>> observee     observ|ee      observe +V:N(PAT)
>> intensity    intens|ity     intense +J:N(STAT)
>> craziness    crazi|ness     crazy +J:N(STAT)
>> normality    normal|ity     normal +J:N(STAT) 
>> abnormal     ab|normal      J:J(NEG)+ normal  
>> irregular    ir|regular     J:J(NEG)+ regular   
>> unhappy      un|happy       J:J(NEG)+ happy  
>> movable      mov|able       move +V:J(ABIL)
>> cuttable     cut|t|able     cut  +V:J(ABIL)
>> buriable     buri|able      bury +V:J(ABIL)
>> admissible   admiss|ible    admit +V:J(ABIL)
>> definsible   defins|ible    defend +V:J(ABIL)
>> Hungarian    Hungar|ian     Hungary +N:N(DEMN)
>> Canadian     Canad|ian      Canada +N:N(DEMN)
>> Ukrainian    Ukrain|ian     Ukraine +N:N(DEMN)
>> Zambian      Zambi|an       Zambia +N:N(DEMN)
>> Panamanian   Panama|n|ian   Panama +N:N(DEMN)
>> -----------------------------------------------
>> Moscovita    Moscov|ita     Moscu  +N:N(DEMN)   
>> Baltimoriano Baltimor|iano  Baltimore +N:N(DEMN)
>> Michiguense  Michiguen|se   Míchigan +N:N(DEMN) 
>> -----------------------------------------------
>> Moskvan      Moskva|n       Moskva +N:N(DEMN)   
>> Moskevský    Moskev|ský     Moskva +N:J(DEMN)   
>> -----------------------------------------------
>>  
>> The goal is to convey both the semantic and part-of-speech transductions inherent in derivational affixes in a way that ports across languages and facilitates both machine translation and information extraction, and does so in a way that abstracts away from particular affix choices and allomorphies (e.g. un-, il-, ir-, in-, ab- all capture the derivational concept J:J(NEG), or adjective-to-adjective(negated), and have advantages in being normalized as such).
>>  
>> Sequences of normalized derivational affixes/features are also nested, yielding a parse:
>>  
>>   abnormality  ab|normal|ity     [[ J:J(NEG)+ normal ] +J:N(STAT)]
>>   inadmissible in|admiss|ible    [J:J(NEG)+ [admit +V:J(ABIL)]]
>>  
>> We are still fully fleshing out our proposed treatment of derivational morphology, and look forward to working with the LORELEI team to find a set of interoperable conventions that satisfy the various needs of the program and the existing (or desired) tools/components that team members wish to bring to the table.
>>  
>> (2) We recognize that there are multiple potential standards for representing a normalized feature set in a lemma+feature analysis, but the key goal in any sufficiently normalized standard is interoperability between such standards.
>>  
>> For example, the following 3 representations are functionally equivalent:
>>  
>>     * Our Leibzig-based Unimorph features:  pos=V;tns=PRS;per=3;num=SG;  
>>  
>>     * Prague positional-based features:     VV-S---3P------  
>>  
>>     * Penn Treebank Tag (English specific): VBZ
>>  
>> Hence we should be concerned less with the exact notation conventions chosen, and more with making them fully interoperable and mappable to/from other standards and sufficiently detailed to capture the feature nuances of the full range of world languages that LORELEI will need to deal with, as has been our focus.
>>  
>> (3) We believe that there is substantial value in morpheme segmentation as a component task of morphological analysis, and we use segmentational models as a component of the machine learning of lemma+feature output. Our residual concerns are:
>>  
>>      (a) There is substantial lack of consensus in both the linguistics and NLP communities regarding what a goldstandard analysis should look like:
>>  
>>        should "cries" be segmented as:    crie|s  cri|es  crie|s  cr|i|es ?
>>  
>>        should "running" be segmented as:  run|n|ing  runn|ing  run|ning ?
>>  
>>        should "dancer" be segmented as:   danc|er  dance|r ?
>>  
>>        should "hablamos" be segmented as: habl|amos  habl|a|mos habla|mos ?
>>  
>>      (b) As we've previously noted, the large majority of downstream LORELEI tasks more directly need to know the lemma plus normalized semantic features rather than have a segmentation into variable morphemes that they then have to normalize and map to such features themselves.  Furthermore, the primary application where a segmentational output would be most directly useful as an input feature space - namely ASR - would very likely want a different kind of segmentation than the other downstream tasks (MT/IE).
>>  
>> (4) There would seem to be value in having a standardized merged form of analysis output, that capture both the strengths of the segmentational and lemma+feature analyses.
>>  
>>  
>> Towards this end, I note that most of the previous discussion on this list (and in the literature) lean towards either one of the following goals:
>>  
>>    * maximal stem regularity: run|ning, dance|d, cri|es (similar to IR STEMMING)
>>  
>>    * maximal affix regularity: runn|ing, danc|ed, crie|s (useful for POS tagging)
>>  
>> Consensus on this list seems to favor the latter. If we were to adopt this segmentational objective, an additional advantage is that it supports a direct merger with a lemma+feature representation. For example:
>>  
>>   dancer       danc|er        dance +V:N(AGT/er)
>>   operator     operat|or      operate +V:N(AGT/or)
>>   intensity    intens|ity     intense +J:N(STAT/ity)
>>   craziness    crazi|ness     crazy +J:N(STAT/ness)
>>   abnormal     ab|normal      J:J(NEG/ab)+ normal
>>   irregular    ir|regular     J:J(NEG/ir)+ regular
>>   unhappy      un|happy       J:J(NEG/un)+ happy
>>  
>>   abnormality  ab|normal|ity     [[ J:J(NEG/ab)+ normal ] +J:N(STAT/ity)]
>>   inadmissible in|admiss|ible    [J:J(NEG/in)+ [admit +V:J(ABIL/ible)]]
>>  
>>   hablamos    habl|amos       hablar +V;IND;PRS;1;SG/amos
>>   dormimos    dorm|imos       dormir +V;IND;PRS;1;SG/imos
>>   hablan      habl|an         hablar +V;IND;PRS;3;PL/an
>>   duermen     duerm|en        dormir +V;IND;PRS;3;PL/en
>>  
>> This has the following advantageous properties:
>>  
>>    (a) the observed affixes serve as a language-specific subfeature to the universal categories (e.g. AGT/er, AGT/or,  NEG/un, NEG/ab, NEG/ir, etc.), which may capture additional semantic nuance (although not necessarily one that transfers crosslinguistically)
>>  
>>    (b) The lemma+feature tag with this additional affix detail directly yields a segmentation consistent with the maximal affix regularity segmentation above (e.g. intens|ity), while the lemma serves an implicit normalized annotation on the residual.
>>  
>>    (c) The traditional morpheme-focused labelling that assigns semantics and normalization to morphemes:
>>  
>>      danc|er  ==>  danc/VERB/dance     +er/AGT
>>      duerm|en ==>  duerm/VERB/dormir   +en/IND;PRS;3;PL
>>  
>>   is essentially equivalent to this expanded lemma+feature structure
>>  
>>      danc|er  ==>  dance  +V:N(AGT/er)
>>      duerm|en ==>  dormir +V;IND;PRS;3;PL/en
>>  
>>    with the primary difference being that the lemma+feature version makes the key information that downstream applications care most about (lemma+features) more overtly accessible, with the language-specific information that the downstream applications may care less about (e.g. the actual surface morphemes) more readily stripped if not relevant to that application. If the surface morphemes themselves are the anchor layer, then they are harder to remove in a downstream process that doesn't care about this surface morpheme detail.
>>  
>> In any event, the maximal-affix segmentation approach that appears to be the current consensus (danc|er and not dance|r) would seem to facilitate this interoperability between annotation conventions.
>>  
>> And, again, we believe that various unsupervised and semi-supervised segmentation models proposed for use in this program have substantial value towards these ends, and we look forward to integrating with them at multiple layers, including as features in a universalized lemma+feature learning stage.
>>  
>> John Sylak-Glassman (also from JHU) will follow up some additional thoughts.
>>  
>>         - David
>>  
>>  
>>  
>>  
>>  
>>  
>>  
>> 
>> _______________________________________________
>> lorelei-morph mailing list
>> lorelei-morph at ldc.upenn.edu <mailto:lorelei-morph at ldc.upenn.edu>
>> http://newlists.ldc.upenn.edu/listinfo/lorelei-morph <http://newlists.ldc.upenn.edu/listinfo/lorelei-morph>
>> 
>> 
>> 
>> 
>> -- 
>> John Sylak-Glassman
>> Postdoctoral Researcher
>> Center for Language and Speech Processing
>> Johns Hopkins University
>> johnsylakglassman at gmail.com <mailto:johnsylakglassman at gmail.com><SFCM - Sylak-Glassman Kirov Post Que Yarowsky 2015 - A Universal Feature Schema for Rich Morphological Annotation and Fine-Grained Cross-Lingual Part-of-Speech Tagging.pdf><JHU-Yarowsky-lg-universals-and-typology.pdf>_______________________________________________
>> lorelei-morph mailing list
>> lorelei-morph at ldc.upenn.edu <mailto:lorelei-morph at ldc.upenn.edu>
>> http://newlists.ldc.upenn.edu/listinfo/lorelei-morph <http://newlists.ldc.upenn.edu/listinfo/lorelei-morph>
> 
> 
> _______________________________________________
> lorelei-morph mailing list
> lorelei-morph at ldc.upenn.edu <mailto:lorelei-morph at ldc.upenn.edu>
> http://newlists.ldc.upenn.edu/listinfo/lorelei-morph <http://newlists.ldc.upenn.edu/listinfo/lorelei-morph>
> 
> 
> _______________________________________________
> lorelei-morph mailing list
> lorelei-morph at ldc.upenn.edu
> http://newlists.ldc.upenn.edu/listinfo/lorelei-morph

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://corpus.nytud.hu/pipermail/nlp-infra-devel/attachments/20151001/5cde8142/attachment-0001.html>