[nlp-infra-devel] Fwd: [LORELEI-morph] Morphology list
Andras Kornai
kornai at math.bme.hu
Thu Oct 1 16:28:13 CEST 2015
Itt egy másik...
> Begin forwarded message:
>
> From: David Mortensen <dmortens at scs.cmu.edu>
> Date: September 29, 2015 at 4:39:44 PM GMT+2
> To: Michael Maxwell <mmaxwell at casl.umd.edu>
> Cc: "lorelei-morph at ldc.upenn.edu" <lorelei-morph at ldc.upenn.edu>
> Subject: Re: [LORELEI-morph] Morphology list
>
> I just looked back over some of Michael Maxwell’s comments from a few days ago and wanted to second a few of his concerns and observations.
>
> (1) I have spent a certain amount of time working on extremely low resource (in some cases, previously undocumented) Tibeto-Burman languages from Northeastern India. These languages are transparently agglutinative with few morphophonemic alternations and a clear correspondence between syllables and morphemes. Despite this, speakers typically show very little ability to manipulate anything smaller than a word. Certainly, some speakers with a higher degree of metalinguistic awareness can perform operations on affixes, but these are not typical speakers and even they only develop this kind of ability with time and exposure to tasks that require skills of this kind. If the task we are demanding of annotators is unrealistic, they output they produce is likely to be inconsistent and ultimately less useful.
>
> (2) For many cases (counterintuitively), stem/lemma + features is actually likely to be easier for annotators than stem/lemma + affixes. While I would allow that there is cross-linguuistic variation in the atomicity of words, and the degree to which speakers are aware of their internal composition, experience leads me to expect that speakers are better (though still rather poor) at identifying the inflectional properties of a word than at segmenting a word into morphemes or stem/lemma + affixes. Certainly, I think that the analyses exemplified in David Yarowsky’s last message would be wonderful (as an output from a morphological analyzer), but expecting this from annotators is setting the bar too high (without extensive scaffolding).
>
> (3) Requiring annotators to provide features (rather than “glosses”) is also likely to be too high a bar. Providing glossing conventions can at least allow annotators to do a consistent job, even in the results are not as informative as we might hope. It might then be possible to provide meta-annotations that describe possible mappings between morpheme glosses and inflectional features. Or not.
>
>
> Best,
> David R. Mortensen
>
>> On Sep 25, 2015, at 6:41 PM, Michael Maxwell <mmaxwell at casl.umd.edu> wrote:
>>
>> I'm going to throw in my two cents. My background: I've done morphology in extremely low density (previously unwritten) languages in SIL, annotation in LDC, and built some morph parsers.
>>
>> First, the bad news: echoing Martha Palmer's note below, getting native speakers to annotate morphology is really hard, and even harder if you want it done right. The breakdown of words into morphemes (regardless of whether those morphemes are described as strings or as features) is not at most speakers' conscious level. We've had difficulty getting good annotations at an even simpler level where the annotator's task was to provide the dictionary citation form for each word (we gave them a dictionary), so they could throw away the affixes. Getting them to annotate the affixes as well will obviously make the task more difficult.
>>
>> Between the stem (or lemma) + affixes-as-strings approach and the stem (or lemma) + affixes-as features approach: paradoxical as it might seem, for purposes of annotation, I suspect the latter is easier, and it provides more useful information. It's easier because it avoids the cut ambiguity problems that have been described in earlier msgs in this thread:
>> running = run+ing with a missing 'n', vs. runn+ing vs. run+ning
>> ran = ??
>> Spanish tiene = tien + e, but where does 'tien' come from? (the lemma is tener, and the "normal" stem is 'ten')
>> The affixes-as-strings approach also doesn't provide as much information, e.g. runs = run+s, but is the suffix a noun plural or a verb 3sg.pres.? (I'm assuming that the affixes-as-strings approach would come with a dictionary of affix meanings; if not, a lot of potentially important information might be lost, as Boyan pointed out in his 23 Sep 3:04 PM email.)
>>
>> There's also the question of lemma vs. stem, which Ann Bies mentions below. Again, while finding the stem might seem easier (it's certainly closer to the surface), arbitrary decisions arise again (as with the three examples above). And checking for consistency of annotation is easier with lemmas (you can look in a dictionary). And finally, I would think a lemma would be much more useful to a downstream processor, since you can (hopefully) look it up in a dictionary; there's no guarantee you can do that with a stem.
>>
>> I think most morphological analyzers output the lemma + affixes-as-features format. E.g. that's the format used in Beesley and Karttunen's book on finite state transducers (specifically xfst and lexc). That would be important if the input to the annotators is automatically analyzed texts. It's also the format that linguists usually use in simpler interlinear texts (by "simpler" I mean the typical 3-line interlinears, with orthographic line, lemma + affix gloss line, and free translation line). However, it's not, as far as I know, the format that any machine learning program trained on an unannotated corpus could give, and it probably wouldn't be as useful a format to be used for training a morph parser.
>>
>> One caution: people have talked in terms of morphosyntactic "features." I suspect it would be easier for purposes of annotation to use glosses, which maintain some correlation between the surface words and the affixes-as-glosses. If you're annotating an agglutinating language, for instance, it's probably easier to be consistent by tagging the affixes than it is to produce a set of features. Otherwise you run the risk of leaving out features, particularly in languages that have both prefixing and suffixing. Also, features can be messier to work with than glosses, e.g. you can gloss the English verb "run" as s.t. like "Pres.Non3Sg", but the features for that would look like:
>> [[Tense present] [Subject [[[Person 1 | 2] [Number singular]] | [Number plural]]]]
>> or marginally better:
>> [[Tense present] [Subject [NOT [[Person 3] [Number singular]]]]
>> That said, if the glosses are chosen right, there's a mapping from glosses to features. It might also be possible to come up with a mapping from glosses to _canonical_ affix shapes, to answer Lane Schwartz's 23 Sep 10:41 AM email.
>>
>> Mike Maxwell
>> Area Director, Technology Use
>> CASL/ University of Maryland
>> "Digital data lasts forever -- or five years, whichever comes first."
>> --Jeff Rothenberg, 1997
>>
>> -----Original Message-----
>> From: lorelei-morph [mailto:lorelei-morph-bounces at ldc.upenn.edu] On Behalf Of Ann Bies
>> Sent: Thursday, September 24, 2015 9:12 PM
>> To: lorelei-morph at ldc.upenn.edu
>> Subject: Re: [LORELEI-morph] Morphology list
>>
>> Yes, I think it's fair to say that LDC's morphological annotation task will be largely dependent on how and how well the automated tools work, and it is difficult for LDC to judge just what the tradeoffs will be in the absence of having seen any analyzer output or knowing what kind of performance we can expect on the RL languages.
>>
>> In the LRL packages, our approach to morphological annotation was tightly integrated with the morph analyzers we were building. Details are in the enclosed slides, but the basic approach included lemma, segmentation, and tags for each segment. Annotation also included a wildcard function (tied to the analyzer) in order to allow some treatment for out of vocabulary words. Multiple iterative rounds of annotation and analyzer improvements were required to get decent coverage. Since the development or adaptation of analyzers is outside of the scope of our tasking for LORELEI, we expect to modify our approach to morphological annotation along the following lines.
>>
>> 1. If we go with the lemma+features approach, we would plan to use an available universal analyzer to provide lemma+features solutions for each word as input to the annotation task. Non-linguist annotators would then choose from complete solutions *as given* by the analyzer.
>>
>> To serve as input to the annotation pipeline, analyzer output would need to have the following properties (natively, or via some post-processing
>> step):
>> - human usable length list of solutions to choose from
>> - human readable (and non-linguist understandable) set of features
>> - lemmas recognizable by non-linguist annotators as related to the surface form
>>
>> Annotators would accept the lemma as-is from the analyzer; we would assume no human editing of lemmas or features (it's not reasonable to expect non-linguist annotators to do this very well).
>>
>> Since we would not expect the provided analyzer to have been developed with the manual annotation task in mind, we assume that the analyzer would not have any kind of wildcard function. In this case, annotators would simply choose "unanalyzable" if the analyzer has no solution or no correct solution for the token. This could result in higher rates of unanalyzed tokens, compared to what we've seen for Turkish or Hausa. Of course, we'd gladly use a wildcard function if it's available.
>>
>> With these assumptions in place, we think we could complete ~10,000 words of lemma+features morphological annotation for each RL as planned.
>> However, we haven't seen any analyzer output yet, so we can't really say anything definitive about how challenging the analyzer solutions will be for the human annotators.
>>
>> As an additional note, morphological annotation of this type would probably not be very good input for a morphological alignment task, so we would need to think about what to do there (redefine the task to, say, word alignment, or else drop it entirely and apply equivalent effort elsewhere).
>>
>> 2. If we go with stem+segments (or lemma+segments), we would plan to use an available universal analyzer to provide stem+segments solutions for each word as input to the annotation task. Non-linguist annotators would then choose from complete solutions *as given* by the analyzer.
>>
>> Here again, annotators would accept as the stem (or lemma) whatever the analyzer outputs, and the lemma/stem would need to be recognizable by non-linguist annotators as related to the surface form.
>>
>> Annotators would not edit the segments or stems, and assuming lack of a wildcard function annotators would choose "unanalyzable" if the analyzer has no solution or no correct solution for the token.
>>
>> Since the only analyzer we've seen along these lines so far outputs segmentation but not tags, in the simplest form, this annotation would not include morphological labels for the stem or segments. We could have annotators manually provide part-of-speech tags for the stem/lemma, but manually providing labels for the other segments from scratch would far exceed our available cycles, and would be extremely error prone.
>>
>> Segments generated via this process could potentially serve as input for a morphological alignment task, whether or not the segments are also tagged.
>>
>> Ann
>>
>>
>> On 9/24/2015 1:42 PM, Constantine Lignos wrote:
>>> This is very helpful and clear, thank you. The original question from
>>> Boyan was put as "stem and segments" vs. "lemma and features". Note
>>> that the Turkish example below might be best described as
>>> "feature-aligned segmentation with lemma and POS" (with the understood
>>> caveat that sometimes the lemma isn't there). I'm going to guess that
>>> most folks would be happy (I certainly am!) with the Turkish
>>> representation if it could be done for all languages, which I know is
>>> a *gigantic* if.
>>>
>>> All the information in that representation is potentially useful:
>>> lemmas, segments, features, and segment-feature alignment. It's hard
>>> to say at the start of the program which of these things one should
>>> prefer to give up given constrained annotation resources. Modest gains
>>> have certainly been shown in downstream tasks given some amount of
>>> morphological analysis and/or segmentation, but the uncharted waters
>>> (that this program is uniquely suited to explore) lie at what is
>>> useful given relatively small amounts of data and the types of
>>> potential projection scenarios. The downstream models will change, so
>>> it's not so clear whether the morphological information that is most
>>> useful to today's models will remain useful to the models that will
>>> develop over the course of the program.
>>>
>>> I think in order to come to some consensus about what annotation is
>>> most useful we might need to hear from LDC what the tradeoffs are and
>>> how they might pan out across languages. It seems to me that LDC's
>>> annotation task is largely dependent on how and how well the automated
>>> tools work (Hence Boyan's question (2)). What does the availability of
>>> analyzers for other RL languages look like at the moment, and what
>>> would the LDC's plans look like for languages with much less mature or
>>> non-existent tools?
>>>
>>> -Constantine
>>>
>>> On 9/24/2015 11:07 AM, Jonathan Wright wrote:
>>>> To expand on Ann’s descriptions, and also give people more options:
>>>>
>>>> The current LRL packs do contain lemmas, with some exceptions. I
>>>> believe when an annotator had to type an analysis from scratch, they
>>>> would *not* enter a lemma, just segmentation and features. However,
>>>> when choosing from the analyses presented by the analyzer, there
>>>> *would* be a lemma. Consider the Turkish word for book:
>>>>
>>>> NW_ZAM_TUR_0000969_20140900
>>>> id="token-8-7”
>>>> pos=“NOUN"
>>>> morph="kitab:kitap=NOUN ı:ı=POSS_3S nda:nda=CASE_LOC”
>>>> kitabında
>>>>
>>>> The stem /p/ becomes [b].
>>>>
>>>> The format here is an array of elements of the form
>>>> segment:lemma=tag, such that concatenating the segments gives the
>>>> original token, even with capitalization. Lemmas for inflections
>>>> were always just the original string I believe, but you could imagine
>>>> something like “s” for an English “plural lemma” as in Constantine’s
>>>> examples. We thought this was a best of both worlds approach, in the
>>>> absence of a spec. But this does emphasize Ann’s point that when the
>>>> analyzer fails, annotators might not be giving as rich a
>>>> representation as the analyzer would.
>>>>
>>>> Note that Constantine’s point about “run-ning” vs. “runn-ing” is
>>>> still a good point, and the developer of the analyzer would probably
>>>> have to make an arbitrary decision to favor fidelity to stems or
>>>> inflections.
>>>>
>>>> Jon
>>>>
>>>>> On Sep 23, 2015, at 7:03 PM, Martha Palmer
>>>>> <Martha.Palmer at colorado.edu> wrote:
>>>>>
>>>>>
>>>>> Just for the record, getting the right lemmatizations for the Arabic
>>>>> treebank, and especially for the Egyptian dialect, was a two-year
>>>>> nightmare.
>>>>>
>>>>> Martha
>>>>>
>>>>>> On Sep 23, 2015, at 2:44 PM, Ann Bies <bies at ldc.upenn.edu> wrote:
>>>>>>
>>>>>> For the LRL packages, stem+segments morphological annotation was a
>>>>>> division of the input string into segments, and the assignment of a
>>>>>> morphological tag to each segment. So, "running" in English was
>>>>>> divided into "runn" and "ing" (preserving input characters, and
>>>>>> keeping the inflection string as constant as possible), with the
>>>>>> tags for each segment. A connection to a lemma could potentially
>>>>>> be made using the lexicon (for Turkish and Hausa) along with the
>>>>>> morphological annotation, but lemma assignment along with the
>>>>>> stem+segments annotation was not done for those packages.
>>>>>>
>>>>>> In considering the feasibility of lemma+features morphological
>>>>>> annotation for LORELEI, two questions would certainly be how to
>>>>>> assign and how to define the lemmas, and how this will fit into the
>>>>>> manual morphological annotation task. One possibility would be to
>>>>>> use a morphological analyzer and define the lemma as whatever the
>>>>>> analyzer outputs in each solution. The question of normalization
>>>>>> such as in Constantine's (2) would immediately arise, however,
>>>>>> since that output may very well be something like an unnormalized
>>>>>> stem. Normalization of this kind would be an extensive manual
>>>>>> annotation task, if normalized lemmas aren't provided by the
>>>>>> analyzer. Note that even if an analyzer does provide lemmas, there
>>>>>> would presumably be tokens that do not have an analysis or a
>>>>>> correct analysis from the analyzer.
>>>>>>
>>>>>> Ann
>>>>>>
>>>>>>
>>>>>> On 9/23/2015 2:49 PM, Constantine Lignos wrote:
>>>>>>> I think it would be useful for the purposes of this discussion to
>>>>>>> have a clearer idea of what "stem + segments" means. Particularly,
>>>>>>> many different kinds of annotation could be called that, largely
>>>>>>> varying in whether they must be a strict segmentation of the
>>>>>>> original and how abstract the segmented units are. These choices
>>>>>>> are roughly parameterized by the following:
>>>>>>>
>>>>>>> 1. Must the stem + segments contain the exact same characters (in
>>>>>>> order) as the input form? For example, is the segmentation the
>>>>>>> form "running" as "run ing" allowed (where an "n" has been
>>>>>>> removed), or must it be something like "run ning" or "runn ing"
>>>>>>> which preserve all input characters?
>>>>>>>
>>>>>>> 2. If the answer to (1) is no, how much normalization is done on
>>>>>>> the stem and segments? For example, in English, we might decide
>>>>>>> the stem + segments annotation of "babies" is "baby s" or "baby
>>>>>>> es" if we want to normalize the stem and/or suffix. The benefit of
>>>>>>> this normalization is that it is pretty clear that "baby" and
>>>>>>> "baby s" have something in common, less so for "baby" and "babie
>>>>>>> s". On the suffix front, one might decides that "kicks" and
>>>>>>> "mixes" should be made to "look alike" by segmenting as "kick s"
>>>>>>> and "mix s". On the stem front, one can also normalize a stem
>>>>>>> change: "frozen" as "freeze en", or in Spanish we have the
>>>>>>> diphthongization example of "cuento" as "cont o" (the infinitive
>>>>>>> is "contar", the stem is usually written as "cont-" but the vowel
>>>>>>> changes under certain conditions).
>>>>>>>
>>>>>>> Historically, (1) is "Yes" when you want morphological
>>>>>>> segmentation to be a binary classification task similar to the
>>>>>>> traditional word segmentation task, where between every pair of
>>>>>>> characters the system decides whether to insert a boundary or not.
>>>>>>> Some unsupervised systems that discover concatenative units (e.g.,
>>>>>>> Morfessor) operate under this model. Unless one is annotating data
>>>>>>> solely for the purpose of evaluating the binary classification
>>>>>>> task, I do not think saying "yes" to (1) is a good strategy,
>>>>>>> especially since it becomes very difficult to produce sensible
>>>>>>> segmentations (see Creutz and Lindén's "Hutmegs" for some
>>>>>>> discussion of this).
>>>>>>>
>>>>>>> Where one lands on (2) depends on many factors, including what one
>>>>>>> believes stems and affixes should be and how many edge cases there
>>>>>>> are that make this hard to annotate. When (2) tends toward the
>>>>>>> most abstract (for example, "baby s" for "babies" and "mix s" for
>>>>>>> "mixes") this becomes similar to the lemma + feature annotation in
>>>>>>> that a normalized stem may be of equivalent usefulness to a lemma.
>>>>>>> However, the key difference is that the segments, even if heavily
>>>>>>> normalized, are still surface forms and not features.
>>>>>>>
>>>>>>> Stephanie/Boyan, can you say where you would imagine the stem +
>>>>>>> segmentation annotation might fall on this spectrum given the
>>>>>>> languages and budgeted time for this project?
>>>>>>>
>>>>>>> -Constantine
>>>>>>>
>>>>>>> On 9/23/2015 10:46 AM, Stephanie M. Strassel wrote:
>>>>>>>> A couple additional questions/observations:
>>>>>>>>
>>>>>>>> 4. Prior approaches to manual morph annotation have assumed that
>>>>>>>> a (non-linguist) human annotator will be able to review the
>>>>>>>> analyzer output and select the optimal solution for each token.
>>>>>>>> If we are to continue using this approach, LDC will need access
>>>>>>>> to a morph analyzer that can be plugged into the annotation
>>>>>>>> pipeline. There are a few other considerations in this case -
>>>>>>>> timing, but also the nature of the output itself (a huge list of
>>>>>>>> possible solutions and/or a huge feature set for each lemma is
>>>>>>>> incompatible with manual annotation.)
>>>>>>>>
>>>>>>>> 5. If we adopt the proposed approach, what is the impact on the
>>>>>>>> morph alignment task? Morph alignment assumes that we have
>>>>>>>> individually segmented/labeled morphemes, which won't exist in
>>>>>>>> the lemma+feature approach. In that case, would we want to pursue
>>>>>>>> word alignment instead of morph alignment? Would we instead want
>>>>>>>> to drop the alignment task and shift those resources to another
>>>>>>>> area (e.g. Additional QC on morph annotation? Richer topic and/or
>>>>>>>> sentiment labeling? Something else?)
>>>>>>>>
>>>>>>>> Stephanie
>>>>>>>>
>>>>>>>>
>>>>>>>> On 9/23/15 10:25 AM, Onyshkevych, Boyan wrote:
>>>>>>>>> Welcome to the LORELEI Morphology list,
>>>>>>>>>
>>>>>>>>> This list has been created to facilitate discussion of program
>>>>>>>>> topics and issues related to morphology. A core question about
>>>>>>>>> morphology came up at the kickoff meeting, which I'm reopening
>>>>>>>>> below, but everyone else should feel free to add questions or
>>>>>>>>> answers as they see fit.
>>>>>>>>>
>>>>>>>>> At the kickoff, David Yarowksy threw down a gauntlet with his
>>>>>>>>> claim (permit me some narrative license) that morphological
>>>>>>>>> marking of lemma + features instead of stem + segments is more
>>>>>>>>> general and more useful. LDC can either annotate one or the
>>>>>>>>> other for the limited morpho-annotated corpus they are
>>>>>>>>> producing, but not both.
>>>>>>>>>
>>>>>>>>> Furthermore, David proposed and briefly described a specific
>>>>>>>>> format and feature inventory for the lemma+feature annotation
>>>>>>>>> (further details can be circulated on this list as needed)
>>>>>>>>>
>>>>>>>>> So, the questions on the table are:
>>>>>>>>>
>>>>>>>>> 1. Is everybody ok with scrapping the stem+segment marking in
>>>>>>>>> favor of lemma+feature, as far as the LRLPs are concerned? (at
>>>>>>>>> the kickoff, Mitch M had voiced some objections, but withdrew
>>>>>>>>> them later)
>>>>>>>>> 2. Who is willing to share their morphological analyzers,
>>>>>>>>> generators (for lexicon expansion), or expanded lexicons with
>>>>>>>>> the rest of the community? We are expecting that one or more
>>>>>>>>> performers will produce run-time morphological analyzers (which
>>>>>>>>> output lemma+features) which will be shared with the LORELEI
>>>>>>>>> community, preferably by inclusion in the later versions of the
>>>>>>>>> LRLPs.
>>>>>>>>> 3. For those who are sharing any kind of morphological
>>>>>>>>> resources, is the YNF (Yarowsky Normal Form) acceptable?
>>>>>>>>>
>>>>>>>>> ________________________________ Dr. Boyan Onyshkevych Program
>>>>>>>>> Manager DARPA/I2O
>>>>>>>>> 675 North Randolph Street, Arlington, VA 22203-2114
>>>>>>>>> boyan.onyshkevych at darpa.mil<mailto:boyan.onyshkevych at darpa.mil>
>>>>>>>>> 703-526-2789 (off) / 571-215-0561 (BB) / 703-248-4992 (fax)
>>>>>>>>> SETA: Caitlin Christianson
>>>>>>>>> caitlin.christianson.ctr at darpa.mil<mailto:caitlin.christianson.c
>>>>>>>>> tr at darpa.mil>
>>>>>>>>> SETA: Paul Dietrich
>>>>>>>>> paul.dietrich.ctr at darpa.mil<mailto:paul.dietrich.ctr at darpa.mil>
>>>>>>>>> _______________________________________________
>>>>>>>>> lorelei-morph mailing list
>>>>>>>>> lorelei-morph at ldc.upenn.edu
>>>>>>>>> http://newlists.ldc.upenn.edu/listinfo/lorelei-morph
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> lorelei-morph mailing list
>>>>>>>> lorelei-morph at ldc.upenn.edu
>>>>>>>> http://newlists.ldc.upenn.edu/listinfo/lorelei-morph
>>>>>>> _______________________________________________
>>>>>>> lorelei-morph mailing list
>>>>>>> lorelei-morph at ldc.upenn.edu
>>>>>>> http://newlists.ldc.upenn.edu/listinfo/lorelei-morph
>>>>>> _______________________________________________
>>>>>> lorelei-morph mailing list
>>>>>> lorelei-morph at ldc.upenn.edu
>>>>>> http://newlists.ldc.upenn.edu/listinfo/lorelei-morph
>>>>> _______________________________________________
>>>>> lorelei-morph mailing list
>>>>> lorelei-morph at ldc.upenn.edu
>>>>> http://newlists.ldc.upenn.edu/listinfo/lorelei-morph
>>>> _______________________________________________
>>>> lorelei-morph mailing list
>>>> lorelei-morph at ldc.upenn.edu
>>>> http://newlists.ldc.upenn.edu/listinfo/lorelei-morph
>>>
>>> _______________________________________________
>>> lorelei-morph mailing list
>>> lorelei-morph at ldc.upenn.edu
>>> http://newlists.ldc.upenn.edu/listinfo/lorelei-morph
>>
>> _______________________________________________
>> lorelei-morph mailing list
>> lorelei-morph at ldc.upenn.edu
>> http://newlists.ldc.upenn.edu/listinfo/lorelei-morph
>
> _______________________________________________
> lorelei-morph mailing list
> lorelei-morph at ldc.upenn.edu
> http://newlists.ldc.upenn.edu/listinfo/lorelei-morph
More information about the nlp-infra-devel
mailing list