The ability of dealing with odd (i.e. ill-formed or simply partial) sentences is largely shown by humans. The human interpretation process is tolerant to phenomena like lack of lexical information (e.g. foreign words), unknown words (e.g. proper nouns never encountered before), and odd grammatical constructions (e.g. gender disagreement). The above form of tolerance is what has been recently called robustness in NLP. The modeling of such phenomenon within computational models is a relevant research area either for a better linguistic investigation or for the design of large-scale NLP systems.
In our research, a Natural Language Processing Chain has been defined according to two principles: modularity, typical of the software engineering practice, and the adaptability of components to different domains. These principles have been adopted for the development of CHAOS (Basili and Zanzotto, 2002), a pool of syntactic modules, which has been used in real applications. The modules enable a large validation of the notion of empirical robustness over different corpora and two different languages (English and Italian).
The main CHAOS modules are:
- Morphological Analyzer (that identifies the possible morphological interpretation of every token)
- Part of Speech Tagger
- Named Entity Recognizer
- Chunker (that collects possibly multiple tokens to form bigger grammatical and unambiguous units called chunks)
- Verb Subcategorization Analyzer (for the recognition of the main verbal dependencies) and a Shallow Syntactic Analyzer (for the recognition of the emaining and possibly ambiguous dependency)
All linguistic information across the processing chain is stored within the eXtended Dependency Graph (XDG) formalism, discussed in (Basili and Zanzotto, 2002). It represents a sentence as a planar graph, whose nodes are constituents and arcs the grammatical relationships between them.