Text Processing and Natural Language Parsing

Introduction

The ability of dealing with odd (i.e. ill-formed or simply partial) sentences is largely shown by humans. The human interpretation process is tolerant to phenomena like lack of lexical information (e.g. foreign words), unknown words (e.g. proper nouns never encountered before), and odd grammatical constructions (e.g. gender disagreement). The above form of tolerance is what has been recently called robustness in NLP. The modeling of such phenomenon within computational models is a relevant research area either for a better linguistic investigation or for the design of large-scale NLP systems.

In our research, a Natural Language Processing Chain has been defined according to two principles: modularity, typical of the software engineering practice, and the adaptability of components to different domains. These principles have been adopted for the development of CHAOS (Basili and Zanzotto, 2002), a pool of syntactic modules, which has been used in real applications. The modules enable a large validation of the notion of empirical robustness over different corpora and two different languages (English and Italian).

The main CHAOS modules are:

Tokenizer
Morphological Analyzer (that identifies the possible morphological interpretation of every token)
Part of Speech Tagger
Named Entity Recognizer
Chunker (that collects possibly multiple tokens to form bigger grammatical and unambiguous units called chunks)
Verb Subcategorization Analyzer (for the recognition of the main verbal dependencies) and a Shallow Syntactic Analyzer (for the recognition of the emaining and possibly ambiguous dependency)

All linguistic information across the processing chain is stored within the eXtended Dependency Graph (XDG) formalism, discussed in (Basili and Zanzotto, 2002). It represents a sentence as a planar graph, whose nodes are constituents and arcs the grammatical relationships between them.

People

Roberto Basili, Danilo Croce, Giuseppe Castellucci

References

SAG Publications

Roberto Basili, Fabio Massimo Zanzotto (2002): Parsing engineering and empirical robustness. In: Natural Language Engineering, 8 (3), pp. 97–120, 2002.

Semantic Analytics Group @ Uniroma2

SAG is the Semantic Analytics Group at the University of Rome, Tor Vergata

Text Processing and Natural Language Parsing

Autentication

News

Username
Password

	Remember Me Lost your password?