Introduction
Ontologies are complex artifacts that define the nature of data exchanged within enterprise applications, thus constraining their possible interpretations within the targeted domain(s). These definitory schemata for general knowledge properties, reusable across domains and applications, are crucial resources for open-ended software systems, such as Web applications or Question Answering systems. Notice that whenever unstructured data are involved, ontologies do not solve the interpretation problem as in public or internal sources of such information always require to map them into the proper concepts, relations or properties made available by the ontology itself.
Ontology acquisition
In these terms the role of natural language is central. Natural language is at the cross-road of the development, acquisition and engineering of knowledge in every domain, as before any agreement about a real-world truth is achieved, it is linguistically expressed. As a consequence, every knowledge development process, where facts and properties of entities are to be recognized and synthesized, is fully immersed into a stream of linguistic phenomena, ranging from draft documents to scientific papers, from messages to spoken dialogues, from personal communications to technical reports. In these terms, ontology acquisition is the assessment of information from the huge realm of heterogeneous linguistic observations to the formalized world of some logical formalism able to axiomatize facts and determine truth-conditional models for the resulting Knowledge Base. A further role for natural languages is the one that allow the use of ontologies within applications, as they are the dictionary by which truth is communicated to the human users: in this case the agreement about the exchanged information is achieved when the proper linguistic translation of an ontological fact is made available.
The engineering of the above process is thus involving the support to every stage of the use of ontologies within software systems. Ontology engineering process has been proposed, for example, as ways of eliciting knowledge extracted from texts. In these systems three stages are usually carried out semi-automatically.
- First, domain terminology is extracted from the involved texts in an application domain, and filtered by applying advanced natural language processing techniques.
- Then, terms are semantically interpreted so that no ambguity bout their sense is stil involved (e.g. words or terms are ordered according to taxonomic relations, generating domain concept hierarchies).
- Finally, the resulting sorts are mapped into existing ontologies (such as any available domain reference data model).
The problems related to the above stages are at least:
- Language-aware Text Processing. The ability to recognize entities, sets or concepts, relations as well as properties characterizing a knowledge domain directly as these are expressed or mentioned in the text collections alive within the domain (e.g. specialized Web contents, Wikipedia pages, international standard, such as ISO, documents or documents internal to an organisation.
- Information Extraction and Text Understanding. The ability to disambiguate the semantics of the above information and to selectively locate them within texts independently from the variability and ambiguity of the natural language. Subtasks of this problem are at least
- Named Entity Classification
- Word Sense Disambiguation
- Semantic Relation Extraction from texts
- Ontology Learning. The development of ontological notions from scratch, such as the learning of the suitable taxonomic structure of concepts into a domain through the hierarchical clustering of the most important underlying instances, as these are recognized in the previous steps
- Ontology Population. The creation of new instances for a given ontology through the recognition of their citations and properties in raw texts. This often implies the mapping of recognized linguistic entities into the classes (i.e. the concepts) that represent the suitable host in the domain axiomatization embodied by the given ontology, as well as the location of the involved relations mentioned in the texts (as in Semantic Relation Extraction).
The work of the SAG group has been to model all the above problems into operational software workflows, by exploiting the integration between advanced statistical machine learning methods (e.g. supervised and semi-supervised kernel-based learning) with a strong representational linguistic bias (e.g. parse trees) as well as language processing technologies and tools (e.g. Semantic Role Labeling or state-of-the-art NERC systems). As an example, corpus-based analysis for ontology learning based on Wordnet or Framenet is discussed in (Moschitti et al, 2006), (Basili et al, 2007) or (Basili et al, 2008).
A deeper investigation of the group has been carried out about the interface to ontological data based on Questions in natural language, something that dates back to research carried out on NL interfaces to Databases. The task has been recently touched by evaluation champaigns such as QUALD, i.e. Question Answering over Linked Data, a series of evaluation campaigns on multilingual question answering over open linked data. The SAG research on this topic is dicussed in a separated page.
References.
Gruber, Thomas R. (June 1993). “A translation approach to portable ontology specifications“. Knowledge Acquisition 5 (2): 199–220.
P. Buitelaar, P. Cimiano, and B. Magnini (Eds.). “Ontology Learning from Text: Methods, Evaluation and Applications”, Series information for Frontiers in Artificial Intelligence and Applications, IOS Press, 2005.
Asunción Gómez-Pérez, Mariano Fernández-López, Oscar Corcho (2004). Ontological Engineering: With Examples from the Areas of Knowledge Management, E-commerce and the Semantic Web. Springer, 2004.
Botstein, David; Cherry, J. Michael; Ashburner, Michael; Ball, Catherine A.; Blake, Judith A.; Butler, Heather; Davis, Allan P.; Dolinski, Kara et al. (2000). “Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium“. Nature Genetics 25 (1): 25–9.
Alessandro Moschitti, Bonaventura Coppola, Daniele Pighin, and Roberto Basili. “Semantic Tree Kernels to Classify Predicate Argument Structures“. ECAI, volume 141 of Frontiers in Artificial Intelligence and Applications, page 568-. IOS Press, (2006)
Roberto Basili, Cristina Giannone, Diego De Cao. “Learning domain-specific framenets from texts“. In Proceedings of 3rd Workshop on Ontology Learning and Population (OLP3). Patras, Greece. July 21/22, 2008.
Roberto Basili, Diego De Cao, Paolo Marocco and Marco Pennacchiotti. “Learning Selectional Preferences for Entailment or Paraphrasing Rules“. In Proceedings of RANLP 2007. Borovets, Bulgaria. September, 2007.