abstract: The work presented in this thesis is a synergy of statistics and natural language processing. Statistics is a branch of mathematics that is related to the collection, analysis, interpretation or explanation, and presentation of data. Working with experimental data, researchers should perform a statistical analysis to correctly interpret the results of the study, which are usually presented in scientific publications. To follow the new knowledge in a particular
domain, which is quickly increasing with recently published electronic publications, computer-based methods are extremely welcome for knowledge identification, extraction and exploration. This is a well-known task in natural language processing. It is a research area of computer science, artificial intelligence, and computational linguistics, concerned with the interactions between computers and human natural languages.
In this thesis, we propose an approach to explore a given domain. The part of statistics is focused on how to obtain more robust statistical results that need to be published, and the part of natural language processing is focused on the extraction and normalization of relevant scientifically published information in order to follow the new knowledge of the domain.
In the area of statistics, we present a novel method for making a statistical comparison of experimental data that is more robust to outliers and small differences that can exist between data values. The main contribution of the method is the new ranking scheme, which uses the whole distribution of the data instead of using only one statistic to describe the data distribution, such as average or median.
In the area of natural language processing, we address the problem of information extraction from an untapped domain. We propose a new rule-based named-entity recognition method. The method does not require an annotated corpus, and the main difference with the other rule-based named-entity recognition methods is that the rules are not associated with the characteristics of the entities. We improve the string similarity of domain-specific short segments of text by probability modeling of the domain using the morphological
information presented in the text. The proposed method can be a basis for text normalization, which is used for automatic mapping of a concept (entity) to a concept that exists in a domain-specific terminological resource.
The methods are evaluated using real-life problems taken from computer science and nutrition science. The statistical method is tested using experimental data from optimization and the evaluation results show that it gives more robust results than the commonly-used statistical comparison approach, especially when the results are affected by the presence of outliers or small differences that may exist between the data values. The rule-based named-entity recognition method is tested in the dietary domain, which is an untapped domain, from where promising results are achieved. Finally, to collect the information for the same entity that can be represented in different ways, using phrases with a variety of structures, the method for string similarity of domain-specific short segments of text is applied on food concepts, or the food matching problem. The evaluation results show that it gives more promising results compared it to commonly-used string similarity measures.