Document Type

Master's Research

Degree Name

Master of Science

Department

Computer Science

Advisor(s)

Lubomir Stanchev

Date of Award

8-2011

Abstract

Word similarity is a semantic measure that evaluates the similarity of words. The goal of the master's thesis is to implement a graph-based knowledgebase and compute similarity between two word forms based on their connections in the graph. The graph is built as a foundation for the similarity calculations. As primary resources for the graph construction, we use natural text descriptions (senses and example use) from each phrase. The main source that we use for extracting similarity knowledge is WordNet. The dataset Google Books Ngram Viewer is integrated to enhance the connections between similar word forms. As a consequence, we build a data structure that connects not only words that have mutual relationships (synonym, hyponym and hypernym), but also words that appear in the same context.

The phrase graph that we build has words from the English language as vertices. Vertices are interconnected to each other with edges which carry a similarity weight. The advantage of this approach is that the search algorithm utilizes a robust approach that connects phrases that are similar in meaning with high coefficient values. The similarity of the meanings is not only calculated by the synonym, hyponym, and hypernym relationships, but also from the context in which the words appear. It is assumed that words which appear in senses together, or are mentioned in an example use sentence, have a higher probability of being similar. This way, it is expected that words that are different in their basic meaning, but are commonly related to each other in everyday use will have high similarity coefficients.

Measuring word similarity is a technique that can be used for engineering ranking algorithms. The ranking algorithms decide which results are most relevant and should be displayed first in the results list. This approach has the advantage of recognizing similar meanings of words and ranking them high in the results list, despite the difference in their natural text descriptions.

Available for download on Saturday, January 01, 2050

Share

COinS