semantic similarity between words and sentences using lexical database and word embeddings
abstract
calculating the semantic similarity between sentences is a long-standing problem
in the area of natural language processing. the semantic analysis field has a crucial
role to play in the research related to the text analytics. the meaning of the word
in general english language differs as the context changes. hence, the semantic
similarity varies significantly as the domain of operation differs. for this reason, it is
crucial to consider the appropriate definition of the words when they are compared
semantically.
we present an unsupervised method that can be applied across multiple domains
by incorporating corpora based statistics into a standardized semantic similarity algorithm.
to calculate the semantic similarity between words and sentences, the proposed
method follows an edge-based approach using a lexical database. when tested
on both benchmark standards and mean human similarity dataset, the methodology
achieves a high correlation value for both word (pearsons correlation coefficient =
0.8753) and sentence similarity (pcc = 0.8793) while comparing rubenstein and
goodenough standard; and the sick dataset (pcc = 0.8324) outperforming other
unsupervised models.
we use the semantic similarity algorithm and extend it to compare the learning
objectives from course outlines. the course description provided by instructors is
an essential piece of information as it defines what is expected from the instructor
and what he/she is going to deliver during a particular course. one of the key components
of a course description is the learning objectives section. the contents of
this section are used by program managers who are tasked to compare and match
two different courses during the development of transfer agreements between various
institutions. this research introduces the development of semantic similarity
algorithms to calculate the similarity between two learning objectives of the same domain.
we present a methodology which deals with the semantic similarity by using a
previously established algorithm and integrating it with the domain corpus to utilize
domain statistics. the disambiguated domain serves as a supervised learning data for
the algorithm. we also introduce bloom index to calculate the similarity between
action verbs in the learning objectives referring to the bloom's taxonomy.
we also study and present the approach to calculate the semantic similarity between
words under the word2vec model for a specific domain. we present a methodology
to compile a corpus for a specific domain using wikipedia. we then present
a case to show the variance in the semantic similarity between words using different
corpora. the core contributions of this thesis are a semantic similarity algorithm for
words and sentences, and the corpus compilation of a specific domain to train the
word2vec model. we also provide the practical uses of algorithms and the implementation.