A Semantics-Based Method for Clustering of Chinese Web Search Results
Enterprise Information Systems
Taylor & Francis
Information explosion is a critical challenge to the development of modern information systems. In particular, when the application of an information system is over the Internet, the amount of information over the web has been increasing exponentially and rapidly. Search engines, such as Google and Baidu, are essential tools for people to find the information from the Internet. Valuable information, however, is still likely submerged in the ocean of search results from those tools. By clustering the results into different groups based on subjects automatically, a search engine with the clustering feature allows users to select most relevant results quickly. In this paper, we propose an online semantics-based method to cluster Chinese web search results. First, we employ the generalised suffix tree to extract the longest common substrings (LCSs) from search snippets. Second, we use the HowNet to calculate the similarities of the words derived from the LCSs, and extract the most representative features by constructing the vocabulary chain. Third, we construct a vector of text features and calculate snippets’ semantic similarities. Finally, we improve the Chameleon algorithm to cluster snippets. Extensive experimental results have shown that the proposed algorithm has outperformed over the suffix tree clustering method and other traditional clustering methods.
search engine, Chinese online semantic clustering, vocabulary chain, semantic similarity, Chameleon algorithm
Hui Zhang, Deqing Wang, Li Wang, Zhuming M. Bi, and Yong Chen (2014).
A Semantics-Based Method for Clustering of Chinese Web Search Results. Enterprise Information Systems.8 (1), 147-165. Taylor & Francis.