Call for Paper

CAE solicits original research papers for the October 2021 Edition. Last date of manuscript submission is September 30, 2021.

Read More

Automatic Measurement of Semantic Similarity among Arabic Short Texts

Fatma Elghannam. Published in Information Sciences.

Communications on Applied Electronics
Year of Publication: 2016
Publisher: Foundation of Computer Science (FCS), NY, USA
Authors: Fatma Elghannam
10.5120/cae2016652430

Fatma Elghannam. Automatic Measurement of Semantic Similarity among Arabic Short Texts. Communications on Applied Electronics 6(2):16-21, November 2016. BibTeX

@article{10.5120/cae2016652430,
	author = {Fatma Elghannam},
	title = {Automatic Measurement of Semantic Similarity among Arabic Short Texts},
	journal = {Communications on Applied Electronics},
	issue_date = {November 2016},
	volume = {6},
	number = {2},
	month = {Nov},
	year = {2016},
	issn = {2394-4714},
	pages = {16-21},
	numpages = {6},
	url = {http://www.caeaccess.org/archives/volume6/number2/675-2016652430},
	doi = {10.5120/cae2016652430},
	publisher = {Foundation of Computer Science (FCS), NY, USA},
	address = {New York, USA}
}

Abstract

Documents that are dealing with the same topic include normally many identical words. Accordingly, surface words co-occurrence similarity measures has been applied successfully to measure the similarity between these documents. However, the problem is not a trivial task when dealing with short texts that carry the same or close meaning but with different vocabularies. Toward solving this problem, researchers have been investigating methods for word analysis at the semantic level. We introduce a new method to measure the semantic similarity between short texts. In the proposed method, semantic distribution and lexical similarity measures are combined to determine the degree of similarity between two words. The similarity between two words is measured as the lexical similarity between the vectors of similar words extracted from corpus as a second order word vector. The proposed method was applied to measure the semantic similarity between Arabic short texts. The experiments performed showed that the best accuracy achieved by the proposed method was 97% compared to 93% recorded for the second order distribution similarity.

References

  1. Atkinson-Abutridy, J., Mellish, C. and Aitken, S., 2004. Combining information extraction with genetic algorithms for text mining. IEEE Intelligent Systems, 19(3), pp.22-30.
  2. Bouckaert, R.R., Frank, E., Hall, M., Kirkby, R., Reutemann, P., Seewald, A. and Scuse, D., 2013. WEKA Manual for Version 3-7-8. Hamilton, New Zealand.
  3. Carenini, G., Cheung, J.C.K. and Pauls, A., 2013. MULTI?DOCUMENT SUMMARIZATION OF EVALUATIVE TEXT. Computational Intelligence, 29(4), pp.545-576.
  4. CBA. Data mining tool Downloading URL : http://www.comp.nus.edu.sg/~dm/p_download.html.
  5. Chapman, S., 2005. SimMetrics-open source similarity measure library. URL: http://nazou. fiit. stuba. sk/home/documentation/concom/concom. doc, Visited:(April 2016).
  6. Church, K.W. and Hanks, P., 1990. Word association norms, mutual information, and lexicography. Computational linguistics, 16(1), pp.22-29.
  7. Coelho, T.A., Calado, P.P., Souza, L.V., Ribeiro-Neto, B. and Muntz, R., 2004. Image retrieval using multiple evidence ranking. IEEE Transactions on Knowledge and Data Engineering, 16(4), pp.408-417.
  8. Cohen, W., Ravikumar, P. and Fienberg, S., 2003, August. A comparison of string metrics for matching names and records. In Kdd workshop on data cleaning and object consolidation (Vol. 3, pp. 73-78).
  9. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R., 1990. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), p.391.
  10. El-Ghannam, F. & El-Shishtawy, T. 2013. Multi-Topic Multi-Document Summarizer, International Journal of Computer Science & Information Technology (IJCSIT) Vol. 5, No 6, December 2013
  11. El-Shishtawy, T. & El-Ghannam, F. 2012. An Accurate Arabic Root-Based Lemmatizer for Information Retrieval Purposes, International Journal of Computer Science Issues, Volume 9, Issue 1, pp. 58-66.
  12. Firth, J.R., 1957. {A synopsis of linguistic theory, 1930-1955}. In Studies in Linguistic Analysis, pp. 1-32. Blackwell, Oxford.
  13. Fung, B.C., Wang, K. and Ester, M., 2003, May. Hierarchical document clustering using frequent itemsets. In SDM (Vol. 3, pp. 59-70).
  14. Glass, J. and Derr, E., Miavia, Inc., 2004. Document similarity detection and classification system. U.S. Patent Application 10/710,918.
  15. Gomaa, W.H. and Fahmy, A.A., 2012. Short answer grading using string similarity and corpus-based similarity. International Journal of Advanced Computer Science and Applications (IJACSA), 3(11).
  16. Gomaa, W.H. and Fahmy, A.A., 2013. A survey of text similarity approaches. International Journal of Computer Applications, 68(13).
  17. Harris, Z.S., 1954. Distributional structure. Word, 10(2-3), pp.146-162.
  18. Islam, A. and Inkpen, D., 2006, May. Second order co-occurrence PMI for determining the semantic similarity of words. In Proceedings of the International Conference on Language Resources and Evaluation, Genoa, Italy (pp. 1033-1038).
  19. Islam, A. and Inkpen, D., 2008. Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data (TKDD), 2(2), p.10.
  20. Jaccard, P., 1901. Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz.
  21. Jiang, J.J. and Conrath, D.W., 1997. Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint cmp-lg/9709008.
  22. Kolb, P., 2008. Disco: A multilingual database of distributionally similar words. Proceedings of KONVENS-2008, Berlin.
  23. Landauer, T.K. and Dumais, S.T., 1997. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2), p.211.
  24. Lee, M., Pincombe, B. and Welsh, M., 2005. An empirical evaluation of models of text document similarity. Cognitive Science Society.
  25. Lin, D., 1998, August. Automatic retrieval and clustering of similar words. In Proceedings of the 17th international conference on Computational linguistics-Volume 2 (pp. 768-774). Association for Computational Linguistics.
  26. Marton, Y., Callison-Burch, C. and Resnik, P., 2009, August. Improved statistical machine translation using monolingually-derived paraphrases. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 381-390). Association for Computational Linguistics.
  27. Mihalcea, R., Corley, C. and Strapparava, C., 2006, July. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI (Vol. 6, pp. 775-780).
  28. Navarro, G., 2001. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1), pp.31-88.
  29. Pantel, P. and Lin, D., 2002, July. Discovering word senses from text. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 613-619). ACM.
  30. Pradhan, N., Gyanchandani, M. and Wadhvani, R., 2015. A Review on Text Similarity Technique used in IR and its Application. International Journal of Computer Applications, 120(9).
  31. Qian, G., Sural, S., Gu, Y. and Pramanik, S., 2004, March. Similarity between Euclidean and cosine angle distance for nearest neighbor queries. In Proceedings of the 2004 ACM symposium on Applied computing (pp. 1232-1237). ACM.
  32. Rensch, C.R., 1992. Calculating lexical similarity. Windows on bilingualism, pp.13-15.
  33. Witten, I.H. and Frank, E., 2005. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.

Keywords

Semantic similarity of words, similarity of short texts, corpus based similarity measure, semantic distribution, lexical similarity.