Semantic similarity

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between them is based on the likeness of their meaning or semantic content as opposed to similarity which can be estimated regarding their syntactical representation (e.g. their string format). These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.[1] The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations.[2] For example, "car" is similar to "bus", but is also related to "road" and "driving".

Computationally, semantic similarity can be estimated by defining a topological similarity, by using ontologies to define the distance between terms/concepts. For example, a naive metric for the comparison of concepts ordered in a partially ordered set and represented as nodes of a directed acyclic graph (e.g., a taxonomy), would be the shortest-path linking the two concept nodes. Based on text analyses, semantic relatedness between units of language (e.g., words, sentences) can also be estimated using statistical means such as a vector space model to correlate words and textual contexts from a suitable text corpus.

Terminology

The concept of semantic similarity is more specific than semantic relatedness, as the latter includes concepts as antonymy and meronymy, while similarity does not.[3] However, much of the literature uses these terms interchangeably, along with terms like semantic distance. In essence, semantic similarity, semantic distance, and semantic relatedness all mean, "How much does term A have to do with term B?" The answer to this question is usually a number between -1 and 1, or between 0 and 1, where 1 signifies extremely high similarity.

Visualization

An intuitive way of visualizing the semantic similarity of terms is by grouping together terms which are closely related and spacing wider apart the ones which are distantly related. This is also common in practice for mind maps and concept maps and is sometimes subconscious.

A more direct way of visualizing the semantic similarity of two linguistic items can be seen with the Semantic Folding approach. In this approach a linguistic item such as a term or a text can be represented by generating a pixel for each of its active semantic features in e.g. a 128 x 128 grid. This allows for a direct visual comparison of the semantics of two items by comparing image representations of their respective feature sets.

Applications

Biomedical informatics

Semantic similarity measures have been applied and developed in biomedical ontologies,[4][5][6] namely, the Gene Ontology (GO).[7][8][9][10] They are mainly used to compare genes and proteins based on the similarity of their functions rather than on their sequence similarity, but they are also being extended to other bioentities, such as chemical compounds,[11] anatomical entities[12] and diseases.[13]

These comparisons can be done using tools freely available on the web:

GeoInformatics

Similarity is also applied to find similar geographic features or feature types:[17]

Computational linguistics

Several metrics use WordNet, a manually constructed lexical database of English words. Despite the advantages of having human supervision in constructing the database, since the words are not automatically learned the database cannot measure relatedness between multi-word term, non-incremental vocabulary.[3][22]

Natural language processing

Natural language processing (NLP) is a field of computer science and linguistics. Sentiment analysis, Natural language understanding and Machine translation (Automatically translate text from one human language to another) are a few of the major areas where it is being used. For example, knowing one information resource in the internet, it is often of immediate interest to find similar resources. The Semantic Web provides semantic extensions to find similar data by content and not just by arbitrary descriptors.[23][24][25][26][27][28][29][30][31]

Measures

Topological similarity

There are essentially two types of approaches that calculate topological similarity between ontological concepts:

Other measures calculate the similarity between ontological instances:

Some examples:

Edge-based

Node-based

Node-and-Relation-Content-based

Pairwise

Groupwise

Statistical similarity

Statistical similarity approaches can be learned from data, or predefined. Similarity learning can often outperform predefined similarity measures. Broadly speaking, these approaches build a statistical model of documents, and use it to estimate similarity.

Semantics-based similarity

Gold standards

Researchers have collected datasets with similarity judgements on pairs of words, which are used to evaluate the cognitive plausibility of computational measures. For a list of datasets, see this compiled list of Word Similarity Datasets.

See also

References

  1. Harispe S.; Ranwez S. Janaqi S.; Montmain J. (2015). "Semantic Similarity from Natural Language and Ontology Analysis". Synthesis Lectures on Human Language Technologies. 8:1: 1–254. doi:10.2200/S00639ED1V01Y201504HLT027.
  2. A. Ballatore; M. Bertolotto; D.C. Wilson (2014). "An evaluative baseline for geo-semantic relatedness and similarity". GeoInformatica. 18:4: 747–767.
  3. 1 2 Budanitsky, Alexander; Hirst, Graeme (2001). "Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures" (PDF). Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics. Pittsburgh.
  4. Pesquita, Catia; Faria, Daniel; Falcão, André O.; Lord, Phillip; Couto, Francisco M. (2009). Bourne, Philip E., ed. "Semantic Similarity in Biomedical Ontologies". PLoS Computational Biology. 5 (7): e1000443. doi:10.1371/journal.pcbi.1000443. PMC 2712090Freely accessible. PMID 19649320.
  5. Guzzi, Pietro Hiram; Mina, Marco; Cannataro, Mario; Guerra, Concettina (2012). "Semantic similarity analysis of protein data: assessment with biological features and issues". Briefings in Bioinformatics. 13 (5): 569–585. doi:10.1093/bib/bbr066. PMID 22138322.
  6. 1 2 Benabderrahmane, Sidahmed; Smail Tabbone, Malika; Poch, Olivier; Napoli, Amedeo; Devignes, Marie-Domonique. (2010). "IntelliGO: a new vector-based semantic similarity measure including annotation origin". Biomed Central. 11: 588. doi:10.1186/1471-2105-11-588. PMC 3098105Freely accessible. PMID 21122125.
  7. Couto, F., Silva, M., & Coutinho, P. (2003). Implementation of a functional semantic similarity measure between gene-products. DI/FCUL TR 03–29, University of Lisbon
  8. Pesquita, C., Faria, D., Falcão, A., Lord, P., & Couto, F. (2009). Semantic similarity in biomedical ontologies. PLoS Computational Biology, 5:e1000443
  9. Couto, F., Silva, M., & Coutinho, P. (2005). "Semantic similarity over the gene ontology: Family correlation and selecting disjunctive ancestors". Proc. of the ACM Conference in Information and Knowledge Management (CIKM): 343. doi:10.1145/1099554.1099658. ISBN 1595931406.
  10. Couto, F., Silva, M., & Coutinho, P. (2007). "Measuring semantic similarity between Gene Ontology terms". Data and Knowledge Engineering. 61: 137–152. doi:10.1016/j.datak.2006.05.003.
  11. Ferreira, João D.; Couto, Francisco M. (2010). Mitchell, John B. O., ed. "Semantic Similarity for Automatic Classification of Chemical Compounds". PLoS Computational Biology. 6 (9): e1000937. doi:10.1371/journal.pcbi.1000937. PMC 2944781Freely accessible. PMID 20885779.
  12. Ferreira, João D.; Couto, Francisco M. (2011). "Generic semantic relatedness measure for biomedical ontologies" (PDF). ICBO 2011 Proceedings.
  13. Köhler, S; Schulz, MH; Krawitz, P; Bauer, S; Dolken, S; Ott, CE; Mundlos, C; Horn, D; et al. (2009). "Clinical diagnostics in human genetics with semantic similarity searches in ontologies". American Journal of Human Genetics. 85 (4): 457–64. doi:10.1016/j.ajhg.2009.09.003. PMC 2756558Freely accessible. PMID 19800049.
  14. "ProteInOn".
  15. "CMPSim".
  16. "CESSM".
  17. Janowicz, K., Raubal, M. and Kuhn, W. (2011). "The semantics of similarity in geographic information retrieval". Journal of Spatial Information Science. 2: 29–57. doi:10.5311/josis.2011.2.3.
  18. "SIM-DL similarity server". CiteSeerX 10.1.1.172.5544Freely accessible.
  19. "Geo-Net-PT Similarity Calculator".
  20. "Geo-Net-PT".
  21. A. Ballatore; D.C. Wilson; M. Bertolotto. "Geographic Knowledge Extraction and Semantic Similarity in OpenStreetMap". Knowledge and Information Systems: 61–81.
  22. Kaur, I. & Hornof, A.J. (2005). "A Comparison of LSA, WordNet and PMI for Predicting User Click Behavior". Proceedings of the Conference on Human Factors in Computing, CHI 2005: 51–60. doi:10.1145/1054972.1054980. ISBN 1581139985.
  23. Similarity-based Learning Methods for the Semantic Web (C. d'Amato, PhD Thesis)
  24. Gracia, J. & Mena, E. (2008). "Web-Based Measure of Semantic Relatedness" (PDF). Proceedings of the 9th international conference on Web Information Systems Engineering (WISE '08). Springer-Verlag, Berlin, Heidelberg: 136–150.
  25. Raveendranathan, P. (2005). Identifying Sets of Related Words from the World Wide Web. Master of Science Thesis, University of Minnesota Duluth.
  26. Wubben, S. (2008). Using free link structure to calculate semantic relatedness. In ILK Research Group Technical Report Series, nr. 08-01, 2008.
  27. Juvina, I., van Oostendorp, H., Karbor, P., & Pauw, B. (2005). Towards modeling contextual information in web navigation. In B. G. Bara & L. Barsalou & M. Bucciarelli (Eds.), 27th Annual Meeting of the Cognitive Science Society, CogSci2005 (pp. 1078–1083). Austin, Tx: The Cognitive Science Society, Inc.
  28. Navigli, R., Lapata, M. (2007). Graph Connectivity Measures for Unsupervised Word Sense Disambiguation, Proc. of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), Hyderabad, India, January 6-12th, 2007, pp. 1683–1688.
  29. Pirolli, P. (2005). "Rational analyses of information foraging on the Web". Cognitive Science. 29 (3): 343–373. doi:10.1207/s15516709cog0000_20. PMID 21702778.
  30. Pirolli, P., & Fu, W.-T. (2003). "SNIF-ACT: A model of information foraging on the World Wide Web". Lecture Notes in Computer Science. 2702. pp. 45–54. doi:10.1007/3-540-44963-9_8.
  31. Turney, P. (2001). Mining the Web for Synonyms: PMI versus LSA on TOEFL. In L. De Raedt & P. Flach (Eds.), Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001) (pp. 491–502). Freiburg, Germany.
  32. Pekar, Viktor; Staab, Steffen (2002). "Proceedings of the 19th international conference on Computational linguistics -". 1: 1. doi:10.3115/1072228.1072318. |chapter= ignored (help)
  33. Cheng, J; Cline, M; Martin, J; Finkelstein, D; Awad, T; Kulp, D; Siani-Rose, MA (2004). "A knowledge-based clustering algorithm driven by Gene Ontology". Journal of biopharmaceutical statistics. 14 (3): 687–700. doi:10.1081/BIP-200025659. PMID 15468759.
  34. Wu, H; Su, Z; Mao, F; Olman, V; Xu, Y (2005). "Prediction of functional modules based on comparative genome analysis and Gene Ontology application". Nucleic Acids Research. 33 (9): 2822–37. doi:10.1093/nar/gki573. PMC 1130488Freely accessible. PMID 15901854.
  35. Del Pozo, Angela; Pazos, Florencio; Valencia, Alfonso (2008). "Defining functional distances over Gene Ontology". BMC Bioinformatics. 9: 50. doi:10.1186/1471-2105-9-50. PMC 2375122Freely accessible. PMID 18221506.
  36. Philip Resnik (1995). Chris S. Mellish, ed. "Using information content to evaluate semantic similarity in a taxonomy". Proceedings of the 14th international joint conference on Artificial intelligence (IJCAI'95). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 1: 448–453.
  37. Dekang Lin. 1998. An Information-Theoretic Definition of Similarity. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML '98), Jude W. Shavlik (Ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 296-304
  38. Ana Gabriela Maguitman, Filippo Menczer, Heather Roinestad, Alessandro Vespignani: Algorithmic detection of semantic similarity. WWW 2005: 107-116
  39. J. J. Jiang and D. W. Conrath. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In International Conference on Research on Computational Linguistics (ROCLING X), pages 9008+, September 1997
  40. Couto, F. & Silva, M. (2011), Disjunctive Shared Information between Ontology Concepts: application to Gene Ontology. Journal of Biomedical Semantics, 2:5
  41. Couto, F., Silva, M., & Coutinho, P. (2007). Measuring semantic similarity between Gene Ontology terms. Data and Knowledge Engineering, 61:137–152
  42. M. T. Pilehvar, D. Jurgens and R. Navigli. Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity.. Proc. of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 4–9, 2013, pp. 1341-1351.
  43. Dong, Hai (2009). "A Hybrid Concept Similarity Measure Model for Ontology Environment". Lecture Notes in Computer Science. 5872: 848–857.
  44. Dong, Hai (2011). "A context-aware semantic similarity model for ontology environments". Concurrency and Computation: Practice and Experience. 23 (2): 505–524.
  45. Catia Pesquita, Daniel Faria, Hugo Bastos, António Ferreira, Andre O Falcao, Francisco Couto 2008: Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics Suppl 5(9), S4
  46. Landauer, T. K.; Dumais, S. T. (1997). "A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge". Psychological Review. 104 (2): 211–240. doi:10.1037/0033-295x.104.2.211.
  47. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). "Introduction to Latent Semantic Analysis" (PDF). Discourse Processes. 25: 259–284. doi:10.1080/01638539809545028.
  48. "Google Similarity Distance".
  49. J. Camacho-Collados, M. T. Pilehvar, and R. Navigli. NASARI: a Novel Approach to a Semantically-Aware Representation of Items. In Proceedings of the North American Chapter of the Association of Computational Linguistics (NAACL 2015), Denver, USA, pp. 567-577, 2015
  50. J. Camacho-Collados, M. T. Pilehvar, and R. Navigli. A Unified Multilingual Semantic Representation of Concepts. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), Beijing, China, July 27–29, pp. 741-751, 2015
  51. C. d'Amato, S. Staab, and N. Fanizzi. On the influence of description logics ontologies on conceptual similarity. Knowledge Engineering: Practice and Patterns, pages 48-63, 2008 doi:10.1007/978-3-540-87696-0_7
  52. F. Couto and H. Pinto, The next generation of similarity measures that fully explore the semantics in biomedical ontologies, Journal of Bioinformatics and Computational Biology, vol. in press, 2013. preprint

External links

Software

Web services

  1. Rus, V., Lintean, M. C., Banjade, R., Niraula, N. B., & Stefanescu, D. (2013, August). SEMILAR: The Semantic Similarity Toolkit. In ACL (Conference System Demonstrations) (pp. 163-168).
This article is issued from Wikipedia - version of the 11/1/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.