Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
datacite.rights | http://purl.org/coar/access_right/c_16ec | |
dc.creator | Rivera-Borroto O.M. | |
dc.creator | García-De La Vega J.M. | |
dc.creator | Marrero-Ponce Y. | |
dc.creator | Grau R. | |
dc.date.accessioned | 2020-03-26T16:32:45Z | |
dc.date.available | 2020-03-26T16:32:45Z | |
dc.date.issued | 2016 | |
dc.description.abstract | Research on similarity searching of cheminformatic data sets has been focused on similarity measures using fingerprints. However, nominal scales are the least informative of all metric scales, increasing the tied similarity scores, and decreasing the effectivity of the retrieval engines. Tanimoto's coefficient has been claimed to be the most prominent measure for this task. Nevertheless, this field is far from being exhausted since the computer science no free lunch theorem predicts that "no similarity measure has overall superiority over the population of data sets". We introduce 12 relational agreement (RA) coefficients for seven metric scales, which are integrated within a group fusion-based similarity searching algorithm. These similarity measures are compared to a reference panel of 21 proximity quantifiers over 17 benchmark data sets (MUV), by using informative descriptors, a feature selection stage, a suitable performance metric, and powerful comparison tests. In this stage, RA coefficients perform favourably with repect to the state-of-the-art proximity measures. Afterward, the RA-based method outperform another four nearest neighbor searching algorithms over the same data domains. In a third validation stage, RA measures are successfully applied to the virtual screening of the NCI data set. Finally, we discuss a possible molecular interpretation for these similarity variants. © 2016 IEEE. | eng |
dc.format.medium | Recurso electrónico | |
dc.format.mimetype | application/pdf | |
dc.identifier.citation | IEEE/ACM Transactions on Computational Biology and Bioinformatics; Vol. 13, Núm. 1; pp. 158-167 | |
dc.identifier.doi | 10.1109/TCBB.2015.2424435 | |
dc.identifier.instname | Universidad Tecnológica de Bolívar | |
dc.identifier.issn | 15455963 | |
dc.identifier.orcid | 24436944800 | |
dc.identifier.orcid | 57188713140 | |
dc.identifier.orcid | 55665599200 | |
dc.identifier.orcid | 57193746355 | |
dc.identifier.reponame | Repositorio UTB | |
dc.identifier.uri | https://hdl.handle.net/20.500.12585/9004 | |
dc.language.iso | eng | |
dc.publisher | Institute of Electrical and Electronics Engineers Inc. | |
dc.rights.accessrights | info:eu-repo/semantics/restrictedAccess | |
dc.rights.cc | Atribución-NoComercial 4.0 Internacional | |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | |
dc.source | https://www.scopus.com/inward/record.uri?eid=2-s2.0-84962028690&doi=10.1109%2fTCBB.2015.2424435&partnerID=40&md5=fbef0edaa9b5080d13f6b2c9480cf72b | |
dc.subject.keywords | Chemistry | |
dc.subject.keywords | Reliability | |
dc.subject.keywords | Similarity measures | |
dc.subject.keywords | Sorting and searching | |
dc.subject.keywords | Benchmarking | |
dc.subject.keywords | Chemistry | |
dc.subject.keywords | Nearest neighbor search | |
dc.subject.keywords | Reliability | |
dc.subject.keywords | Four-nearest-neighbors | |
dc.subject.keywords | Molecular interpretation | |
dc.subject.keywords | No free lunch theorem | |
dc.subject.keywords | Performance metrices | |
dc.subject.keywords | Proximity measure | |
dc.subject.keywords | Similarity measure | |
dc.subject.keywords | Similarity Searching | |
dc.subject.keywords | Sorting and searching | |
dc.subject.keywords | Population statistics | |
dc.subject.keywords | Algorithm | |
dc.subject.keywords | Chemical database | |
dc.subject.keywords | Chemistry | |
dc.subject.keywords | Data mining | |
dc.subject.keywords | Information science | |
dc.subject.keywords | Procedures | |
dc.subject.keywords | Algorithms | |
dc.subject.keywords | Chemistry | |
dc.subject.keywords | Data mining | |
dc.subject.keywords | Databases, Chemical | |
dc.subject.keywords | Informatics | |
dc.title | Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets | |
dc.type.driver | info:eu-repo/semantics/article | |
dc.type.hasversion | info:eu-repo/semantics/publishedVersion | |
dc.type.spa | Artículo | |
dcterms.bibliographicCitation | Maggiora, G., Shanmugasundaram, V., Molecular similarity measures (2011) Chemoinformatics and Computational Chemical Biology, pp. 77-84. , Methods in Molecular Biology, J. Bajorath, ed. New York, NY, USA: Humana Press | |
dcterms.bibliographicCitation | Ágoston, V., Kaján, L., Carugo, O., Hegedüs, Z., Vlahovicek, K., Pongor, S., Concepts of similarity in bioinformatics (2005) Essays in Bioinformatics, pp. 11-31. , NATO Science Series, I: Life and Behavioural Sciences, D. S. Moss, S. Jelaska, and S. Pongor, Eds. Amsterdam, The Netherland: IOS Press | |
dcterms.bibliographicCitation | Martin, Y.C., Kofron, J.L., Traphagen, L.M., Do structurally similar molecules have similar biological activity? (2002) J. Med. Chem., 45 (19), pp. 4350-4358. , Sep | |
dcterms.bibliographicCitation | Valencia, A., Automatic annotation of protein function (2005) Currency Opinion Struct. Biol., 15 (3), pp. 267-274. , Jun | |
dcterms.bibliographicCitation | Medina-Franco, J.L., Scanning structure-activity relationships with structure-activity similarity and related maps: From consensus activity cliffs to selectivity switches (2012) J. Chem. Inf. Model, 52 (10), pp. 2485-2493. , Oct | |
dcterms.bibliographicCitation | Punta, M., Ofran, Y., The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function (2008) PLoS Comput. Biol., 4 (10), p. e1000160. , Oct | |
dcterms.bibliographicCitation | Gower, J.C., Legendre, P., Metric and Euclidean properties of dissimilarity coefficients (1986) J. Classification, 3 (1), pp. 5-48. , Mar | |
dcterms.bibliographicCitation | Deza, M.M., Deza, E., (2013) Encyclopedia of Distances, , 2nd ed. Berlin, Germany: Springer-Verlag | |
dcterms.bibliographicCitation | Tversky, A., Features of similarity (1977) Psychol. Rev., 84 (4), pp. 327-352. , Jul | |
dcterms.bibliographicCitation | Ucar, D., Altiparmak, F., Ferhatosmanoglu, H., Parthasarathy, S., Investigating the use of extrinsic similarity measures for microarray analysis (2007) Proc. 7th Int. Workshop Data Mining Bioinformat, pp. 10-18 | |
dcterms.bibliographicCitation | Dobson, C.M., Chemical space and biology (2004) Nature, 432 (7019), pp. 824-828. , Dec | |
dcterms.bibliographicCitation | Lee, D., Redfern, O., Orengo, C., Predicting protein function from sequence and structure (2007) Nat. Rev. Mol. Cell Biol., 8 (12), pp. 995-1005. , Aug | |
dcterms.bibliographicCitation | Bajorath, J., Integration of virtual and high-throughput screening (2002) Nat. Rev. Drug Discov., 1 (11), pp. 882-894. , Nov | |
dcterms.bibliographicCitation | Seifert, M.H.J., Wolf, K., Vitt, D., Virtual high-throughput in Silico Screening (2003) Biosilico, 1 (4), pp. 143-149. , Sep | |
dcterms.bibliographicCitation | Willett, P., Similarity-based virtual screening using 2D fingerprints (2006) Drug Discov. Today, 11 (23-24), pp. 1046-1053. , Dec | |
dcterms.bibliographicCitation | Wolpert, D.H., The supervised learning no-free-lunch theorems (2001) Proc. 6th Online World Conf. Soft Comput. Ind. Appl., pp. 1-20. , http://ti.arc.nasa.gov/profile/dhw/statistical/, [Online] | |
dcterms.bibliographicCitation | Holliday, J.D., Salim, N., Whittle, M., Willett, P., Analysis and display of the size dependence of chemical similarity coefficients (2003) J. Chem. Inf. Comput. Sci., 43 (3), pp. 819-828. , May | |
dcterms.bibliographicCitation | Vogt, M., Bajorath, J., Introduction of the conditional correlated bernoulli model of similarity value distributions and its application to the prospective prediction of fingerprint search performance (2011) J. Chem. Inf. Model, 51 (10), pp. 2496-2506. , Oct | |
dcterms.bibliographicCitation | Zegers, F.E., Ten Berge, J.M.F., A family of association coefficients for metric scales (1985) Psychometrika, 50 (1), pp. 17-24. , Mar | |
dcterms.bibliographicCitation | Zegers, F.E., A family of chance-corrected association coefficients for metric scales (1986) Psychometrika, 51 (4), pp. 559-562. , Dec | |
dcterms.bibliographicCitation | Stine, W.W., Meaningful inference: The role of measurement in statistics (1989) Psychol. Bull., 105 (1), p. 147. , Jan | |
dcterms.bibliographicCitation | Conover, W.J., Iman, R.L., Rank transformations as a bridge between parametric and nonparametric statistics (1981) Amer. Stat., 35 (3), pp. 124-129. , Aug | |
dcterms.bibliographicCitation | Gower, J.C., Some distance properties of latent root and vector methods used in multivariate analysis (1966) Biometrika, 53 (3-4), pp. 325-338. , Dec | |
dcterms.bibliographicCitation | Rivera-Borroto, O.M., García-De La-Vega, J.M., Hernández-Díaz, Y., Theoretical advances on coefficients of relational agreement: Application to cheminformatics as k-way biomolecular similarity measures (2013) J. Chemometrics, 27 (11), pp. 420-430. , Nov | |
dcterms.bibliographicCitation | Al-Khalifa, A., Haranczyk, M., Holliday, J., Comparison of nonbinary similarity coefficients for similarity searching, clustering and compound selection (2009) J. Chem. Inf. Model, 49 (5), pp. 1193-1201. , May | |
dcterms.bibliographicCitation | Jobson, J., A coefficient of equality for questionnaire items with interval scales (1976) Educ. Psychol. Meas., 36 (2), pp. 271-274. , Jul | |
dcterms.bibliographicCitation | Lin, L.I.-K.L., A concordance correlation coefficient to evaluate reproducibility (1989) Biometrics, 45 (1), p. 255. , Mar | |
dcterms.bibliographicCitation | King, T.S., Chinchilli, V.M., A generalized concordance correlation coefficient for continuous and categorical data (2001) Stat. Med., 20 (14), pp. 2131-2147. , Jul | |
dcterms.bibliographicCitation | McDonald, R.P., Linear versus nonlinear models in item response theory (1982) Appl. Psychol. Meas., 6 (4), pp. 379-396. , Sep | |
dcterms.bibliographicCitation | Cureton, E.E., The definition and estimation of test reliability (1958) Educ. Psychol. Meas., 18 (4), pp. 715-738. , Dec | |
dcterms.bibliographicCitation | Mehta, J., Gurland, J., Some properties and an application of a statistic arising in testing correlation (1969) Ann. Math. Statist., 40 (5), pp. 1736-1745. , Oct | |
dcterms.bibliographicCitation | Kristof, W., On a statistic arising in testing correlation (1972) Psychometrika, 37 (4), pp. 377-384. , Dec | |
dcterms.bibliographicCitation | Burt, C., The factorial study of temperamental traits (1948) Brit. J. Psychol., 1 (3), pp. 178-203. , Nov | |
dcterms.bibliographicCitation | Tucker, L.R., (1951) A Method for Synthesis of Factor Analysis Studies, , Princeton, NJ, USA: Educational Testing Servise | |
dcterms.bibliographicCitation | Sjöberg, L., Holley, J.W., A measure of similarity between individuals when scoring directions of variables are arbitrary (1967) Multivar. Behav. Res., 2 (3), pp. 377-384. , Sep | |
dcterms.bibliographicCitation | Kendall, M.G., Kendall, S.F.H., Smith, B.B., The distribution of spearman's coefficient of rank correlation in a universe in which all rankings occur an equal number of times (1939) Biometrika, 30 (3-4), pp. 251-273. , Jan | |
dcterms.bibliographicCitation | Varin, T., Bureau, R., Mueller, C., Willett, P., Clustering files of chemical structures using the Székely-Rizzo generalization of ward's method (2009) J. Mol. Graph. Modell., 28 (2), pp. 187-195. , Sep | |
dcterms.bibliographicCitation | Rohrer, S.G., Baumann, K., Maximum unbiased validation (MUV) data sets for virtual screening based on pubchem bioactivity data (2009) J. Chem. Inf. Model, 49 (2), pp. 169-184. , Feb | |
dcterms.bibliographicCitation | National Cancer Institute, https://resresources.nci.nih.gov/resources/, Bethesda, MD, USA [Online] | |
dcterms.bibliographicCitation | (2014) JChem for Excel is A Microsoft Excel Integrated Tool Enabling Scientists to Manage and Analyze Chemical Structures and Their Data, , http://www.chemaxon.com, JChem for Excel v. 14.7.2100, Budapest, Hungary. ChemAxon Kft [Online] | |
dcterms.bibliographicCitation | Sadowski, J., Gasteiger, J., Klebe, G., Comparison of automatic three-dimensional model builders using 639 X-ray structures (1994) J. Chem. Inf. Comput. Sci., 34 (4), pp. 1000-1008. , Jul | |
dcterms.bibliographicCitation | (2007) The Software for Molecular Descriptors Calculations DRAGON is Available from Talete Srl, , http://www.talete.mi.it, DRAGON for Windows v. 5.5, Milano, Italy. [Online] | |
dcterms.bibliographicCitation | Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H., The WEKA data mining software: An update (2009) SIGKDD Explor. Newsl., 11 (1), pp. 10-18. , (Jun.) Jun. 2009 | |
dcterms.bibliographicCitation | Guyon, I., Elisseeff, A., An introduction to variable and feature selection (2003) J. Mach. Learn. Res., 3, pp. 1157-1182. , Mar | |
dcterms.bibliographicCitation | Bender, A., Mussa, H.Y., Glen, R.C., Molecular similarity searching using atom environments, information-based feature selection, and a naïve Bayesian classifier (2004) J. Chem. Inf. Comput. Sci., 44 (1), pp. 170-178. , Jan | |
dcterms.bibliographicCitation | Patterson, D.E., Cramer, R.D., Ferguson, A.M., Clark, R.D., Weinberger, L.E., Neighborhood behavior: A useful concept for validation of ""molecular diversity"" descriptors (1996) J. Med. Chem., 39 (16), pp. 3049-3059. , Aug | |
dcterms.bibliographicCitation | Nikolova, N., Jaworska, J., Approaches to measure chemical similarity-A review (2003) QSAR Comb. Sci., 22 (11), pp. 1006-1026. , Nov | |
dcterms.bibliographicCitation | Cruz-Monteagudo, M., Medina-Franco, J.L., Pérez-Castillo, Y., Nicolotti, O., Cordeiro, M.N., Borges, F., Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? (2014) Drug Discov. Today, 19 (8), pp. 1069-1080. , Aug | |
dcterms.bibliographicCitation | Nasr, R.J., Swamidass, S.J., Baldi, P.F., Large scale study of multiple-molecule queries (2009) J. Cheminf., 1 (7), pp. 1-19. , Jun | |
dcterms.bibliographicCitation | Hert, J., Willett, P., Wilton, D.J., Acklin, P., Azzaoui, K., Jacoby, E., Schuffenhauer, A., Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures (2004) J. Chem. Inf. Comput. Sci., 44 (3), pp. 1177-1185. , Mar | |
dcterms.bibliographicCitation | Swamidass, S.J., Azencott, C.-A., Daily, K., Baldi, P., A CROC stronger than ROC: Measuring, visualizing and optimizing early retrieval (2010) Bioinformatics, 26 (10), pp. 1348-1356. , May | |
dcterms.bibliographicCitation | Truchon, J., Bayly, C.I., Evaluating virtual screening methods: Good and bad metrics for the ""early recognition"" problem (2007) J. Chem. Inf. Model, 47 (2), pp. 488-508. , Mar | |
dcterms.bibliographicCitation | Apostol, T.M., (1974) Mathematical Analysis, , 2nd ed. Reading, MA, USA: Addison-Wesley | |
dcterms.bibliographicCitation | Bullen, P.S., A dictionary of inequalities (1998) Pitman Monographs and Surveys in Pure and Applied Mathematics 97, p. 296. , Reading, MA, USA: Addison Wesley Logman | |
dcterms.bibliographicCitation | Mitrinović, D.S., Vasić, P.M., (1970) Analytic Inequalities, , Berlin, Germany: Springer-Verlag | |
dcterms.bibliographicCitation | Iman, R.L., Davenport, J.M., Approximations of the critical region of the Friedman's statistic (1980) Commun. Stat. Theory, 9 (6), pp. 571-595. , Jan | |
dcterms.bibliographicCitation | Demšar, J., Statistical comparisons of classifiers over multiple data sets (2006) J. Mach. Learn. Res., 7, pp. 1-30. , Jan | |
dcterms.bibliographicCitation | García, S., Fernández, A., Luengo, J., Herrera, F., A study of statistical techniques and performance measures for geneticsbased machine learning: Accuracy and interpretability (2009) Soft Comput., 13 (10), pp. 959-977. , Aug | |
dcterms.bibliographicCitation | Li, J., A two-step rejection procedure for testing multiple hypotheses (2008) J. Stat. Planning Inference, 138 (6), pp. 1521-1527. , Jul | |
dcterms.bibliographicCitation | Willett, P., The calculation of molecular structural similarity: Principles and practice (2014) Mol. Inf., 33 (6-7), pp. 403-413. , Apr | |
dcterms.bibliographicCitation | Nasr, R.J., Swamidass, S.J., Baldi, P.F., Large scale study of multiple-molecule queries (2009) J. Cheminf., 1 (7), p. 19. , Jun | |
dcterms.bibliographicCitation | Tiikkainen, P., Markt, P., Wolber, G., Kirchmair, J., Distinto, S., Poso, A., Kallioniemi, O., Critical comparison of virtual screening methods against the MUV data set (2009) J. Chem. Inf. Model, 49 (10), pp. 2168-2178. , Oct | |
dcterms.bibliographicCitation | Rosenbaum, L., Hinselmann, G., Jahn, A., Zell, A., Interpreting linear support vector machine models with heat map molecule coloring (2011) J. Cheminf., 3 (1), p. 12. , Mar | |
dcterms.bibliographicCitation | Riniker, S., Landrum, G., Open-source platform to benchmark fingerprints for ligand-based virtual screening (2013) J. Cheminf., 5 (1), p. 17. , May | |
dcterms.bibliographicCitation | Hinselmann, G., Rosenbaum, L., Jahn, A., Fechner, N., Ostermann, C., Zell, A., Large-scale learning of structure-activity relationships using a linear support vector machine and problem-specific metrics (2011) J. Chem. Inf. Model, 51 (2), pp. 203-213. , Feb | |
dcterms.bibliographicCitation | Gardiner, E.J., Holliday, J.D., O'Dowd, C., Willett, P., Effectiveness of 2D fingerprints for scaffold hopping (2011) Future Med. Chem., 3 (4), pp. 405-414. , Mar | |
dcterms.bibliographicCitation | Ahmed, A., Saeed, F., Salim, N., Abdo, A., Condorcet and borda count fusion method for ligand-based virtual screening (2014) J. Cheminf., 6 (1), p. 10 | |
dcterms.bibliographicCitation | Duesbury, E.V., Holliday, J., Willett, P., Maximum common substructure-based data fusion in similarity searching (2015) J. Chem. Inf. Model, 55 (2), pp. 222-230 | |
dcterms.bibliographicCitation | Hallgren, K.A., Computing inter-rater reliability for observational data: An overview and tutorial (2012) Quant. Meth. Psych., 8 (1), pp. 23-34. , Jan | |
dcterms.bibliographicCitation | Willett, P., Combination of similarity rankings using data fusion (2013) J. Chem. Inf. Model, 53 (1), pp. 1-10. , Jan | |
dcterms.bibliographicCitation | Cao, Y., Jiang, T., Girke, T., Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing (2010) Bioinformatics, 26 (7), pp. 953-959. , Apr | |
oaire.resourceType | http://purl.org/coar/resource_type/c_6501 | |
oaire.version | http://purl.org/coar/version/c_970fb48d4fbd8a85 |