Show simple item record

Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets

dc.creatorRivera-Borroto O.M.
dc.creatorGarcía-De La Vega J.M.
dc.creatorMarrero-Ponce Y.
dc.creatorGrau R.
dc.date.accessioned2020-03-26T16:32:45Z
dc.date.available2020-03-26T16:32:45Z
dc.date.issued2016
dc.identifier.citationIEEE/ACM Transactions on Computational Biology and Bioinformatics; Vol. 13, Núm. 1; pp. 158-167
dc.identifier.issn15455963
dc.identifier.urihttps://hdl.handle.net/20.500.12585/9004
dc.description.abstractResearch on similarity searching of cheminformatic data sets has been focused on similarity measures using fingerprints. However, nominal scales are the least informative of all metric scales, increasing the tied similarity scores, and decreasing the effectivity of the retrieval engines. Tanimoto's coefficient has been claimed to be the most prominent measure for this task. Nevertheless, this field is far from being exhausted since the computer science no free lunch theorem predicts that "no similarity measure has overall superiority over the population of data sets". We introduce 12 relational agreement (RA) coefficients for seven metric scales, which are integrated within a group fusion-based similarity searching algorithm. These similarity measures are compared to a reference panel of 21 proximity quantifiers over 17 benchmark data sets (MUV), by using informative descriptors, a feature selection stage, a suitable performance metric, and powerful comparison tests. In this stage, RA coefficients perform favourably with repect to the state-of-the-art proximity measures. Afterward, the RA-based method outperform another four nearest neighbor searching algorithms over the same data domains. In a third validation stage, RA measures are successfully applied to the virtual screening of the NCI data set. Finally, we discuss a possible molecular interpretation for these similarity variants. © 2016 IEEE.eng
dc.format.mediumRecurso electrónico
dc.format.mimetypeapplication/pdf
dc.language.isoeng
dc.publisherInstitute of Electrical and Electronics Engineers Inc.
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/
dc.sourcehttps://www.scopus.com/inward/record.uri?eid=2-s2.0-84962028690&doi=10.1109%2fTCBB.2015.2424435&partnerID=40&md5=fbef0edaa9b5080d13f6b2c9480cf72b
dc.titleRelational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
dcterms.bibliographicCitationMaggiora, G., Shanmugasundaram, V., Molecular similarity measures (2011) Chemoinformatics and Computational Chemical Biology, pp. 77-84. , Methods in Molecular Biology, J. Bajorath, ed. New York, NY, USA: Humana Press
dcterms.bibliographicCitationÁgoston, V., Kaján, L., Carugo, O., Hegedüs, Z., Vlahovicek, K., Pongor, S., Concepts of similarity in bioinformatics (2005) Essays in Bioinformatics, pp. 11-31. , NATO Science Series, I: Life and Behavioural Sciences, D. S. Moss, S. Jelaska, and S. Pongor, Eds. Amsterdam, The Netherland: IOS Press
dcterms.bibliographicCitationMartin, Y.C., Kofron, J.L., Traphagen, L.M., Do structurally similar molecules have similar biological activity? (2002) J. Med. Chem., 45 (19), pp. 4350-4358. , Sep
dcterms.bibliographicCitationValencia, A., Automatic annotation of protein function (2005) Currency Opinion Struct. Biol., 15 (3), pp. 267-274. , Jun
dcterms.bibliographicCitationMedina-Franco, J.L., Scanning structure-activity relationships with structure-activity similarity and related maps: From consensus activity cliffs to selectivity switches (2012) J. Chem. Inf. Model, 52 (10), pp. 2485-2493. , Oct
dcterms.bibliographicCitationPunta, M., Ofran, Y., The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function (2008) PLoS Comput. Biol., 4 (10), p. e1000160. , Oct
dcterms.bibliographicCitationGower, J.C., Legendre, P., Metric and Euclidean properties of dissimilarity coefficients (1986) J. Classification, 3 (1), pp. 5-48. , Mar
dcterms.bibliographicCitationDeza, M.M., Deza, E., (2013) Encyclopedia of Distances, , 2nd ed. Berlin, Germany: Springer-Verlag
dcterms.bibliographicCitationTversky, A., Features of similarity (1977) Psychol. Rev., 84 (4), pp. 327-352. , Jul
dcterms.bibliographicCitationUcar, D., Altiparmak, F., Ferhatosmanoglu, H., Parthasarathy, S., Investigating the use of extrinsic similarity measures for microarray analysis (2007) Proc. 7th Int. Workshop Data Mining Bioinformat, pp. 10-18
dcterms.bibliographicCitationDobson, C.M., Chemical space and biology (2004) Nature, 432 (7019), pp. 824-828. , Dec
dcterms.bibliographicCitationLee, D., Redfern, O., Orengo, C., Predicting protein function from sequence and structure (2007) Nat. Rev. Mol. Cell Biol., 8 (12), pp. 995-1005. , Aug
dcterms.bibliographicCitationBajorath, J., Integration of virtual and high-throughput screening (2002) Nat. Rev. Drug Discov., 1 (11), pp. 882-894. , Nov
dcterms.bibliographicCitationSeifert, M.H.J., Wolf, K., Vitt, D., Virtual high-throughput in Silico Screening (2003) Biosilico, 1 (4), pp. 143-149. , Sep
dcterms.bibliographicCitationWillett, P., Similarity-based virtual screening using 2D fingerprints (2006) Drug Discov. Today, 11 (23-24), pp. 1046-1053. , Dec
dcterms.bibliographicCitationWolpert, D.H., The supervised learning no-free-lunch theorems (2001) Proc. 6th Online World Conf. Soft Comput. Ind. Appl., pp. 1-20. , http://ti.arc.nasa.gov/profile/dhw/statistical/, [Online]
dcterms.bibliographicCitationHolliday, J.D., Salim, N., Whittle, M., Willett, P., Analysis and display of the size dependence of chemical similarity coefficients (2003) J. Chem. Inf. Comput. Sci., 43 (3), pp. 819-828. , May
dcterms.bibliographicCitationVogt, M., Bajorath, J., Introduction of the conditional correlated bernoulli model of similarity value distributions and its application to the prospective prediction of fingerprint search performance (2011) J. Chem. Inf. Model, 51 (10), pp. 2496-2506. , Oct
dcterms.bibliographicCitationZegers, F.E., Ten Berge, J.M.F., A family of association coefficients for metric scales (1985) Psychometrika, 50 (1), pp. 17-24. , Mar
dcterms.bibliographicCitationZegers, F.E., A family of chance-corrected association coefficients for metric scales (1986) Psychometrika, 51 (4), pp. 559-562. , Dec
dcterms.bibliographicCitationStine, W.W., Meaningful inference: The role of measurement in statistics (1989) Psychol. Bull., 105 (1), p. 147. , Jan
dcterms.bibliographicCitationConover, W.J., Iman, R.L., Rank transformations as a bridge between parametric and nonparametric statistics (1981) Amer. Stat., 35 (3), pp. 124-129. , Aug
dcterms.bibliographicCitationGower, J.C., Some distance properties of latent root and vector methods used in multivariate analysis (1966) Biometrika, 53 (3-4), pp. 325-338. , Dec
dcterms.bibliographicCitationRivera-Borroto, O.M., García-De La-Vega, J.M., Hernández-Díaz, Y., Theoretical advances on coefficients of relational agreement: Application to cheminformatics as k-way biomolecular similarity measures (2013) J. Chemometrics, 27 (11), pp. 420-430. , Nov
dcterms.bibliographicCitationAl-Khalifa, A., Haranczyk, M., Holliday, J., Comparison of nonbinary similarity coefficients for similarity searching, clustering and compound selection (2009) J. Chem. Inf. Model, 49 (5), pp. 1193-1201. , May
dcterms.bibliographicCitationJobson, J., A coefficient of equality for questionnaire items with interval scales (1976) Educ. Psychol. Meas., 36 (2), pp. 271-274. , Jul
dcterms.bibliographicCitationLin, L.I.-K.L., A concordance correlation coefficient to evaluate reproducibility (1989) Biometrics, 45 (1), p. 255. , Mar
dcterms.bibliographicCitationKing, T.S., Chinchilli, V.M., A generalized concordance correlation coefficient for continuous and categorical data (2001) Stat. Med., 20 (14), pp. 2131-2147. , Jul
dcterms.bibliographicCitationMcDonald, R.P., Linear versus nonlinear models in item response theory (1982) Appl. Psychol. Meas., 6 (4), pp. 379-396. , Sep
dcterms.bibliographicCitationCureton, E.E., The definition and estimation of test reliability (1958) Educ. Psychol. Meas., 18 (4), pp. 715-738. , Dec
dcterms.bibliographicCitationMehta, J., Gurland, J., Some properties and an application of a statistic arising in testing correlation (1969) Ann. Math. Statist., 40 (5), pp. 1736-1745. , Oct
dcterms.bibliographicCitationKristof, W., On a statistic arising in testing correlation (1972) Psychometrika, 37 (4), pp. 377-384. , Dec
dcterms.bibliographicCitationBurt, C., The factorial study of temperamental traits (1948) Brit. J. Psychol., 1 (3), pp. 178-203. , Nov
dcterms.bibliographicCitationTucker, L.R., (1951) A Method for Synthesis of Factor Analysis Studies, , Princeton, NJ, USA: Educational Testing Servise
dcterms.bibliographicCitationSjöberg, L., Holley, J.W., A measure of similarity between individuals when scoring directions of variables are arbitrary (1967) Multivar. Behav. Res., 2 (3), pp. 377-384. , Sep
dcterms.bibliographicCitationKendall, M.G., Kendall, S.F.H., Smith, B.B., The distribution of spearman's coefficient of rank correlation in a universe in which all rankings occur an equal number of times (1939) Biometrika, 30 (3-4), pp. 251-273. , Jan
dcterms.bibliographicCitationVarin, T., Bureau, R., Mueller, C., Willett, P., Clustering files of chemical structures using the Székely-Rizzo generalization of ward's method (2009) J. Mol. Graph. Modell., 28 (2), pp. 187-195. , Sep
dcterms.bibliographicCitationRohrer, S.G., Baumann, K., Maximum unbiased validation (MUV) data sets for virtual screening based on pubchem bioactivity data (2009) J. Chem. Inf. Model, 49 (2), pp. 169-184. , Feb
dcterms.bibliographicCitationNational Cancer Institute, https://resresources.nci.nih.gov/resources/, Bethesda, MD, USA [Online]
dcterms.bibliographicCitation(2014) JChem for Excel is A Microsoft Excel Integrated Tool Enabling Scientists to Manage and Analyze Chemical Structures and Their Data, , http://www.chemaxon.com, JChem for Excel v. 14.7.2100, Budapest, Hungary. ChemAxon Kft [Online]
dcterms.bibliographicCitationSadowski, J., Gasteiger, J., Klebe, G., Comparison of automatic three-dimensional model builders using 639 X-ray structures (1994) J. Chem. Inf. Comput. Sci., 34 (4), pp. 1000-1008. , Jul
dcterms.bibliographicCitation(2007) The Software for Molecular Descriptors Calculations DRAGON is Available from Talete Srl, , http://www.talete.mi.it, DRAGON for Windows v. 5.5, Milano, Italy. [Online]
dcterms.bibliographicCitationHall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H., The WEKA data mining software: An update (2009) SIGKDD Explor. Newsl., 11 (1), pp. 10-18. , (Jun.) Jun. 2009
dcterms.bibliographicCitationGuyon, I., Elisseeff, A., An introduction to variable and feature selection (2003) J. Mach. Learn. Res., 3, pp. 1157-1182. , Mar
dcterms.bibliographicCitationBender, A., Mussa, H.Y., Glen, R.C., Molecular similarity searching using atom environments, information-based feature selection, and a naïve Bayesian classifier (2004) J. Chem. Inf. Comput. Sci., 44 (1), pp. 170-178. , Jan
dcterms.bibliographicCitationPatterson, D.E., Cramer, R.D., Ferguson, A.M., Clark, R.D., Weinberger, L.E., Neighborhood behavior: A useful concept for validation of ""molecular diversity"" descriptors (1996) J. Med. Chem., 39 (16), pp. 3049-3059. , Aug
dcterms.bibliographicCitationNikolova, N., Jaworska, J., Approaches to measure chemical similarity-A review (2003) QSAR Comb. Sci., 22 (11), pp. 1006-1026. , Nov
dcterms.bibliographicCitationCruz-Monteagudo, M., Medina-Franco, J.L., Pérez-Castillo, Y., Nicolotti, O., Cordeiro, M.N., Borges, F., Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? (2014) Drug Discov. Today, 19 (8), pp. 1069-1080. , Aug
dcterms.bibliographicCitationNasr, R.J., Swamidass, S.J., Baldi, P.F., Large scale study of multiple-molecule queries (2009) J. Cheminf., 1 (7), pp. 1-19. , Jun
dcterms.bibliographicCitationHert, J., Willett, P., Wilton, D.J., Acklin, P., Azzaoui, K., Jacoby, E., Schuffenhauer, A., Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures (2004) J. Chem. Inf. Comput. Sci., 44 (3), pp. 1177-1185. , Mar
dcterms.bibliographicCitationSwamidass, S.J., Azencott, C.-A., Daily, K., Baldi, P., A CROC stronger than ROC: Measuring, visualizing and optimizing early retrieval (2010) Bioinformatics, 26 (10), pp. 1348-1356. , May
dcterms.bibliographicCitationTruchon, J., Bayly, C.I., Evaluating virtual screening methods: Good and bad metrics for the ""early recognition"" problem (2007) J. Chem. Inf. Model, 47 (2), pp. 488-508. , Mar
dcterms.bibliographicCitationApostol, T.M., (1974) Mathematical Analysis, , 2nd ed. Reading, MA, USA: Addison-Wesley
dcterms.bibliographicCitationBullen, P.S., A dictionary of inequalities (1998) Pitman Monographs and Surveys in Pure and Applied Mathematics 97, p. 296. , Reading, MA, USA: Addison Wesley Logman
dcterms.bibliographicCitationMitrinović, D.S., Vasić, P.M., (1970) Analytic Inequalities, , Berlin, Germany: Springer-Verlag
dcterms.bibliographicCitationIman, R.L., Davenport, J.M., Approximations of the critical region of the Friedman's statistic (1980) Commun. Stat. Theory, 9 (6), pp. 571-595. , Jan
dcterms.bibliographicCitationDemšar, J., Statistical comparisons of classifiers over multiple data sets (2006) J. Mach. Learn. Res., 7, pp. 1-30. , Jan
dcterms.bibliographicCitationGarcía, S., Fernández, A., Luengo, J., Herrera, F., A study of statistical techniques and performance measures for geneticsbased machine learning: Accuracy and interpretability (2009) Soft Comput., 13 (10), pp. 959-977. , Aug
dcterms.bibliographicCitationLi, J., A two-step rejection procedure for testing multiple hypotheses (2008) J. Stat. Planning Inference, 138 (6), pp. 1521-1527. , Jul
dcterms.bibliographicCitationWillett, P., The calculation of molecular structural similarity: Principles and practice (2014) Mol. Inf., 33 (6-7), pp. 403-413. , Apr
dcterms.bibliographicCitationNasr, R.J., Swamidass, S.J., Baldi, P.F., Large scale study of multiple-molecule queries (2009) J. Cheminf., 1 (7), p. 19. , Jun
dcterms.bibliographicCitationTiikkainen, P., Markt, P., Wolber, G., Kirchmair, J., Distinto, S., Poso, A., Kallioniemi, O., Critical comparison of virtual screening methods against the MUV data set (2009) J. Chem. Inf. Model, 49 (10), pp. 2168-2178. , Oct
dcterms.bibliographicCitationRosenbaum, L., Hinselmann, G., Jahn, A., Zell, A., Interpreting linear support vector machine models with heat map molecule coloring (2011) J. Cheminf., 3 (1), p. 12. , Mar
dcterms.bibliographicCitationRiniker, S., Landrum, G., Open-source platform to benchmark fingerprints for ligand-based virtual screening (2013) J. Cheminf., 5 (1), p. 17. , May
dcterms.bibliographicCitationHinselmann, G., Rosenbaum, L., Jahn, A., Fechner, N., Ostermann, C., Zell, A., Large-scale learning of structure-activity relationships using a linear support vector machine and problem-specific metrics (2011) J. Chem. Inf. Model, 51 (2), pp. 203-213. , Feb
dcterms.bibliographicCitationGardiner, E.J., Holliday, J.D., O'Dowd, C., Willett, P., Effectiveness of 2D fingerprints for scaffold hopping (2011) Future Med. Chem., 3 (4), pp. 405-414. , Mar
dcterms.bibliographicCitationAhmed, A., Saeed, F., Salim, N., Abdo, A., Condorcet and borda count fusion method for ligand-based virtual screening (2014) J. Cheminf., 6 (1), p. 10
dcterms.bibliographicCitationDuesbury, E.V., Holliday, J., Willett, P., Maximum common substructure-based data fusion in similarity searching (2015) J. Chem. Inf. Model, 55 (2), pp. 222-230
dcterms.bibliographicCitationHallgren, K.A., Computing inter-rater reliability for observational data: An overview and tutorial (2012) Quant. Meth. Psych., 8 (1), pp. 23-34. , Jan
dcterms.bibliographicCitationWillett, P., Combination of similarity rankings using data fusion (2013) J. Chem. Inf. Model, 53 (1), pp. 1-10. , Jan
dcterms.bibliographicCitationCao, Y., Jiang, T., Girke, T., Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing (2010) Bioinformatics, 26 (7), pp. 953-959. , Apr
datacite.rightshttp://purl.org/coar/access_right/c_16ec
oaire.resourceTypehttp://purl.org/coar/resource_type/c_6501
oaire.versionhttp://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.driverinfo:eu-repo/semantics/article
dc.type.hasVersioninfo:eu-repo/semantics/publishedVersion
dc.identifier.doi10.1109/TCBB.2015.2424435
dc.subject.keywordsChemistry
dc.subject.keywordsReliability
dc.subject.keywordsSimilarity measures
dc.subject.keywordsSorting and searching
dc.subject.keywordsBenchmarking
dc.subject.keywordsChemistry
dc.subject.keywordsNearest neighbor search
dc.subject.keywordsReliability
dc.subject.keywordsFour-nearest-neighbors
dc.subject.keywordsMolecular interpretation
dc.subject.keywordsNo free lunch theorem
dc.subject.keywordsPerformance metrices
dc.subject.keywordsProximity measure
dc.subject.keywordsSimilarity measure
dc.subject.keywordsSimilarity Searching
dc.subject.keywordsSorting and searching
dc.subject.keywordsPopulation statistics
dc.subject.keywordsAlgorithm
dc.subject.keywordsChemical database
dc.subject.keywordsChemistry
dc.subject.keywordsData mining
dc.subject.keywordsInformation science
dc.subject.keywordsProcedures
dc.subject.keywordsAlgorithms
dc.subject.keywordsChemistry
dc.subject.keywordsData mining
dc.subject.keywordsDatabases, Chemical
dc.subject.keywordsInformatics
dc.rights.accessRightsinfo:eu-repo/semantics/restrictedAccess
dc.rights.ccAtribución-NoComercial 4.0 Internacional
dc.identifier.instnameUniversidad Tecnológica de Bolívar
dc.identifier.reponameRepositorio UTB
dc.type.spaArtículo
dc.identifier.orcid24436944800
dc.identifier.orcid57188713140
dc.identifier.orcid55665599200
dc.identifier.orcid57193746355


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

http://creativecommons.org/licenses/by-nc-nd/4.0/
Except where otherwise noted, this item's license is described as http://creativecommons.org/licenses/by-nc-nd/4.0/