Publicación:
An automatic approach to generate corpus in Spanish

datacite.rightshttp://purl.org/coar/access_right/c_16ec
dc.contributor.authorPuertas Del Castillo, Edwin Alexander
dc.contributor.editorMartínez Santos, Juan Carlos
dc.contributor.editorSerrano Castañeda, Jairo Enrique
dc.creatorAlvarado Valencia, Jorge Andrés
dc.creatorMoreno Sandoval, L.G.
dc.creatorPomares Quimbaya, A.
dc.date.accessioned2020-03-26T16:32:36Z
dc.date.available2020-03-26T16:32:36Z
dc.date.issued2018
dc.description.abstractA corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipedia and web pages. The information extracted by Wikipedia is done by delimiting the domain, using a propagation algorithm to determine the categories associated with a domain region and a set of seeds to delimit the search. The information extracted from the web pages is carried out efficiently, determining the patterns associated with the structure of each page with the purpose of defining the quality of the extraction. © Springer Nature Switzerland AG 2018.eng
dc.description.notesAcknowledgements. The tool presented was carried out within the construction of research capabilities of the Center for Excellence and Appropriation in Big Data and Data Analytics (CAOBA), led by the Pontificia Universidad Javeriana, funded by the Ministry of Information Technologies and Telecommunications of the Republic of Colombia (MinTIC).
dc.description.sponsorshipPontificia Universidad Javeriana
dc.format.mediumRecurso electrónico
dc.format.mimetypeapplication/pdf
dc.identifier.citationCommunications in Computer and Information Science; Vol. 885, pp. 150-161
dc.identifier.doi10.1007/978-3-319-98998-3_12
dc.identifier.instnameUniversidad Tecnológica de Bolívar
dc.identifier.isbn9783319989976
dc.identifier.issn18650929
dc.identifier.orcid57202285682
dc.identifier.orcid8738428200
dc.identifier.orcid57194828933
dc.identifier.orcid57203852380
dc.identifier.reponameRepositorio UTB
dc.identifier.urihttps://hdl.handle.net/20.500.12585/8916
dc.language.isoeng
dc.publisherSpringer Verlag
dc.relation.conferencedate26 September 2018 through 28 September 2018
dc.rights.accessrightsinfo:eu-repo/semantics/restrictedAccess
dc.rights.ccAtribución-NoComercial 4.0 Internacional
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/
dc.sourcehttps://www.scopus.com/inward/record.uri?eid=2-s2.0-85054377708&doi=10.1007%2f978-3-319-98998-3_12&partnerID=40&md5=d8689ca7ab863965c5539711ded485c1
dc.source.event13th Colombian Conference on Computing, CCC 2018
dc.subject.keywordsCorpus
dc.subject.keywordsKnowledge extraction
dc.subject.keywordsLinguistic computational
dc.subject.keywordsNatural language processing
dc.subject.keywordsText mining
dc.subject.keywordsData mining
dc.subject.keywordsExtraction
dc.subject.keywordsNatural language processing systems
dc.subject.keywordsTellurium compounds
dc.subject.keywordsWebsites
dc.subject.keywordsAutomatic approaches
dc.subject.keywordsCorpus
dc.subject.keywordsDigital information
dc.subject.keywordsKnowledge extraction
dc.subject.keywordsLinguistic resources
dc.subject.keywordsPropagation algorithm
dc.subject.keywordsText mining
dc.subject.keywordsWikipedia
dc.subject.keywordsLinguistics
dc.titleAn automatic approach to generate corpus in Spanish
dc.typeConferencia
dc.type.driverinfo:eu-repo/semantics/conferenceObject
dc.type.hasversioninfo:eu-repo/semantics/publishedVersion
dcterms.bibliographicCitationArnold, P., Rahm, E., Automatic extraction of semantic relations from wikipedia (2015) Int. J. Artif. Intell. Tools, 24 (2)
dcterms.bibliographicCitationBerners-Lee, T., Connolly, D., (1995) Hypertext Markup Language-2.0, , Technical report, USA
dcterms.bibliographicCitationBlei, D.M., Ng, A.Y., Jordan, M.I., Latent dirichlet allocation (2003) J. Mach. Learn. Res, 3, pp. 993-1022. , Jan
dcterms.bibliographicCitation(2006) Extensible Markup Language (Xml) 1.1
dcterms.bibliographicCitationCrawford, W., Csomay, E., Doing Corpus Linguistics (2015) Routledge, , Abingdon
dcterms.bibliographicCitationCrockford, D., (2006) The Application/Json Media Type for Javascript Object Notation, , JSON
dcterms.bibliographicCitationDrechsler, A., Hevner, A., A four-cycle model of is design science research: Capturing the dynamic nature of is artifact design (2016) Breakthroughs and Emerging Insights from Ongoing Design Science Projects: Research-In-Progress Papers and Poster Presentations from the 11Th International Conference on Design Science Research in Information Systems and Technology (DESRIST). DESRIST 2016, , St. John, Canada
dcterms.bibliographicCitationDutta, B., Chatterjee, U., Madalli, D.P., YAMO: Yet another methodology for large-scale faceted ontology construction (2015) J. Knowl. Manag., 19 (1), pp. 6-24
dcterms.bibliographicCitationEdeki, C., Agile unified process (2013) Int. J. Comput. Sci., 1 (3), pp. 13-17
dcterms.bibliographicCitationFan, J., Kalyanpur, A., Gondek, D.C., Ferrucci, D.A., Automatic knowledge extraction from documents (2012) IBM J. Res. Dev., 56 (3), pp. 1-5
dcterms.bibliographicCitationFerrara, E., de Meo, P., Fiumara, G., Baumgartner, R., Web data extraction, applications and techniques: A survey (2014) Knowl.-Based Syst., 70, pp. 301-323
dcterms.bibliographicCitationGharib, T.F., Badr, N.L., Haridy, S., Abraham, A., Enriching ontology concepts based on texts from WWW and corpus (2012) J. UCS, 18 (16), pp. 2234-2251
dcterms.bibliographicCitationJiang, J., Information extraction from text (2012) Mining Text Data, pp. 11-41. , https://doi.org/10.1007/978-1-4614-3223-42, Aggarwal, C., Zhai, C. (eds.), Springer, Boston
dcterms.bibliographicCitationJurafsky, D., Martin, J.H., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2009) Prentice Hall Series in Artificial Intelligence, pp. 1-1024
dcterms.bibliographicCitationKanakaraj, M., Kamath, S.S., NLP based intelligent news search engine using information extraction from e-newspapers (2014) 2014 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), pp. 1-5. , IEEE
dcterms.bibliographicCitationKanavos, A., Makris, C., Plegas, Y., Theodoridis, E., Ranking web search results exploiting wikipedia (2016) Int. J. Artif. Intell. Tools, 25 (3)
dcterms.bibliographicCitationKozareva, Z., Hovy, E., Tailoring the automated construction of large-scale taxonomies using the web (2013) Lang. Resour. Eval., 47 (3), pp. 859-890
dcterms.bibliographicCitationKüçük, D., Arslan, Y., Semi-automatic construction of a domain ontology for wind energy using wikipedia articles (2014) Renew. Energy, 62, pp. 484-489
dcterms.bibliographicCitationLahbib, W., Bounhas, I., Slimani, Y., Arabic terminology extraction and enrichment based on domain-specific text mining (2015) 2015 IEEE 27Th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 340-347. , IEEE
dcterms.bibliographicCitationLeskovec, J., Rajaraman, A., Ullman, J.D., (2014) Mining of Massive Datasets, , Cambridge University Press, Cambridge
dcterms.bibliographicCitationLiu, S., Zhang, C., Termhood-based comparability metrics of comparable corpus in special domain (2013) CLSW 2012. LNCS (LNAI), 7717, pp. 134-144. , https://doi.org/10.1007/978-3-642-36337-515, Ji, D., Xiao, G. (eds.), Springer, Heidelberg
dcterms.bibliographicCitationLoria, S., TextBlob: Simplified text processing (2014) Secondary Textblob: Simplified Text Processing
dcterms.bibliographicCitationMarch, S.T., Smith, G.F., Design and natural science research on information technology (1995) Decis. Support Syst., 15 (4), pp. 251-266
dcterms.bibliographicCitationMarch, S.T., Storey, V.C., Design science in the information systems discipline: An introduction to the special issue on design science research (2008) MIS Q, 32, pp. 725-730
dcterms.bibliographicCitationMedelyan, O., Witten, I.H., Divoli, A., Broekstra, J., Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures (2013) Wiley Interdisc. Rev.: Data Min. Knowl. Discov., 3 (4), pp. 257-279
dcterms.bibliographicCitationMorell, M.F., The Wikimedia foundation and the governance of Wikipedias infrastructure: Historical trajectories and its hybrid character (2011) Critical Point of View: A Wikipedia Reader, pp. 325-341
dcterms.bibliographicCitationPetrov, S., Das, D., McDonald, R., (2011) A Universal Part-Of-Speech Tagset
dcterms.bibliographicCitationPowers, D.M.W., Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation (2011) J. Mach. Learn. Technol., 2 (1), pp. 37-63
dcterms.bibliographicCitationRichardson, L., Ruby, S., (2008) Restful Web Services, , O’Reilly Media, Inc., Sebastopol
dcterms.bibliographicCitationSchwaber, K., Beedle, M., (2002) Agile Software Development with Scrum, 1. , Prentice Hall, Upper Saddle River
dcterms.bibliographicCitationVállez, M., Pedraza-Jiménez, R., Codina, L., Blanco, S., Rovira, C., A semiautomatic indexing system based on embedded information in HTML documents (2015) Library Hi Tech, 33 (2), pp. 195-210
dcterms.bibliographicCitationvan Rossum, G., Drake, F.L., Python Language Reference Manual (2003) Network Theory, , Bristol
dcterms.bibliographicCitationWood, L., Nicol, G., Robie, J., Champion, M., Byrne, S., (2004) Document Object Model (DOM) Level 3 Core Specification
dcterms.bibliographicCitationZhu, M., Recall, precision and average precision (2004) Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, 2, p. 30
dspace.entity.typePublication
oaire.resourceTypehttp://purl.org/coar/resource_type/c_c94f
oaire.versionhttp://purl.org/coar/version/c_970fb48d4fbd8a85
relation.isAuthorOfPublication84e86005-e232-4d13-ab38-68f0f2b4aeb0
relation.isAuthorOfPublication.latestForDiscovery84e86005-e232-4d13-ab38-68f0f2b4aeb0
relation.isEditorOfPublication35de2f55-a620-47ac-97f2-9961adeac601
relation.isEditorOfPublicationdb6967a8-73d5-4623-92c5-5e62d5ad495c
relation.isEditorOfPublication.latestForDiscovery35de2f55-a620-47ac-97f2-9961adeac601

Archivos