An automatic approach to generate corpus in Spanish

Puertas E.; Alvarado‑Valencia, Jorge Andres; Moreno-Sandoval L.G.; Pomares-Quimbaya A.

dc.contributor.editor	Serrano C. J.E.
dc.contributor.editor	Martínez-Santos, Juan Carlos
dc.creator	Puertas E.
dc.creator	Alvarado‑Valencia, Jorge Andres
dc.creator	Moreno-Sandoval L.G.
dc.creator	Pomares-Quimbaya A.
dc.date.accessioned	2020-03-26T16:32:36Z
dc.date.available	2020-03-26T16:32:36Z
dc.date.issued	2018
dc.identifier.citation	Communications in Computer and Information Science; Vol. 885, pp. 150-161
dc.identifier.isbn	9783319989976
dc.identifier.issn	18650929
dc.identifier.uri	https://hdl.handle.net/20.500.12585/8916
dc.description.abstract	A corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipedia and web pages. The information extracted by Wikipedia is done by delimiting the domain, using a propagation algorithm to determine the categories associated with a domain region and a set of seeds to delimit the search. The information extracted from the web pages is carried out efficiently, determining the patterns associated with the structure of each page with the purpose of defining the quality of the extraction. © Springer Nature Switzerland AG 2018.	eng
dc.description.sponsorship	Pontificia Universidad Javeriana
dc.format.medium	Recurso electrónico
dc.format.mimetype	application/pdf
dc.language.iso	eng
dc.publisher	Springer Verlag
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.source	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85054377708&doi=10.1007%2f978-3-319-98998-3_12&partnerID=40&md5=d8689ca7ab863965c5539711ded485c1
dc.title	An automatic approach to generate corpus in Spanish
dcterms.bibliographicCitation	Arnold, P., Rahm, E., Automatic extraction of semantic relations from wikipedia (2015) Int. J. Artif. Intell. Tools, 24 (2)
dcterms.bibliographicCitation	Berners-Lee, T., Connolly, D., (1995) Hypertext Markup Language-2.0, , Technical report, USA
dcterms.bibliographicCitation	Blei, D.M., Ng, A.Y., Jordan, M.I., Latent dirichlet allocation (2003) J. Mach. Learn. Res, 3, pp. 993-1022. , Jan
dcterms.bibliographicCitation	(2006) Extensible Markup Language (Xml) 1.1
dcterms.bibliographicCitation	Crawford, W., Csomay, E., Doing Corpus Linguistics (2015) Routledge, , Abingdon
dcterms.bibliographicCitation	Crockford, D., (2006) The Application/Json Media Type for Javascript Object Notation, , JSON
dcterms.bibliographicCitation	Drechsler, A., Hevner, A., A four-cycle model of is design science research: Capturing the dynamic nature of is artifact design (2016) Breakthroughs and Emerging Insights from Ongoing Design Science Projects: Research-In-Progress Papers and Poster Presentations from the 11Th International Conference on Design Science Research in Information Systems and Technology (DESRIST). DESRIST 2016, , St. John, Canada
dcterms.bibliographicCitation	Dutta, B., Chatterjee, U., Madalli, D.P., YAMO: Yet another methodology for large-scale faceted ontology construction (2015) J. Knowl. Manag., 19 (1), pp. 6-24
dcterms.bibliographicCitation	Edeki, C., Agile unified process (2013) Int. J. Comput. Sci., 1 (3), pp. 13-17
dcterms.bibliographicCitation	Fan, J., Kalyanpur, A., Gondek, D.C., Ferrucci, D.A., Automatic knowledge extraction from documents (2012) IBM J. Res. Dev., 56 (3), pp. 1-5
dcterms.bibliographicCitation	Ferrara, E., de Meo, P., Fiumara, G., Baumgartner, R., Web data extraction, applications and techniques: A survey (2014) Knowl.-Based Syst., 70, pp. 301-323
dcterms.bibliographicCitation	Gharib, T.F., Badr, N.L., Haridy, S., Abraham, A., Enriching ontology concepts based on texts from WWW and corpus (2012) J. UCS, 18 (16), pp. 2234-2251
dcterms.bibliographicCitation	Jiang, J., Information extraction from text (2012) Mining Text Data, pp. 11-41. , https://doi.org/10.1007/978-1-4614-3223-42, Aggarwal, C., Zhai, C. (eds.), Springer, Boston
dcterms.bibliographicCitation	Jurafsky, D., Martin, J.H., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2009) Prentice Hall Series in Artificial Intelligence, pp. 1-1024
dcterms.bibliographicCitation	Kanakaraj, M., Kamath, S.S., NLP based intelligent news search engine using information extraction from e-newspapers (2014) 2014 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), pp. 1-5. , IEEE
dcterms.bibliographicCitation	Kanavos, A., Makris, C., Plegas, Y., Theodoridis, E., Ranking web search results exploiting wikipedia (2016) Int. J. Artif. Intell. Tools, 25 (3)
dcterms.bibliographicCitation	Kozareva, Z., Hovy, E., Tailoring the automated construction of large-scale taxonomies using the web (2013) Lang. Resour. Eval., 47 (3), pp. 859-890
dcterms.bibliographicCitation	Küçük, D., Arslan, Y., Semi-automatic construction of a domain ontology for wind energy using wikipedia articles (2014) Renew. Energy, 62, pp. 484-489
dcterms.bibliographicCitation	Lahbib, W., Bounhas, I., Slimani, Y., Arabic terminology extraction and enrichment based on domain-specific text mining (2015) 2015 IEEE 27Th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 340-347. , IEEE
dcterms.bibliographicCitation	Leskovec, J., Rajaraman, A., Ullman, J.D., (2014) Mining of Massive Datasets, , Cambridge University Press, Cambridge
dcterms.bibliographicCitation	Liu, S., Zhang, C., Termhood-based comparability metrics of comparable corpus in special domain (2013) CLSW 2012. LNCS (LNAI), 7717, pp. 134-144. , https://doi.org/10.1007/978-3-642-36337-515, Ji, D., Xiao, G. (eds.), Springer, Heidelberg
dcterms.bibliographicCitation	Loria, S., TextBlob: Simplified text processing (2014) Secondary Textblob: Simplified Text Processing
dcterms.bibliographicCitation	March, S.T., Smith, G.F., Design and natural science research on information technology (1995) Decis. Support Syst., 15 (4), pp. 251-266
dcterms.bibliographicCitation	March, S.T., Storey, V.C., Design science in the information systems discipline: An introduction to the special issue on design science research (2008) MIS Q, 32, pp. 725-730
dcterms.bibliographicCitation	Medelyan, O., Witten, I.H., Divoli, A., Broekstra, J., Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures (2013) Wiley Interdisc. Rev.: Data Min. Knowl. Discov., 3 (4), pp. 257-279
dcterms.bibliographicCitation	Morell, M.F., The Wikimedia foundation and the governance of Wikipedias infrastructure: Historical trajectories and its hybrid character (2011) Critical Point of View: A Wikipedia Reader, pp. 325-341
dcterms.bibliographicCitation	Petrov, S., Das, D., McDonald, R., (2011) A Universal Part-Of-Speech Tagset
dcterms.bibliographicCitation	Powers, D.M.W., Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation (2011) J. Mach. Learn. Technol., 2 (1), pp. 37-63
dcterms.bibliographicCitation	Richardson, L., Ruby, S., (2008) Restful Web Services, , O’Reilly Media, Inc., Sebastopol
dcterms.bibliographicCitation	Schwaber, K., Beedle, M., (2002) Agile Software Development with Scrum, 1. , Prentice Hall, Upper Saddle River
dcterms.bibliographicCitation	Vállez, M., Pedraza-Jiménez, R., Codina, L., Blanco, S., Rovira, C., A semiautomatic indexing system based on embedded information in HTML documents (2015) Library Hi Tech, 33 (2), pp. 195-210
dcterms.bibliographicCitation	van Rossum, G., Drake, F.L., Python Language Reference Manual (2003) Network Theory, , Bristol
dcterms.bibliographicCitation	Wood, L., Nicol, G., Robie, J., Champion, M., Byrne, S., (2004) Document Object Model (DOM) Level 3 Core Specification
dcterms.bibliographicCitation	Zhu, M., Recall, precision and average precision (2004) Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, 2, p. 30
datacite.rights	http://purl.org/coar/access_right/c_16ec
oaire.resourceType	http://purl.org/coar/resource_type/c_c94f
oaire.version	http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.source.event	13th Colombian Conference on Computing, CCC 2018
dc.type.driver	info:eu-repo/semantics/conferenceObject
dc.type.hasversion	info:eu-repo/semantics/publishedVersion
dc.identifier.doi	10.1007/978-3-319-98998-3_12
dc.subject.keywords	Corpus
dc.subject.keywords	Knowledge extraction
dc.subject.keywords	Linguistic computational
dc.subject.keywords	Natural language processing
dc.subject.keywords	Text mining
dc.subject.keywords	Data mining
dc.subject.keywords	Extraction
dc.subject.keywords	Natural language processing systems
dc.subject.keywords	Tellurium compounds
dc.subject.keywords	Websites
dc.subject.keywords	Automatic approaches
dc.subject.keywords	Corpus
dc.subject.keywords	Digital information
dc.subject.keywords	Knowledge extraction
dc.subject.keywords	Linguistic resources
dc.subject.keywords	Propagation algorithm
dc.subject.keywords	Text mining
dc.subject.keywords	Wikipedia
dc.subject.keywords	Linguistics
dc.rights.accessrights	info:eu-repo/semantics/restrictedAccess
dc.rights.cc	Atribución-NoComercial 4.0 Internacional
dc.identifier.instname	Universidad Tecnológica de Bolívar
dc.identifier.reponame	Repositorio UTB
dc.description.notes	Acknowledgements. The tool presented was carried out within the construction of research capabilities of the Center for Excellence and Appropriation in Big Data and Data Analytics (CAOBA), led by the Pontificia Universidad Javeriana, funded by the Ministry of Information Technologies and Telecommunications of the Republic of Colombia (MinTIC).
dc.relation.conferencedate	26 September 2018 through 28 September 2018
dc.type.spa	Conferencia
dc.identifier.orcid	57202285682
dc.identifier.orcid	8738428200
dc.identifier.orcid	57194828933
dc.identifier.orcid	57203852380

Ficheros en el ítem

Ficheros	Tamaño	Formato	Ver
No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)

Productos de investigación [1460]

Mostrar el registro sencillo del ítem

http://creativecommons.org/licenses/by-nc-nd/4.0/

Universidad Tecnológica de Bolívar - 2017 Institución de Educación Superior sujeta a inspección y vigilancia por el Ministerio de Educación Nacional. Resolución No 961 del 26 de octubre de 1970 a través de la cual la Gobernación de Bolívar otorga la Personería Jurídica a la Universidad Tecnológica de Bolívar.