An automatic approach to generate corpus in Spanish

Puertas E.; Alvarado‑Valencia, Jorge Andres; Moreno-Sandoval L.G.; Pomares-Quimbaya A.

Fecha

2018

Autor(es)

Puertas E.

Alvarado‑Valencia, Jorge Andres

Moreno-Sandoval L.G.

Pomares-Quimbaya A.

Metadatos

Mostrar el registro completo del ítem

Resumen

A corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipedia and web pages. The information extracted by Wikipedia is done by delimiting the domain, using a propagation algorithm to determine the categories associated with a domain region and a set of seeds to delimit the search. The information extracted from the web pages is carried out efficiently, determining the patterns associated with the structure of each page with the purpose of defining the quality of the extraction. © Springer Nature Switzerland AG 2018.

Citar como

Communications in Computer and Information Science; Vol. 885, pp. 150-161

Utilice esta dirección para citar:

https://hdl.handle.net/20.500.12585/8916

Colecciones

Productos de investigación [1460]

Compatible para recolección con:

Archivos

http://creativecommons.org/licenses/by-nc-nd/4.0/

Universidad Tecnológica de Bolívar - 2017 Institución de Educación Superior sujeta a inspección y vigilancia por el Ministerio de Educación Nacional. Resolución No 961 del 26 de octubre de 1970 a través de la cual la Gobernación de Bolívar otorga la Personería Jurídica a la Universidad Tecnológica de Bolívar.