Daniverg’s Blog

mayo 16, 2009

What is a corpus?

Filed under: LR — Daniel Vergara @ 8:56 am

In linguistics, a corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe. They have also multiple examples of each word in different contexts and each example is categorized by a certain code, specifying the date of release and the name of the magazine in which that sentence has been published.

A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora.

Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for POS-tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching.

If you want to find out more about English corpora or find links to famous English Corpora visit our wiki page: http://wiki.littera.deusto.es/en/index.php/Lr0809/I

Dejar un comentario »

Aún no hay comentarios.

RSS feed for comments on this post. TrackBack URI

Responder

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s

Crea un blog o un sitio web gratuitos con WordPress.com.

A %d blogueros les gusta esto: