Daniverg’s Blog

mayo 18, 2009

Online Dictionaries.

Filed under: LR — Daniel Vergara @ 10:54 am

Since we live in a fast paced society and technology is developing more and more each day. Dictionaries are also changing in this way,people want to get information fast and therefore they use online dictionaries.

Online dictionaries are the replacement of book dictionaries. You type the word you are serching for and you instantly find the definition of the world.

I found a webpage with links to different online dictionaries that might interest the reader: Retrieved 12:14 , May 5, 2009 from http://math-www.uni-paderborn.de/HTML/Dictionaries.html

mayo 17, 2009

Conclusion after using both: The BNC and the CCAE.

Filed under: LR — Daniel Vergara @ 11:17 am

After using both system I must say that although it seems a little bit more complex at first, I prefer the Corpus of Contemporary American English rather than the British National Corpus.

  • Aesthetically: It seems that the American National Corpus has a long way to go before it becomes something like the BNC. The latter, seems to be more clear in its presentation and more understandable whereas the former seems to be more technical (for expert users). In the BNC everything is well organized and it easier to find the our way in the page. In the ANC everything gets a little more fuzzy and it is not difficult to get lost in the technical jergon it uses.
  • Technically (Search Engine): The first advantage of the British National Corpus is that its search engine is online wheareas the ANC’s you have to download it and install it in your computer (4GB approximately). That is a great disadvantage to the ANC because it is tedious to install that ammount of data you might never need to use in your computer. However,there is an option to using the ANC, the Corpus of Contemporary American English which seems to be a more sophisticated way of searching words. The BNC is easier to use, within one click you can find whatever you are looking for, and you will be showed 50 examples of the word you searched in different contexts. The sources of the phrases showed in the BNC are perfectly acknoledged, which is also one of the features of the ANC. As a matter of fact, the search engine used in the BNC is SARA and the one used in the ANC isXaira.

All in all, we will be using both in our project, so we have to say that they both work quite well when dealing with words and phrases in different contexts. Personally, we think that although the registration procedure that has to be done on the CCAE, this one is the best corpus.

mayo 16, 2009

How can I make a search in a corpus?

Filed under: LR — Daniel Vergara @ 1:43 pm

I am going to explain how does a corpus search engine work. For the example I am going to use the Contemporary American English Corpus.

1.- We get to the corpus’ webpage .

2.- After that, you register or log in otherwise the webpage won’t let you search anything.

3.- We make a search of a word it could be anything. In this case I am going to take the word analyze (American). You type the word in the search engine and press SEARCH. The word you have searched for will appear in the chart next to the search box.imp pant 1

4.- If you click on the word you searched you will be given  numerous real examples of the word used in different contexts. The answers will be displayed beyond  the chart.


5.- If you want to compare two words, for example: analyze and analyse. You go to ‘DISPLAY’ and press ‘COMPARE WORDS’ instead of ‘LIST’. The search engine will be a  bit modified with two places to search words. You put the words you want to search in those spaces and press ‘SEARCH’ you will be showed another chart with the two words an their examples.


Getting to know better SARA and Xaira…

Filed under: LR — Daniel Vergara @ 9:34 am
  1. ‘SARA (SGML Aware Retrieval Application) was developed specifically for access to the BNC in a Microsoft Windows environment. It is freely available to all BNC licensees and also for registered users of the BNC Subscription service hosted by the British Library. A copy of SARA is delivered with every copy of the BNC World corpus. You can also download the latest version here. The SARA webpage offers more information about SARA.

The Xaira program derives from SARA but has been developed further. It can be used on all well-formed corpora in XML. The BNC XML Edition, BNC Baby, and BNC Sampler corpora are delivered with a copy of Xaira. You can also download the latest version of XAIRA from SourceForge.net. More information about Xaira can be found on the Xaira webpage.’ Information retrieved at 10:43, May 16, 2009 from http://www.natcorp.ox.ac.uk/tools/index.xml.


SARA allows investigations on the content and structure of a corpus. The precise searches and enquiries possible on a given corpus will of course depend upon the nature and completeness of the markup applied to it. However, indicatively, SARA supports features such as the following:

– Searches on words, truncated words and phrases
-Searches on SGML tags, attributes
-Combinatorial Boolean operations
-Frequency counts
-Lexicon, to allow identification of similar words (eg gumboot, gum-boot, gum-boots etc)
-Storing searches
-Limiting scope of queries
-Presentation of Results
-With or without SGML markup
-Page or concordance format
-Optional use of colour to enhance display

By way of illustration, some of the markup in the BNC relates to the social class of the speaker (in the case of spoken words); markup is also used to signify parts of speech. Thus, in the case of the BNC, SARA can be used to formulate a query equivalent to: How often do speakers of social class C1 use the word “input” as a verb?

XAIRA is the same thing as SARA but with more features since it is the recent MODIFICATED version of the last one.

Different kinds of Corpora.

Filed under: LR — Daniel Vergara @ 9:18 am

I am going to give a brief introduction on these different kinds of corpora. It is not going to be technical because if we would want technical examples we would go to the page itself. It is going to be a description from my own experience, who do these systems work and what can we find in them.

The British National Corpus: Is a corpus of the British English, either spoken or written. The serach engine is simple, you type a word and the corpora will give about 50 examples of that word in different contexts.  

The American National Corpus: Is a corpus of the American English, either spoken or written. It has two releases: the first and the second. The problem is that the page is not up to date and therefore the search engine is not working, that’s why we had to get to another webpage called: The  Corpus of Contemporary American English which works wonders, I must say. 

The Corpus of Contemporary American English: Is a very powerful system which uses the best of the BNC and ANC altogether. The result of this is a great search system in which not only you can find different examples of words in American English, but also can compare two words and their different uses in contexts. A great webpage for students or teachers. The onbly problem is that requires registration, but the process is very simple. I did the registration myself and it only took me 3 minutes.

The International Corpus of English: Is a corpus which has examples of the different varieties of English.

As a matter of fact, the search engine used in the BNC is SARA and the one used in the ANC is Xaira.

Check out our wiki page, the bibliography part, for links to corpora:http://wiki.littera.deusto.es/en/index.php/Lr0809/I

What is a corpus?

Filed under: LR — Daniel Vergara @ 8:56 am

In linguistics, a corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe. They have also multiple examples of each word in different contexts and each example is categorized by a certain code, specifying the date of release and the name of the magazine in which that sentence has been published.

A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora.

Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for POS-tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching.

If you want to find out more about English corpora or find links to famous English Corpora visit our wiki page: http://wiki.littera.deusto.es/en/index.php/Lr0809/I

Crea un blog o un sitio web gratuitos con WordPress.com.