Loading...
Please wait, while we are loading the content...
Similar Documents
Automatic detection of spelling variation in historical corpus An application to build a Brazilian Portuguese spelling variants dictionary
| Content Provider | Semantic Scholar |
|---|---|
| Author | Giusti, Rafael Candido, Arnaldo Muniz, Marcelo Cucatto, Livia Aluísio, Sandra M. |
| Copyright Year | 2007 |
| Abstract | The Historical Dictionary of Brazilian Portuguese (HDBP), the first of its kind, is based on a corpus of Brazilian Portuguese (BP) texts from the sixteenth through the eighteenth centuries (and some texts from the beginning of the nineteenth century), being developed under the sponsorship of the Brazilian funding agency CNPq (Conselho Nacional de Desenvolvimento Cientifico e Tecnologico). It is a three-year project that started in 2006 to fill a gap in Brazilian culture with a dictionary describing the vocabulary of Brazilian Portuguese from the beginning of the country’s history. The corpus totals more than 3,000 texts with approximately 7.5 million words. Our working corpus, i.e. the corpus already processed by the corpus processing system UNITEX (http://www-igm.univ-mlv.fr/~unitex/), is coded in Unicode (UTF-16) and totals 1,733 texts, 57.1 MB, and 4.9 million words. A difficulty in dealing with historical corpora to carry out lexicographic tasks is the identification of all spelling variants of a specific word, since spelling variation distorts frequency counts, a usual criterion to select dictionary entries. In our project, another challenge is to select all variants of a dictionary entry that are in the corpus to illustrate the absence of an orthographical system in the aforementioned centuries and to provide example sentences for them. This paper introduces both an approach based on transformation rules to cluster distinct spelling variations around a common form, which is not always the orthographic (or modern) form, and the choices made to build a dictionary of spelling variants of BP based on these clusters. Currently, we have forty-three rules manually developed, which generated 12,189 clusters of spelling variants, totalling 27,199 variants from our working corpus. After a careful analysis of these clusters, we adopted the DELA format to build our dictionary. The BP dictionary of spelling variants enables sophisticated searches in the historical corpus using UNITEX, giving support to build the main dictionary of the HDBP project. Moreover, the variants of a given word can be searched using an application named Dicionario we have developed to display dictionaries in DELA format. As we also use Philologic (http://philologic.uchicago.edu/index.php) to support the building of the HDPB, we carried out a comparative evaluation between our approach to cluster distinct spelling variants and AGREP (http://www.tgries.de/agrep/), which is used in Philologic to check for similar or alternative spellings. 1 University of Sao Paulo, NILC, CP 668,13560-970, Sao Carlos/SP, Brazil e-mail: rg@grad.icmc.usp.br, arnaldoc@icmc.usp.br, marcelo.muniz@gmail.com, liviacucatto@yahoo.com.br, sandra@icmc.usp.br |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://ucrel.lancs.ac.uk/publications/cl2007/paper/238_Paper.pdf |
| Alternate Webpage(s) | http://www.researchgate.net/profile/Sandra_Aluisio/publication/228527552_Automatic_detection_of_spelling_variation_in_historical_corpus_An_application_to_build_a_Brazilian_Portuguese_spelling_variants_dictionary/links/02bfe510834f764be1000000.pdf |
| Alternate Webpage(s) | http://www.nilc.icmc.usp.br/nilc/download/corpus_linguistics_2007.pdf |
| Alternate Webpage(s) | http://www.nilc.icmc.usp.br/nilc/projects/hpc/cl_2007/cl_2007_presentation.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |