Loading...
Please wait, while we are loading the content...
Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus (1998)
| Content Provider | CiteSeerX |
|---|---|
| Author | Yamamoto, Mikio Church, Kenneth W. |
| Abstract | Mutual Information (MI) and similar measures are often used in corpus-based linguistics to find interesting ngrams. MI looks for bigrams whose term frequency () is larger than chance. Residual Inverse Document Frequency (RIDF) is similar, but it looks for ngrams whose document frequency (df) is larger than chance. Previous studies have tended to focus on relatively short ngrams, typically bigrams and trigrams. In this paper, we will show that this approach can be extended to arbitrarily long ngrams. Using suffix arrays, we were able to compute tf, df and RIDF for all ngrams in two large corpora, an English corpus of 50 million words of Wall Street Journal news articles and a Japanese corpus of 216 million characters of Mainichi Shimbun news articles. |
| File Format | |
| Volume Number | 27 |
| Journal | Computational Linguistics |
| Language | English |
| Publisher Date | 1998-01-01 |
| Publisher Institution | Mikio Yamamoto University |
| Access Restriction | Open |
| Subject Keyword | Suffix Array Document Frequency Compute Term Frequency English Corpus Mutual Information Previous Study Interesting Ngrams Similar Measure Large Corpus Corpus-based Linguistics Residual Inverse Document Frequency Mainichi Shimbun News Article Short Ngrams Term Frequency Japanese Corpus Wall Street Journal News Article |
| Content Type | Text |
| Resource Type | Article |