NDLI: Short Text Classification based on feature extension using The N-Gram model

Content Provider	IEEE Xplore Digital Library
Author	Xinwei Zhang Bin Wu
Copyright Year	2015
Description	Author affiliation: Beijing Key Lab. of Intell. Telecommun. Software & Multimedia, Beijing Univ. of Posts & Telecommun., Beijing, China (Xinwei Zhang; Bin Wu)
Abstract	With the rapid development of Web2.0, more and more people like to show their life or opinions on social media websites or forums, such as Weibo, Twitter and Tianya, which produce masses of short texts. In order to manage these short texts effectively, Short Text Classification becomes an important branch of Text Classification. However, because of the short text length, the lack of signals, and the sparseness of features, it is very difficult to achieve high quality classification by using conventional methods. This paper proposes a novelty feature extending method based on the N-Gram model to solve the problem of feature sparseness. From continuous word sequences in the train set, we extract n-grams as our feature extension mode library. Then using features showing in the short texts, we can compute the appearance probability of other words that do not exist in original texts. We use the data set collected from Sina Weibo to carry out our extension method. After extending features of the original short texts, we use the Naïve Bayes algorithm to train and evaluate a classifier. We use precision, recall and F1-Score to evaluate our work. The result shows that the extension method based on the N-Gram model can improve classification performance observably.
Starting Page	710
Ending Page	716
File Size	517081
Page Count	7
File Format	PDF
e-ISBN	9781467376822
DOI	10.1109/FSKD.2015.7382029
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2015-08-15
Publisher Place	China
Access Restriction	Subscribed
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Naïve Bayes Feature Extension Short Text Computational modeling Semantics Text categorization Classification Feature extraction Libraries Classification algorithms Internet The N-Gram Model
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Chinese Text Classification Based on Extended Naïve Bayes Model with Weighed Positive Features

Research on short text classification for web forum

Short Text Feature Selection for Micro-Blog Mining

English and Taiwanese text categorization using N-gram based on Vector Space Model

A New Model for Chinese Short-text Classification Considering Feature Extension

Effect of different feature types on age based classification of short texts

Feature extraction based IP traffic classification using machine learning

A hybrid algorithm for text classification based on rough set

Internet news headlines classification method based on the N-Gram language model

Short Text Classification based on feature extension using The N-Gram model

Similar Documents

Chinese Text Classification Based on Extended Naïve Bayes Model with Weighed Positive Features

Research on short text classification for web forum

Short Text Feature Selection for Micro-Blog Mining

English and Taiwanese text categorization using N-gram based on Vector Space Model

A New Model for Chinese Short-text Classification Considering Feature Extension

Effect of different feature types on age based classification of short texts

Feature extraction based IP traffic classification using machine learning

A hybrid algorithm for text classification based on rough set

Internet news headlines classification method based on the N-Gram language model

Short Text Classification based on feature extension using The N-Gram model