NDLI: Structured entity identification and document categorization: two tasks with one joint model

Please wait, while we are loading the content...

Influence and correlation in social networks

Land cover change detection: a case study

Social networks: looking ahead

An inductive database prototype based on virtual mining views

Graph Mining and Graph Kernels

Internet advertising and optimal auction design

Efficient semi-streaming algorithms for local triangle counting in massive graphs

Identifying authoritative actors in question-answering forums: the case of Yahoo! answers

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Mining Massive RFID, Trajectory, and Traffic Data Sets

Large scale data analysis and modelling in online services and advertising

Structured entity identification and document categorization: two tasks with one joint model

Context-aware query suggestion by mining click-through and session data

Using tagflake for condensing navigable tag hierarchies from tag clouds

Blogosphere: Research Issues, Applications, and Tools

Regularization paths and coordinate descent

Mining adaptively frequent closed unlabeled rooted trees in data streams

The persuasive phase of visualization

An integrated system for automatic customer satisfaction analysis in the services industry

Mining Uncertain and Probabilistic Data: problems, Challenges, Methods, and Applications

The future of image search

Effective label acquisition for collective classification

Detecting privacy leaks using corpus-based association rules

DiMaC: a disguised missing data cleaning tool

Genesis of postal address reading, current state and future prospects: thirty years of pattern recognition on duty of postal services

Topical query decomposition

Learning methods for lung tumor markerless gating in image-guided radiotherapy

Pattern-Miner: integrated management and mining over data mining models

Unsupervised feature selection for principal components analysis

Text classification, business intelligence, and interactivity: automating C-Sat analysis for services industry

CRO: a system for online review structurization

The cost of privacy: destruction of data-mining utility in anonymized data publishing

Data mining using high performance data clouds: experimental studies using sector and sphere

Morpheus: interactive exploration of subspace clustering

Generating succinct titles for web URLs

Automated cyclone discovery and tracking using knowledge sharing in multiple heterogeneous satellite data

A software system for buzz-based recommendations

Structured learning for non-smooth ranking losses

Spotting out emerging artists using geo-aware analysis of P2P query strings

Pictor: an interactive system for importing data from a website

Partitioned logistic regression for spam filtering

Customer targeting models using actively-selected web content

Learning subspace kernels for classification

Anticipating annotations and emerging trends in biomedical literature

Combinational collaborative filtering for personalized community recommendation

Temporal pattern discovery for trends and transient effects: its application to patient records

FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems

Scalable and near real-time burst detection from eCommerce queries

Semi-supervised learning with data calibration for long-term time series forecasting

Identifying domain expertise of developers from source code

Reconstructing chemical reaction networks: data mining meets system identification

ArnetMiner: extraction and mining of academic social networks

Automatic record linkage using seeded nearest neighbour and support vector machine classification

Tagmark: reliable estimations of RFID tags for business processes

Feedback effects between similarity and social influence in online communities

Experimental comparison of scalable online ad serving

Anomaly pattern detection in categorical datasets

A visual-analytic toolkit for dynamic interaction graphs

Bypass rates: reducing query abandonment using negative inferences

Heterogeneous data fusion for alzheimer's disease study

De-duping URLs via rewrite rules

Privacy-preserving cox regression for survival analysis

Structured metric learning for high dimensional problems

Using predictive analysis to improve invoice-to-cash collection

Constraint programming for itemset mining

Learning from multi-topic web documents for contextual advertisement

Learning classifiers from only positive and unlabeled data

Locality sensitive hash functions based on concomitant rank order statistics

Direct mining of discriminative and essential frequent patterns via model-based search tree

Scaling up text classification for large file systems

SPIRAL: efficient and exact model identification for hidden Markov models

Using ghost edges for classification in sparsely labeled networks

Composition attacks and auxiliary information in data privacy

Entity categorization over large document collections

Knowledge transfer via multiple model local structure mapping

Banded structure in binary matrices

Quantitative evaluation of approximate frequent pattern mining algorithms

Unsupervised deduplication using cross-field dependencies

Permu-pattern: discovery of mutable permutation patterns with proximity constraint

Simultaneous tensor subspace selection and clustering: the equivalence of high order svd and k-means clustering

Bridging centrality: graph mining from element level to group level

Interpretable nonnegative matrix decompositions

Fast logistic regression for text categorization with variable-length n-grams

Probabilistic latent semantic visualization: topic model for visualizing documents

Automatic identification of quasi-experimental designs for discovering causal knowledge

Extracting shared subspace for multi-label classification

Mining preferences from superior and inferior examples

Effective and efficient itemset pattern summarization: regression-based approaches

A sequential dual method for large scale multi-class linear svms

Constructing comprehensive summaries of large event sequences

Factorization meets the neighborhood: a multifaceted collaborative filtering model

The structure of information pathways in a social communication network

Angle-based outlier detection in high-dimensional data

Stream prediction using a generative model based on frequent episodes in event sequences

Microscopic evolution of social networks

Cut-and-stitch: efficient parallel learning of linear dynamical systems on smps

Active learning with direct query construction

Spectral domain-transfer learning

Mining multi-faceted overviews of arbitrary topics in a text collection

Multi-class cost-sensitive boosting with p-norm loss functions

On updates that constrain the features' connections during learning

Weighted graphs and disconnected components: patterns and a generator

Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering

Joint latent topic models for text and citations

Classification with partial labels

Discrimination-aware data mining

Fast collapsed gibbs sampling for latent dirichlet allocation

Partial least squares regression for graph mining

Knowledge discovery of semantic relationships between words using nonparametric bayesian graph model

Mobile call graphs: beyond power-law and lognormal distributions

Efficient ticket routing by resolution sequence mining

Get another label? improving data quality and data mining using multiple, noisy labelers

iSAX: indexing and mining terabyte sized time series

Efficient computation of personal aggregate queries on blogs

Semi-supervised approach to rapid and reliable labeling of large data sets

Relational learning via collective matrix factorization

A bayesian mixture model with linear regression mixing proportions

Hypergraph spectral learning for multi-label classification

Community evolution in dynamic multi-mode networks

Colibri: fast mining of large static and dynamic graphs

Can complex network metrics predict the behavior of NBA teams?

Model-based document clustering with a collapsed gibbs sampler

Building semantic kernels for text classification using wikipedia

A unified approach for schema matching, coreference and canonicalization

Information extraction from Wikipedia: moving down the long tail

SAIL: summation-based incremental learning for information-theoretic clustering

Asymmetric support vector machines: low false-positive learning under the user tolerance

Succinct summarization of transactional databases: an overlapped hyperrectangle scheme

Anonymizing transaction databases for publication

Local peculiarity factor and its application in outlier detection

A family of dissimilarity measures between nodes generalizing both the shortest-path and the commute-time distances

Training structural svms with kernels using sampled cuts

Stable feature selection via dense feature groups

Categorizing and mining concept drifting data streams

Fastanova: an efficient algorithm for genome-wide association study

Cuts3vm: a fast semi-supervised svm algorithm

Identifying biologically relevant genes via multiple heterogeneous data sources

Volatile correlation computation: a checkpoint view

Structured entity identification and document categorization: two tasks with one joint model

Content Provider	ACM Digital Library
Author	Godbole, Shantanu Bhattacharya, Indrajit Joshi, Sachindra
Abstract	Traditionally, research in identifying structured entities in documents has proceeded independently of document categorization research. In this paper, we observe that these two tasks have much to gain from each other. Apart from direct references to entities in a database, such as names of person entities, documents often also contain words that are correlated with discriminative entity attributes, such age-group and income-level of persons. This happens naturally in many enterprise domains such as CRM, Banking, etc. Then, entity identification, which is typically vulnerable against noise and incompleteness in direct references to entities in documents, can benefit from document categorization with respect to such attributes. In return, entity identification enables documents to be categorized according to different label-sets arising from entity attributes without requiring any supervision. In this paper, we propose a probabilistic generative model for joint entity identification and document categorization. We show how the parameters of the model can be estimated using an EM algorithm in an unsupervised fashion. Using extensive experiments over real and semi-synthetic data, we demonstrate that the two tasks can benefit immensely from each other when performed jointly using the proposed model.
Starting Page	25
Ending Page	33
Page Count	9
File Format	PDF
ISBN	9781605581934
DOI	10.1145/1401890.1401899
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2008-08-24
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Probabilistic generative model Entity identification Document categorization
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in