NDLI: GeoHashViz: interactive analytics for mapping spatiotemporal diffusion of Twitter hashtags

Please wait, while we are loading the content...

Accelerating our understanding of supernova explosion mechanism via simulations and visualizations with GenASiS

Using Mozilla badges to certify XSEDE users and promote training

XSEDE value added, cost avoidance, and return on investment

In-core volume rendering for Cartesian grid fluid dynamics simulations

Paleoscape model of coastal South Africa during modern human origins: progress in scaling and coupling climate, vegetation, and agent-based models on XSEDE

Extending access to HPC skills through a blended online course

Overview of XSEDE-PRACE collaborative projects in 2014

Bring the NLACE model online using XSEDE and HUBzero

Cyberinfrastructure resources enabling creation of the loblolly pine reference transcriptome

Connecting the non-traditional user-community to the national CyberInfrastructure

Publishing and consuming GLUE v2.0 resource information in XSEDE

GeoHashViz: interactive analytics for mapping spatiotemporal diffusion of Twitter hashtags

NCBI-BLAST programs optimization on XSEDE resources for sustainable aquaculture

ISLET: an isolated, scalable, & lightweight environment for training

Market-based on demand scheduling (MBoDS) in co-operative grid environment

Autotuning OpenACC work distribution via direct search

FlowGate: towards extensible and scalable web-based flow cytometry data analysis

Science gateways for humanities, arts, and social science

A prototype sampling interface for PAPI

The CIPRES workbench: a flexible framework for creating science gateways

A scalable computational approach to political redistricting optimization

On fostering a culture of research cyberinfrastructure grant proposals within a community of service providers in an EPSCoR state

Optimizing codes on the Xeon Phi: a case-study with LAMMPS

Porting scientific libraries to PGAS in XSEDE resources: practice and experience

Discovering the influence of socioeconomic factors on online game behaviors

Incorporating interactive compute environments into web-based training materials using the Cornell job runner service

Jetstream: a self-provisioned, scalable science and engineering cloud environment

Enabling HPC simulation workflows for complex industrial flow problems

Performance examinations of multiple time-stepping algorithms on stampede supercomputer

TAS view of XSEDE users and usage

Bridges: a uniquely flexible HPC resource for new communities and data analytics

A performance predictor for UltraScan supercomputer calculations

Performance assessment of real-time estimation of continuous-time stochastic volatility of financial data on GPUs

Multidisciplinary research and education with open tools: metagenomic analysis of 16S rRNA using Arduino, Android, Mothur and XSEDE

Using data science to understand tape-based archive workloads

CDD: computational discovery desktop

Grouping game players using parallelized k-means on supercomputers

Storage utilization in the long tail of science

Leveraging DiaGrid hub for interactively generating and running parallel programs

The VAT: enhanced video analysis

Advanced user environment design and implementation on integrated multi-architecture supercomputers

Inversion of magnetotelluric data using integral equation approach with variable sensitivity domain: application to earthscope MT data

Integrating apache spark into PBS-Based HPC environments

A SIMD tabu search implementation for solving the quadratic assignment problem with GPU acceleration

GeoHashViz: interactive analytics for mapping spatiotemporal diffusion of Twitter hashtags

Content Provider	ACM Digital Library
Author	Parameswaran, Aditya Wang, Shaowen Soltani, Kiumars
Abstract	Since its birth in 2006, Twitter has evolved to a multi-purpose social media that attracts hundreds of millions of users to share their activities and ideas on a daily basis. The potential of capturing fine-grained activity log of users, combined with ever increasing geographical information derived from GPS-enabled devices, has made Twitter data a valuable source for spatiotemporal analysis of human activities. One of the early innovations of Twitter is the use of hashtag as a unique tagging mechanism to provide additional information about a user post. From its emergence in late 2007, hashtags have been used extensively to express ideas, group tweets and report events among Twitter users. The increasing popularity of hashtags, in addition to their simple and concise structure, has inspired multiple recent studies to propose hashtag as a medium to assess diffusion of ideas in a virtual world. Studying collective effort of users in making a hashtag go viral can shed light on the complex process of idea diffusion that involves psychological, sociological and geographical elements. Although most of the previous research on idea diffusion in virtual world purely focuses on the users social graph, recent studies have confirmed that the spatial relationship among users and regions also play a crucial role in its adoption patterns [1]. This comes back to First Law of Geography that was formulated by Waldo Tobler more than 40 years ago, as "everything is related to everything else, but near things are more related than distant things". However, previous work on designing an interactive visual analytical framework for hashtag diffusion (http://keyhole.co/, http://hashtracking.com/, https://tagboard.com/), lack in-depth spatial analysis capabilities, hence not well-suited to be used for studying diffusion patterns. This research aims to fill this gap by providing an interactive framework to offer visual analytics on geographical diffusion of hashtags over time. Our framework, called GeoHashViz, can provide both textual and visual analytics on the role of location in adoption of hashtags and offer insights on diffusion patterns among different hashtags. GeoHashViz processes large stream of incoming tweets using a Hadoop-based approach and calculates multiple measures that will be used to generate visual analytics for the user. Furthermore, it integrates online maps with a live animation tool to visualize both spatial and temporal diffusion of hashtags at the same time. Data Collection: we gather our data using the Twitter Streaming API (details in [3]).Since we are only interested in common hashtags, which have a certain level of popularity, we only keep the hashtags with more than 1000 appearances. Our unit of spatial resolution is set to cities in United States with a population larger than 60000 people that give us 645 unique locations. These locations will form our reference grid and every geographical point will be assigned to its nearest neighbor in the reference grid. Analytics: To formulate the problem of spatiotemporal analysis of hashtag diffusion, we recognized two main categories of hashtag-based and location-based analytics. In hashtag-based analytics we focus on specific hashtags and their associated diffusion patterns. On the other hand, location-based analytics study the similarity and closeness of locations in terms of their hashtag adoption. To evaluate the usability of the framework, we identify five core analytical features that cover wide ranges of research questions. However, our framework can be easily extended to include more analytical features. The five visual analytical capabilities are listed in Table 1. Spread and focus points (locations with highest occurrence of the hashtag [1]) provide users with a visual estimate of how the hashtag is diffused over time. However, we also provided four metrics that gives a user a more concrete sense of the diffusion patterns: a) Entropy: Measures the randomness of hashtag distribution [1] ;b) KL-divergence: Compare the geographical distribution of hashtag in consecutive time windows using KL-divergence method ;c) Spatial Dispersion: Measures how scattered is the hashtag from its geographical midpoint ;d) Count:. Plot the cumulative count of the hashtag over time. For location-based analytics we included two functions. Top-k hashtags calculate the most popular hashtags in a region and visualize that using a word cloud. However by simply looking at the counts, we may miss some locally significant due to their relative low count. To reduce the dominance of globally popular hashtags, we introduce another analytic that will visualize top-k locally significant hashtags. This analytic uses a Tf-idf like metric [5] to measure the local popularity of a hashtag in a specific region, hence assigning lower rank to the hashtags which are popular in other places as well. In addition, we provide two metrics for comparing two different regions in terms of hashtag adoption: a) Jaccard Similarity Compare the set of hashtag used in two different regions, with higher number assigned to more similar regions ;b) Adoption Lag This measure depicts how long it takes for a hashtag to travel between two region, by averaging the time difference between the first appearance of hashtags in two regions. Architecture: GeoHashViz framework follows a two-layer architecture: an offline-processing module and an interactive module. The offline-processing module, implemented entirely in Apache Hadoop and called periodically, processes the raw data and pre-computes measures related to spatiotemporal diffusion of hashtags. The interactive module on the other hand is called on demand and based on user requests. The two modules connect with each other through a distributed MongoDB database. The two-layer architecture enables a fast interactive final framework by reducing the data processing that interactive module is required to do. In the offline-processing module, significant hashtags are extracted and the points are laid on the geographical mesh that we defined above. Then two MapReduce jobs are executed: one for pre-computing measures related to hashtag-based analytics and one for location-based analytics. All the Hadoop experiments were conducted using XSEDE Gordon Hadoop cluster. The data-intensive nature of our problem, requiring aggregation of large number of tweets based on both hashtags and locations, make Hadoop an ideal choice for the offline-processing module. Using Hadoop, we distribute the tweets into multiple nodes, and then take advantage of MapReduce model to aggregate them based on their associated location on the mesh and their included hashtags. In the reduce step, having access to all the tweets for a certain location/hashtag, we can generate the analytics for different timestamps. In addition, since the nodes on Gordon Hadoop cluster have relatively high memory, we are able to store the geographical mesh in memory and quickly map the location of users to their closest point on the mesh (using kd-tree). The same technique is employed in the interactive module to find the set of mesh points which lies into the user-defined bounding box. The interactive module includes a web application and a Java Servlet. The web application is integrated into Cyber-GIS Gateway [2] to increase usability of the application and easier integration with other CyberGIS applications. Figure 1 shows a view of the application visualizing top 20 hashtags in the southern California region in September 2014.
Starting Page	1
Ending Page	2
Page Count	2
File Format	PDF
ISBN	9781450337205
DOI	10.1145/2792745.2792782
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2015-07-26
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Interactive visualization Social media Hadoop Cybergis Geohashviz
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in