Infectious Disease Modelling and Surveillance through Unstructured Twitter Data
Jawaharlal Nehru University, New Delhi
More than 1.6 billion  users access internet worldwide while 64 percent of them use social media services. Facebook and Twitter have high penetration rates of 33 percent (fig.1) and 24 percent (fig.1) respective with very high user generated content for behavioural studies. Smartphones have further generated huge volumes  of unstructured data thus exponentially growing wealth of digital footprints which are replicable, transparent and revolutionizing disease surveillance, e-health applications  and epidemiological studies .This study focuses on extracting Twitter unstructured medical tweets for understanding coronary disease associating tweeting patterns, developing hypothesis and understanding medical causality and behaviour.15 closely related keywords (fig.2) were identified. Response generating noisy and irrelevant tweets like "Stroke", "Piles" and "Heart" were rejected. 101815 sample tweets were streamed from twitter API through Python and unbiased and positively correlated data for the study was selected. 60.7 (fig.3) percent tweets were having keyword “Heart attack" followed by 13.5 (fig.3) percent for "Heart Disease”. Interestingly tweets from India showed higher response for keyword "Heart Disease” which accounted 65.6% (fig.4) of inputs.
Comparative Analysis of tweeting patterns between a developed country like United States and Developing country like India with respect to Rest of World showed ‘Heart Attack' is used as predominant key word with 59.1%(fig.4) tweet responses in US, followed by 11.3%(fig.4) for ‘Heart Disease' while ‘Heart Attack' is used as predominant key word with 64.8%(fig.4) tweet responses in rest of the countries, followed by 8.84%(fig.4) for ‘Heart Disease'. Comparing the three observations,(fig.5) it was found that there is variation in keyword usage while addressing coronary heart ailment. ‘Sclerosis' keyword is popular in US as compared to rest of the world. Date wise comparison within 3rd week of month Feb and March showed slight variation in twitter trending. Heart Attack tweet's percent fell from 61% to 47% (fig.2) compared Feb and Mar during same time showing tweet trending preference shift.
However, response pattern and response categorization needs to be further explored. Validation, integration, data volatility and biasness remains key issue which needs to be addressed. Visual basic and python are used for data extracting and statistically analysis. Significant variation in tweeting pattern is identified in developed and developing countries. Gender writing pattern and date wise variation for unique multiple and retweets need to be explored. Bigger datasets for developing a stronger correlation, accurate forecasting and valid hypothesis needs to be studied. Lastly identifying modifiable risk factors for various infectious diseases by generating the data dictionaries and mapping with meta-thesaurus of Unified Medical Language System is the step forward.
 Tetyana I. Vasylyevaa, Samuel R. Friedmanb “Towards a socio-molecular era for public health, Infection, Genetics and Evolution” 46 (2016), Pages 248–255
 Jennifer L. Gardy, James C. Johnston N Engl J Med 364 (2011) 730-739
 Young SD. A “Big data” approach to HIV Epidemiology and Prevention”. Preventive medicine. 70 (2015)