Use Rmarkdown to do the following tasks(2). Please note that the presentation of the document and the range of Rmarkdown features/functions used are matter.

1. Describe a real-world application that uses topic modelling and explain how the topic model works. (4)

The online shopping recommendation system, such as JD recommendation system (A Chinese online shop), could use topic modelling to work. The key work is to calculate the similarity. All the works could be processed with three steps.

  1. Calculating the distribution of topics for several items.
  2. Calculating the similarity (Kullback-Leibler divergence could be used) among other items with the items you viewing based on the distribution of topics.
  3. The item with higher similarity will get a higher chance for recommendation.

Once the system have your viewing history, then it could recommend the similar goods to you.

2. Download the Twitter dataset (rdmTweets-201306.RData) from the course website and do the following. (8)

if(!require(twitteR))
    install.packages("twitteR")
if(!require(tm))
    install.packages("tm")
if(!require(wordcloud))
    install.packages("wordcloud")

library(tm)
library(wordcloud)
library(dplyr)
library(stringr)
library(tidyverse)

load('rdmTweets-201306.RData')

a. Text cleaning: remove URLs, convert to lower case, and remove non-English letters or space.

text_clean <- function(x) {
    text <- x$text
    # remove urls with header
    text <- gsub("https?://\\S+|www\\.\\S+", "", text)
    # remove urls without header
    text <- gsub("\\bwww[1-9a-zA-Z]*\\.\\S+", "", text)
    # convert to lower case and remove non-English letters and space
    text <- tolower(gsub("\\s+", " ", gsub("[^a-zA-Z ]", " ", text)))
    # remove the space located at the begining and the end
    trimws(text)
}
tweet_texts <- unlist(lapply(tweets, text_clean))
tweet_texts
##   [1] "examples on calling java code from r"                                                                                                      
##   [2] "simulating map reduce in r for big data analysis using flights data via rbloggers"                                                         
##   [3] "job opportunity senior analyst big data at wesfarmers industrial amp safety sydney area australia jobs"                                    
##   [4] "clavin an open source software package for document geotagging and geoparsing"                                                             
##   [5] "an online book on natural language processing with python"                                                                                 
##   [6] "tips for r programming provided on biostat wiki"                                                                                           
##   [7] "introduction to r for data mining a one hour video by revolution analytics"                                                                
##   [8] "the geonames geographical database geocode for countries cities suburbs places and postcodes"                                              
##   [9] "result of kdnuggets annual software poll on software used in the past months for real projects via kdnuggets"                              
##  [10] "big geo data visualisations"                                                                                                               
##  [11] "mapping the gdelt data in r"                                                                                                               
##  [12] "airport airline and route data containing airports airlines and routes spanning the globe"                                                 
##  [13] "how to map connections with great circles"                                                                                                 
##  [14] "gdelt global data on events location and tone containing over million geolocated events for to present"                                    
##  [15] "nd cfp ausdm industry track submission due july"                                                                                           
##  [16] "poll hosted by kdnuggets predictive analytics big data data mining data science software used"                                             
##  [17] "govhack on saturday amp sunday june all over australia"                                                                                    
##  [18] "slides on large scale network analysis with packge igraph"                                                                                 
##  [19] "r notes"                                                                                                                                   
##  [20] "an r example on word clouds for document comparison using comparison cloud and commonality cloud"                                          
##  [21] "social network analysis labs in r and sonia heaps of r code for sna"                                                                       
##  [22] "microsoft looking for data scientists for a brand new initiative redmond seattle us"                                                       
##  [23] "five hour lecture videos on statistical aspects of data mining with r"                                                                     
##  [24] "church or beer an interesting example on twitter data analysis"                                                                            
##  [25] "introduction to data science a free online course on coursera already started on may st"                                                   
##  [26] "breaking bin laden visualizing the power of a single tweet"                                                                                
##  [27] "usamaf are the slides of your talk on big data in australia available online now and if yes could you please send me a link thanks"        
##  [28] "see a list of interesting software tools functions for high dimensional data and genomics developed by jeff leek at"                       
##  [29] "see rdatamining follower map at and see how to produce a twitter follower map at"                                                          
##  [30] "data and computing fundamentals with r"                                                                                                    
##  [31] "moving beyond the hype of big data oracle s big data events in australia mid may"                                                          
##  [32] "examples of error handling in r"                                                                                                           
##  [33] "director business and statistical analysis spartanburg sc us"                                                                              
##  [34] "data scientist agl melbourne australia"                                                                                                    
##  [35] "fore cast ing prin ci ples and prac tice a free online text book on fore casting with r"                                                   
##  [36] "graphical models with r a pdf document"                                                                                                    
##  [37] "a quick introduction to ggplot"                                                                                                            
##  [38] "usamaf thanks for your fantastic talk on big data in canberra are the slides available on web or could you send me a copy thanks"          
##  [39] "r v was released today rstudio needs updating to latest version as well to make them work together"                                        
##  [40] "top linkedin groups for analytics big data data mining and data science via kdnuggets"                                                     
##  [41] "cfp the th australasian data mining conference ausdm submission due july"                                                                  
##  [42] "big data canberra forum am pm april no charge"                                                                                             
##  [43] "job opportunity data scientist at civitas learning austin texas area jobs"                                                                 
##  [44] "lecture videos of course data analysis with r now on youtube"                                                                              
##  [45] "examples on parallel computing in r with the multicore package"                                                                            
##  [46] "join the ausdm group on linkedin at a group for the ausdm conference amp a forum for australia wide data miners"                           
##  [47] "call for participation dmapps a workshop on data mining applications in industry amp government see program at"                            
##  [48] "free online course on social network analysis starting now"                                                                                
##  [49] "free e book on data science with r"                                                                                                        
##  [50] "knitr a general purpose tool for dynamic report generation in r"                                                                           
##  [51] "documenting with knitr by graham williams"                                                                                                 
##  [52] "slides on rstudio s shiny framework to create interactive web applications in r"                                                           
##  [53] "vacancy of big data analytics researcher at adobe san jose usa"                                                                            
##  [54] "a funny story about modelling with a jar of ants have fun amp re think about modelling see follow the ants to richness"                    
##  [55] "anshumansaini you can find a list of online documents on r and data mining at"                                                             
##  [56] "researcher position in machine learning at nicta australia"                                                                                
##  [57] "presentation on using r with hadoop"                                                                                                       
##  [58] "slides for book data mining concepts and techniques rd edition by jiawei han micheline kamber amp jian pei"                                
##  [59] "vacancies at ibm research india"                                                                                                           
##  [60] "new book r and data mining examples and case studies elsevier dec pages r code and data provided"                                          
##  [61] "social network analysis a free online course starting on jan th"                                                                           
##  [62] "data analysis with r a free online course starting on jan nd"                                                                              
##  [63] "see sample pages of r and data mining book on google books"                                                                                
##  [64] "data used in book r and data mining examples and case studies can now be downloaded at"                                                    
##  [65] "r code for book r and data mining examples and case studies is now available at"                                                           
##  [66] "a simple example of topic modelling with lda using r packages rtexttools and topicmodels"                                                  
##  [67] "nd cfp dmapps to be held in conjunction with pakdd in gold coast australia submission due jan"                                             
##  [68] "a video from a talk on dynamic and correlated topic models applied to the journal science"                                                 
##  [69] "slides for a recent tutorial about topic modeling by david m blei at machine learning summer school"                                       
##  [70] "faculty position in data mining at mcgill university canada"                                                                               
##  [71] "network of recent world leader meetings via recordedfuture"                                                                                
##  [72] "could data scientist be your next job"                                                                                                     
##  [73] "postdoc in web data analytics delft"                                                                                                       
##  [74] "call for participation ausdm sydney dec see conference program at"                                                                         
##  [75] "mr prolixic do you mean to save all tweets into a txt file write them into csv write csv df a txt or write csv df text a txt"              
##  [76] "easy interactive web applications in r with the shiny package"                                                                             
##  [77] "see a list of chapters to be included in book data mining applications with r at to be published in mid"                                   
##  [78] "senior principal researcher positions with data insight nokia sunnyvale"                                                                   
##  [79] "dmapps to be held in conjunction with pakdd submission due by january"                                                                     
##  [80] "cfp dmapps workshop on data mining applications in industry and government submission due by dec"                                          
##  [81] "youtube playlist for course computing for data analysis with r week week"                                                                  
##  [82] "youtube playlist for course computing for data analysis with r week week"                                                                  
##  [83] "bioinformatics faculty position at the university of kansas"                                                                               
##  [84] "junior researcher in mobile data analytics create net italy"                                                                               
##  [85] "center director position on data analytics university of washington tacoma"                                                                
##  [86] "creating r packages a tutorial"                                                                                                            
##  [87] "slides on building r packages and"                                                                                                         
##  [88] "the economist nielsen data visualization challenge"                                                                                        
##  [89] "tenure track position in cis at the university of alabama at birmingham"                                                                   
##  [90] "postdoctoral fellow in biocomputing san francisco state university"                                                                        
##  [91] "tenure track research position in data and knowledge management at fbk trento italy"                                                       
##  [92] "tenure track faculty position at the emory university"                                                                                     
##  [93] "three year postdoc position in pervasive computing csiro australia"                                                                        
##  [94] "assistant professor at the eindhoven university of technology the netherlands"                                                             
##  [95] "research associate in smart harvesting for social science open access literature and research data project germany"                        
##  [96] "google s r style guide"                                                                                                                    
##  [97] "introduction to survival analysis"                                                                                                         
##  [98] "handling large data with package ff and ffbase"                                                                                            
##  [99] "join the rdatamining project to build a comprehensive r package for data mining"                                                           
## [100] "research scientist in text mining and graph analysis university of california santa barbara"                                               
## [101] "postdoc and phd positions in computational intelligence at disi univ of genoa italy"                                                       
## [102] "slides on big data in r"                                                                                                                   
## [103] "slides on getting started with r and hadoop"                                                                                               
## [104] "junior professorship in database systems the university of konstanz germany"                                                               
## [105] "data analytics position in melbourne"                                                                                                      
## [106] "postdoctoral fellowship in text mining modeling and prototyping university of alberta canada"                                              
## [107] "positions available in area of applied machine learning and data mining in xerox research webster"                                         
## [108] "postdoc position in healthcare analytics in ibm t j watson research center"                                                                
## [109] "a little book of r for time series"                                                                                                        
## [110] "slides and exercises on efficient r programming"                                                                                           
## [111] "cfp ausdm deadline extended to august"                                                                                                     
## [112] "example of profiling r code to find out the time consumption of your r codes"                                                              
## [113] "open postdoc research assistant positions at nanyang technological university singapore"                                                   
## [114] "phnk sorry i donot have a backup you can try to google for it"                                                                             
## [115] "which data mining technique do you use for prediction classification in business applications please vote at"                              
## [116] "r is reported as now being used by about half of all data miners in the data miners survey by rexer analytics"                             
## [117] "cookbook for r"                                                                                                                            
## [118] "slides for a couple of r short courses"                                                                                                    
## [119] "fast nearest neighbour search in o n log n time with the rann package in r"                                                                
## [120] "research staff member opening at ibm t j watson research center"                                                                           
## [121] "free online course on computing for data analysis with r to start on sept"                                                                 
## [122] "sentiment analysis and subjectivity a book chapter by bing liu"                                                                            
## [123] "a taste of sentiment analysis page slides in pdf format"                                                                                   
## [124] "post doctoral position in text mining and dynamic social media analysis university of southern california usc"                             
## [125] "examples and resources on association rule mining with r"                                                                                  
## [126] "package gwidgets provides a toolkit independent api for building interactive guis see an example at"                                       
## [127] "an example on using package ggmap to convert addresses to geo codes grab a map from google and then plot them on a map"                    
## [128] "deadline of ausdm paper submission is august midnight pst which is days to go"                                                             
## [129] "data mining in excel lecture notes and cases"                                                                                              
## [130] "postdoctoral research associate in cloud and big data at university of southern california"                                                
## [131] "nbclust an r package providing indices for cluster validation and determining the number of clusters"                                      
## [132] "an introduction of graphical user interfaces for r"                                                                                        
## [133] "research scientist position in data and knowledge management at inesc id portugal"                                                         
## [134] "open position big data full time research associate lecturer jacobs university germany"                                                    
## [135] "a tutorial on outlier detection techniques at acm sigkdd"                                                                                  
## [136] "an example of association rule mining with r incl association mining pruning redundant rules and rule visualization"                       
## [137] "an example of outlier detection with the lof local outlier factor algorithm in r"                                                          
## [138] "an example on sentiment analysis with r tracking us sentiments over time in wikileaks"                                                     
## [139] "r ranked in a kdnuggets poll on top analytics data mining software used see details at"                                                    
## [140] "postdoc position on social data management university of bologna italy"                                                                    
## [141] "postdoc research engineer positions on social network data management and processing korea"                                                
## [142] "position in social media analysis research at university of southern california"                                                           
## [143] "postdoctoral fellow position in biocomputing san francisco state university"                                                               
## [144] "pdf slides and r code examples on data mining and exploration"                                                                             
## [145] "job opportunity lecturer senior lecturer business amp social analytics at university of technology sydney"                                 
## [146] "full professor of data and knowledge engineering vienna university of economics and business"                                              
## [147] "see r code examples on association rules mining pruning redundant rules and visualizing association rules in the latest version of my book"
## [148] "a chapter on association rule mining with r has been added to my draft book r and data mining examples and studies"                        
## [149] "examples on outlier detection with r have been added to my draft book check it out at"                                                     
## [150] "cfp the th australasian data mining conference ausdm"                                                                                      
## [151] "will be the beginning of the end for sas and spss a demonstration with time series forecasting in r"                                       
## [152] "an example of social network analysis with r using package igraph"                                                                         
## [153] "job on artificial intelligence and big data mining at new research lab of huawei hong kong"                                                
## [154] "research positions on multimedia search and big data analysis microsoft research asia beijing china"                                       
## [155] "my book in draft titled r and data mining examples and case studies is now on cran check it out at"                                        
## [156] "a simple example of parallel computing on a windows and also mac machine"                                                                  
## [157] "a presentation on taking r to the limit part ii large datasets in r"                                                                       
## [158] "a presentation on taking r to the limit part i parallelization in r"                                                                       
## [159] "a predictive modeling contest maximizing loan potential details at"                                                                        
## [160] "post doc researcher position in spatial database and data mining ibm t j watson research center"                                           
## [161] "vacancy of data scientist data miner for npario a big data start up"                                                                       
## [162] "pairachchamp yes as long as it is a real world application on data mining with r each chapter is expected to be about pages thanks"        
## [163] "second round of call for chapter proposals for book data mining applications with r due by may details at"                                 
## [164] "slides on graph mining laws generators and tools"                                                                                          
## [165] "slides for tutorial on tools for large graph mining structure and diffusion at www"                                                        
## [166] "pdf slides titled r you ready which is a nice introduction of r"                                                                           
## [167] "r tips quick examples for r programming"                                                                                                   
## [168] "postdoc research scientist position on big data at mit"                                                                                    
## [169] "research scientist position for privacy preserving data publishing singapore"                                                              
## [170] "easier parallel computing in r with snowfall and sfcluster"                                                                                
## [171] "tutorial parallel computing using r package snowfall"                                                                                      
## [172] "handling big data interacting with data using the filehash package for r"                                                                  
## [173] "parallel computing with r using snow and snowfall"                                                                                         
## [174] "state of the art in parallel computing with r"                                                                                             
## [175] "slides on parallel computing in r"                                                                                                         
## [176] "r with high performance computing parallel processing and large memory"                                                                    
## [177] "high performance computing with r"                                                                                                         
## [178] "slides on massive data shared and distributed memory and concurrent programming bigmemory and foreach"                                     
## [179] "the r reference card for data mining is updated with functions packages for handling big data parallel computing"                          
## [180] "post doc on optimizing a cloud for data mining primitives inria france"                                                                    
## [181] "chief scientist data intensive analytics pacific northwest national laboratory pnnl us"                                                    
## [182] "top in data mining"                                                                                                                        
## [183] "a statnet tutorial for social network analysis"                                                                                            
## [184] "obama administration unveiled a big data research and development initiative with million"                                                 
## [185] "tutorials on using statnet for network analysis"                                                                                           
## [186] "slides on social network analysis with r"                                                                                                  
## [187] "a detailed introduction to social network analysis with sna"                                                                               
## [188] "social network analysis in r"                                                                                                              
## [189] "r for networks a short tutorial"                                                                                                           
## [190] "research fellow positions on graph spatial and temporal databases unsw sydney australia"                                                   
## [191] "post doc positions at ntu singapore on distributed systems and social network analysis"                                                    
## [192] "r code for community detection"                                                                                                            
## [193] "post doctoral research associate on social media analytics university of southern california"                                              
## [194] "postdoc position in social network analysis tartu estonia"                                                                                 
## [195] "a presentation on distributed data analysis with hadoop and r"                                                                             
## [196] "tutorial on network analysis with package igraph"                                                                                          
## [197] "examples on r for social network analysis"                                                                                                 
## [198] "an online textbook on introduction to social network"                                                                                      
## [199] "several positions of data analysts in australian public sector"                                                                            
## [200] "tutorial on mapreduce programming in r with package rmr"                                                                                   
## [201] "lecturer in statistics at the swinburne university of technology melbourne australia"                                                      
## [202] "call for reviewers data mining applications with r pls contact me if you have experience on the topic see details at"                      
## [203] "several functions for evaluating performance of classification models added to r reference card for data mining"                           
## [204] "examples on visualizing classifier performance with package rocr"                                                                          
## [205] "postdoc on data stream analysis university of washington tacoma"                                                                           
## [206] "senior lecturer position in informatics university of skovde sweden"                                                                       
## [207] "tenured faculty positions at imt institute for advanced studies lucca italy"                                                               
## [208] "a short recipe for fitting random forests and computing variable importance measures with r see the last page of"                          
## [209] "call for chapters data mining applications with r an edited book to be published by elsevier proposal due april"                           
## [210] "pdf slides on process modelling and mining of event logs"                                                                                  
## [211] "research positions at nec labs america"                                                                                                    
## [212] "postdoc position data mining in the cloud montpellier france"                                                                              
## [213] "join our discussions on predictive modelling in business applications and learn from other data miners experience"                         
## [214] "outstanding data visualization examples"                                                                                                   
## [215] "join our discussion on techniques and result evaluation for outlier detection"                                                             
## [216] "my edited book titled post mining of association rules techniques for effective knowledge extraction"                                      
## [217] "vacancy of data mining analyst at microsoft search technology center asia stca beijing china"                                              
## [218] "sub domains group group of two rdatamining groups have been swapped use group rdatamining com to visit the group on linkedin"              
## [219] "an introduction to random forest and using it for classification clustering and outlier detection"                                         
## [220] "post doctoral researcher at the university of glasgow scotland uk"                                                                         
## [221] "research engineer web and data applications yahoo research barcelona spain"                                                                
## [222] "some r functions and packages for outlier detection have been added to r reference card for data mining at"                                
## [223] "pranzini check quick r for advanced statistics for examples documents"                                                                     
## [224] "postdoctoral fellow in computer science with specialization in database technology"                                                        
## [225] "research intern postdoctoral researcher positions in data insight team nokia research center north america"                                
## [226] "postdoctoral researcher on data mining sponsered by us dod"                                                                                
## [227] "outlier detection with lofactor in package dmwr or dprep and with lof in package rlof for a parallel implementation"                       
## [228] "tutorial on data mining algorithms by ian witten"                                                                                          
## [229] "an introduction to text mining by ian witten"                                                                                              
## [230] "postdoctoral fellowship in data mining information retrieval and databases"                                                                
## [231] "videos of presentations at melbourne r user group on youtube"                                                                              
## [232] "data mining job opening toronto canada"                                                                                                    
## [233] "a prize of for a data mining competition to improve healthcare"                                                                            
## [234] "statistics with r a great r graphics stats website also available as a pdf file"                                                           
## [235] "a nice short article on memory in r"                                                                                                       
## [236] "vacancy of research scientist natural language processing social network analysis csiro australia"                                         
## [237] "data mining job opening toronto canada"                                                                                                    
## [238] "sa nkar user info can be get with getuser in package twitter and the numbers of followers and followees are followerscount friendscount"   
## [239] "a vacancy of bioinformatics analyst at csiro australia"                                                                                    
## [240] "an r example on using text mining to find out what rdatamining tweets are about"                                                           
## [241] "slides on tidy data tidy tools with r examples by hadley wickham"                                                                          
## [242] "help stemming and stem completion with package tm in r"                                                                                    
## [243] "text mining tutorial"                                                                                                                      
## [244] "r cookbook with examples"                                                                                                                  
## [245] "access large amounts of twitter data for data mining and other tasks within r via the twitter package"                                     
## [246] "information diffusion in social networks tutorial at vldb"                                                                                 
## [247] "visuwords online graphical dictionary any techniques for text mining or social network analysis are involved"                              
## [248] "mmiiina it is working which i tried one minute ago you might try addictedtor free fr graphiques"                                           
## [249] "r graphics gallery with source code"                                                                                                       
## [250] "data mining scientist at apple inc austin tx via kdnuggets"                                                                                
## [251] "interactive charts with googlevis package and r"                                                                                           
## [252] "opendata r google easy maps"                                                                                                               
## [253] "frequent itemset mining implementations repository"                                                                                        
## [254] "a c frequent itemset mining template library"                                                                                              
## [255] "arminer a client server data mining application specialized in finding association rules"                                                  
## [256] "barack obama is recruiting analysts for his re election campaign job details and application at view jobs of"                              
## [257] "an overview of data mining tools"                                                                                                          
## [258] "fastcluster fast hierarchical clustering routines for r and python"                                                                        
## [259] "r tutorial an r introduction to statistics many examples at the blog"                                                                      
## [260] "statistical data mining tutorials by andrew moore dozens of tutorial slides in pdf format"                                                 
## [261] "spatial regression analysis in r a workbook tutorials and worked examples using r package spdep"                                           
## [262] "resources to help you learn and use r"                                                                                                     
## [263] "an introduction to time series decompositiom with stl"                                                                                     
## [264] "geoda a free software for spatial data analysis"                                                                                           
## [265] "cluto a software package for clustering low and high dimensional datasets free for educational and research purposes"                      
## [266] "i created group rdatamining on linkedin"                                                                                                   
## [267] "slides of talks at r users groups"                                                                                                         
## [268] "slides on statisical analysis modelling of afl rapidminer r integration unfortunately need join to download"                               
## [269] "slides of talks at melburn melbourne users of r network accessible to anyone"                                                              
## [270] "slides of a talk on data table package"                                                                                                    
## [271] "datasets to practice your data mining incl large real world data and small research data"                                                  
## [272] "r examples on various clustering techniques with r codes and dataset provided see for exercises instructions"                              
## [273] "mahout mining large data sets it supports recommendation mining clustering classification frequent itemset mining"                         
## [274] "book and slides in pdf on introduction to information retrieval"                                                                           
## [275] "r code examples on time series analysis and mining extracted from my recent talk are available on rdatamining blog"                        
## [276] "r ranked no in a recent poll on top languages for data mining analytics"                                                                   
## [277] "slides on r graphics by paul murrell"                                                                                                      
## [278] "acm sigkdd innovation award goes to dr ross quinlan for his invention of id and c for decision tree learning"                              
## [279] "r code examples d bar charts matrix contours microarray heatmaps network graphs ramachandran plots linear models"                          
## [280] "learn r toolkit quickly learn how to make advanced charts with r"                                                                          
## [281] "pdf notes on data loading plotting time series and sweave in r"                                                                            
## [282] "emilopezcano thanks still in an initial stage a long way to go"                                                                            
## [283] "i sorry missed your message slides on my time series talk are available at"                                                                
## [284] "my document r and data mining examples and case studies is scheduled to be published by elsevier in mid"                                   
## [285] "vacancies on analytics and r at aps level at department of immigration citizenship australia close by aug"                                 
## [286] "lecture notes on data mining course at cmu some of which contain r code examples"                                                          
## [287] "distributed text mining in r"                                                                                                              
## [288] "using r for data analysis and graphics introduction examples and commentary by john maindonald"                                            
## [289] "experiences with using r in credit risk at anz bank ppt slides"                                                                            
## [290] "tutorial on discovering multiple clustering solutions grouping objects in different views of the data"                                     
## [291] "tutorial on spatial and spatio temporal data mining"                                                                                       
## [292] "an easy to understand brief introduction of haar wavelet transform a simple dwt discrete wavelet transform"                                
## [293] "slides of my talk on time series analysis and mining with r at canberra r users group are available at"                                    
## [294] "time series analysis for business forecasting"                                                                                             
## [295] "i will give a talk on time series analysis and mining with r at canberra r users group july"                                               
## [296] "text mining handbook in pdf format with r code examples"                                                                                   
## [297] "tuliplab hi i think i followed the instructions below to put the html code in google sites page"                                           
## [298] "r package data table extension of data frame for fast indexing fast ordered joins and fast grouping"                                       
## [299] "data mining lecture notes"                                                                                                                 
## [300] "a complete guide to nonlinear regression"                                                                                                  
## [301] "visits to rdatamining com from china increased from to a sign that china can now access googlesites clustrmaps"                            
## [302] "an r time series tutorial with example codes"                                                                                              
## [303] "r news and tutorials contributed by r bloggers"                                                                                            
## [304] "text data mining with twitter and r"                                                                                                       
## [305] "want to learn data mining lecture slides free chapters on introduction to data mining available at"                                        
## [306] "get stuck in r programming try r help mailing list and you can get answers from other r users"                                             
## [307] "which r documents and r packages to start with answers at or"                                                                              
## [308] "getting lost in over r packages which r packages to use cran task views is a good guidance see"                                            
## [309] "a recent poll shows that r is the nd popular tool used for data mining see poll data mining analytic tools used"                           
## [310] "what is clustering what clustering approaches and techniques are available find answers at"                                                
## [311] "melt and cast your data with reshape an r package for flexibly restructuring and aggregating data"                                         
## [312] "spatial data analysis with r presentations available at"                                                                                   
## [313] "rstudio a free ide for r programming"                                                                                                      
## [314] "comments are enabled at to enable comments at your website try html comment box lt"                                                        
## [315] "rdatamining group to share your experience on using r for data mining with other data miners"                                              
## [316] "there are more than visits to rdatamining com website up to april since its opening on march"                                              
## [317] "r reference card for data mining also available at mirrors in addition to"                                                                 
## [318] "traminer is an excellent r package for mining and visualizing sequence data its function seqefsub searches for frequent subsequences"      
## [319] "pajek a free tool for large network analysis"                                                                                              
## [320] "an r reference card for data mining is now available on cran it lists many useful r functions and packages for data mining applications"

b. Count the frequency of words “data” and “mining”.

method 1: manually count the word ‘data’ and ‘mining’

freq.words <- data.frame(
  "data" = unlist(lapply(tweet_texts, function(x){
      words <- unlist(strsplit(x, "\\s+"))
      sum(words == "data")
    })) , 
  "mining" = unlist(lapply(tweet_texts, function(x){
      words <- unlist(strsplit(x, "\\s+"))
      sum(words == "mining")
    }))
)

freq <- freq.words %>% summarize(across(everything(), sum, na.rm=TRUE))
freq

method 2: count word using Corpus function

# concatenate all string into one
text_data <- paste(tweet_texts, collapse=" ")

# Create a corpus and perform text cleaning
text_corpus <- Corpus(VectorSource(text_data))
dtm <- DocumentTermMatrix(text_corpus)
dtm_matrix <- as.matrix(dtm)

freq <- gather(as.data.frame(dtm_matrix), key="col", value="c") %>% 
  filter(col %in% c("data", "mining"))
freq

c. Plot the word cloud.

The following code is ploting the word cloud. All the work is split into 3 steps.

  1. Merge all sentences.
  2. Remove meaningless words.
  3. Calculating word frequency.
  4. Plot word cloud.
# concatenate all string into one
text_data <- paste(tweet_texts, collapse=" ")

# Create a corpus and perform text cleaning
text_corpus <- Corpus(VectorSource(text_data))

# Remove the meaningless words
custom_stopwords <- c("the", "a", "an", "in", "on", "of", "to", "for", "with", "by", "and")
text_corpus <- tm_map(text_corpus, removeWords, custom_stopwords)

#text_corpus <- tm_map(text_corpus, content_transformer(tolower))
text_corpus <- tm_map(text_corpus, removePunctuation)

# Create a term-document matrix and calculate word frequencies
tdm <- TermDocumentMatrix(text_corpus)
word_freq <- rowSums(as.matrix(tdm))

# Create the word cloud
wordcloud(names(word_freq), 
          freq = word_freq, 
          scale = c(3, 0.5), 
          min.freq = 3, 
          colors = brewer.pal(8, "Dark2")
)

d. Use a topic modelling algorithm to fit the Twitter data to 8 topics. Find the top 6 frequent terms (words) in each topic.

if(!require(topicmodels))
    install.packages("topicmodels")
## Loading required package: topicmodels
library("topicmodels")
dtm <- DocumentTermMatrix(text_corpus)
# Specify the number of topics (k)
k <- 8

# Fit the LDA model
lda_model <- LDA(dtm, 
                 k = k, 
                 method = "Gibbs", 
                 control = list(seed = 9999, 
                                burnin = 1000, 
                                thin = 100, 
                                iter = 1000)
                 )

# Get 6 top frequency words with each topic
top_words <- terms(lda_model, 6)

lda_model
## A LDA_Gibbs topic model with 8 topics.
top_words 
##      Topic 1       Topic 2       Topic 3        Topic 4        Topic 5     
## [1,] "programming" "examples"    "free"         "introduction" "data"      
## [2,] "studies"     "online"      "text"         "slides"       "mining"    
## [3,] "lecture"     "your"        "group"        "see"          "analysis"  
## [4,] "see"         "ausdm"       "modelling"    "available"    "research"  
## [5,] "opening"     "parallel"    "postdoc"      "from"         "package"   
## [6,] "chapters"    "rdatamining" "postdoctoral" "researcher"   "university"
##      Topic 6    Topic 7     Topic 8       
## [1,] "software" "example"   "applications"
## [2,] "due"      "reference" "codes"       
## [3,] "map"      "canada"    "used"        
## [4,] "via"      "center"    "access"      
## [5,] "amp"      "cloud"     "business"    
## [6,] "have"     "course"    "fast"

3. Provide a real-world example of a system or an application that utilises stream-data. In your example, explain the challenges faced by algorithms in analysing stream data and suggest some ideas to address those challenges (6)

A real example of stream data application

StockSight is an application that has been used for many organization. It uses the dataset from Twitter and news headlines data for stocks to analysis the sentiment of the author. Normally, Twitter and other news platforms owns a huge amount of data; more even, the data will be posted rapidly, especially on Twitter; some people also will add some slang, emojis, etc. that will make more challenging for analyzing; Lots of unrelated data exist, etc. All the problems are the challenges for implementation. The items listed below are some of my ideas for dealing with all the challenges.

  1. Using stream sampling algorithm to deal with huge amount of data and rapid posting data.
  2. Using NLP techniques to deal with the text, such as slang, emojis, etc.
  3. Using classification method to deal with unrelated data.

References

https://github.com/shirosaidev/stocksight

4. Create a data stream of two dimensions data points. The data points will follow Gaussian distribution with 5% noise and belong to 4 clusters. Compare the performance of the following clustering methods in terms of precision, recall, and F1. (6)

if(!require(stream))
  install.packages("stream")  
library(stream)
stream <- DSD_Gaussians(k = 4, d = 2, noise = .05, p = c(0.9, .5, .3, .1))
stream
## Gaussian Mixture (d = 2, k = 4) 
## Class: DSD_Gaussians, DSD_R, DSD

a. Use Reservoir sampling to sample 200 data points from 500 data points of the stream. Use K-means to cluster the points in the reservoir into 5 groups, and use 100 points from the stream to evaluate the performance of K-means.

Reservoir_Kmeans = DSC_TwoStage(
  micro = DSC_Sample(k = 200), 
  macro = DSC_Kmeans(k = 4)
)
update(Reservoir_Kmeans, stream, n=500)
plot(Reservoir_Kmeans)

evaluate_static(Reservoir_Kmeans, 
                stream, 
                measure =c("f1", "precision", "recall"), 
                n =100
)
## Evaluation results for macro-clusters.
## Points were assigned to micro-clusters.
## 
##        F1 precision    recall 
## 0.9453104 0.8976417 0.9983259 
## attr(,"type")
## [1] "macro"
## attr(,"assign")
## [1] "micro"

b. Use Windowing method to get 200 data points from 500 data points of the stream. Use K-means to cluster the points in the window into 5 groups, and use 100 points from the stream to evaluate the performance of K-means.

Window_Kmeans = DSC_TwoStage(
  micro = DSC_Window(horizon = 200), 
  macro = DSC_Kmeans(k = 5)
)
update(Window_Kmeans, stream, n=500)
#Window_Kmeans

plot(Window_Kmeans, stream)

evaluate_static(
  Window_Kmeans, 
  stream,
  measure = c("f1","precision","recall"), 
  n =100
)
## Evaluation results for macro-clusters.
## Points were assigned to micro-clusters.
## 
##        F1 precision    recall 
## 0.9876278 0.9781566 0.9972841 
## attr(,"type")
## [1] "macro"
## attr(,"assign")
## [1] "micro"

c. Apply the D-Stream clustering method to 500 points from the stream with gridsize=0.1, and use 100 points from the stream to evaluate the performance of D-stream.

dstream <- DSC_DStream(gridsize = .1, Cm = 1.2)
update(dstream, stream, n = 500)
plot(dstream, stream)

evaluate_static(
  dstream, 
  stream,
  measure = c("f1","precision","recall"), 
  n =100
)
## Evaluation results for micro-clusters.
## Points were assigned to micro-clusters.
## 
##        F1 precision    recall 
## 0.5807339 0.8462567 0.4420391 
## attr(,"type")
## [1] "micro"
## attr(,"assign")
## [1] "micro"

Performance comparison

reservoir <- evaluate_static(Reservoir_Kmeans, 
                stream, 
                measure =c("f1", "precision", "recall"), 
                n =100
)

window <- evaluate_static(
  Window_Kmeans, 
  stream,
  measure = c("f1","precision","recall"), 
  n =100
)

dstream_p <- evaluate_static(
  dstream, 
  stream,
  measure = c("f1","precision","recall"), 
  n =100
)



performance <- rbind(t(reservoir[1:3]), t(window[1:3]))
performance <- rbind(performance, t(dstream_p[1:3]))

rownames(performance) <- c('reservoir', 'window', 'dstream')
performance
##                  F1 precision    recall
## reservoir 0.8908607 0.8066417 0.9947160
## window    0.9671362 0.9438717 0.9915764
## dstream   0.4672239 0.7855030 0.3324984

According to the results, the performance of the three indicators, window method has the highest performance for this case.

5. explain a real-world application of geographical information system. (4)

QGIS (Quantum GIS) is a powerful and open-source Geographical Information System (GIS) software that provides a comprehensive set of tools for managing, analyzing, and visualizing geographic and spatial data.

It is widely used for a variety of applications, from mapping and cartography to spatial analysis and data management.

Qgis allows users to import and manage kinds of data, and also provides a wide range of geospatial analysis tools for spatial queries, buffer analysis, spatial joins, proximity analysis, etc. Users can also create high-quality maps with symbols, labels and styling options. Besides, Qgis is also high customizable that allow users to adapt it to their specific needs.

The most important is that the Qgis is also an open source and community-driven software. It is freely available for everyone. Another important point is that it’s a cross-platform that is available for multiple operating system.

References

QGIS: An Introduction to an Open-Source Geographic Information System

QGIS Github Repository

6. Use spatial data analysis packages in R do the following tasks. (10)

a. Draw a map of Australia where each city is represented as a dot. Highlight cities with population more than one million people. Map should have only the borders at country and state levels.

if(!require(terra))
  install.packages("terra")
library(terra)
filename <- paste0(getwd(),"/SA2_2021_AUST_SHP_GDA2020/SA2_2021_AUST_GDA2020.shp")
#basename(filename)
ausvec <- vect(filename)
# filter out all empty geometry
ausvec <- ausvec[!is.na(ausvec$AREASQKM21)]

# dissolve based states level
ausvecAgg <- aggregate(ausvec, by="STE_NAME21")

#load population data
auspop <- vect(paste0(getwd(),"/Australia_population.shp"))

#plot Australia with borders at country and state levels
plot(ausvecAgg)

# plot cities with less than 1 million population
plot(auspop[as.integer(auspop$population) < 1000000], add=TRUE, cex = 0.1)
# plot cities with greater than 1 million population, and highlight
plot(auspop[as.integer(auspop$population) >= 1000000], add=TRUE, col="red")

b. Use the shapefile provided in the course website to draw a map of “South Australia”. Keep all borders in the map. Use a colour palette to highlight the statistical areas level 4 (SA4).

# Plot South Australia

# select data of South Australia
sa <- ausvec[ausvec$STE_NAME21 == "South Australia"]

# dissolve based on SA4 level
sa4Agg <- aggregate(sa, by="SA4_NAME21")
plot(sa4Agg, c('SA4_NAME21'))
plot(sa, add=TRUE)

c. Create a spatial vector of “Greater Adelaide”. Aggregate the polygons to draw a map that shows only the borders for statistical areas level 3 (SA3).

greater_adelaide <- ausvec[ausvec$GCC_NAME21 == "Greater Adelaide"]
# dissolve based on SA4 level
greater_adelaide.sa3Agg <- aggregate(greater_adelaide, by="SA3_NAME21")

#plot only the level 3 border of Greater Adelaide
plot(greater_adelaide.sa3Agg)

d. For this point you need to check the data in “crimeCounts.csv” available in the course website.

  • Use the variable “SA3_NAME21” to obtain a spatial vector of “Salisbury”.
salibury <- ausvec[ausvec$SA3_NAME21 == "Salisbury"]
plot(salibury)

  • create a new attribute with the name crimeCounts containing the offence count (July 2022 – June 2023) for the suburbs in Salisbury spatial vector.
# load crime data
crime_count <- read.csv('crimeCounts.csv')
# rename columns
names(crime_count) <- c('id', 'suburb', 'crimeCounts')
# convert to lower case for easy to compare
crime_count$lowerSub <- tolower(crime_count$suburb)

# create a suburb dataframe for merging
suburb.df <- data.frame(
  "lowerSub" = tolower(salibury$SA2_NAME21)
)
suburb.df <- left_join(suburb.df, crime_count, by="lowerSub")

#assign crimeCounts for the suburbs in Salisbury
salibury$crimeCounts <- suburb.df$crimeCounts

# show the count and suburbs
values(salibury) %>% select(SA2_NAME21, crimeCounts)
  • Create a spatial raster to display the crimeCounts in Salisbury. Select a colour palette so that high crimeCounts are represented in red colour.
#library(RColorBrewer)
#display.brewer.all()

salibury_raster <- rast(
  ncol=1000, nrow=1000, 
  xmin=138.53, xmax=138.71, 
  ymin=-34.85, ymax=-34.68
)

#Use a feature from the SpatVector as raster values 
salibury_raster <- rasterize(salibury, salibury_raster, "crimeCounts")
terra::plot(salibury_raster, col=brewer.pal(9, "YlOrRd"))
terra::lines(salibury)

#text(salibury, salibury$crimeCounts, inside=T, cex=0.6)
  • Show Salisbury suburb names and borders in the map.
salibury_raster <- rast(ncol=1000, nrow=1000, xmin=138.53, xmax=138.71, ymin=-34.85, ymax=-34.68)
#Use a feature from the SpatVector as raster values 
salibury_raster <- rasterize(salibury, salibury_raster, "crimeCounts")
terra::plot(salibury_raster, col=brewer.pal(11, "Spectral"))
terra::lines(salibury)
text(salibury, salibury$SA2_NAME21, inside=T, cex=0.6)

e. Create a html page with an interactive map containing the markers of your top 5 restaurants in Adelaide. Include in your report a screenshot of the interactive map. Upload the html as additional file in your submission.

The restaurant data get from https://discover.data.vic.gov.au/dataset/cafe-restaurant-bistro-seats

if(!require(leaflet))
  install.packages("leaflet")
## Loading required package: leaflet
## Warning: package 'leaflet' was built under R version 4.2.3
if(!require(sf))
  install.packages("sf")
## Loading required package: sf
## Warning: package 'sf' was built under R version 4.2.3
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(sf)
library(leaflet)

# Create a leaflet map
leaflet() %>%
  addTiles() %>%  # Add a basemap
  addMarkers(lng = 138.60956081250663,lat = -34.92074609311045,popup = "Schnithouse Rundle St") %>%
  addMarkers(lng = 138.59883250622383,lat = -34.924407169226825,popup = "GEORGES") %>%
  addMarkers(lng = 138.5941120067306,lat = -34.92806706283753,popup = "Nu Thai Restaurant") %>%
  addMarkers(lng = 138.6006364073467, lat = -34.932289053377616, popup = "La Trattoria Restaurant & Pizza Bar") %>%
  addMarkers(lng = 138.61282628115364, lat = -34.93249827609448, popup = "Ballaboosta")