ARM and Networks with Twitter Data

Twitter Data was collected centered on the hashtags #NHL and #NHLNews and analyzed using Associated Rule Mining in order to establish correlations within the content.

Link to the Code


Figure 1. Transaction Data of #NHL Tweets #NHL Tweets were collected from Twitter and the text content was converted into a transaction data set format. The results were then cleaned by removing stop words, words with a numeric, and web links. The results are linked in the image above.

Tables of Confidence, Support and Lift for #NHL Tweets

Association Rule Mining takes the Support, Confidence and Lift of Transaction Data and is used to determine the relation between them. Support is how frequent an itemset appears in the data set. Confidence is how often a determined rule is true in the data set. Lift is the ratio of support if both items in the rule are independent. The rules that are generated indicate associations between the objects. Using these we can observe interesting trends and associations in the data set, create a network and draw conclusions.

Table 1. #NHL Tweets Confidence The transaction data for #NHL was analyzed using apriori and the confidence was measured. The thresholds were a minimum of .27 for support and .5 for confidence. The top 15 rules for confidence are shown below. With the confidence equaling one for all the top 15 rules, this indicates that everytime the items show up the rule is true such as everytime espn shows up nhl also shows up.

Table 2. #NHL Tweets Support The transaction data for #NHL was analyzed using apriori and the support was measured.The thresholds were a minimum of .27 for support and .5 for confidence. The top 15 rules for support are shown below. With the highest support being .62 which means that this rule shows up 62% of the time in the data set.

Table 3. #NHL Tweets Lift The transaction data for #NHL was analyzed using apriori and the lift was measured. The thresholds were a minimum of .27 for support and .5 for confidence. The top 15 rules for lift are shown below.

Networks

Figure 2. Network of Support for #NHL Tweets. A network was created using the top 50 rules ranked by support. The results are seen above.

Figure 3. Network of Confidence for #NHL Tweets. A network was created using the top 50 rules ranked by support. The results are seen above.


As seen above the Support Network centers around NHL with team names and hockey terms like goal associated with it. This was expected because support is an indication of how frequently the item appears in the data set and each tweet contains #NHL thus NHL is the center of the network. What is interesting is that multiple teams are associated to NHL. In the network such as the Capitols, The Rangers, and Ottawa but Calgary is not.

As seen in the Confidence Network it focusses on the topics of the New York Rangers. It contains many items that are about the NYR and are associated to other NYR items.

Figure 4. Network of Lift for #NHL Tweets with Network D3. A network was created using the rules ranked by lift. The network was formed using the R package NewtworkD3 which creates an interactive network. The interactive network is linked in the image above.

Figure 5. Network of Lift for #NHL Tweets with VisNetwork. A network was created using the rules ranked by lift. The network was formed using the R package VisNetwork which creates an interactive network. The interactive network is linked in the image above.


After examining the network it has centered around the New York rangers and their return to the rink. This is indicated firstly by the large number of Rangers related words in the network such as the garden, LGR, and NYR. These words are linked and associated with the words finally, back and missed which suggest to us that the rangers are returning and the fans are excited. This is demonstrated in both the NetworkD3 and VisNetwork figures.

#NHLNews Tweets

Figure 6. Transaction Data of #NHLNews Tweets #NHLNews Tweets were collected from Twitter and the text content was converted into a transaction data set format. The results were then cleaned by removing stop words, words with a numeric, and web links. The results are linked in the image above.

Tables of Confidence, Support and Lift for #NHLNews Tweets

Table 4. #NHLNews Tweets Confidence The transaction data for #NHLNews was analyzed using apriori and the confidence was measured.The thresholds were a minimum of .27 for support and .5 for confidence. The top 15 rules for confidence are shown below. With the confidence equaling one for all the top 15 rules, this indicates that everytime the items show up the rule is true such as everytime options show up jokes show up as well.

Table 5. #NHLNews Tweets Support The transaction data for #NHLNews was analyzed using apriori and the support was measured. The thresholds were a minimum of .27 for support and .5 for confidence. The top 15 rules for support are shown below. With the highest support being .55 which means that this rule shows up 55% of the time in the data set.

Table 6. #NHLNews Tweets Lift The transaction data for #NHLNews was analyzed using apriori and the Lift was measured. The thresholds were a minimum of .27 for support and .5 for confidence. The top 15 rules for lift are shown below.

Networks

Figure 7. Network of Support for #NHLNews Tweets. A network was created using the top 50 rules ranked by support. The results are seen above.

Figure 8. Network of Confidence for #NHLNews Tweets. A network was created using the top 50 rules ranked by support. The results are seen above.


The Support Network shows a relationship with NHL in the center, however Montreal and Mcdavid have been associated with NHL and hockey news. This implies that there was a recent news story centered around Mcdavid and Montreal that has gained traction on twitter.
For the Confidence Network, NHL news is the center with the majority being associated with it. Furthermore, the Stanley Cup is associated with rumors and memes thus showing its important role in the sport.

Figure 9. Network of Lift for #NHLNews Tweets with VisNetwork. A network was created using the rules ranked by lift. The network was formed using the R package VisNetwork which creates an interactive network. The interactive network is linked in the image above.


The Network is centered around the Stanley Cup with NHLNews, NHLRumors and NHLHumor all closely associated with it. This result is not surprising because all news in the NHL is eventually about who will win the Cup. Thus majority of articles, rumors and jokes focus on which team will or will not win.

Summary

The findings showed that Associated Rule Mining can discover trends and what is popular in each hashtag. This is demonstrated in both #NHL and #NHLNews. In #NHL, it discovered associations between words that are related to the Rangers and words that are related to some form of wait/anticipation/return. These associations allow us to hypothesize that the Rangers are returning to the rink and the fans are excited about it. In #NHLNews, the Stanley Cup is associated with rumors, news and jokes indicating that there is a strong relation between them. It can be infered that all of hockey news revolves around the Stanley Cup which is to be expected due to its significance in the NHL.

These findings demonstrate that ARM can be used to find the most important news about the NHL through twitter word associations. This will allow for the general mood of the populus about the NHL to be infered based off of word usuage. Future steps could be using this data and quantifying it in terms of positive or negative mood and how it effects salary.