Creating a word cloud

Now let’s take a look at the dataset we collected using the keyword "tvdebates" and how the use of COSMOS tools can help us understand what people are saying and whom they are interacting with.

As before, we drag the dataset we wish to examine into the workspace where it is opened in table view.

Let’s also narrow down the attributes to focus on by deselecting those we are not interested in.

To use any of the tools in the right hand pane, we simply select and drag the tool we wish to use into the workspace and place it over the dataset.

I’m first of all going to use the word cloud tool to visualize the text of the tweets. We see how the individual words in the set of tweets are displayed, scaled in proportion to their relative frequency.

In this case, we can see that words such as scared, cowardly, hypocritical etc. figure quite prominently, providing us with a feel for the tone of the discussion and people’s attitudes on this topic. Of course, any conclusions we might wish to draw from this summary view would need more careful study at the tweet level before they can be verified. We also see that several different accounts are mentioned in these tweets – harryslaststand, willblackwriter and nhaparty – giving us an indication of the more prominent participants.

Pie chart separated by gender

Let’s now see how we can create and then explore a subset of this dataset. To do this, we can use the pie chart tool. Let’s chose gender as our variable. Here is the pie chart showing the relative contributions of people posting tweets by gender. If we click on the female segment, then the chart tool responds by separating this segment from the rest of the chart. If we now press the select button on the bottom left of the chart view, we can create and store in the dataset repository a subset of tvdebates where all the posters are female. Do not forget, that COSMOS assigns a gender to a post by examining the poster’s profile. (This is not one hundred per cent accurate.)

Word cloud separated by gender

Let’s compare the word cloud for the subset of female posters with the word cloud for the subset of tweets posted by males. I have already created this latter subset using the same method as we used to create the female subset.

I will now create in the workspace a word cloud for the subset of tweets posted by males and place it alongside the word cloud for tweets posted by females.

A comparison of the two word clouds suggests a number of interesting features. First, female posters are more likely in proportion to mention the official account for davidcameron in their tweets. This reflects that either females are more likely to retweet or reply to tweets posted by this account or to mention this account in their own tweets.

The other notable feature is that the relative frequency of words such arrogant, calculating etc. is higher in the female subset, suggesting that female posters are displaying a more negative reaction to this topic. Again, these findings would need verification through more analysis of the tweets, not least because a significant number of tweets do not have an assigned gender, but what the findings do do is suggest questions for further study.

Network visualisation tool - Mentions all tweets

Let’s now use the network tool to explore the ways in which posters are engaging with one another as they retweet or reply to each other’s tweets, and mention different accounts.

First, let’s examine the patterns of mentions for the whole tvdebates dataset. Applying the network tool and selecting the ‘mentions’ option, we can see the mentions of all posters visualized as a network, where the nodes represent the accounts and the directed (arrowed) links (or edges) from one node to another represent tweets posted by the account from which the edge originates that mention the account that it points to. The size of a node is in proportion to the number of mentions of the account to which it belongs. We click on and drag the nodes around in order to see more clearly the clusters of mentions and we can zoom in and out to see parts of the network in more detail.

What is clear is that in this dataset the mentions tend to cluster around particular accounts, with harryslaststand getting the most mentions, closely followed by willblackwriter and nhaparty.

 A search on Twitter reveals that harryslaststand describes himself as “Survivor of the Great Depression, RAF veteran & activist.” willblackwriter describes himself as “Author and journalist.” nhaparty describes itself as “National Health Action Party. UK political party fighting for a healthy NHS. Putting patients before profits and opposing #NHS privatisation.”

 If we click on a node, then it and the accounts mentioning it are highlighted. This confirms that the main clusters are quite distinct from one another, suggesting that there are separate discussions going on between the accounts involved, though there are a small number of accounts which bridge between the main clusters.

Network visualisation tool - Retweets all tweets

Now, let’s examine the patterns of retweets for this dataset. Selecting the retweets option, we can see retweets visualized as a network, where the nodes represent the accounts and the directed (arrowed) links (or edges) from one node to another represent tweets posted by the account from which the edge originates that are retweets of tweets posted by the account that the edge points to.

What is immediately noticeable is that while willblackwriter was receiving a lot of mentions, very few of this account’s tweets are being retweeted. Retweets are sometimes interpreted as being endorsements of the original tweet, and so we might conclude that few people interacting with willblackwriter agree with him. However, this conclusion would need to be verified through more detailed tweet-level analysis. Again, therefore, these retweet results are suggestive of questions that might be worthy of more detailed tweet-level analysis, rather than being unambiguous conclusions.

Exporting analysis

As mentioned in the overview, we are now able to export our data. Cosmos tools may not provide all the analysis facilities that users may need, so the option to export data in various file formats enables users to carry on analysis outside of the Cosmos desktop.

The simplest export format is as a CSV file. Here, I am going to export the Females data subset in this format. I simply put the dataset in the workspace. After selecting which attributes I wish to include, I then select the export button at the bottom left of the table view. I can now choose a filename and initiate the export process.

Other file export formats you may find useful are provided by the social network analysis tool. These are GraphML, GEXF and JSON. GraphML and GEXF are XML-based format for social networks. Both of these are compatible with Gephi (as you already familiar with from our last tutorial). This not only makes COSMOS a versatile tool for data collection and visualisation, but also a powerful companion for further visualisations with Gephi.

To export data in these formats, I simply select the export function and then choose the file format and filename as before.

Last modified: Monday, 16 March 2015, 7:49 PM