Twitter, is one of the biggest social media networks in the world. It provides users with a lot of information. The purpose of this paper is to crawl certain Twitter trends in India, to extract data and to check to see if there are common figures through those trends.
Definition of Troll
Trolls are a common occurrence on social media. This is the definition of a troll according to Urban Dictionary
“Being a prick on the internet because you can. Typically unleashing one or more cynical or sarcastic remarks on an innocent by-stander, because it’s the internet and, hey, you can.” (Definition according to Urban Dictionary http://www.urbandictionary.com/define.php?term=trolling)
Wikipedia has a more formal description for a troll
“”Someone who posts inflammatory, extraneous, or off-topic messages in an online community, such as a forum, chat room, or blog, with the primary intent of provoking readers into an emotional response or of otherwise disrupting normal on-topic discussion.” https://en.wikipedia.org/wiki/Internet_troll
Owen Jones summed it up well here “Political debate, a crucial element of any democracy, is becoming ever more poisoned. Social media has helped to democratise the political discourse, forcing journalists – who would otherwise simply dispense their alleged wisdom from on high – to face scrutiny.”
Why do people troll?
There is no set reason people like to troll on the internet. They could be depressed or they just feel the need for attention or they could be suffering from mental illness. Some critics are of the school that trolls have nothing better to do. Social networks have made it easier for people to troll safely behind their screens. They can do whatever they like without having to be worried. It is more of an act by a cowardly person so that they can feel stronger.
Twitter is one of the most important tools of our generation. It began in 2007 when it was used as a replacement for SMS. The creators regarded it as microblogging so that people would be able to express their views in less than 140 characters. Twitter came out to be a revolution of sorts when it stole the show at SSX 2007. It was known to be the best new thing since sliced bread. For the first time ever, there was a social network where people could express their views about everything. It was where they could talk about politics, themselves, general life, and everything else. At that time Twitter did not realize that they had a goldmine on their hands as data was the most important commodity out there. It did not stop researchers from mining all that data. When Twitter eventually found out about it, they could restrict them from using that data or from reselling it. Not only that but they imposed API limits so it would not be so easy.
What is Social Media?
The information age has had many good and troubled times. It has evolved and will continue evolving. It is also signified by what happened through the information age. The first one was where the internet revolution took place. That was when the internet started shaping up and was when Microsoft famously missed the boat by not jumping on it fast enough. The following one was the social media revolution which brought about social media networks such as Myspace (now defunct), Facebook, Twitter. It also led to other social sharing networks such as Instagram and Snapchat. However, it was the advent of the social networks that really started the real revolution. It led to what I believed is known as the data revolution where data was treated as a commodity. We have had social networks which failed and some which started off as social networks and ended up evolving to such a point that they can up with spinoffs of their own. There are several social media networks which Millennials do not remember as they are no longer functional. Orkut and Myspace are remnants of that past along with Flickr which still exists but has lost its importance. Facebook arrived in the social media sphere and made a substantial change when it arrived. Initially, it was a network which was reminiscent of Orkut where people could share posts on each other’s walls and stay in touch with each other. Now it is more like a person’s feed where one shares news, pictures, and personal information about him/herself.
Facebook was the first major milestone and inspired many others. Twitter was the second biggest change in the social network market. Evan Williams, Biz Stone, Jack Dorsey and Noah Glass created it. Even though twitter does not have as many active followers, it is still one of the most important networks of our time. It has played a huge role in revolutions, giving a platform to Donald Trump and has faced its own problems such as trolls etc.
Data is the biggest asset owned by twitter now. They have a lot of active users who generate a lot of data per minute. Twitter restricts access to this data and sells it to others through third parties. This data is quite expensive and is only accessible through a “Firehose” for which there are no limits.
Twitter was ripe for a takeover by the trolls and no one was ready. The troll issue is one reason no one was willing to acquire them. Of course, there were secondary reasons such as the price and the fact that users are a fraction of Facebook’s total. Trolls have made it a horrible place to be on. Twitter has a lot of issues with abuse and it has not been able to do much about it even though they introduced several changes to make it more suitable for everyone. It banned Milo Yiannopoulos from Twitter after he got his followers to harass SNL star, Leslie Jones. Unfortunately, this action was taken after a lot of public pressure. The fact that they claim they are an open network for all to promote free speech makes it worse. We can look at the example of Donald Trump whose tweets helped him foster hatred and win the elections. His “tweet storms” (give a link to tweet storm here), are what we have gotten used to as they happen quite often. Twitter has not once suspended or banned Trump for his tweets even though he has violated the Terms and Agreement, while lesser individuals have had to face more.
Twitter is a godsend as it has allowed people to express their views freely. However, this has proven to be a bit of an issue as some have taken that freedom too lightly.
It has always been a fascination of mine to understand how Cyber Troops function. So far no one has asked for credit for coining the term, but it shall be an important part of all our conversations.
Briefly, a Cyber Troop is one which can have many purposes. They are used by the government, military, or any political party to control social media with their own narratives. My fascination with Cyber Troops came about when I noticed a political party using Twitter to control conversations and trend hashtags on assorted topics. The modus operandi is such that they use influencers to spread the message which is amplified. Some of these users also have multiple ids which they use to amplify the message. Initially, I wanted to do research on them but there was not enough information. Therefore, I focused on India where the BJP is the ruling party and their cyber troop has been trolling their opponents and using it to keep their party in the limelight.
There was a time on twitter when people did not troll and conversations were fun. Parties and regimes now know that it is productive to keep a presence on twitter by putting people on a payroll. Some of them are volunteer based while others employ a lot of people. The end goal is to manipulate public opinion as well as targeting foreigners. Since there is a limited amount of time, I focus only on India and to see patterns in their harassment campaigns.
The BJP is known as the most right-wing party in India. Their leader is known as Narendra Modi who has faced accusations of ignoring riots in his province which took the lives of many Muslims. Under his government, the number of Muslim deaths has gone up. His support was bolstered by his cyber army which encourages all acts committed under the government. Not only were they effective in shaping policy, but they also encouraged acts of terrorism. They have since manipulated public opinion on social media and have been able to shake things up.
Trolls in India have used it to whip up a frenzy and to attack their opponents whenever they feel like it. This has led to the situation in India getting very dangerous. Social media is awash “with right-wing trolls who incite online communal tension and abuse and sexually harass journalists, opposition politicians and anyone who questions them. “ (Mashable link here)
This was proven in a book by Journalist-author Swati Chaturvedi, who did a massive investigation to find out how political trolls run. She mentioned her finding in her book where she makes the claim that the 2014 campaign for Narendra Modi used social media volunteers to harass figures who opposed the BJP.
It has gotten to a point where one cannot tweet anything negative about the BJP or the PM without being labelled a traitor or receiving a death threat. The scary part is that the PM follows some of these trolls. A number of trolls are well known in the Twittersphere, while some are anonymous but will do as they are told.
One of the biggest casualties of the cyber troops In India was Amir Khan who is one of the biggest superstars in India. He made comments about intolerance at a ceremony in India in November 2015. He stated that “Kiran (his wife Kiran Rao) and I have lived all our lives in India. For the first time, she said, should we move out of India? That’s a disastrous and big statement for Kiran to make to me. She fears for her child. She fears about what the atmosphere around us will be. She feels scared to open the newspapers everyday. That does indicate that there is a sense of growing disquiet.”
These comments created a storm on social media and a campaign was launched against him. They targeted Snapdeal and forced it to drop him as their Brand Ambassador. The BJP IT cell orchestrated the campaign.
There is rarely a trend which is not backed by the government or the BJP’s IT Cell. The aim of this paper is to observe and see who the main players are and to see if there are any common figures among randomly chosen trends.
Fader and Winer have said that “Text analysis of user-generated content (UGC) is a often used approach in the marketing literature for mining consumer perceptions from social media data” (Fader and Winer 2012).
Twitter data has become quite valuable and could replace oil as an important commodity. It is one of the most popular information sources for practical applications and academic research. Twitter data is used for stock forecasting (Arias et al.,2014; Feldman,2013), through real-time event and trend analysis using Machine learning algorithms(Dickey,2014), brand management (Malhotra et al.,2012), to crisis management(Wyatt,2013).
Chau and Xu are of the view that Social media data is primary text and should be considered unstructured. Content Analytics tends to help one analyze it better. CA such as NLP (Natural Language Processing) and text mining methods help to extract the information needed.
(Chau and Xu,2012).
Nicols et al have conducted their own study on Twitter to understand how Sporting events work. They have created datasets for each event from the Live Twitter Stream by using keywords or Hashtags. They did this by using keywords such as “World Cup” and “wc2010”.
Nichols J, Mahmud J, Drews C (2012) Summarizing sporting events using Twitter. In: Proceedings of the 2012 ACM international conference on intelligent user interfaces (IUI’12), pp 189–198
In contrast, Lanagan et al. have stated that there can be scenarios where incomplete tweet datasets can have a negative effect on the performance of their Search API as it only returns tweets within 7 days.
Wang et. Al have observed that the streaming API provides real-time services but only returns 1% of a total number of tweets. They also state that the full Firehose stream of tweets is available for a huge amount of money, e.g. PowerTrack costs $2,000 per month plus $0.10 per 1,000 tweets delivered.
Wang X, Tokarchuk L, Cuadrado F, Poslad S (2013) Exploiting hashtags for adaptive microblog crawling. In: IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM 2013)
Body and Ellison feel that Twitter is unique as it has a lot of the characteristics of a blog as well as elements of SNS (Boyd and Ellison, 2007),while Hughes et al add that it has the ability to create profiles (Hughes et al., 2011) and finally Goncalves is of the view that it also allows you to create a new relationship with other users (Gonçalves et al., 2011).
Weller states that posts are limited to 140 characters because of the original limits on short messages on mobile phones (Weller, 2011).
It is not difficult to crawl for an event online as it merely needs the input of keywords. This was the modus operandi in the early days of Twitter for tweet collection and analysis. It did not give any good results on a set of pre-defined keywords.
Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. “Earthquake shakes Twitter users: realtime event detection by social sensors”. In Proceedings of the 19th international conference on World Wide Web (WWW ’10). pp. 851-860. 3. Kate Starbird and LeysiaPalen. “(How) will the revolution be retweeted?: information diffusion and the 2011 Egyptian uprising”. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work (CSCW ’12). pp. 7-16.
Twitter is like the Short Messaging Service of the Internet. This is because it is signified by the well-known @ mention, RT retweet and # hashtag. Tsur and Rappoport have observed that the hashtag annotation is quite important as users will understand what the tweet is about.
Oren Tsur and Ari Rappoport. “What’s in a hashtag?: content based prediction of the spread of ideas in microblogging communities”. In Proceedings of the Fifth ACM international conference on Web search and data mining (WSDM ’12). pp. 643-652.
Twitter Crawling Model
Compared with the simple (baseline) Twitter crawling model, the adaptive Twitter crawling model enables the adaptive crawling algorithm to leverage keyword adaptation in real-time.
(Xinyue Wang, Laurissa Tokarchuk, Felix Cuadrado, and Stefan Poslad)
Hyde writes that, “Online trolling is the practice of behaving in a deceptive, destructive, or disruptive manner in a social setting on the Internet with no apparent instrumental purpose. From a lay-perspective, Internet trolls share many characteristics of the classic Joker villain: a modern variant of the Trickster archetype from ancient folklore”
Hardaker states that “ Much like the Joker, trolls operate as agents of chaos on the Internet, exploiting ‘‘hot-button issues” to make users appear overly emotional or foolish in some manner. If an unfortunate person falls into their trap, trolling intensifies for further, merciless amusement. Therefore, novice Internet users are routinely admonished, ‘‘Do not feed the trolls!”.
Despite public awareness of the phenomenon, there is little empirical research on trolling. Existing literature are scattered and multidisciplinary in nature (Hardaker, 2010; Herring, Job-Sluder, & Scheckler, 2002; McCusker, in press; Shachaf & Hara, 2010).
Frequency of activity is an important correlate of antisocial uses of technology. For instance, cyberbullying is often perpetrated by heavy Internet users (Juvonen & Gross, 2008), and disagreeable persons use mobile technologies more than others – not for socializing, but for personal entertainment (Phillips & Butt, 2006).
There have been many well documented cases when it comes to Cyber troll armies. Solon has seen that the British Army set its 77th Brigade to “focus on non‐lethal psychological operations using social networks like Facebook and Twitter to fight enemies by gaining control of the narrative in the information age” (Solon, 2015).
Solon also goes on to state that government resources can be used to buoy pro-government supporters. An example is given of the Ecuadorian government which has launched a website named Somos+ to investigate (read: threaten) social media critics of the government. Whenever a critic says something, the pro-government supporters get an alert to target political dissidents.
Gorwa adds that Cyber Troops have this Individual targeting strategy where they select an individual or group to influence on social media. An example is given of Poland, where opinion leaders, including prominent bloggers, journalists and activists, are carefully selected and targeted with messages in order to convince them that their followers hold certain beliefs and values (Gorwa 2017).
Howard has conducted research where it was found that cyber troops can come up with fake accounts to hide their identities. This is known as “astroturfing” where everything appears as grassroot activism. (Howard, 2003).
Rueda and Morla state that Argentina and Ecuador have cyber troops who work from directly under the office of the President.
(Rueda, 2012; Morla, 2015a, 2015b).
Indian Political Trolling
“So, what is political trolling? According to a popular online dictionary, it is the act of using emotions, lies, false accusations and broken logic to undermine your opponent and win an argument in a political arena. This often involves twisting various sources, such as religion, to look like they say that your views are right. Motives include, but are not limited to: For money, power, fun and for the lulz.”
Swati Chaturvedi wrote a book on BJP Trolls. She is quoted to have said “During the investigation, I was shaken when I discovered the secret ‘IT shakhas’ and the depth of the rigorous planning of social media operations… I was also taken aback by the sheer ordinariness of the trolls. Here were people abusing, slandering and indulging in communal incitement, making rape and death threats, yet treating it like a 9-to-5 call center job.”
Her quote is given ample support by Shashi Tharoor who has also alleged that the BJP is the ringleader behind figures on Twitter who control the trends and narrative.
He said “Their organised band of tweeters, retweeters and trolls have dominated social media and outstrip other parties in sheer numbers and persistence. They mount campaigns against those they disagree with and do so quite effectively.”
It seems that misogyny is also part of their playbook. Chakraborty writes “They send us (Nirbhaya-like) rape threats, mutilation threats, abuses and profanities and watch us shriek and recoil in horror. They hurt our “refined sensibilities” with their poor English, and we never miss an opportunity to point that out in our pristine, Microsoft Word-aided Queen’s language.”
In an earlier post, she writes “As an astute commentator has observed, it represents a “democratisation of opinion-making”, even though, more often than not, trolling is associated with the worst forms of misogyny, communally-sensitive distortions against religious and caste minorities, and entrenched prejudice against difference – sexual, linguistic, regional, gastronomic, among others.”
“But 2016 may be best captured by its formatting: the first ALL-CAPS campaign, when American Internet trolls announced, anonymously, that they are no longer content to cower in the comments section.”
Background to Research
The project was done over a two month period where everything was done by trial and error. The first phase was to test the servers for durability and to test other trends before getting it done. Initially I tried mining the local trends in Pakistan but there was not much data to give me a better picture. So, I tried the US but there wasn’t much activity on the troll front there. Then after approval from my supervisor, I switched to India where I could get all the data I needed. The datasets from Pakistan were sufficient for me to work around so I could get everything right for India.
My focus was to make the task simple so I could simply mine the right tweets and get it all working smoothly.
This is the workflow of how I went by
Setting up Servers: I set up many Linux servers on the cloud to gather the data as power outages back home could affect how I gather tweets. Created five servers so that I could mine for other tweets in parallel and save time.
Collection of Tweets: This was a challenge as I had to register other twitter accounts with a phone number for each account so I could create a dataset.
It was not an issue building up the dataset because there was no API limit. Traditionally the API limit only allows 15 requests during a 15 minutes window. This means that you can only get about 60 requests in an hour. This was why I was able to gather a lot of tweets in a month’s time for the various tweets.
Parsing the data: This was the most difficult part as it was collected in a JSON file but there were some issues with the line break.
Processing the data: The data was processed with the help of a Python script which helped me get all the numbers I needed. Using a stop-words dictionary to remove familiar words. Using a script to find out common tweeters in the list, to check for the most RTed ones in the dataset, to extrapolate the tweets in a list, create a list of the IDs as well make a list of the most prolific ones.
Tokenize tweets: To use the NLTK library to tokenize the tweets so that certain words are recognized instead of being ignored.
Last step was to check and see if there were some common entities among the five trends and there were.
Twitter knows the power of its data and that is why they have restricted access to most who want it. Those who want it will pay a heavy price as they use it for marketing or to understand how some people will want to vote. The data can be used for absolutely anything but it is quite difficult to analyze that data manually. Therefore, I was not able to get as much as I needed.
I collected data by crawling for trends in the Indian Twittersphere. It took me a month to gather all the various datasets in total while the preparatory research took around a month as well. The data was collected by me seeing the trends on twitter as well as various trend websites.
I also had to do a lot of reading so I could understand my subject and understand what steps I needed to take. This was quite useful and helped me obtain my objectives.
I used a script to gather the tweets in real time. There were some issues where the script would not run but after a few tweaks I was able to make the changes I needed. It was a powerful script as it was able to utilize the Streaming API and save the tweets directly to the PostGreSQL dabatase. The format it chose was JSON.
The trends which I looked at were spread across many days. There was also some downtime as not much was happening. It needed me to be vigilant and to check the trends from time to time. They had to do with current affairs or any trends which were geared towards politics or any event occurring in India. The datasets were of diverse sizes and I gathered close to around 900k tweets. The method was that I chose to analyze the top five trends and see what they were all about.
I set up a crawler on a Linux server using Google Cloud. It used a PostgreSQL server and a python script to mine tweets in real time. The good thing about live tweets is that there was no API limit. I set up four more parallel servers which could crawl for the relevant hashtags I chose.
To analyze the tweets, I had to shortlist hashtags. The hashtags were chosen based on popularity and the audience it catered to. This had to be done manually and I had to be quick as I was not sure of which one to go for. Once I found which ones I needed, I entered the keyword on the crawler script and it started gathering the tweets.
This was done through the Twitter API and not through the search option. The difference between the two was that the former would give me the metadata while the latter would pick up old tweets but without the data I needed. The API let me scrape the tweets in real time without any limits. That was the best solution and came in handy for me.
They were mined till there was no more activity. On average, they were active for 4-5 days till they died down. There were a couple of times when the crawler shut down because the server could not handle the request so there was a bit of a loss of data. I could collect around roughly 900,000 tweets during the collection period.
My hypothesis is that a number of these users are trolls who make their services to make the trends go viral. There was no removal of users as I wanted to get an idea of who was tweeting. The other issue I faced was that there were many tweets which were in Hindi so it was an issue.
Twitter messages are very short and they tend to be easier to mine.
Easy to get millions of tweets for live events if you configure the crawler to mine tweets in real time. This is far easier than Facebook.
There is a lot of metadata for a twitter account. You can check to see the timestamp, user id, response to, geotag (if available), time zone, device used, language, date, etc.
A majority of twitter data is openly available so it’s easy to mine. Historical data is another story. The only exception is the minority which protects their tweets.
You can access Twitter data easily via their API. The caveat is that you must be using a registered account to obtain that data.
Anatomy of Twitter
Twitter is one of the most popular social networks out there, though it is dwarfed by Facebook. It hasn’t shown much growth since 2015 where they had 302 million active users. The current numbers for 2017 are estimated to be around 328 million users.
Twitter users send around 500 million tweets a day and the service can handle 18 quintillion user accounts and their usage. 80 percent of Twitter users access the service on their mobiles while 79 percent of its audience is outside the US.
A Twitter crawler or scraper is a program which collects tweets or users’ information through
The Twitter API where you put in your requirements.
I utilized the baseline crawling model as it uses a constant keywords set. They were already predefined and I chose the ones which I felt would add more to the study. They were chosen as they were and the stream returned the tweets to the dataset. This was what I used for my research.
I also used the Streaming API which gave me real time access to the tweets I required. The alternatives were the Search API and REST (Representational State Transfer) which would not have suited my purpose.
There is many APIs on the Twitter official website. As I am more at ease with Python, I used the API wrapper “Tweepy”. It was easy to use and did not need much of an effort. I had heard a lot of rave reviews about it. The ease of use was good but the problem was when I had to install it as there were some conflicts with the version of Python. After a bit of tinkering, I could get it to work properly. It is a very popular wrapper and is now in version 3.5.0.
Cleaning the Dataset
The anatomy of a tweet is such that messages are known as tweets. They are not more than 140 characters. Originally, the limit used to include the user handle and any web link. Now they are not counted within the limit. Twitter messages can be part of a thread or stand-alone messages or replies to people. They can also be quoted tweets which cite other tweets and add their own tweets. The handle of a person is the identity of a person and is found by an @ sign. An example of a twitter user is someone like @Jack, who is the CEO at Twitter. Twitter users choose a handle when they sign up for twitter and that is what sets them apart from others. One can change a twitter handle later, but if you are a verified user that would mean losing your verified status.
There can be a good number of people within a conversation. With the latest update, there is no limit to the number of handles. Users can also share links to other websites or include pictures, though there is a limit.
Tweets were signified by RT which showed that it was a Retweet. Now that choice has been taken away though you can still manually RT if you are using an app such as Tweetbot. Tweets can be taken as an endorsement of someone though some users claim that they are not. In this scenario, it does not matter how many RTs there are as we need to understand who the most prolific users are. Most of the noise which will be removed from this is the rest of the metadata. The only data which stays is the user id, the text within the tweet. The key is to see who the most prolific ones are and which ones get the most RTed.
Normally I would have worked to get rid of the bots within the tweets. In this scenario, they are important for the study. I would need to be able to see if they are playing a role in helping trends go the extra mile. They are used for the singular purpose to help propagate and amplify tweets. Twitter bots are created through scripting programs. They usually tend to tweet a number of times till Twitter finds out and suspends them. In some cases, that does not happen and they are free to do what they like. So it doesn’t matter if there are many duplicate tweets as we just need to how influential they are.
The cleaning process was initiated once all the tweets were retrieved and the analysis process started. The cleaning process was done with a number of scripts on Python. The final process was to merge the screen names for the five trends together and see who the common ones are.
Post the Parsing Process
The last step of the project is to run the data through Python to parse and get the results. We already have the datasets ready and they need to be cleaned so that we can get the desired results. The alternative was to stick with R but that was not giving me the right results. I worked on tweaking some scripts to extrapolate the values needed to get the results.
There were just two choices for me here: Python or R. Python is one of the most effective programming languages which works well for this purpose. I briefly considered using node.js and express.js to run the script for the crawler but decided against it. I did not have to think much and stuck with Python and Tweepy for the initial phase.
The second phase was to use Git Bash to run the commands on my Windows Computer as it was relatively easy. I was given the suggestion of using R to clean the data but it was quite a task. Python is one of the best languages to use to clean the data.
There is also a lot of documentation available and there are also a lot of tutorials available online. I am a novice when it comes to Python but was able to learn while studying.
I crawled the following trends on Twitter: BengalDebate, BengalExpose, BengalInvestigation, BengalTension, FakeIntolerance, HindusDontCount, HindusOnHateList, IndiaIsraelFriendship, MamataBlameGame, ModiInIsrael,MidnightGSTLaunch, NoMamataForHindus and SaveBengal.
These datasets varied as some were small while the rest were huge. Since I was advised to investigate larger datasets, I decided to go for GSTForNewIndia, ModiInIsrael, SaveBengal, IndiaIsraelFriendship, and NoMamataForHindus.
The results I found were quite interesting.
The prolific tweeters were generated by running a script to see how many tweets were sent within the trend. The python script was told to ignore the RTs so that only the ones who were the most prolific were listed. It brought down a list of names which were the most active ones within the trends. A quick look at some of these handles showed that a number of them had shown their political affiliation within their handles. There were some which looked like as if a bot had generated the handles.
Most Retweeted Handles
This was also quite interesting as it gave a list of the ones who were the most Retweeted ones. The numbers for the dataset of the highest volume were in the six figures. Upon manually looking up some of the twitter handles, I could see that they are quite active in the Twittersphere. They were a number of politically active people as well as some figures who were followed by the PM or high ranking members of his party.
The word cloud was to see what the main topic of contention was in most of those trends. This was set to only include words in English characters. They were all politically targeted.
Figure 1: #ModiInIsrael
#Figure 2: GSTForNewIndia
Figure 3: #SaveBengal
Figure 4: #IndiaIsraelFriendship
Figure 5: #NoMamataForHindus
The conclusive results were when I combined the data from the five datasets to find a list of common figures who were playing a role in all the trends I had randomly chosen. It brought up a list of three to four thousand figures who were active throughout all the trends. This does prove that Cyber troops have played a role in controlling the narrative and gaming trends in India.
Conclusions and Future Work
Twitter data mining/Text Classification is quite important. It is one of the biggest parts of the social media revolution. Experts believe that data will have a lot of value soon and that it can expect to fetch a premium. This research was quite useful as it allowed me to understand how trends work. It was not an easy task to mine and create the datasets. To start with gathering the data, parsing it, extrapolating the tweets, to isolate the ids, screennames, RTs, find the most prolific ones, creating a word cloud out of each and to sort out which ones are the common ones among the trends.
I have used a number of scripts and gone through a process to get all the information I needed. They have given me the results which I needed. Python was the simplest language I could use as it gave the results without me having to sit through the process of sifting everything.
There were some problems when I tried sorting it out as it would include data I did not need. After reconfiguring the parameters, I was able to get what I needed. The time limitations and the fact that I was only able to get one percent of the stream for each trend meant that it can’t be taken as an accurate interpretation of what the actual sentiment is. However, given the sample obtained and the data derived from it, I was able to conclude that my hypothesis was correct. The hypothesis was that people can be hired for a purpose or there is a group of volunteers who control the narrative on twitter.
I also noticed that a number of Twitter screennames were in a sequence and that the other visible pattern was that people RTed the most popular ones to amplify the message.
The accuracy is something which is pretty good based on what I derived. I reconfigured it and retested it a few times to make sure that there were no errors. The results were practically the same with some minor deviations which would not do much to change the result.
In future releases, I would like to create a system where I could create a front end where I can enter the values I require. It would have an easy to use UX where the user can input the search required. The backend would be the server which would do all the heavy lifting and save the stream to the database. Perhaps I would build a container and allow for multiple streams to be mined and accessed. A future iteration would also include means of generating the results easily and creating an output file without me having to do it all manually. Another addition would be to add a list of foreign words for the stop-words dictionary so that there will be no room for error. The current lists can be improved to get a better idea. One thing I regret was not being able to do a sentiment analysis of the tweets (Even though that was not the scope of the project). However, based on the output of the word cloud, it can be seen that the ones targeting the PM were positive while the ones against his rivals were full of hatred.