## Twitter Hashtag 94 Data¶

### Data Preprocessing Details¶

Scraper. The Tweets from Twitter are scraped using the Twint software. When scraping, a specific hashtag is entered as a query to request relevant tweets that contain that hashtag at a particular range of time. In this case, we have 94 hashtags scraped in the time-frame January 2013 to December 2020. Below is a diagram of the text processing of tweets. Twint's purpose is to scrape tweets without the use of a Twitter API to avoid most of the limits. What it does is using the search function on twitter and scrapes the search results accordingly.

Time Frame. The tweets was scraped in February 2021 using the time-frame January 2013 to December 2020. This means that any tweets - with a specified hashtag query - is scraped if it exists within that time-frame. Deleted tweets and users prior to February 2021 and private tweets may not exist since the scraper only takes available public tweets.

Tweet Text Processing. The diagram below illustrates the processing pipeline where the green boxes indicate separate data savepoints. The Bag-of-Words save point is where only the processed text is saved. The original tables savepoint is where the text and other tweet information is saved (e.g. favorite counts, reply counts, etc). The LIWC tables savepoint is where the text and LIWC metrics are saved. For all savepoints, texts was lowercased and the hyperlinks are removed. The scraper can take tweets in several languages but in our case, we only take the English texts. The username tag (e.g. @username) is anonymized by replacing it by the character "[usn]". For all trailing usernames, it was contracted into the "[usn]" character. The hashtags (e.g. #hashtag) in the tweet are not fully recognized within the vocabulary of LIWC. Therefore, the hashtags are replaced with the character "[htg]" and trailing hashtags are contracted similar to the usernames for the LIWC tables save point only. The processed text is entered into LIWC software and the results are organized into a table where the rows are labeled with unique tweets ids and the columns contain the text and the LIWC metrics (see the LIWC manual for the descriptions of the LIWC metrics/columns).

There are three data folders where each folder is associated with the type of data which are Bag-of-Words, Original table, or LIWC tables. All relevant files can be downloaded from Google Drive. The following lists the links to the files.

1. Bag-Of-Words. This link directs you to a Google Drive folder where it contains three subfolders:
 a. "combined" - contains compressed text data containing all of the tweets separated by subsets.
b. "group" - contains compressed text data grouped by hashtags and separated by subsets.
c. "temporal" - contains two subfolders:
i. "combined" - contains compressed text data separated into months and by subsets.
ii. "group" - contains compressed text data separated into months, by hashtag, and by subsets.
1. Original Tables. This link directs you to a Google Drive folder where it contains ".pkl" files of the original tables separated by hashtags. Each file corresponds to a hashtag and it contains tweet data where the row labels are the tweet ids. A separate file containing duplicates is also in this folder.

2. LIWC Tables. This link directs you to a Google Drive folder where it contains ".csv" files of the LIWC tables separated by hashtags. Each file corresponds to a hashtag and it contains tweet data and the LIWC metrics. The row labels of the tables are the tweet ids