This site Footnote 2 was utilized as a way to get tweet-ids Footnote step 3 , this amazing site brings experts with metadata regarding a great (third-party-collected) corpus from Dutch tweets (Tjong Kim Performed and you will Van den Bosch, 2013). age., brand new historic restriction when asking for tweets considering a venture inquire). New R-bundle ‘rtweet’ and you can complementary ‘lookup_status’ mode were used to gather tweets in the JSON format. The fresh new JSON file constitutes a dining table to the tweets’ recommendations, such as the development day, the tweet text, in addition to supply (i.e., variety of Facebook client).
Investigation cleaning and you may preprocessing
The JSON Footnote 4 files were converted into an R data frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as users who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, Nusers = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.
The tweet texts have been transformed into ASCII security. URLs, line holidays, tweet headers, display screen names, and you will references so you can display labels have been removed. URLs enhance the profile number whenever found in the tweet. Yet not, URLs do not enhance the profile amount while they are found at the end of a great tweet. To avoid an excellent misrepresentation of your actual reputation limitation one to pages had to endure, tweets with URLs ( not news URLs particularly additional pictures otherwise video clips) was indeed excluded.
Token and you may bigram study
The fresh new Roentgen bundle Footnote 5 ‘quanteda’ was applied to help you tokenize this new tweet messages towards the tokens (we.e., separated terms, punctuation s. Concurrently, token-frequency-matrices was indeed determined with: the fresh frequency pre-CLC [f(token pre)], this new cousin frequency pre-CLC[P (token pre)], the latest frequency post-CLC [f(token article)], the new cousin regularity post-CLC and you will T-results. New T-sample is a lot like a standard T-figure and you may exercises new analytical difference in means (i.elizabeth., the latest cousin phrase wavelengths). Negative T-results indicate a somewhat higher thickness away from a beneficial token pre-CLC, while positive T-score imply a comparatively highest occurrence out of an excellent token blog post-CLC. This new T-rating equation used in the study try demonstrated once the Eq. (1) and you can (2). N ‘s the final number of tokens each dataset (we.elizabeth., both before and after-CLC). It formula is dependant on the procedure to have linguistic data of the Chapel et al. (1991; Tjong Kim Carried out, 2011).
Part-of-speech (POS) study
The brand new R package Footnote 6 ‘openNLP’ was applied in order to identify and you may matter POS kinds regarding tweets (i.elizabeth., adjectives, adverbs, articles, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and you will various). Brand new POS tagger operates playing with an optimum entropy (maxent) likelihood design to help you predict the new POS class centered on contextual has (Ratnaparkhi, 1996). The brand new Dutch maxent model useful the POS class try coached into CoNLL-X Alpino Dutch Treebank investigation (Buchholz and you may ). New openNLP POS model has been reported which have a precision get out of 87.3% whenever utilized for English social networking investigation (Horsmann ainsi que al., 2015). An enthusiastic ostensible limitation of the newest investigation is the precision out of the latest POS tagger. But not, equivalent analyses have been did for both pre-CLC and you https://blacksportsonline.com/wp-content/uploads/2019/12/Kevin-Love.jpg» alt=»Winnipeg sugar babies»> may blog post-CLC datasets, definition the accuracy of your own POS tagger shall be uniform over each other datasets. Ergo, i assume there are no logical confounds.