This site Footnote dos was applied as an easy way to get tweet-ids Footnote step three , this website provides boffins that have metadata away from a beneficial (third-party-collected) corpus out of Dutch tweets (Tjong Kim Sang and you can Van den Bosch, 2013). age., new historical maximum whenever asking for tweets predicated on a quest query). The fresh new Roentgen-bundle ‘rtweet’ and subservient ‘lookup_status’ means were used to gather tweets during the JSON structure. The newest JSON file comprises a table into tweets’ suggestions, such as the design date, brand new tweet text message, additionally the supply (we.elizabeth., form of Facebook client).
Investigation tidy up and you can preprocessing
The JSON Footnote 4 files were converted into an R data https://datingranking.net/sugar-daddies-usa/az/phoenix/ frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as pages who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, Nusers = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.
The newest tweet messages have been converted to ASCII security. URLs, line vacations, tweet headers, screen names, and you can recommendations to help you display labels was got rid of. URLs enhance the character number when located within the tweet. However, URLs do not add to the profile number when they’re found at the conclusion a good tweet. To get rid of a misrepresentation of one’s real character restrict one users had to endure, tweets that have URLs (but not mass media URLs eg added photo or video) was indeed excluded.
Token and you will bigram studies
The brand new Roentgen package Footnote 5 ‘quanteda’ was used so you can tokenize this new tweet messages towards the tokens (we.age., isolated terms, punctuation s. At exactly the same time, token-frequency-matrices have been calculated which have: this new regularity pre-CLC [f(token pre)], the brand new cousin volume pre-CLC[P (token pre)], the fresh new frequency blog post-CLC [f(token post)], new relative regularity article-CLC and you may T-score. The fresh T-test is much like an elementary T-fact and you can works out the latest analytical difference between form (we.elizabeth., new relative keyword wavelengths). Negative T-results indicate a comparatively high thickness regarding good token pre-CLC, whereas confident T-results suggest a relatively higher density away from an excellent token blog post-CLC. New T-rating picture utilized in the study try displayed since the Eq. (1) and you can (2). N ‘s the final number out-of tokens for each and every dataset (we.age., pre and post-CLC). This picture is dependant on the process getting linguistic data by the Church ainsi que al. (1991; Tjong Kim Done, 2011).
Part-of-speech (POS) studies
The latest R plan Footnote 6 ‘openNLP’ was applied so you can categorize and matter POS classes regarding the tweets (i.age., adjectives, adverbs, articles, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and you will miscellaneous). This new POS tagger works having fun with a max entropy (maxent) opportunities design to help you predict the brand new POS classification predicated on contextual has (Ratnaparkhi, 1996). This new Dutch maxent model employed for the newest POS group try trained to the CoNLL-X Alpino Dutch Treebank studies (Buchholz and you may ). The new openNLP POS model might have been reported having an accuracy score out of 87.3% when used in English social networking data (Horsmann mais aussi al., 2015). An enthusiastic ostensible maximum of the most recent studies is the accuracy out-of the brand new POS tagger. But not, comparable analyses was performed for both pre-CLC and you can post-CLC datasets, meaning the precision of the POS tagger are going to be consistent over each other datasets. Therefore, i imagine there are no clinical confounds.