Ви не можете вибрати більше 25 тем Теми мають розпочинатися з літери або цифри, можуть містити дефіси (-) і не повинні перевищувати 35 символів.

165 lines
6.3KB

  1. #! /usr/bin/env Rscript
  2. # Usage -------------------------------------------------------------------
  3. 'Gather tweets from the command line
  4. Usage:
  5. gathertweet search [--file=<file>] [options] [--] <terms>...
  6. gathertweet update [--file=<file> --token=<token> --backup --backup-dir=<dir> --polite --debug-args]
  7. gathertweet simplify [--file=<file> --output=<output> --debug-args --polite] [<fields>...]
  8. Arguments
  9. <terms> Search terms. Individual search terms are queried separately,
  10. but duplicated tweets are removed from the stored results.
  11. Each search term counts against the 15 minute rate limit of 180
  12. searches, which can be avoided by manually joining search terms
  13. into a single query. WARNING: Wrap queries with spaces in
  14. \'single quotes\': double quotes are allowed inside single quotes only.
  15. <fields> Tweet fields that should be included. Default value will include
  16. `status_id`, `created_at`, `user_id`, `screen_name`, `text`,
  17. `favorite_count`, `retweet_count`, `is_quote`, `hashtags`,
  18. `mentions_screen_name`, `profile_url`, `profile_image_url`,
  19. `media_url`, `urls_url`, `urls_expanded_url`.
  20. Options:
  21. -h --help Show this screen.
  22. --file <file> Name of RDS file where tweets are stored [default: tweets.rds]
  23. --no-parse Disable parsing of the results
  24. --token <token> See {rtweet} for more information
  25. --retryonratelimit Wait and retry when rate limited (only relevant when n exceeds 18000 tweets)
  26. --quiet Disable printing of {rtweet} processing/retrieval messages
  27. --polite Only allow one process (search|update) to run at a time
  28. --backup Create a backup of existing tweet file before writing any new files
  29. --backup-dir <dir> Location for backups, use "" for current directory. [default: backups]
  30. --debug-args Print values of the arguments only
  31. search:
  32. -n, --n <n> Number of tweets to return [default: 18000]
  33. --type <type> Type of search results: "recent", "mixed", or "popular". [default: recent]
  34. --include_rts Logical indicating whether retweets should be included
  35. --geocode <geocode> Geographical limiter of the template "latitude,longitude,radius"
  36. --max_id <max_id> Return results with an ID less than (older than) or equal to max_id
  37. --since_id <since_id> Return results with an ID greather than (newer than) or equal to since_id,
  38. automatically extracted from the existing tweets <file>, if it exists, and
  39. ignored when <max_id> is set. "none" for all available tweets. [default: last]
  40. --and-simplify Create additional simplified tweet set with default values.
  41. Run `gathertweet simplify` manually for more control.
  42. simplify:
  43. --output <output> Output file, default is input file with `_simplified` appended to name.
  44. ' -> doc
  45. library(docopt)
  46. args <- docopt(doc, version = paste('gathertweet version', packageVersion("gathertweet")))
  47. exit <- function(value = 0) q(save = "no", value)
  48. if (args$`--debug-args`) {
  49. str(args)
  50. saveRDS(args, "args.rds")
  51. exit()
  52. }
  53. library(gathertweet)
  54. action <- names(Filter(isTRUE, args[c("search", "update", "simplify")]))
  55. if (args$polite) {
  56. lockfile <- paste0(".gathertweet_",
  57. digest::digest(args[c("file", "search", "update", "simplify")]),
  58. ".lock")
  59. lck <- filelock::lock(lockfile, exclusive = TRUE, timeout = 0)
  60. gathertweet:::stopifnot_locked(lck, "Another gathertweet {action} process is currently running for {args$file}")
  61. }
  62. log_info("---- gathertweet {action} start ----")
  63. # Search ------------------------------------------------------------------
  64. if (isTRUE(args$search)) {
  65. if (args[["--and-simplify"]]) args$simplify <- TRUE
  66. log_info("Searching for \"{paste0(args$terms, collapse = '\", \"')}\"")
  67. max_id <- args[["max_id"]]
  68. since_id <- args[["since_id"]]
  69. since_id <- if (is.null(max_id)) {
  70. if (since_id == "last") {
  71. last_seen_tweet(file = args$file)
  72. } else if (since_id == "none") {
  73. NULL
  74. } else since_id
  75. }
  76. if (!is.null(since_id)) log_info("Tweets from {since_id}")
  77. if (!is.null(max_id)) log_info("Tweets up to {max_id}")
  78. tweets <- lapply(
  79. args$term,
  80. function(term) rtweet::search_tweets(
  81. q = term,
  82. n = as.integer(args$n),
  83. type = args$type,
  84. include_rts = args$include_rts,
  85. geocode = args$geocode,
  86. max_id = max_id,
  87. parse = !args[["no-parse"]],
  88. token = args$token,
  89. retryonratelimit = args$retryonratelimit,
  90. verbose = !args$quiet,
  91. since_id = since_id
  92. )
  93. )
  94. tweets <- dplyr::bind_rows(tweets)
  95. if (nrow(tweets) == 0) {
  96. log_info("No new tweets.")
  97. exit()
  98. }
  99. tweets <- tweets[order(tweets$status_id), ]
  100. tweets <- tweets[!duplicated(tweets$status_id), ]
  101. log_info("Gathered {nrow(tweets)} tweets")
  102. if (args$backup) backup_tweets(args$file, backup_dir = args[["backup-dir"]])
  103. tweets <- save_tweets(tweets, args$file)
  104. log_info("Total of {nrow(tweets)} tweets in {args$file}")
  105. # Update ------------------------------------------------------------------
  106. } else if (isTRUE(args$update)) {
  107. logger("Updating tweets in {args$file}")
  108. tweets <- update_tweets(
  109. file = args$file,
  110. # passed to rtweet::lookup_statuses()
  111. parse = !args[["no-parse"]],
  112. token = args$token
  113. )
  114. log_debug("Status lookup returned {nrow(tweets)} tweets")
  115. if (args$backup) backup_tweets(args$file, backup_dir = args[["backup-dir"]])
  116. tweets <- save_tweets(tweets, args$file)
  117. log_debug("Total of {nrow(tweets)} tweets in {args$file}")
  118. }
  119. # Simplify ----------------------------------------------------------------
  120. if (isTRUE(args$simplify)) {
  121. logger("Simplifying tweets in {args$file}")
  122. tweets_simplified <- simplify_tweets(
  123. tweets = NULL,
  124. file = args$file,
  125. .fields = args$fields
  126. )
  127. log_debug("Simplified {nrow(tweets_simplified)} tweets")
  128. if (is.null(args$output)) {
  129. args$output <- gathertweet:::path_add(args$file, append = "_simplified")
  130. }
  131. log_info("Saving simplified tweets to {args$output}")
  132. tweets_simplfied <- save_tweets(tweets_simplified, args$output)
  133. }
  134. if (args$polite) {
  135. filelock::unlock(lck)
  136. unlink(lockfile)
  137. }
  138. log_info("---- gathertweet {action} complete ----")