class: center, middle, inverse, title-slide # Using R and a Raspberry Pi to collect social media data ### Frie Preu ### RLadies Bucharest, 2020-11-24 --- # About me - political scientist turned data scientist turned IT consultant / software developer... something else? - useR since 2013/2015 - CorreAid volunteer since 2015, full-time since 2020 --- # About CorrelAid - German(/European) Data4Good network with over 1500 volunteers - data4good projects with external partners - education: e.g. meetups, tidytuesday, workshops, annual conference, internal projects,.. - dialogue with society - excellent opportunity to try out things --- # About this project 2017: new website withπ β‘οΈ collect social media time series: facebook, twitter, mailchimp subscribers ![](index_files/figure-html/unnamed-chunk-1-1.png)<!-- --> --- # Requirements for automated data collection - π€ somewhere to run our code on - π automatically execute code at regular intervals - πΎ store data for later, easy access - π¬ notify us if something is wrong --- # π€: A Raspberry Pi .pull-left[ <a title="Gareth Halfacree from Bradford, UK / CC BY-SA (https://creativecommons.org/licenses/by-sa/2.0)" href="https://commons.wikimedia.org/wiki/File:Raspberry_Pi_3_B%2B_(39906369025).png"><img width="512" alt="Raspberry Pi 3 B+ (39906369025)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/97/Raspberry_Pi_3_B%2B_%2839906369025%29.png/512px-Raspberry_Pi_3_B%2B_%2839906369025%29.png"></a> ] .pull-right[ - tiny and affordable computer, originally used for teaching - large open-source community, many different projects - πΈ: ~10-110 Euro (with accessoires) - Specs: 512MB - 8GB RAM, own OS (Raspian) ] --- # π: Cron jobs > Cron is one of the most useful utility that you can find in any Unix-like operating system. It is used to schedule commands at a specific time. These scheduled commands or tasks are known as "Cron Jobs". ([Source](https://ostechnix.com/a-beginners-guide-to-cron-jobs)) ![https://ostechnix.com/a-beginners-guide-to-cron-jobs](https://ostechnix.com/wp-content/uploads/2018/05/cron-job-format-1.png) -- ```r 50 23 * * * /usr/lib/R/bin/Rscript '/home/frie/correlaid-utils/correlaid-analytics/run.R' ``` .footnote[Note: Slide adapted from Alex Kapps presentation, see [here](https://docs.correlaid.org/correlcollection/open-online-data-meetup#how-to-store-thousands-of-shared-bike-locations-every-4-minutes-into-a-database). Image source: https://ostechnix.com/wp-content/uploads/2018/05/cron-job-format-1.png.] --- # Project timeline & versions .pull-left[ [mid 2017 - Oct. 2017](https://github.com/friep/correlaid-utils/tree/9f2506f90773e34f409be46f164bbbc16e8c7b9d) <br> <br> [early 2018 - mid 2018 (?)](https://github.com/friep/correlaid-utils/tree/1ed5a5b4416beab950bcc1313ae6bc2f8fab1b22) <br> <br> mid 2018 - late 2020 [late 2020](https://github.com/friep/correlaid-utils) ] .pull-right[ Raspberry Pi + R + mlab, cf. [talk at OODM](https://youtu.be/tFRNBHqg_ZQ?t=2290) AWS Lambda, Serverless & Python, cf. [talk at OODM](https://youtu.be/tFRNBHqg_ZQ?t=2413) β Raspberry Pi + R + GitHub + GitHub actions ] --- class: center, middle, inverse # R and Raspberry Pi - 2017 version --- # 2017 version: diagram <img src="img/r_v1.png" width="1003" /> --- # 2017 version: summary - π€ Raspberry Pi - π Cron - πΎ mlab - π¬ β -- ### Problems - one big, messy R script - authentication details in text files checked into (private) GitHub (β οΈ) - code quality ... --- class: center, middle, inverse # 2018: Python + AWS Lambda + Serverless --- # Dezember 2017 Frie π§π» [https://www.codecentric.de](https://www.codecentric.de) --- # 2018 version: diagram <img src="img/correlaid-analytics_v2.png" width="983" /> --- # 2018 version: What is AWS Lambda? > AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code. [...] > The purpose of Lambda, as compared to AWS EC2, is to simplify building smaller, on-demand applications [...] (([Wikipedia](https://en.wikipedia.org/wiki/AWS_Lambda)) -- - *event-driven*: it only runs responding to an **event** - the event can be a cronjob π -- - *serverless*: underlying servers are **automatically** started + stopped by AWS (-> RIP fripi) -- - *smaller, on-demand applications*: those are called **functions** -- - payment per execution -> free / very cheap! --- # 2018: AWS Lambda + Python ```bash correlaid-analytics βββ daily.py βββ deploy-analytics.sh βββ every_monday.py βββ package-lock.json βββ requirements.txt βββ serverless.yml βββ setup.sh ``` --- # 2018 version: serverless The [serverless](https://serverless.com) framework allows to define Lambda *functions* in a yml file (`serverless.yml`) and makes deployment to AWS very easy. ```yml functions: daily_correlaid_analytics: handler: daily.get_correlaid_data events: - schedule: rate: cron(56 22 * * ? *) ``` Deployment with: ```bash serverless deploy -v ``` --- # 2018 summary - π€ AWS Lamdba (runs on AWS) - π Cron - πΎ hosted MySQL - π¬ AWS Lambda alerts --- class: center, middle, inverse # R and Raspberry Pi - 2020 version --- <img src="img/r_v2.png" width="1004" /> --- # 2020 version: Cron job ```r ## cronR job ## id: daily_analytics ## tags: ## desc: Get daily CorrelAid Analytics 50 23 * * * cd '/home/frie/correlaid-utils/correlaid-analytics' && /usr/lib/R/bin/Rscript '/home/frie/correlaid-utils/correlaid-analytics/run.R' > '/home/frie/correlaid-utils/correlaid-analytics/run.log' 2>&1 ``` set up with the very helpful {[cronR](https://github.com/bnosac/cronR)} π¦ -- ### run.R ```r library(here) print("==============================") print(Sys.time()) source(here::here("correlaid-analytics/01_get_daily_analytics.R")) source(here::here("correlaid-analytics/02_git.R")) ``` --- # 2020 version: files ```r correlaid-analytics/ βββ 01_get_daily_analytics.R βββ 02_git.R βββ cron.R βββ data β βββ all_daily.csv βββ run.log βββ run.R ``` [01_get_daily_analytics.R](https://github.com/friep/correlaid-utils/blob/main/correlaid-analytics/01_get_daily_analytics.R) --- # 2020 version: smcounts π¦ ```r library(smcounts) smcounts::collect_data ``` ``` ## function (slack = TRUE, facebook = TRUE, twitter = TRUE, mailchimp = TRUE) ## { ## df <- tibble::tibble(date = c(), platform = c(), n = c()) ## if (slack) { ## slack_df <- ca_slack() ## df <- rbind(df, slack_df) ## } ## if (facebook) { ## facebook_df <- ca_facebook() ## df <- rbind(df, facebook_df) ## } ## if (twitter) { ## twitter_df <- ca_twitter() ## df <- rbind(df, twitter_df) ## } ## if (mailchimp) { ## mailchimp_df <- ca_newsletter() ## df <- rbind(df, mailchimp_df) ## } ## return(df) ## } ## <bytecode: 0x7ffd289e57e8> ## <environment: namespace:smcounts> ``` --- # smcounts π¦ - abstracts data collection functionality --> can be reused in other contexts - define dependencies via DESCRIPTION file - easy installation from [GitHub](https://github.com/friep/smcounts) (https://github.com/friep/smcounts) - uses environment variables (standard way to store API keys etc.) --- # 2020 version πΎ: Git ### 02_git.R ```r # gert (https://docs.ropensci.org/gert/index.html) library(gert) gert::git_pull() print(gert::git_status()) gert::git_add("correlaid-analytics/data/all_daily.csv") gert::git_commit(message = "π€ CRON - update daily data", author = git_signature("raspi3", "raspi3@pr130.dev")) gert::git_push() ``` ```r ca_counts <- readr::read_csv("https://raw.githubusercontent.com/friep/correlaid-utils/main/correlaid-analytics/data/all_daily.csv") ``` ``` ## ## ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## cols( ## date = col_date(format = ""), ## platform = col_character(), ## n = col_double() ## ) ``` --- # 2020 version: π¬ GitHub Action - CI/CD tool (continuous integration, continuous deployment) to define *workflows* in yml files - typical use case: run checks on R Packages (e.g. [dplyr](https://github.com/tidyverse/ggplot2/actions?query=workflow%3AR-CMD-check)), build websites - different kinds of triggers: push, pull request, cron job (π) -- ### correlaid-utils workflow - runs every morning to check whether a commit has been made to `all_daily.csv` in last 24 hours - if yes: β - if no: β -> workflow fails and GitHub sends email - [yml file](https://github.com/friep/correlaid-utils/blob/main/.github/workflows/notify_on_failure.yml) - [workflow runs](https://github.com/friep/correlaid-utils/actions) --- # 2020 version: summary - π€ Raspberry Pi - π Cron - πΎ GitHub - π¬ GitHub Actions -- ### 2020 version vs. 2017 version - β better decoupling through smcounts package - β more stability - π€ git as storage option & github action - π better error handling - β tests!! --- class: center, inverse, middle # Alternatives & Summary --- # Alternatives #### Server π€ - Virtual machines on AWS, Azure, Google Cloud - specialized services from AWS, Azure, ... - GitHub Actions or other CI/CD services (?!) -- #### Storage πΎ - a proper database - local - in the cloud (e.g. [AWS RDS free tier](https://aws.amazon.com/rds/free/?nc1=h_ls), [elephantsql](https://www.elephantsql.com)) - file storage for csv file (e.g. free AWS S3) -- #### Notifications π¬ - make built-in cron emailing functionality work on Raspberry Pi - monitoring services on AWS etc. (e.g. AWS SNS) --- # Summary - Things you can learn: git, cron jobs, ssh, scp, basics of networking, command line, bash scripting, to write code that works not only on your machine... - Buy a Raspberry Pi, if... - ... you want to get more experience with virtual machines / "the cloud" etc. but you feel like you need something in between - ... you have a use case (and 2-3 other use cases once you "graduate" to the cloud)! -- - Don't buy one if... - ... you'll have to work with cloud services soon anyway - ... you don't have the time / nerves to work without RStudio / non-interactively - ... you have project ideas that require complex architectures / more computing resources / new packages --- # Thanks for coming! ### Links - [Slides](https://talks.pr130.dev/2020-11-24_rladies_bucharest_raspberrypi/index.html) - [correlaid-utils Repository](https://github.com/friep/correlaid-utils) with a (hopefully) helpful README - [smcounts](https://github.com/friep/smcounts) R Package - [talk at CorrelAid Open Online Data Meetup](https://youtu.be/tFRNBHqg_ZQ?t=1966) ### Follow me / Reach out - βοΈ [frie.p@correlaid.org](frie.p@correlaid.org) -
[ameisen_strasse](https://twitter.com/ameisen_strasse) -
[https://pr130.dev](https://pr130.dev) -
[correlaid.org](https://correlaid.org)