+ - 0:00:00
Notes for current slide
Notes for next slide

Using R and a Raspberry Pi to collect social media data

Frie Preu

RLadies Bucharest, 2020-11-24

1 / 31

About me

  • political scientist turned data scientist turned IT consultant / software developer... something else?
  • useR since 2013/2015
  • CorreAid volunteer since 2015, full-time since 2020
2 / 31

About CorrelAid

  • German(/European) Data4Good network with over 1500 volunteers
    • data4good projects with external partners
    • education: e.g. meetups, tidytuesday, workshops, annual conference, internal projects,..
    • dialogue with society
  • excellent opportunity to try out things
3 / 31

About this project

2017: new website withπŸ“Š

➑️ collect social media time series: facebook, twitter, mailchimp subscribers

4 / 31

Requirements for automated data collection

  • πŸ€– somewhere to run our code on
  • πŸ•› automatically execute code at regular intervals
  • πŸ’Ύ store data for later, easy access
  • πŸ’¬ notify us if something is wrong
5 / 31

πŸ€–: A Raspberry Pi

Raspberry Pi 3 B+ (39906369025)

  • tiny and affordable computer, originally used for teaching
  • large open-source community, many different projects
  • πŸ’Έ: ~10-110 Euro (with accessoires)
  • Specs: 512MB - 8GB RAM, own OS (Raspian)
6 / 31

πŸ•›: Cron jobs

Cron is one of the most useful utility that you can find in any Unix-like operating system. It is used to schedule commands at a specific time. These scheduled commands or tasks are known as "Cron Jobs". (Source)

https://ostechnix.com/a-beginners-guide-to-cron-jobs

7 / 31

πŸ•›: Cron jobs

Cron is one of the most useful utility that you can find in any Unix-like operating system. It is used to schedule commands at a specific time. These scheduled commands or tasks are known as "Cron Jobs". (Source)

https://ostechnix.com/a-beginners-guide-to-cron-jobs

50 23 * * * /usr/lib/R/bin/Rscript '/home/frie/correlaid-utils/correlaid-analytics/run.R'

Note: Slide adapted from Alex Kapps presentation, see here. Image source: https://ostechnix.com/wp-content/uploads/2018/05/cron-job-format-1.png.

7 / 31

Project timeline & versions

Raspberry Pi + R + mlab, cf. talk at OODM

AWS Lambda, Serverless & Python, cf. talk at OODM

❌

Raspberry Pi + R + GitHub + GitHub actions

8 / 31

R and Raspberry Pi - 2017 version

9 / 31

2017 version: diagram

10 / 31

2017 version: summary

  • πŸ€– Raspberry Pi
  • πŸ•› Cron
  • πŸ’Ύ mlab
  • πŸ’¬ ❌
11 / 31

2017 version: summary

  • πŸ€– Raspberry Pi
  • πŸ•› Cron
  • πŸ’Ύ mlab
  • πŸ’¬ ❌

Problems

  • one big, messy R script
  • authentication details in text files checked into (private) GitHub (⚠️)
  • code quality ...
11 / 31

2018: Python + AWS Lambda + Serverless

12 / 31

Dezember 2017 Frie

πŸ§‘πŸ’»

https://www.codecentric.de

13 / 31

2018 version: diagram

14 / 31

2018 version: What is AWS Lambda?

AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code. [...] The purpose of Lambda, as compared to AWS EC2, is to simplify building smaller, on-demand applications [...] ((Wikipedia)

15 / 31

2018 version: What is AWS Lambda?

AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code. [...] The purpose of Lambda, as compared to AWS EC2, is to simplify building smaller, on-demand applications [...] ((Wikipedia)

  • event-driven: it only runs responding to an event - the event can be a cronjob πŸ‘€
15 / 31

2018 version: What is AWS Lambda?

AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code. [...] The purpose of Lambda, as compared to AWS EC2, is to simplify building smaller, on-demand applications [...] ((Wikipedia)

  • event-driven: it only runs responding to an event - the event can be a cronjob πŸ‘€

  • serverless: underlying servers are automatically started + stopped by AWS (-> RIP fripi)

15 / 31

2018 version: What is AWS Lambda?

AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code. [...] The purpose of Lambda, as compared to AWS EC2, is to simplify building smaller, on-demand applications [...] ((Wikipedia)

  • event-driven: it only runs responding to an event - the event can be a cronjob πŸ‘€

  • serverless: underlying servers are automatically started + stopped by AWS (-> RIP fripi)

  • smaller, on-demand applications: those are called functions
15 / 31

2018 version: What is AWS Lambda?

AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code. [...] The purpose of Lambda, as compared to AWS EC2, is to simplify building smaller, on-demand applications [...] ((Wikipedia)

  • event-driven: it only runs responding to an event - the event can be a cronjob πŸ‘€

  • serverless: underlying servers are automatically started + stopped by AWS (-> RIP fripi)

  • smaller, on-demand applications: those are called functions
  • payment per execution -> free / very cheap!
15 / 31

2018: AWS Lambda + Python

correlaid-analytics
β”œβ”€β”€ daily.py
β”œβ”€β”€ deploy-analytics.sh
β”œβ”€β”€ every_monday.py
β”œβ”€β”€ package-lock.json
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ serverless.yml
└── setup.sh
16 / 31

2018 version: serverless

The serverless framework allows to define Lambda functions in a yml file (serverless.yml) and makes deployment to AWS very easy.

functions:
daily_correlaid_analytics:
handler: daily.get_correlaid_data
events:
- schedule:
rate: cron(56 22 * * ? *)

Deployment with:

serverless deploy -v
17 / 31

2018 summary

  • πŸ€– AWS Lamdba (runs on AWS)
  • πŸ•› Cron
  • πŸ’Ύ hosted MySQL
  • πŸ’¬ AWS Lambda alerts
18 / 31

R and Raspberry Pi - 2020 version

19 / 31

20 / 31

2020 version: Cron job

## cronR job
## id: daily_analytics
## tags:
## desc: Get daily CorrelAid Analytics
50 23 * * * cd '/home/frie/correlaid-utils/correlaid-analytics' && /usr/lib/R/bin/Rscript '/home/frie/correlaid-utils/correlaid-analytics/run.R' > '/home/frie/correlaid-utils/correlaid-analytics/run.log' 2>&1

set up with the very helpful {cronR} πŸ“¦

21 / 31

2020 version: Cron job

## cronR job
## id: daily_analytics
## tags:
## desc: Get daily CorrelAid Analytics
50 23 * * * cd '/home/frie/correlaid-utils/correlaid-analytics' && /usr/lib/R/bin/Rscript '/home/frie/correlaid-utils/correlaid-analytics/run.R' > '/home/frie/correlaid-utils/correlaid-analytics/run.log' 2>&1

set up with the very helpful {cronR} πŸ“¦

run.R

library(here)
print("==============================")
print(Sys.time())
source(here::here("correlaid-analytics/01_get_daily_analytics.R"))
source(here::here("correlaid-analytics/02_git.R"))
21 / 31

2020 version: files

correlaid-analytics/
β”œβ”€β”€ 01_get_daily_analytics.R
β”œβ”€β”€ 02_git.R
β”œβ”€β”€ cron.R
β”œβ”€β”€ data
β”‚ └── all_daily.csv
β”œβ”€β”€ run.log
└── run.R

01_get_daily_analytics.R

22 / 31

2020 version: smcounts πŸ“¦

library(smcounts)
smcounts::collect_data
## function (slack = TRUE, facebook = TRUE, twitter = TRUE, mailchimp = TRUE)
## {
## df <- tibble::tibble(date = c(), platform = c(), n = c())
## if (slack) {
## slack_df <- ca_slack()
## df <- rbind(df, slack_df)
## }
## if (facebook) {
## facebook_df <- ca_facebook()
## df <- rbind(df, facebook_df)
## }
## if (twitter) {
## twitter_df <- ca_twitter()
## df <- rbind(df, twitter_df)
## }
## if (mailchimp) {
## mailchimp_df <- ca_newsletter()
## df <- rbind(df, mailchimp_df)
## }
## return(df)
## }
## <bytecode: 0x7ffd289e57e8>
## <environment: namespace:smcounts>
23 / 31

smcounts πŸ“¦

  • abstracts data collection functionality --> can be reused in other contexts
  • define dependencies via DESCRIPTION file
  • easy installation from GitHub (https://github.com/friep/smcounts)
  • uses environment variables (standard way to store API keys etc.)
24 / 31

2020 version πŸ’Ύ: Git

02_git.R

# gert (https://docs.ropensci.org/gert/index.html)
library(gert)
gert::git_pull()
print(gert::git_status())
gert::git_add("correlaid-analytics/data/all_daily.csv")
gert::git_commit(message = "πŸ€– CRON - update daily data", author = git_signature("raspi3", "raspi3@pr130.dev"))
gert::git_push()
ca_counts <- readr::read_csv("https://raw.githubusercontent.com/friep/correlaid-utils/main/correlaid-analytics/data/all_daily.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## date = col_date(format = ""),
## platform = col_character(),
## n = col_double()
## )
25 / 31

2020 version: πŸ’¬ GitHub Action

  • CI/CD tool (continuous integration, continuous deployment) to define workflows in yml files
  • typical use case: run checks on R Packages (e.g. dplyr), build websites
  • different kinds of triggers: push, pull request, cron job (πŸ‘€)
26 / 31

2020 version: πŸ’¬ GitHub Action

  • CI/CD tool (continuous integration, continuous deployment) to define workflows in yml files
  • typical use case: run checks on R Packages (e.g. dplyr), build websites
  • different kinds of triggers: push, pull request, cron job (πŸ‘€)

correlaid-utils workflow

  • runs every morning to check whether a commit has been made to all_daily.csv in last 24 hours
    • if yes: βœ…
    • if no: ❌ -> workflow fails and GitHub sends email
  • yml file
  • workflow runs
26 / 31

2020 version: summary

  • πŸ€– Raspberry Pi
  • πŸ•› Cron
  • πŸ’Ύ GitHub
  • πŸ’¬ GitHub Actions
27 / 31

2020 version: summary

  • πŸ€– Raspberry Pi
  • πŸ•› Cron
  • πŸ’Ύ GitHub
  • πŸ’¬ GitHub Actions

    2020 version vs. 2017 version

  • βœ… better decoupling through smcounts package
  • βœ… more stability
  • πŸ€” git as storage option & github action
  • πŸ‘Ž better error handling
  • ❌ tests!!
27 / 31

Alternatives & Summary

28 / 31

Alternatives

Server πŸ€–

  • Virtual machines on AWS, Azure, Google Cloud
  • specialized services from AWS, Azure, ...
  • GitHub Actions or other CI/CD services (?!)
29 / 31

Alternatives

Server πŸ€–

  • Virtual machines on AWS, Azure, Google Cloud
  • specialized services from AWS, Azure, ...
  • GitHub Actions or other CI/CD services (?!)

Storage πŸ’Ύ

29 / 31

Alternatives

Server πŸ€–

  • Virtual machines on AWS, Azure, Google Cloud
  • specialized services from AWS, Azure, ...
  • GitHub Actions or other CI/CD services (?!)

Storage πŸ’Ύ

Notifications πŸ’¬

  • make built-in cron emailing functionality work on Raspberry Pi
  • monitoring services on AWS etc. (e.g. AWS SNS)
29 / 31

Summary

  • Things you can learn: git, cron jobs, ssh, scp, basics of networking, command line, bash scripting, to write code that works not only on your machine...

  • Buy a Raspberry Pi, if...

    • ... you want to get more experience with virtual machines / "the cloud" etc. but you feel like you need something in between
    • ... you have a use case (and 2-3 other use cases once you "graduate" to the cloud)!
30 / 31

Summary

  • Things you can learn: git, cron jobs, ssh, scp, basics of networking, command line, bash scripting, to write code that works not only on your machine...

  • Buy a Raspberry Pi, if...

    • ... you want to get more experience with virtual machines / "the cloud" etc. but you feel like you need something in between
    • ... you have a use case (and 2-3 other use cases once you "graduate" to the cloud)!
  • Don't buy one if...

    • ... you'll have to work with cloud services soon anyway
    • ... you don't have the time / nerves to work without RStudio / non-interactively
    • ... you have project ideas that require complex architectures / more computing resources / new packages
30 / 31

Thanks for coming!

Follow me / Reach out

31 / 31

About me

  • political scientist turned data scientist turned IT consultant / software developer... something else?
  • useR since 2013/2015
  • CorreAid volunteer since 2015, full-time since 2020
2 / 31
Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k Go to previous slide
↓, β†’, Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow