Being able to program makes you lazy - or rather it gives you the ability to be lazy by just automating everything. This is what I did in this post.
Recently, I went on holidays to the Vosges mountains in northeastern France. While one or two days were definitely too rainy to take electronics outside, I was able to take some pics with my Micro-Four-Thirds (MFT) camera of the beautiful autumn landscape, of our family dog (Team #rdogs!) and of the many, many fly agarics.
Back home and with a free weekend all to myself, I ventured to sort the photos and sent the best ones to my family + friends who were with me on the trip. This is always my least favorite part because I take a lot of pictures and a lot of them are…well…not worthy the time of looking at.
So I opened the photo viewer on my Linux laptop, went through the photos and deleted the ones I don’t like. “Done”, you’d think. Well, no. Why? Because, some months ago, I decided I really needed to have RAW files - just in case I’d ever want to seriously edit something (spoiler: I’m too lazy for that). Soo, whenever I push the shutter button nowadays, two files with the same name are stored on my SD card: a normal JPG
file and a RAW file with the RW2
extension. So, for example P1120006.JPG
and P1120006.RW2
.
However, the Linux photo viewer only shows me the JPG
files. So after an hour of deleting JPG
s, I still needed to delete the corresponding RW2
files of the JPGs I had deleted. And my dislike for doing stuff in the explorer / finder was big enough that I decided to automate this. Because the offending files are already deleted I set up a little test case for this post but I’ll include some screenshots that will show how much time - and nerves - I saved from this little R exercise.
First up is actually getting the file paths. For this, I use the good old list.files
command which will give you all files in a given folder. I get both the simple path and the full path to the file.1
# delete RAW files where the jpg is deleted
library(dplyr)
library(stringr)
library(tidyr)
library(tibble)
library(here)
# FOLDER <- "/home/frie/Pictures/2019/2019-10_vogesen/"
FOLDER <- "data"
full_paths <- list.files(FOLDER, full.names = TRUE)
file_names <- list.files(FOLDER)
df <- tibble::tibble(full_path = full_paths, file_name = file_names)
df %>% select(-full_path)
# A tibble: 8 x 1
file_name
<chr>
1 P1120001.RW2
2 P1120002.JPG
3 P1120002.RW2
4 P1120003.JPG
5 P1120003.RW2
6 P1120006.JPG
7 P1120006.RW2
8 P1120008.RW2
There are 8 files in the folder. By manually looking at the data, I can easily see that I want to delete P1120001.RW2
and P1120008.RW2
.
In the real case, there were 942 😱. No way to easily see that at one glance!
Fortunately, the RW2
and JPG
version have the same file name, except for the extension. I first extract this “common” element of the file name using tidyr::separate
which splits a character vector at a certain pattern (the sep
argument) and directly puts the splitted things into new columns (hard to explain 😄, just see the result and compare with before!). This is honestly one of my favorite functions ever because it’s such a common task that would be otherwise really annoying. 2
# A tibble: 8 x 3
full_path file_name_without_ext ext
<chr> <chr> <chr>
1 data/P1120001.RW2 P1120001 RW2
2 data/P1120002.JPG P1120002 JPG
3 data/P1120002.RW2 P1120002 RW2
4 data/P1120003.JPG P1120003 JPG
5 data/P1120003.RW2 P1120003 RW2
6 data/P1120006.JPG P1120006 JPG
7 data/P1120006.RW2 P1120006 RW2
8 data/P1120008.RW2 P1120008 RW2
Now I count how many files exist for each file_name_without_ext
by grouping by that variable and counting the number of rows using the little magic n()
function from dplyr. This is such a common pattern and I love that dplyr makes this so easy - I remember doing this for my Bachelor thesis without the tidyverse and it was soo difficult for me.
# could be replaced by shorthand: dplyr::add_count(file_name_without_ext)
df <- df %>%
dplyr::group_by(file_name_without_ext) %>%
dplyr::mutate(n = n())
df
# A tibble: 8 x 4
# Groups: file_name_without_ext [5]
full_path file_name_without_ext ext n
<chr> <chr> <chr> <int>
1 data/P1120001.RW2 P1120001 RW2 1
2 data/P1120002.JPG P1120002 JPG 2
3 data/P1120002.RW2 P1120002 RW2 2
4 data/P1120003.JPG P1120003 JPG 2
5 data/P1120003.RW2 P1120003 RW2 2
6 data/P1120006.JPG P1120006 JPG 2
7 data/P1120006.RW2 P1120006 RW2 2
8 data/P1120008.RW2 P1120008 RW2 1
Now I filter those rows where n == 1
- those are the RW2
files that are the leftover companions of the JPG
s I deleted manually. Just to be sure, I also add the ext == "RW2"
condition to the filter statement.3
I use dplyr::pull
to get the full_path
variable from the data frame.4 I also add a small check that I indeed have only RW2
files - all this making sure thing is getting a bit out of hand but better safe than sorry. 😉
And finally: delete, delete, delete that sh*t with file.remove
!
[1] "data/P1120001.RW2" "data/P1120008.RW2"
This deletes the two files that do not have a JPG
companion. In the real use, my script successfully deleted 258 files as can be seen by comparing the before (posted at the beginning of this post) and after screenshots of my explorer.
Hurray for the power of computers! 🎉
I don’t know whether this brought any considerable insight to anyone. 😄 After all, this is not the usual use case for R - a well written shell command would’ve achieved the same. Or… actually manually deleting the files… But no, this was never an alternative.
Take away from this? Being able to program makes you lazy - or rather it gives you the ability to be lazy by just automating everything away. 😎 👅 And in my opinion, this is just another excellent reason to: keep coding. ❤️
The double call could be avoided by splitting the full path using something like tidyr::separate
but I was lazy.↩︎
Sidenote: There’s also tidyr::separate_rows
which is even more awesome!↩︎
If I did my manual deletion process how I described it, this should not be necessary as a JPG should always have a “partner” RAW file. But who knows? 🤷↩︎
pull
is just like $
- it just integrates better into pipe workflows. As I broke up the pipe for “educational” purposes, it does not really make sense here but I thought I left it in just in case someone did not know about it yet.↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://gitlab.com/friep/blog, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".