library(stringi); library(dplyr)
<- tibble(
df name = c("John Smyth", "Alan Pear", "Don Baker", "Bjarne Andersson"),
client_id = c(243, 22, 441, 994),
address_1 = c("2 Corner View Road", "106 Southfield Ave.", "213 North 25th Street",
"11 Apple Boulevard"),
address_2 = c("Dunkirk Springs", "Ballyvourney", "Oakland Heights", "Rintinville"),
address_3 = c("New York", "Cork", "Essex", "Stockholm"),
phone_number = c("99-2278-122", "088-766653221", "112341-991", "011-221-324"),
zip_code = c(11517, "E45NN12", "WX133Y", 213337),
registration = c("Full", "Part-time", "Full", "Part-time"),
profile = c("Advanced", "Advanced", "Beginner", "Intermediate")
)
head(df)
Shuffling Strings in R
Let’s say you need to share some data that has some potentially identifiable sensitive information in it – people’s addresses, phone numbers etc. Maybe these fields are not particularly important, but you don’t want to take them out exactly, and neither do you want to have to go through an encryption & decryption process…well, one quick and useful option is the stri_rand_shuffle()
function from the stringi package.
Imagine you have the following fake data:
## # A tibble: 4 x 9
## name client_id address_1 address_2 address_3 phone_number zip_code
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 John… 243 2 Corner… Dunkirk … New York 99-2278-122 11517
## 2 Alan… 22 106 Sout… Ballyvou… Cork 088-7666532… E45NN12
## 3 Don … 441 213 Nort… Oakland … Essex 112341-991 WX133Y
## 4 Bjar… 994 11 Apple… Rintinvi… Stockholm 011-221-324 213337
## # … with 2 more variables: registration <chr>, profile <chr>
…and suppose we’re interested in Client ID, site/region, ZIP code, registration and profile. We can quickly scramble the identifying information we have in the other columns with string & dplyr:
%>%
df mutate(phone_number = as.character(phone_number)) %>%
mutate_at(vars(name, address_1, address_2, phone_number), stri_rand_shuffle)
## # A tibble: 4 x 9
## name client_id address_1 address_2 address_3 phone_number zip_code
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 "ntJ… 243 RV2 edwr… nnuriSsk… New York 1-928-22297 11517
## 2 " aA… 22 d6h itle… eBlnyyru… Cork 22610588676… E45NN12
## 3 e rk… 441 t 2 rthS… etdHnkaa… Essex 39-4121911 WX133Y
## 4 nBjn… 994 p1erAaud… vltiRlne… Stockholm 3421-2102-1 213337
## # … with 2 more variables: registration <chr>, profile <chr>
If you want a closer look at, for example, the address_1 column:
%>%
df mutate(phone_number = as.character(phone_number)) %>%
mutate_at(vars(name, address_1, address_2, phone_number), stri_rand_shuffle) %>%
pull(address_1)
## [1] "rdrn woa e VoiRCe2" ".it0e6SoAf h ld1veu" "S oe t2r3et215hNtrt h"
## [4] "u vA 1reoealBpp1dl"
Using this method, you can share the dataset without concern 😎.