Package 'ETLUtils' reference manual

Title:	Utility Functions to Execute Standard Extract/Transform/Load Operations (using Package 'ff') on Large Data
Description:	Provides functions to facilitate the use of the 'ff' package in interaction with big data in 'SQL' databases (e.g. in 'Oracle', 'MySQL', 'PostgreSQL', 'Hive') by allowing easy importing directly into 'ffdf' objects using 'DBI', 'RODBC' and 'RJDBC'. Also contains some basic utility functions to do fast left outer join merging based on 'match', factorisation of data and a basic function for re-coding vectors.
Authors:	Jan Wijffels
Maintainer:	Jan Wijffels <[email protected]>
License:	GPL-2
Version:	1.5
Built:	2025-02-10 03:12:59 UTC
Source:	https://github.com/jwijffels/etlutils

Extra utility functions to execute standard ETL operations on large data

Description

Provides functions to load bigdata (e.g. from Oracle) directly into ffdf objects using DBI and some utility functions like recoding and matchmerge which does fast left outer join merging based on match.

Author(s)

Jan Wijffels [email protected]

Examples

# See the specified functions in the package
# See the specified functions in the package

Put character vectors, columns of a data.frame or list elements as factor

Description

Put character vectors, columns of a data.frame or list elements as factor if they are character strings or optionally if they are logicals

Usage

factorise(x, logicals = FALSE, ...)

## Default S3 method:
factorise(x, logicals = FALSE, ...)

## S3 method for class 'character'
factorise(x, logicals = FALSE, ...)

## S3 method for class 'data.frame'
factorise(x, logicals = FALSE, ...)

## S3 method for class 'list'
factorise(x, logicals = FALSE, ...)
factorise(x, logicals = FALSE, ...)

## Default S3 method:
factorise(x, logicals = FALSE, ...)

## S3 method for class 'character'
factorise(x, logicals = FALSE, ...)

## S3 method for class 'data.frame'
factorise(x, logicals = FALSE, ...)

## S3 method for class 'list'
factorise(x, logicals = FALSE, ...)

Arguments

`x`	a character vector, a data.frame or a list
`logicals`	logical indicating if logical vectors should also be converted to factors. Defaults to FALSE.
`...`	optional arguments passed on to the methods

Value

The updated x vector/data.frame or list where the character vectors or optionally logical elements are converted to factors

Examples

x <- data.frame(x = 1:4, y = LETTERS[1:4], b = c(TRUE, FALSE, NA, TRUE), stringsAsFactors=FALSE)
str(factorise(x))
str(factorise(x, logicals = TRUE))
str(factorise(list(a = LETTERS, b = 1:10, c = pi, d = list(x = x))))
x <- data.frame(x = 1:4, y = LETTERS[1:4], b = c(TRUE, FALSE, NA, TRUE), stringsAsFactors=FALSE)
str(factorise(x))
str(factorise(x, logicals = TRUE))
str(factorise(list(a = LETTERS, b = 1:10, c = pi, d = list(x = x))))

Merge two data frames (fast) by common columns by performing a left (outer) join or an inner join.

Description

Merge two data frames (fast) by common columns by performing a left (outer) join or an inner join.
The data frames are merged on the columns given by by.x and by.y. Columns can be specified only by name. This differs from the merge function from the base package in that merging is done based on 1 column key only. If more than one column is supplied in by.x and by.y, these columns will be concatenated together to form 1 key which will be used to match. Alternatively, by.x and by.y can be 2 vectors of length NROW(x) which will be used as keys.

Usage

matchmerge(
  x,
  y,
  by.x,
  by.y,
  all.x = FALSE,
  by.iskey = FALSE,
  suffix = ".y",
  add.columns = colnames(y),
  check.duplicates = TRUE,
  trace = FALSE
)
matchmerge(
  x,
  y,
  by.x,
  by.y,
  all.x = FALSE,
  by.iskey = FALSE,
  suffix = ".y",
  add.columns = colnames(y),
  check.duplicates = TRUE,
  trace = FALSE
)

Arguments

`x`	the left hand side data frame to merge
`y`	the right hand side data frame to merge or a vector in which case you always need to supply by.y as a vector, make sure by.iskey is set to TRUE and provide in add.columns the column name for which y will be relabelled to in the joined data frame (see the example).
`by.x`	either the name of 1 column in x or a character vector of length NROW(x) which will be used as key to merge the 2 data frames
`by.y`	either the name of 1 column in y or a character vector of length NROW(x) which will be used as key to merge the 2 data frames. Duplicate values in by.y are not allowed.
`all.x`	logical, if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output. The default value corresponds to an inner join. If TRUE is supplied, this corresponds to a left (outer) join.
`by.iskey`	Logical, indicating that the by.x and the by.y inputs are vectors of length NROW(x) and NROW(y) instead of column names in x and y. If this is FALSE, the input columns will be pasted together to create a key to merge upon. Otherwise, the function will use the by.x and by.y vectors directly as matching key. Defaults to FALSE indicating the by.x and by.y are column names in x and y.
`suffix`	a character string to be used for duplicate column names in x and y to make the y columns unique.
`add.columns`	character vector of column names in y to merge to the x data frame. Defaults to all columns in y.
`check.duplicates`	checks if by.y contains duplicates which is not allowed. Defaults to TRUE.
`trace`	logical, indicating to print some informative messages about the progress

Details

The rows in the right hand side data frame that match on the specific key are extracted, and joined together with the left hand side data frame.

Merging is done based on the match function on the key value. This makes the function a lot faster when compared to applying merge, especially for large data frames (see the example). And also the memory consumption is a lot smaller.

In SQL database terminology, the default value of all.x = FALSE gives a natural join, a special case of an inner join. Specifying all.x = FALSE gives a left (outer) join. Right (outer) join or (full) outer join are not provided in this function.

Value

data frame with x joined with y based on the supplied columns. The output columns are the columns in x followed by the extra columns in y.

Examples

left <- data.frame(idlhs = c(1:4, 3:5), a = LETTERS[1:7], stringsAsFactors = FALSE)
right <- data.frame(idrhs = c(1:4), b = LETTERS[8:11], stringsAsFactors = FALSE)
## Inner join
matchmerge(x=left, y=right, by.x = "idlhs", by.y = "idrhs")

## Left outer join in 2 ways
matchmerge(x=left, y=right, by.x = "idlhs", by.y = "idrhs", all.x=TRUE)
matchmerge(x=left, y=right, by.x = left$idlhs, by.y = right$idrhs, all.x=TRUE, by.iskey=TRUE)

## Show usage when y is just a vector instead of a data.frame
matchmerge(x=left, y=right$b, by.x = left$idlhs, by.y = right$idrhs, all.x=TRUE, 
by.iskey=TRUE, add.columns="b.renamed")

## Show speedup difference with merge
## Not run: 
size <- 100000 
dimension <- seq(Sys.Date(), Sys.Date()+10, by = "day")
left <- data.frame(date = rep(dimension, size), sales = rnorm(size))
right <- data.frame(date = dimension, feature = dimension-7, feature = dimension-14)
dim(left)
dim(right)
print(system.time(merge(left, right, by.x="date", by.y="date", all.x=TRUE, all.y=FALSE)))
print(system.time(matchmerge(left, right, by.x="date", by.y="date", all.x=TRUE, by.iskey=FALSE)))

## End(Not run)
## Show example usage 
products <- expand.grid(product = c("Pepsi", "Coca Cola"), type = c("Can","Bottle"), 
size = c("6Ml","8Ml"), distributor = c("Distri X","Distri Y"), salesperson = c("Mr X","Mr Y"), 
stringsAsFactors=FALSE)
products <- products[!duplicated(products[, c("product","type","size")]), ]
products$key <- paste(products$product, products$type, products$size, sep=".")
sales <- expand.grid(item = unique(products$key), sales = rnorm(10000, mean = 100))
str(products)
str(sales)
info <- matchmerge(x=sales, y=products, 
  by.x=sales$item, by.y=products$key, all.x=TRUE, by.iskey=TRUE, 
  add.columns=c("size","distributor"), check.duplicates=FALSE)
str(info)
tapply(info$sales, info$distributor, FUN=sum)
left <- data.frame(idlhs = c(1:4, 3:5), a = LETTERS[1:7], stringsAsFactors = FALSE)
right <- data.frame(idrhs = c(1:4), b = LETTERS[8:11], stringsAsFactors = FALSE)
## Inner join
matchmerge(x=left, y=right, by.x = "idlhs", by.y = "idrhs")

## Left outer join in 2 ways
matchmerge(x=left, y=right, by.x = "idlhs", by.y = "idrhs", all.x=TRUE)
matchmerge(x=left, y=right, by.x = left$idlhs, by.y = right$idrhs, all.x=TRUE, by.iskey=TRUE)

## Show usage when y is just a vector instead of a data.frame
matchmerge(x=left, y=right$b, by.x = left$idlhs, by.y = right$idrhs, all.x=TRUE, 
by.iskey=TRUE, add.columns="b.renamed")

## Show speedup difference with merge
## Not run: 
size <- 100000 
dimension <- seq(Sys.Date(), Sys.Date()+10, by = "day")
left <- data.frame(date = rep(dimension, size), sales = rnorm(size))
right <- data.frame(date = dimension, feature = dimension-7, feature = dimension-14)
dim(left)
dim(right)
print(system.time(merge(left, right, by.x="date", by.y="date", all.x=TRUE, all.y=FALSE)))
print(system.time(matchmerge(left, right, by.x="date", by.y="date", all.x=TRUE, by.iskey=FALSE)))

## End(Not run)
## Show example usage 
products <- expand.grid(product = c("Pepsi", "Coca Cola"), type = c("Can","Bottle"), 
size = c("6Ml","8Ml"), distributor = c("Distri X","Distri Y"), salesperson = c("Mr X","Mr Y"), 
stringsAsFactors=FALSE)
products <- products[!duplicated(products[, c("product","type","size")]), ]
products$key <- paste(products$product, products$type, products$size, sep=".")
sales <- expand.grid(item = unique(products$key), sales = rnorm(10000, mean = 100))
str(products)
str(sales)
info <- matchmerge(x=sales, y=products, 
  by.x=sales$item, by.y=products$key, all.x=TRUE, by.iskey=TRUE, 
  add.columns=c("size","distributor"), check.duplicates=FALSE)
str(info)
tapply(info$sales, info$distributor, FUN=sum)

Performs NA replacement by last observation carried forward but adds 1 to the last observation carried forward.

Description

Performs NA replacement by last observation carried forward but adds 1 to the last observation carried forward.

Usage

naLOCFPlusone(x)
naLOCFPlusone(x)

Arguments

`x`	a numeric vector

Value

a vector where NA's are replaced with the LOCF + 1

Examples

require(zoo)
x <- c(2,NA,NA,4,5,2,NA)
naLOCFPlusone(x)
require(zoo)
x <- c(2,NA,NA,4,5,2,NA)
naLOCFPlusone(x)

Read data from a DBI connection into an ffdf.

Description

Read data from a DBI connection into an ffdf. This can for example be used to import large datasets from Oracle, SQLite, MySQL, PostgreSQL, Hive or other SQL databases into R.

Usage

read.dbi.ffdf(
  query = NULL,
  dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""),
  dbSendQuery.args = list(),
  dbFetch.args = list(),
  x = NULL,
  nrows = -1,
  first.rows = NULL,
  next.rows = NULL,
  levels = NULL,
  appendLevels = TRUE,
  asffdf_args = list(),
  BATCHBYTES = getOption("ffbatchbytes"),
  VERBOSE = FALSE,
  colClasses = NULL,
  transFUN = NULL,
  ...
)
read.dbi.ffdf(
  query = NULL,
  dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""),
  dbSendQuery.args = list(),
  dbFetch.args = list(),
  x = NULL,
  nrows = -1,
  first.rows = NULL,
  next.rows = NULL,
  levels = NULL,
  appendLevels = TRUE,
  asffdf_args = list(),
  BATCHBYTES = getOption("ffbatchbytes"),
  VERBOSE = FALSE,
  colClasses = NULL,
  transFUN = NULL,
  ...
)

Arguments

`query`	the SQL query to execute on the DBI connection
`dbConnect.args`	a list of arguments to pass to DBI's `dbConnect` (like drv, dbname, username, password). See the examples.
`dbSendQuery.args`	a list containing database-specific parameters which will be passed to to pass to `dbSendQuery`. Defaults to an empty list.
`dbFetch.args`	a list containing optional database-specific parameters which will be passed to to pass to `dbFetch`. Defaults to an empty list.
`x`	NULL or an optional ffdf object to which the read records are appended. See documentation in read.table.ffdf for more details and the example below.
`nrows`	Number of rows to read from the query resultset. Default value of -1 reads in all rows.
`first.rows`	chunk size (rows) to read for first chunk from the query resultset
`next.rows`	chunk size (rows) to read sequentially for subsequent chunks from the query resultset. Currently, this must be specified.
`levels`	optional specification of factor levels. A list with as names the names the columns of the data.frame fetched in the first.rows, containing levels of the factors.
`appendLevels`	logical. A vector of permissions to expand levels for factor columns. See documentation in `read.table.ffdf` for more details.
`asffdf_args`	further arguments passed to `as.ffdf` (ignored if 'x' gives an ffdf object )
`BATCHBYTES`	integer: bytes allowed for the size of the data.frame storing the result of reading one chunk. See documentation in `read.table.ffdf` for more details.
`VERBOSE`	logical: TRUE to verbose timings for each processed chunk (default FALSE).
`colClasses`	See documentation in `read.table.ffdf`
`transFUN`	function applied to the data frame after each chunk is retreived by `dbFetch`
`...`	optional parameters passed on to transFUN

Details

Opens up the DBI connection using DBI::dbConnect, sends the query using DBI::dbSendQuery and DBI::dbFetch-es the results in batches of next.rows rows. Heavily borrowed from read.table.ffdf

Value

An ffdf object unless the query returns zero records in which case the function will return the data.frame returned by dbFetch and possibly transFUN.

Examples

require(ff)

##
## Example query using data in sqlite
##
require(RSQLite)
dbfile <- system.file("smalldb.sqlite3", package="ETLUtils")
drv <- dbDriver("SQLite")
query <- "select * from testdata limit 10000"
x <- read.dbi.ffdf(query = query, dbConnect.args = list(drv = drv, dbname = dbfile), 
first.rows = 100, next.rows = 1000, VERBOSE=TRUE)
class(x)
x[1:10, ]

## show it is the same as getting the data directly using RSQLite 
## apart from characters which are factors in ffdf objects
directly <- dbGetQuery(dbConnect(drv = drv, dbname = dbfile), query)
directly <- as.data.frame(as.list(directly), stringsAsFactors=TRUE)
all.equal(x[,], directly)

## show how to use the transFUN argument to transform the data before saving into the ffdf
## and shows the use of the levels argument
query <- "select * from testdata limit 10"
x <- read.dbi.ffdf(query = query, dbConnect.args = list(drv = drv, dbname = dbfile), 
first.rows = 100, next.rows = 1000, VERBOSE=TRUE, levels = list(a = rev(LETTERS)),
transFUN = function(x, subtractdays){
	x$b <- as.Date(x$b)
	x$b.subtractdaysago <- x$b - subtractdays
	x
}, subtractdays=7)
class(x)
x[1:10, ]
## remark that the levels of column a are reversed due to specifying the levels argument correctly
levels(x$a)

## show how to append data to an existing ffdf object 
transformexample <- function(x, subtractdays){
	x$b <- as.Date(x$b)
	x$b.subtractdaysago <- x$b - subtractdays
	x
}
dim(x)
x[,]
combined <- read.dbi.ffdf(query = query, 
 dbConnect.args = list(drv = drv, dbname = dbfile), 
 first.rows = 100, next.rows = 1000, x = x, VERBOSE=TRUE, 
 transFUN = transformexample, subtractdays=1000)
dim(combined)
combined[,]

##
## Example query using ROracle. Do try this at home with some larger data :)
##
## Not run: 
require(ROracle)
query <- "select OWNER, TABLE_NAME, TABLESPACE_NAME, NUM_ROWS, LAST_ANALYZED from all_all_tables" 
x <- read.dbi.ffdf(query=query,
dbConnect.args = list(drv = dbDriver("Oracle"), 
user = "YourUser", password = "YourPassword", dbname = "Mydatabase"),
first.rows = 100, next.rows = 50000, nrows = -1, VERBOSE=TRUE)

## End(Not run)
require(ff)

##
## Example query using data in sqlite
##
require(RSQLite)
dbfile <- system.file("smalldb.sqlite3", package="ETLUtils")
drv <- dbDriver("SQLite")
query <- "select * from testdata limit 10000"
x <- read.dbi.ffdf(query = query, dbConnect.args = list(drv = drv, dbname = dbfile), 
first.rows = 100, next.rows = 1000, VERBOSE=TRUE)
class(x)
x[1:10, ]

## show it is the same as getting the data directly using RSQLite 
## apart from characters which are factors in ffdf objects
directly <- dbGetQuery(dbConnect(drv = drv, dbname = dbfile), query)
directly <- as.data.frame(as.list(directly), stringsAsFactors=TRUE)
all.equal(x[,], directly)

## show how to use the transFUN argument to transform the data before saving into the ffdf
## and shows the use of the levels argument
query <- "select * from testdata limit 10"
x <- read.dbi.ffdf(query = query, dbConnect.args = list(drv = drv, dbname = dbfile), 
first.rows = 100, next.rows = 1000, VERBOSE=TRUE, levels = list(a = rev(LETTERS)),
transFUN = function(x, subtractdays){
	x$b <- as.Date(x$b)
	x$b.subtractdaysago <- x$b - subtractdays
	x
}, subtractdays=7)
class(x)
x[1:10, ]
## remark that the levels of column a are reversed due to specifying the levels argument correctly
levels(x$a)

## show how to append data to an existing ffdf object 
transformexample <- function(x, subtractdays){
	x$b <- as.Date(x$b)
	x$b.subtractdaysago <- x$b - subtractdays
	x
}
dim(x)
x[,]
combined <- read.dbi.ffdf(query = query, 
 dbConnect.args = list(drv = drv, dbname = dbfile), 
 first.rows = 100, next.rows = 1000, x = x, VERBOSE=TRUE, 
 transFUN = transformexample, subtractdays=1000)
dim(combined)
combined[,]

##
## Example query using ROracle. Do try this at home with some larger data :)
##
## Not run: 
require(ROracle)
query <- "select OWNER, TABLE_NAME, TABLESPACE_NAME, NUM_ROWS, LAST_ANALYZED from all_all_tables" 
x <- read.dbi.ffdf(query=query,
dbConnect.args = list(drv = dbDriver("Oracle"), 
user = "YourUser", password = "YourPassword", dbname = "Mydatabase"),
first.rows = 100, next.rows = 50000, nrows = -1, VERBOSE=TRUE)

## End(Not run)

Read data from a JDBC connection into an ffdf.

Description

Read data from a JDBC connection into an ffdf. This can for example be used to import large datasets from Oracle, SQLite, MySQL, PostgreSQL, Hive or other SQL databases into R.

Usage

read.jdbc.ffdf(
  query = NULL,
  dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""),
  dbSendQuery.args = list(),
  dbFetch.args = list(),
  x = NULL,
  nrows = -1,
  first.rows = NULL,
  next.rows = NULL,
  levels = NULL,
  appendLevels = TRUE,
  asffdf_args = list(),
  BATCHBYTES = getOption("ffbatchbytes"),
  VERBOSE = FALSE,
  colClasses = NULL,
  transFUN = NULL,
  ...
)
read.jdbc.ffdf(
  query = NULL,
  dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""),
  dbSendQuery.args = list(),
  dbFetch.args = list(),
  x = NULL,
  nrows = -1,
  first.rows = NULL,
  next.rows = NULL,
  levels = NULL,
  appendLevels = TRUE,
  asffdf_args = list(),
  BATCHBYTES = getOption("ffbatchbytes"),
  VERBOSE = FALSE,
  colClasses = NULL,
  transFUN = NULL,
  ...
)

Arguments

`query`	the SQL query to execute on the JDBC connection
`dbConnect.args`	a list of arguments to pass to JDBC's `RJDBC::dbConnect` (like drv, dbname, username, password). See the examples.
`dbSendQuery.args`	a list containing database-specific parameters which will be passed to to pass to `RJDBC::dbSendQuery`. Defaults to an empty list.
`dbFetch.args`	a list containing optional database-specific parameters which will be passed to to pass to `RJDBC::dbFetch`. Defaults to an empty list.
`x`	NULL or an optional ffdf object to which the read records are appended. See documentation in read.table.ffdf for more details and the example below.
`nrows`	Number of rows to read from the query resultset. Default value of -1 reads in all rows.
`first.rows`	chunk size (rows) to read for first chunk from the query resultset
`next.rows`	chunk size (rows) to read sequentially for subsequent chunks from the query resultset. Currently, this must be specified.
`levels`	optional specification of factor levels. A list with as names the names the columns of the data.frame fetched in the first.rows, containing levels of the factors.
`appendLevels`	logical. A vector of permissions to expand levels for factor columns. See documentation in `read.table.ffdf` for more details.
`asffdf_args`	further arguments passed to `as.ffdf` (ignored if 'x' gives an ffdf object )
`BATCHBYTES`	integer: bytes allowed for the size of the data.frame storing the result of reading one chunk. See documentation in `read.table.ffdf` for more details.
`VERBOSE`	logical: TRUE to verbose timings for each processed chunk (default FALSE).
`colClasses`	See documentation in `read.table.ffdf`
`transFUN`	function applied to the data frame after each chunk is retreived by `RJDBC::dbFetch`
`...`	optional parameters passed on to transFUN

Details

Opens up the JDBC connection using RJDBC::dbConnect, sends the query using RJDBC::dbSendQuery and RJDBC::dbFetch-es the results in batches of next.rows rows. Heavily borrowed from read.table.ffdf

Value

An ffdf object unless the query returns zero records in which case the function will return the data.frame returned by RJDBC::dbFetch and possibly transFUN.

Examples

## Not run: 
require(ff)

##
## Example query using data in sqlite
##
require(RSQLite)
dbfile <- system.file("smalldb.sqlite3", package="ETLUtils")
drv <- JDBC(driverClass = "org.sqlite.JDBC", classPath = "/usr/local/lib/sqlite-jdbc-3.7.2.jar")
query <- "select * from testdata limit 10000"
x <- read.jdbc.ffdf(query = query, 
 dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile)), 
 first.rows = 100, next.rows = 1000, VERBOSE=TRUE)
class(x)
x[1:10, ]

## End(Not run)
## Not run: 
require(ff)

##
## Example query using data in sqlite
##
require(RSQLite)
dbfile <- system.file("smalldb.sqlite3", package="ETLUtils")
drv <- JDBC(driverClass = "org.sqlite.JDBC", classPath = "/usr/local/lib/sqlite-jdbc-3.7.2.jar")
query <- "select * from testdata limit 10000"
x <- read.jdbc.ffdf(query = query, 
 dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile)), 
 first.rows = 100, next.rows = 1000, VERBOSE=TRUE)
class(x)
x[1:10, ]

## End(Not run)

Read data from a ODBC connection into an ffdf.

Description

Read data from a ODBC connection into an ffdf. This can for example be used to import large datasets from Oracle, SQLite, MySQL, PostgreSQL, Hive or other SQL databases into R.

Usage

read.odbc.ffdf(
  query = NULL,
  odbcConnect.args = list(dsn = NULL, uid = "", pwd = ""),
  odbcDriverConnect.args = list(connection = ""),
  odbcQuery.args = list(),
  sqlGetResults.args = list(),
  x = NULL,
  nrows = -1,
  first.rows = NULL,
  next.rows = NULL,
  levels = NULL,
  appendLevels = TRUE,
  asffdf_args = list(),
  BATCHBYTES = getOption("ffbatchbytes"),
  VERBOSE = FALSE,
  colClasses = NULL,
  transFUN = NULL,
  ...
)
read.odbc.ffdf(
  query = NULL,
  odbcConnect.args = list(dsn = NULL, uid = "", pwd = ""),
  odbcDriverConnect.args = list(connection = ""),
  odbcQuery.args = list(),
  sqlGetResults.args = list(),
  x = NULL,
  nrows = -1,
  first.rows = NULL,
  next.rows = NULL,
  levels = NULL,
  appendLevels = TRUE,
  asffdf_args = list(),
  BATCHBYTES = getOption("ffbatchbytes"),
  VERBOSE = FALSE,
  colClasses = NULL,
  transFUN = NULL,
  ...
)

Arguments

`query`	the SQL query to execute on the ODBC connection
`odbcConnect.args`	a list of arguments to pass to ODBC's `odbcConnect` (like dsn, uid, pwd). See the examples.
`odbcDriverConnect.args`	a list of arguments to pass to ODBC's `odbcDriverConnect` (like connection). If you want to connect using odbcDriverConnect instead of odbcConnect.
`odbcQuery.args`	a list of arguments to pass to ODBC's `odbcQuery`, like rows_at_time. Defaults to an empty list.
`sqlGetResults.args`	a list containing optional parameters which will be passed to `sqlGetResults`. Defaults to an empty list. The max parameter will be overwritten with first.rows and next.rows when importing in batches.
`x`	NULL or an optional ffdf object to which the read records are appended. See documentation in `read.table.ffdf` for more details and the example below.
`nrows`	Number of rows to read from the query resultset. Default value of -1 reads in all rows.
`first.rows`	chunk size (rows) to read for first chunk from the query resultset
`next.rows`	chunk size (rows) to read sequentially for subsequent chunks from the query resultset. Currently, this must be specified.
`levels`	optional specification of factor levels. A list with as names the names the columns of the data.frame fetched in the first.rows, containing levels of the factors.
`appendLevels`	logical. A vector of permissions to expand levels for factor columns. See documentation in `read.table.ffdf` for more details.
`asffdf_args`	further arguments passed to `as.ffdf` (ignored if 'x' gives an ffdf object)
`BATCHBYTES`	integer: bytes allowed for the size of the data.frame storing the result of reading one chunk. See documentation in `read.table.ffdf` for more details.
`VERBOSE`	logical: TRUE to verbose timings for each processed chunk (default FALSE).
`colClasses`	See documentation in `read.table.ffdf`
`transFUN`	function applied to the data frame after each chunk is retreived by `sqlGetResults`
`...`	optional parameters passed on to transFUN

Details

Opens up the ODBC connection using RODBC::odbcConnect or RODBC::odbcDriverConnect, sends the query using RODBC::odbcQuery and retrieves the results in batches of next.rows rows using RODBC::sqlGetResults. Heavily borrowed from read.table.ffdf

Value

An ffdf object unless the query returns zero records in which case the function will return the data.frame returned by sqlGetResults and possibly transFUN.

Examples

##
## Using the sqlite database (smalldb.sqlite3) in the /inst folder of the package
## set up the sqlite ODBC driver (www.stats.ox.ac.uk/pub/bdr/RODBC-manual.pd) 
## and call it 'smalltestsqlitedb' 
##
## Not run: 
require(RODBC)
x <- read.odbc.ffdf(
query = "select * from testdata limit 10000",
odbcConnect.args = list(
 dsn="smalltestsqlitedb", uid = "", pwd = "", 
 believeNRows = FALSE, rows_at_time = 1), 
nrows = -1, 
first.rows = 100, next.rows = 1000, VERBOSE = TRUE)

## End(Not run)
##
## Using the sqlite database (smalldb.sqlite3) in the /inst folder of the package
## set up the sqlite ODBC driver (www.stats.ox.ac.uk/pub/bdr/RODBC-manual.pd) 
## and call it 'smalltestsqlitedb' 
##
## Not run: 
require(RODBC)
x <- read.odbc.ffdf(
query = "select * from testdata limit 10000",
odbcConnect.args = list(
 dsn="smalltestsqlitedb", uid = "", pwd = "", 
 believeNRows = FALSE, rows_at_time = 1), 
nrows = -1, 
first.rows = 100, next.rows = 1000, VERBOSE = TRUE)

## End(Not run)

Recodes the values of a character vector

Description

Recodes the values of a character vector

Usage

recoder(x, from = c(), to = c())
recoder(x, from = c(), to = c())

Arguments

`x`	character vector
`from`	character vector with old values
`to`	character vector with new values

Value

x where from values are recoded to the supplied to values

Examples

recoder(x=append(LETTERS, NA, 5), from = c("A","B"), to = c("a.123","b.123")) 
recoder(x=append(LETTERS, NA, 5), from = c("A","B"), to = c("a.123","b.123"))

Renames variables in a data frame.

Description

Renames variables in a data frame.

Usage

renameColumns(x, from = "", to = "")
renameColumns(x, from = "", to = "")

Arguments

`x`	data frame to be modified.
`from`	character vector containing the current names of each variable to be renamed.
`to`	character vector containing the new names of each variable to be renamed.

Value

The updated data frame x where the variables listed in from are renamed to the corresponding to column names.

Examples

x <- data.frame(x = 1:4, y = LETTERS[1:4])
renameColumns(x, from = c("x","y"), to = c("digits","letters"))
x <- data.frame(x = 1:4, y = LETTERS[1:4])
renameColumns(x, from = c("x","y"), to = c("digits","letters"))

Write ffdf data to a database table by using a DBI connection.

Description

Write ffdf data to a database table by using a DBI connection. This can for example be used to store large ffdf datasets from R in Oracle, SQLite, MySQL, PostgreSQL, Hive or other SQL databases.
Mark that for very large datasets, these SQL databases might have tools to speed up by bulk loading. You might also consider that as an alternative to using this procedure.

Usage

write.dbi.ffdf(
  x,
  name,
  dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""),
  RECORDBYTES = sum(.rambytes[vmode(x)]),
  BATCHBYTES = getOption("ffbatchbytes"),
  by = NULL,
  VERBOSE = FALSE,
  ...
)
write.dbi.ffdf(
  x,
  name,
  dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""),
  RECORDBYTES = sum(.rambytes[vmode(x)]),
  BATCHBYTES = getOption("ffbatchbytes"),
  by = NULL,
  VERBOSE = FALSE,
  ...
)

Arguments

`x`	the `ffdf` to write to the database
`name`	character string with the name of the table to store the data in. Passed on to `dbWriteTable`.
`dbConnect.args`	a list of arguments to pass to DBI's `dbConnect` (like drv, dbname, username, password). See the examples.
`RECORDBYTES`	optional integer scalar representing the bytes needed to process a single row of the ffdf
`BATCHBYTES`	integer: bytes allowed for the size of the data.frame storing the result of reading one chunk. See documentation in `read.table.ffdf` for more details.
`by`	integer passed on to `chunk` indicating to write to the database in chunks of this size. Overwrites the behaviour of BATCHBYTES and RECORDBYTES.
`VERBOSE`	logical: TRUE to verbose timings for each processed chunk (default FALSE).
`...`	optional parameters passed on to `dbWriteTable`

Details

Opens up the DBI connection using DBI::dbConnect, writes data to the SQL table using DBI::dbWriteTable by extracting the data in batches from the ffdf and appending them to the table.

Value

invisible()

Examples

require(ff)

##
## Example query using data in sqlite
##
require(RSQLite)
dbfile <- system.file("smalldb.sqlite3", package="ETLUtils")
drv <- dbDriver("SQLite")
query <- "select * from testdata limit 10000"
x <- read.dbi.ffdf(query = query, dbConnect.args = list(drv = drv, dbname = dbfile), 
first.rows = 100, next.rows = 1000, VERBOSE=TRUE)

## copy db in package folder to temp folder as CRAN does not allow writing in package dirs
dbfile <- tempfile(fileext = ".sqlite3")
file.copy(from = system.file("smalldb.sqlite3", package="ETLUtils"), to = dbfile)
Sys.chmod(dbfile, mode = "777")
write.dbi.ffdf(x = x, name = "helloworld", row.names = FALSE, overwrite = TRUE,
  dbConnect.args = list(drv = drv, dbname = dbfile), 
  by = 1000, VERBOSE=TRUE)
test <- read.dbi.ffdf(query = "select * from helloworld", 
  dbConnect.args = list(drv = drv, dbname = dbfile))

## clean up for CRAN
file.remove(dbfile)
## Not run: 
require(ROracle)
write.dbi.ffdf(x = x, name = "hellooracle", row.names = FALSE, overwrite = TRUE,
  dbConnect.args = list(drv = dbDriver("Oracle"), 
                        user = "YourUser", password = "YourPassword", dbname = "Mydatabase"), 
  VERBOSE=TRUE)

## End(Not run)
require(ff)

##
## Example query using data in sqlite
##
require(RSQLite)
dbfile <- system.file("smalldb.sqlite3", package="ETLUtils")
drv <- dbDriver("SQLite")
query <- "select * from testdata limit 10000"
x <- read.dbi.ffdf(query = query, dbConnect.args = list(drv = drv, dbname = dbfile), 
first.rows = 100, next.rows = 1000, VERBOSE=TRUE)

## copy db in package folder to temp folder as CRAN does not allow writing in package dirs
dbfile <- tempfile(fileext = ".sqlite3")
file.copy(from = system.file("smalldb.sqlite3", package="ETLUtils"), to = dbfile)
Sys.chmod(dbfile, mode = "777")
write.dbi.ffdf(x = x, name = "helloworld", row.names = FALSE, overwrite = TRUE,
  dbConnect.args = list(drv = drv, dbname = dbfile), 
  by = 1000, VERBOSE=TRUE)
test <- read.dbi.ffdf(query = "select * from helloworld", 
  dbConnect.args = list(drv = drv, dbname = dbfile))

## clean up for CRAN
file.remove(dbfile)
## Not run: 
require(ROracle)
write.dbi.ffdf(x = x, name = "hellooracle", row.names = FALSE, overwrite = TRUE,
  dbConnect.args = list(drv = dbDriver("Oracle"), 
                        user = "YourUser", password = "YourPassword", dbname = "Mydatabase"), 
  VERBOSE=TRUE)

## End(Not run)

Write ffdf data to a database table by using a JDBC connection.

Description

Write ffdf data to a database table by using a JDBC connection. This can for example be used to store large ffdf datasets from R in Oracle, SQLite, MySQL, PostgreSQL, Hive or other SQL databases.
Mark that for very large datasets, these SQL databases might have tools to speed up by bulk loading. You might also consider that as an alternative to using this procedure.

Usage

write.jdbc.ffdf(
  x,
  name,
  dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""),
  RECORDBYTES = sum(.rambytes[vmode(x)]),
  BATCHBYTES = getOption("ffbatchbytes"),
  by = NULL,
  VERBOSE = FALSE,
  ...
)
write.jdbc.ffdf(
  x,
  name,
  dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""),
  RECORDBYTES = sum(.rambytes[vmode(x)]),
  BATCHBYTES = getOption("ffbatchbytes"),
  by = NULL,
  VERBOSE = FALSE,
  ...
)

Arguments

`x`	the `ffdf` to write to the database
`name`	character string with the name of the table to store the data in. Passed on to `dbWriteTable`.
`dbConnect.args`	a list of arguments to pass to JDBC's `RJDBC::dbConnect` (like drv, dbname, username, password). See the examples.
`RECORDBYTES`	optional integer scalar representing the bytes needed to process a single row of the ffdf
`BATCHBYTES`	integer: bytes allowed for the size of the data.frame storing the result of reading one chunk. See documentation in `read.table.ffdf` for more details.
`by`	integer passed on to `chunk` indicating to write to the database in chunks of this size. Overwrites the behaviour of BATCHBYTES and RECORDBYTES.
`VERBOSE`	logical: TRUE to verbose timings for each processed chunk (default FALSE).
`...`	optional parameters passed on to `dbWriteTable`

Details

Opens up the JDBC connection using RJDBC::dbConnect, writes data to the SQL table using RJDBC::dbWriteTable by extracting the data in batches from the ffdf and appending them to the table.

Value

invisible()

Examples

## Not run: 
require(ff)

##
## Example query using data in sqlite
##
require(RJDBC)
dbfile <- system.file("smalldb.sqlite3", package="ETLUtils")
drv <- JDBC(driverClass = "org.sqlite.JDBC", classPath = "/usr/local/lib/sqlite-jdbc-3.7.2.jar")
query <- "select * from testdata limit 10000"
x <- read.jdbc.ffdf(query = query, 
 dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile)), 
 first.rows = 100, next.rows = 1000, VERBOSE=TRUE)
 
write.jdbc.ffdf(x = x, name = "helloworld", row.names = FALSE, overwrite = TRUE,
  dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile)), 
  by = 1000, VERBOSE=TRUE)
test <- read.jdbc.ffdf(query = "select * from helloworld", 
  dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile)))

## End(Not run)
## Not run: 
require(ff)

##
## Example query using data in sqlite
##
require(RJDBC)
dbfile <- system.file("smalldb.sqlite3", package="ETLUtils")
drv <- JDBC(driverClass = "org.sqlite.JDBC", classPath = "/usr/local/lib/sqlite-jdbc-3.7.2.jar")
query <- "select * from testdata limit 10000"
x <- read.jdbc.ffdf(query = query, 
 dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile)), 
 first.rows = 100, next.rows = 1000, VERBOSE=TRUE)
 
write.jdbc.ffdf(x = x, name = "helloworld", row.names = FALSE, overwrite = TRUE,
  dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile)), 
  by = 1000, VERBOSE=TRUE)
test <- read.jdbc.ffdf(query = "select * from helloworld", 
  dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile)))

## End(Not run)

Write ffdf data to a database table by using a ODBC connection.

Description

Write ffdf data to a database table by using a ODBC connection. This can for example be used to store large ffdf datasets from R in Oracle, SQLite, MySQL, PostgreSQL, Hive or other SQL databases.
Mark that for very large datasets, these SQL databases might have tools to speed up by bulk loading. You might also consider that as an alternative to using this procedure.

Usage

write.odbc.ffdf(
  x,
  tablename,
  odbcConnect.args = list(dsn = NULL, uid = "", pwd = ""),
  RECORDBYTES = sum(.rambytes[vmode(x)]),
  BATCHBYTES = getOption("ffbatchbytes"),
  by = NULL,
  VERBOSE = FALSE,
  ...
)
write.odbc.ffdf(
  x,
  tablename,
  odbcConnect.args = list(dsn = NULL, uid = "", pwd = ""),
  RECORDBYTES = sum(.rambytes[vmode(x)]),
  BATCHBYTES = getOption("ffbatchbytes"),
  by = NULL,
  VERBOSE = FALSE,
  ...
)

Arguments

`x`	the `ffdf` to write to the database
`tablename`	character string with the name of the table to store the data in. Passed on to `sqlSave`.
`odbcConnect.args`	a list of arguments to pass to ODBC's `odbcConnect` (like dsn, uid, pwd). See the examples.
`RECORDBYTES`	optional integer scalar representing the bytes needed to process a single row of the ffdf
`BATCHBYTES`	integer: bytes allowed for the size of the data.frame storing the result of reading one chunk. See documentation in `read.table.ffdf` for more details.
`by`	integer passed on to `chunk` indicating to write to the database in chunks of this size. Overwrites the behaviour of BATCHBYTES and RECORDBYTES.
`VERBOSE`	logical: TRUE to verbose timings for each processed chunk (default FALSE).
`...`	optional parameters passed on to `sqlSave`

Details

Opens up the ODBC connection using RODBC::odbcConnect, writes data to the SQL table using RODBC::sqlSave by extracting the data in batches from the ffdf and appending them to the table.

Value

invisible()

Examples

##
## Using the sqlite database (smalldb.sqlite3) in the /inst folder of the package
## set up the sqlite ODBC driver (www.stats.ox.ac.uk/pub/bdr/RODBC-manual.pd) 
## and call it 'smalltestsqlitedb' 
##
## Not run: 
require(RODBC)
x <- read.odbc.ffdf(
  query = "select * from testdata limit 10000",
  odbcConnect.args = list(
   dsn="smalltestsqlitedb", uid = "", pwd = "", 
   believeNRows = FALSE, rows_at_time = 1), 
  nrows = -1, 
  first.rows = 100, next.rows = 1000, VERBOSE = TRUE)
  
write.odbc.ffdf(x = x, tablename = "testdata", rownames = FALSE, append = TRUE,
  odbcConnect.args = list(
   dsn="smalltestsqlitedb", uid = "", pwd = "", 
   believeNRows = FALSE, rows_at_time = 1),  
  by = 1000, VERBOSE=TRUE)

## End(Not run)
##
## Using the sqlite database (smalldb.sqlite3) in the /inst folder of the package
## set up the sqlite ODBC driver (www.stats.ox.ac.uk/pub/bdr/RODBC-manual.pd) 
## and call it 'smalltestsqlitedb' 
##
## Not run: 
require(RODBC)
x <- read.odbc.ffdf(
  query = "select * from testdata limit 10000",
  odbcConnect.args = list(
   dsn="smalltestsqlitedb", uid = "", pwd = "", 
   believeNRows = FALSE, rows_at_time = 1), 
  nrows = -1, 
  first.rows = 100, next.rows = 1000, VERBOSE = TRUE)
  
write.odbc.ffdf(x = x, tablename = "testdata", rownames = FALSE, append = TRUE,
  odbcConnect.args = list(
   dsn="smalltestsqlitedb", uid = "", pwd = "", 
   believeNRows = FALSE, rows_at_time = 1),  
  by = 1000, VERBOSE=TRUE)

## End(Not run)

Package 'ETLUtils'

Help Index

Extra utility functions to execute standard ETL operations on large data

Description

Author(s)

Examples

Put character vectors, columns of a data.frame or list elements as factor

Description

Usage

Arguments

Value

See Also

Examples

Merge two data frames (fast) by common columns by performing a left (outer) join or an inner join.

Description

Usage

Arguments

Details

Value

See Also

Examples

Performs NA replacement by last observation carried forward but adds 1 to the last observation carried forward.

Description

Usage

Arguments

Value

See Also

Examples

Read data from a DBI connection into an ffdf.

Description

Usage

Arguments

Details

Value

See Also

Examples

Read data from a JDBC connection into an ffdf.

Description

Usage

Arguments

Details

Value

See Also

Examples

Read data from a ODBC connection into an ffdf.

Description

Usage

Arguments

Details

Value

See Also

Examples

Recodes the values of a character vector

Description

Usage

Arguments

Value

See Also

Examples

Renames variables in a data frame.

Description

Usage

Arguments

Value

See Also

Examples

Write ffdf data to a database table by using a DBI connection.

Description

Usage

Arguments

Details

Value

See Also

Examples

Write ffdf data to a database table by using a JDBC connection.

Description

Usage

Arguments

Details

Value