Title: | Utility Functions to Execute Standard Extract/Transform/Load Operations (using Package 'ff') on Large Data |
---|---|
Description: | Provides functions to facilitate the use of the 'ff' package in interaction with big data in 'SQL' databases (e.g. in 'Oracle', 'MySQL', 'PostgreSQL', 'Hive') by allowing easy importing directly into 'ffdf' objects using 'DBI', 'RODBC' and 'RJDBC'. Also contains some basic utility functions to do fast left outer join merging based on 'match', factorisation of data and a basic function for re-coding vectors. |
Authors: | Jan Wijffels |
Maintainer: | Jan Wijffels <[email protected]> |
License: | GPL-2 |
Version: | 1.5 |
Built: | 2024-11-12 03:09:42 UTC |
Source: | https://github.com/jwijffels/etlutils |
Provides functions to load bigdata (e.g. from Oracle) directly into ffdf
objects using DBI and some utility functions like recoding and matchmerge which does fast left outer join merging based on match
.
Jan Wijffels [email protected]
# See the specified functions in the package
# See the specified functions in the package
Put character vectors, columns of a data.frame or list elements as factor if they are character strings
or optionally if they are logicals
factorise(x, logicals = FALSE, ...) ## Default S3 method: factorise(x, logicals = FALSE, ...) ## S3 method for class 'character' factorise(x, logicals = FALSE, ...) ## S3 method for class 'data.frame' factorise(x, logicals = FALSE, ...) ## S3 method for class 'list' factorise(x, logicals = FALSE, ...)
factorise(x, logicals = FALSE, ...) ## Default S3 method: factorise(x, logicals = FALSE, ...) ## S3 method for class 'character' factorise(x, logicals = FALSE, ...) ## S3 method for class 'data.frame' factorise(x, logicals = FALSE, ...) ## S3 method for class 'list' factorise(x, logicals = FALSE, ...)
x |
a character vector, a data.frame or a list |
logicals |
logical indicating if logical vectors should also be converted to factors. Defaults to FALSE. |
... |
optional arguments passed on to the methods |
The updated x vector/data.frame or list where the character vectors or optionally logical elements are converted to factors
x <- data.frame(x = 1:4, y = LETTERS[1:4], b = c(TRUE, FALSE, NA, TRUE), stringsAsFactors=FALSE) str(factorise(x)) str(factorise(x, logicals = TRUE)) str(factorise(list(a = LETTERS, b = 1:10, c = pi, d = list(x = x))))
x <- data.frame(x = 1:4, y = LETTERS[1:4], b = c(TRUE, FALSE, NA, TRUE), stringsAsFactors=FALSE) str(factorise(x)) str(factorise(x, logicals = TRUE)) str(factorise(list(a = LETTERS, b = 1:10, c = pi, d = list(x = x))))
Merge two data frames (fast) by common columns by performing a left (outer) join or an inner join.
The data frames are merged on the columns given by by.x and by.y. Columns can be specified only by name.
This differs from the merge function from the base package in that merging is done based on 1 column key only.
If more than one column is supplied in by.x and by.y, these columns will be concatenated together
to form 1 key which will be used to match.
Alternatively, by.x and by.y can be 2 vectors of length NROW(x) which will be used as keys.
matchmerge( x, y, by.x, by.y, all.x = FALSE, by.iskey = FALSE, suffix = ".y", add.columns = colnames(y), check.duplicates = TRUE, trace = FALSE )
matchmerge( x, y, by.x, by.y, all.x = FALSE, by.iskey = FALSE, suffix = ".y", add.columns = colnames(y), check.duplicates = TRUE, trace = FALSE )
x |
the left hand side data frame to merge |
y |
the right hand side data frame to merge |
by.x |
either the name of 1 column in x or a character vector of length NROW(x) which will be used as key to merge the 2 data frames |
by.y |
either the name of 1 column in y or a character vector of length NROW(x) which will be used as key to merge the 2 data frames. Duplicate values in by.y are not allowed. |
all.x |
logical, if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output. The default value corresponds to an inner join. If TRUE is supplied, this corresponds to a left (outer) join. |
by.iskey |
Logical, indicating that the by.x and the by.y inputs are vectors of length NROW(x) and NROW(y) instead of column names in x and y. If this is FALSE, the input columns will be pasted together to create a key to merge upon. Otherwise, the function will use the by.x and by.y vectors directly as matching key. Defaults to FALSE indicating the by.x and by.y are column names in x and y. |
suffix |
a character string to be used for duplicate column names in x and y to make the y columns unique. |
add.columns |
character vector of column names in y to merge to the x data frame. Defaults to all columns in y. |
check.duplicates |
checks if by.y contains duplicates which is not allowed. Defaults to TRUE. |
trace |
logical, indicating to print some informative messages about the progress |
The rows in the right hand side data frame that match on the specific key are extracted, and joined together
with the left hand side data frame.
Merging is done based on the match function on the key value.
This makes the function a lot faster when compared to applying merge, especially for large data frames (see the example).
And also the memory consumption is a lot smaller.
In SQL database terminology, the default value of all.x = FALSE gives a natural join, a special case of an inner join. Specifying all.x = FALSE gives a left (outer) join. Right (outer) join or (full) outer join are not provided in this function.
data frame with x joined with y based on the supplied columns. The output columns are the columns in x followed by the extra columns in y.
left <- data.frame(idlhs = c(1:4, 3:5), a = LETTERS[1:7], stringsAsFactors = FALSE) right <- data.frame(idrhs = c(1:4), b = LETTERS[8:11], stringsAsFactors = FALSE) ## Inner join matchmerge(x=left, y=right, by.x = "idlhs", by.y = "idrhs") ## Left outer join in 2 ways matchmerge(x=left, y=right, by.x = "idlhs", by.y = "idrhs", all.x=TRUE) matchmerge(x=left, y=right, by.x = left$idlhs, by.y = right$idrhs, all.x=TRUE, by.iskey=TRUE) ## Show usage when y is just a vector instead of a data.frame matchmerge(x=left, y=right$b, by.x = left$idlhs, by.y = right$idrhs, all.x=TRUE, by.iskey=TRUE, add.columns="b.renamed") ## Show speedup difference with merge ## Not run: size <- 100000 dimension <- seq(Sys.Date(), Sys.Date()+10, by = "day") left <- data.frame(date = rep(dimension, size), sales = rnorm(size)) right <- data.frame(date = dimension, feature = dimension-7, feature = dimension-14) dim(left) dim(right) print(system.time(merge(left, right, by.x="date", by.y="date", all.x=TRUE, all.y=FALSE))) print(system.time(matchmerge(left, right, by.x="date", by.y="date", all.x=TRUE, by.iskey=FALSE))) ## End(Not run) ## Show example usage products <- expand.grid(product = c("Pepsi", "Coca Cola"), type = c("Can","Bottle"), size = c("6Ml","8Ml"), distributor = c("Distri X","Distri Y"), salesperson = c("Mr X","Mr Y"), stringsAsFactors=FALSE) products <- products[!duplicated(products[, c("product","type","size")]), ] products$key <- paste(products$product, products$type, products$size, sep=".") sales <- expand.grid(item = unique(products$key), sales = rnorm(10000, mean = 100)) str(products) str(sales) info <- matchmerge(x=sales, y=products, by.x=sales$item, by.y=products$key, all.x=TRUE, by.iskey=TRUE, add.columns=c("size","distributor"), check.duplicates=FALSE) str(info) tapply(info$sales, info$distributor, FUN=sum)
left <- data.frame(idlhs = c(1:4, 3:5), a = LETTERS[1:7], stringsAsFactors = FALSE) right <- data.frame(idrhs = c(1:4), b = LETTERS[8:11], stringsAsFactors = FALSE) ## Inner join matchmerge(x=left, y=right, by.x = "idlhs", by.y = "idrhs") ## Left outer join in 2 ways matchmerge(x=left, y=right, by.x = "idlhs", by.y = "idrhs", all.x=TRUE) matchmerge(x=left, y=right, by.x = left$idlhs, by.y = right$idrhs, all.x=TRUE, by.iskey=TRUE) ## Show usage when y is just a vector instead of a data.frame matchmerge(x=left, y=right$b, by.x = left$idlhs, by.y = right$idrhs, all.x=TRUE, by.iskey=TRUE, add.columns="b.renamed") ## Show speedup difference with merge ## Not run: size <- 100000 dimension <- seq(Sys.Date(), Sys.Date()+10, by = "day") left <- data.frame(date = rep(dimension, size), sales = rnorm(size)) right <- data.frame(date = dimension, feature = dimension-7, feature = dimension-14) dim(left) dim(right) print(system.time(merge(left, right, by.x="date", by.y="date", all.x=TRUE, all.y=FALSE))) print(system.time(matchmerge(left, right, by.x="date", by.y="date", all.x=TRUE, by.iskey=FALSE))) ## End(Not run) ## Show example usage products <- expand.grid(product = c("Pepsi", "Coca Cola"), type = c("Can","Bottle"), size = c("6Ml","8Ml"), distributor = c("Distri X","Distri Y"), salesperson = c("Mr X","Mr Y"), stringsAsFactors=FALSE) products <- products[!duplicated(products[, c("product","type","size")]), ] products$key <- paste(products$product, products$type, products$size, sep=".") sales <- expand.grid(item = unique(products$key), sales = rnorm(10000, mean = 100)) str(products) str(sales) info <- matchmerge(x=sales, y=products, by.x=sales$item, by.y=products$key, all.x=TRUE, by.iskey=TRUE, add.columns=c("size","distributor"), check.duplicates=FALSE) str(info) tapply(info$sales, info$distributor, FUN=sum)
Performs NA replacement by last observation carried forward but adds 1 to the last observation carried forward.
naLOCFPlusone(x)
naLOCFPlusone(x)
x |
a numeric vector |
a vector where NA's are replaced with the LOCF + 1
require(zoo) x <- c(2,NA,NA,4,5,2,NA) naLOCFPlusone(x)
require(zoo) x <- c(2,NA,NA,4,5,2,NA) naLOCFPlusone(x)
Read data from a DBI connection into an ffdf
. This can for example be used to import
large datasets from Oracle, SQLite, MySQL, PostgreSQL, Hive or other SQL databases into R.
read.dbi.ffdf( query = NULL, dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""), dbSendQuery.args = list(), dbFetch.args = list(), x = NULL, nrows = -1, first.rows = NULL, next.rows = NULL, levels = NULL, appendLevels = TRUE, asffdf_args = list(), BATCHBYTES = getOption("ffbatchbytes"), VERBOSE = FALSE, colClasses = NULL, transFUN = NULL, ... )
read.dbi.ffdf( query = NULL, dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""), dbSendQuery.args = list(), dbFetch.args = list(), x = NULL, nrows = -1, first.rows = NULL, next.rows = NULL, levels = NULL, appendLevels = TRUE, asffdf_args = list(), BATCHBYTES = getOption("ffbatchbytes"), VERBOSE = FALSE, colClasses = NULL, transFUN = NULL, ... )
query |
the SQL query to execute on the DBI connection |
dbConnect.args |
a list of arguments to pass to DBI's |
dbSendQuery.args |
a list containing database-specific parameters which will be passed to to pass to |
dbFetch.args |
a list containing optional database-specific parameters which will be passed to to pass to |
x |
NULL or an optional ffdf object to which the read records are appended. See documentation in read.table.ffdf for more details and the example below. |
nrows |
Number of rows to read from the query resultset. Default value of -1 reads in all rows. |
first.rows |
chunk size (rows) to read for first chunk from the query resultset |
next.rows |
chunk size (rows) to read sequentially for subsequent chunks from the query resultset. Currently, this must be specified. |
levels |
optional specification of factor levels. A list with as names the names the columns of the data.frame fetched in the first.rows, containing levels of the factors. |
appendLevels |
logical. A vector of permissions to expand levels for factor columns. See documentation in |
asffdf_args |
further arguments passed to |
BATCHBYTES |
integer: bytes allowed for the size of the data.frame storing the result of reading one chunk.
See documentation in |
VERBOSE |
logical: TRUE to verbose timings for each processed chunk (default FALSE). |
colClasses |
See documentation in |
transFUN |
function applied to the data frame after each chunk is retreived by |
... |
optional parameters passed on to transFUN |
Opens up the DBI connection using DBI::dbConnect
, sends the query using DBI::dbSendQuery
and DBI::dbFetch
-es
the results in batches of next.rows rows. Heavily borrowed from read.table.ffdf
An ffdf object unless the query returns zero records in which case the function will return the data.frame
returned by dbFetch
and possibly transFUN.
read.table.ffdf, read.odbc.ffdf
require(ff) ## ## Example query using data in sqlite ## require(RSQLite) dbfile <- system.file("smalldb.sqlite3", package="ETLUtils") drv <- dbDriver("SQLite") query <- "select * from testdata limit 10000" x <- read.dbi.ffdf(query = query, dbConnect.args = list(drv = drv, dbname = dbfile), first.rows = 100, next.rows = 1000, VERBOSE=TRUE) class(x) x[1:10, ] ## show it is the same as getting the data directly using RSQLite ## apart from characters which are factors in ffdf objects directly <- dbGetQuery(dbConnect(drv = drv, dbname = dbfile), query) directly <- as.data.frame(as.list(directly), stringsAsFactors=TRUE) all.equal(x[,], directly) ## show how to use the transFUN argument to transform the data before saving into the ffdf ## and shows the use of the levels argument query <- "select * from testdata limit 10" x <- read.dbi.ffdf(query = query, dbConnect.args = list(drv = drv, dbname = dbfile), first.rows = 100, next.rows = 1000, VERBOSE=TRUE, levels = list(a = rev(LETTERS)), transFUN = function(x, subtractdays){ x$b <- as.Date(x$b) x$b.subtractdaysago <- x$b - subtractdays x }, subtractdays=7) class(x) x[1:10, ] ## remark that the levels of column a are reversed due to specifying the levels argument correctly levels(x$a) ## show how to append data to an existing ffdf object transformexample <- function(x, subtractdays){ x$b <- as.Date(x$b) x$b.subtractdaysago <- x$b - subtractdays x } dim(x) x[,] combined <- read.dbi.ffdf(query = query, dbConnect.args = list(drv = drv, dbname = dbfile), first.rows = 100, next.rows = 1000, x = x, VERBOSE=TRUE, transFUN = transformexample, subtractdays=1000) dim(combined) combined[,] ## ## Example query using ROracle. Do try this at home with some larger data :) ## ## Not run: require(ROracle) query <- "select OWNER, TABLE_NAME, TABLESPACE_NAME, NUM_ROWS, LAST_ANALYZED from all_all_tables" x <- read.dbi.ffdf(query=query, dbConnect.args = list(drv = dbDriver("Oracle"), user = "YourUser", password = "YourPassword", dbname = "Mydatabase"), first.rows = 100, next.rows = 50000, nrows = -1, VERBOSE=TRUE) ## End(Not run)
require(ff) ## ## Example query using data in sqlite ## require(RSQLite) dbfile <- system.file("smalldb.sqlite3", package="ETLUtils") drv <- dbDriver("SQLite") query <- "select * from testdata limit 10000" x <- read.dbi.ffdf(query = query, dbConnect.args = list(drv = drv, dbname = dbfile), first.rows = 100, next.rows = 1000, VERBOSE=TRUE) class(x) x[1:10, ] ## show it is the same as getting the data directly using RSQLite ## apart from characters which are factors in ffdf objects directly <- dbGetQuery(dbConnect(drv = drv, dbname = dbfile), query) directly <- as.data.frame(as.list(directly), stringsAsFactors=TRUE) all.equal(x[,], directly) ## show how to use the transFUN argument to transform the data before saving into the ffdf ## and shows the use of the levels argument query <- "select * from testdata limit 10" x <- read.dbi.ffdf(query = query, dbConnect.args = list(drv = drv, dbname = dbfile), first.rows = 100, next.rows = 1000, VERBOSE=TRUE, levels = list(a = rev(LETTERS)), transFUN = function(x, subtractdays){ x$b <- as.Date(x$b) x$b.subtractdaysago <- x$b - subtractdays x }, subtractdays=7) class(x) x[1:10, ] ## remark that the levels of column a are reversed due to specifying the levels argument correctly levels(x$a) ## show how to append data to an existing ffdf object transformexample <- function(x, subtractdays){ x$b <- as.Date(x$b) x$b.subtractdaysago <- x$b - subtractdays x } dim(x) x[,] combined <- read.dbi.ffdf(query = query, dbConnect.args = list(drv = drv, dbname = dbfile), first.rows = 100, next.rows = 1000, x = x, VERBOSE=TRUE, transFUN = transformexample, subtractdays=1000) dim(combined) combined[,] ## ## Example query using ROracle. Do try this at home with some larger data :) ## ## Not run: require(ROracle) query <- "select OWNER, TABLE_NAME, TABLESPACE_NAME, NUM_ROWS, LAST_ANALYZED from all_all_tables" x <- read.dbi.ffdf(query=query, dbConnect.args = list(drv = dbDriver("Oracle"), user = "YourUser", password = "YourPassword", dbname = "Mydatabase"), first.rows = 100, next.rows = 50000, nrows = -1, VERBOSE=TRUE) ## End(Not run)
Read data from a JDBC connection into an ffdf
. This can for example be used to import
large datasets from Oracle, SQLite, MySQL, PostgreSQL, Hive or other SQL databases into R.
read.jdbc.ffdf( query = NULL, dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""), dbSendQuery.args = list(), dbFetch.args = list(), x = NULL, nrows = -1, first.rows = NULL, next.rows = NULL, levels = NULL, appendLevels = TRUE, asffdf_args = list(), BATCHBYTES = getOption("ffbatchbytes"), VERBOSE = FALSE, colClasses = NULL, transFUN = NULL, ... )
read.jdbc.ffdf( query = NULL, dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""), dbSendQuery.args = list(), dbFetch.args = list(), x = NULL, nrows = -1, first.rows = NULL, next.rows = NULL, levels = NULL, appendLevels = TRUE, asffdf_args = list(), BATCHBYTES = getOption("ffbatchbytes"), VERBOSE = FALSE, colClasses = NULL, transFUN = NULL, ... )
query |
the SQL query to execute on the JDBC connection |
dbConnect.args |
a list of arguments to pass to JDBC's |
dbSendQuery.args |
a list containing database-specific parameters which will be passed to to pass to |
dbFetch.args |
a list containing optional database-specific parameters which will be passed to to pass to |
x |
NULL or an optional ffdf object to which the read records are appended. See documentation in read.table.ffdf for more details and the example below. |
nrows |
Number of rows to read from the query resultset. Default value of -1 reads in all rows. |
first.rows |
chunk size (rows) to read for first chunk from the query resultset |
next.rows |
chunk size (rows) to read sequentially for subsequent chunks from the query resultset. Currently, this must be specified. |
levels |
optional specification of factor levels. A list with as names the names the columns of the data.frame fetched in the first.rows, containing levels of the factors. |
appendLevels |
logical. A vector of permissions to expand levels for factor columns. See documentation in |
asffdf_args |
further arguments passed to |
BATCHBYTES |
integer: bytes allowed for the size of the data.frame storing the result of reading one chunk.
See documentation in |
VERBOSE |
logical: TRUE to verbose timings for each processed chunk (default FALSE). |
colClasses |
See documentation in |
transFUN |
function applied to the data frame after each chunk is retreived by |
... |
optional parameters passed on to transFUN |
Opens up the JDBC connection using RJDBC::dbConnect
, sends the query using RJDBC::dbSendQuery
and RJDBC::dbFetch
-es
the results in batches of next.rows rows. Heavily borrowed from read.table.ffdf
An ffdf object unless the query returns zero records in which case the function will return the data.frame
returned by RJDBC::dbFetch
and possibly transFUN.
read.table.ffdf, read.jdbc.ffdf
## Not run: require(ff) ## ## Example query using data in sqlite ## require(RSQLite) dbfile <- system.file("smalldb.sqlite3", package="ETLUtils") drv <- JDBC(driverClass = "org.sqlite.JDBC", classPath = "/usr/local/lib/sqlite-jdbc-3.7.2.jar") query <- "select * from testdata limit 10000" x <- read.jdbc.ffdf(query = query, dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile)), first.rows = 100, next.rows = 1000, VERBOSE=TRUE) class(x) x[1:10, ] ## End(Not run)
## Not run: require(ff) ## ## Example query using data in sqlite ## require(RSQLite) dbfile <- system.file("smalldb.sqlite3", package="ETLUtils") drv <- JDBC(driverClass = "org.sqlite.JDBC", classPath = "/usr/local/lib/sqlite-jdbc-3.7.2.jar") query <- "select * from testdata limit 10000" x <- read.jdbc.ffdf(query = query, dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile)), first.rows = 100, next.rows = 1000, VERBOSE=TRUE) class(x) x[1:10, ] ## End(Not run)
Read data from a ODBC connection into an ffdf
. This can for example be used to import
large datasets from Oracle, SQLite, MySQL, PostgreSQL, Hive or other SQL databases into R.
read.odbc.ffdf( query = NULL, odbcConnect.args = list(dsn = NULL, uid = "", pwd = ""), odbcDriverConnect.args = list(connection = ""), odbcQuery.args = list(), sqlGetResults.args = list(), x = NULL, nrows = -1, first.rows = NULL, next.rows = NULL, levels = NULL, appendLevels = TRUE, asffdf_args = list(), BATCHBYTES = getOption("ffbatchbytes"), VERBOSE = FALSE, colClasses = NULL, transFUN = NULL, ... )
read.odbc.ffdf( query = NULL, odbcConnect.args = list(dsn = NULL, uid = "", pwd = ""), odbcDriverConnect.args = list(connection = ""), odbcQuery.args = list(), sqlGetResults.args = list(), x = NULL, nrows = -1, first.rows = NULL, next.rows = NULL, levels = NULL, appendLevels = TRUE, asffdf_args = list(), BATCHBYTES = getOption("ffbatchbytes"), VERBOSE = FALSE, colClasses = NULL, transFUN = NULL, ... )
query |
the SQL query to execute on the ODBC connection |
odbcConnect.args |
a list of arguments to pass to ODBC's |
odbcDriverConnect.args |
a list of arguments to pass to ODBC's |
odbcQuery.args |
a list of arguments to pass to ODBC's |
sqlGetResults.args |
a list containing optional parameters which will be passed to |
x |
NULL or an optional ffdf object to which the read records are appended.
See documentation in |
nrows |
Number of rows to read from the query resultset. Default value of -1 reads in all rows. |
first.rows |
chunk size (rows) to read for first chunk from the query resultset |
next.rows |
chunk size (rows) to read sequentially for subsequent chunks from the query resultset. Currently, this must be specified. |
levels |
optional specification of factor levels. A list with as names the names the columns of the data.frame fetched in the first.rows, containing levels of the factors. |
appendLevels |
logical. A vector of permissions to expand levels for factor columns. See documentation in |
asffdf_args |
further arguments passed to |
BATCHBYTES |
integer: bytes allowed for the size of the data.frame storing the result of reading one chunk.
See documentation in |
VERBOSE |
logical: TRUE to verbose timings for each processed chunk (default FALSE). |
colClasses |
See documentation in |
transFUN |
function applied to the data frame after each chunk is retreived by |
... |
optional parameters passed on to transFUN |
Opens up the ODBC connection using RODBC::odbcConnect
or RODBC::odbcDriverConnect
,
sends the query using RODBC::odbcQuery
and retrieves
the results in batches of next.rows rows using RODBC::sqlGetResults
. Heavily borrowed from read.table.ffdf
An ffdf object unless the query returns zero records in which case the function will return the data.frame
returned by sqlGetResults
and possibly transFUN.
read.table.ffdf, read.dbi.ffdf
## ## Using the sqlite database (smalldb.sqlite3) in the /inst folder of the package ## set up the sqlite ODBC driver (www.stats.ox.ac.uk/pub/bdr/RODBC-manual.pd) ## and call it 'smalltestsqlitedb' ## ## Not run: require(RODBC) x <- read.odbc.ffdf( query = "select * from testdata limit 10000", odbcConnect.args = list( dsn="smalltestsqlitedb", uid = "", pwd = "", believeNRows = FALSE, rows_at_time = 1), nrows = -1, first.rows = 100, next.rows = 1000, VERBOSE = TRUE) ## End(Not run)
## ## Using the sqlite database (smalldb.sqlite3) in the /inst folder of the package ## set up the sqlite ODBC driver (www.stats.ox.ac.uk/pub/bdr/RODBC-manual.pd) ## and call it 'smalltestsqlitedb' ## ## Not run: require(RODBC) x <- read.odbc.ffdf( query = "select * from testdata limit 10000", odbcConnect.args = list( dsn="smalltestsqlitedb", uid = "", pwd = "", believeNRows = FALSE, rows_at_time = 1), nrows = -1, first.rows = 100, next.rows = 1000, VERBOSE = TRUE) ## End(Not run)
Recodes the values of a character vector
recoder(x, from = c(), to = c())
recoder(x, from = c(), to = c())
x |
character vector |
from |
character vector with old values |
to |
character vector with new values |
x where from values are recoded to the supplied to values
recoder(x=append(LETTERS, NA, 5), from = c("A","B"), to = c("a.123","b.123"))
recoder(x=append(LETTERS, NA, 5), from = c("A","B"), to = c("a.123","b.123"))
Renames variables in a data frame.
renameColumns(x, from = "", to = "")
renameColumns(x, from = "", to = "")
x |
data frame to be modified. |
from |
character vector containing the current names of each variable to be renamed. |
to |
character vector containing the new names of each variable to be renamed. |
The updated data frame x where the variables listed in from are renamed to the corresponding to column names.
x <- data.frame(x = 1:4, y = LETTERS[1:4]) renameColumns(x, from = c("x","y"), to = c("digits","letters"))
x <- data.frame(x = 1:4, y = LETTERS[1:4]) renameColumns(x, from = c("x","y"), to = c("digits","letters"))
Write ffdf
data to a database table by using a DBI connection.
This can for example be used to store large ffdf datasets from R in
Oracle, SQLite, MySQL, PostgreSQL, Hive or other SQL databases.
Mark that for very large datasets, these SQL databases might have tools to speed up by bulk loading.
You might also consider that as an alternative to using this procedure.
write.dbi.ffdf( x, name, dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""), RECORDBYTES = sum(.rambytes[vmode(x)]), BATCHBYTES = getOption("ffbatchbytes"), by = NULL, VERBOSE = FALSE, ... )
write.dbi.ffdf( x, name, dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""), RECORDBYTES = sum(.rambytes[vmode(x)]), BATCHBYTES = getOption("ffbatchbytes"), by = NULL, VERBOSE = FALSE, ... )
x |
the |
name |
character string with the name of the table to store the data in. Passed on to |
dbConnect.args |
a list of arguments to pass to DBI's |
RECORDBYTES |
optional integer scalar representing the bytes needed to process a single row of the ffdf |
BATCHBYTES |
integer: bytes allowed for the size of the data.frame storing the result of reading one chunk.
See documentation in |
by |
integer passed on to |
VERBOSE |
logical: TRUE to verbose timings for each processed chunk (default FALSE). |
... |
optional parameters passed on to |
Opens up the DBI connection using DBI::dbConnect
, writes data to the SQL table
using DBI::dbWriteTable
by extracting the data in batches from the ffdf
and appending them to the table.
invisible()
require(ff) ## ## Example query using data in sqlite ## require(RSQLite) dbfile <- system.file("smalldb.sqlite3", package="ETLUtils") drv <- dbDriver("SQLite") query <- "select * from testdata limit 10000" x <- read.dbi.ffdf(query = query, dbConnect.args = list(drv = drv, dbname = dbfile), first.rows = 100, next.rows = 1000, VERBOSE=TRUE) ## copy db in package folder to temp folder as CRAN does not allow writing in package dirs dbfile <- tempfile(fileext = ".sqlite3") file.copy(from = system.file("smalldb.sqlite3", package="ETLUtils"), to = dbfile) Sys.chmod(dbfile, mode = "777") write.dbi.ffdf(x = x, name = "helloworld", row.names = FALSE, overwrite = TRUE, dbConnect.args = list(drv = drv, dbname = dbfile), by = 1000, VERBOSE=TRUE) test <- read.dbi.ffdf(query = "select * from helloworld", dbConnect.args = list(drv = drv, dbname = dbfile)) ## clean up for CRAN file.remove(dbfile) ## Not run: require(ROracle) write.dbi.ffdf(x = x, name = "hellooracle", row.names = FALSE, overwrite = TRUE, dbConnect.args = list(drv = dbDriver("Oracle"), user = "YourUser", password = "YourPassword", dbname = "Mydatabase"), VERBOSE=TRUE) ## End(Not run)
require(ff) ## ## Example query using data in sqlite ## require(RSQLite) dbfile <- system.file("smalldb.sqlite3", package="ETLUtils") drv <- dbDriver("SQLite") query <- "select * from testdata limit 10000" x <- read.dbi.ffdf(query = query, dbConnect.args = list(drv = drv, dbname = dbfile), first.rows = 100, next.rows = 1000, VERBOSE=TRUE) ## copy db in package folder to temp folder as CRAN does not allow writing in package dirs dbfile <- tempfile(fileext = ".sqlite3") file.copy(from = system.file("smalldb.sqlite3", package="ETLUtils"), to = dbfile) Sys.chmod(dbfile, mode = "777") write.dbi.ffdf(x = x, name = "helloworld", row.names = FALSE, overwrite = TRUE, dbConnect.args = list(drv = drv, dbname = dbfile), by = 1000, VERBOSE=TRUE) test <- read.dbi.ffdf(query = "select * from helloworld", dbConnect.args = list(drv = drv, dbname = dbfile)) ## clean up for CRAN file.remove(dbfile) ## Not run: require(ROracle) write.dbi.ffdf(x = x, name = "hellooracle", row.names = FALSE, overwrite = TRUE, dbConnect.args = list(drv = dbDriver("Oracle"), user = "YourUser", password = "YourPassword", dbname = "Mydatabase"), VERBOSE=TRUE) ## End(Not run)
Write ffdf
data to a database table by using a JDBC connection.
This can for example be used to store large ffdf datasets from R in
Oracle, SQLite, MySQL, PostgreSQL, Hive or other SQL databases.
Mark that for very large datasets, these SQL databases might have tools to speed up by bulk loading.
You might also consider that as an alternative to using this procedure.
write.jdbc.ffdf( x, name, dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""), RECORDBYTES = sum(.rambytes[vmode(x)]), BATCHBYTES = getOption("ffbatchbytes"), by = NULL, VERBOSE = FALSE, ... )
write.jdbc.ffdf( x, name, dbConnect.args = list(drv = NULL, dbname = NULL, username = "", password = ""), RECORDBYTES = sum(.rambytes[vmode(x)]), BATCHBYTES = getOption("ffbatchbytes"), by = NULL, VERBOSE = FALSE, ... )
x |
the |
name |
character string with the name of the table to store the data in. Passed on to |
dbConnect.args |
a list of arguments to pass to JDBC's |
RECORDBYTES |
optional integer scalar representing the bytes needed to process a single row of the ffdf |
BATCHBYTES |
integer: bytes allowed for the size of the data.frame storing the result of reading one chunk.
See documentation in |
by |
integer passed on to |
VERBOSE |
logical: TRUE to verbose timings for each processed chunk (default FALSE). |
... |
optional parameters passed on to |
Opens up the JDBC connection using RJDBC::dbConnect
, writes data to the SQL table
using RJDBC::dbWriteTable
by extracting the data in batches from the ffdf
and appending them to the table.
invisible()
## Not run: require(ff) ## ## Example query using data in sqlite ## require(RJDBC) dbfile <- system.file("smalldb.sqlite3", package="ETLUtils") drv <- JDBC(driverClass = "org.sqlite.JDBC", classPath = "/usr/local/lib/sqlite-jdbc-3.7.2.jar") query <- "select * from testdata limit 10000" x <- read.jdbc.ffdf(query = query, dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile)), first.rows = 100, next.rows = 1000, VERBOSE=TRUE) write.jdbc.ffdf(x = x, name = "helloworld", row.names = FALSE, overwrite = TRUE, dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile)), by = 1000, VERBOSE=TRUE) test <- read.jdbc.ffdf(query = "select * from helloworld", dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile))) ## End(Not run)
## Not run: require(ff) ## ## Example query using data in sqlite ## require(RJDBC) dbfile <- system.file("smalldb.sqlite3", package="ETLUtils") drv <- JDBC(driverClass = "org.sqlite.JDBC", classPath = "/usr/local/lib/sqlite-jdbc-3.7.2.jar") query <- "select * from testdata limit 10000" x <- read.jdbc.ffdf(query = query, dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile)), first.rows = 100, next.rows = 1000, VERBOSE=TRUE) write.jdbc.ffdf(x = x, name = "helloworld", row.names = FALSE, overwrite = TRUE, dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile)), by = 1000, VERBOSE=TRUE) test <- read.jdbc.ffdf(query = "select * from helloworld", dbConnect.args = list(drv = drv, url = sprintf("jdbc:sqlite:%s", dbfile))) ## End(Not run)
Write ffdf
data to a database table by using a ODBC connection.
This can for example be used to store large ffdf datasets from R in
Oracle, SQLite, MySQL, PostgreSQL, Hive or other SQL databases.
Mark that for very large datasets, these SQL databases might have tools to speed up by bulk loading.
You might also consider that as an alternative to using this procedure.
write.odbc.ffdf( x, tablename, odbcConnect.args = list(dsn = NULL, uid = "", pwd = ""), RECORDBYTES = sum(.rambytes[vmode(x)]), BATCHBYTES = getOption("ffbatchbytes"), by = NULL, VERBOSE = FALSE, ... )
write.odbc.ffdf( x, tablename, odbcConnect.args = list(dsn = NULL, uid = "", pwd = ""), RECORDBYTES = sum(.rambytes[vmode(x)]), BATCHBYTES = getOption("ffbatchbytes"), by = NULL, VERBOSE = FALSE, ... )
x |
the |
tablename |
character string with the name of the table to store the data in. Passed on to |
odbcConnect.args |
a list of arguments to pass to ODBC's |
RECORDBYTES |
optional integer scalar representing the bytes needed to process a single row of the ffdf |
BATCHBYTES |
integer: bytes allowed for the size of the data.frame storing the result of reading one chunk.
See documentation in |
by |
integer passed on to |
VERBOSE |
logical: TRUE to verbose timings for each processed chunk (default FALSE). |
... |
optional parameters passed on to |
Opens up the ODBC connection using RODBC::odbcConnect
, writes data to the SQL table
using RODBC::sqlSave
by extracting the data in batches from the ffdf
and appending them to the table.
invisible()
## ## Using the sqlite database (smalldb.sqlite3) in the /inst folder of the package ## set up the sqlite ODBC driver (www.stats.ox.ac.uk/pub/bdr/RODBC-manual.pd) ## and call it 'smalltestsqlitedb' ## ## Not run: require(RODBC) x <- read.odbc.ffdf( query = "select * from testdata limit 10000", odbcConnect.args = list( dsn="smalltestsqlitedb", uid = "", pwd = "", believeNRows = FALSE, rows_at_time = 1), nrows = -1, first.rows = 100, next.rows = 1000, VERBOSE = TRUE) write.odbc.ffdf(x = x, tablename = "testdata", rownames = FALSE, append = TRUE, odbcConnect.args = list( dsn="smalltestsqlitedb", uid = "", pwd = "", believeNRows = FALSE, rows_at_time = 1), by = 1000, VERBOSE=TRUE) ## End(Not run)
## ## Using the sqlite database (smalldb.sqlite3) in the /inst folder of the package ## set up the sqlite ODBC driver (www.stats.ox.ac.uk/pub/bdr/RODBC-manual.pd) ## and call it 'smalltestsqlitedb' ## ## Not run: require(RODBC) x <- read.odbc.ffdf( query = "select * from testdata limit 10000", odbcConnect.args = list( dsn="smalltestsqlitedb", uid = "", pwd = "", believeNRows = FALSE, rows_at_time = 1), nrows = -1, first.rows = 100, next.rows = 1000, VERBOSE = TRUE) write.odbc.ffdf(x = x, tablename = "testdata", rownames = FALSE, append = TRUE, odbcConnect.args = list( dsn="smalltestsqlitedb", uid = "", pwd = "", believeNRows = FALSE, rows_at_time = 1), by = 1000, VERBOSE=TRUE) ## End(Not run)