- Installation
- Important Information
- Introduction
- Mat Objects
lmutils::Mat$new
lmutils::Mat$r
lmutils::Mat$col
lmutils::Mat$colnames
lmutils::Mat$save
lmutils::Mat$combine_columns
lmutils::Mat$combine_rows
lmutils::Mat$remove_columns
lmutils::Mat$remove_column
lmutils::Mat$remove_column_if_exists
lmutils::Mat$remove_rows
lmutils::Mat$transpose
lmutils::Mat$sort
lmutils::Mat$sort_by_name
lmutils::Mat$sort_by_order
lmutils::Mat$dedup
lmutils::Mat$dedup_by_name
lmutils::Mat$match_to
lmutils::Mat$match_to_by_name
lmutils::Mat$join
lmutils::Mat$join_by_name
lmutils::Mat$standardize_columns
lmutils::Mat$standardize_rows
lmutils::Mat$remove_na_rows
lmutils::Mat$remove_na_columns
lmutils::Mat$na_to_value
lmutils::Mat$na_to_column_mean
lmutils::Mat$na_to_row_mean
lmutils::Mat$min_column_sum
lmutils::Mat$max_column_sum
lmutils::Mat$min_row_sum
lmutils::Mat$max_row_sum
lmutils::Mat$rename_column
lmutils::Mat$rename_column_if_exists
lmutils::Mat$remove_duplicate_columns
lmutils::Mat$remove_identical_columns
- Matrix Functions
lmutils::save
lmutils::save_dir
lmutils::calculate_r2
lmutils::column_p_values
lmutils::linear_regression
lmutils::logistic_regression
lmutils::combine_vectors
lmutils::combine_rows
lmutils::remove_rows
lmutils::crossprod
lmutils::mul
lmutils::load
lmutils::match_rows
lmutils::match_rows_dir
lmutils::dedup
- Data Frame Functions
- Other Functions
- Configuration
lmutils
is not currently available on CRAN, but it can be installed on Linux with the following command. This will also install the Rust programming language which is required for lmutils
.
curl https://raw.githubusercontent.com/GMELab/lmutils.r/refs/heads/master/install.sh | sh
- Matrix convertable object - a data frame, matrix, file name (to read from), a numeric column vector, or a
Mat
object. - List of matrix convertable objects - a list of matrix convertable objects, a character vector of file names (to read from), or a single matrix convertable object.
- Standard output file - a character vector of file names matching the length of the inputs, or
NULL
to return the output. If a single input, not in a list, was provided, the output will not be in a list. - Join - an inner join means only rows that match in both matrices are kept, a left join means all rows in the left matrix are kept, a right join means all rows in the right matrix are kept.
All files can be optionally compressed with gzip
, rdata
files are assumed to be compressed without looking for a .gz
file extension (as is the standard in R).
.mat
(recommended, custom binary format designed for matrices).csv
(requires column headers).tsv
(requires column headers).txt
(requires column headers).json
.cbor
.rkyv
.rdata
lmutils
is an R package that provides utilities for working with matrices and data frames. It is built on top of the Rust programming language for performance and safety. The package provides a way to store matrices in memory and perform operations on them, as well as functions for working with data frames.
lmutils
is built primarily around the Mat
object. These are designed to be used to perform operations on matrices without loading them into memory until necessary. This can be useful for working with lots of large matrices, like hundreds of gene blocks.
To get started with your first Mat
object, you can use the following code:
mat <- lmutils::Mat$new("matrix1.csv")
This will create a new Mat
object from a file. You can then perform operations on this object, like combining it with other matrices, removing columns, or standardizing the columns. If you want this matrix to be loaded into R, you can use the r
method:
mat$combine_columns("matrix2.csv")
mat$remove_columns(c(1, 2, 3))
mat$standardize_columns()
m <- mat$r()
You can also pass the object directly into functions that accept a matrix convertable object, it'll then be loaded automatically (with all the stored operations applied) only when needed.
lmutils::calculate_r2(
mat,
"outcomes1.RData",
)
outcomes <- lmutils::Mat$new("outcomes.RData")
geneBlocks <- lapply(c(
"geneBlock1.csv",
"geneBlock2.csv",
"geneBlock3.csv",
"geneBlock4.csv",
"geneBlock5.csv",
), function(mat) {
mat <- lmutils::Mat$new(mat)
mat$match_to_by_name(outcomes$col("eid"), "IID", 0)
mat$remove_column("IID")
mat$min_column_sum(2)
mat$na_to_column_mean()
mat$standardize_columns()
mat
})
outcomes$remove_column("eid")
results <- lmutils::calculate_r2(geneBlocks, outcomes)
lmutils::Mat
objects are a way to store matrices in memory and perform operations on them. They can be used to store operations or chain operations together for later execution. This can be useful if, for example, you wish to a hundred large matrices from files and standardize them all before using lmutils::calculate_r2
. Using Mat
objects, you can store the operations you wish to perform and Mat
will execute them only when the matrix is loaded.
Passing the same Mat
object multiple times in a single function call may cause undefined behavior. For example, the following code may not work as expected:
mat <- lmutils::Mat$new("matrix1.csv")
lmutils::calculate_r2(list(mat, mat), mat)
Creates a new Mat
object.
data
is a matrix convertable object.
mat <- lmutils::Mat$new("matrix1.csv")
Loads the matrix from the Mat
object.
m <- mat$r()
Get a column by name or index.
col <- mat$col("eid")
col <- mat$col(1)
Get the column names of the matrix or NULL
if there are none.
colnames <- mat$colnames()
Saves the matrix to a file.
file
is the file name to write to.
mat$save("matrix1.mat.gz")
Combines this matrix with other matrices by columns. (cbind
)
data
is a list of matrix convertable objects.
mat$combine_columns("matrix2.csv")
Combines this matrix with other matrices by rows. (rbind
)
data
is a list of matrix convertable objects.
mat$combine_rows("matrix2.csv")
Removes columns from the matrix.
columns
is a vector of column indices (1-based) to remove.
mat$remove_columns(c(1, 2, 3))
Removes a column from the matrix by name.
column
is the column name to remove.
mat$remove_column("eid")
Removes a column from the matrix by name if it exists.
column
is the column name to remove.
mat$remove_column_if_exists("eid")
Removes rows from the matrix.
rows
is a vector of row indices (1-based) to remove.
mat$remove_rows(c(1, 2, 3))
Transposes the matrix.
mat$transpose()
Sort by the column at the given index.
by
is the column index (1-based) to sort by.
mat$sort(1)
Sort by the column with the given name.
by
is the column name to sort by.
mat$sort_by_name("eid")
Sort by the given order of rows.
order
is a vector of row indices (1-based) to sort by.
mat$sort_by_order(c(3, 2, 1))
Deduplicate the matrix by a column.
by
is the column index (1-based) to deduplicate by.
mat$dedup(1)
Deduplicate the matrix by a column name.
by
is the column name to deduplicate by.
mat$dedup_by_name("eid")
Match the rows of the matrix to the values in a vector by a column.
with
is a numeric vector to match the rows to.by
is the column index (1-based) to match the rows by.join
is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$match_to(c(1, 2, 3), 1, 0)
Match the rows of the matrix to the values in a vector by a column name.
with
is a numeric vector to match the rows to.by
is the column name to match the rows by.join
is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$match_to_by_name(c(1, 2, 3), "eid", 0)
Join the matrix with another matrix by a column.
other
is a matrix convertable object.by
is the column index (1-based) to join by.join
is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$join("matrix2.csv", 1, 0)
Join the matrix with another matrix by a column name.
other
is a matrix convertable object.by
is the column name to join by.join
is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$join_by_name("matrix2.csv", "eid", 0)
Standardize the columns of the matrix to have a mean of 0 and a standard deviation of 1.
mat$standardize_columns()
Standardize the rows of the matrix to have a mean of 0 and a standard deviation of 1.
mat$standardize_rows()
Remove rows with any NA
values.
mat$remove_na_rows()
Remove columns with any NA
values.
mat$remove_na_columns()
Replace all NA
values with a given value.
mat$na_to_value(0)
Replace all NA
values with the mean of the column.
mat$na_to_column_mean()
Replace all NA
values with the mean of the row.
mat$na_to_row_mean()
Remove columns with a sum less than a given value.
mat$min_column_sum(10)
Remove columns with a sum greater than a given value.
mat$max_column_sum(10)
Remove rows with a sum less than a given value.
mat$min_row_sum(10)
Remove rows with a sum greater than a given value.
mat$max_row_sum(10)
Rename a column by name.
mat$rename_column("IID", "eid")
Rename a column by name if it exists.
mat$rename_column_if_exists("IID", "eid")
Remove columns that are duplicates of other columns. The first column is kept.
mat$remove_duplicate_columns()
Remove columns with all identical entries.
mat$remove_identical_columns()
Subset the matrix to only include the given columns (1-based indices or names).
mat$subset_columns(c(1, 2, 3))
Saves a list of matrix convertable objects to files.
from
is a list of matrix convertable objects.to
is a character vector of file names to write to.
lmutils::save(
list("file1.csv", matrix(1:9, nrow=3), 1:3, data.frame(a=1:3, b=4:6)),
c("file1.json", "file2.mat.gz", "file3.csv", "file4.rdata"),
)
Recursively converts a directory of files to the selected file type.
from
is a string directory name to read the files from.to
is a string directory name to write the files to orNULL
to write tofrom
.file_type
is a string file extension to write the files as.
lmutils::save_dir(
"data",
"converted_data", # or NULL
"mat.gz",
)
Calculates the R^2 and adjusted R^2 values for blocks and outcomes.
data
is a list of matrix convertable objects.outcomes
is a single matrix convertable object. Returns a data frame with columnsr2
,adj_r2
,data
,outcome
,n
,m
, andpredicted
.
results <- lmutils::calculate_r2(
c("block1.csv", "block2.mat.gz"),
"outcomes1.RData",
)
Compute the p value of a linear regression between each pair of columns in data and outcomes.
data
is a list of matrix convertable objects.outcomes
is a single matrix convertable object. The function returns a data frame with columnsp_value
,beta
,intercept
,data
,data_column
, andoutcome
.
results <- lmutils::column_p_values(
c("block1.csv", "block2.mat.gz"),
"outcomes1.RData",
)
Perform a linear regression between each data element and each outcome column.
data
is a list of matrix convertable objects.outcomes
is a single matrix convertable object. The function returns a list of data frames with columnsslopes
,intercept
,r2
,adj_r2
,data
,outcome
,n
,m
, andpredicted
(if enabled).
results <- lmutils::linear_regression(
c("block1.csv", "block2.mat.gz"),
"outcomes1.RData",
)
Perform a logistic regression between each data element and each outcome column.
data
is a list of matrix convertable objects.outcomes
is a single matrix convertable object. The function returns a list of data frames with columnsslopes
,intercept
,r2
,adj_r2
,data
,outcome
,n
,m
, andpredicted
(if enabled).
results <- lmutils::logistic_regression(
c("block1.csv", "block2.mat.gz"),
"outcomes1.RData",
)
Combine a list of double vectors into a single matrix using the vectors as columns.
data
is a list of double vectors.out
is an output file name orNULL
to return the matrix.
lmutils::combine_vectors(
list(1:3, 4:6),
"combined_matrix.csv",
)
Combine a potentially nested list of rows (double vectors) into a matrix.
data
is a list of double vectors.out
is an output file name orNULL
to return the matrix.
lmutils::combine_rows(
list(list(c(1, 2, 3)), c(4, 5, 6)),
"combined_matrix.csv",
)
Removes rows from a matrix.
data
is list of matrix convertable objects.rows
is a vector of row indices (1-based) to remove.out
is a standard output file.
lmutils::remove_rows(
"matrix1.csv",
c(1, 2, 3),
"matrix1_removed_rows.csv",
)
Calculates the cross product of two matrices. Equivalent to t(data) %*% data
.
data
is a list of matrix convertable objects.out
is a standard output file.
lmutils::crossprod(
"matrix1.csv",
"crossprod_matrix1.csv",
)
Multiplies two matrices. Equivalent to a %*% b
.
a
is a list of matrix convertable objects.b
is a list of matrix convertable objects.out
is a standard output file.
lmutils::mul(
"matrix1.csv",
"matrix2.mat.gz",
"mul_matrix1_matrix2.csv",
)
Loads a matrix convertable object into R.
obj
is a list matrix convertable objects. If a single object is provided, the function will return the matrix directly, otherwise it will return a list of matrices.
lmutils::load("matrix1.csv")
Matches rows of a matrix by the values of a vector.
data
is a list of matrix convertable objects.with
is a numeric vector.by
is the column name to match the rows by.out
is a standard output file.
lmutils::match_rows(
"matrix1.csv",
c(1, 2, 3),
"eid",
"matched_matrix1.csv",
)
Matches rows of all matrices in a directory to the values in a vector by a column.
from
is a string directory name to read the files from.to
is a string directory name to write the files to orNULL
to write tofrom
.with
is a numeric vector to match the rows to.by
is the column name to match the rows by.
lmutils::match_rows_dir(
"matrices",
"matched_matrices",
c(1, 2, 3),
"eid",
)
Deduplicate a matrix by a column. The first occurrence of each value is kept.
data
is a list of matrix convertable objects.by
is the column name to deduplicate by.out
is a standard output file.
lmutils::dedup(
"matrix1.csv",
"eid",
"matrix1_dedup.csv",
)
Compute a new column for a data frame from a Rust-flavored regex and an existing column.
df
is a data frame.column
is the column name to match.regex
is the regex to match. The first capture group is used.new_column
is the new column name.
lmutils::new_column_from_regex(
data.frame(a=c("a1", "b2", "c3")),
"a",
"([a-z])",
"b",
)
Converts two character vectors into a named list, where the first vector is the names and the second vector is the values. Only the first occurrence of each name is used, essentially creating a map.
names
is a character vector of names.values
is a character vector of values.
lmutils::map_from_pairs(
c("a", "b", "c"),
c("1", "2", "3"),
)
Compute a new column for a data frame from a list of values and an existing column, matching by the names of the values.
df
is a data frame.column
is the column name to match.values
is a named list of values.new_column
is the new column name.
lmutils::new_column_from_map(
data.frame(a=c("a", "b", "c")),
"a",
lmutils::map_from_pairs(
c("a", "b", "c"),
c("1", "2", "3"),
),
"b",
)
Compute a new column for a data frame from two character vectors of names and values, matching by the names.
df
is a data frame.column
is the column name to match.names
is a character vector of names.values
is a character vector of values.new_column
is the new column name.
lmutils::new_column_from_map_pairs(
data.frame(a=c("a", "b", "c")),
"a",
c("a", "b", "c"),
c("1", "2", "3"),
"b",
)
Mutably sorts a data frame in ascending order by multiple columns in ascending order. All columns must be numeric (double or integer), character, or logical vectors.
df
is a data frame.columns
is a character vector of column names to sort by.
df <- data.frame(a=c(3, 3, 2, 2, 1, 1), b=c("b", "a", "b", "a", "b", "a"))
lmutils::df_sort_asc(
df,
c("a", "b"),
)
Splits a data frame into multiple data frames by a column. This function will mutably sort the data frame by the column before splitting.
df
is a data frame.by
is the column name to split by.
df <- data.frame(a=c(1, 2, 3), b=c("a", "b", "c"))
lmutils::df_split(
df,
"b",
)
Combines a potentially nested list of data frames into a single data frame. The data frames must have the same columns.
data
is a list of data frames.
lmutils::df_combine(
list(data.frame(a=1:3), data.frame(a=4:6))
)
Compute the R^2 value for given actual and predicted vectors.
lmutils::compute_r2(
c(1, 2, 3),
c(1, 2, 3),
)
Computes the mean of a vector.
lmutils::mean(
c(1, 2, 3),
)
Computes the median of a vector.
lmutils::median(
c(1, 2, 3),
)
Computes the standard deviation of a vector.
lmutils::sd(
c(1, 2, 3),
)
Computes the variance of a vector.
lmutils::var(
c(1, 2, 3),
)
lmutils
exposes a number global config options that can be set using environment variables or the lmutils
package functions:
LMUTILS_LOG
/lmutils::set_log_level
to set the log level (default:info
). Available log levels in order of increasing verbosity areoff
,error
,warn
,info
,debug
, andtrace
.LMUTILS_CORE_PARALLELISM
/lmutils::set_core_parallelism
to set the core parallelism (default:16
). This is the number of primary operations to run in parallel.LMUTILS_NUM_WORKER_THREADS
/lmutils::set_num_worker_threads
to set the number of worker threads to use (default:num_cpus::get() / 2
). This is the number of threads to use for parallel operations. Once an operation has been run, this value cannot be changed.LMUTILS_ENABLE_PREDICTED
/lmutils::disable_predicted
/lmutils::enable_predicted
to enable the calculation of the predicted values inlmutils::calculate_r2
.LMUTILS_IGNORE_CORE_PARALLEL_ERRORS
/lmutils::ignore_core_parallel_errors
/lmutils::dont_ignore_core_parallel_errors
to ignore errors in core parallel operations. By default, if an error occurs in a core parallel operation it will be retried, if it fails its allowed number of retries then the error will be logged and the next operation will be attempted. If this option is disabled, Rust will panic after the allowed number of retries and the operation will fail.