Quickstart
TheTable class in LanceDB implements a contract for a PyTorch
Dataset.
This means you can simply use a LanceDB table in a PyTorch dataloader directly.
Python
Table class in LanceDB implements the torch.utils.data.Dataset interface, you may find that using
a table Permutation is more flexible.
Python
Output Formats
By default, aTable data loader will emit Arrow data. collate_fn is PyTorch’s batching hook: PyTorch calls it to
turn the fetched items into one batch. PyTorch’s default collate function only knows how to combine tensors, NumPy
arrays, numbers, dicts, and lists, so it does not accept Arrow data directly. When using a Table directly, pass
LanceDB’s lancedb.util.tbl_to_tensor helper as PyTorch’s collate_fn; it converts numeric Arrow columns into a
column-major torch.Tensor with shape (columns, rows).
Permutation works differently: its default output is a list of Python dicts, which PyTorch’s default collate function
can batch into a dict of tensors. This is usually more convenient when you are getting started. However, there is a
significant performance penalty converting from Arrow, Lance’s internal representation, to this default format. Use a
direct Table with collate_fn when you want Arrow-to-tensor conversion, or a Permutation when you want the default
PyTorch dict-of-tensors behavior.
To address this, the Permutation class provides a set of builtin transform functions that can be applied to map
the Arrow data in different ways. The arrow and polars formats will always avoid data copies. However, numpy,
pandas, and torch_col formats will also avoid data copies in most cases. The python, python_col, and
torch formats will all require at least one full copy of the data and are the slowest options.
Using the torch_col format with a torch data loader
Thetorch_col format is the most efficient way to convert from Arrow to a torch.Tensor. It will convert the
entire Arrow batch to a column-major torch.Tensor. In other words, given C columns and R rows, the resulting
Tensor will have shape (C, R). However, this format generates an error if you are using a
torch.utils.data.DataLoader with the default collation function:
Python
torch format but that format
requires a data copy. To avoid this error, and avoid data copies, you will need to provide a custom collation function
in addition to specifying the torch_col format.
Python
Selecting columns
By default, theTable class will return all columns in the table when used as input to PyTorch. If you only need
a subset of columns, you can significantly reduce your I/O requirements by selecting only the columns you need. The
Permutation class provides a select_columns method that provides this functionality.
Python
Using multiple DataLoader workers
Setnum_workers > 0 to read from LanceDB in multiple PyTorch worker processes. LanceDB tables and Permutation objects are picklable, so each worker reopens the table after it starts.
Prefer the spawn start method when using multiple workers; LanceDB uses internal threads. See the performance guide for more multiprocessing guidance.
Python
Remote tables in DataLoader workers
Remote LanceDB Enterprise tables (db://...) work the same way: workers reopen the table from the pickled connection state.
Python
This sends the connection state, including the API key, to each worker. Use a connection factory if credentials should be loaded inside the worker or your
client_config contains a non-serializable header_provider.Providing a custom connection factory
Permutation.with_connection_factory lets each worker reopen the base table with custom logic. The factory takes the table name, returns a LanceDB table, and must be picklable.
Python