SSD Embedding Operators¶
CUDA Operators¶
-
void cuda_callback_func(cudaStream_t stream, cudaError_t status, void *functor)¶
A callback function for
cudaStreamAddCallback
A common callback function for
cudaStreamAddCallback
, i.e.,cudaStreamCallback_t callback
. This function castsfunctor
into a void function, invokes it and then delete it (the deletion occurs in another thread)- Parameters:
stream – CUDA stream that
cudaStreamAddCallback
operates onstatus – CUDA status
functor – A functor that will be called
- Returns:
None
-
Tensor masked_index_put_cuda(Tensor self, Tensor indices, Tensor values, Tensor count, const bool use_pipeline, const int64_t preferred_sms)¶
Similar to
torch.Tensor.index_put
but ignoreindices < 0
masked_index_put_cuda
only supports 2D inputvalues
. It putscount
rows invalues
intoself
using the row indices that are >= 0 inindices
.# Equivalent PyTorch Python code indices = indices[:count] filter_ = indices >= 0 indices_ = indices[filter_] self[indices_] = values[filter_.nonzero().flatten()]
- Parameters:
self – The 2D output tensor (the tensor that is indexed)
indices – The 1D index tensor
values – The 2D input tensor
count – The tensor that contains the length of
indices
to processuse_pipeline – A flag that indicates that this kernel will overlap with other kernels. If it is true, then use a fraction of SMs to reduce resource competition
preferred_sms – The number of preferred SMs for the kernel to use when use_pipeline=true. This value is ignored when use_pipeline=false.
- Returns:
The
self
tensor
-
Tensor masked_index_select_cuda(Tensor self, Tensor indices, Tensor values, Tensor count, const bool use_pipeline, const int64_t preferred_sms)¶
Similar to
torch.index_select
but ignoreindices < 0
masked_index_select_cuda
only supports 2D inputvalues
. It putscount
rows that are specified inindices
(whereindices
>= 0) fromvalues
intoself
# Equivalent PyTorch Python code indices = indices[:count] filter_ = indices >= 0 indices_ = indices[filter_] self[filter_.nonzero().flatten()] = values[indices_]
- Parameters:
self – The 2D output tensor
indices – The 1D index tensor
values – The 2D input tensor (the tensor that is indexed)
count – The tensor that contains the length of
indices
to processuse_pipeline – A flag that indicates that this kernel will overlap with other kernels. If it is true, then use a fraction of SMs to reduce resource competition
preferred_sms – The number of preferred SMs for the kernel to use when use_pipeline=true. This value is ignored when use_pipeline=false.
- Returns:
The
self
tensor
-
std::tuple<Tensor, Tensor> ssd_generate_row_addrs_cuda(const Tensor &lxu_cache_locations, const Tensor &assigned_cache_slots, const Tensor &linear_index_inverse_indices, const Tensor &unique_indices_count_cumsum, const Tensor &cache_set_inverse_indices, const Tensor &lxu_cache_weights, const Tensor &inserted_ssd_weights, const Tensor &unique_indices_length, const Tensor &cache_set_sorted_unique_indices)¶
Generate memory addresses for SSD TBE data.
The data retrieved from SSD can be stored in either a scratch pad (HBM) or LXU cache (also HBM).
lxu_cache_locations
is used to specify the location of the data. If the location is -1, the data for the associated index is in the scratch pad; otherwise, it is in the cache. To enable TBE kernels to access the data conveniently, this operator generates memory addresses of the first byte for each index. When accessing data, a TBE kernel only needs to convert addresses into pointers.Moreover, this operator also generate the list of post backward evicted indices which are basically the indices that their data is in the scratch pad.
- Parameters:
lxu_cache_locations – The tensor that contains cache slots where data is stored for the full list of indices. -1 is a sentinel value that indicates that data is not in cache.
assigned_cache_slots – The tensor that contains cache slots for the unique list of indices. -1 indicates that data is not in cache
linear_index_inverse_indices – The tensor that contains the original position of linear indices before being sorted
unique_indices_count_cumsum – The tensor that contains the the exclusive prefix sum results of the counts of unique indices
cache_set_inverse_indices – The tensor that contains the original positions of cache sets before being sorted
lxu_cache_weights – The LXU cache tensor
inserted_ssd_weights – The scratch pad tensor
unique_indices_length – The tensor that contains the number of unique indices (GPU tensor)
cache_set_sorted_unique_indices – The tensor that contains associated unique indices for the sorted unique cache sets
- Returns:
A tuple of tensors (the SSD row address tensor and the post backward evicted index tensor)
-
void ssd_update_row_addrs_cuda(const Tensor &ssd_row_addrs_curr, const Tensor &inserted_ssd_weights_curr_next_map, const Tensor &lxu_cache_locations_curr, const Tensor &linear_index_inverse_indices_curr, const Tensor &unique_indices_count_cumsum_curr, const Tensor &cache_set_inverse_indices_curr, const Tensor &lxu_cache_weights, const Tensor &inserted_ssd_weights_next, const Tensor &unique_indices_length_curr)¶
Update memory addresses for SSD TBE data.
When pipeline prefetching is enabled, data in a scratch pad of the current iteration can be moved to L1 or a scratch pad of the next iteration during the prefetch step. This operator updates the memory addresses of data that is relocated to the correct location.
- Parameters:
ssd_row_addrs_curr – The tensor that contains the row address of the current iteration
inserted_ssd_weights_curr_next_map – The tensor that contains mapping between the location of each index in the current iteration in the scratch pad of the next iteration. (-1 = the data has not been moved). inserted_ssd_weights_curr_next_map[i] is the location
lxu_cache_locations_curr – The tensor that contains cache slots where data is stored for the full list of indices for the current iteration. -1 is a sentinel value that indicates that data is not in cache.
linear_index_inverse_indices_curr – The tensor that contains the original position of linear indices before being sorted for the current iteration
unique_indices_count_cumsum_curr – The tensor that contains the the exclusive prefix sum results of the counts of unique indices for the current iteration
cache_set_inverse_indices_curr – The tensor that contains the original positions of cache sets before being sorted for the current iteration
lxu_cache_weights – The LXU cache tensor
inserted_ssd_weights_next – The scratch pad tensor for the next iteration
unique_indices_length_curr – The tensor that contains the number of unique indices (GPU tensor) for the current iteration
- Returns:
None