Pytorch

29 Jan 2025 - cohlem

`torch.stack(tensors, dim)`

stacks the tensors across dim

#usage
# data has to be tensor
torch.stack([data[i:i+some_number] for i in range(10)])

`torch.from_numpy(numpy_array)`

shares the memory with the numpy_array but is tensor type

a = np.array([1,2,3])
b = torch.tensor(a) # creates copy
c = torch.from_numpy(a) # shares memory

a[0] = 11
c

# outputs: tensor([11,  2,  3])

`torch.flatten(input, start,end=-1)`

flattens the input from dim start to end (-1 by default)

t = torch.tensor([[[1, 2],
                   [3, 4]],
                  [[5, 6],
                   [7, 8]]])
torch.flatten(t)
torch.flatten(t, start_dim=1) # (2,2,2) --> (2,2*2)

```tensor([[1, 2, 3, 4], [5, 6, 7, 8]])

#### torch.stack and torch.cat((tensors), dim)

torch.stack stacks tensors along new dim, whereas
torch.cat concatenates along that specific dim.

example:

```python
a = torch.randn(2,5,8,32)
b = torch.randn(2,1,8,32)

torch.cat((a,b), dim=1).shape

#outputs : torch.Size([2, 6, 8, 32])

a = torch.randn(3,5,8,32)
b = torch.randn(3,5,8,32)

torch.stack((a,b), dim=1).shape

#outputs: torch.Size([3, 2, 5, 8, 32])

For the past 2 years I’ve been involved in training and experimenting machine learning systems, mostly using third party packages such as sklearn, huggingface and so on. Sometimes the experiments become too specific and the abstraction provided by these packages become a bottleneck for the performance optimization. My research goal is to understand these bottlenecks in deep and write my own optimized code for hardware-specific optimization which enables resource efficient training or inference.

While training, make to take care of these things

are models using the same precision ? verify explicitly
models will ouput different logits, when using parallelism vs when not using it even with the same prevision. Maybe there’s a difference because of the partial result when applying rowwiseParallel where the results are summed. Example: without parallelism: [ 3.7969, 7.7500, 3.3125, -1.0938, 6.9688] with parallelism: [ 3.9062, 7.8750, 3.2656, -0.8867, 6.6562]
actions must be shifted by [1:] whereas states with [:-1] why? because we pass the states to our model, it will output logits, logits are the (next token prediction), so shift the action by [1:], so for a state[i] it’s actual next token will be state[i+1], thus we construct action from the states but by shifting it by 1. Since, there’s actual next token available for the ending token -1 in the states, we shift the states by -1, because we don’t have any tokens available for that logits to find the logprobs.

If you forget doing the step above, it might haunt you for hours/days, if you’re fine-tuning it should be faily easy to debug, because when you print the logprobs for the tokens, you should see something like below (its logprobs, not probability so don’t confuse, 0.00 represents the token is more probable), most tokens are probable because it’s already trained. If you’re pretraining from the beginning, it may be hard to debug.

tensor([-11.7500, -10.8125, -8.6250, -11.2500, -11.3125, -12.5000, -7.3125, -14.2500, -13.6875, -11.8125, -15.5000, -11.2500, -17.5000, -15.5000, -12.7500, -14.5000, -8.5625, -12.8750, -17.7500, -20.0000, -13.3750, -18.5000, -22.2500, -12.3750, -13.8125, -13.3750, -13.3125, -15.6250, -11.5625, -14.8125, -10.1875, -14.1250, -18.6250, -17.0000, -13.6250, -21.7500, -14.2500, -9.7500, -3.7500, -10.1250, -6.6875, -12.7500, -10.9375, -13.4375, -10.9375, -12.9375, -14.7500, -13.5000, -0.5000, -12.5625, -10.3750, -8.9375, -14.8750, -6.5000, -15.3125, -13.5000, -8.9375, -15.0625, -31.0000, -39.2500, -21.1250, -18.5000, 0.0000, -1.1250, -0.2500, 0.0000, -0.2500, 0.0000, -0.1250, 0.0000, -0.1250, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -0.2500, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -1.5000, -1.1250, 0.0000, -0.6250, 0.0000, -1.0000, -0.3750, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -1.0000, -0.8750, 0.0000, 0.0000, -0.1250, -0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -0.1250, -0.1250, 0.0000, 0.0000, 0.0000, 0.0000, -0.1250, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -0.1250, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -0.1250, 0.0000, -0.1250, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -1.8750, 0.0000, -0.1250, 0.0000, -0.1250, -0.3750, 0.0000, -1.2500, -1.7500, -0.8750, 0.0000, -0.1250, 0.0000, 0.0000, 0.0000, -0.7500, 0.0000, 0.0000, -0.1250, 0.0000, 0.0000], device='cuda:0', dtype=torch.bfloat16)