Pytorch

29 Jan 2025 - cohlem

torch.stack(tensors, dim)

stacks the tensors across dim

#usage
# data has to be tensor
torch.stack([data[i:i+some_number] for i in range(10)])

torch.from_numpy(numpy_array)

shares the memory with the numpy_array but is tensor type

a = np.array([1,2,3])
b = torch.tensor(a) # creates copy
c = torch.from_numpy(a) # shares memory

a[0] = 11
c

# outputs: tensor([11,  2,  3])

torch.flatten(input, start,end=-1)

flattens the input from dim start to end (-1 by default)

t = torch.tensor([[[1, 2],
                   [3, 4]],
                  [[5, 6],
                   [7, 8]]])
torch.flatten(t)
torch.flatten(t, start_dim=1) # (2,2,2) --> (2,2*2)

```tensor([[1, 2, 3, 4], [5, 6, 7, 8]])


#### torch.stack and torch.cat((tensors), dim)

torch.stack stacks tensors along new dim, whereas
torch.cat concatenates along that specific dim.

example:

```python
a = torch.randn(2,5,8,32)
b = torch.randn(2,1,8,32)

torch.cat((a,b), dim=1).shape

#outputs : torch.Size([2, 6, 8, 32])
a = torch.randn(3,5,8,32)
b = torch.randn(3,5,8,32)

torch.stack((a,b), dim=1).shape

#outputs: torch.Size([3, 2, 5, 8, 32])

For the past 2 years I’ve been involved in training and experimenting machine learning systems, mostly using third party packages such as sklearn, huggingface and so on. Sometimes the experiments become too specific and the abstraction provided by these packages become a bottleneck for the performance optimization. My research goal is to understand these bottlenecks in deep and write my own optimized code for hardware-specific optimization which enables resource efficient training or inference.

While training, make to take care of these things

If you forget doing the step above, it might haunt you for hours/days, if you’re fine-tuning it should be faily easy to debug, because when you print the logprobs for the tokens, you should see something like below (its logprobs, not probability so don’t confuse, 0.00 represents the token is more probable), most tokens are probable because it’s already trained. If you’re pretraining from the beginning, it may be hard to debug.

tensor([-11.7500, -10.8125, -8.6250, -11.2500, -11.3125, -12.5000, -7.3125, -14.2500, -13.6875, -11.8125, -15.5000, -11.2500, -17.5000, -15.5000, -12.7500, -14.5000, -8.5625, -12.8750, -17.7500, -20.0000, -13.3750, -18.5000, -22.2500, -12.3750, -13.8125, -13.3750, -13.3125, -15.6250, -11.5625, -14.8125, -10.1875, -14.1250, -18.6250, -17.0000, -13.6250, -21.7500, -14.2500, -9.7500, -3.7500, -10.1250, -6.6875, -12.7500, -10.9375, -13.4375, -10.9375, -12.9375, -14.7500, -13.5000, -0.5000, -12.5625, -10.3750, -8.9375, -14.8750, -6.5000, -15.3125, -13.5000, -8.9375, -15.0625, -31.0000, -39.2500, -21.1250, -18.5000, 0.0000, -1.1250, -0.2500, 0.0000, -0.2500, 0.0000, -0.1250, 0.0000, -0.1250, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -0.2500, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -1.5000, -1.1250, 0.0000, -0.6250, 0.0000, -1.0000, -0.3750, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -1.0000, -0.8750, 0.0000, 0.0000, -0.1250, -0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -0.1250, -0.1250, 0.0000, 0.0000, 0.0000, 0.0000, -0.1250, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -0.1250, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -0.1250, 0.0000, -0.1250, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, -1.8750, 0.0000, -0.1250, 0.0000, -0.1250, -0.3750, 0.0000, -1.2500, -1.7500, -0.8750, 0.0000, -0.1250, 0.0000, 0.0000, 0.0000, -0.7500, 0.0000, 0.0000, -0.1250, 0.0000, 0.0000], device='cuda:0', dtype=torch.bfloat16)