8

The only difference is one of the parameter passed to DataLoader is in type "numpy.array" and the other is in type "list", but the DataLoader gives totally different results.

You can use the following code to reproduce it:

from torch.utils.data import DataLoader,Dataset
import numpy as np

class my_dataset(Dataset):
    def __init__(self,data,label):
        self.data=data
        self.label=label          
    def __getitem__(self, index):
        return self.data[index],self.label[index]
    def __len__(self):
        return len(self.data)

train_data=[[1,2,3],[5,6,7],[11,12,13],[15,16,17]]
train_label=[-1,-2,-11,-12]

########################### Look at here:    

test=DataLoader(dataset=my_dataset(np.array(train_data),train_label),batch_size=2)
for i in test:
    print ("numpy data:")
    print (i)
    break


test=DataLoader(dataset=my_dataset(train_data,train_label),batch_size=2)
for i in test:
    print ("list data:")
    print (i)
    break

The result is:

numpy data:
[tensor([[1, 2, 3],
        [5, 6, 7]]), tensor([-1, -2])]
list data:
[[tensor([1, 5]), tensor([2, 6]), tensor([3, 7])], tensor([-1, -2])]  
0

1 Answer 1

13
+50

This is because how batching is handled in torch.utils.data.DataLoader. collate_fn argument decides how samples from samples are merged into a single batch. Default for this argument is undocumented torch.utils.data.default_collate.

This function handles batching by assuming numbers/tensors/ndarrays are primitive data to batch and lists/tuples/dicts containing these primitives as structure to be (recursively) preserved. This allow you to have a semantic batching like this:

  1. (input_tensor, label_tensor) -> (batched_input_tensor, batched_label_tensor)
  2. ([input_tensor_1, input_tensor_2], label_tensor) -> ([batched_input_tensor_1, batched_input_tensor_2], batched_label_tensor)
  3. {'input': input_tensor, 'target': target_tensor} -> {'input': batched_input_tensor, 'target': batched_target_tensor}

(Left side of -> is output of dataset[i], while right side is batched sample from torch.utils.data.DataLoader)

Your example code is similar to example 2 above: list structure is preserved while ints are batched.

Sign up to request clarification or add additional context in comments.

5 Comments

Source: I added support for dicts to DataLoader in this PR.
May I ask what is a "mini-batch" in the official doc: "collate_fn (callable, optional) – merges a list of samples to form a mini-batch." ?
May I understand it like this: when it is a numpy array, the "if" here is true, and when it is a list, the "if" here is True
Is mini-batch just batch?
Yep, mini-batch is just a batch. As explained in my answer numpy arrays are batched but list structure is preserved.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.