Perfoming joins without having all the data in memory

time to read 1 min | 143 words

Probably the easier way to perform a join is by a nested loop, given dataset A and dataset B, joining all the rows to dataset C is simple:

for row_a in A:
	for row_b in B:
		if condition(row_a, row_b):
			add join(row_a, row_b), C

Supporting left/right/cross joins is simple matter from here, but this has the issue of having to have both datasets in memory. I run into it while processing big files, I don't want to have to hold them in memory, especially if I need several level of joins in order to correctly process them.

I thought about bypassing the entire issue by simply writing the data down to a sqlite DB, and doing the join there. While this is possible, I would rather avoid having to do this.

Any suggestions on where to look to solve this issue?