0

I have very large data set it is almost 450000 lines and two rows, i want to compute adjacency matrix using python, because previously i have tried to do it in matlab, and it shows memory error because of large data values. my data values also start from 100 and goes upto 450000,

Anyone can help me in this issue, as i am new to python.

I have to first import the file into python using excel sheet or notepad and then compute the adjacency matrix

7
  • 1
    Are the pairs ("two rows" although you probable meant columns) edge descriptions and you want to produce an adjacency matrix from the list of edges? How many actual vertices are there in the graph? If its 450,000 vertices then you are talking about a matrix with over 200 billion cells! Commented Apr 14, 2010 at 13:29
  • @Andreas An adjacency matrix with 450k vertices would occupy closer to 1.5 TB using doubles. It would be more efficient to use a single bit per edge, but that would still take around 24 GB. Commented Apr 14, 2010 at 13:30
  • @Judge: Yes, I thought that A A had problems storing a 450000 × 2 matrix of doubles, which would only occupy 7.2 MB. Commented Apr 14, 2010 at 15:30
  • 1
    Please provide an example of how to process a small section of your data. The format of the adjacency list is not clear from your description. Commented Apr 14, 2010 at 18:25
  • If you don't mind me asking, why do you need a full adjacency matrix in memory as opposed to the sparse representation of the source data? Is it the CAIDA file itself that you are having trouble reading into MatLab? Commented Apr 16, 2010 at 19:23

3 Answers 3

1

If I understand your question correctly, then you require more memory than is available in RAM. Even with virtual memory, you probably can't allocate a block that big. Therefore, your solution is to write the adjacency matrix to a file as you build it. This method would work in MatLab or Python.


I am assuming you are processing CAIDA's Router-Level Topology Measurements since the format seems to match your description. Each line of this file contains an edge of the graph from one IP router (column 1) to another (column 2). A full adjacency matrix of the 192244 nodes would require 4.3 GB assuming you only use a single bit for each node. I would still suggest writing the matrix directly to a file instead of building it in memory.

Sign up to request clarification or add additional context in comments.

2 Comments

You still haven't explained what the data means. What are those numbers?
0

The simplest way? Well, if you have over 10,000 nodes, but only 45000 edges, use SciPy's sparse matrix:

http://www.scipy.org/SciPy_Tutorial#head-c60163f2fd2bab79edd94be43682414f18b90df7

SciPy provides various compression methods to keep the actual in-memory size of the matrix down (since the matrix values will largely be 0's). I'm sure MatLab provides a space-conscious sparse matrix data structure as well.

If you want to just know how to read in the file, I'd suggest you save it as a CSV or a text file (there is no real benefit in storing the data in an Excel file). Python comes w/ a library for reading/writing CSV files:

http://docs.python.org/library/csv.html

If you really want to use XLS files, then you can use either pyExcelerator (I've never used this) - http://sourceforge.net/projects/pyexcelerator/ - or you can use OpenOffice.org + PyUNO or MS Office + COM.

2 Comments

If my understanding is correct, the file is just pairs of nodes (labeled with integers). You just create a sparse matrix, read the pairs in, and fill out the appropriate cell in the matrix with 1 for each pair.
pyExcelerator is not maintained; use xlrd/xlwt for reading/writing XLS files.
0

I would use defaultdict - it is simple to use, and just a few lines of code. I'm assuming your file looks like

a b
c d

First, put it into a list (http://docs.python.org/2/library/fileinput.html) so that the format is [(a, b),(c,d)].

Then, use defaultdict:

from collections import defaultdict

adjmat = defaultdict(int)
for edge in list:
    adjmat[edge] = 1

adjmat[a, b] will return 1 if the edge exists, 0 otherwise. If you can have multiple edges between nodes, you only need to change that to adjmat[edge] += 1, and adjmat[a, b] will return the number of edges connects a and b

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.