0

I have a large file that needs to be loaded into memory, then some operations can be done based on the user input. But I don't want to load the file into memory again and again whenever there is a user input.

A solution might be to load the data file by a process as the "server", and have another client process to query the server on behalf of the client.

I am wondering what the best client-server implementation for this is. I know that I could implement an HTTP server, but querying it needs to follow the HTTP protocol which has too much overhead (For my specific, the client only needs to send a string to the server, so all the HTTP headers are not needed.) A lighter solution is preferred. Also, both the client and the server are supposed to run in the same machine, so using memory is faster than using networking for sharing the information between the client and the server?

Actually, the server could just load the data into the memory as some python objects, if there is a way to access these python objects from the client, it should be also fine.

Could anybody offer some advice on the best solution to solve this problem? Thanks.

12
  • What sort of data is the file? Commented Apr 11, 2019 at 15:17
  • It doesn't matter. As the solution that I look for should be independent of the data. For example, if the data is just a table, I could just use sqlite3. That is not what I am looking for. Commented Apr 11, 2019 at 15:20
  • Well, it does slightly matter so we can try and figure out a solution that's as optimal as possible for your use case. How do you need to access the data in the client, for instance? A slice of a binary array? All at once? Etc. Commented Apr 11, 2019 at 15:24
  • 1
    How about using mmap? docs.python.org/2/library/mmap.html Commented Apr 11, 2019 at 15:26
  • The strings will not be sent all at once. Each time just send a string. The returned result is a dictionary or list, which can be large. So I don't want the server to compute the results once and send it back because this will result in an extra copy. I'd think directly access the objects in the server from the client might be better. Commented Apr 11, 2019 at 15:27

1 Answer 1

1

Okay, so based on the comments, the data is keyed by string and values are lists or dictionaries, and the client requests an object by string.

Unfortunately there's no safe, sane way to directly access that sort of data cross-process without some intermediate serialization/deserialization step. An obvious choice, safety concerns aside, is pickleing them. msgpack is reasonable too.

As for the protocol, if tried-and-tested HTTP is too slow for you, for a simple request-response cycle like this, maybe just have the client send the key to retrieve, followed by a null character or a newline or whatnot, and the server directly reply with the serialized object, and then close the connection.

You might also want to consider simply storing the serialized data in a database, be it SQLite or something else.


EDIT: I decided to experiment a little. Here's a small, pretty naive asyncio + msgpack based server + client that does the trick:

server.py

import asyncio
import random
import msgpack
import time
from functools import lru_cache


def generate_dict(depth=6, min_keys=1, max_keys=10):
    d = {}
    for x in range(random.randint(min_keys, max_keys)):
        d[x] = (
            generate_dict(
                depth=depth - 1, min_keys=min_keys, max_keys=max_keys
            )
            if depth
            else "foo" * (x + 1)
        )
    return d


DATA = {f"{x}": generate_dict() for x in range(10)}


@lru_cache(maxsize=64)
def get_encoded_data(key):
    # TODO: this does not clear the cache upon DATA being mutated
    return msgpack.packb(DATA.get(key))


async def handle_message(reader, writer):
    t0 = time.time()
    data = await reader.read(256)
    key = data.decode()
    addr = writer.get_extra_info("peername")
    print(f"Sending key {key!r} to {addr!r}...", end="")
    value = get_encoded_data(key)
    print(f"{len(value)} bytes...", end="")
    writer.write(value)
    await writer.drain()
    writer.close()
    t1 = time.time()
    print(f"{t1 - t0} seconds.")


async def main():
    server = await asyncio.start_server(handle_message, "127.0.0.1", 8888)

    addr = server.sockets[0].getsockname()
    print(f"Serving on {addr}")

    async with server:
        await server.serve_forever()


asyncio.run(main())

client.py

import socket
import msgpack
import time


def get_key(key):
    t0 = time.time()
    s = socket.socket()
    s.connect(("127.0.0.1", 8888))
    s.sendall(str(key).encode())
    buf = []
    while True:
        chunk = s.recv(65535)
        if not chunk:
            break
        buf.append(chunk)
    val = msgpack.unpackb(b"".join(buf))
    t1 = time.time()
    print(key, (t1 - t0))
    return val


t0 = time.time()
n = 0
for i in range(10):
    for x in range(10):
        assert get_key(x)
        n += 1
t1 = time.time()
print("total", (t1 - t0), "/", n, ":", (t1 - t0) / n)

On my Mac,

  • it takes about 0.02814 seconds per message on the receiving end, for a single-consumer throughput of 35 requests per second.
  • it takes about 0.00241 seconds per message on the serving end, for a throughput of 413 requests per second.

(And as you can see from how the DATA is generated, the payloads can be quite large.)

Hope this helps.

Sign up to request clarification or add additional context in comments.

4 Comments

SQLite is definitely not the solution, as it involves the disk. Depending on the disks the performance can be slow. Everything must be in memory. There will be computational on the data, it is more efficient to make the computation and send the result (which is small). If directly get the data and then do the computation in the client, the data get by client will be large. So any database solution is not appropriate for my problem.
The connection approach is also a concern. I think that in C there is a way to make a piece of data persistent in memory. (ipcs and ipcrm can show info of shared memory and remove a piece of shared memory). Is there a way to make a python object persistent in memory?
You can use ctypes to call mlockall, but I think you're prematurely optimizing things here. Have you tried SQLite? It can be surprisingly performant.
@user1424739 Please see my edit – I added an example server/client thing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.