add a "more memory" observation

Source Link

edited Dec 17, 2023 at 0:48

J_H

8k
1
18
28

I observe a python3.11 process consuming 70 MiB of memory during each transient run when reading the DB. This is on an 8 GiB laptop with browser, PyCharm, and other interactive apps sitting on more than half the RAM. When I quit some of them to free up space for OS FS caching, then summarizing records of a thousand random users proceeds at a rate of > 120 query/sec rather than 50 query/sec. Performance of a relational database degrades gracefully as memory conditions tighten. Contrast this with performance of a graph engine like neo4j, which tends to fall off a cliff if even a small fraction of nodes turn out to not be memory resident.

add performance measurements

Source Link

edited Dec 17, 2023 at 0:24

J_H

8k
1
18
28

EDIT

when I imagined the constant lookups with JOINs ... I quickly realized this is simply not viable due to performance issues.

There's no such thing as a performance "issue" if you haven't clicked a stopwatch and written down your observation.

I enclose a PoC based on your requirements which creates a 6 GiB database in less than half an hour. On an ancient 8 GiB RAM laptop.

from array import array
from hashlib import sha3_224
from pathlib import Path
from random import shuffle
from typing import Any, Callable

from sqlalchemy import Column, ForeignKey, Integer, String
from sqlalchemy.orm import Session, declarative_base
from sqlalchemy.schema import PrimaryKeyConstraint
from tqdm import tqdm
import pandas as pd
import sqlalchemy as sa

Base = declarative_base()


class WorldFact(Base):
    __tablename__ = "world_fact"
    id = Column(Integer, primary_key=True)
    name = Column(String, nullable=False)
    details = Column(String)


class Fact(Base):
    __tablename__ = "fact"
    user_id = Column(Integer)
    fact_id = Column(Integer, ForeignKey("world_fact.id"))
    PrimaryKeyConstraint(user_id, fact_id)


def create_engine():
    DB_FILE = Path("/tmp/article.db")
    DB_URL = f"sqlite:///{DB_FILE}"
    engine = sa.create_engine(DB_URL)
    return engine


# It takes 1 second to INSERT 188_000 rows into world_facts.
NUM_FACTS = 40_000
NUM_USERS = 10_000


def get_fact_df():
    return pd.DataFrame(
        {
            "name": [
                sha3_224(f"Earth {i}".encode()).hexdigest() for i in range(NUM_FACTS)
            ],
            "details": [f"Loxodonta africana {i}" for i in range(NUM_FACTS)],
        }
    )


def insert_world_facts(fact_df: pd.DataFrame) -> None:
    fact_df.to_sql("world_fact", engine, index=False, if_exists="append")


def insert_user_facts(user_df: pd.DataFrame) -> None:
    user_df.to_sql("fact", engine, index=False, if_exists="append")


def main():
    fact_df = get_fact_df()
    insert_world_facts(fact_df)

    with Session(engine) as session:
        fact_ids = array("I", [row.id for row in session.query(WorldFact.id)])

    for userid in tqdm(range(NUM_USERS)):
        shuffle(fact_ids)
        user_df = pd.DataFrame({"fact_id": fact_ids[: NUM_FACTS // 2]})
        user_df["user_id"] = userid
        insert_user_facts(user_df)


if __name__ == "__main__":
    engine = create_engine()
    Base.metadata.create_all(engine)
    main()

There were certain inconsistencies in the OP + comments specification, which I resolved by giving each user a random subset (50%) of the global knowledge base.

The PoC INSERTs 200 M rows with throughput greater than 114 K row / second, using the most basic RDBMS, sqlite, and the slowest language, python. No tuning. Go substitute a more mature NoSQL or RDBMS offering by changing the connect string, and measure the speedup.

Querying a random user is performant:

def num_facts_for(user_id: int):
    select = "count(*)"  # or "sum(fact_id)"
    with Session(engine) as session:
        yield from session.query(text(select)).filter(Fact.user_id == user_id)


def main(num_queries=1_000):
    for _ in tqdm(range(num_queries)):
        user_id = randrange(NUM_USERS)
        n, = next(num_facts_for(user_id))
        assert NUM_FACTS / 2 == n

This will COUNT() or SUM() all of a random user's 20 K fact IDs in 20 msec (fifty queries per second), due to sequential I/O against the PK.

When each random user pulls in world_fact details from a JOIN it takes just a little longer, 200 msec, due to random reads from external storage. Such random I/O is a good match for several web queries executing simultaneously, each returning sub-second interactive response latencies.

If there is a "performance issue" with hosting the proposed app atop an RDBMS, we have not yet seen such an issue revealed.

EDIT

when I imagined the constant lookups with JOINs ... I quickly realized this is simply not viable due to performance issues.

There's no such thing as a performance "issue" if you haven't clicked a stopwatch and written down your observation.

I enclose a PoC based on your requirements which creates a 6 GiB database in less than half an hour. On an ancient 8 GiB RAM laptop.

from array import array
from hashlib import sha3_224
from pathlib import Path
from random import shuffle
from typing import Any, Callable

from sqlalchemy import Column, ForeignKey, Integer, String
from sqlalchemy.orm import Session, declarative_base
from sqlalchemy.schema import PrimaryKeyConstraint
from tqdm import tqdm
import pandas as pd
import sqlalchemy as sa

Base = declarative_base()


class WorldFact(Base):
    __tablename__ = "world_fact"
    id = Column(Integer, primary_key=True)
    name = Column(String, nullable=False)
    details = Column(String)


class Fact(Base):
    __tablename__ = "fact"
    user_id = Column(Integer)
    fact_id = Column(Integer, ForeignKey("world_fact.id"))
    PrimaryKeyConstraint(user_id, fact_id)


def create_engine():
    DB_FILE = Path("/tmp/article.db")
    DB_URL = f"sqlite:///{DB_FILE}"
    engine = sa.create_engine(DB_URL)
    return engine


# It takes 1 second to INSERT 188_000 rows into world_facts.
NUM_FACTS = 40_000
NUM_USERS = 10_000


def get_fact_df():
    return pd.DataFrame(
        {
            "name": [
                sha3_224(f"Earth {i}".encode()).hexdigest() for i in range(NUM_FACTS)
            ],
            "details": [f"Loxodonta africana {i}" for i in range(NUM_FACTS)],
        }
    )


def insert_world_facts(fact_df: pd.DataFrame) -> None:
    fact_df.to_sql("world_fact", engine, index=False, if_exists="append")


def insert_user_facts(user_df: pd.DataFrame) -> None:
    user_df.to_sql("fact", engine, index=False, if_exists="append")


def main():
    fact_df = get_fact_df()
    insert_world_facts(fact_df)

    with Session(engine) as session:
        fact_ids = array("I", [row.id for row in session.query(WorldFact.id)])

    for userid in tqdm(range(NUM_USERS)):
        shuffle(fact_ids)
        user_df = pd.DataFrame({"fact_id": fact_ids[: NUM_FACTS // 2]})
        user_df["user_id"] = userid
        insert_user_facts(user_df)


if __name__ == "__main__":
    engine = create_engine()
    Base.metadata.create_all(engine)
    main()

There were certain inconsistencies in the OP + comments specification, which I resolved by giving each user a random subset (50%) of the global knowledge base.

The PoC INSERTs 200 M rows with throughput greater than 114 K row / second, using the most basic RDBMS, sqlite, and the slowest language, python. No tuning. Go substitute a more mature NoSQL or RDBMS offering by changing the connect string, and measure the speedup.

Querying a random user is performant:

def num_facts_for(user_id: int):
    select = "count(*)"  # or "sum(fact_id)"
    with Session(engine) as session:
        yield from session.query(text(select)).filter(Fact.user_id == user_id)


def main(num_queries=1_000):
    for _ in tqdm(range(num_queries)):
        user_id = randrange(NUM_USERS)
        n, = next(num_facts_for(user_id))
        assert NUM_FACTS / 2 == n

This will COUNT() or SUM() all of a random user's 20 K fact IDs in 20 msec (fifty queries per second), due to sequential I/O against the PK.

When each random user pulls in world_fact details from a JOIN it takes just a little longer, 200 msec, due to random reads from external storage. Such random I/O is a good match for several web queries executing simultaneously, each returning sub-second interactive response latencies.

If there is a "performance issue" with hosting the proposed app atop an RDBMS, we have not yet seen such an issue revealed.

Source Link

answered Dec 16, 2023 at 2:58

J_H

8k
1
18
28

tl;dr: Use an RDBMS, as there's nothing in the problem statement to motivate neo4j or similar graph databases.

primary key

You said you have 40,000 rows in world_fact, things that are immutably and universally true.

You have a bunch of isolated tenants / users, who can take FK references to that universe of facts. Their fact table would have PK of (userid, factid). We might find it convenient to insert 40 K ('world', factid) rows into that table for consistency -- the "world" pseudo-user knows all.

The users don't interact with each other's data.

Right, the PK addresses that.

I feel like there might be a performance hit when executing queries.

I don't feel that way. Mostly because you've not demonstrated any observed performance numbers, have not articulated any 95th percentile latency targets you need to hit. A B-tree is very good at encouraging locality of data, especially if you design the factid's so they have a sensible lexical order.

querying should be faster if each user's data is its own island

Yes. That's why userid appears first in the compound key.

flashcard

You didn't draw a useful distinction between fact and flashcard, so I'm ignoring it. That is, you told us they are 1:1 with each other, and I'm just taking them to be identical. If there's some interesting aspect of their frequency distribution or the giant size of some of their attributes, then tell us about that. A relational database, implemented as a column store, is perfectly good at projecting down to a subset of fat columns in a performant way. Presumably queries will just ask for columns they actually need. Introducing the notion of a flashcard sounds like an unmotivated premature optimization at this point.

throughput

You didn't describe your workload, nor your KPIs, nor whether lots of tenants will be querying simultaneously. If low latency is the metric you care most about, a graph database may be best for you, with the caveat that most graph nodes should be memory resident, so buy enough RAM. If high throughput is what you care about, and size of data exceeds size of RAM, you will be hard pressed to find a general purpose graph database that beats a general purpose RDBMS. Even when data is smaller than RAM, I've been able to drag results out of postgres faster than with neo4j.

If flashcards about elephants is what you care about, recommend you use a database with mature support for full text indexing.

Stack Exchange Network

Return to Answer

primary key

flashcard

throughput