EDIT
when I imagined the constant lookups with JOINs ... I quickly realized this is simply not viable due to performance issues.
There's no such thing as a performance "issue" if you haven't
clicked a stopwatch and written down your observation.
I enclose a PoC based on your requirements which
creates a 6 GiB database in less than half an hour.
On an ancient 8 GiB RAM laptop.
from array import array
from hashlib import sha3_224
from pathlib import Path
from random import shuffle
from typing import Any, Callable
from sqlalchemy import Column, ForeignKey, Integer, String
from sqlalchemy.orm import Session, declarative_base
from sqlalchemy.schema import PrimaryKeyConstraint
from tqdm import tqdm
import pandas as pd
import sqlalchemy as sa
Base = declarative_base()
class WorldFact(Base):
__tablename__ = "world_fact"
id = Column(Integer, primary_key=True)
name = Column(String, nullable=False)
details = Column(String)
class Fact(Base):
__tablename__ = "fact"
user_id = Column(Integer)
fact_id = Column(Integer, ForeignKey("world_fact.id"))
PrimaryKeyConstraint(user_id, fact_id)
def create_engine():
DB_FILE = Path("/tmp/article.db")
DB_URL = f"sqlite:///{DB_FILE}"
engine = sa.create_engine(DB_URL)
return engine
# It takes 1 second to INSERT 188_000 rows into world_facts.
NUM_FACTS = 40_000
NUM_USERS = 10_000
def get_fact_df():
return pd.DataFrame(
{
"name": [
sha3_224(f"Earth {i}".encode()).hexdigest() for i in range(NUM_FACTS)
],
"details": [f"Loxodonta africana {i}" for i in range(NUM_FACTS)],
}
)
def insert_world_facts(fact_df: pd.DataFrame) -> None:
fact_df.to_sql("world_fact", engine, index=False, if_exists="append")
def insert_user_facts(user_df: pd.DataFrame) -> None:
user_df.to_sql("fact", engine, index=False, if_exists="append")
def main():
fact_df = get_fact_df()
insert_world_facts(fact_df)
with Session(engine) as session:
fact_ids = array("I", [row.id for row in session.query(WorldFact.id)])
for userid in tqdm(range(NUM_USERS)):
shuffle(fact_ids)
user_df = pd.DataFrame({"fact_id": fact_ids[: NUM_FACTS // 2]})
user_df["user_id"] = userid
insert_user_facts(user_df)
if __name__ == "__main__":
engine = create_engine()
Base.metadata.create_all(engine)
main()
There were certain inconsistencies in
the OP + comments specification,
which I resolved by giving each user a random
subset (50%) of the global knowledge base.
The PoC INSERTs 200 M rows with throughput greater
than 114 K row / second, using the most basic RDBMS, sqlite,
and the slowest language, python.
No tuning.
Go substitute a more mature NoSQL or RDBMS offering
by changing the connect string,
and measure the speedup.
Querying a random user is performant:
def num_facts_for(user_id: int):
select = "count(*)" # or "sum(fact_id)"
with Session(engine) as session:
yield from session.query(text(select)).filter(Fact.user_id == user_id)
def main(num_queries=1_000):
for _ in tqdm(range(num_queries)):
user_id = randrange(NUM_USERS)
n, = next(num_facts_for(user_id))
assert NUM_FACTS / 2 == n
This will COUNT() or SUM() all of a random user's
20 K fact IDs in 20 msec (fifty queries per second),
due to sequential I/O against the PK.
When each random user pulls in world_fact details from a JOIN
it takes just a little longer, 200 msec, due to random reads
from external storage.
Such random I/O is a good match for several web queries
executing simultaneously, each returning sub-second
interactive response latencies.
If there is a "performance issue" with hosting the proposed
app atop an RDBMS, we have not yet seen such an issue revealed.