Environment Setup
I'm working with a distributed JanusGraph architecture deployed on Azure Kubernetes Service (AKS):
Infrastructure:
- AKS Cluster: 2 nodes (16 vCPU, 64 GB RAM each)
- Cassandra: 2 replicas with sharding enabled (Kubernetes pods)
- Elasticsearch: 2 replicas with sharding enabled (Kubernetes pods)
- JanusGraph: Single replica connected to both backends (Kubernetes pod)
- Mixed index: Created on
titleandntproperty keys
Connection Pool Implementation:
I've implemented thread-safe connection pooling where each keyspace has its cached traversal:
class BaseGremlinClass(View):
_connections = {}
_lock = threading.Lock()
def get_traversal(self, keyspace_name):
"""Get or create a traversal for the given keyspace"""
if keyspace_name not in settings.JANUSGRAPH_KEYSPACES:
raise ValueError(f"Keyspace {keyspace_name} not found in settings")
with self._lock:
if keyspace_name not in self._connections:
self._create_connection(keyspace_name)
print("Getting Connection from Pool")
return self._connections[keyspace_name]['traversal']
def _create_connection(self, keyspace_name):
"""Create a new connection and traversal"""
try:
config = settings.JANUSGRAPH_KEYSPACES[keyspace_name]
connection = DriverRemoteConnection(
config['url'],
config['graph'],
message_serializer=serializer.GraphSONSerializersV3d0(),
timeout=30,
pool_size=10,
max_workers=4,
)
traversal_g = traversal().withRemote(connection)
self._connections[keyspace_name] = {
'connection': connection,
'traversal': traversal_g
}
logger.info(f"Created connection for keyspace {keyspace_name}")
except Exception as e:
logger.error(f"Error creating connection to {keyspace_name}: {e}")
raise
class GremlinQueries:
def __init__(self, keyspace_name='main'):
traversal = BaseGremlinClass()
self.g = traversal.get_traversal(keyspace_name).with_('evaluationTimeout', 0)
self.keyspace_name = keyspace_name
def get_all_nodes_label(self):
"""Return list of node types"""
data = self.g.V().has('nt').values('nt').dedup().order().by(Order.asc).to_list()
return data
Performance Issue
Despite having connection pooling, indexing, and sharding implemented, I'm observing:
- First query execution: Takes significantly longer (e.g., 45 seconds)
- Second query: Runs in almost half the time (~20-25 seconds)
- Subsequent queries: Maintain the improved performance (i.e. maintains the performance of the 2nd run)
Questions
- Why does the first Gremlin query take significantly longer than subsequent runs in a Kubernetes environment, even with connection pooling and indexing?
- What Kubernetes-specific factors might be contributing to this cold start behavior?
- What optimisations can be implemented to reduce the first-time latency in a containerised distributed setup?
- Are there specific considerations for sharded Cassandra/Elasticsearch deployments on Kubernetes that could impact initial query performance?
What I've Tried
- Verified that connection pooling is working (connections are reused)
- Confirmed mixed indexes are properly created and being used
- Checked that subsequent queries with same/different parameters show improved performance
- Monitored that the connection pool prevents reconnection overhead