Why does the first Gremlin query take significantly more time in distributed JanusGraph setup on Kubernetes?

Ask Question

Asked 5 months ago

Modified 5 months ago

Viewed 42 times

Environment Setup

I'm working with a distributed JanusGraph architecture deployed on Azure Kubernetes Service (AKS):

Infrastructure:

AKS Cluster: 2 nodes (16 vCPU, 64 GB RAM each)
Cassandra: 2 replicas with sharding enabled (Kubernetes pods)
Elasticsearch: 2 replicas with sharding enabled (Kubernetes pods)
JanusGraph: Single replica connected to both backends (Kubernetes pod)
Mixed index: Created on title and nt property keys

Connection Pool Implementation:

I've implemented thread-safe connection pooling where each keyspace has its cached traversal:

class BaseGremlinClass(View):
    _connections = {}
    _lock = threading.Lock()

    def get_traversal(self, keyspace_name):
        """Get or create a traversal for the given keyspace"""
        if keyspace_name not in settings.JANUSGRAPH_KEYSPACES:
            raise ValueError(f"Keyspace {keyspace_name} not found in settings")

        with self._lock:
            if keyspace_name not in self._connections:
                self._create_connection(keyspace_name)
            print("Getting Connection from Pool")
            return self._connections[keyspace_name]['traversal']

    def _create_connection(self, keyspace_name):
        """Create a new connection and traversal"""
        try:
            config = settings.JANUSGRAPH_KEYSPACES[keyspace_name]
            connection = DriverRemoteConnection(
                config['url'],
                config['graph'],
                message_serializer=serializer.GraphSONSerializersV3d0(),
                timeout=30,
                pool_size=10,
                max_workers=4,
            )
            traversal_g = traversal().withRemote(connection)
            self._connections[keyspace_name] = {
                'connection': connection,
                'traversal': traversal_g
            }
            logger.info(f"Created connection for keyspace {keyspace_name}")
        except Exception as e:
            logger.error(f"Error creating connection to {keyspace_name}: {e}")
            raise

class GremlinQueries:
    def __init__(self, keyspace_name='main'):
        traversal = BaseGremlinClass()
        self.g = traversal.get_traversal(keyspace_name).with_('evaluationTimeout', 0)
        self.keyspace_name = keyspace_name

    def get_all_nodes_label(self):
        """Return list of node types"""
        data = self.g.V().has('nt').values('nt').dedup().order().by(Order.asc).to_list()
        return data

Performance Issue

Despite having connection pooling, indexing, and sharding implemented, I'm observing:

First query execution: Takes significantly longer (e.g., 45 seconds)
Second query: Runs in almost half the time (~20-25 seconds)
Subsequent queries: Maintain the improved performance (i.e. maintains the performance of the 2nd run)

Questions

Why does the first Gremlin query take significantly longer than subsequent runs in a Kubernetes environment, even with connection pooling and indexing?
What Kubernetes-specific factors might be contributing to this cold start behavior?
What optimisations can be implemented to reduce the first-time latency in a containerised distributed setup?
Are there specific considerations for sharded Cassandra/Elasticsearch deployments on Kubernetes that could impact initial query performance?

What I've Tried

Verified that connection pooling is working (connections are reused)
Confirmed mixed indexes are properly created and being used
Checked that subsequent queries with same/different parameters show improved performance
Monitored that the connection pool prevents reconnection overhead

edited Jun 2 at 15:33

asked Jun 2 at 15:23

Ravindra Gupta

64215 silver badges47 bronze badges

Both Cassandra and Janusgraph use a cache, see yaaics.blogspot.com/2018/04/…, so I would only expect such cold start behaviour.

HadoopMarc
– HadoopMarc

2025-06-07 11:44:50 +00:00
Commented Jun 7 at 11:44

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Why does the first Gremlin query take significantly more time in distributed JanusGraph setup on Kubernetes?

Environment Setup

Connection Pool Implementation:

Performance Issue

Questions

What I've Tried

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Environment Setup

Connection Pool Implementation:

Performance Issue

Questions

What I've Tried

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest