Questions tagged [cluster]
discussion related to cluster mechanisms.
219 questions
1
vote
0
answers
27
views
Ocfs2: link between cluster and device?
I am having 2 servers (Debian 12) that use a storage-disk (SD).
Both see this SD as a device via fdisk. I have no details about the storage-device itself or the connection type - for me it is just a ...
0
votes
0
answers
54
views
Adding a New Server to Existing Proxmox Cluster - Network Configuration and VM Communication
I’m looking for some guidance on expanding my Proxmox setup. Here’s my current setup and what I’m trying to achieve:
Current Setup
I have a dedicated OVH server running Proxmox.
On this server, I ...
0
votes
0
answers
34
views
Disable read-ahead caching for GFS2 Logical Volume
I have 10 node deployment which implement red hat clustering software - pacemaker/corosync to mount gfs2 and ensure high-availability. Nodes are actually mail servers and use gfs2 to store user's data ...
0
votes
0
answers
31
views
How to set new features to N during kernel compilation from an old .config file?
I am compiling a custom linux kernel for a compute cluster. The cluster is currently running on kernel version 4.4.47 since last 5 years. I need to upgrade the kernel to a more recent version. I've ...
1
vote
1
answer
424
views
Need a method for managing systemd services across multiple hosts
I have six Linux servers running RHEL 8.6 - and need to ensure that one specific service is running at least one and at most one of those six servers.
Does systemd support something like this?
If not, ...
1
vote
0
answers
23
views
Script/Daemon to kill specific resource-consuming tools?
I'm working on a SGE linux cluster and beginners often run memory/resource consuming tools on the login node instead of using qsub or qlogin ( https://gridscheduler.sourceforge.net/htmlman/htmlman1/...
1
vote
1
answer
135
views
qsub-like behavior for a slurm cluster
I recently switched to slurm and looking for a job submission tool, that behaves similar to qsub:
It takes input through a pipe
It prints the output to stdout
Example:
for n in `seq 1 10`; do
...
0
votes
2
answers
560
views
Unable to install Slurm on PC
I am trying to install slurm on Ubuntu PC. Therefore, I followed the instructions given over here
I did the following -
sudo apt update -y
sudo apt install slurmd slurmctld -y
mkdir sudo /etc/slurm-...
1
vote
0
answers
65
views
Shell script looking for a missing module
I want to run a shell script on a compute cluster but I get an error because at some point it is looking for a module that does not exist since a major update on the cluster a few months ago. This ...
0
votes
1
answer
60
views
Running arbitrary binary program with cluster computers
I have 3 VPS. Let's say master, slave1, slave2.
Their specifications are identic.
Processor: 1CPU
Memory: 1GB
Disk: 10GB
Network: running on LAN each other
I expect any arbitrary binary program (...
1
vote
1
answer
177
views
Can I fully utilize HDR Infiniband network throughput between servers and NFS volume?
I'm working on a project building a CPU cluster, and those servers and NFS storage (not a parallel file system) are going to be connected through HDR InfiniBand cables. In this architecture, can I get ...
1
vote
1
answer
56
views
Unable to run linpack on head node of cluster
I recently set up my own home cluster - 4 units of raspberry pi. But I am having problems trying to benchmark all 4 units using Linpack
One node is the head node called rpislave1, it connects to the ...
2
votes
1
answer
368
views
How to set up a bunch of linux servers with shared file system without using job scheduler?
I am managing multiple GPU servers in our lab, which are mainly used for deep learning tasks. We would like these machines to share the same file system, so it is easier to switch between them.
...
0
votes
0
answers
83
views
Proper way to design filesystems structure for a cluster of diskless nodes
I'm trying to learn the basics of Linux clustering so I started designing a really humble cluster:
6 worker nodes (Libre Computer La Frite | Cortex-A53 @ 1.2 GHz | 1GB RAM)
1 master node (Raspberry ...
0
votes
1
answer
282
views
Remove internet access without losing LAN
I have a small cluster (all nodes run Debian 10) and need to remove the internet connections of all slave nodes. The internet cable connection connects to a computer that acts as a firewall, then, ...
1
vote
1
answer
601
views
Transferring very large dataset from cluster to a storage server
We have to move a set of very large data (in petabytes) from HPC cluster to a storage server. We have a high capacity communication link between the devices. However, the bottleneck seems to be a fast ...
-1
votes
1
answer
782
views
IBM AIX - Method to identify Cluster or HA services
I am keen to learn if existing IBM AIX servers from different location have Clustering/HA features. Kindly let me know the steps to check. Thanks.
2
votes
1
answer
5k
views
Running multiple SLURM jobs on the same GPU
So, I am by no means a sysadmin but I need to use an existing SLURM installation to launch a sizable amount of jobs (around 5000).
The cluster is composed of 1 node with 10 GPUs (with 8GB of memory ...
0
votes
1
answer
421
views
How to tell if a VG is clustered?
I have a CentOS 7 Pacemaker cluster with GFS2 Filesystrems mounted. I'm fairly certain that vgchange -cy vg_name was NOT run during setup. I tried running vgchange --test -cy vg_name and it tells me ...
0
votes
0
answers
64
views
NFS shared file system that uses all PCs hard drive
I'm building a small cluster with desktop computers that run on Debian 11.
I want a shared /home directory where all user's files are located. I know that the ideal way to do this is to have a master ...
0
votes
0
answers
335
views
How to update computing nodes in a cluster?
I am in the process of small computing cluster assembly. It will run Ubuntu/Slurm for job scheduling. Only the head node will be connected to the Internet, all others will be accessible from the local ...
1
vote
0
answers
121
views
Will execution of many parallel instances of an application from NFS drive affect performance?
I am assembling a simple computing cluster under Ubuntu 20.04 and Slurm as a job scheduler.
The cluster will be primarily used for quantum-chemical calculations, so, as a rule, each job will run its ...
0
votes
1
answer
172
views
Does the size of a file system cluster have to be even bytes?
Basically, could we have a file system with odd byte size clusters? Why is everything even? Thanks
0
votes
0
answers
1k
views
What is "quorum-manager" under Designation in GPFS Cluster Information?
I get this block of information under my GPFS cluster information when I execute /usr/lpp/mmfs/bin/mmlscluster and I can't find documentation on what the Designation actually means. Does quorum-...
0
votes
0
answers
529
views
Sharing ssh keys between cluster nodes
I have a cluster with several login nodes and many compute nodes (call it the cluster). Then I have another server with a large shared storage (call it the storage). I need to be able to rsync (i.e. ...
-1
votes
1
answer
52
views
Sync sudo anthority to all nodes
I want to submit a task that is interpreted by /bin/csh, which only exists in master node. And I have no root permission but only sudo, which is limited in master node. So I can't use sudo apt install ...
1
vote
0
answers
108
views
Jetson Nano Picocluster not accessible from network
I recently received a Picocluster with five Jetson Nano boards running MicroK8S. The cluster has a built in switch, which I know works as I can route my own network traffic through it just fine. All ...
1
vote
0
answers
182
views
Is it possible to mount an ATA over Ethernet (AOE) block device on multiple clients, if so how?
I have a lab consisting of 3 machines connected with 2 10gbe links on 2 segregated networks. Each device has 100tb in block storage connected to it. I want to use ATA over Ethernet to create a storage ...
0
votes
1
answer
1k
views
How to create a ocfs2 filesystem in ubuntu?
Can someone please walk me through the step-by-step process of configuring a ocfs2 filesystem right from splitting an existing partition? When I tried, I am seeing the below error:
mount.ocfs2: ...
0
votes
0
answers
47
views
Why non-root python installation can work across the whole cluster?
I recently installed anaconda (which includes a python3) locally in my account folder on a cluster with a dozen of nodes (each node with several cores). I use it to install some package P that is used ...
1
vote
1
answer
2k
views
Debian 10 Pacemaker-Cluster: GFS2 Mount fails because of "Global lock failed: check that global lockspace is started."
I'm trying to setup a new Debian 10 cluster with three instances. My stack is based on pacemaker, corosync, dlm, and lvmlockd with a GFS2 volume. All servers have access to the GFS2 volume but I can't ...
1
vote
1
answer
2k
views
How can I submit multiple R job at once?
I have a R-script which runs multiple files say file=1 to 50.
I usually submit repeated jobs say 5 times with 10 files each time by changing the number in R-script.
So, how can I submit the 5-job at ...
0
votes
1
answer
27
views
What are the existing open source tools to develop on-premise organizational app store on linux?
We have a Linux cluster in our organization and my data science team is developing a number of ML projects to be utilized by teams across the organization. To enable the teams to access the ML models, ...
1
vote
1
answer
523
views
Where are libvirt's VM definitions "originals" stored, and how to sync them across multiple nodes?
Migrating from Xen's xm to Xen's xl under control of libvirt, I wonder:
Where does libvirt store the "originals" of VM configurations?
I found that my PVM configurations are stored in /etc/...
2
votes
1
answer
1k
views
Python error only when I run script on Linux cluster: _tkinter.TclError: no display name and no $DISPLAY environment variable
My question is related to a python error, but I suspect that it is more a Linux question than a python one. Thus I post it first here.
I am running a python script which does a calculation and then ...
0
votes
0
answers
3k
views
How can I cancel all waiting jobs with qsub?
I am running a lot of jobs with qsub: some are running, some are waiting. Is there a way to cancel all the jobs for a given user which are queued/waiting without giving the individual job IDs?
2
votes
1
answer
2k
views
how to set passwordless authentication in a cluster where users /home directory from headnode is mounted to all machines /home of the cluster
First of all thank you in advance for your help.
I hope the title makes sense. Basically, on the headnode the users' home directory (i.e: headnode:/home/eric) are NFS shared and mounted to all the ...
0
votes
1
answer
2k
views
Pacemaker apache resource is Failed to access httpd status page after change to HTTPS
I get this error from pacemaker after i change apache from http to https.
now my ocf::heartbeat:apache resource is not find status page.
I generate SSL certificate separately for 3 servers.
Everything ...
2
votes
1
answer
2k
views
Simple cp command not working (inside shell script and submitted using sbatch on a cluster)
I am working on a cluster running RHEL and I submit jobs using the following command.
sbatch MyScript.sh
The content of the MyScript.sh are as below.
#!/bin/sh
# ....
# Other SBATCH related commands ...
1
vote
1
answer
303
views
compiling HPL-2.0_FERMIv15
when i compile xhpl i always get the error message:
./xhpl: error while loading shared libraries: libdgemm.so.1: cannot open shared object file: No such file or directory
when i type ldd xhpl:
linux-...
1
vote
0
answers
612
views
Cannot seem to start pcs cluster (NFS Cluster) disk_fencing trouble
For the life of me, I can't find a clear answer on how to start my NFS active / passive cluster. I have two nodes, node1 and node2 and followed the guide here: https://www.linuxtechi.com/configure-nfs-...
0
votes
1
answer
4k
views
DRBD: "Couldn't mount device [/dev/drbd0] as /mydata" when failing over or rebooting node
I'm creating a cluster system using two ESXi hosts, with a CentOS 7 server on each.
Going through I created the filesystem, and it mounts on node1.
When I perform a standby or reboot from node01 to ...
0
votes
2
answers
4k
views
How to Remove caavg_private Properly on AIX?
I am trying to cleanup a server which had a PowerHA configuration. I have stopped cluster (smitty clstop) and removed resource groups. How do I remove the caavg_private properly?
hdisk5 ...
0
votes
1
answer
131
views
RHEL 6.9 HA clustering issue with one node completely down
I have two node servers with SAN storage.
Each node have RHEL 6.9 with HA and the partition are mapped from the storage using fiber cables with clustered resources.
The thing is when the two nodes ...
0
votes
1
answer
360
views
Trying to Create an AFM relationship by using GPFS protocol. Having error in cache side cluster
I am trying to create an AFM relationship by using GPFS protocol. Having error in cache side cluster.
Steps of Home cluster :
1) Create a home cluster (cluster name - gpfs01).
2) Create a file ...
1
vote
0
answers
320
views
OpenLDAP Cluster
Trying to implement an OpenLDAP cluster, I already managed to set up the two backend LDAP servers in mirroring mode.
The application (iRedMail) using the LDAP service is running on the same systems ...
0
votes
0
answers
986
views
How does htop calculate times? (Impossible times shown)
I was running what was supposed to be a small make job on a small node in our cluster, but a sorting process seemed to be overwhelming the allocated RAM, so I killed it after 20 hours (same jobs on ...
0
votes
1
answer
106
views
how to move fs resource in service group in ccs cluster
I have a ccs cluster running on RHEL 6.4 where there is no luci service, i have added a filesystem resource to the cluster with the below command but i need to move the resource inside a service group....
0
votes
1
answer
623
views
How to setup a compute cluster in local network on top of linux
I have 3 machines in my local network running manjaro. I am running python scripts using dask, pandas, etc which max out the cpu on the first machine and I usually need to wait more than 30 min until ...
0
votes
2
answers
208
views
Why worker nodes are not fault-tolerant?
A swarm manager nodes handles cluster management tasks such as:
1) Maintaining cluster state
2) Scheduling services
3) Serving swarm mode HTTP API endpoints
You may execute any of the - docker ...