Newest 'cluster' Questions

1 vote

0 answers

27 views

Ocfs2: link between cluster and device?

I am having 2 servers (Debian 12) that use a storage-disk (SD). Both see this SD as a device via fdisk. I have no details about the storage-device itself or the connection type - for me it is just a ...

chris01

1,039

asked Jun 5 at 3:28

0 votes

0 answers

54 views

Adding a New Server to Existing Proxmox Cluster - Network Configuration and VM Communication

I’m looking for some guidance on expanding my Proxmox setup. Here’s my current setup and what I’m trying to achieve: Current Setup I have a dedicated OVH server running Proxmox. On this server, I ...

Zakaria Ait Yakoub

21

asked Feb 18 at 20:06

0 votes

0 answers

34 views

Disable read-ahead caching for GFS2 Logical Volume

I have 10 node deployment which implement red hat clustering software - pacemaker/corosync to mount gfs2 and ensure high-availability. Nodes are actually mail servers and use gfs2 to store user's data ...

brchelli26

1

asked Feb 17 at 8:29

0 votes

0 answers

31 views

How to set new features to N during kernel compilation from an old .config file?

I am compiling a custom linux kernel for a compute cluster. The cluster is currently running on kernel version 4.4.47 since last 5 years. I need to upgrade the kernel to a more recent version. I've ...

Sâu

101

asked Feb 10 at 12:37

1 vote

1 answer

424 views

Need a method for managing systemd services across multiple hosts

I have six Linux servers running RHEL 8.6 - and need to ensure that one specific service is running at least one and at most one of those six servers. Does systemd support something like this? If not, ...

The Programmer

13

asked Sep 18, 2024 at 17:34

1 vote

0 answers

23 views

Script/Daemon to kill specific resource-consuming tools?

I'm working on a SGE linux cluster and beginners often run memory/resource consuming tools on the login node instead of using qsub or qlogin ( https://gridscheduler.sourceforge.net/htmlman/htmlman1/...

Pierre

1,793

asked Mar 28, 2024 at 10:40

1 vote

1 answer

135 views

qsub-like behavior for a slurm cluster

I recently switched to slurm and looking for a job submission tool, that behaves similar to qsub: It takes input through a pipe It prints the output to stdout Example: for n in `seq 1 10`; do ...

LazyCat

188

asked Mar 6, 2024 at 2:01

0 votes

2 answers

560 views

Unable to install Slurm on PC

I am trying to install slurm on Ubuntu PC. Therefore, I followed the instructions given over here I did the following - sudo apt update -y sudo apt install slurmd slurmctld -y mkdir sudo /etc/slurm-...

desert_ranger

103

asked Mar 5, 2024 at 1:51

1 vote

0 answers

65 views

Shell script looking for a missing module

I want to run a shell script on a compute cluster but I get an error because at some point it is looking for a module that does not exist since a major update on the cluster a few months ago. This ...

Seb

11

asked Feb 5, 2024 at 10:53

0 votes

1 answer

60 views

Running arbitrary binary program with cluster computers

I have 3 VPS. Let's say master, slave1, slave2. Their specifications are identic. Processor: 1CPU Memory: 1GB Disk: 10GB Network: running on LAN each other I expect any arbitrary binary program (...

Muhammad Ikhwan Perwira

339

asked Nov 2, 2023 at 19:12

1 vote

1 answer

177 views

Can I fully utilize HDR Infiniband network throughput between servers and NFS volume?

I'm working on a project building a CPU cluster, and those servers and NFS storage (not a parallel file system) are going to be connected through HDR InfiniBand cables. In this architecture, can I get ...

Antenna_

35

asked Aug 17, 2023 at 10:20

1 vote

1 answer

56 views

Unable to run linpack on head node of cluster

I recently set up my own home cluster - 4 units of raspberry pi. But I am having problems trying to benchmark all 4 units using Linpack One node is the head node called rpislave1, it connects to the ...

AlexChan

21

asked Jul 26, 2023 at 10:37

2 votes

1 answer

368 views

How to set up a bunch of linux servers with shared file system without using job scheduler?

I am managing multiple GPU servers in our lab, which are mainly used for deep learning tasks. We would like these machines to share the same file system, so it is easier to switch between them. ...

x.y.z liu

43

asked Apr 13, 2023 at 21:18

0 votes

0 answers

83 views

Proper way to design filesystems structure for a cluster of diskless nodes

I'm trying to learn the basics of Linux clustering so I started designing a really humble cluster: 6 worker nodes (Libre Computer La Frite | Cortex-A53 @ 1.2 GHz | 1GB RAM) 1 master node (Raspberry ...

phreq

1

asked Mar 31, 2023 at 16:58

0 votes

1 answer

282 views

Remove internet access without losing LAN

I have a small cluster (all nodes run Debian 10) and need to remove the internet connections of all slave nodes. The internet cable connection connects to a computer that acts as a firewall, then, ...

Carlos Andrés del Valle

111

asked Mar 28, 2023 at 20:23

1 vote

1 answer

601 views

Transferring very large dataset from cluster to a storage server

We have to move a set of very large data (in petabytes) from HPC cluster to a storage server. We have a high capacity communication link between the devices. However, the bottleneck seems to be a fast ...

Ikram Ullah

113

asked Mar 8, 2023 at 14:45

-1 votes

1 answer

782 views

IBM AIX - Method to identify Cluster or HA services

I am keen to learn if existing IBM AIX servers from different location have Clustering/HA features. Kindly let me know the steps to check. Thanks.

Nick eric adelee

49

asked Dec 5, 2022 at 9:01

2 votes

1 answer

5k views

Running multiple SLURM jobs on the same GPU

So, I am by no means a sysadmin but I need to use an existing SLURM installation to launch a sizable amount of jobs (around 5000). The cluster is composed of 1 node with 10 GPUs (with 8GB of memory ...

jacky la mouette

123

asked Nov 25, 2022 at 20:02

0 votes

1 answer

421 views

How to tell if a VG is clustered?

I have a CentOS 7 Pacemaker cluster with GFS2 Filesystrems mounted. I'm fairly certain that vgchange -cy vg_name was NOT run during setup. I tried running vgchange --test -cy vg_name and it tells me ...

ex_submariner

1

asked Sep 22, 2022 at 17:56

0 votes

0 answers

64 views

NFS shared file system that uses all PCs hard drive

I'm building a small cluster with desktop computers that run on Debian 11. I want a shared /home directory where all user's files are located. I know that the ideal way to do this is to have a master ...

Carlos Andrés del Valle

111

asked Jul 31, 2022 at 0:09

0 votes

0 answers

335 views

How to update computing nodes in a cluster?

I am in the process of small computing cluster assembly. It will run Ubuntu/Slurm for job scheduling. Only the head node will be connected to the Internet, all others will be accessible from the local ...

FNS

11

asked Mar 21, 2022 at 1:48

1 vote

0 answers

121 views

Will execution of many parallel instances of an application from NFS drive affect performance?

I am assembling a simple computing cluster under Ubuntu 20.04 and Slurm as a job scheduler. The cluster will be primarily used for quantum-chemical calculations, so, as a rule, each job will run its ...

FNS

11

asked Mar 21, 2022 at 1:23

0 votes

1 answer

172 views

Does the size of a file system cluster have to be even bytes?

Basically, could we have a file system with odd byte size clusters? Why is everything even? Thanks

pushandpop

1,446

asked Feb 4, 2022 at 8:37

0 votes

0 answers

1k views

What is "quorum-manager" under Designation in GPFS Cluster Information?

I get this block of information under my GPFS cluster information when I execute /usr/lpp/mmfs/bin/mmlscluster and I can't find documentation on what the Designation actually means. Does quorum-...

IceTea

121

asked Nov 17, 2021 at 3:25

0 votes

0 answers

529 views

Sharing ssh keys between cluster nodes

I have a cluster with several login nodes and many compute nodes (call it the cluster). Then I have another server with a large shared storage (call it the storage). I need to be able to rsync (i.e. ...

Botond

135

asked Oct 1, 2021 at 15:56

-1 votes

1 answer

52 views

Sync sudo anthority to all nodes

I want to submit a task that is interpreted by /bin/csh, which only exists in master node. And I have no root permission but only sudo, which is limited in master node. So I can't use sudo apt install ...

Zhihui

1

asked Sep 24, 2021 at 11:18

1 vote

0 answers

108 views

Jetson Nano Picocluster not accessible from network

I recently received a Picocluster with five Jetson Nano boards running MicroK8S. The cluster has a built in switch, which I know works as I can route my own network traffic through it just fine. All ...

Thijs van der Heijden

111

asked Sep 7, 2021 at 9:41

1 vote

0 answers

182 views

Is it possible to mount an ATA over Ethernet (AOE) block device on multiple clients, if so how?

I have a lab consisting of 3 machines connected with 2 10gbe links on 2 segregated networks. Each device has 100tb in block storage connected to it. I want to use ATA over Ethernet to create a storage ...

Tim

111

asked Sep 6, 2021 at 19:22

0 votes

1 answer

1k views

How to create a ocfs2 filesystem in ubuntu?

Can someone please walk me through the step-by-step process of configuring a ocfs2 filesystem right from splitting an existing partition? When I tried, I am seeing the below error: mount.ocfs2: ...

Divija Gogineni

21

asked Sep 6, 2021 at 2:08

0 votes

0 answers

47 views

Why non-root python installation can work across the whole cluster?

I recently installed anaconda (which includes a python3) locally in my account folder on a cluster with a dozen of nodes (each node with several cores). I use it to install some package P that is used ...

xiaohuamao

121

asked Aug 24, 2021 at 14:36

1 vote

1 answer

2k views

Debian 10 Pacemaker-Cluster: GFS2 Mount fails because of "Global lock failed: check that global lockspace is started."

I'm trying to setup a new Debian 10 cluster with three instances. My stack is based on pacemaker, corosync, dlm, and lvmlockd with a GFS2 volume. All servers have access to the GFS2 volume but I can't ...

Me7e0r

11

asked Jun 23, 2021 at 10:53

1 vote

1 answer

2k views

How can I submit multiple R job at once?

I have a R-script which runs multiple files say file=1 to 50. I usually submit repeated jobs say 5 times with 10 files each time by changing the number in R-script. So, how can I submit the 5-job at ...

b_takhel

21

asked May 22, 2021 at 19:54

0 votes

1 answer

27 views

What are the existing open source tools to develop on-premise organizational app store on linux?

We have a Linux cluster in our organization and my data science team is developing a number of ML projects to be utilized by teams across the organization. To enable the teams to access the ML models, ...

kosmos

101

asked Apr 10, 2021 at 8:53

1 vote

1 answer

523 views

Where are libvirt's VM definitions "originals" stored, and how to sync them across multiple nodes?

Migrating from Xen's xm to Xen's xl under control of libvirt, I wonder: Where does libvirt store the "originals" of VM configurations? I found that my PVM configurations are stored in /etc/...

U. Windl

1,775

asked Feb 17, 2021 at 12:08

2 votes

1 answer

1k views

Python error only when I run script on Linux cluster: _tkinter.TclError: no display name and no $DISPLAY environment variable

My question is related to a python error, but I suspect that it is more a Linux question than a python one. Thus I post it first here. I am running a python script which does a calculation and then ...

Britzel

165

asked Feb 2, 2021 at 23:04

0 votes

0 answers

3k views

How can I cancel all waiting jobs with qsub?

I am running a lot of jobs with qsub: some are running, some are waiting. Is there a way to cancel all the jobs for a given user which are queued/waiting without giving the individual job IDs?

user443699

133

asked Dec 13, 2020 at 21:12

2 votes

1 answer

2k views

how to set passwordless authentication in a cluster where users /home directory from headnode is mounted to all machines /home of the cluster

First of all thank you in advance for your help. I hope the title makes sense. Basically, on the headnode the users' home directory (i.e: headnode:/home/eric) are NFS shared and mounted to all the ...

Eric Alemany

33

asked Sep 29, 2020 at 15:03

0 votes

1 answer

2k views

Pacemaker apache resource is Failed to access httpd status page after change to HTTPS

I get this error from pacemaker after i change apache from http to https. now my ocf::heartbeat:apache resource is not find status page. I generate SSL certificate separately for 3 servers. Everything ...

Karippery

1

asked Sep 21, 2020 at 15:04

2 votes

1 answer

2k views

Simple cp command not working (inside shell script and submitted using sbatch on a cluster)

I am working on a cluster running RHEL and I submit jobs using the following command. sbatch MyScript.sh The content of the MyScript.sh are as below. #!/bin/sh # .... # Other SBATCH related commands ...

Amit

123

asked Sep 14, 2020 at 7:37

1 vote

1 answer

303 views

compiling HPL-2.0_FERMIv15

when i compile xhpl i always get the error message: ./xhpl: error while loading shared libraries: libdgemm.so.1: cannot open shared object file: No such file or directory when i type ldd xhpl: linux-...

Tim Tonic

11

asked Jun 21, 2020 at 22:08

1 vote

0 answers

612 views

Cannot seem to start pcs cluster (NFS Cluster) disk_fencing trouble

For the life of me, I can't find a clear answer on how to start my NFS active / passive cluster. I have two nodes, node1 and node2 and followed the guide here: https://www.linuxtechi.com/configure-nfs-...

jasontt33

11

asked May 25, 2020 at 10:49

0 votes

1 answer

4k views

DRBD: "Couldn't mount device [/dev/drbd0] as /mydata" when failing over or rebooting node

I'm creating a cluster system using two ESXi hosts, with a CentOS 7 server on each. Going through I created the filesystem, and it mounts on node1. When I perform a standby or reboot from node01 to ...

markb

143

asked May 7, 2020 at 6:34

0 votes

2 answers

4k views

How to Remove caavg_private Properly on AIX?

I am trying to cleanup a server which had a PowerHA configuration. I have stopped cluster (smitty clstop) and removed resource groups. How do I remove the caavg_private properly? hdisk5 ...

RJ Gellangarin

11

asked Apr 1, 2020 at 7:56

0 votes

1 answer

131 views

RHEL 6.9 HA clustering issue with one node completely down

I have two node servers with SAN storage. Each node have RHEL 6.9 with HA and the partition are mapped from the storage using fiber cables with clustered resources. The thing is when the two nodes ...

user65285

1

asked Feb 21, 2020 at 14:18

0 votes

1 answer

360 views

Trying to Create an AFM relationship by using GPFS protocol. Having error in cache side cluster

I am trying to create an AFM relationship by using GPFS protocol. Having error in cache side cluster. Steps of Home cluster : 1) Create a home cluster (cluster name - gpfs01). 2) Create a file ...

pratiksha chavan

11

asked Jan 27, 2020 at 8:52

1 vote

0 answers

320 views

OpenLDAP Cluster

Trying to implement an OpenLDAP cluster, I already managed to set up the two backend LDAP servers in mirroring mode. The application (iRedMail) using the LDAP service is running on the same systems ...

arminV

11

asked Nov 8, 2019 at 10:45

0 votes

0 answers

986 views

How does htop calculate times? (Impossible times shown)

I was running what was supposed to be a small make job on a small node in our cluster, but a sorting process seemed to be overwhelming the allocated RAM, so I killed it after 20 hours (same jobs on ...

GenesRus

101

asked Oct 13, 2019 at 21:37

0 votes

1 answer

106 views

how to move fs resource in service group in ccs cluster

I have a ccs cluster running on RHEL 6.4 where there is no luci service, i have added a filesystem resource to the cluster with the below command but i need to move the resource inside a service group....

Ritesh Vishwakarma

7

asked Sep 27, 2019 at 10:29

0 votes

1 answer

623 views

How to setup a compute cluster in local network on top of linux

I have 3 machines in my local network running manjaro. I am running python scripts using dask, pandas, etc which max out the cpu on the first machine and I usually need to wait more than 30 min until ...

cmosig

123

asked Sep 11, 2019 at 6:50

0 votes

2 answers

208 views

Why worker nodes are not fault-tolerant?

A swarm manager nodes handles cluster management tasks such as: 1) Maintaining cluster state 2) Scheduling services 3) Serving swarm mode HTTP API endpoints You may execute any of the - docker ...

overexchange

1,606

asked Sep 8, 2019 at 19:49

Questions tagged [cluster]