Multiprocessing in bash [duplicate]

Question

I wrote a script (wrapper) over one python project. here are the links: https://github.com/daniellerch/aletheia https://daniellerch.me/stego/aletheia/intro-en/#aletheia-commands The essence of the script is that the user uses method 1 of 4 and indicates the path to the images (there are many tens of thousands of them). The results for each photo are entered into a csv file. I have a computing server with 80 cores. So the question is how can I make it so that the photos are not processed one by one, but for example 80 at once + then another 80 photos at once until all the photos are finished???

here is my bash script:

#!/bin/bash

echo ___________________________________________
echo "[I] Select the structural type of steganalysis:"
echo ___________________________________________
echo [1] ws
echo [2] rs
echo [3] triples
echo [4] spa
stuct=
read -p "Enter the steganalysis structure type number: " struct_tmp
if ((struct_tmp == 1)); then
  struct="ws"
elif ((struct_tmp == 2)); then
  struct="rs"
elif ((struct_tmp ==3)); then
  struct="triples"
elif ((struct_tmp == 4)); then
  struct="spa"
else
  echo "Run the script again and enter the structure method number from 1 to 4!"
  exit
fi
echo "Structural steganalysis method $struct selected!"
echo _________________________________________________
echo "[II] Enter the path to the directory with images:"
echo _________________________________________________
read in_path
echo "Path to the directory with images ($in_path)"
#mkdir csv_result
awk 'BEGIN {OFS=","; print "FILE-NAME, '"$struct"'-channel-R, '"$struct"'-channel-G, '"$struct"'-channel-B"}' >out_$struct.csv 

for img in "$in_path"/*
do
  echo treatment $(basename $img) ... 
  ./aletheia.py $struct $img | awk -v img_awk="$(basename $img)" '
    /Hidden data found in channel R/ { R = $7 }
    /Hidden data found in channel G/ { G = $7 }
    /Hidden data found in channel B/ { B = $7 }
    END {OFS=","; print img_awk, R, G, B}' >>out_$struct.csv 
  echo 
done
echo _______________________________________________________________
echo "[III] CSV file created <out_$struct.csv>"
echo "with the results of the structural method-$struct steganalysis report!"
echo _______________________________________________________________

I assumed that this could be done using functions and "&" but how do I move through the directory and write data to a file?

have you tried a web search on bash multithreading? there are several methods available ... parallel, xargs, custom looping solutions (run 80, wait for all 80 to finish, run 80, wait for all 80 to finish; run 80, as one finishes run a new one), spawn 80 child processes that are fed a list of files to process (use a locking mechanism - flock? - to insure 2 or more processes don't work on same file), ... — markp-fuso, Commented Jul 4 at 14:06
assuming the 'parent' needs to process return codes/messages from the subprocesses you can then research bash interprocess communications of which there are multiple approaches ... sockets, named pipes, (80) tmp/scratch files, a database (eg, sqllite) — markp-fuso, Commented Jul 4 at 14:10
couple other items to keep in mind: 1) if each process is disk IO intensive you may find that 80 concurrent sets of disk IO operations saturates the disk(s) and may actually slow down the overall process; 2) the final objective appears to be a single csv file but you'll need to address how 80 concurrent processes are to write to that single file without scrambling/corruption of (csv) rows ... having all 80 concurrent processes appending to the same file will likely lead to some issues with the contents of the file — markp-fuso, Commented Jul 4 at 16:01

Pierre D · Accepted Answer · 2024-07-04 16:39:04Z

xargs -P n is going to be your friend. It is like regular xargs, but parallelized using n processes.

You can write your per-image processing in a separate script, let's say, process-single-image.sh, then:

find "$in_path" -maxdepth 1 -type f -print0 | xargs -0 -n1 -P 32 process-single-image.sh >> out.csv

Note: if process-single-image.sh outputs multiple lines (in several calls to print or echo), you may want to prefix them with a unique prefix (e.g. the input filename) so even if you have lines shuffled, you can put them back together. While I suppose it is possible to have output garbled across lines of output, I have never encountered that in decades of using these kinds of commands. In any case, if this proves to be an actual issue for you, then have each process write to its own output file, and cat all those in one final csv after the operation.

I use this construct all the time, often with regular shell commands. For example, to gunzip a whole bunch of files fast:

find . -type f -name '*.gz' | xargs -P 64 gunzip

Good point on -n1. For files with spaces in the name, I also use -print0 and -0 to make sure. — Pierre D, Commented Jul 4 at 16:33

Collectives™ on Stack Overflow

Multiprocessing in bash [duplicate]

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
bash
multithreading
shell
unix
multiprocessing
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged bashmultithreadingshellunixmultiprocessing or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
bash
multithreading
shell
unix
multiprocessing
or ask your own question.