This question may or may not be truly Haskell-specific, but it concerns a slight annoyance that I am facing with a certain programming task.
I have written a program in Haskell which is mostly universal for the type of problem I am trying to solve, but includes two dependent components: a run-time estimation function for a script, calculated based on trial runs at a certain benchmark, and a file-name conversion function, which is tailored to the naming scheme of the files I was working with. Naturally, if I want to use the script with performances other than the benchmark, or I find that the estimates are too conservative, I would like to change the function used to estimate the run-time, and likewise I would like to be able to modify the file-name conversion function if I ever need to work with different files with different naming schemes.
However, the (remote) computer that I am running my scripts on does not have GHC or runhaskell installed, so I am having to modify, compile, and re-upload the code from my local machine, which is a bit of a hassle. My question is, is there an easy way to implement changes in some components of my code without having to recompile in order for the changes to be reflected at call-time?
I apologize if my description is vague, and have included the gory details below, as I do not want to clutter my question with unnecessary details from the outset, should the details prove unnecessary.
I am writing this code in Haskell mainly because that is the language that I best know how to implement the methods in; while I understand that other languages might be more portable, I am not sufficiently familiar with other languages in order to implement this without having to read a lot of documentation and go through multiple revisions in order to get it to work. If achieving the flexibility I want with Haskell is impractical, I can appreciate that, but I would rather know that Haskell cannot do it than receive suggestions of other languages that can.
Specific Details
I am writing code to run independent jobs on a load-sharing cluster, and I therefore want to most closely estimate the time required for a particular job, without under-shooting and causing the job to be terminated, and without over-shooting and thereby lowering the priority of the jobs. I am basing my time estimate on the size of the inputs to the job program, and I have gathered enough real-world data to derive an approximate quadratic relation between size and time.
The way I am currently assigning time-estimates, and thereby establishing a job order, for the inputs is by parsing the output of du with a Haskell script, performing a computation, and writing the time results to a new file, which is then read in a loop by the job-assignment script.
The job is being run for paired files, which share a common name up to a certain point, where the last common element I wish to retain is an 's', with no further 's' characters in either name from then on. Therefore, I am traversing the names backwards and dropping until I reach an 's'. My code is included below. It is liberal with comments, which might help or might confuse. Some of them are highly specific to the task I am working with.
-- size2time.hs
-- A Haskell script to convert file sizes into job-times, based on observed job-times for
-- various file sizes
--
--
-- This file may be compiled via the following command:
-- > ghc size2time.hs
--
-- Should any edits be made, ensure that the compiled executable is updated accordingly
--
-- The executable is to be run with the following usage
--
-- > ./size2time inputfile outputfile
--
-- where inputfile is the name of a file whose first column contains the sizes, in MB, of each fq.gz
-- (including both paired-end reads), and whose second column contains the corresponding file names, as
-- generated by
--
-- > du -m $( ls DIR/*.fq.gz ) >inputfile
--
-- where DIR is the directory containing the fq.gz files
--
-- output is the name of a file that will be created by the execution of this script, whose first
-- column will contain the run-time, in minutes, of the corresponding job (the times are based on
-- jobs run on Intel CPUs with 12 cores and 2GB of RAM, and therefore will potentially be
-- inapplicable to jobs run on CPUs of different manufacturers, with different numbers of cores,
-- and/or with different allocated RAM), and whose second column contains the scrubbed names of
-- the jobs to be run. The greater time-value for any given pair is used, with only one member of
-- each pair retained, as the file-names of each member of a pair are identical after scrubbing
--
-- import modules for command line arguments, list operations, map operations
import System.Environment
import Data.List
import qualified Data.Map as Map
main = do
args <- getArgs -- parse command line arguments: inputfile, outputfile, <ignored>
let infile = head args
outfile = head . tail $ args
contents <- readFile infile -- read the inputfile
let sf = lines contents -- split into lines
tf = map size2time sf -- peform size2time mapping
st = map sample tf -- scrub filename
stu = Map.toList . Map.fromListWith (max) $ st -- take only the longer of the two times of the paired reads
tsu = map flip2 stu -- put time first
stsu = sort tsu -- sort by time, ascending
tsustr = map unwords . map (\(x,y) -> [show x, y]) $ stsu -- convert back to string
tsulns = unlines tsustr -- join individual lines
writeFile outfile tsulns -- write to the outputfile
{- given a string, with the size of a file and the name of the file,
- returns a tuple with the estimated job-time and the unmodified name
- of the file.
-
- The size-time conversion is extrapolated from experimental data,
- with only the upper extremes considered in order to prevent timeout,
- rounding in the quadratic term, and a linear-degree time padding added
- to allow for upper extremes. If modifications are to be made to any
- coefficients, it is recommended that only linear and constant terms be increased,
- and decreases should only be made after performing sufficient alignments to collect
- enough (file size)--(actual computation time) pairs to verify that the padding is excessive,
- and to determine coefficients that more closely follow the trend of the actual data, with
- the conditions that no data point must exceed the approximation curve, and that sufficient padding
- must be provided to allow for potential inconsistency in the time required for any given size of alignment.
-}
size2time :: String -> (Int,String)
size2time sfstring = let (size:file:[]) = words sfstring -- parses out size and filename
x = fromIntegral (read size :: Int) -- floating point from numeric string
time = floor $ 0.000025 * x ^ 2 + 0.03 * x + 10 -- apply floored conversion
tfstring = (time,file)
in tfstring
{-
- removes all characters in the file-name after 's', which properly scrubs files of the format
- *--Hs--R?.fq.gz, where the ? is either 1 or 2. For filenames formatted in different ways,
- or for alternative naming of the BAM file to be generated, this function must be modified
- to suit the scenario.
-}
sample :: (a,String) -> (String,a)
sample (x,f) = let s = reverse . dropWhile (/= 's') . reverse $ f
in (s,x)
{-
- Reverses the order of a tuple, e.g. so that a Map may be made with a key to be found in the
- current second position of the tuple.
-}
flip2 :: (a,b) -> (b,a)
flip2 (x,y) = (y,x)
flip2 == Data.Tuple.swap.hintlibrary in your project, then using it to load haskell modules and interpret them as scripts.0.000025 * x ^ 2 + 0.03 * x + 10. There are a lot of expression-parsing tutorials out there, and probably a well-built existing one, so you could pass the name of a file containing such an expresssion as a command line to your function, parse it at runtime and apply it to the data. You would only need to recompile if something other than this function were changed.