0

the work is making the function computes the number of outliers in the array of measurements.

the function of compute of the median is already given.

If a measurement lies outside the range [0.5*median to 1.5*median], then it is an outlier and hence should be discarded. So I have tried to do up to as much as I can try. I am just wondering how to get the array gotten rid of the outliers from the origin array. I made the new array to store the number within the range. and the return value is to allocate the data.

task1_main.c

#include<stdio.h>
#include<stdlib.h>
#include "task1.c"

int main()
{
int i, size1, size2;

// reading the number of measurements in group1 
scanf("%d", &size1);        
float *measurements1 = malloc(size1*sizeof(float));
// reading the measurements in group1   
for(i=0; i<size1; i++)
scanf("%f", measurements1+i);

// reading the number of measurements in group2 
scanf("%d", &size2);        
float *measurements2 = malloc(size2*sizeof(float));
// reading the measurements in group1   
for(i=0; i<size2; i++)
scanf("%f", measurements2+i);



float median1 = sort_and_find_median(measurements1, size1);
int new_size1;
float *measurements1_wo_outliers = discard_outliers(measurements1, size1, median1, &new_size1);

float median2 = sort_and_find_median(measurements2, size2);
int new_size2;
float *measurements2_wo_outliers = discard_outliers(measurements2, size2, median2, &new_size2);

// writing measurements for group1 after discarding the outliers
printf("%d\n", new_size1);
for(i=0; i<new_size1; i++)
printf("%.2f\n", measurements1_wo_outliers[i]);

printf("\n");
// writing measurements for group2 after discarding the outliers
printf("%d\n", new_size2);
for(i=0; i<new_size2; i++)
printf("%.2f\n", measurements2_wo_outliers[i]);


free(measurements1);
free(measurements2);
free(measurements1_wo_outliers);
free(measurements2_wo_outliers);
return 0;
}

task1.c

// function to sort the array in ascending order
float sort_and_find_median(float *measurements , int size)
{
  int i=0 , j=0;
  float temp=0;

  for(i=0 ; i<size ; i++)
    {
      for(j=0 ; j<size-1 ; j++)
    {
      if(measurements[j]>measurements[j+1])
        {
          temp        = measurements[j];
          measurements[j]    = measurements[j+1];
          measurements[j+1]  = temp;
        }
    }
    }

  return measurements[size/2];
}

float *discard_outliers(float *measurements, int size, float median, int *new_size)
{

  //float number_of_outliers[0];
  int i= 0;
  for(i = 0; i<size; i++){
    if((measurements[i] < (0.5*median)) && (measurements[i] > (1.5*median))){
      number_of_outliers[i] = measurements[i];
    }

  }


  *new_size = size - number_of_outliers;
  //to creates a new array of length *newsize using malloc 
  *measurements_wo_outliers = malloc( (*new_size) * sizeof(float) );

}

Let us assume that the group1 and group2 have 3 and 4 patients respectively. Let the measurements be {45.0, 23.15, 11.98} and {2.45, 11.0, 12.98, 77.80} for group1 and group2 respectively.
The contents of measurements.txt will be:

3

45.0

23.15

11.98

4

2.45

11.0

12.98

77.80

mesurements.txt is

25 23.0 21.5 27.6 2.5 19.23 21.0 23.5 24.6 19.5 19.23 26.01 22.5 24.6 20.15 18.23 19.73 22.25 26.6 45.5 5.23 18.0 24.5 23.26 22.5 18.93

20 11.12 10.32 9.91 14.32 12.32 20.37 13.32 11.57 2.32 13.32 11.22 12.32 10.91 8.32 14.56 10.16 35.32 12.91 12.58 13.32

and expected_measurements is below as:

22 18.00 18.23 18.93 19.23 19.23 19.50 19.73 20.15 21.00 21.50 22.25 22.50 22.50 23.00 23.26 23.50 24.50 24.60 24.60 26.01 26.60 27.60

17 8.32 9.91 10.16 10.32 10.91 11.12 11.22 11.57 12.32 12.32 12.58 12.91 13.32 13.32 13.32 14.32 14.56

2
  • Take a look at How to remove certain elements from an array using a conditional test in C? if that approach won't work, there are a number of other questions related to outliers. For your array, when you identify an outlier, the easiest removal is verify you are not at the end and just to memcpy (&array[i], &array[i+1], (n-- - i + 1) * sizeof *array); If it is the last element, then array[n-- - 1] = 0 Commented Feb 12, 2019 at 21:35
  • Use realloc to resize an existing allocation. There is no need to have two arrays. Simply condense the array down by shuffling values to remove the outliers, and realloc to resize it (if you want to -- there is often less need for this, as you can simply leave the extra space available for future array growth). Commented Feb 12, 2019 at 21:39

2 Answers 2

0

In addition to your current answer, your have a number of problems, but your problem with outlier identification is you are using '&&' instead of '||' which prevents any outlier from being found because your test condition always evaluates FALSE, e.g.

if((measurements[i] < (0.5*median)) && (measurements[i] > (1.5*median))){

(the array element can never be both less than (0.5*median) and greater than (1.5*median) at the same time)

Beyond your identification of the outliers, as noted in the comments and in @paddy's answer, you don't need to copy or allocate in your outlier removal function. Instead, remove the outliers by shuffling all elements above the outlier down by one removing the outlier with memmove and before returning from the function, if outliers were removed, you can (optionally) realloc once at the end to trim the allocation size.

(which really isn't needed unless you are working on a memory limited embedded system or have millions of elements you are dealing with)

Tidying up your removal function and passing the address of your array from main() to allow reallocation in the function without having to assign the return, you could do something like:

/* remove outliers from array 'a' given 'median'.
 * takes address of array 'a', address of number of elements 'n',
 * and median 'median' to remove outliers. a is reallocated following
 * removal and n is updated to reflect the number of elements that
 * remain. returns pointer to reallocated array on success, NULL otherwise.
 */
double *rmoutliers (double **a, size_t *n, double median)
{
    size_t i = 0, nelem = *n;   /* index, save initial numer of elements */

    while (i < *n)  /* loop over all elements indentifying outliers */
        if ((*a)[i] < 0.5 * median || (*a)[i] > 1.5 * median) {
            if (i < *n - 1)     /* if not end, use memmove to remove */
                memmove (&(*a)[i], &(*a)[i+1], 
                        (*n - i + 1) * sizeof **a);
            (*n)--; /* decrement number of elements */
        }
        else        /* otherwise, increment index */
            i++;

    if (*n < nelem) {   /* if outliers removed */
        void *dbltmp = realloc (*a, *n * sizeof **a);   /* realloc */
        if (!dbltmp) {  /* validate reallocation */
            perror ("realloc-a");
            return NULL;
        }
        *a = dbltmp;    /* assign reallocated block to array */
    }

    return *a;      /* return array */
}

Next, do not roll-you-own sort function. The C library provides qsort which will be orders of magnitude less likely to contains errors than your own (not to mention the orders of magnitude faster). All you need to do is write a qsort compare function, that receives pointers to adjacent elements from your array and then returns -1 if the first sorts before the second, 0 if the elements are equal, and 1 if the second sorts before the first. For numeric comparisons, you can return the result to the two inequalities to avoid potential over/underflow, e.g.

    /* qsort compare to sort numbers in ascending order without overflow */
    return (a > b) - (a < b);

Noting that a and b will be pointers to double (or float) in your case, to compare doubles, the proper casts before dereference would be:

/* qsort compare function for doubles (ascending) */
int cmpdbl (const void *a, const void *b)
{
    return (*((double *)a) > *((double *)b)) - 
            (*((double *)a) < *((double *)b));
}

That's the only challenge to using qsort after that, to sort your array in ascending order, takes nothing more than:

        qsort (array, n, sizeof *array, cmpdbl);    /* use qsort to sort */

(done...)

Putting it altogether in a short example that just reads your arrays as lines of input (1024 chars max) and then converts each value to a double using sscanf storing any number of values in a dynamically sized array before sorting, grabbing the median and calling your removal function, could be written as follows.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXC 1024   /* max characters to read per-line (per-array) */
#define MAXD 8      /* initial number of doubles to allocate */

/* qsort compare function for doubles (ascending) */
int cmpdbl (const void *a, const void *b)
{
    return (*((double *)a) > *((double *)b)) - 
            (*((double *)a) < *((double *)b));
}

/* remove outliers from array 'a' given 'median'.
 * takes address of array 'a', address of number of elements 'n',
 * and median 'median' to remove outliers. a is reallocated following
 * removal and n is updated to reflect the number of elements that
 * remain. returns pointer to reallocated array on success, NULL otherwise.
 */
double *rmoutliers (double **a, size_t *n, double median)
{
    size_t i = 0, nelem = *n;   /* index, save initial numer of elements */

    while (i < *n)  /* loop over all elements indentifying outliers */
        if ((*a)[i] < 0.5 * median || (*a)[i] > 1.5 * median) {
            if (i < *n - 1)     /* if not end, use memmove to remove */
                memmove (&(*a)[i], &(*a)[i+1], 
                        (*n - i + 1) * sizeof **a);
            (*n)--; /* decrement number of elements */
        }
        else        /* otherwise, increment index */
            i++;

    if (*n < nelem) {   /* if outliers removed */
        void *dbltmp = realloc (*a, *n * sizeof **a);   /* realloc */
        if (!dbltmp) {  /* validate reallocation */
            perror ("realloc-a");
            return NULL;
        }
        *a = dbltmp;    /* assign reallocated block to array */
    }

    return *a;      /* return array */
}

int main (void) {

    char buf[MAXC];
    int arrcnt = 1;

    while (fgets (buf, MAXC, stdin)) {  /* read line of data into buf */
        int offset = 0, nchr = 0;
        size_t  n = 0, ndbl = MAXD, size;
        double  *array = malloc (ndbl * sizeof *array), /* allocate */
                dbltmp, median;

        if (!array) {   /* validate initial allocation */
            perror ("malloc-array");
            return 1;
        }
        /* parse into doubles, store in dbltmp (should use strtod) */
        while (sscanf (buf + offset, "%lf%n", &dbltmp, &nchr) == 1) {
            if (n == ndbl) {    /* check if reallocation requierd */
                void *tmp = realloc (array, 2 * ndbl * sizeof *array);
                if (!tmp) {     /* validate */
                    perror ("realloc-array");
                    break;
                }
                array = tmp;    /* assign reallocated block */
                ndbl *= 2;      /* update allocated number of doubles */
            }
            array[n++] = dbltmp;    /* assign to array, increment index */
            offset += nchr;     /* update offset in buffer */
        }

        qsort (array, n, sizeof *array, cmpdbl);    /* use qsort to sort */
        median = array[n / 2];                      /* get median */

        /* output original array and number of values */
        printf ("\narray[%d] - %zu values\n\n", arrcnt++, n);
        for (size_t i = 0; i < n; i++) {
            if (i && i % 10 == 0)
                putchar ('\n');
            printf (" %5.2f", array[i]);
        }
        printf ("\n\nmedian: %5.2f\n\n", median);

        size = n;   /* save orginal number of doubles in array in size */
        if (!rmoutliers (&array, &n, median))   /* remove outliers */
            return 1;

        if (n < size) { /* check if outliers removed */
            printf ("%zu outliers removed - %zu values\n\n", size - n, n);
            for (size_t i = 0; i < n; i++) {
                if (i && i % 10 == 0)
                    putchar ('\n');
                printf (" %5.2f", array[i]);
            }
            printf ("\n\n");
        }
        else    /* otherwise warn no outliers removed */
            fputs ("warning: no outliers found.\n\n", stderr);

        free (array);   /* don't forget to free what you allocate */
    }
}

(note: you should really use strtod as sscanf provides no error handling beyond reporting success/failure of the conversion, but that is for another day or left to you as an exercise)

Example Input File

Note: I didn't use the size: X information in my data file. It wasn't needed. I just used a dynamic allocation scheme to size the arrays as needed. The format of the input file I used contained the measurement values for each array on a separate line, e.g.

23.0 21.5 27.6 2.5 19.23 21.0 23.5 24.6 19.5 19.23 26.01 22.5 24.6 20.15 ... 18.93
11.12 10.32 9.91 14.32 12.32 20.37 13.32 11.57 2.32 13.32 11.22 12.32 ... 13.32

Example Use/Output

$ ./bin/rmoutliers <dat/outlierdata.txt

array[1] - 25 values

  2.50  5.23 18.00 18.23 18.93 19.23 19.23 19.50 19.73 20.15
 21.00 21.50 22.25 22.50 22.50 23.00 23.26 23.50 24.50 24.60
 24.60 26.01 26.60 27.60 45.50

median: 22.25

3 outliers removed - 22 values

 18.00 18.23 18.93 19.23 19.23 19.50 19.73 20.15 21.00 21.50
 22.25 22.50 22.50 23.00 23.26 23.50 24.50 24.60 24.60 26.01
 26.60 27.60


array[2] - 20 values

  2.32  8.32  9.91 10.16 10.32 10.91 11.12 11.22 11.57 12.32
 12.32 12.58 12.91 13.32 13.32 13.32 14.32 14.56 20.37 35.32

median: 12.32

3 outliers removed - 17 values

  8.32  9.91 10.16 10.32 10.91 11.12 11.22 11.57 12.32 12.32
 12.58 12.91 13.32 13.32 13.32 14.32 14.56

(note: in any code that dynamically allocates memory, you should run the program through a memory error checking program like valgrind for Linux, other OS's have similar tools. It's simple, just run add valgrind to the start of your command, e.g. valgrind ./bin/rmoutliers <dat/outlierdata.txt and confirm you have freed all the memory you have allocated and that there are no memory errors.)

Look things over and let me know if you have questions.

Memory Use/Error Check

In your comment you seem concerned that what I do about may leak memory -- that's not the case. As mentioned in the question, you can verify the memory use and check for any memory errors with tools such as valgrind, e.g.

$ valgrind ./bin/rmoutliers <dat/outlierdata.txt
==28383== Memcheck, a memory error detector
==28383== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==28383== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==28383== Command: ./bin/rmoutliers
==28383==

array[1] - 25 values

  2.50  5.23 18.00 18.23 18.93 19.23 19.23 19.50 19.73 20.15
 21.00 21.50 22.25 22.50 22.50 23.00 23.26 23.50 24.50 24.60
 24.60 26.01 26.60 27.60 45.50

median: 22.25

3 outliers removed - 22 values

 18.00 18.23 18.93 19.23 19.23 19.50 19.73 20.15 21.00 21.50
 22.25 22.50 22.50 23.00 23.26 23.50 24.50 24.60 24.60 26.01
 26.60 27.60


array[2] - 20 values

  2.32  8.32  9.91 10.16 10.32 10.91 11.12 11.22 11.57 12.32
 12.32 12.58 12.91 13.32 13.32 13.32 14.32 14.56 20.37 35.32

median: 12.32

3 outliers removed - 17 values

  8.32  9.91 10.16 10.32 10.91 11.12 11.22 11.57 12.32 12.32
 12.58 12.91 13.32 13.32 13.32 14.32 14.56

==28383==
==28383== HEAP SUMMARY:
==28383==     in use at exit: 0 bytes in 0 blocks
==28383==   total heap usage: 8 allocs, 8 frees, 1,208 bytes allocated
==28383==
==28383== All heap blocks were freed -- no leaks are possible
==28383==
==28383== For counts of detected and suppressed errors, rerun with: -v
==28383== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If you note above, there were "8 allocations and 8 frees" associated with the memory used above, e.g.:

==28383==   total heap usage: 8 allocs, 8 frees, 1,208 bytes allocated

You can also confirm that all memory was freed and there were no leaks in the next line:

==28383== All heap blocks were freed -- no leaks are possible

And finally you can confirm there were no memory errors associated with the use of memory during the program execution:

==28383== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If there is a part of the code that you are having trouble following where the memory is freed, let me know and I'm happy to help further.

Sign up to request clarification or add additional context in comments.

6 Comments

I post the my method. my main method also prevent the memory leaks and takes the task1.c what I showed above and compiles as gcc -Werror -Wall task1_main.c -o task1 and runs ./task1 <measurements.txt > measurements_wo_outliers.txt I have to implement the all of the function to compute the outlier into the float *discard_outliers() but, from yours I get lots of info and ideas thx.
@DavidJackson thanks for your comment. All memory used was freed and no leaks are possible. I updated the answer summarizing the memory use with the valgrind output. Let me know if you still have questions about how the memory was handled and I'm happy to explain further. If you need to store both arrays in memory at the same time, let me know (it's simple to do) and I'm happy to add another example showing an array of pointers.
Very thorough answer, although I'm not a fan of the way you are using memmove to shift the entire remainder of the array every time an outlier is encountered. If using that function, it would make more sense to identify whole chunks of values between outliers and use memmove on them.
Oh... it's what I want to exactly.. you just give me the answer right away.. LOL
but... the median and the measurements must be the float type. so Can I just the type change into float? I need to integrate task1_maic.c and task1.c
|
0

Here is the basic approach to condense your array, removing outliers, and then resize it.

First I noticed your logic for testing outliers is wrong. The measurement can't be less than 0.5*median AND greater than 1.5*median... Unless median is negative. Let's clean that up by allowing both:

// Choose stable lower and upper bounds
const float low =  (median < 0.f ? 1.5f : 0.5f) * median;
const float high = (median < 0.f ? 0.5f : 1.5f) * median;

This ensures that low <= high always (except if low or high end up to be NaN).

Now you need to remove the outliers. The simplest way to do this is to keep a second index that records how many non-outliers you have seen so far. Walk through the array, and if any outlier has been found, you will also shuffle values as you go.

// Remove outliers
int num_clean = 0;
for(int i = 0; i < size; i++)
{
    float value = measurements[i];
    if(value >= low && value <= high)
    {
        ++num_clean;
        if (i != num_clean)
            measurements[num_clean] = value;
    }
}

At the end of this, num_clean represents the number of values that remain. It's up to you whether to resize the array or not. You could use the following logic:

// Resize array
if (num_clean < size)
{
    float *new_measurements = realloc(measurements, num_clean * sizeof float);
    if (new_measurements)
        measurements = new_measurements;
    *new_size = num_clean;
}

Note that you may need some extra handling in the case that num_clean ends up as 0. You must decide whether to free your array or not. In the above, there's also a silent handling of the case where realloc fails -- we'll keep the original array pointer but update new_size.

If you are not too concerned about a little bit of extra memory, it should be fine to avoid reallocation completely. Simply return the number of clean samples, and leave any remaining memory at the end of the array unused.

2 Comments

what i_clean means?
Typo. It should be num_clean, obviously.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.