0

I've got a bunch of data with altitude - some of it just numbers, some include meters at the end or '. I also have few ranges 1200-1300 etc (I guess it the second problem would have to be solved a different way). I tried experimenting with regexp_replace but [^a-z] doesn't seem to be working. Any of you have a good idea on how to get rid of everything that's not a digit? Also, if you could recommend good website/book/course on how to clear data, I'd be much appreciated. Thanks!

1 Answer 1

3

Let's leave the ranges (like 1200-1300) to the side, since - even regardless of any kind of programming - it is not clear what you would want to "extract" from that. And, you may also have problems with things like '5 ft 10 in' or similar, if they are possible in your data. (And it is not clear what the whole thing means if all altitudes aren't using the same unit of measurement anyway - some are in meters, some in feet, the info disappears when you just keep the number).

To remove all the non-digits from a string and to keep the digits, you do NOT need regular expressions, which may be quite slow (an order of magnitude slower!) than standard string functions.

One way to remove all non-digit characters uses the TRANSLATE function. Like so:

translate(input_string, '0123456789' || input_string, '0123456789')

The function "translates" (replaces) 0 with 0, 1 with 1, etc., and any character in the input string that hasn't already appeared earlier in the second argument (which in this case means "non-digit") to nothing (null, zip, disappears, is removed).

Example (note the use of TO_NUMBER to also convert to actual numbers):

with
  data (input_string) as (
    select '1500'   from dual union all
    select '2100 m' from dual union all
    select '535 ft' from dual
  )
select input_string,
       to_number(translate(input_string, '0123456789' || input_string, 
                                         '0123456789')) as extracted_number
from   data;

INPUT_STRING EXTRACTED_NUMBER
------------ ----------------
1500                     1500
2100 m                   2100
535 ft                    535
Sign up to request clarification or add additional context in comments.

7 Comments

Can I still run it on an entire column? Or do I have to type in every single string?
You can run it on an entire column. The WITH clause is only to simulate the test data, it is not part of the query. Replace data with your table name and input_string with your column name. Note though, beforehand, that the query will return 12001300 if the input is 1200-1300, etc. - it will not differentiate between different reasons for non-digits in the input.
Thank you so much for your help. I feel helpless when it comes to cleansing the data.
@Coolkidscandie - is this a one-time job? Have you (your group, organization etc.) fixed the reason data was stored with non-digits in the first place? If you don't fix that first, cleansing the data will be an ongoing job - which is very inefficient. Then - if it is a one-time job, then performance shouldn't be the main issue; correctness should be. Then you need to get all the values that are not numbers already, and see what kind of situations you must handle. Then you can write a simple function that will handle all situations in one go. We can help here, you will just need to ask.
it's a one time job for now as I'm learning SQL myself so I can maybe work with data professionally ;) So it's just a bunch of data I found online and I'm now trying to make a use of it. But thank you
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.