0

I have some foreign language names in my query. The problem is, I don't know where all the special characters are, so using the REPLACE function will not be helpful because there are over 500,000 rows. Some foreign names appear like this for example:

enter image description here

I want the name to appear like this instead "COLLEGE BOREAL DARTS APPLIQUES ET DE TECHNOLOGIE"

Is there a way to achieve this without using the replace function? So that it works on other Names on the list as well

I tried something like this that I saw in another post:

SELECT
CTE.COLLEGE_NAME COLLATE Cyrillic_General_CI_AI
FROM SCHOOLS cte

But it did not work. If someone could please help me solve this, that would be great! thank you

8
  • 1
    See this OTN Forums discussion (community.oracle.com/tech/developers/discussion/4146087/…). It suggests TRANSLATE function and contains several examples which might help (unless you have characters that aren't covered there). Commented Jul 6, 2021 at 20:31
  • I have characters that aren't covered here such as "@" and the trademark symbol Commented Jul 6, 2021 at 20:52
  • So include them; shouldn't be too difficult, I presume. Commented Jul 6, 2021 at 20:53
  • I don't want to include the symbols and other characters though, as I said, I only want the letters, this article uses the translate function, which is not what I am looking for Commented Jul 6, 2021 at 21:30
  • 2
    I think Littlefoot was referring to including those in the translate, not in the result. Why don't you want to use replace or translate - because there are too many possibilities? And what fundamental rules apply to the result - it sounds like maybe you only want ASCII characters, minus symbols; is that right? Except you said only letters; but you retained spaces, and what about numbers? Does the last query here work for your real data? Commented Jul 6, 2021 at 23:12

2 Answers 2

3

You seem to be talking about both accented and special characters. As @Sayan showed you can use nlssort to remove the accents, but as well as having to deal with the case change, it doesn't remove things like the trademark symbol (which you mentioned0 as you might expect or want - the '™' is converted to 'tm' which is clever but unhelpful here, and it throws out the translate too (as shown here, adding examples to Sayan's code).

Another approach that might work for you is to use convert (which Oracle recommend not to do) or utl_raw/utl_i18n functions to convert your values to plain ASCII, which takes care of the accents (hopefully all of them; I haven't tested extensively, and the discussion @Littlefoot linked to shows a lot of variations), and replaces any other non-ASCII values with a ?, which you can then conventiently remove along with other punctuation and symbols:

select college_name,
  regexp_replace(
    utl_i18n.raw_to_char(utl_i18n.string_to_raw(college_name, 'US7ASCII'), 'US7ASCII'),
    '[[:punct:]]',
    null) as result
from schools

which with your example and another with a trademark symbol gives:

COLLEGE_NAME RESULT
COLLÈGE BORÉAL D’ARTS APPLIQUÉS ET DE TECHNOLOGIE COLLEGE BOREAL DARTS APPLIQUES ET DE TECHNOLOGIE
Collectives™ on Stack Overflow Collectives on Stack Overflow

db<>fiddle including some variations; but don't use the convert ones *8-)

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you, this one works, Its interesting that CONVERT is not recommended by ORACLE because for me, these two codes worked perfectly fine: ``` select regexp_replace( utl_i18n.raw_to_char(utl_i18n.string_to_raw(' Collège Boréal D’Arts Appliqués Et De Technologie?', 'US7ASCII'), 'US7ASCII'), '[[:punct:]]', null) as result from dual; ``` and ``` regexp_replace(CONVERT(' Collège Boréal D’Arts Appliqués ÄNí et de Technologie?', 'US7ASCII'),'[[:punct:]]') ``` Just wondering if you could tell me why adding NULL is important at the end of the first sql code?
@comp_user - the null isn't important, it is just showing explicitly that you're removing the punctuation, not replacing it with anything; but it's an optional argument and will make no difference if you omit it. And yes, convert works in my fiddle too, but it might not in a database with a different character set, and it's generally better to avoid things Oracle discourage *8-)
thank you so much @Alex Poole! Really appreciate the explanation
3

Of course, you can remove ascent/umlauts from characters.

First of all, look at this example:

with t(n,name) as (
select 1, 'Löwenbrauerei' from dual union all
select 2, 'LÖwenbrauerei' from dual union all
select 3, 'Lowenbrauerei' from dual union all
select 4, 'LOwenbrauerei' from dual 
)
select
   n
  ,name
  ,utl_raw.cast_to_varchar2(nlssort(name, 'NLS_SORT=BINARY_AI')) name_AI
from t;

Results:

        N NAME           NAME_AI
---------- -------------- --------------------
         1 Löwenbrauerei  lowenbrauerei
         2 LÖwenbrauerei  lowenbrauerei
         3 Lowenbrauerei  lowenbrauerei
         4 LOwenbrauerei  lowenbrauerei

As you can see NLSSORT(..., 'NLS_SORT=BINARY_AI') removes all ascents and changes all to lower-case characters, so you just need to restore original upper/lower-case characters. For example you can use it with translate:

with t(n,name) as (
select 1, 'Löwenbrauerei' from dual union all
select 2, 'LÖwenbrauerei' from dual union all
select 3, 'Lowenbrauerei' from dual union all
select 4, 'LOwenbrauerei' from dual 
)
select
  n
  ,name 
  ,upper(name)
  ,lower(utl_raw.cast_to_varchar2(nlssort(name, 'NLS_SORT=BINARY_AI'))) name_AI_lower
  ,upper(utl_raw.cast_to_varchar2(nlssort(name, 'NLS_SORT=BINARY_AI'))) name_AI_upper
  ,translate(
      translate(
           name
          ,upper(name)
          ,upper(utl_raw.cast_to_varchar2(nlssort(name, 'NLS_SORT=BINARY_AI')))
      )
      ,lower(name)
      ,utl_raw.cast_to_varchar2(nlssort(name, 'NLS_SORT=BINARY_AI'))
  ) as name_ascent_removed
from t;

Results:

         N NAME           UPPER(NAME)    NAME_AI_LOWER        NAME_AI_UPPER        NAME_ASCENT_REMOVED
---------- -------------- -------------- -------------------- -------------------- --------------------------------------------------------
         1 Löwenbrauerei  LÖWENBRAUEREI  lowenbrauerei        LOWENBRAUEREI        Lowenbrauerei
         2 LÖwenbrauerei  LÖWENBRAUEREI  lowenbrauerei        LOWENBRAUEREI        LOwenbrauerei
         3 Lowenbrauerei  LOWENBRAUEREI  lowenbrauerei        LOWENBRAUEREI        Lowenbrauerei
         4 LOwenbrauerei  LOWENBRAUEREI  lowenbrauerei        LOWENBRAUEREI        LOwenbrauerei

ps. probably you can just to set codepage/font on the client that ignores them...

1 Comment

afaIk there is no magic wand here - you have to manually code it up like this. A function would be a better approach though.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.