Dashes with(out) spaces with python's regex

Question

What I've managed to do

I'm new to both python and regex. With python's re.compile, in massive number of text files, I wanted to find all kinds of dashes surrounded by spaces. I used:

search.results = re.compile(r'\s[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD]\s')

(Yeah, I know about the regex module on PyPI, but I'm trying to use what I know better) It seems to have worked fine: I got all kinds of dash-like characters with spaces around them.

What I'd like to do now

Now I'm trying to do the opposite: find all the dash-like characters that are not surrounded by spaces (that is, with a space to the left, or a space to the right, or no spaces around them at all).

What I've tried

So I tried to use the same regex above and just swap the \s at the beginning, and then the \s at the end, and then both the \s-es with \S (to find all characters that are not space-characters). And now the regex suddenly seems to have gone mad and is finding all knids of words rather than dashes and their neighbouting letters, which I expected it to do. I've no idea what's going on.

What went wrong?

"Yeah, I know PyPy's regex exists"—what does PyPy have to do with anything? — Chris
– Chris, Commented Feb 27, 2022 at 19:01
I mean, I know I could use PyPi regex's module and use \p{Pd} — gttfld
– gttfld, Commented Feb 27, 2022 at 19:08
Let the hyphen regex is \p{Pd}, then, to match any hyphen not surrounded with spaces is (?<!\s)\p{Pd}|\p{Pd}(?!\s). See regex101.com/r/yEnukG/1. But there is also another way described in my YT video here, here, it would be \p{Pd}(?!(?<=\s.)\s) — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 27, 2022 at 19:20

Wiktor Stribiżew · Accepted Answer · 2022-02-27 20:02:12Z

To match a specific single-char pattern not in between two chars you can use a pattern of the following type:

b(?!(?<=a.)c)
(?<!a)b|b(?!c)

where a and c can be the same chars.

The b(?!(?<=a.)c) pattern matches any b that is not immediately followed with c that is, in its turn, not immediately preceded with a and any one char (here, . is fine to use because all we want from the lookbehind pattern is to reach the place after the b).

Here, if you wanted to match a normal regular hyphen not in between whitespaces, you could use -(?!(?<=\s.)\s).

If you put the character class of your choice into the pattern, it will look like

(?<!\s)[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD]|[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD](?!\s)

Or

[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD](?!(?<=\s.)\s)

See the regex demo #1 and regex demo #2. The second is more efficient.

This technique is also described in the "Matching dots or commas as (not) part of numbers" YT video of mine.

sln · Accepted Answer · 2022-02-28 17:23:01Z

To find all the dash-like characters that are not surrounded by spaces :

Edited:

Notes - The classes in the OP question and all the other cut'n paste
from that, reference the \u syntax for \u10EAD. This is invalid as
\u is notation for the BMP. The actual U+10EAD character is in the SMP.
That should be represented using the \U00xxxxxx syntax.
In the regex \u10EAD is interpreted as \u10EA plus D.
Suggested change: \u10EAD = \U00010EAD

The question is about Unicode 14 dashes.
(I don't think this has changed in many years, but not sure.)
There are 30 of them. Can query the UCD with the property \p{Dash}.
Also, the General Category Dash Punctuation aka \p{Pd} only returns
26 characters. The OP injected an extra dash making it 27.

The only language on Regex101 that supports the \p{Dash} property
is the updated JScript. I'm sure Perl does by now (don't know], but it seems
PCRE does not, but not sure.
So it's better for now to use a class with the individual dashes,
30 of them instead of the \p{Pd} property.

These sample regex's all use Unicode characters.
If you need help to convert them into \U00 xxxxxx syntax let me know.
I do it back and forth via software.

Python:
(?<!\s)[-֊־᐀᠆‐‑‒–—―⁓⁻₋−⸗⸚⸺⸻⹀⹝〜〰゠︱︲﹘﹣－𐺭](?!\s)
https://regex101.com/r/cnUiL6/1

JScript:
(?<!\s)[\p{Dash}](?!\s)
https://regex101.com/r/LW9R9J/1
(?<!\s)[-֊־᐀᠆‐‑‒–—―⁓⁻₋−⸗⸚⸺⸻⹀⹝〜〰゠︱︲﹘﹣－𐺭](?!\s)
https://regex101.com/r/jDc6DO/1

The class of Dashes will probably not change in the future 20 years or so.

00002D    -    HYPHEN-MINUS
00058A    ֊    ARMENIAN HYPHEN
0005BE    ־    HEBREW PUNCTUATION MAQAF
001400    ᐀    CANADIAN SYLLABICS HYPHEN
001806    ᠆    MONGOLIAN TODO SOFT HYPHEN
002010    ‐    HYPHEN
002011    ‑    NON-BREAKING HYPHEN
002012    ‒    FIGURE DASH
002013    –    EN DASH
002014    —    EM DASH
002015    ―    HORIZONTAL BAR
002053    ⁓    SWUNG DASH
00207B    ⁻    SUPERSCRIPT MINUS
00208B    ₋    SUBSCRIPT MINUS
002212    −    MINUS SIGN
002E17    ⸗    DOUBLE OBLIQUE HYPHEN
002E1A    ⸚    HYPHEN WITH DIAERESIS
002E3A    ⸺    TWO-EM DASH
002E3B    ⸻    THREE-EM DASH
002E40    ⹀    DOUBLE HYPHEN
002E5D    ⹝    OBLIQUE HYPHEN
00301C    〜    WAVE DASH
003030    〰    WAVY DASH
0030A0    ゠    KATAKANA-HIRAGANA DOUBLE HYPHEN
00FE31    ︱    PRESENTATION FORM FOR VERTICAL EM DASH
00FE32    ︲    PRESENTATION FORM FOR VERTICAL EN DASH
00FE58    ﹘    SMALL EM DASH
00FE63    ﹣    SMALL HYPHEN-MINUS
00FF0D    －    FULLWIDTH HYPHEN-MINUS
010EAD    𐺭    YEZIDI HYPHENATION MARK

Collectives™ on Stack Overflow

Dashes with(out) spaces with python's regex

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related