How to use parsec to get sub-strings of specific pattern in a string

Question

I'm a newbie to Haskell, and now I'm learning to use parsec. I get stuck in one problem, that is, I want to get all the sub-strings which satisfies some specific pattern in a string. For example, from the following string,

"I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',"-"]."

the result should be [F12, F12, F1(a), F2a, F5-A, F34-5], where the space between the F and the digit should be deleted.

With the parsec, I have succeeded in getting one sub-string, such as F12, F2a. The code is as follows:

hao :: Parser Char
hao = oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"
tuhao :: Parser String
tuhao = do { c <- char 'F'
           ; many space
           ; c1 <- digit
           ; cs <- many hao
           ; return (c:c1:cs)
           }
parse tuhao "" str -- can parse the str and get one sub-string.

However, I am stuck at how to parse the example string above and get all the sub-strings of the specific pattern. I have an idea that if F is found, then begin parsing, else skip parsing or if parsing fails then skip parsing. But I don't know how to implement the plan. I have another idea that uses State to record the remaining string that is not parsed, and use recursion, but still fail to carry it out.

So I appreciate any tip! ^_^

What's your input string? I'd like to take a look at this but though you give your expected output (["F12", "F12", "F1(a)", "F2a", "F5-A", "F34-5"]) but not the input — Adam Smith
– Adam Smith, Commented Apr 17, 2018 at 5:43
@AdamSmith apparently the problem statement itself is the input: "I want to choose F12 or F 12 from F1(a)…" — Zeta
– Zeta, Commented Apr 17, 2018 at 5:49
@AdamSmith As Zeta said, the problem description is the input. — Z-Y.L
– Z-Y.L, Commented Apr 17, 2018 at 6:11
With [m| m<- subsequences "F1(a)F2aF5-AF34-5", elem m ["F12","F 12"]] I got one ["F12"] and that is probably right. subsequence sequences by character so meaningless characters can be removes. I took out only space and comma. — fp_mora
– fp_mora, Commented Apr 18, 2018 at 2:40

sshine · Accepted Answer · 2018-04-17 07:49:58Z

2

F12, F 12, F1(a), F2a, F5-A, F34-5

This is an incomplete description, so I'll make some guesses.

I would start by defining a type that can contain the logical parts of these expressions. E.g.
```
newtype F = F (Int, Maybe String) deriving Show
```
That is, "F" followed by a number and an optional part that is either letters, parenthesised letters, or a dash followed by letters/digits. Since the number after "F" can have multiple digits, I assume that the optional letters/digits may be multiple, too.

Since the examples are limited, I assume that the following aren't valid: F1a(b), F1(a)b, F1a-5, F1(a)-A, F1a(a)-5, F1a1, F1-(a), etc. and that the following are valid: F1A, F1abc, F1(abc), F1-abc, F1-a1b2. This is probably not true. [1]

I would then proceed to write parsers for each of these sub-parts and compose them:

module Main where

import Text.Parsec
import Data.Maybe (catMaybes)

symbol :: String -> Parser String
symbol s = string s <* spaces

parens :: Parser a -> Parser a
parens = between (string "(") (string ")")

digits :: Parser Int
digits = read <$> many1 digit

parseF :: Parser F
parseF = curry F <$> firstPart <*> secondPart
  where
    firstPart :: Parser Int
    firstPart = symbol "F" >> digits

    secondPart :: Parser (Maybe String)
    secondPart = optionMaybe $ choice
      [ many1 letter
      , parens (many1 letter)
      , string "-" >> many1 alphaNum
      ]

(As Jon Purdy writes in a comment,) using this parser on a string to get multiple matches,

extract :: Parser a -> Parser [a]
extract p = do (:) <$> try p <*> extract p
        <|> do anyChar >> extract p
        <|> do eof >> return []

readFs :: String -> Either ParseError [F]
readFs s = parse (extract parseF) "" s

main :: IO ()
main = print (readFs "F12, F 12, F1(a), F2a, F5-A, F34-5")

This prints:

Right [F (12,Nothing),F (12,Nothing),F (1,Just "a"),F (2,Just "a"),F (5,Just "A"),F (34,Just "5")]

Takeaways:

You can parse optional whitespace using token parsing (symbol).
You can parse optional parts with option, optionMaybe or optional.
You can alternate between combinators using a <|> b <|> c or choice [a, b, c].
When alternating between choices, make sure they don't have overlapping FIRST sets. Otherwise you need to try; this is nasty but sometimes unavoidable. (In this case, FIRST sets for the three choices are letter, string "(" and string "-", i.e. not overlapping.)

[1]: For the sake of restriction, I kept to the assumptions above, but I felt that I could also have assumed that F1a-B, F1(a)-5 and F1(a)-5A are valid, in which case I might change the model to:

newtype F = F (Int, Maybe String, Maybe String)

edited Apr 17, 2018 at 7:49

answered Apr 17, 2018 at 6:24

sshine

16.2k1 gold badge45 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jon Purdy Over a year ago

This doesn’t address the main question of how to extract all of these from the input. You could do that with something like catMaybes <$> many (try (Just <$> parseF) <|> (Nothing <$ anyChar)) (untested)

sshine Over a year ago

@JonPurdy: Thanks; that does make the parser a lot more useful.

Z-Y.L Over a year ago

@SimonShine & Jon Purdy Thank both of you! The suggestions are really helpful for me, esp. the usage of many and try, and it is the key to handle the unwanted character with <$, with which I can solve my real problem.

James Brock · Accepted Answer · 2019-08-31 04:14:50Z

We can get sub-strings of specific pattern in a string with the findAll combinator from replace-megaparsec.

Notice that this tuhao parser doesn't actually return anything. The findAll combinator just checks for success of the parser to find sub-strings which match the pattern.

import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Maybe
import Data.Either

let tuhao :: Parsec Void String ()
    tuhao = do
        void $ single 'F'
        void $ space
        void $ digitChar
        void $ many $ oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"

input = "I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',\"-\"]."

rights $ fromJust $ parseMaybe (findAll tuhao) input

["F12","F 12","F1(a)","F2a","F5-A","F34-5"]

Collectives™ on Stack Overflow

How to use parsec to get sub-strings of specific pattern in a string

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related