4

I'm a newbie to Haskell, and now I'm learning to use parsec. I get stuck in one problem, that is, I want to get all the sub-strings which satisfies some specific pattern in a string. For example, from the following string,

"I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',"-"]."

the result should be [F12, F12, F1(a), F2a, F5-A, F34-5], where the space between the F and the digit should be deleted.

With the parsec, I have succeeded in getting one sub-string, such as F12, F2a. The code is as follows:

hao :: Parser Char
hao = oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"
tuhao :: Parser String
tuhao = do { c <- char 'F'
           ; many space
           ; c1 <- digit
           ; cs <- many hao
           ; return (c:c1:cs)
           }
parse tuhao "" str -- can parse the str and get one sub-string.

However, I am stuck at how to parse the example string above and get all the sub-strings of the specific pattern. I have an idea that if F is found, then begin parsing, else skip parsing or if parsing fails then skip parsing. But I don't know how to implement the plan. I have another idea that uses State to record the remaining string that is not parsed, and use recursion, but still fail to carry it out.

So I appreciate any tip! ^_^

4
  • What's your input string? I'd like to take a look at this but though you give your expected output (["F12", "F12", "F1(a)", "F2a", "F5-A", "F34-5"]) but not the input Commented Apr 17, 2018 at 5:43
  • @AdamSmith apparently the problem statement itself is the input: "I want to choose F12 or F 12 from F1(a)…" Commented Apr 17, 2018 at 5:49
  • @AdamSmith As Zeta said, the problem description is the input. Commented Apr 17, 2018 at 6:11
  • With [m| m<- subsequences "F1(a)F2aF5-AF34-5", elem m ["F12","F 12"]] I got one ["F12"] and that is probably right. subsequence sequences by character so meaningless characters can be removes. I took out only space and comma. Commented Apr 18, 2018 at 2:40

2 Answers 2

2

F12, F 12, F1(a), F2a, F5-A, F34-5

This is an incomplete description, so I'll make some guesses.

  1. I would start by defining a type that can contain the logical parts of these expressions. E.g.

    newtype F = F (Int, Maybe String) deriving Show
    

    That is, "F" followed by a number and an optional part that is either letters, parenthesised letters, or a dash followed by letters/digits. Since the number after "F" can have multiple digits, I assume that the optional letters/digits may be multiple, too.

    Since the examples are limited, I assume that the following aren't valid: F1a(b), F1(a)b, F1a-5, F1(a)-A, F1a(a)-5, F1a1, F1-(a), etc. and that the following are valid: F1A, F1abc, F1(abc), F1-abc, F1-a1b2. This is probably not true. [1]

  2. I would then proceed to write parsers for each of these sub-parts and compose them:

    module Main where
    
    import Text.Parsec
    import Data.Maybe (catMaybes)
    
    symbol :: String -> Parser String
    symbol s = string s <* spaces
    
    parens :: Parser a -> Parser a
    parens = between (string "(") (string ")")
    
    digits :: Parser Int
    digits = read <$> many1 digit
    
    parseF :: Parser F
    parseF = curry F <$> firstPart <*> secondPart
      where
        firstPart :: Parser Int
        firstPart = symbol "F" >> digits
    
        secondPart :: Parser (Maybe String)
        secondPart = optionMaybe $ choice
          [ many1 letter
          , parens (many1 letter)
          , string "-" >> many1 alphaNum
          ]
    
  3. (As Jon Purdy writes in a comment,) using this parser on a string to get multiple matches,

    extract :: Parser a -> Parser [a]
    extract p = do (:) <$> try p <*> extract p
            <|> do anyChar >> extract p
            <|> do eof >> return []
    
    readFs :: String -> Either ParseError [F]
    readFs s = parse (extract parseF) "" s
    
    main :: IO ()
    main = print (readFs "F12, F 12, F1(a), F2a, F5-A, F34-5")
    

    This prints:

    Right [F (12,Nothing),F (12,Nothing),F (1,Just "a"),F (2,Just "a"),F (5,Just "A"),F (34,Just "5")]
    

Takeaways:

  • You can parse optional whitespace using token parsing (symbol).

  • You can parse optional parts with option, optionMaybe or optional.

  • You can alternate between combinators using a <|> b <|> c or choice [a, b, c].

  • When alternating between choices, make sure they don't have overlapping FIRST sets. Otherwise you need to try; this is nasty but sometimes unavoidable. (In this case, FIRST sets for the three choices are letter, string "(" and string "-", i.e. not overlapping.)

[1]: For the sake of restriction, I kept to the assumptions above, but I felt that I could also have assumed that F1a-B, F1(a)-5 and F1(a)-5A are valid, in which case I might change the model to:

newtype F = F (Int, Maybe String, Maybe String)
Sign up to request clarification or add additional context in comments.

3 Comments

This doesn’t address the main question of how to extract all of these from the input. You could do that with something like catMaybes <$> many (try (Just <$> parseF) <|> (Nothing <$ anyChar)) (untested)
@JonPurdy: Thanks; that does make the parser a lot more useful.
@SimonShine & Jon Purdy Thank both of you! The suggestions are really helpful for me, esp. the usage of many and try, and it is the key to handle the unwanted character with <$, with which I can solve my real problem.
1

We can get sub-strings of specific pattern in a string with the findAll combinator from replace-megaparsec.

Notice that this tuhao parser doesn't actually return anything. The findAll combinator just checks for success of the parser to find sub-strings which match the pattern.

import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Maybe
import Data.Either

let tuhao :: Parsec Void String ()
    tuhao = do
        void $ single 'F'
        void $ space
        void $ digitChar
        void $ many $ oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"

input = "I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',\"-\"]."

rights $ fromJust $ parseMaybe (findAll tuhao) input
["F12","F 12","F1(a)","F2a","F5-A","F34-5"]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.