25

I have huge NSString with HTML text inside. The length of this string is more then 3.500.000 characters. How can i convert this HTML text to NSString with plain text inside. I was using scanner , but it works too slowly. Any idea ?

1

7 Answers 7

69

It depends what iOS version you are targeting. Since iOS7 there is a built-in method that will not only strip the HTML tags, but also put the formatting to the string:

Xcode 9/Swift 4

if let htmlStringData = htmlString.data(using: .utf8), let attributedString = try? NSAttributedString(data: htmlStringData, options: [.documentType : NSAttributedString.DocumentType.html], documentAttributes: nil) {
    print(attributedString)
}

You can even create an extension like this:

extension String {
    var htmlToAttributedString: NSAttributedString? {
        guard let data = self.data(using: .utf8) else {
            return nil
        }

        do {
            return try NSAttributedString(data: data, options: [.documentType : NSAttributedString.DocumentType.html, .characterEncoding: String.Encoding.utf8.rawValue], documentAttributes: nil)
        } catch {
            print("Cannot convert html string to attributed string: \(error)")
            return nil
        }
    }
}

Note that this sample code is using UTF8 encoding. You can even create a function instead of computed property and add the encoding as a parameter.

Swift 3

let attributedString = try NSAttributedString(data: htmlString.dataUsingEncoding(NSUTF8StringEncoding)!,
                                              options: [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType],
                                              documentAttributes: nil)

Objective-C

[[NSAttributedString alloc] initWithData:[htmlString dataUsingEncoding:NSUTF8StringEncoding] options:@{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, NSCharacterEncodingDocumentAttribute: [NSNumber numberWithInt:NSUTF8StringEncoding]} documentAttributes:nil error:nil];

If you just need to remove everything between < and > (dirty way!!!), which might be problematic if you have these characters in the string, use this:

- (NSString *)stringByStrippingHTML {
   NSRange r;
   NSString *s = [[self copy] autorelease];
   while ((r = [s rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound)
     s = [s stringByReplacingCharactersInRange:r withString:@""];
   return s;
}
Sign up to request clarification or add additional context in comments.

5 Comments

I got Out of memory exception on the simulator =(
How do I replace HTML entities like &amp with their plain text equivalent. i.e. &
@ThEuSeFuL check this answer: stackoverflow.com/questions/1105169/…
Perfect answer for me. +1 for you.
Keep in mind, that using NSHTMLTextDocumentType requires to run synchronously on the main thread which is getting locked.
16

I resolve my question with scanner, but i use it not for all the text. I use it for every 10.000 text part, before i concatenate all parts together. My code below

-(NSString *)convertHTML:(NSString *)html {

    NSScanner *myScanner;
    NSString *text = nil;
    myScanner = [NSScanner scannerWithString:html];

    while ([myScanner isAtEnd] == NO) {

        [myScanner scanUpToString:@"<" intoString:NULL] ;

        [myScanner scanUpToString:@">" intoString:&text] ;

        html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@>", text] withString:@""];
    }
    //
    html = [html stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];

    return html;
}

Swift 4:

var htmlToString(html:String) -> String {
        var htmlStr =html;
        let scanner:Scanner = Scanner(string: htmlStr);
        var text:NSString? = nil;
        while scanner.isAtEnd == false {
            scanner.scanUpTo("<", into: nil);
            scanner.scanUpTo(">", into: &text);
            htmlStr = htmlStr.replacingOccurrences(of: "\(text ?? "")>", with: "");
        }
        htmlStr = htmlStr.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines);
        return htmlStr;
}

2 Comments

add a @autoreleasepool into the while loop for preserving memory
Note: this will also replace anything between tags, so if you have an email address like "Some Name <[email protected]>" it'll strip out <[email protected]>. That's probably not what you want. It needs to possibly look up against a map of known HTML tags.
2

Objective C

+ (NSString*)textToHtml:(NSString*)htmlString
{
    htmlString = [htmlString stringByReplacingOccurrencesOfString:@"&quot;" withString:@"\""];
    htmlString = [htmlString stringByReplacingOccurrencesOfString:@"&apos;" withString:@"'"];
    htmlString = [htmlString stringByReplacingOccurrencesOfString:@"&amp;" withString:@"&"];
    htmlString = [htmlString stringByReplacingOccurrencesOfString:@"&lt;" withString:@"<"];
    htmlString = [htmlString stringByReplacingOccurrencesOfString:@"&gt;" withString:@">"];
    return htmlString;
}

Hope this helps!

1 Comment

And why not htmlToText ?
1

For Swift Language ,

NSAttributedString(data:(htmlString as! String).dataUsingEncoding(NSUTF8StringEncoding, allowLossyConversion: true
            )!, options:[NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, NSCharacterEncodingDocumentAttribute: NSNumber(unsignedLong: NSUTF8StringEncoding)], documentAttributes: nil, error: nil)!

Comments

1
- (NSString *)stringByStrippingHTML:(NSString *)inputString
{
    NSMutableString *outString;

    if (inputString)
    {
        outString = [[NSMutableString alloc] initWithString:inputString];

        if ([inputString length] > 0)
        {
            NSRange r;

            while ((r = [outString rangeOfString:@"<[^>]+>|&nbsp;" options:NSRegularExpressionSearch]).location != NSNotFound)
            {
                [outString deleteCharactersInRange:r];
            }      
        }
    }

    return outString; 
}

Comments

0

Swift 4:

do {
   let cleanString = try NSAttributedString(data: htmlContent.data(using: String.Encoding.utf8)!,
                                                                      options: [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType],
                                                                      documentAttributes: nil)
} catch {
    print("Something went wrong")
}

Comments

0

It can be more generic by passing encoding type as parameter, but as example as this category:

@implementation NSString (CSExtension)

    - (NSString *)htmlToText {
        return [NSAttributedString.alloc
                initWithData:[self dataUsingEncoding:NSUnicodeStringEncoding]
                     options:@{NSDocumentTypeDocumentOption: NSHTMLTextDocumentType}
          documentAttributes:nil error:nil].string;
    }

@end

6 Comments

in this method where you are passing string may be on self...?
@Raviteja_DevObal Ah sorry this was category, i could be more clear , will edit ...
But I don't believe this answer is correct anymore as there ir requirement of large html and this is terribly slow. I ended up using DTCoreText with some additional modifications for showing images correctly my solution is public on github though.
This method is not converting dynamic HTML text from service.Means i don't know which HTML content is coming from service.But replacing with custom method's
Sorry that was typo: But I don't believe this answer is NOT correct anymore as there is requirement of large html and this is terribly slow. I ended up using DTCoreText with some additional modifications for showing images correctly my solution is public on github though.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.