1

I am looking to process text file using Spark RDD which has data like below:

----------------------------*-----------------------

   state:xx             sub:z    |Basic info

company:abc        rate:123      |

----------------------------*------------------------

                     Date: 12-03-2019

I am expecting data to be in below format:

State:XX
Sub:z
Company:abc
rate:123
Date:12-03-2019

When I tried to remove special characters '-' using data1=data.ReplaceAll('-',"") function, it is removing - even from date also,i.e 12032019, But date should be in 12-03-2019 and also I am not getting how to move sub:z ,company:abc andrate:123 to new lines.Please help

2
  • with more details people can help you more.what's the whole file looks like? how many records may it have? Commented Aug 9, 2019 at 7:02
  • zhang-yuan Thanks for your response .It is around 600 pages big file it also has data in different format .it is starting piece of data mentioned above.where I am looking for initial solution Commented Aug 9, 2019 at 7:11

1 Answer 1

1

without providing further details, here are my suggestions:

  1. just remove lines start with -, you may get something like this
state:xx sub:z |Basic info
company:abc rate:123 |
Date: 12-03-2019
  1. then remove data afeter |
state:xx sub:z
company:abc rate:123
Date: 12-03-2019
  1. replace the (blank space) with \n\r

    not sure whether Date: has a blank space behind

    if so, you can replace that 'Date: ' to 'Date:' first

state:xx
sub:z
company:abc
rate:123
Date:12-03-2019

hope this would help

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.