0

I used xpath helper to help me scrapping a table in a login website.

Code:

g=driver.find_element_by_xpath("//table[@id='DataGrid']/tbody").text
print(g)

The result looks like this, data type is "string":

#@5@#*&(
&*(%#IO
!@%&*(O)
2018/02/02 206 MAZDA MAZDA 5 5660-ES 2006 01 1999 70000 white A
2018/02/02 210 BMW 330 9378-W6 2006 01 2996 80000 black C
2018/02/02 211 MITSUBISHI FORTIS ALK-3501 2015 04 1798 100000 white C+

I want to write this string into csv without the first three lines and use comma to separate them otherwise they will all combine together.

Code here:

if "#@5@#*&(" in g and "&*(%#IO" in g and "!@%&*(O)" in g:
    g=g.replace("#@5@#*&(", "")
    g=g.replace("&*(%#IO", "")
    g=g.replace("!@%&*(O)", "")
    g=g.replace(' ', ',')  
print(g)
file_name="C:/Test.csv"
with open(file_name,'a') as file:
    file.write(g+'\n')

What bothered me is that I don't know how to delete the first three lines. I replace them with blank space, but they are still there, everytime when I write into csv, they all take place. Second is that, when I separate them with comma, there were some errors. Like Mazda 5, it should not be separated. Is there any good way to solve this problem? or should I just correct it in csv file?

source code:

<tr align="left" style="height:40px;">
  <td>2018/02/02</td>
  <td>206</td>
  <td>MAZDA</td>
  <td>MAZDA 5</td>
  <td>5660-ES</td>
  <td>2006</td>
  <td>01</td>
  <td>1999</td>
  <td>70000</td>
  <td>white</td>
  <td align="center" valign="middle"></td>
  <td>A</td>
</tr>

2 Answers 2

1

When it comes to removing the first 3 lines, you could either:

  • replace new line character with nothing (use string like "#@5@#*&(\n"); or
  • split the original string into lines and remove the first 3, then combine them again "\n".join(g.split("\n")[3:])

The second issue is much harder, because by saving all the content of tbody into one variable, you effectively lost the information about separators. Now you have no way to know whether the space was originally there or is just a separator added automatically. I'd suggest scraping each td cell individually.

Sign up to request clarification or add additional context in comments.

Comments

1

To remove the first few lines from a string, just figure out the position of the first relevant piece of info.

temp = "adknsad"

temp[2:] would output something like "knsad"

It should be the same for the piece of string you have.

I don't think there is any simple way to solve the Mazda 5 thing.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.