0

I am trying to read multiple xml files and extract the data from these files. I am trying to extract two sets of data and save them into two separate csv files.

From the extractData function, I get a list of data from a single file. In the createCSV function, I extract only the data which I require.

I want to save all the extracted data from all the files I read into one csv file. Currently, I am only able to save the last file.

import json
import os
import pandas as pd
import numpy as np
import bs4
import glob
import csv

def extractData(path):
    for filename in glob.glob(os.path.join(path, '*.xml')):
        genre = bs4.BeautifulSoup(open(filename, 'r', encoding="utf8"), features="lxml")
        #print(genre)
        if genre.find_all("name") == []:
            print('Not Available')
        else:
            tags = genre.find_all("name")
            genre_list = []
            for name in tags:
                genres = name.text.strip()
                genre_list.append(genres)
            #print(genre_list)
    return genre_list

def createCSV(list_genre):
    new_artist_list = []
    new_genre_list = []
    complete_list = pd.DataFrame(list_genre)
    new_col = len(complete_list)
    #print(complete_list)
    #print(new_col)
    if new_col == 2:
        #for complete_list in complete_lists:
        column_names_1 = ["Song", "Artist"]
        final_list_1 = complete_list.T
        final_list_1.columns = column_names_1
        #print(final_list_1)
        new_artist_list.append(final_list_1)
        #print(new_artist_list)
    elif new_col > 2:
        #for complete_list in complete_lists:
        column_names_2 = ["Song", "Artist", "Genre"]
        #print(complete_list)
        final_list_2 = complete_list.T
        final_list_3 = np.array(final_list_2)
        #print(final_list_3)
        song_list = final_list_3[:, 0]
        artist_list = final_list_3[:, 1]
        final_list_2['Genre'] = final_list_2[final_list_2.columns[2:]].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1)
        #print(final_list_2['Genre'])
        combined = np.concatenate((song_list, artist_list))
        combined_list = np.concatenate((combined, final_list_2['Genre']))
        #print(combined_list)
        complete_list_final = pd.DataFrame(combined_list)
        #print(complete_list_final)
        complete_list_final_1 = complete_list_final.T
        complete_list_final_1.columns = column_names_2
        new_genre_list.append(complete_list_final_1)
        #print(new_genre_list)
    print(new_artist_list)
    print(new_genre_list)
    return new_artist_list, new_genre_list

def writeCSV(partial_list, complete_list):
    partial_lists = partial_list
    complete_lists = complete_list
    artist_init = []
    complete_init = []
    for i, j in zip(partial_lists, complete_lists):
        with open('data_only_artist_name.csv', 'w', encoding="utf8") as data_partial:
            artist_csv = csv.writer(data_partial)
            artist_csv.writerow(i)
            artist_init.append(artist_csv)

        with open('data_complete.csv', 'w', encoding="utf8") as data_complete:
            complete_csv = csv.writer(data_complete)
            complete_csv.writerow(j)
            complete_init.append(complete_csv)
            #print(new_genre_list)
    a = artist_init
    b = complete_init
    return a, b

path = 'XML Data'
data = extractData(path)
print(data)
partial, complete = createCSV(data)
list_partial, list_complete = writeCSV(partial, complete)
7
  • In ExtractData, did you intentionally initialize genre_list in the for loop ? Because this is the reason why you only get the data of the last file found. Edit: you should use a dict with filename as key and lists as values. Commented Feb 17, 2020 at 12:50
  • Yes, I did intentionally initialize genre_list in the for loop. Should I initialize it outside the for loop? Commented Feb 17, 2020 at 12:53
  • Well in the current state, each time you find a file, the list is cleared. So you only get the data of the last file found. I'll try to provide an example using a dict. Commented Feb 17, 2020 at 12:55
  • I tried out your suggestion and I get a complete list with all the data from the extactData function. But after that, the csv file doesn't get saved. Commented Feb 17, 2020 at 13:00
  • Can you update your question with your last version ? Don't forget to mention that you edited it. Commented Feb 17, 2020 at 13:03

2 Answers 2

1

In write_csv you open the files in w mode, so you truncate them to 0 size on each pass and at the end you only have the values from the last pass.

Use a mode instead (with open(..., 'a', encoding="utf8") as ...) and you will add new data after previous one.

Sign up to request clarification or add additional context in comments.

1 Comment

Hello Serge, the suggestion which you gave doesn't seem to work.
1

As mentioned by Serge, open your files in append mode to avoid overwriting them in the for loop:

def writeCSV(partial_list, complete_list):
    artist_init = []
    complete_init = []
    # Create empty files
    # Comment to keep previously written files
    open('data_only_artist_name.csv', 'w')
    open('data_complete.csv', 'w')
    for i, j in zip(partial_list, complete_list):
        with open('data_only_artist_name.csv', 'a', encoding="utf8") as data_partial:
            artist_csv = csv.writer(data_partial)
            artist_csv.writerow(i)
            artist_init.append(artist_csv)

        with open('data_complete.csv', 'a', encoding="utf8") as data_complete:
            complete_csv = csv.writer(data_complete)
            complete_csv.writerow(j)
            complete_init.append(complete_csv)
            #print(new_genre_list)
    return artist_init, complete_init

2 Comments

It doesn't work. I keep getting an empty csv file. Any suggestions?
Maybe check the input lists for writeCSV ? I tried with a, b = writeCSV([['a', 'b', 'c'], ], [[1, 2, 3], ]) and I obtained the 2 expected files.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.