0

I have two files:

  • One with 'filename' and value_count columns (ValueCounts.csv)
  • Another with 'filename' and 'latitude' and 'longitude' columns (GeoData.xlsx)

I have started by creating dataframes for each file and the specific columns within that I intend on using. My code for this is as follows:

Xeno_values = pd.read_csv(r'C:\file_path\ValueCounts.csv')
img_coords = pd.read_excel(r'C:\file_path\GeoData.xlsx')

df_values = pd.DataFrame(Xeno_values, columns = ['A','B'])
df_coords = pd.DataFrame(img_coords, columns = ['L','M','W'])

However when I print() each dataframe all the column values are returned as 'NaN'.

How do I correct this? And then write and if statement that iterates over the data and says:

if 'filename' (col 'A') in df_values == 'filename' (col 'W') in df_coords, append 'latitude' (col 'L') and 'longitude' (col 'M') to df_values

If any clarification is needed please do ask.

Thanks, R

1 Answer 1

1

Check out the documentation for pandas read_csv and read_excel (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). These functions already return the data in a dataframe. Your code is trying to create a dataframe using a dataframe, which is fine if you don't specify columns, but will return all NaN values if you do.

So if you want to load the dataframes:

df_values = pd.read_csv(r'C:\file_path\ValueCounts.csv')
df_coords = pd.read_excel(r'C:\file_path\GeoData.xlsx')

Will do the trick. And if you just want specific columns:

df_values = pd.read_csv(r'C:\file_path\ValueCounts.csv', usecols=['A','B'])
df_coords = pd.read_excel(r'C:\file_path\GeoData.xlsx', usecols=['L','M','W'])

Make sure that those column names do actually exist in your csv files

If you want to rename columns (make sure you're doing all columns here):

df_values.columns = ['Filename', 'Date'] 

For adding lat/long to df_values you could try:

df = pd.merge(df_values, df_coords[['filename', 'LAT', 'LONG']], on='filename', how='inner')

Which assumes that there are columns 'filename' in both the values and coords dataframes, and that the coords dataframes has columns 'LAT' and 'LONG' in it.

Lastly, do out a tutorial on pandas (https://www.tutorialspoint.com/python_pandas/index.htm). Becoming more familiar with it will help you wrangle data better.

Sign up to request clarification or add additional context in comments.

4 Comments

Hi @eNc. Thanks for the quick response! This is great in theory, however doen't 'pd.merge' essentially just paste the selected columns alongside eachother? If so this is not appropreate, as not all the filenames in one file are included in the other, so as soon as one filename is dropped the entire dataset will be offset. Hence the need for an if statement to ensure that the lat/long data is only appended where the filename is equal to filename. or does merge do this anyway?
@CephaloRhod I've updated the soln to use the intersection of filenames from both data frames. Take a look at pandas merge (pandas.pydata.org/pandas-docs/stable/reference/api/…). What the last line of code there does is take the lat, long from coords, and merges with df_values where the filenames column has the same values in both df. It returns a new dataframe, df
Fantastic this is now working as desired. Thank you very much youve been a massive help. I'm just an ecologist drowning in a programmers world, but I'm desperately trying to learn! I'll make sure I check out some of those resources you linked me to more thoroughly when i have some more time. Thanks again mate @eNc :)
@CephaloRhod Glad its working for you. Good luck with your research, and plz accept my solution (check mark next to arrows) if its answered your question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.