2

I have a python script which currently accesses an API which returns JSON. It then takes the JSON string and saves it off as a file on the local file system, where i then move it into HDFS manually. I would like to change this so my python script is saving directly to HDFS instead of hitting the local file system first. I am currently trying to save the file using HDFS and DFS command but I don't think the copy command is the correct way to do this because it isn't a file but rather a JSON string when I am trying to save it.

Current Code

import urllib2
import json
import os

f = urllib2.urlopen('RESTful_API_URL.json')
json_string = json.loads(f.read().decode('utf-8'))
with open('\home\user\filename.json', 'w') as outfile:
    json.dump(json_string,outfile)

New Code

f = urllib2.urlopen('RESTful_API_URL.json')
json_string = json.loads(f.read().decode('utf-8'))
os.environ['json_string'] = json.dump(json_string)
os.system('hdfs dfs -cp -f $json_string hdfs/user/test')

3 Answers 3

7

I think the problem is the same with this thread Stream data into hdfs directly without copying.

Firstly, this command can redirect stdin to hdfs file,

hadoop fs -put - /path/to/file/in/hdfs.txt

Then, you can do this in python,

os.system('echo "%s" | hadoop fs -put - /path/to/file/in/hdfs.txt' %(json.dump(json_string)))
Sign up to request clarification or add additional context in comments.

4 Comments

the file is created but no contents have written in to the file in HDFS... what would be the issue?
If you use echo "%s" , your json file will be stored without double quotes
I used the same technique, the JSON string is not copied in HDFS? any reason.
@eval , when i use the os.system('echo "%s" | hadoop fs -put - /path/to/file/in/hdfs.txt' %(json.dump(json_string))) iam getting prefix u and ' ' char before. {u'manager':u'robert',u'name':u'kim'} ,how i can remove u and ' ' for every key ,value in my json
1

Take a look at the HDFS put command at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#put

You can put to HDFS from the command line using standard in with syntax like the following (-put - means read from stdin).

hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile

If you can start this command as a sub process within your python code, you should be able to pipe your json string to the sub process.

Comments

-1

It helped in my case:

import os
import requests

r = requests.get(url = url,headers=headers)
json_string = r.json()
os.system('echo "%s" | hadoop fs -put - /<your_hdfs_path>/json_name.json' %(json_string))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.