I have 1,000 files; the start of each file all look like this:
!dataset_description = Analysis of POF D119 mutation.
!dataset_type = Expression profiling by array
!dataset_pubmed_id = 17318176
!dataset_platform = GPL1322
The aim: I want to transform this information into a list so I can make an excel spreadsheet between all the files; i.e. I want the list to look like this:
[Analysis_of_POF_D119_mutation,Expression_profiling_by_array,17318176,GPL1322]
I have this code (this is just to extract the first variable, "!dataset_description", however, I would subsequently run the code on each variable of interest i.e. !dataset_type, !dataset_pubmed_id, !dataset_platform):
OpenDataset = open(sys.argv[1], 'r')
Dataset = OpenDataset.readlines()
ListOfInformation = []
formatted_line = lambda x: "_".join(line.strip().split("=")[x].split())
for line in Dataset:
if line.startswith("!dataset_description"):
description = formatted_line(1)
print description
The code works, however, I am now at a stage where I understand python basics, and I want to start coding more "pythonically". I have two questions.
- It seems silly to use the lambda expression that I am using. "x" in the lambda expression will always be 1, since I will always want what comes after the "=" sign. Therefore x isn't really a "variable", but then I can't have a lambda expression without a variable.
I tried to change the variable to being what the line starts with, which is the true variable, doing something like this:
formatted_line = lambda x: "_".join(line.strip().split("=")[1].split()) if line.startswith(x)
However, this code returns a syntax error.
Would someone know how to make the above lambda expression work.
- These files have the potential to be really really big. However, the information that I need is at the start of the file, and all start with the "!" symbol. So it seems silly to read in the whole file, when I'll just need X number of lines at the start of the file, all of which start with "!" (the exact number of lines per file will be variable). Is there a way to read in just the lines starting with "!"; or is it quicker just to use file.readlines().
1always? Pass thelineinstead.lambdaversion, what will be the result of the expression if the line doesn't start withx? That is why it produces a Syntax error.lambdainstead ofdeffor a named function is generally considered bad style in Python, although that rule is sometimes bent, eg when creating a key function that's used as an arg tosortorsortedand then immediately re-used as an arg toitertools.groupby. Apart from brevity, lambdas have no advantage over full function definitions, but they have several disadvantages. So you should only use them when a simple anonymous function is appropriate.