4

I just want to append strings based on my condition. For example all strings starting with http won't be appended but all the other strings in each that has a length of 40 will be appended.

    words = []
    store1 = []
   disregard = ["http","gen"]

    for all in glob.glob(r'MYDIR'):
        with open(all, "r",encoding="utf-16") as f:
            text = f.read()
        lines = text.split("\n")

        for each in lines:
            words += each.split()
        for each in words:
            if len(each) == 40 and each not in disregard:
                store1.append(each)

Update:

if disregard[0] not in each: 

works but how can I compare it to all the contents in my list? using disregard only doesnt work Here is my input text file :

http://1234ashajkhdajkhdajkhdjkaaaaaaad1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
genp://1234ashajkhdajkhdajkhdjkaaaaaaad1
a\a

The only thing that will append will be "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

10
  • I didn't find anything wrong in your code. Try change each not in disregard to all([word not in each for word in disregard]) because I think when you split words, "http" not stand itself but like "http://blablabla.com" because there's no space there and it makes each not in disregard return True. Commented May 23, 2018 at 3:44
  • TypeError: 'str' object it no callable when I tried replacing it. Commented May 23, 2018 at 4:36
  • 1
    Ah... it's because you are using all as variable in for all in glob.glob(r'MYDIR'). Better change it because all is python function. Commented May 23, 2018 at 4:39
  • You could add an example of data, the output you currently get from it, and the output you want from it. This would make your question clearer. Commented May 23, 2018 at 4:57
  • updated ............. Commented May 23, 2018 at 5:05

2 Answers 2

1

I think the answers should depend on the number of words you want to disregard. It's important to define what word means. If the word ends with spaces, should they all be stripped? One solution could be to create a regular expression from all your words and use that to match the line.

import glob
import re

disregard = ["http","gen"]
pattern = "|".join([re.escape(w) for w in disregard])
for all in glob.glob(r'MYDIR/*'):
    with open(all, "r", encoding="utf-16") as f:
        matched_words = []
        for line in f:
            line = line.rstrip("\n")
            if len(line) == 40 and not re.match(pattern, line):
                matched_words.append(line)

    print(matched_words)
Sign up to request clarification or add additional context in comments.

1 Comment

This solution has an edge case where the last line in the file (which won't end with \n) could be 41 characters and erroneously pass these conditions. Or it could be 40 characters and fail.
0

The basic structure looks ok, it seems the place where it's breaking is setting up incorrect conditionals. You say you want to check where each line starts with the supplied strings, but then you split each line and check for existence of those strings. Use .startswith() instead. This will also make it so there doesn't have to be a space after "http" in order for that string to be caught.

Also, either the conditional testing should be placed after the loop that builds the words list, or else the words list should be reset at the start of each loop so you're not re-testing words you've already checked.

# adjusted some variable names for clarity
words = []
output = []
disregard = ["http","gen"]

for fname in glob.glob(r'MYDIR'):
    with open(fname, "r", encoding="utf-16") as f:
        text = f.read()
    lines = text.split("\n")

    for line in lines:
        words += line.split()

for word in words:
    if len(word) == 40 and not any([word.startswith(dis) for dis in disregard]):
        output.append(each)

2 Comments

What if I want to store all the the string that starts with disregard?
Just use any(...) instead of not any(...).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.