0

I am trying to extract all the references from part of a paper as a list. For now I've just got a paragraph and set it as a string.

I was wondering if it is possible to do this using regex on python? I want to be able to extract multiple words from the string, but so far all I've been able to do is extract the years, singular words, or characters, but not an entire reference at once. Also there are quite a lot of conditions really as the references can vary in format, for example:

text="As shown by Macelroy et al. (1967), bla bla. Podar & Reysenbach (2006) also researched ... Another example is ... (Valdes et al. 2008). Most notably .... Edwards, Bartlett & Stirling (2003)."

So some have the number within a bracket, some are entirely encompassed by brackets, some have multiple capitalised words, some have "et al" and so on. Is it possible to define all of these requirements within one search, and then print these all out together?

I know there are websites or programs I can put the paper into to extract all the references for me, but I would like to know how to do it myself.

Thanks

NB: Edited to clarify how the references would be embedded in the string

4
  • tell us what your expected output Commented Oct 14, 2018 at 4:29
  • I have edited the string to show better the expected input. My expected output would just be a list of references (ideally without brackets), so "Macelroy et al. 1967, Podar & Reysenbach 2006, Valdes et al. 2008, Edwards, Bartlett & Stirling 2003" Commented Oct 14, 2018 at 8:29
  • try this f = ["".join(result).replace("(","") for result in re.findall("([A-Z])([^A-Z)]+|[^.,]+)([0-9]{4})",t)] , i dont know if this work for your whole article Commented Oct 14, 2018 at 9:00
  • Ah brilliant, it worked, thanks kcorlidy Commented Oct 14, 2018 at 10:46

1 Answer 1

1
import re
t = """
As shown by Macelroy et al. (1967), bla bla. Podar
 & Reysenbach (2006) also researched ... Another example is ... (Valdes et al. 2008). Most notably .... Edwards, Bartlett & Stirling (2003).
"""
f = ["".join(result).replace("(","") for result in re.findall("([A-Z])([^A-Z)]+|[^.,]+)([0-9]{4})",t,re.S)]
print(f)
  1. ([A-Z]) match a block letter
  2. [^A-Z)]+|[^.,]+ match two situation ,

    • match string which without block letter and )
    • match a string which did not contain ,. because if contain , or . may match a whole sentence
  3. [0-9]{4} end with 4 numbers

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.