in R converting a text file into a data frame

Question

in R have a .txt file that i would like to extract data from as a character string. my .txt file is formatted like the following with a list separated by numbers. 1. [text1] 2. [text2] 3. [text3] and so on to 400. i want each observation to be extracted so that the first observation is [text1] and second is [text2] and so on. how would i do this? see below for actual example.

1. 
Readability and Quality of Online Information on Sickle Cell Retinopathy for 
Patients [embase.com]
Gbedemah Z.E.E., Fuseini M.-S.N., Fordjuor S.K.E.J., Baisie-Nkrumah E.J., Beecham R.-
M.E.M., Amissah-Arthur K.N. 
[In Process] Am. J. Ophthalmol. 2024 259: (45-52) 
Embase, MEDLINE
Go to publisher for the full text [dx.doi.org]
Embase Open URL: redirect to full text [embase.com]
Abstract
PURPOSE: This study aims to evaluate the readability and quality of Internet-based health 
information on sickle cell retinopathy. DESIGN: Retrospective cross-sectional website analysis. 
METHODS: To simulate a patient's online search, the terms “sickle cell retinopathy” and “sickle 
cell disease in the eye” were entered into the top 3 search engines (Google, Bing and Yahoo). The 
first 20 results of each search were retrieved and screened for analysis. The DISCERN 
questionnaire, the Journal of the American Medical Association (JAMA) standards, and the Health 
on the Net (HON) criteria were used to evaluate the quality of the information. The Flesch–Kincaid 
Grade Level (FKGL), the Flesch Reading Ease (FRES), and the Automated Readability Index 
(ARI) were used to assess the readability of each website. RESULTS: Of 16 online sources, 12 
(75%) scored moderately on the DISCERN tool. The mean DISCERN score was 40.91 (SD, 
10.39; maximum possible, 80). None of the sites met all of the JAMA benchmarks, and only 3 
(18.75%) of the websites had HONcode certification. All of the websites had scores above the 
target American Medical Association grade level of 6 on both the FKGL and ARI. The mean FRES 
was 57.76 (±4.61), below the recommended FRES of 80 to 90. CONCLUSION: There is limited 
online information available on sickle cell retinopathy. Most included websites were fairly difficult to 
read and of substandard quality. The quality and readability of Internet-based, patient-focused 
information on sickle cell retinopathy needs to be improved.


 
2. 
Multicomponent Strategy Improves Human Papillomavirus Vaccination Rates 
Among Adolescents with Sickle Cell Disease [embase.com]
Aurora T., Cole A., Rai P., Lavoie P., McIvor C., Klesges L.M., Kang G., Liyanage J.S.S., 
Brandt H.M., Hankins J.S. 
J. Pediatr. 2024 265: 
Embase, MEDLINE, NURSING
Go to publisher for the full text [dx.doi.org]
Embase Open URL: redirect to full text [embase.com]
Abstract
Objective: To evaluate the effectiveness of a vaccine strategy bundle to increase human 
papillomavirus (HPV) vaccine initiation and completion in a specialty clinic setting. Study design: 
Our Hematology clinic utilized an implementation framework from October 1, 2018, to December 
31, 2019, involving nurses, nursing coordinators, and clinicians in administering the HPV 
vaccination series to our adolescent sickle cell sample of nearly 500 patients. The bundle included 
education for staff on the need for HPV vaccine administration, provider incentives, vaccines 
offered to patients in SCD clinics, and verification of patients' charts of vaccine completion. 
Results: Following the implementation of the bundle, the cumulative incidence of HPV vaccination 
initiation and completion improved from 28% to 46% and 7% to 49%, respectively. Both rates 
remained higher postimplementation as well. HPV vaccination series completion was associated 
with a decreased distance to the health care facility, lower state deprivation rank, and increased 
hospitalizations. Conclusion: Our clinic's implementation strategy successfully improved vaccine 
completion rates among adolescents with sickle cell disease (SCD) while continuing to educate 
staff, patients, and families on the importance of cancer prevention among people living with SCD.

# Sample character string
cleaned_text <- "1. Readability and Quality of Online Information on Sickle Cell Retinopathy for Patients [embase.com] Gbedemah 2. Another text here 3. Yet another text 4. More text 5. Even more text"

# Split the text based on the numbering pattern
text_split <- strsplit(cleaned_text, "\\d+\\.\\s+")[[1]]

# Remove the first empty element
text_split <- text_split[-1]

# Convert the result into a data frame
df <- data.frame(Text = text_split)

shaun_m · Accepted Answer · 2024-05-13 20:00:39Z

Your code is almost complete. The only issue I saw was that the regex was finding too many matches, e.g. the sentence ending in 90. was being matched as an observation delimiter. Changing the regex to find only the matches that are on a line by themselves seems to get the right result.

I saved your sample text in ~/Rwork/sample.txt and the code is as follows:

# Read the file into a single character string
cleaned_text <- readr::read_file("~/Rwork/sample.txt")

# Split the text based on the numbering pattern
# The numbering must be on a line by itself (i.e. starts with ^ or \\n and 
# ends with \\n or $)
text_split <- strsplit(cleaned_text, "(^|\\n)\\d+\\.\\s*(\\n|$)")[[1]]

# Remove the first empty element
text_split <- text_split[-1]

# Convert the result into a data frame
df <- data.frame(Text = text_split)

^{Created on 2024-05-13 with reprex v2.1.0}

Collectives™ on Stack Overflow

in R converting a text file into a data frame

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related