0

I want to parse a SQL file and print only the create table statements.

Example SQL file:

--
-- Name: film_actor; Type: TABLE; Schema: public; Owner: postgres
--

CREATE TABLE public.film_actor (
    actor_id smallint NOT NULL,
    film_id smallint NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);


ALTER TABLE public.film_actor OWNER TO postgres;

--
-- Name: film_category; Type: TABLE; Schema: public; Owner: postgres
--

CREATE TABLE public.film_category (
    film_id smallint NOT NULL,
    category_id smallint NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);


ALTER TABLE public.film_category OWNER TO postgres;

Here, I just want to get the complete create table statement for the first table and then print, then go for the next table.

I tried to use it with DDLparse and SQLparse tools, but not exactly parse the complete SQL file. So basically once I grep the Create table statement then I can use SQLparse to do other stuff.

Could someone help me with this?

2 Answers 2

1

I'm not sure about parsers or parsing tools, but you could do a workaround using regex. What I did is basically take all the text between "CREATE" and ";" and added them to a list, then I manually added "CREATE" and ";" to complete the SQL queries.

Take a look at this:

import re

Test = """
--
-- Name: film_actor; Type: TABLE; Schema: public; Owner: postgres
--

CREATE TABLE public.film_actor (
    actor_id smallint NOT NULL,
    film_id smallint NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);


ALTER TABLE public.film_actor OWNER TO postgres;

--
-- Name: film_category; Type: TABLE; Schema: public; Owner: postgres
--

CREATE TABLE public.film_category (
    film_id smallint NOT NULL,
    category_id smallint NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);


ALTER TABLE public.film_category OWNER TO postgres;"""

#search(r'Part 1\.(.*?)Part 3', s)

results = re.findall ( 'CREATE(.*?);', Test, re.DOTALL)

newresults = []

for x in results:
    newresults.append("CREATE "+x+";")

for y in newresults:
    print(y)
Sign up to request clarification or add additional context in comments.

2 Comments

If I have it a large SQL file and If I use read lines, will it eat more memory?
It depends on how big your file is and how much memory your machine has (Also what other programs are using your memory). You could also process each result by itself inside the loop without appending it to the newresults list to prevent using more memory
0

You can use library like sqlparse

import sqlparse

with open('test.sql') as input:
  statements = sqlparse.split(input.read())

for statement in statements:
  if 'create table' in statement.lower():
    print(sqlparse.format(statement, strip_comments=True))

2 Comments

If I have 5GB of this SQL file, will it read on hold it on memory? Or just read line by one and process on the fly?
It will read it all into memory and then process from it. The questions you ask imply you are working on a DB migration and you do not want to migrate data, just the schema. Is that right? In that case you might want to rethink how you want to approach this migration. Working with raw SQL is often asking for security trouble. I'd recommend migrating to established ORM like SQLAlchemy. You can generate declarative schema from existing DB using a tool like sqlacodegen and then manage migrations with alembic

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.