4

Using Lua Pattern Matching I would like to be able to parse a string and find the following URLs

http://www.test.com/
www.test.com/
test.com/
test-test.test.com/

The slashes can be optional but if included it has to be able to find nested folders such as:

test.com/test/

That way I can use a single pattern match to find the url. The problem is that all the examples I use either dont work or cause World of Warcraft to never leave the loading screen to some error I cant resolve on my own.

I no longer have the pattern I used in my code so I could use one that will work and not hook improper URLs. I will come up with some if needed later.

1
  • Apparently the answer below works for that too. It has to be all the above and any form of valid URL. I can parse the #:#:#:# ones myself as it has a simple formula but I was having issues with urls themselves. Commented May 11, 2014 at 15:20

1 Answer 1

11
-- all characters allowed to be inside URL according to RFC 3986 but without
-- comma, semicolon, apostrophe, equal, brackets and parentheses
-- (as they are used frequently as URL separators)
local text_with_URLs = [[
   <a href="http://www.lua.org:80/manual/5.2/contents.html">L.ua 5.2</a>
   [url=127.0.0.1:8080]forum link[/url]
   intranet links: http://test, http://retracker.local/announce
   [markdown link](https://74.125.143.101/search?q=Who+are+the+Lua+People%3F)
   long subdomain chain: very.long.name.of.my.site.co.uk
   auth link: ftp://user:[email protected]/path - not recognized yet :(
]]

local domains = [[.ac.ad.ae.aero.af.ag.ai.al.am.an.ao.aq.ar.arpa.as.asia.at.au
   .aw.ax.az.ba.bb.bd.be.bf.bg.bh.bi.biz.bj.bm.bn.bo.br.bs.bt.bv.bw.by.bz.ca
   .cat.cc.cd.cf.cg.ch.ci.ck.cl.cm.cn.co.com.coop.cr.cs.cu.cv.cx.cy.cz.dd.de
   .dj.dk.dm.do.dz.ec.edu.ee.eg.eh.er.es.et.eu.fi.firm.fj.fk.fm.fo.fr.fx.ga
   .gb.gd.ge.gf.gh.gi.gl.gm.gn.gov.gp.gq.gr.gs.gt.gu.gw.gy.hk.hm.hn.hr.ht.hu
   .id.ie.il.im.in.info.int.io.iq.ir.is.it.je.jm.jo.jobs.jp.ke.kg.kh.ki.km.kn
   .kp.kr.kw.ky.kz.la.lb.lc.li.lk.lr.ls.lt.lu.lv.ly.ma.mc.md.me.mg.mh.mil.mk
   .ml.mm.mn.mo.mobi.mp.mq.mr.ms.mt.mu.museum.mv.mw.mx.my.mz.na.name.nato.nc
   .ne.net.nf.ng.ni.nl.no.nom.np.nr.nt.nu.nz.om.org.pa.pe.pf.pg.ph.pk.pl.pm
   .pn.post.pr.pro.ps.pt.pw.py.qa.re.ro.ru.rw.sa.sb.sc.sd.se.sg.sh.si.sj.sk
   .sl.sm.sn.so.sr.ss.st.store.su.sv.sy.sz.tc.td.tel.tf.tg.th.tj.tk.tl.tm.tn
   .to.tp.tr.travel.tt.tv.tw.tz.ua.ug.uk.um.us.uy.va.vc.ve.vg.vi.vn.vu.web.wf
   .ws.xxx.ye.yt.yu.za.zm.zr.zw]]
local tlds = {}
for tld in domains:gmatch'%w+' do
   tlds[tld] = true
end
local function max4(a,b,c,d) return math.max(a+0, b+0, c+0, d+0) end
local protocols = {[''] = 0, ['http://'] = 0, ['https://'] = 0, ['ftp://'] = 0}
local finished = {}

for pos_start, url, prot, subd, tld, colon, port, slash, path in
   text_with_URLs:gmatch'()(([%w_.~!*:@&+$/?%%#-]-)(%w[-.%w]*%.)(%w+)(:?)(%d*)(/?)([%w_.~!*:@&+$/?%%#=-]*))'
do
   if protocols[prot:lower()] == (1 - #slash) * #path and not subd:find'%W%W'
      and (colon == '' or port ~= '' and port + 0 < 65536)
      and (tlds[tld:lower()] or tld:find'^%d+$' and subd:find'^%d+%.%d+%.%d+%.$'
      and max4(tld, subd:match'^(%d+)%.(%d+)%.(%d+)%.$') < 256)
   then
      finished[pos_start] = true
      print(pos_start, url)
   end
end

for pos_start, url, prot, dom, colon, port, slash, path in
   text_with_URLs:gmatch'()((%f[%w]%a+://)(%w[-.%w]*)(:?)(%d*)(/?)([%w_.~!*:@&+$/?%%#=-]*))'
do
   if not finished[pos_start] and not (dom..'.'):find'%W%W'
      and protocols[prot:lower()] == (1 - #slash) * #path
      and (colon == '' or port ~= '' and port + 0 < 65536)
   then
      print(pos_start, url)
   end
end

Output:

13    http://www.lua.org:80/manual/5.2/contents.html
61    L.ua
82    127.0.0.1:8080
197   https://74.125.143.101/search?q=Who+are+the+Lua+People%3F
281   very.long.name.of.my.site.co.uk
133   http://test
146   http://retracker.local/announce

Second line was printed because it looks like some ukrainian site :-)

Sign up to request clarification or add additional context in comments.

6 Comments

Thats awesome now how would i also grab the location or swap it adding stuff to the end and beginning of the caught text? Basically I need the start point and the text. I will then use a swap to add custom text to hyperlink it inside World of Warcraft.
Also could you help me understand how this code works and explain it to me? I would like to be able to know how to come up with something like this on my own in case I want to do a different format later.
Answer updated. Position of found text is displayed.
Thanks man. I sure do appreciate it. You should have seen my code without a method like this lol. I will be listing you in the code and addon for credit for this. Thanks again.
I see you updated it again... your too much man... I shoulda came here first lol. I wish I could upvote it but I dont have the rep.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.