ruby robots

Post on 26-May-2015

3.633 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Talk about creating web robots in Ruby programming language, using restclient, nokogiri, mechanize in rsonrails event

TRANSCRIPT

Ruby Robots

http://www.flickr.com/photos/flysi/183272970

Daniel Cukier@danicuki

Relatives

• spiders

• crawlers

• bots

Why robot?

require 'anemone'

Anemone.crawl(url) do |anemone| anemone.on_every_page do |page| puts page.url endend http://www.cantora.mus.br/

http://www.cantora.mus.br/fotoshttp://www.cantora.mus.br/?locale=enhttp://www.cantora.mus.br/?locale=pt-BRhttp://www.cantora.mus.br/musicashttp://www.cantora.mus.br/videoshttp://www.cantora.mus.br/agendahttp://www.cantora.mus.br/novidadeshttp://www.cantora.mus.br/musicas/baixarhttp://www.cantora.mus.br/visitors/baixarhttp://www.cantora.mus.br/socialhttp://www.cantora.mus.br/fotos?locale=pt-BRhttp://www.cantora.mus.br/musicas?locale=enhttp://www.cantora.mus.br/fotos?locale=en

XPath<html>...<div class="bla"> <a>legal</a></div>...</html>

html_doc = Nokogiri::HTML(html)info = html_doc.xpath( "//div[@class='bla']/a")info.text=> legal

XPath<table id="super"> <tr> <td>L1C1</td> <td>L1C2</td> </tr> <tr> <td>L2C1</td> <td>L2C2</td> </tr> <tr> <td>L3C1</td> <td>L3C2</td> </tr></table>

>> html_doc = Nokogiri::HTML(html)>> info = html_doc.xpath( "//table[@id='super']/tr")>> info.size=> 3

>> info=> legal

>> info[0].xpath("td").size=> 2

>> info[2].xpath("td")[1].text=> "L3C2"

rest-client

http://www.flickr.com/photos/amortize/766738216

GET

GET

Good bot

/robots.txt

User-agent: *

Disallow:

http://www.flickr.com/photos/temily/5645585162

Ruby Robots

http://www.flickr.com/photos/flysi/183272970

Daniel Cukier@danicuki

maxRowsList=16

WTF?

>> body = RestClient.get(url)>> json = JSON.parse(body)>> content = json["Content"]>> content.size=> 16

AHA!!!

http://.../artistas?maxRowsList=1600&filter=Recentes>> body = RestClient.get(url)>> json = JSON.parse(a)>> content = json["Content"]>> content.size=> 1600

http://.../artistas?maxRowsList=1600000&filter=Recentes

>> content.size=> 9154

Bingo!!!

>> b["Content"].map {|c| c["ProfileUrl"]}["caravella", "tomleite", "jeffersontavares", "rodrigoaraujo", "jorgemendes", "bossapunk", "daviddepiro", "freetools", "ironia", "tiagorosa", "outprofile", "lucianokoscky", "bandateatraldecarona", "tlounge", "almanaque", "razzyoficial", "cretinosecanalhas", "cincorios", "ninoantunes", "caiocorsalette", "alinedelima", "thelio", "grupodomdesamba", "ladoz", "alexandrepontes", "poeiradgua", "betimalu", "leonardobessa", "kamaross", "marcusdocavaco", "atividadeinformal", "angelkeys", "locojohn", "forcamusic", "tiaguinhoabreu", "marcelonegrao", "jstonemghiphop", "uniaoglobal", "bandaefex", "severarock", "manitu", "sasso", "kakka", "xsopretty", "belepoke", "caixaazul", "wknd", "bandastarven", "bleiamusic", "3porcentoaocubo", "lucianoterra", "hipnoia", "influencianegra", "bandaursamaior", "mariafreitas", "jessejames", "vagnerrockxe", "stageo3", "lemoneight", "innocence", "dinda", "marcelocapela", "paulocamoeseoslusiadas", "magnussrock", "bandatheburk", "mercantes", "bandaturnerock", "flaviasaolli", "tonysagga", "thiagoponde", "centeio", "grupodeubranco", "bocadeleao", "eusoueliascardan", "notoriaoficial", "planomasterrock", "rofgod", "dreemonphc", "chicobrant", "osz", "bandalightspeed", "cavernadenarnia", "sergiobenevenuto", "viniciusdeoliveira", ...]

email?phone?

>> html = RestClient.get("http://.../robomacaco")>> html_doc = Nokogiri::HTML(html)>> info = html_doc.xpath("//span[@class='name']")>> info.text=> "robo-macaco@hotmail.comRIO DE JANEIRO - RJ - Brasil21 9675-0199

cookies

cookies = {}c = "s_nr=12954999; s_v19=12978609471; ... __utmc=206845458"cook = c.split(";").map {|i| i.strip.split("=")}cook.each {|u| cookies[u[0]] = u[1]}

RestClient.get(url, :cookies => cookies)

Proxies

>> response = RestClient.get(url)>> html_doc = Nokogiri::HTML(response)>> table = html_doc.xpath("//table[@class='proxylist']")>> lines = table.children>> lines.shift # tira o cabeçalho

>> lines[1].text=> "208.52.144.55 document.write(\":\"+i+r+i+r) anonymous proxy server-2 minutes ago United States"

Text

IP WTF?

<script type="text/javascript"> z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;</script>

JAVASCRIPT=

RUBY

http://www.flickr.com/photos/drics/4266471776/

<script type="text/javascript"> z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;</script>

>> script = html_doc.xpath("//script")[1]>> eval script.text>> z=> 5>> i=> 8

>> digits = lines[1].text.split(")")[0].split("+")=> ["208.52.144.55document.write(\":\"", "i", "r", "i", "r"]>> digits.shift>> digits=> ["i", "r", "i", "r"]>> port = digits.map {|c| eval(c)}.join("")=> "8080"

>> lines[1].text=> "208.52.144.55 document.write(\":\"+i+r+i+r) anonymous proxy server-2 minutes ago United States"

Voilà

RestClient.proxy = "http://#{server}:#{port}"

>> server = lines[1].text.split[0]=> "208.52.144.55"

agent = Mechanize.newsite = "http://www.cantora.mus.br"page = agent.get("#{site}/baixar")form = page.formform['visitor[name]'] = 'daniel'form['visitor[email]'] = "danicuki@gmail.com"page = agent.submit(form)tracks = page.links.select { |l| l.href =~ /track/ }tracks.each do |t| file = agent.get("#{site}#{t}) file.saveend

mechanize

protection techniques

javascript

text as image

captcha

don’t be ingenuous

captcha

YES you can!

prove you are not a robot

3 steps

1. Download Image2. filter image3. run OCR software

Good Luck!

scaling

http://www.flickr.com/photos/liquene/3330714590

clouds

$ knife ec2 server create

threads+

queues

Nessa vida de programador malucoMe aparece cada situaçãoDe repente um cliente, uma proposta brutaPra pegar de um site informaçãoVocê tá louco, esse tipo de crime eu não façoSe quiser tenho uns amigos lá do sulFaz pra mim que eu te pago com essa jóia cool

Te dou um rubyPra você roubarCom o seu robô

Quer fazer robô?É só usar rubyÉ só usar rubyPra fazer robô

http://www.flickr.com/photos/jobafunky/5572503988

Thank you

Daniel Cukier@danicuki

top related