ruby robots

Ruby Robots

http://www.flickr.com/photos/flysi/183272970

Daniel Cukier@danicuki

Relatives

• spiders

• crawlers

• bots

Why robot?

http://www.flickr.com/photos/nhankamer/5016628611

require 'anemone'

Anemone.crawl(url) do |anemone| anemone.on_every_page do |page| puts page.url endend http://www.cantora.mus.br/

http://www.cantora.mus.br/fotoshttp://www.cantora.mus.br/?locale=enhttp://www.cantora.mus.br/?locale=pt-BRhttp://www.cantora.mus.br/musicashttp://www.cantora.mus.br/videoshttp://www.cantora.mus.br/agendahttp://www.cantora.mus.br/novidadeshttp://www.cantora.mus.br/musicas/baixarhttp://www.cantora.mus.br/visitors/baixarhttp://www.cantora.mus.br/socialhttp://www.cantora.mus.br/fotos?locale=pt-BRhttp://www.cantora.mus.br/musicas?locale=enhttp://www.cantora.mus.br/fotos?locale=en

XPath<html>...<div class="bla"> <a>legal</a></div>...</html>

html_doc = Nokogiri::HTML(html)info = html_doc.xpath( "//div[@class='bla']/a")info.text=> legal

XPath<table id="super"> <tr> <td>L1C1</td> <td>L1C2</td> </tr> <tr> <td>L2C1</td> <td>L2C2</td> </tr> <tr> <td>L3C1</td> <td>L3C2</td> </tr></table>

>> html_doc = Nokogiri::HTML(html)>> info = html_doc.xpath( "//table[@id='super']/tr")>> info.size=> 3

>> info=> legal

>> info[0].xpath("td").size=> 2

>> info[2].xpath("td")[1].text=> "L3C2"

rest-client

http://www.flickr.com/photos/amortize/766738216

http://www.flickr.com/photos/abbeychristine/223898960

Good bot

/robots.txt

User-agent: *

Disallow:

http://www.flickr.com/photos/temily/5645585162

Ruby Robots

http://www.flickr.com/photos/flysi/183272970

http://www.flickr.com/photos/nephelim/5632618462

maxRowsList=16

>> body = RestClient.get(url)>> json = JSON.parse(body)>> content = json["Content"]>> content.size=> 16

AHA!!!

http://.../artistas?maxRowsList=1600&filter=Recentes>> body = RestClient.get(url)>> json = JSON.parse(a)>> content = json["Content"]>> content.size=> 1600

http://.../artistas?maxRowsList=1600000&filter=Recentes

>> content.size=> 9154

Bingo!!!

>> b["Content"].map {|c| c["ProfileUrl"]}["caravella", "tomleite", "jeffersontavares", "rodrigoaraujo", "jorgemendes", "bossapunk", "daviddepiro", "freetools", "ironia", "tiagorosa", "outprofile", "lucianokoscky", "bandateatraldecarona", "tlounge", "almanaque", "razzyoficial", "cretinosecanalhas", "cincorios", "ninoantunes", "caiocorsalette", "alinedelima", "thelio", "grupodomdesamba", "ladoz", "alexandrepontes", "poeiradgua", "betimalu", "leonardobessa", "kamaross", "marcusdocavaco", "atividadeinformal", "angelkeys", "locojohn", "forcamusic", "tiaguinhoabreu", "marcelonegrao", "jstonemghiphop", "uniaoglobal", "bandaefex", "severarock", "manitu", "sasso", "kakka", "xsopretty", "belepoke", "caixaazul", "wknd", "bandastarven", "bleiamusic", "3porcentoaocubo", "lucianoterra", "hipnoia", "influencianegra", "bandaursamaior", "mariafreitas", "jessejames", "vagnerrockxe", "stageo3", "lemoneight", "innocence", "dinda", "marcelocapela", "paulocamoeseoslusiadas", "magnussrock", "bandatheburk", "mercantes", "bandaturnerock", "flaviasaolli", "tonysagga", "thiagoponde", "centeio", "grupodeubranco", "bocadeleao", "eusoueliascardan", "notoriaoficial", "planomasterrock", "rofgod", "dreemonphc", "chicobrant", "osz", "bandalightspeed", "cavernadenarnia", "sergiobenevenuto", "viniciusdeoliveira", ...]

email?phone?

>> html = RestClient.get("http://.../robomacaco")>> html_doc = Nokogiri::HTML(html)>> info = html_doc.xpath("//span[@class='name']")>> info.text=> "robo-macaco@hotmail.comRIO DE JANEIRO - RJ - Brasil21 9675-0199

cookies

cookies = {}c = "s_nr=12954999; s_v19=12978609471; ... __utmc=206845458"cook = c.split(";").map {|i| i.strip.split("=")}cook.each {|u| cookies[u[0]] = u[1]}

RestClient.get(url, :cookies => cookies)

Proxies

http://www.ip-adress.com/proxy_list

>> response = RestClient.get(url)>> html_doc = Nokogiri::HTML(response)>> table = html_doc.xpath("//table[@class='proxylist']")>> lines = table.children>> lines.shift # tira o cabeçalho

>> lines[1].text=> "208.52.144.55 document.write(\":\"+i+r+i+r) anonymous proxy server-2 minutes ago United States"

IP WTF?

JAVASCRIPT=

http://www.flickr.com/photos/drics/4266471776/

>> script = html_doc.xpath("//script")[1]>> eval script.text>> z=> 5>> i=> 8

>> digits = lines[1].text.split(")")[0].split("+")=> ["208.52.144.55document.write(\":\"", "i", "r", "i", "r"]>> digits.shift>> digits=> ["i", "r", "i", "r"]>> port = digits.map {|c| eval(c)}.join("")=> "8080"

>> lines[1].text=> "208.52.144.55 document.write(\":\"+i+r+i+r) anonymous proxy server-2 minutes ago United States"

Voilà

RestClient.proxy = "http://#{server}:#{port}"

>> server = lines[1].text.split[0]=> "208.52.144.55"

agent = Mechanize.newsite = "http://www.cantora.mus.br"page = agent.get("#{site}/baixar")form = page.formform['visitor[name]'] = 'daniel'form['visitor[email]'] = "danicuki@gmail.com"page = agent.submit(form)tracks = page.links.select { |l| l.href =~ /track/ }tracks.each do |t| file = agent.get("#{site}#{t}) file.saveend

mechanize

protection techniques

javascript

text as image

captcha

don’t be ingenuous

captcha

YES you can!

prove you are not a robot

3 steps

1. Download Image2. filter image3. run OCR software

Good Luck!

scaling

http://www.flickr.com/photos/liquene/3330714590

clouds

$ knife ec2 server create

threads+

queues

Nessa vida de programador malucoMe aparece cada situaçãoDe repente um cliente, uma proposta brutaPra pegar de um site informaçãoVocê tá louco, esse tipo de crime eu não façoSe quiser tenho uns amigos lá do sulFaz pra mim que eu te pago com essa jóia cool

Te dou um rubyPra você roubarCom o seu robô

Quer fazer robô?É só usar rubyÉ só usar rubyPra fazer robô

http://www.flickr.com/photos/jobafunky/5572503988

Thank you

ruby robots

ruby http

urlendend http

geturl html

xpath html

robomacaco html

r digits

text z

r port

Technology

090821 ruby sapporo night ruby cocoa

feels like ruby - ruby kaigi 2010

introduction to ruby & ruby on rails

formação ruby & redu :: introdução a ruby

rubinius: ruby, написанный на ruby

ruby on rails [ ruby on rails.ppt ] - [ruby-doc.org:...

ruby ruby ruby

ruby course - lesson 1 - introduction to ruby

curso de ruby e ruby on rails

ottawa ruby - ruby tuesday - june 26, 2012

ruby 5: & & b 2019 dxruby ruby ruby ... · :...

ruby cucumber training | ruby cucumber course | ruby...

ruby nation: why no haz ruby?

ruby i18n - internationalization for ruby

formação ruby & redu :: ruby on rails

ruby-9911vg2a -...

practical ruby projects with mongodb - ruby midwest

ruby on rails - ghost...

ruby-9716vgar -...

ruby desenvolvimento em ruby para web