desenvolvendo web crawler/scraper com python
TRANSCRIPT
![Page 1: Desenvolvendo web crawler/scraper com Python](https://reader031.vdocuments.site/reader031/viewer/2022013111/55ab6a6a1a28abc67a8b45c7/html5/thumbnails/1.jpg)
Desenvolvendo web crawler/scraper com
Python
G e e k N i g h t
![Page 2: Desenvolvendo web crawler/scraper com Python](https://reader031.vdocuments.site/reader031/viewer/2022013111/55ab6a6a1a28abc67a8b45c7/html5/thumbnails/2.jpg)
2
![Page 3: Desenvolvendo web crawler/scraper com Python](https://reader031.vdocuments.site/reader031/viewer/2022013111/55ab6a6a1a28abc67a8b45c7/html5/thumbnails/3.jpg)
3
Web Crawler
● Spider, robôs
● Começa com uma lista de URL's para visitar. A cada URL visitada, ele identifica os hyperlinks e os guarda para visitá-los no futuro, e também copia o conteúdo da página.
● GoogleBot, Yahoo Slurp,DuckDuckBot...
![Page 4: Desenvolvendo web crawler/scraper com Python](https://reader031.vdocuments.site/reader031/viewer/2022013111/55ab6a6a1a28abc67a8b45c7/html5/thumbnails/4.jpg)
4
Web Scraper
● Extrai informações de um web site.
● Relacionado com web indexing.
● Transformação de dados.
![Page 5: Desenvolvendo web crawler/scraper com Python](https://reader031.vdocuments.site/reader031/viewer/2022013111/55ab6a6a1a28abc67a8b45c7/html5/thumbnails/5.jpg)
5
Trabalho de um crawler/scraper
● Abrir um link
● Cópia e/ou manipulação do dado
![Page 6: Desenvolvendo web crawler/scraper com Python](https://reader031.vdocuments.site/reader031/viewer/2022013111/55ab6a6a1a28abc67a8b45c7/html5/thumbnails/6.jpg)
6
Seletores
● Xpath
● CSS Selectors
Retirado do site http://ejohn.org/blog/xpath-css-selectors/
![Page 7: Desenvolvendo web crawler/scraper com Python](https://reader031.vdocuments.site/reader031/viewer/2022013111/55ab6a6a1a28abc67a8b45c7/html5/thumbnails/7.jpg)
7
import request
● Para humanos
● urllib2
![Page 8: Desenvolvendo web crawler/scraper com Python](https://reader031.vdocuments.site/reader031/viewer/2022013111/55ab6a6a1a28abc67a8b45c7/html5/thumbnails/8.jpg)
8
lxml
● Uma implementação em Python das bibliotecas em C libxml2 e libxslt para parse de xml e html.
●Suporta css selector e xpath.
![Page 9: Desenvolvendo web crawler/scraper com Python](https://reader031.vdocuments.site/reader031/viewer/2022013111/55ab6a6a1a28abc67a8b45c7/html5/thumbnails/9.jpg)
9
lxml
![Page 10: Desenvolvendo web crawler/scraper com Python](https://reader031.vdocuments.site/reader031/viewer/2022013111/55ab6a6a1a28abc67a8b45c7/html5/thumbnails/10.jpg)
10
BeautifulSoup
![Page 11: Desenvolvendo web crawler/scraper com Python](https://reader031.vdocuments.site/reader031/viewer/2022013111/55ab6a6a1a28abc67a8b45c7/html5/thumbnails/11.jpg)
11
PySpider
![Page 12: Desenvolvendo web crawler/scraper com Python](https://reader031.vdocuments.site/reader031/viewer/2022013111/55ab6a6a1a28abc67a8b45c7/html5/thumbnails/12.jpg)
12
Scrapy
● Open Source framework, poderoso para crawling e scraping. Python 2.
● Suporta o uso de xpath e css selectors.
● Formatos de saída: json, csv, xml, json lines
● Há exemplos com persistência em banco.
![Page 13: Desenvolvendo web crawler/scraper com Python](https://reader031.vdocuments.site/reader031/viewer/2022013111/55ab6a6a1a28abc67a8b45c7/html5/thumbnails/13.jpg)
13
Scrapy
scrapy crawl bbcnews --output results.json Retirado do site http://scraping.pro/
![Page 14: Desenvolvendo web crawler/scraper com Python](https://reader031.vdocuments.site/reader031/viewer/2022013111/55ab6a6a1a28abc67a8b45c7/html5/thumbnails/14.jpg)
14
Bibliotecas em Python
● Goose
● Pyquery