devtalks cluj - open-source technologies for analyzing text
TRANSCRIPT
![Page 2: DevTalks Cluj - Open-Source Technologies for Analyzing Text](https://reader031.vdocuments.site/reader031/viewer/2022030207/58a98d981a28ab412d8b62c3/html5/thumbnails/2.jpg)
✓ Very good hotel!*
✓ Near city centre“Close to the city center”✓ Clean rooms« Chambre impeccable »✓ Popular with solo travelers“Remote doesnt work”
*) Ramada Cluj (Full summary)
![Page 3: DevTalks Cluj - Open-Source Technologies for Analyzing Text](https://reader031.vdocuments.site/reader031/viewer/2022030207/58a98d981a28ab412d8b62c3/html5/thumbnails/3.jpg)
![Page 4: DevTalks Cluj - Open-Source Technologies for Analyzing Text](https://reader031.vdocuments.site/reader031/viewer/2022030207/58a98d981a28ab412d8b62c3/html5/thumbnails/4.jpg)
DBCrawling Semantic Analysis
TrustYou Analytics
API
Google, Hotels.com …
TrustYou Architecture
200 million reqs/month
❤ Python
![Page 5: DevTalks Cluj - Open-Source Technologies for Analyzing Text](https://reader031.vdocuments.site/reader031/viewer/2022030207/58a98d981a28ab412d8b62c3/html5/thumbnails/5.jpg)
Scrapy
● Build your own web crawlers● Extract data via CSS selectors, XPath, regexes …● Handles “tag soup”, queuing, request parallelism,
cookies, throttling … ● Code sample on GitHub
![Page 6: DevTalks Cluj - Open-Source Technologies for Analyzing Text](https://reader031.vdocuments.site/reader031/viewer/2022030207/58a98d981a28ab412d8b62c3/html5/thumbnails/6.jpg)
NLP in Python
● NLTK○ Word/sentence tokenization○ POS tagging, parsing
● Great support for scientific computation:NumPy, SciPy, Pandas
● Scikit-learn● TensorFlow!
![Page 7: DevTalks Cluj - Open-Source Technologies for Analyzing Text](https://reader031.vdocuments.site/reader031/viewer/2022030207/58a98d981a28ab412d8b62c3/html5/thumbnails/7.jpg)
Gensim: Fun with Word2Vec>>> # trained from 100k meetup descriptions!
>>> m = gensim.models.Word2Vec.load("data/word2vec")
>>> m.most_similar(positive=["python"])[:3]
[(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django',
0.8189617991447449)]
>>> m.doesnt_match(["python", "c++", "javascript"])
'c++'
>>> m.most_similar(positive=["berlin"])[:3]
[(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland',
0.7970746755599976)]
>>> m.most_similar(positive=["ladies"])[:3]
[(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]
![Page 8: DevTalks Cluj - Open-Source Technologies for Analyzing Text](https://reader031.vdocuments.site/reader031/viewer/2022030207/58a98d981a28ab412d8b62c3/html5/thumbnails/8.jpg)
Big Data & Open Source
2004MapReduce, GFS
BigTable, Spanner, F1 …
Apache Beam …
![Page 9: DevTalks Cluj - Open-Source Technologies for Analyzing Text](https://reader031.vdocuments.site/reader031/viewer/2022030207/58a98d981a28ab412d8b62c3/html5/thumbnails/9.jpg)
Spark
● User writes driver program which transparently schedules execution in a cluster
● Faster and more expressive than MapReduce
● Spark SQL: Interactive query of large datasets● Spark Streaming: Spark is “batch first”, but fast enough
to implement stream processing with “mini batches”● Spark MLlib: Machine learning
![Page 10: DevTalks Cluj - Open-Source Technologies for Analyzing Text](https://reader031.vdocuments.site/reader031/viewer/2022030207/58a98d981a28ab412d8b62c3/html5/thumbnails/10.jpg)
● Build complex pipelines ofbatch jobs○ Dependency resolution○ Parallelism○ Resume failed jobs
● Some support for Hadoop● Pythonic replacement for Oozie
Luigi
![Page 11: DevTalks Cluj - Open-Source Technologies for Analyzing Text](https://reader031.vdocuments.site/reader031/viewer/2022030207/58a98d981a28ab412d8b62c3/html5/thumbnails/11.jpg)
Try it out!
GitHub repo showcasing:● Luigi● Scrapy● Word2Vec model training with gensim@ https://github.com/trustyou/meetups