Hadoop Distributed File System, MapReduce, Screenscraping +

Post by **KBleivik** » Thu May 29, 2014 10:03 am

1. The Hadoop home page

http://hadoop.apache.org/

2. What is the Hadoop Distributed File System (HDFS)?

Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. In this way, the map and reduce functions can be executed on smaller subsets of your larger data sets, and this provides the scalability that is needed for big data processing.

Source: http://www-01.ibm.com/software/data/inf ... doop/hdfs/

3. What is MapReduce?

Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance.

Source: http://lintool.github.io/MapReduceAlgorithms/

See also: https://databricks.com/blog/2014/03/26/ ... ark-2.html

4. Screen Scraping

Webscraping Techniques using PHP or Python

http://www.webbotsspidersscreenscrapers.com/

http://www.phparch.com/books/phparchite ... -with-php/

http://www.crummy.com/software/BeautifulSoup/

http://www.crummy.com/software/Beautifu ... ation.html

http://scrapy.org/

http://www.givegoodweb.com/post/210/html-parser-for-php

http://www.regexr.com/

http://regex101.com/

http://www.regular-expressions.info/

5. Make your own search engine

The inverted index is a classic algorithm needed for building search engines. Before running MapReduce, crawl teh interwebs, find all the pages, and build a data set of URLs -> doc contents, written to flat files in HDFS or one of the more “sophisticated” formats.

Source: http://polyglotprogramming.com/papers/S ... eModel.pdf

See also:

http://press.princeton.edu/titles/8216.html

http://infolab.stanford.edu/~backrub/google.html

http://www.ams.org/samplings/feature-co ... c-pagerank

http://lintool.github.io/MapReduceAlgor ... -final.pdf

https://developer.yahoo.com/hadoop/tuto ... dule4.html

Example of building an inverted index in MapReduce Java see:

http://polyglotprogramming.com/papers/S ... eModel.pdf

or

http://www.slideshare.net/deanwampler/s ... pute-model

page 27 - 34.

To better understand the next example see this

http://www.oopschool.com/phpBB3/viewtop ... f=65&t=332

thread and read about Maxwell’s equations: https://www.google.no/search?q=Maxwell% ... s&safe=off

Example of building an inverted index in MapReduce Spark Scala see the same document page 36 - 43.

http://spark.apache.org/docs/0.9.0/stre ... ck-example

Other methods like starting a crawl from a hub, site node or cluster of sites can also be used.

http://www.skupot.com/

6. Compiling online

http://codepad.org/

http://compileonline.com/

https://ideone.com/

Example of a general search term:

https://www.google.no/search?q=running+ ... b&safe=off

7. Setting up a cron job

http://www.a2hosting.com/kb/getting-sta ... cure-shell

http://www.a2hosting.com/kb/cpanel/adva ... /cron-jobs

http://www.a2hosting.com/kb/developer-c ... -cron-jobs

https://service.futurequest.net/index.p ... o-i-use-it

http://www.thesitewizard.com/general/set-cron-job.shtml

8. Related links

http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

http://polyglotprogramming.com/

http://hortonworks.com/hadoop/hdfs/

http://blog.cloudera.com/blog/category/hdfs/

https://developer.yahoo.com/hadoop/tuto ... dule2.html

http://www.cloudera.com/content/clouder ... educe.html

http://www.packtpub.com/article/display ... ids-ext-js

http://en.wikipedia.org/wiki/Apache_Hadoop

OopSchool.com

Hadoop Distributed File System, MapReduce, Screenscraping +

Hadoop Distributed File System, MapReduce, Screenscraping +

Who is online