Hadoop Distributed File System, MapReduce, Screenscraping +

Post Reply
KBleivik
Site Admin
Posts: 184
Joined: Tue Sep 29, 2009 6:25 pm
Location: Moss Norway
Contact:

Hadoop Distributed File System, MapReduce, Screenscraping +

Post by KBleivik »

1. The Hadoop home page

http://hadoop.apache.org/

2. What is the Hadoop Distributed File System (HDFS)?
Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. In this way, the map and reduce functions can be executed on smaller subsets of your larger data sets, and this provides the scalability that is needed for big data processing.
Source: http://www-01.ibm.com/software/data/inf ... doop/hdfs/

3. What is MapReduce?
Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance.
Source: http://lintool.github.io/MapReduceAlgorithms/

See also: https://databricks.com/blog/2014/03/26/ ... ark-2.html

4. Screen Scraping

Webscraping Techniques using PHP or Python

http://www.webbotsspidersscreenscrapers.com/

http://www.phparch.com/books/phparchite ... -with-php/

http://www.crummy.com/software/BeautifulSoup/

http://www.crummy.com/software/Beautifu ... ation.html

http://scrapy.org/

http://www.givegoodweb.com/post/210/html-parser-for-php

http://www.regexr.com/

http://regex101.com/

http://www.regular-expressions.info/

5. Make your own search engine
The inverted index is a classic algorithm needed for building search engines. Before running MapReduce, crawl teh interwebs, find all the pages, and build a data set of URLs -> doc contents, written to flat files in HDFS or one of the more “sophisticated” formats.
Source: http://polyglotprogramming.com/papers/S ... eModel.pdf

See also:

http://press.princeton.edu/titles/8216.html

http://infolab.stanford.edu/~backrub/google.html

http://www.ams.org/samplings/feature-co ... c-pagerank

http://lintool.github.io/MapReduceAlgor ... -final.pdf

https://developer.yahoo.com/hadoop/tuto ... dule4.html

Example of building an inverted index in MapReduce Java see:

http://polyglotprogramming.com/papers/S ... eModel.pdf

or

http://www.slideshare.net/deanwampler/s ... pute-model

page 27 - 34.

To better understand the next example see this

http://www.oopschool.com/phpBB3/viewtop ... f=65&t=332

thread and read about Maxwell’s equations: https://www.google.no/search?q=Maxwell% ... s&safe=off

Example of building an inverted index in MapReduce Spark Scala see the same document page 36 - 43.

http://spark.apache.org/docs/0.9.0/stre ... ck-example

Other methods like starting a crawl from a hub, site node or cluster of sites can also be used.

http://www.skupot.com/

6. Compiling online

http://codepad.org/

http://compileonline.com/

https://ideone.com/

Example of a general search term:

https://www.google.no/search?q=running+ ... b&safe=off

7. Setting up a cron job

http://www.a2hosting.com/kb/getting-sta ... cure-shell

http://www.a2hosting.com/kb/cpanel/adva ... /cron-jobs

http://www.a2hosting.com/kb/developer-c ... -cron-jobs

https://service.futurequest.net/index.p ... o-i-use-it

http://www.thesitewizard.com/general/set-cron-job.shtml

8. Related links

http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

http://polyglotprogramming.com/

http://hortonworks.com/hadoop/hdfs/

http://blog.cloudera.com/blog/category/hdfs/

https://developer.yahoo.com/hadoop/tuto ... dule2.html

http://www.cloudera.com/content/clouder ... educe.html

http://www.packtpub.com/article/display ... ids-ext-js

http://en.wikipedia.org/wiki/Apache_Hadoop

Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests