2. What is the Hadoop Distributed File System (HDFS)?
Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. In this way, the map and reduce functions can be executed on smaller subsets of your larger data sets, and this provides the scalability that is needed for big data processing.
Source: http://www-01.ibm.com/software/data/inf ... doop/hdfs/
3. What is MapReduce?
Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance.
See also: https://databricks.com/blog/2014/03/26/ ... ark-2.html
4. Screen Scraping
Webscraping Techniques using PHP or Python
http://www.phparch.com/books/phparchite ... -with-php/
http://www.crummy.com/software/Beautifu ... ation.html
5. Make your own search engine
The inverted index is a classic algorithm needed for building search engines. Before running MapReduce, crawl teh interwebs, find all the pages, and build a data set of URLs -> doc contents, written to flat files in HDFS or one of the more “sophisticated” formats.
Source: http://polyglotprogramming.com/papers/S ... eModel.pdf
http://www.ams.org/samplings/feature-co ... c-pagerank
http://lintool.github.io/MapReduceAlgor ... -final.pdf
https://developer.yahoo.com/hadoop/tuto ... dule4.html
Example of building an inverted index in MapReduce Java see:
http://polyglotprogramming.com/papers/S ... eModel.pdf
http://www.slideshare.net/deanwampler/s ... pute-model
page 27 - 34.
To better understand the next example see this
thread and read about Maxwell’s equations: https://www.google.no/search?q=Maxwell% ... s&safe=off
Example of building an inverted index in MapReduce Spark Scala see the same document page 36 - 43.
http://spark.apache.org/docs/0.9.0/stre ... ck-example
Other methods like starting a crawl from a hub, site node or cluster of sites can also be used.
6. Compiling online
Example of a general search term:
https://www.google.no/search?q=running+ ... b&safe=off
7. Setting up a cron job
http://www.a2hosting.com/kb/getting-sta ... cure-shell
http://www.a2hosting.com/kb/cpanel/adva ... /cron-jobs
http://www.a2hosting.com/kb/developer-c ... -cron-jobs
https://service.futurequest.net/index.p ... o-i-use-it
8. Related links
https://developer.yahoo.com/hadoop/tuto ... dule2.html
http://www.cloudera.com/content/clouder ... educe.html
http://www.packtpub.com/article/display ... ids-ext-js