http://hadoop.apache.org/
2. What is the Hadoop Distributed File System (HDFS)?
Source: http://www-01.ibm.com/software/data/inf ... doop/hdfs/Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. In this way, the map and reduce functions can be executed on smaller subsets of your larger data sets, and this provides the scalability that is needed for big data processing.
3. What is MapReduce?
Source: http://lintool.github.io/MapReduceAlgorithms/Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance.
See also: https://databricks.com/blog/2014/03/26/ ... ark-2.html
4. Screen Scraping
Webscraping Techniques using PHP or Python
http://www.webbotsspidersscreenscrapers.com/
http://www.phparch.com/books/phparchite ... -with-php/
http://www.crummy.com/software/BeautifulSoup/
http://www.crummy.com/software/Beautifu ... ation.html
http://scrapy.org/
http://www.givegoodweb.com/post/210/html-parser-for-php
http://www.regexr.com/
http://regex101.com/
http://www.regular-expressions.info/
5. Make your own search engine
Source: http://polyglotprogramming.com/papers/S ... eModel.pdfThe inverted index is a classic algorithm needed for building search engines. Before running MapReduce, crawl teh interwebs, find all the pages, and build a data set of URLs -> doc contents, written to flat files in HDFS or one of the more “sophisticated” formats.
See also:
http://press.princeton.edu/titles/8216.html
http://infolab.stanford.edu/~backrub/google.html
http://www.ams.org/samplings/feature-co ... c-pagerank
http://lintool.github.io/MapReduceAlgor ... -final.pdf
https://developer.yahoo.com/hadoop/tuto ... dule4.html
Example of building an inverted index in MapReduce Java see:
http://polyglotprogramming.com/papers/S ... eModel.pdf
or
http://www.slideshare.net/deanwampler/s ... pute-model
page 27 - 34.
To better understand the next example see this
http://www.oopschool.com/phpBB3/viewtop ... f=65&t=332
thread and read about Maxwell’s equations: https://www.google.no/search?q=Maxwell% ... s&safe=off
Example of building an inverted index in MapReduce Spark Scala see the same document page 36 - 43.
http://spark.apache.org/docs/0.9.0/stre ... ck-example
Other methods like starting a crawl from a hub, site node or cluster of sites can also be used.
http://www.skupot.com/
6. Compiling online
http://codepad.org/
http://compileonline.com/
https://ideone.com/
Example of a general search term:
https://www.google.no/search?q=running+ ... b&safe=off
7. Setting up a cron job
http://www.a2hosting.com/kb/getting-sta ... cure-shell
http://www.a2hosting.com/kb/cpanel/adva ... /cron-jobs
http://www.a2hosting.com/kb/developer-c ... -cron-jobs
https://service.futurequest.net/index.p ... o-i-use-it
http://www.thesitewizard.com/general/set-cron-job.shtml
8. Related links
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
http://polyglotprogramming.com/
http://hortonworks.com/hadoop/hdfs/
http://blog.cloudera.com/blog/category/hdfs/
https://developer.yahoo.com/hadoop/tuto ... dule2.html
http://www.cloudera.com/content/clouder ... educe.html
http://www.packtpub.com/article/display ... ids-ext-js
http://en.wikipedia.org/wiki/Apache_Hadoop