这个是使用Hadoop来处理数据,搭建Hadoop平台,分析数据,编写对应的Map函数和Reduce函数来完成。
本次数据集略坑,完全就是原始数据,编码字符都不统一。
Key Competency
现在提到了代写服务,肯定很多人都不会觉得陌生,就算是国内也是有着专业代写作业的服务行业的,能够为有需求的学生提供很多的帮助,不过其实代写机构在国外会更获得学生的支持,这是因为国外的学校对于平时的作业要求比较严格,为了获得更高的分数顺利毕业,不少留学生就会让代写机构帮忙完成作业,比较常见的作业代写类型,就是计算机专业了,因为对于留学生来说这个技术对于Machine Learning或者AI的代码编程要求更高,所以找代写机构完成作业会简单轻松很多,那么代写机构的水平,要怎么选择才会比较高?
1、代写机构正规专业
不论是在什么情况下,选择正规合法经营的机构肯定是首要的操作,这也是为了避免自己在找机构的时候,出现上当受骗的现象,造成自己的经济出现损失,带来的影响还是非常大的,所以需要注意很多细节才可以,所以在这样的情况下,代写机构的选择,也要选择在经营方面属于正规合法的类型,这样才可以保证服务进行的时候,不会出现各种问题,也可以减少损失的出现,而且正规合法也是代写机构的合格基础。
2、代写机构编程能力
作业的难度相信很多人都很熟悉,特别是对于AI深度学习或者是人工神经网络这种算法来说,因为要对SVM、Design Tree、线性回归以及编程有很高的要求,可以说作业的完成要求非常高,因此才会带动代写机构的发展,找专业的代写机构,一般都是会有专业的人员帮忙进行作业的完成,因为这类型的作业对专业要求比较高,因此代写机构也要具备专业能力才可以,否则很容易导致作业的完成出现问题,出现低分的评价。
3、代写机构收费情况
现在有非常多的留学生,都很在意作业的完成度,为了保证作业可以顺利的被完成,要进行的相关操作可是非常多的,代写机构也是因为如此才会延伸出来的,在现在发展也很迅速,现在选择代写机构的时候,一定要重视收费情况的合理性,因为代写作业还是比较费精力的,而且对于专业能力要求也高,所以价格方面一般会收取几千元至万元左右的价格,但是比较简单的也只需要几百元价格。
4、代写机构完成速度
大部分人都很在意代写机构的专业能力,也会很关心要具备什么能力,才可以展现出稳定的代写能力,其实专业的代写机构,对于作业完成度、作业完成时间、作业专业性等方面,都是要有一定的能力的,特别是在完成的时间上,一定要做到可以根据客户规定的时间内完成的操作,才可以作为合格专业的代写机构存在,大众在选择的时候,也可以重视完成时间这一点来。
现在找专业的CS代写机构帮忙完成作业的代写,完全不是奇怪的事情了,而且专业性越强的作业,需要代写机构帮忙的几率就会越高,代写就发展很好,需求量还是非常高的,这也可以很好的说明了,这个专业的难度以及专业性要求,才可以增加代写机构的存在。
using MapReduce to process data
Necessary Skills
- expressing an algorithm in the MapReduce style
- choosing appropriate classes and methods from the MapReduce API
- testing and debugging
- writing clear, tidy, consistent and understandable code
Requirements
The practical involves manipulating fairly large data files using the Hadoop implementation of MapReduce.
When working in the lab, it is highly recommended that you copy a small subset of these files to your local machine under /cs/scratch/username , and use them to develop and test your program. Do not use the input files directly from studres or your home folder as this will exhaust the network.
Your program should perform the following operations:
- Obtain the absolute paths of the input and output directories from the user. The input must be read in from files in the input directory, and must be written to files in the output directory.
- Find all character level or word level n-grams for text fragments contained within files in the given input directory written in a given language, depending on the user’s input
- Print the list of n-grams and their frequency to a file in the output directory, in alphabetical order
One possible interaction with the program could be as follows (assuming your username is mucs1):
- Obtain the absolute paths of the input and output directories from the user. The input must be read in from files in the input directory, and must be written to files in the output directory.
- Find all character level or word level n-grams for text fragments contained within files in the given input directory written in a given language, depending on the user’s input
- Print the list of n-grams and their frequency to a file in the output directory, in alphabetical order
One possible interaction with the program could be as follows (assuming your username is mucs1):
Enter the input directory: /cs/scratch/mucs1/p5/data
Enter the output directory: /cs/scratch/mucs1/p5/output
Type of n-gram (C)haracter or (W)ord: W
Value of N for n-grams: 2
Language: EN-GB
The first few lines of output will then look similar to that below:
随时关注您喜欢的主题
a basis 7
a border 1
a central 1
a coating 1
You must use Hadoop MapReduce wherever applicable to compute the above, over conventional methods. You can reuse any code from your previous practicals, so long as you clearly identify it.
For your convenience, when finding n-grams (character or word-level), skip all words that contain anything other than uppercase or lowercase characters (numbers, symbols, parentheses, etc.). For character-level n-grams, you do not need to indicate word boundaries with an underscore as you did in Practical 2.
In this practical you need only run your code on the local machine. However, if you did have access to a large Hadoop cluster, your code would work without needing to be adapted.
Your program should deal gracefully with possible errors such as the web resource file being unavailable, or the response containing data in an unexpected format. The source code for your program should follow common style guidelines, including:
- formatting code neatly
- consistency in name formats for methods, fields, variables
- avoiding embedded “magic numbers” and string literals
- minimising code duplication
- avoiding long methods
- using comments and informative method/variable names to make the code clear to the reader
Deliverables
Hand in via MMS, a zip file containing the following:
- Your Java source files
- A brief report (maximum 3 pages) explaining the decisions you made, how you tested your program, and how you solved any difficulties that you encountered. Include instructions on how to run your program and any dependencies that need resolving. You can use any software you like to write your report, but your submitted version must be in PDF format.
- Also within your report:
- Highlight one piece of feedback from your previous submissions, and explain how you used it to improve this submission
- If you had to carry out a similar large-scale data processing task in future, would you choose Hadoop or basic file manipulation as you did in earlier practicals? Write a brief comparison of the two methods, explaining the advantages and disadvantages of each, and justify your choice.
Extensions
If you wish to experiment further, you could try any or all of the following:
- Give the user an option to order the n-grams by occurrence frequency. Hint: You could use the ‘Most Popular Word’ example on studres as a starting point.
- Perform additional queries on the data, such as:
a. Count all text fragments containing a given string by the user in a given language
b. Find which word occurs in the highest number of different languages
c. Any other additional statistics/queries about the data you could generate using Hadoop
You are free to implement your own extensions, but clearly state these in your report. If you use any additional libraries to implement your extensions, ensure that you include these in your submission, with clear instructions to your tutor on how to resolve any dependencies and run your program.
Marking
- For a pass mark (7) it is necessary to show evidence of a serious attempt at both the programming task and the report.
- A mark of 13 can be achieved with a partial solution to the main problem.
- A mark of 17 can be achieved with a good and complete solution to the main problem and a well written report to match.
- For higher marks it is necessary to also attempt one or more of the mentioned extension activities, or suitable extension activities of your own.
Hints
Here is one possible sequence for developing your solution. It is recommended that you make a new method or class at each stage, so that you can easily go back to a previous stage if you run into problems. Please use demonstrator support in the labs whenever possible.
- You will need to examine the structure of the data files, to see how the text fragments and language specifications are represented.
- Tackle one problem at a time, beginning with selecting all text in a particular language, which requires a mapper class but no reducer. Initially, you can use a fixed search language String in your mapper class for testing purposes.
- To select only text in the required language, the difficulty is that the language is recorded in a different line from the text fragment, so will be processed in a different call to map. This can be solved using a field in the mapper object to record the most recently encountered language. The map method can then either update this field, if the current line contains a language, or check the language field value, if the current line contains a text fragment.
- To test, first make a new directory and copy 10 or 20 of the data files into it—the full data set will take inconveniently long to run.
- Once this works, refine your solution so that the search language is passed as a parameter. Recall that you can pass a parameter value to a mapper or reducer by calling the method.
- To return text in the specified language as n-grams, you will need to also pass the user’s specified n-gram type and size as parameters, using the same methods in Step 5. Following this, it is recommended that you reuse your n-gram creation code from Practical 2 to split each String into its corresponding n-grams. Remember that, unlike Practical 2, you do not need to represent word boundaries with an underscore.
- In order to output the n-gram frequencies alongside the n-grams themselves, you will need to implement a reducer class that groups duplicate n-grams and sums their total frequency. For a reminder of how to do this, review the ‘Word Count’ example on studres.
- For ordering your output n-grams, recall that sorting order is specified with the setOutputKeyComparatorClass method of the JobConf class.
关于分析师
LE PHUONG
在此对LE PHUONG对本文所作的贡献表示诚挚感谢,她在山东大学完成了计算机科学与技术专业的硕士学位,专注数据分析、数据可视化、数据采集等。擅长Python、SQL、C/C++、HTML、CSS、VSCode、Linux、Jupyter Notebook。