Nico Budi Darmawan Tan - Simple Outside, Complicated Inside.: October 2014

Tuesday, October 21, 2014

Positive Attitude - Sikap Positif

Saat perang Vietnam berkecamuk, ribuan tentara Amerika serikat tertangkap oleh pasukan Vietkong.

Berita siksaan yang mengerikan dipropagandakan untuk menjatuhkan mental tentara US. Banyak yang merasakan ketakutan yang luar biasa, banyak dari mereka yang mengalami gangguan jiwa dan mati karena ketakutan.

Tidak seperti tahanan lainnya, John McCain memiliki sikap mental yang berbeda. Dia selalu berkata pada dirinya sendiri “saya harus kuat dan bertahan agar dapat keluar dari penjara dalam keadaan hidup karena ada tugas besar yang menanti”.

Berpegang pada keyakinannya, John mampu bertahan dalam penyiksaan selama 5,5 tahun dan keluar dari penjara dalam Keadaan hidup.

Ia menjadi senator Amerika serikat. Ia dinobatkan Majalah Times, sebagai salah satu dari 25 yang paling berpengaruh di Amerika Serikat.

Dari kisah diatas, kita tarik pelajaran bahwa sikap positif adalah pilihan. Kita mungkin tidak dapat memilih keadaan yang akan terjadi pada kita, namun kita masih dapat memilih bagaimana cara kita dalam menghadapinya. Pusatkanlah pada hal-hal positif, ciptakan pengharapan dan peganglah pengharapan itu dengan imam. Kerjakanlah setiap proses dengan penuh ketekunan dan kesabaran. Percayalah, keberhasilan hanya tinggal masalah waktu. Anda pasti bisa !!!

“Pilihlah yang positif. Anda punya pilihan dan Andalah yang menentukan.
Pilihlah yang positif, yang membangun.
Optimisme adalah keyakinan yang menuntun kita menuju sukses”.
-Bruce Lee-

Sumber : Agus Gunawan (Chief HCM - PT AGIT - Via Email Blast 8 Agustus 2014)

Monday, October 20, 2014

Business Intelligence untuk Perbankan / BI For Banking

Saat ini industri perbankan sedang menghadapi tantangan pasar yang sangat pelik, seperti membutuhkan lingkungan transaksi yang sangat aman, kondisi ekonomi global yang tidak menentu, regulasi pemerintah yang ketat, dan tuntutan customer yang selalu berekspektasi tinggi. Bank perlu mengembangkan strategi tidak hanya untuk mempertahankan customer yang telah ada, namun juga perlu mengembangkan strategi untuk mendapatkan customer baru.

Tuntuan ini meliputi juga untuk mengidentifikasi dan mendukung profitable customer, meningkatkan operasi pada level akar rumput dan memberikan respon yang cepat atas portfolio peformance.

Dengan BI maka semua level Desicion Maker mulai dari para Top Level Management, Middle Management, sampai Operational Staff dapat mengambil keputusan yang cepat dan tepat, semua stakeholder akan mendapatkan informasi menyeluruh sesuai dengan business role-nya. Perusahaan akan mempunyai ‘Single View of the Truth’ atas semua informasi pada semua level organisasi.

BI akan memberikan cara pandang yang baru terhadap berbagai level performance dari level enterprise sampai level individual staff, dari atomic transaction sampai summary transaction.

Beberapa area bisnis di perbankan yang dapat menggunakan BI Tool:

Asset and Liability Management

Interest Rate Sensitivity Analysis
Liquidity Analysis
Short term Funding Management
Financial Management Accounting
Capital Allocation Analysis
Capital Procurement
Credit Loss Provision
Funds Maturity Analysis
Income Analysis
Net Interest Margin Variance
Structured Finance Analysis
Equity Position Exposure

Relationship Marketing

Customer Interaction Analysis
Customer Investment Profile
Individual Customer Profile
Wallet Share Analysis
Customer Complaints Analysis
Customer Delinguency Analysis
Customer Loyality
Market Analysis
Campaign Analysis
Cross Sell Analysis
Customer Attrition Analysis
Customer behavior

Profitabilily

Transaction Analysis
Activity based Costing Analysis
Insurance Product Analysis
Investment Arrangement Analysis
Profitabilty Analysis
Channel Profitability
Customer Life Time Value
Customer Profitability
Location Profitability
Product Profitability
Product Analysis
Organization Unit Profitability
Performance Measurement
Business Procedure Performance
Lead Analysis
Position Valuation Analysis

Risk

Interest Rate Risk Analysis
Credit Risk Profile
Credit Risk Assessment
Credit Risk Mitigation Assessment
Securitization Analysis
Operational Risk Assessment
Outstandings Analysis
Portfolio Credit Exposure
Security Analysis
Liquidity Risk
Collections Analysis
Insurance Risk Profile

Dari semua area yang disebut diatas, untuk Top Level Management dapat diambil beberapa Indikator yang dapat mewakili kondisi perbankan secara keseluruhan, inilah yang disebut Banking Key Indikator.

Banking Key Indikator tersebut antara lain:

Total Assets
Deposits
Loans
Earning Asset
Net Interest Income
Loan to Deposit Ratio (LDR) (%)
Return on Assets (%)
Gross Non Performing Loans (%)
Non Performing Loan (NPL)
Net Non Performing Loans (%)
Capital Adequacy Ratio (CAR) (%)
Loans/Earning Assets (%)
Net Interest Margin (NIM) (%)
Liquid Assets/Total Assets (%)
Core Deposits/Total Assets (%)
Cost Efficiency Ratio (%)

Dari beberapa indikator yang sudah disebutkan diatas, agar dapat dilakukan Multidimensional Analysis maka perlu dibuat Dimensional Data Model dengan melengkapi apa Measurement-nya dan apa saja Dimension yang terkait dengan Measurement tersebut.

Pekerjaan ini bisanya dikerjakan Data Modeler yang dibantu Banking Business Analyst atau SME untuk masing-masing Subject Area.

Setelah data di populate ke Cube yang sudah dibentuk maka informasi sudah dapat disajikan dalam bentuk Digital Dashboard, Balance Scorecard, atau Report sesuai dengan kebutuhan User.

Beberapa keuntungan dalam menggunakan BI adalah sebagai berikut:

Menyeimbangkan resiko bisnis dan perkembangan bisnis selagi mengelola Regulatory Compliance.
Memberdayakan investasi pada sumber daya yang ada dan infrastruktur-nya dengan cara mengumpulkan data dari berbagai sumber data yang ada.
Memenuhi kebutuhan Regulator dengan cara yang cepat dan informasi yang akurat.
Report BI untuk memonitor Management Performance di cabang.
Menyediakan informasi yang cerdas tentang customer untuk aktivitas promosi.
Menyediakan 360 derajat pandangan mengenai profil customer.
Mempermudah penilaian semua aspek performan organisasi seperti income, profit, customer satisfaction, flexibility, perpindahan dan pertumbuhannya.
Membantu melakukan Profit Analysis per cabang, mencari customer yang paling banyak memberikan keuntungan, service yang paling banyak memberikan keuntungan, atau lokasi yang paling banyak memberikan keuntungan.
Mengelola resiko kredit, membuat balance sheet dengan report profit/loss, menstandardisasi portofolio dan analisa kredit.
Memberikan 360 derajat pandangan mengenai finansial dan hasil-hasil operasional.

Sumber : Yoyonb.Wordpress

Dengan referensi :

IBM, Banking Data Warehouse General Information Manual
ElegantJ BI, Business Intelligence and KPI for Banking

Sunday, October 19, 2014

How to Handle the Expectations of Others - By Michelle (Considering Grace)

Have you ever wished you could just pray like that person in church who prays with such authority? How about trying so hard to spend the hour per day in the word that your friend seems to be able to do? We are constantly trying to compare our walk with God with what others do. What’s worse, is that we are also constantly trying to meet the expectations of those around us. We try to please people. All around us, we are forever trying to impress others and make people think good things about us. The scary thing is that we can easily get caught in a web of confusion by trying to meet those unrealistic expectations.

What Others Can Expect of Us:

People’s expectations of us can quite often be performance based. They have expectations on what kind of church involvement, and volunteering there should be. A ‘good’ Christian would be a Sunday school teacher, say. Do you know how many people feel like they ‘should’ do something because of the expectations of others?
Some people in our lives may assume we should follow their advice on how we should or shouldn’t have a relationship with God. They sometimes have the expectation that there are a certain type of devotions you should do in a week, or even for a specific amount of time each day! They base their knowledge on their own experience without leaving room for you to be who God designed you to be.

What God Expects You to Know About the Expectations of Others:

God expects us to follow His voice, and His voice only.

“The sheep that are My own hear and are listening to My voice; and I know them, and they follow Me.”
John 10:27

God said we are not defined by the works we do. I believe He expects us to use the gifts He has given to us, and apply them where needed. We need to be ok with saying no sometimes, and be sure our motivation for ministry is at the right place. Am I primarily doing it for God, or for others?

God sent us the Holy Spirit to guide us in our daily lives.

“If you love me, you will keep my commandments. And I will ask the Father, and he will give you another Helper, to be with you forever, even the Spirit of truth, whom the world cannot receive, because it neither sees him nor knows him. You know him, for he dwells with you and will be in you.”
John 14:15-17

God is not limited by the kind of devotions we do, how many days a week or how long each day. God... I believe He wants our best effort as far as getting into the bible, etc., but remember, it’s not about the expectations of others. Ask God to speak to you about what He expects of you each day.

“Come near to God and He will come near to you.”
James 4:8

WAIT!
Just in case some of you are thinking I’m making a blanket statement… I am not saying that God doesn’t use people in our lives to teach us and show us about having a closer walk with Him. He sure does. But we need to be extremely careful that we are living by God’s standards for ourselves, instead of trying to win the favour of others. They may very well have the best of intentions, however what God expects of them will not be the same as what He expects of you.

What to Expect:
It can just cause confusion and hurt when we try to live our Christian walks by other people’s standards. We are spiritual in our own ways, created individually by a God that surpasses ‘rules’. We are called to live holy, and be holy, and we are the ones that will one day be held responsible for our own walk with Him. As long as we surrender to the life that He calls us to, we don’t need to worry about what other people think. Just like Paul wanted to please God instead of people:

“Am I now trying to win human approval? Or am I trying to please people? If I were still trying to please people, I would not be a servant of Christ.”
Galatians 1:10

What have others expected of you in terms of your walk with the Lord? Compare that to what you feel God is saying to you. Is it the same? Ask God to help you hear His voice, and His alone.

Original Source : Michelle - ConsideringGrace.com

Thursday, October 16, 2014

Business Analyst - VS - Project Manager - Overview

Hi... Tonight I would like to post the difference between Business Analyst (BA) and Project Manager (PM). In most of project / company these roles are handled by multiple person, but some project / company merge them. From my perspective and experiences, it can happen (merge) when the project scope is small - medium. Most of the case happen because of efficiency (reduce cost). In addition some project prefer to have stand alone PM, but the task of BA is given to system analyst (SA). We can call it "Business System Analyst." You can find the posting here... Back to our main topic tonight about PM vs BA, I gather some facts and information from many source that related with it. Please read it below.

LinkedIn 1 :

Project Manager: (accountable for -HOW- solution will be developed)
PM is a supervision role on whole project
Responsible for project deliverable on Time, on Budget and on Scope (oToBoS).
Arranging team members (different skills) is PM’s responsibility
PM is responsible for Controlling Solution Scope
Monitor and control progress of project

Business Analyst: (accountable for -WHAT- is required)
Need to understand what is the problem or issue by the organization (or business)
BA need to investigate the problem (analyzed)
BA recommends a solution that solves organization’s problem
BA uses following skills to find out what is needed by organization
• Requirements documentation (Use-case, BPMN, Modeling techniques etc)
• Requirements Validations
• Managing different stakeholders
• Effective communication
• Reasonable Business knowledge

...

LinkedIn 2 :

Many BA's don't "enjoy" or "thrive" on conflict, PM is full of conflict.
BA's can balance detail with big picture. PM is big picture.
BA's like to do "real" work, PM is lots of meetings, reports etc etc many BA's don't consider this to be "meaningful" work.
BA's listen to business, PM's are managing a budget / timeline.
BA's are inquisitive, PM's see scope creep.
BA's are non political. PM is full of politics.
BA's are pragmatic. PM is idealistic.
BA's seek to establish consensus. PM's cant wait to forge ahead.
BA's tend to be tactful. PM's tend to be direct.
BA's rarely deal with the difficult HR issues. PM's have to.

...

Project-Pro :

The Project Manager manages the project – “The application of knowledge, skills, tools, and techniques to provide activities to meet the project requirements.”

The Business Analyst conducts business analysis – “The set of tasks and techniques used to work as a liaison among stakeholders to understand the structure, policies, and operations of an organization, and to recommend solutions that enable the organization to meet its goals.

...

From the information above, we can conclude that : both PM and BA have very different roles and tasks. Each of them have interesting point of view. Till now, I have only ever worked as a BA and TL (Team Leader). I wish I had a chance to gain experience as a PM, although it is a small project. I think PM and TL has similar roles, but still they are have something different. For now, I have moved to BD (Business Development). A new exciting role with a lot of challenges that still unknown. If you like to read my journey, please read it here... OK, I think it's enough for tonight. Thank you for reading.

Wednesday, October 15, 2014

Thank You "Love"

Like a rainbow after the rain ...
Like a horizon after evening ...

A grace ...
A hope ...

That is you "Love" ...
Beautiful at the right time ...

The moment when you release me from my loneliness ...
I feel your presence real in your light ...

The moment when you give me hope in this doubt ...
I feel your aura touch my heart in your warmth ...

Thank you "Love" ...

Tuesday, October 14, 2014

High Availability - Overview & General Concepts

#Wikipedia

High availability is a characteristic of a system. The definition of availability is

Ao = up time / total time.

This equation is not practically useful, but if (total time - down time) is substituted for up time then you have

Ao = (total time - down time) / total time.

Determining tolerable down time is practical. From that, the required availability may be easily calculated.

High availability system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period.

There are three principles of high availability engineering. They are

Elimination of single points of failure. This means adding redundancy to the system so that failure of a component does not mean failure of the entire system.
Reliable crossover. In multithreaded systems, the crossover point itself tends to become a single point of failure. High availability engineering must provide for reliable crossover.
Detection of failures as they occur. If the two principles above are observed, then a user may never see a failure. But the maintenance activity must.

Modernization has resulted in an increased reliance on these systems. For example, hospitals and data centers require high availability of their systems to perform routine daily activities. Availability refers to the ability of the user community to obtain a service or good, access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is - from the users point of view - unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable.

Percentage calculation
Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service level agreements often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable per year, month, or week.

Uptime and availability are not synonymous. A system can be up, but not available, as in the case of a network outage.

#Mirazon

In the world of IT, “high availability” is a term you often encounter. There are a few things that go into having high availability and assessing it. This week we will cover the general concepts behind maintaining highly available environments.

There are two major things to consider with high availability: redundancy and separation.

Redundancy

Redundancy involves providing excess capacity in the design in order to account for any failures without a performance decline.

An example of redundancy would be taking a server and plugging it into not just one, but two power circuits to protect against the power failure of one.

But what if the server itself fails? Add another server and put them together in a cluster.

However, if the circuit fails, the server or cluster would still go down. The key to protecting your availability is to double up (or triple even) on your equipment and power sources.

Separation

Diversifying the power sources will help protect your servers from going down due to malfunction or power outages. A good way to improve your server’s or server cluster’s availability is to connect it to power sources from two different circuits. To maintain availability even during a widespread power outage, we recommend using at least one uninterruptible power supply (UPS). These UPSs are powerful for protecting availability because they take very little time to assume the power burden if there has been a mains power failure. A UPS is able to quickly supply energy through batteries or a flywheel.

While redundancy and separation are two different elements to ensuring high availability, it’s important to note that you need both. A server failing is as likely as a power failure. A solid plan for high availability accounts for both redundancy and separation in order to ensure there is a plan B for any situation.

Monday, October 13, 2014

To Build Custom Solution - VS - To Buy Commercial Of The Shelf

What is COTS ?

An adjective that describes software or hardware products that are ready-made and available for sale to the general public. For example, Microsoft Office is a COTS product that is a packaged software solution for businesses. COTS products are designed to be implemented easily into existing systems without the need for customization. #Webopedia

Commercial off-the-shelf, products that are commercially available and can be bought "as is“ #Wikipedia

COTS vs Custom Solution - By Bob Mango

InfoWorld wrote: “To build or to buy IT applications – It’s a question of Shakespearean proportions. Should you license a commercial enterprise application that will meet 75 percent of your needs, or would it be nobler to build your own application, one that will track as closely as possible to the task at hand?” I think this statement really sets the stage for any analysis a company should take when considering a custom software solution or a commercial, off-the-shelf (COTS) solution.

Many have used the guidelines of buy when you need to automate commodity business processes; build when you’re dealing with the core processes that differentiate your company. If only it were that simple!

I have assembled some observations I’ve made during past software implementations as well as from customers who have gone down both paths. This is generic to any build vs. buy software decision.

Other Considerations

Custom :

How many people do you have on staff that can focus 100% of time on software development and support?
What is the technology resource pool at my company look like? Any close to retiring? Risk of losing knowledge base?
What platform will you build your application on today? Is it a common standard? What is the projected end-of-life? What resources do you have with experience on this platform?
How will you keep up to date on current technologies?
What software development methodology will you use? Rapid Application Development (RAD) Joint Application Development (JAD), Rational Unified Process (RUP), Spiral, Waterfall?
How will you charge-back cost of development and support for your software? “Who foots the bill?”
Are you reinventing the wheel? Does a COTS software exist that meets the majority of your needs? Can this software be customized to meet your needs?
When evaluating whether to buy or build, it’s critical to thoroughly understand total costs during the software lifecycle — typically seven or eight years. (typically, 70 percent of software costs occur after implementation.) A rigorous lifecycle analysis that realistically estimates ongoing maintenance by in-house developers often tips the balance in favor of buying.

Off-the-Shelf :

When choosing a tools vendor, software development managers should consider the background and stability of the vendor as a key component of their selection criteria.
How much of the vendor’s business is made up of software and software related revenue vs. other revenue?
How rigid is the COTS software? Can it be configured to my unique needs?
How much custom coding/development do customers usually request from the vendor?
Can the vendor support (help-desk, updates, bug fixes, on-going customizations, future implementations) the software beyond the initial implementation?
Does the offer a blended model of support where we can take on some of the support?
Will you be required to make significant changes in your business practice to fit the COTS software?
Does the vendor have similar market examples of customers who have successfully deployed their application? Does any single customer make up the majority of their annual revenue?

Cost :

Why reinvent the wheel? Using tools readily available in the commercial marketplace can provide significant savings in the time and effort required to develop and maintain the numerical analysis and visualization foundation for your advanced end-user applications. Here’s an example:

Let’s say, for instance, that an algorithm developer’s salary is $60,000 per year. The average cost of employment, including benefits, adds an additional 20%, or $12,000. Thus, the fully burdened cost for a developer is $72,000.

On average, it takes three weeks to develop one numerical algorithm. It takes one week to test the algorithm, and an additional three weeks to document, maintain and port it. The total cost to develop one numerical algorithm comes to just over $9,500, or about 1/8 of your developer’s annual compensation. Your developer could produce just eight algorithms in one year.

The deciding factor in the Build vs. Buy argument is often how the decision will affect the bottom line. Software tools vendors help you realize significant savings because they have already done the work.

Conclusion :

Trends over the past five years indicate the days of large, self-contained software development organizations within corporations possessing the broad level of expertise and time required to adequately and efficiently produce and maintain every line of code in the company’s mission critical applications are far gone. While today’s engineer is skilled within his/her specialized discipline, the time-consuming process of developing the building blocks of an application can be alleviated by the use of commercially available software tools. In the end, however, everyone is being asked to do more work with fewer resources.

The late Dick Benson, a top direct marketing consultant and author, summarized it best: “you should do only what you do best in-house, and outsource the rest. “ This ability to focus on core competencies can mean the difference between a software project that is bogged down by non-application specific details, and a project that successfully hits milestones and makes it to completion.

“Everybody knows that the more standardized you are and the more you buy off-the-shelf, the more cost effective it will be for both implementation and ongoing maintenance,” says Mark Lutchen, former global CIO of PricewaterhouseCoopers, now head of the firm’s IT Effectiveness practice.

The best solution may very well be a combination of COTS software that has the controls and foundation in place for rapid deployment, while allowing custom configuration to meet the specific needs of a company [see figure at right]. A sharing of responsibilities between a software provider and a company who licenses that software can ensure a successful solution for years to come. The software provider takes on the responsibility of maintaining the core code development, testing, and training, while the company can focus on the configuration of the software that provides unique value to them.

Most IT execs say they evaluate commercial software first, particularly when time-to-market and money are top priorities. The rule of thumb is to buy applications to the maximum extent possible to cut costs — freeing up resources for whatever really needs to be built in-house.

Original Source : 3CSoftware.Com

Sunday, October 12, 2014

Apache Hadoop & HDFS Overview

#Wikipedia
Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users.[2] It is licensed under the Apache License 2.0.

#MapR
Apache Hadoop™ was born out of a need to process an avalanche of Big Data. The web was generating more and more information on a daily basis, and it was becoming very difficult to index over one billion pages of content. In order to cope, Google invented a new style of data processing known as MapReduce. A year after Google published a white paper describing the MapReduce framework, Doug Cutting and Mike Cafarella, inspired by the white paper, created Hadoop to apply these concepts to an open-source software framework to support distribution for the Nutch search engine project. Given the original case, Hadoop was designed with a simple write-once storage infrastructure.

Apache Hadoop includes a Distributed File System (HDFS), which breaks up input data and stores data on the compute nodes. This makes it possible for data to be processed in parallel using all of the machines in the cluster. The Apache Hadoop Distributed File System is written in Java and runs on different operating systems.

#IBM
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

#Cloudera
Hadoop was initially inspired by papers published by Google outlining its approach to handling an avalanche of data, and has since become the de facto standard for storing, processing and analyzing hundreds of terabytes, and even petabytes of data.

Apache Hadoop is 100% open source, and pioneered a fundamentally new way of storing and processing data. Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big. And in today’s hyper-connected world where more and more data is being created every day, Hadoop’s breakthrough advantages mean that businesses and organizations can now find value in data that was recently considered useless.

#Hortonworks

Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data.

Enterprise Hadoop: The Ecosystem of Projects
Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop. Each project has been developed to deliver an explicit function and each has its own community of developers and individual release cycles.

#Apache.Org
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

...

HDFS Architecture Guide

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.

NameNode and DataNodes
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

Original Source : Apache.org

Thursday, October 9, 2014

Overview About Big Data and Hadoop - Part 2

This post is a continuation from previous post. Still about big data and hadoop from The Executive's Guide To Big Data & Apache Hadoop by Robert D. Schneider, this post tell us about the story behind hadoop and end with things to loop up when evaluating hadoop technology.

Distributed Processing Methodologies

In the past, organizations that wanted to work with large information sets would have needed to:

Acquire very powerful servers, each sporting very fast processors and lots of memory
Stage massive amounts of high-end, often-proprietary storage
License an expensive operating system, a RDBMS, business intelligence, and other software
Hire highly skilled consultants to make all of this work
Budget lots of time and money

Fortunately, several distinct but interrelated technology industry trends have made it possible to apply fresh strategies to work with all this information:

Commodity hardware
Distributed file systems
Open source operating systems, databases, and other infrastructure
Significantly cheaper storage
Widespread adoption of interoperable Application Programming Interfaces (APIs)

Today, there’s an intriguing collection of powerful distributed processing methodologies to help derive value from Big Data.

In a nutshell, these distributed processing methodologies are constructed on the proven foundation of ‘Divide and Conquer’: it’s much faster to break a massive task into smaller chunks and process them in parallel. There’s a long history of this style of computing, dating all the way back to functional programming paradigms like LISP in the 1960s.

Given how much information it must manage, Google has long been heavily reliant on these tactics. In 2004, Google published a white paper that described their thinking on parallel processing of large quantities of data, which they labeled “MapReduce”. The white paper was conceptual in that it didn’t spell out the implementation technologies per se. Google summed up MapReduce as follows:

“MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.”

MapReduce was proven to be one of the most effective techniques for conducting batch-based analytics on the gargantuan amounts of raw data generated by web search and crawling before organizations expanded their use of MapReduce to additional scenarios.

Rather than referring to a single tactic, MapReduce is actually a collection of complementary processes and strategies that begins by pairing commoditized hardware and software with specialized underlying file systems. Computational tasks are then directly performed on the data wherever it happens to reside, rather than the previous practices of first copying and aggregating raw data into a single repository before processing it. These older practices simply won’t scale when the amount of data expands beyond terabytes. Instead, MapReduce’s innovative thinking means that rather than laboriously moving huge volumes of raw data across a network, only code is sent over the network.

MapReduce was, and continues to be, a superb strategy for the problem that it was originally designed to solve: how to conduct batch analysis on the massive quantities of data generated by users running searches and visiting web sites. The concepts behind MapReduce have also served as the inspiration for an ever-expanding collection of novel parallel processing computational frameworks aimed at a variety of use cases, such as streaming analysis, interactive querying, integrating SQL with machine learning, and so on. While not all of these new approaches will achieve the same level of traction as the popular and still-growing batch-oriented MapReduce, many are being used to solve interesting challenges and drive new applications.

Conveniently, each of these methodologies shields software developers from the thorny challenges of distributed, parallel processing. But as Robert D. Schneider described earlier, Google’s MapReduce paper didn’t dictate exactly what technologies should be used to implement its architecture. This means that unless you worked for Google, it’s unlikely that you had the time, money, or people to design, develop, and maintain your own, site-specific set of all of the necessary components for systems of this sophistication. After all, it’s doubtful that you built your own proprietary operating system, relational database management system, or Web server.

Thus, there was a need for a complete, standardized, end-to-end solution suitable for enterprises seeking to apply the full assortment of modern, distributed processing techniques to help extract value from reams of Big Data. This is where Hadoop comes in.

Hadoop

Around the same time that Google was publishing the MapReduce paper, two engineers - Doug Cutting and Mike Cafarella - were busily working on their own web crawling technology named Nutch. After reading Google’s research, they quickly adjusted their efforts and set out to create the foundations of what would later be known as Hadoop. Eventually, Cutting joined Yahoo! where the Hadoop technology was expanded further. As Hadoop grew in sophistication, Yahoo! extended its usage into additional internal applications. In early 2008, the Apache Software Foundation (ASF) promoted Hadoop into a top-level open source project.

Simply stated, Hadoop is a comprehensive software platform that executes distributed data processing techniques. It’s implemented in several distinct, specialized modules:

Storage, principally employing the Hadoop File System (HDFS) although other more robust alternatives are available as well
Resource management and scheduling for computational tasks
Distributed processing programming model based on MapReduce
Common utilities and software libraries necessary for the entire Hadoop platform

Hadoop has broad applicability across all industries.

Enterprises have responded enthusiastically to Hadoop. Table 3 below illustrates just a few examples of how Hadoop is being used in production today.

Selecting your Hadoop infrastructure is a vital IT decision that will affect the entire organization for years to come, in ways that you can’t visualize now. This is particularly true since we’re only at the dawn of Big Data in the enterprise. Hadoop is no longer an “esoteric”, lab-oriented technology; instead, it’s becoming mainline, it’s continually evolving, and it must be integrated into your enterprise. Selecting a Hadoop implementation requires the same level of attention and devotion as your organization expends when choosing other critical core technologies, such as application servers, storage, and databases. You can expect your Hadoop environment to be subject to the same requirements as the rest of your IT asset portfolio, including:

Service Level Agreements (SLAs)
Data protection
Security
Integration with other applications

Checklist: Ten Things to Look for When Evaluating Hadoop Technology

1. Look for solutions that support open source and ecosystem components that support Hadoop API’s. It’s wise to make sure API’s are open to avoid lock-in.

2. Interoperate with existing applications. One way to magnify the potential of your Big Data efforts is to enable your full portfolio of enterprise applications to work with all of the information you’re storing in Hadoop.

3. Examine the ease of migrating data into and out of Hadoop. By mounting your Hadoop cluster as an NFS volume, applications can load data directly into Hadoop and then gain real-time access to Hadoop’s results. This approach also increases usability by supporting multiple concurrent random access readers and writers.

4. Use the same hardware for OLTP and analytics. It’s rare for an organization to maintain duplicate hardware and storage environments for different tasks. This requires a high-performance, low-latency solution that doesn’t get bogged down with time-consuming tasks such as garbage collection or compactions. Reducing the overhead of the disk footprint and related I/O tasks helps speed things up and increases the likelihood of efficient execution of different types of processes on the same servers.

5. Focus on scalability. In its early days, Hadoop was primarily used for offline analysis. Although this was an important responsibility, instant responses weren’t generally viewed as essential. Since Hadoop is now driving many more types of use cases, today’s Hadoop workloads are highly variable. This means that your platform must be capable of gracefully and transparently allocating additional resources on an as-needed basis without imposing excessive administrative and operational burdens.

6. Ability to provide real-time insights on newly loaded data. Hadoop’s original use case was to crawl and index the Web. But today – when properly implemented – Hadoop can deliver instantaneous understanding of live data, but only if fresh information is immediately available for analysis.

7. A completely integrated solution. Your database architects, operations staff, and developers should focus on their primary tasks, instead of trying to install, configure, and maintain all of the components in the Hadoop ecosystem.

8. Safeguard data via multiple techniques. Your Hadoop platform should facilitate duplicating both data and metadata across multiple servers using practices such as replication and mirroring. In the event of an outage on a particular node you should be able to immediately recover data from where it has been replicated in the cluster. This not only fosters business continuity, it also presents the option of offering read-only access to information that’s been replicated to other nodes. Snapshots - which should be available for both files and tables - provide point-in-time recovery capabilities in the event of a user or application error.

9. Offer high availability. Hadoop is now a critical enterprise technology infrastructure. Like other enterprise-wide fundamental software assets, it should be possible to upgrade your Hadoop environment without shutting it down. Furthermore, your core Hadoop system should be isolated from user tasks so that runaway jobs can’t degrade or even bring down your entire cluster.

10. Complete administrative tooling and comprehensive security. It should be easy for your operational staff to maintain your Hadoop landscape, with minimal amounts of manual procedures. Self-tuning is an excellent way that a given Hadoop environment can reduce administrative overhead, and it should also be easy for you to incorporate your existing security infrastructure into Hadoop.

Wednesday, October 8, 2014

Overview About Big Data and Hadoop - Part 1

Big Data has come to Indonesia. One of my customer ask about it. I don't know if the needs has come or only temporary joy from outside. Still, I think I need to improve my knowledge about Big Data and Hadoop. I found that the book titled : The Executive's Guide To Big Data & Apache Hadoop by Robert D. Schneider is very good and insightful. This book contained everything you need to understand and get started with Big Data and Hadoop. You can find the eBook free in Google, but if you need something more summarize please read my summary below.

Introducing Big Data

Big Data has the potential to transform the way you run your organization. When used properly it will create new insights and more effective ways of doing business, such as:

How you design and deliver your products to the market
How your customers find and interact with you
Your competitive strengths and weaknesses
Procedures you can put to work to boost the bottom line

What Turns Plain Old Data into Big Data?

From Robert D. Schneider perspective, organizations that are actively working with Big Data have each of the following five traits in comparison to those who don’t:

Larger amounts of information
More types of data
Data that’s generated by more sources
Data that’s retained for longer periods
Data that’s utilized by more types of applications

1. Larger Amounts of Information
Enterprises are capturing, storing, managing, and using more data than ever before. Generally, these events aren’t confined to a single organization; they’re happening everywhere:
On average over 500 million Tweets occur every day
World-wide there are over 1.1 million credit card transactions every second
There are almost 40,000 ad auctions per second on Google AdWords
On average 4.5 billion “likes” occur on Facebook every day

Comparing Database Sizes

2. More Types of Data
Structured data – regularly generated by enterprise applications and amassed in relational databases – is usually clearly defined and straightforward to work with. On the other hand, enterprises are now interacting with enormous amounts of unstructured – or semi-structured – information, such as:

Clickstreams and logs from websites
Photos
Video
Audio
XML documents
Freeform blocks of text such as email messages, Tweets, and product reviews

3. Generated by More Sources
Enterprise applications continue to produce transactional and web data, but there are many new conduits for generating information, including:

Smartphones
Medical devices
Sensors
GPS location data
Machine-to-machine, streaming communication

4. Retained for Longer Periods
Government regulations, industry standards, company policies, and user expectations are all contributing to enterprises keeping their data for lengthier amounts of time. Many IT leaders also recognize that there are likely to be future use cases that will be able to profit from historical information, so carelessly throwing data away isn’t a sound business strategy. However, hoarding vast and continually growing amounts of information in core application storage is prohibitively expensive. Instead, migrating information to Hadoop is significantly less costly, plus Hadoop is capable of handling a much bigger variety of data.

5. Utilized by More Types of Applications
Faced with a flood of new information, many enterprises are following a “grab the data first, and then figure out what to do with it later” approach. This means that there are countless new applications being developed to work with all of this diverse information. Such new applications are widely varied, yet must satisfy requirements such as bigger transaction loads, faster speeds, and enormous workload variability.

Big Data is also shaking up the analytics landscape. Structured data analysis has historically been the prime player, since it works well with traditional relational database-hosted information. However, driven by Big Data, unstructured information analysis is quickly becoming equally important. Several new techniques work with data from manifold sources such as:

Blogs
Facebook
Twitter
Web traffic logs
Text messages
Yelp reviews
Support desk calls
Call center calls

Implications of Not Handling Big Data Properly

Failing to keep pace with the immense data volumes, mushrooming number of information sources and categories, longer data retention periods, and expanding suite of data-hungry applications has impeded many Big Data plans, and is resulting in:

Delayed or faulty insights
An inability to detect and manage risk
Diminished revenue
Increased cost
Opportunity costs of missing new applications along with operational use of data
A weakened competitive position

Checklist: How to Tell When Big Data Has Arrived

1. You’re getting overwhelmed with raw data from mobile or medical devices, sensors, and/or machine-to-machine communications. Additionally, it’s likely that you’re so busy simply capturing this data that you haven’t yet found a good use for it.

2. You belatedly discover that people are having conversations about your company on Twitter. Sadly, not all of this dialogue is positive.

3. You’re keeping track of a lot more valued information from many more sources, for longer periods of time. You realize that maintaining such extensive amounts of historical data might present new opportunities for deeper awareness into your business.

4. You have lots of silos of data, but can’t figure out how to use them together. You may already be deriving some advantages from limited, standalone analysis, but you know that the whole is greater than the sum of the parts.

5. Your internal users – such as data analysts – are clamoring for new solutions to interact with all this data. They may already be using one-off analysis tools such as spreadsheets, but these ad-hoc approaches don’t go nearly far enough.

6. Your organization seeks to make real-time business decisions based on newly acquired information. These determinations have the potential to significantly impact daily operations.

7. You’ve heard rumors (or read articles) about how your competitors are using Big Data to gain an edge, and you fear being left behind.

8. You’re buying lots of additional storage each year. These supplementary resources are expensive, yet you’re not putting all of this extra data to work.

9. You’ve implemented – either willingly or by necessity – new information management technologies, often from startups or other cutting-edge vendors. However, many of these new solutions are operating in isolation from the rest of your IT portfolio.

Click Here For Part 2

Tuesday, October 7, 2014

RAID Overview and Types

If you've ever looked into purchasing a NAS device or server, particularly for a small business, you've no doubt come across the term "RAID." RAID stands for Redundant Array of Inexpensive (or sometimes "Independent") Disks. In general, a RAID-enabled system uses two or more hard disks to improve the performance or provide some level of fault tolerance for a machine—typically a NAS or server. Fault tolerance simply means providing a safety net for failed hardware by ensuring that the machine with the failed component, usually a hard drive, can still operate. Fault tolerance lessens interruptions in productivity, and it also decreases the chance of data loss.

The way in which you configure that fault tolerance depends on the RAID level you set up. RAID levels depend on how many disks you have in a storage device, how critical drive failover and recovery is to your data needs, and how important it is to maximize performance. A business will generally find it more urgent to keep data intact in case of hardware failure than, for example, a home user will. Different RAID levels represent different configurations aimed at providing different balances between performance optimization and data protection.

RAID Overview
RAID is traditionally implemented in businesses and organizations where disk fault tolerance and optimized performance are must-haves, not luxuries. Servers and NASes in business datacenters typically have a RAID controller—a piece of hardware that controls the array of disks. These systems feature multiple SSD or SATA drives, depending on the RAID configuration. Because of the increased storage demands of consumers, home NAS devices also support RAID. Home, prosumer, and small business NASes are increasingly shipping with two or more disk drive bays so that users can leverage the power of RAID just like an enterprise can.

Software RAID means you can setup RAID without need for a dedicated hardware RAID controller. The RAID capability is inherent in the operating system. Windows 8's Storage Spaces feature and Windows 7 (Pro and Ultimate editions) have built-in support for RAID. You can set up a single disk with two partitions: one to boot from and the other for data storage and have the data parition mirrored.

This type of RAID is available in other operating systems as well, including OS X Server, Linux, and Windows Servers. Since this type of RAID already comes as a feature in the OS, the price can't be beat. Software RAID can also comprise virtual RAID solutions offered by vendors such as Dot Hill to deliver powerful host-based virtual RAID adapters. That's a solution more tailored to enterprise networks, however.

Which RAID Is Right for Me?
As mentioned, there are several RAID levels, and the one you choose depends on whether you are using RAID for performance or fault tolerance (or both). It also matters whether you have hardware or software RAID, because software supports fewer levels than hardware-based RAID. In the case of hardware RAID, the type of controller you have matters, too. Different controllers support different levels of RAID and also dictate the kinds of disks you can use in an array: SAS, SATA or SSD.

Here's the rundown on popular RAID levels:

RAID 0 is used to boost a server's performance. It's also known as "disk striping." With RAID 0, data is written across multiple disks. This means the work that the computer is doing is handled by multiple disks rather than just one, increasing performance because multiple drives are reading and writing data, improving disk I/O. A minimum of two disks is required. Both software and hardware RAID support RAID 0, as do most controllers. The downside is that there is no fault tolerance. If one disk fails, then that affects the entire array and the chances for data loss or corruption increases.

RAID 1 is a fault-tolerance configuration known as "disk mirroring." With RAID 1, data is copied seamlessly and simultaneously, from one disk to another, creating a replica, or mirror. If one disk gets fried, the other can keep working. It's the simplest way to implement fault tolerance and it's relatively low cost.

The downside is that RAID 1 causes a slight drag on performance. RAID 1 can be implemented through either software or hardware. A minimum of two disks is required for RAID 1 hardware implementations. With software RAID 1, instead of two physical disks, data can be mirrored between volumes on a single disk. One additional point to remember is that RAID 1 cuts total disk capacity in half: If a server with two 1TB drives is configured with RAID 1, then total storage capacity will be 1TB not 2TB.

RAID 5 is by far the most common RAID configuration for business servers and enterprise NAS devices. This RAID level provides better performance than mirroring as well as fault tolerance. With RAID 5, data and parity (which is additional data used for recovery) are striped across three or more disks. If a disk gets an error or starts to fail, data is recreated from this distributed data and parity block— seamlessly and automatically. Essentially, the system is still operational even when one disk kicks the bucket and until you can replace the failed drive. Another benefit of RAID 5 is that it allows many NAS and server drives to be "hot-swappable" meaning in case a drive in the array fails, that drive can be swapped with a new drive without shutting down the server or NAS and without having to interrupt users who may be accessing the server or NAS. It's a great solution for fault tolerance because as drives fail (and they eventually will), the data can be rebuilt to new disks as failing disks are replaced. The downside to RAID 5 is the performance hit to servers that perform a lot of write operations. For example, with RAID 5 on a server that has a database that many employees access in a workday, there could be noticeable lag.

RAID 6 is also used frequently in enterprises. It's identical to RAID 5, except it's an even more robust solution because it uses one more parity block than RAID 5. You can have two disks die and still have a system be operational.

RAID 10 is a combination of RAID 1 and 0 and is often denoted as RAID 1+0. It combines the mirroring of RAID 1 with the striping of RAID 0. It's the RAID level that gives the best performance, but it is also costly, requiring twice as many disks as other RAID levels, for a minimum of four. This is the RAID level ideal for highly utilized database servers or any server that's performing many write operations. RAID 10 can be implemented as hardware or software, but the general consensus is that many of the performance advantages are lost when you use software RAID 10.

Source 1 : PC Mag
Source 2 : Wikipedia