Monday, May 26, 2014

5 Steps to Avoid Drowning in Unstructured Data Analysis

On average, companies are reporting more than a 40% annual growth in the data they use for analysis, according to a recent research report from Aberdeen Group.

Much of this data explosion represents unstructured data that can be difficult to format and evaluate via data analysis.

This includes unstructured data such as social media posts, recorded call center interactions between customers and agents, health records, and the bodies of email messages.

However, there are steps that businesses can take to improve how they go about gathering data, integrating data from multiple sources, and using data analysis techniques to manage the data explosion sensibly, as Glenda Nevill notes in a recent blog post.

These tips will also help companies more effectively use customer data and other data streams to improve operations, optimize marketing efforts, and drive better business performance.

Here’s a step-by-step approach to help data scientists tame the data beast:

  1. Classify unstructured data. Most corporate data environments are pretty chaotic. Word documents, email, PDFs, spreadsheets, and other data files are scattered across the enterprise. The good news is that most unstructured data is also clear text. As such, this data can be read, indexed, compressed, and stored fairly easily. Classifying unstructured data is the first step to being able to identify unstructured data sources before eventually parsing and using data visualization tools.
  2. Set enforceable storage policies. Most data has a shelf life. New data is frequently accessed during its first 90 days of life and usage tends to taper off after that. Because of these usage trends, data should be regularly examined for dates, the most recent usage, and then discarded or archived based on data retirement policies enforced by the IT organization.
  3. Evaluate your BI infrastructure and adjust as needed. Before organizations begin analyzing unstructured data, it’s helpful to evaluate the current business intelligence (BI) infrastructure that’s in place and how it all fits together. It’s not always easy to create structured definitions of data that’s stored within non-traditional data sources. As such, the data management team should identify the steps that are needed to integrate unstructured data into a structured BI environment.
  4. Don’t overlook metadata. Making effective use of unstructured data requires an approach to organizing and cataloging content. In order to use the content, it’s helpful to know what that content is. Some systems automatically capture process-related metadata, or attributes such as creation date, author, title, etc. However, applying metadata to actual content such as content summaries, companies or people mentioned, or topic keywords can be considerably more useful.
  5. Apply unstructured data analysis. BI tools can’t analyze unstructured data directly. However, specialized data analysis technology can be used to analyze unstructured data as well as to produce a data model that BI tools can work with. Unstructured data analysis can start by using a natural language engine to measure keyword density. This approach, along with the use of metadata, can help data scientists and decision makers get at the heart of what key stakeholders are looking for using data discovery tools and techniques (e.g. positive or negative comments about a company in social media comments).

Source : Tibco

Sunday, May 25, 2014

Meraih Upah Melalui Ujian Iman - Dari : Pelita Hidup


“Saudara-saudaraku, anggaplah sebagai suatu kebahagiaan, apabila kamu jatuh ke dalam berbagai-bagai pencobaan, sebab kamu tahu, bahwa ujian terhadap imanmu itu menghasilkan ketekunan. Dan biarkanlah ketekunan itu memperoleh buah yang matang, supaya kamu menjadi sempurna dan utuh dan tak kekurangan suatu apapun.” Yakobus 1:2-4


Menjalani kehidupan sebagai pengikut Kristus bukanlah merupakan jalan yang mulus tanpa kendala. Setiap umatNya pasti akan mengalami keadaan dimana imannya akan diuji.

Melalui ujian terhadap iman kita inilah akan membawa kita kepada kesempurnaan iman, sehingga kita akan semakin menyerupai Yesus di dalam setiap langkah kehidupan kita.

Tidak jarang sebagian dari umat Tuhan tidak sabar dalam menghadapi ujian ini. Mereka merasa putus asa dan tidak mau tetap berpegang teguh pada imannya.

Padahal Tuhan tidak akan memberikan pencobaan melebihi dari kekuatan yang kita miliki. Setiap pencobaan yang ada pasti ada jalan keluarnya. Dan ketika kita mau tetap berharap kepada Yesus, disana ada jalan keluarnya.

Dalam Alkitab ada beberapa tokoh yang mengalami pencobaan-pencobaan yang menguji iman mereka. Mari kita lihat mengapa mereka dapat tahan uji dan menerima kemenangan dari Tuhan.

1. Abraham
Tuhan memberikan seorang anak perjanjian kepada Abraham, yaitu Ishak, melalui istrinya Sara yang kandungannya telah tertutup oleh karena lanjut usia. Tentunya anak ini merupakan kebanggaan dan pengharapan bagi Abraham untuk dapat meraih janji Tuhan yang akan menjadikan dia sebagai bapa segala bangsa. Abraham menerima janji Tuhan yang akan membuat keturunannya seperti debu tanah banyaknya.

Tetapi Tuhan justru meminta Abraham untuk mempersembahkan anaknya dan menjadikannya sebagai korban bakaran. Sebagai manusia, hal ini merupakan pukulan yang sangat berat karena Ishak merupakan pengharapannya untuk dapat meraih janji-janji Tuhan. Bagaimana mungkin dia dapat memiliki keturunan yang banyak jika anak yang merupakan anak perjanjian harua dikorbankan?

Disini Abraham menunjukkan kesetiaan imannya kepada Tuhan. Dan dengan iman dia melakukan apa yang diperintahkannya kepada dia.

Tepat pada saat dimana dia akan menghujamkan pisaunya, Tuhan menghentikannya. Tuhan melihat bahwa Abraham tidak ragu-ragu mengikuti perintah Tuhan. Lalu Tuhan menyediakan domba jantan sebagai ganti korban untuk dipersembahkan.

Abraham pun diberkati secara berlimpah-limpah dan Tuhan menggenapi FirmanNya dengan membuat keturunannya menjadi banyak seperti banyaknya debu tanah. Dan kita dapat mengenal Abraham sebagai bapa orang beriman.

2. Yusuf
Yusuf mendapatkan mimpi bahwa semua saudara-saudaranya dan orangtuanya datang untuk sujud menyembah dia. Bukannya melihat mimpi itu menjadi kenyataan, tetapi Yusuf malah dibenci oleh saudara-saudaranya. Bahkan mereka berniat untuk membunuh Yusuf. Mereka menjebloskan Yusuf ke dalam sumur kering dan tidak lama setelah itu mereka menjualnya kepada orang Midian yang kemudian dijual lagi kepada Potifar.

Dalam keadaannya seperti itu, Yusuf tetap melakukan apa yang menjadi pekerjaannya. Tuhan pun menyertai Yusuf dan menjadikan segala yang dikerjakannya berhasil.

Seakan tidak pernah berhenti, Yusuf kembali mengalami pencobaan. Dia dirayu oleh istrinya Potifar. Tetapi Yusuf tetap tidak mau tergoda dan justru mendapatkan fitnah. Ia dijebloskan ke dalam penjara karena dituduh menggoda istri Potifar.

Sekali lagi Tuhan tetap menyertai Yusuf di dalam penjara dan membuat dia menjadi kesayangan bagi kepala penjara. Berbagai kejadian dialami oleh Yusuf sampai kepada kisah dimana dia mendapat kesempatan untuk menjelaskan arti dari mimpi Firaun.

Yusufpun mendapat perkenanan di hati Firaun dan kemudian Firaun mengangkatnya sebagai penguasa di tanah Mesir.

Disini kita dapat melihat bahwa dalam segala kejadian yang dialami oleh Yusuf, tidak sekalipun Yusuf meninggalkan Tuhan. Oleh sebab itu Tuhan menyertai Yusuf dan membuat segala yang dikerjankannya menjadi berhasil.

Tidak hanya itu, pada masa-masa kelaparan, saudara-saudaranya-pun datang ke Mesir untuk membeli makanan. Dan pada akhirnya kita semua mengetahui bahwa mimpi yang pernah didapatkannya lebih dari 20 tahun sebelumnya menjadi kenyataan. Iman telah membawa Yusuf mendapatkan apa yang dimimpikannya.

...

Masih banyak kisah-kisah di Alkitab yang dapat memperlihatkan kepada kita bahwa Tuhan ingin agar umatNya tetap setia di dalam iman walau apapun yang terjadi.

Apa yang kita alami dalam kehidupan ini memang tidaklah mudah. Mungkin sulit bagi kita untuk memahami apa yang Tuhan rencanakan bagi hidup kita. Mungkin ada hal yang harus kita korbankan seperti Abraham yang harus mengorbankan anaknya. Mungkin ada hal-hal yang harus kita alami seperti Yusuf yang dikhianati oleh saudara-saudara kandungnya, difitnah dan dimasukkan ke penjara dan masih banyak lagi. Tetapi satu hal yang Tuhan inginkan untuk tetap kita lakukan, yaitu tetap melangkah dengan iman.

Jangan pernah lepaskan iman kita kepada Yesus. Jangan pernah menyerah kepada keadaan yang kita alami. Tetap jalani langkah demi langkah dalam hidup kita sambil menjaga iman kita kepada Yesus. Tuhan telah menyediakan sesuatu yang besar bagi kita. Dan Dia pasti akan menggenapi segala Firman yang telah Dia janjikan bagi kita. Haleluya!


Sumber : Pelita Hidup

Saturday, May 24, 2014

Cerita Tentang Aku dan Kamu

Sebuah kehidupan, banyak kisah.
Lika liku perjalanan yang sulit dimengerti.

Sejarah masa lalu seakan mati, ketika ku menemukanmu yang begitu berarti.

Logika yang sulit diungkapkan,
Perasaan yang tidak mudah diterjemahkan.

Pelukanmu bagaikan matahari,
Menghangatkan jiwa dan menenangkan hati.

Senyummu bagaikan bulan,
Menerangi langkah dan menghiasi mimpi.
 
Bersamamu adalah keindahan dan anugerah.
Kuingin kau selalu ada dalam nyata atau bayangan.
Takkan pernah kan ku lupakan, ceritaku saat bersamamu.
Cerita tentang aku dan kamu, kan tersimpan indah selamanya.

Monday, May 19, 2014

5 Rahasia tentang Kotoran Telinga

Selama manusia masih memiliki telinga tentu kotoran telinga bukan sesuatu yang asing. Biasanya kotoran tubuh dianggap jorok, misalnya saja feses dan urine, namun siapa sangka kotoran telinga menyimpan rahasia yang tidak diketahui semua orang.

Di masa lalu, kotoran telinga digunakan sebagai lip balm dan salep untuk luka tusuk. Nah, seiring berkembangnya ilmu pengetahuan dan teknologi, penelitian menunjukkan bahwa kotoran telinga juga bisa mendiagnosa kondisi tertentu di tubuh manusia.

Berikut ini 5 rahasia kotoran telinga yang tidak diketahui semua orang, seperti dikutip dari BBC, Senin (28/4/2014):

1. Kotoran Telinga Keluar Sendiri


Sebenarnya tubuh memiliki mekanisme membersihkan sendiri, tak terkecuali telinga. Nah, bagaimana mekanisme telinga membersihkan bagian dalamnya? Di dalam saluran telinga terdapat banyak kelenjar yang menghasilkan zat seperti lilin yang disebut serumen. Serumen kerap kali disebut tahi kuping atau kotoran telinga. Tanpa menggunakan cotton bud, sebenarnya kotoran telinga ini bisa keluar sendiri lho.

Lapisan kulit saluran telinga bermigrasi dari gendang telinga ke telinga pembukaan luar, pada saat itulah kotoran telinga dibawa keluar. Kotoran telinga yang lama diangkut dari daerah yang lebih dalam dari saluran telinga menuju keluar. Bentuk kotoran ini kering dan berupa serpihan.

Bila kotoran telinga terlalu banyak dan menutupi saluran telinga maka bisa mengganggu pendengaran. Nah, karena itulah kotoran telinga perlu dibersihkan. Tapi jangan sembarangan membersihkan telingan dengan cotton bud ya. Sebab cotton bud malah bisa mendorong kotoran telinga kembali masuk ke dalam dan menumpuk.

Menurut Prof Shakeel Saeed dari London's Royal National Throat, Nose and Ear Hospital, gerakan normal rahang, melalui gerakan saat makan dan bicara, membantu keluarnya kotoran teliga. Selain itu, seiring bertambahnya usia, kotoran telinga umumnya juga berwarna lebih gelap. Pada pria yang kupingnya lebih banyak ditumbuhi rambut, terkadang kotoran telinga memang agak sulit keluar karena terjebak di 'hutan' rambut di dalam kuping.

Sebenarnya kotoran telinga punya fungsi melindungi telinga dari kerusakan dan infeksi sehingga tidak perlu terlalu sering dibersihkan. Namun jika Anda ingin membersihkannya, ada baiknya datang ke dokter telinga hidung tenggorokan (THT).

2. Memiliki Sifat Anti-mikroba

Kotoran telinga yang lengket dan berbau tidak sedap ternyata mengandung bahan pelindung yang anti bakteri. Kotoran kuping yang berminyak atau berlilin ini sebagian besar terdiri dari sel kulit mati, keringat, lemak, serta debu dan kotoran.

Antara 1.000 sampai 2.000 kelenjar memproduksi peptida anti-mikroba. Sementara itu telinga juga memiliki kelenjar sebasea yang memproduksi sebum, yakni sesuatu yang bersifat minyak atau lilin untuk melumasi kulit dan rambut yang ada di bagian dalam telinga. Sebum terutama terdiri dari trigliserida, kolesterol, dan zat berminyak yang disebut squalene.

Kotoran telinga juga mengandung lisozim yang merupakan enzim antibakteri. Sifat asam dari kotoran telinga ini bisa menghambat pertumbuhan.

Untuk diketahui, produksi kotoran telinga tidak berbeda jauh antara laki-laki dan perempuan, pada orang muda ataupun orang tua. Akan tetapi dalam satu studi kecil diketahui konten trigliserida menurun dari bulan November sampai bulan Juli.

3. Bisa Menunjukkan dari Mana Anda Berasal


Ilmuwan Monell Institute di Philadelphia menemukan sama halnya dengan keringat, zat kimia yang terkandung dalam kotoran telinga pada satu ras berbeda dengan ras lainnya. Kromoson 16 merupakan rumah bagi kotoran telinga yang basah ataupun kering, di mana varian basah mendominasi.

Molekul yang menghasilkan kotoran telinga yang paling bau cenderung lebih banyak ditemukan pada orang-orang Kaukasia ketimbang Asia Timur. Salah satu varian dari ABCC11, yang biasanya ditemukan pada orang keturunan Asia Timur, menyebabkan kotoran telinga kering, berwarna putih dan bau badan kurang. Sedangkan varian lain dari gen, sebagian besar ditemukan di antara orang-orang keturunan Afrika dan Eropa, menyebabkan kotoran telinga menjadi basah dan berwarna kuning kecokelatan, dan juga lebih mungkin menyebabkan bau badan.

Dalam studi yang mengaitkan bau badan dengan penyakit, ABCC11 telah menjadi hubungan antara keduanya. Sebagai contoh, sebuah studi 2009 yang dipublikasikan dalam The FASEB Journal, menemukan bahwa varian gen yang menyebabkan ketiak berbau dan kotoran telinga basah juga terkait dengan peningkatan risiko kanker payudara.

Dr Kate Prigge dari Monell mengatakan analisis mereka terhadap bau kotoran telinga adalah langkah pertama untuk mencari tahu apakah mereka nantinya bisa menggunakannya untuk mendeteksi penyakit. Dipelajari juga bahwa kelainan genetik langka dalam penyakit urine sirup maple, kemungkinan dapat dengan mudah didiagnosis melalui aroma senyawa kotoran telinga.

4. Cara Membersihkan Telinga

Telinga memiliki mekanisme membersihkan dirinya sendiri. Jika Anda ingin membersihkan telinga, maka sebaiknya yang dibersihkan hanyalah bagian luarnya saja dengan menggunakan kapas atau tisu saja. Para dokter menyarankan untuk tidak membersihkan bagian dalam liang telinga.

Penggunaan cotton bud untuk membersihkan liang telinga bisa jadi malah menyebabkan kotoran terdorong semakin jauh ke dalam. Untuk itu guna mengeluarkan kotoran telinga, dokter biasanya mengunakan pengait atau sendok serume yang terbuat dari logam. Bila kotoran telinga lunak, maka akan digunakan pompa vakum untuk mengisap. Jadi alat ini semacam cakum cleaner, hanya saja berukuran sangat kecil.

Cara lain membersihkan telinga adalah dengan menyemprotkan air hangat ke dalam liang telinga. Terkadang cara ini tidak berhasil lantaran kotoran telinga yang keras. Jika menemui kasus semacam ini maka dokter akan meminta pasien meneteskan obat tetes selama beberapa hari untuk memudahkan pengambilan kotoran tersebut.

5. Monitor Polusi

Kotoran telinga, seperti banyak sekresi tubuh lainnya, bisa menunjukkan jejak racun tertentu dalam tubuh seperti keberadaan logam berat. Meskipun memang hal ini tidak lebih bisa diandalkan dibanding tes darah sederhana.

Nah, soal kotoran telinga, ada penemuan ilmiah yang cukup menarik. Kotoran telinga manusia dikeluarkan sendiri oleh telinga. Namun pada paus biru, mereka mempertahankan kotoran telinganya, sehingga kotoran telinga itu menjadi semacam rekaman peristiwa kehidupan yang dijalaninya. Peneliti menyamakannya dengan lingkaran tahun pada batang pohon yang bisa memberi informasi tertentu.

Kotoran telinga pada paus biru dianalisis oleh Sascha Usenko, seorang ilmuwan lingkungan di Baylor University di Waco, Texas. Dia dan timnya menemukan bahwa selama hidup paus jantan berusia 12 tahun itu mengalami kontak dengan 16 polutan yang berbeda. Diketahui bahwa ada paparan polutan yang cukup tinggi pada tahun pertama hidup paus biru ini. Diduga polutan itu dipindahkan melalui sang induk, saat paus biru itu masih ada di dalam rahim ataupun melalui air susu.

Dari kotoran kuping ini, para peneliti dapat mengetahui apa yang perlu dilakukan untuk melindungi paus biru dari stres, polusi, dan ancaman lainnya di masa depan. Sebab dari kotoran kuping bisa dilihat tingkat fluktuasi testosteron dan hormon stres atau kortisol selama hidupnya. Dari kotoran kuping itu juga bisa dilihat kadar testosteron paus biru jantan di mana kemudian ditengarai bahwa mamalia ini mencapai kematangan seksual pada usia sekitar 10 tahun.


Sumber : Detik Health

Wednesday, May 14, 2014

Pareto (Prinsip - Diagram - Analysis - How To)


#Indonesia Wikipedia
Prinsip Pareto

Prinsip Pareto (bahasa Inggris:The Pareto principle) (juga dikenal sebagai aturan 80-20[1]) menyatakan bahwa untuk banyak kejadian, sekitar 80% daripada efeknya disebabkan oleh 20% dari penyebabnya. Prinsip ini diajukkan oleh pemikir manajemen bisnis Joseph M. Juran, yang menamakannya berdasarkan ekonom Italia Vilfredo Pareto (15 July 1848 – 19 August 1923), yang pada 1906 mengamati bahwa 80% dari pendapatan di Italia dimiliki oleh 20% dari jumlah populasi.

Dalam implementasinya, prinsip 80/20 ini dapat diterapkan untuk hampir semua hal:

  • 80% dari keluhan pelanggan muncul dari 20% dari produk atau jasa.
  • 80% dari keterlambatan jadwal timbul dari 20% dari kemungkinan penyebab penundaan.
  • 20% dari produk atau jasa mencapai 80% dari keuntungan.
  • 20% dari tenaga penjualan memproduksi 80% dari pendapatan perusahaan.
  • 20% dari cacat sistem menyebabkan 80% masalah.



#English Wikipedia
Pareto Chart

A Pareto chart, named after Vilfredo Pareto, is a type of chart that contains both bars and a line graph, where individual values are represented in descending order by bars, and the cumulative total is represented by the line.

The left vertical axis is the frequency of occurrence, but it can alternatively represent cost or another important unit of measure. The right vertical axis is the cumulative percentage of the total number of occurrences, total cost, or total of the particular unit of measure. Because the reasons are in decreasing order, the cumulative function is a concave function. To take the example above, in order to lower the amount of late arriving by 78%, it is sufficient to solve the first three issues.

The purpose of the Pareto chart is to highlight the most important among a (typically large) set of factors. In quality control, it often represents the most common sources of defects, the highest occurring type of defect, or the most frequent reasons for customer complaints, and so on. Wilkinson (2006) devised an algorithm for producing statistically based acceptance limits (similar to confidence intervals) for each bar in the Pareto chart.

These charts can be generated by simple spreadsheet programs, such as OpenOffice.org Calc and Microsoft Excel and specialized statistical software tools as well as online quality charts generators.

The Pareto chart is one of the seven basic tools of quality control.


#IlmuSDM Wordpress

Apa itu diagram Pareto? Diagram Pareto adalah serangkaian seri diagram batang yang menggambarkan frekuensi atau pengaruh dari proses/keadaan/masalah. Diagram diatur mulai dari yang paling tinggi sampai paling rendah dari kiri ke kanan. Diagram batang bagian kiri relatif lebih penting daripada sebelah kanannya. Nama diagram Pareto diambil dari prinsip Pareto, yang mengatakan bahwa 80% gangguan berasal dari 20% masalah yang ada.

Diagram Pareto sudah lama digunakan dalam quality management tools, sebagai alat untuk menginvestigasi data-data masalah yang ada kemudian dipecahkan ke dalam kategori tertentu, sehingga dapat diketahui frekuensinya untuk setiap kejadian/proses. Dengan pareto, anda dapat mengantarkan sejumlah data ke dalam bentuk yang lebih baik dan terbaca lebih mudah, sehingga dapat diambil kesimpulan dan prioritas penyelesaian tugas.


#WikiHow

Pareto Analysis is a simple technique for prioritizing potential causes by identifying the problems. The article gives instructions on how to create a Pareto chart using MS Excel 2010

Steps :


1. Identify and List Problems. Make a list of all of the data elements/work items that you need to prioritize using the Pareto principal. This should look something like this.

If you don't have data to practice, then use the data shown in the image and see if you make the same Pareto chart, which is shown here.



2. Arrange different Categories in Descending Order, in our case “Hair Fall Reason” based on “Frequency”.


3. Add a column for Cumulative Frequency. Use formulae similar to what is shown in the figure.

Now your table should look like this.





4. Calculate total of numbers shown in Frequency and add a column for Percentage.

Ensure the Total should be same as the last value in Cumulative Frequency column.




Now your data table is complete and ready to create the Pareto chart. Your data table should look like this.




5. Go to Insert-->Column and select the 2-D Column chart.



6. A blank Chart area should now appear on the Excel sheet. Right Click in the Chart area and Select Data. 



7. Select Column B1 to C9. Then put a comma (,) and select column E1 to E9.

This is one of the important step, extra care need to be taken to ensure correct data range is being selected for the Pareto.




8. Now, your Pareto Chart should look like this. Frequency is shown as Blue bars and Percentage is shown as Red bars.


9. Select one of the Percentage bars and right click. Click on “Change Series Chart Type” to “Line with Markers”.

Following screen should appear.




10. Now your chart should look like this.

Percentage bars are now changed to line-chart.





11. Select and right click on the Red Line chart for Percentage and Click on Format data series.

Now, Format Data Series pop-up will open, where you need to select "Secondary Axis".







12. Secondary "Y" axis will appear.

The only problem with this Pareto Chart is the fact that the secondary Y-axis is showing 120%. This needs to be corrected. You may or may not face this issue.
 


13. Select the Secondary Y-axis. Right click and click on "Format Axis" option shown as you right click.

Go to Axis Options in the "Format Data Series" dialog box and Change the value for "Maximum" to 1.0.




14 Your Pareto is complete and should look like this.

However, you can still go ahead and Add some final touch to your Pareto to make it more appealing.

Go to Chart Tools --> Layout. You can add Chart Title,Axis Title,Legend and Data Tables, if you want.

Saturday, May 10, 2014

The 6 People Every Startup Needs

There’s no magic bullet for startup success, but your team can often make-or-break it, says entrepreneur Bernd Schoner.

Schoner, who has a Ph.D. from MIT and was co-founder of RFID technologies startup ThingMagic, sold his company to Trimble Navigation in 2010 for an undisclosed sum.

ThingMagic had an original team of five co-founders. But by the time the company was acquired, Schoner says only two were remaining – leading him to think more closely about team dynamics.

“There are certain roles that people assume in a typical tech company or startup that make sense and I think if you are careful about that, then your odds of success go up,” says Schoner. He is author of the upcoming book: ‘The Tech Entrepreneur’s Survival Guide."

While some companies start out with just one or two employees, Schoner says there are six key personality types he believes make for a great team. Here is the recipe for his dream lineup:

No. 1: The prima donna genius

“I think it’s commonly accepted in a tech startup that you better have someone with technical knowledge,” says Schoner. “You want to have someone be able to lead the technical agenda of the team.”

No. 2: The leader


Typically the CEO, Schoner says it’s important to have one person calling the shots.

“For larger founder teams … It can get very tricky if there are five opinions and all have equal weight. Democracy is great, but not in a startup,” says Choner. “The leader or CEO doesn’t always need to be right, but if [he or she] is a leader figure that others can look up to, then that’s a good thing.”

No. 3: Industry veteran

Schoner says this person is often missing in younger startup teams, but he stresses the importance of having someone on staff who has been around the block.

“Someone who really knows the industry can be of extreme value. They’re not just going for what’s cool or new. They really have the experience to understand what’s needed in a particular industry,” says Schoner.

No. 4: The sales animal

“The sales animal is the person who knows not just the technology, but also knows how to sell it to a customer,” says Schoner, who says young technologists often miss the value of having a sales expert on-board. “When we are trying to have someone pay money, it’s not about the technology – it’s about the value we can provide to the customer.”

No. 5: The superstar

The superstar may also be the tech genius or the CEO, says Schoner, but he or she is the person who can rally people around the company.

“That’s the guy to build your marketing strategy around. He’s the guy you want to send to conferences and industry meetings,” says Schoner.

No. 6: The financial guru

Especially when you haven’t yet brought a product to market, it’s important to keep track of costs.

“Having the financial personality is important. You want someone who has enough ability to handle numbers,” says Schoner. However, if you’re not yet at the stage where you can support a staff of six, Schoner says you can leave this person off the list – as long as someone else on the team has a good sense of money management.


Original Source : Entrepreneur

Thursday, May 8, 2014

How Do I Build a Business Plan? (Infographic)

You have a powerful idea for the next big thing, but before you sell it to anyone, you have to get it all down on paper.  It’s time to make a business plan.

How do you know if you’re headed in the right direction? Washington State University created an infographic that provides 10 guidelines to help prospective entrepreneurs organize their thoughts and wow potential investors.

The infographic details some major questions that aspiring CEOS need to ask themselves like, what problem is my business going to solve, what’s my company’s mission, and what do we do better than anyone else in the market?

But you aren’t quite done yet. A thorough business plan includes who your target demographic is, the conditions of the market you’re entering into and accounts for worst-case scenarios. And of course, there’s the money: how much you need to get going, and where it’s going to come from once your business is up and running.

For more information, like how much funding you’ll need before applying for a small business loan (that’s 30 percent), check out the infographic below.










Wednesday, May 7, 2014

ETL - Data Cleansing Overview & Definition

Wikipedia
Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data.

After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores.

Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data.

The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records).

Some data cleansing solutions will clean data by cross checking with a validated data set. Also data enhancement, where data is made more complete by adding related information, is a common data cleansing practice. For example, appending addresses with phone numbers related to that address.

Data cleansing may also involve activities like, harmonization of data, and standardization of data. For example, harmonization of short codes (St, rd etc.) to actual words (street, road). Standardization of data is a means of changing a reference data set to a new standard, ex, use of standard codes.


#Data Quality
High-quality data needs to pass a set of quality criteria. Those include:

  • Validity: The degree to which the measures conform to defined business rules or constraints (see also Validity (statistics). When modern database technology is used to design data-capture systems, validity is fairly easy to ensure: invalid data arises mainly in legacy contexts (where constraints were not implemented in software) or where inappropriate data-capture technology was used (e.g., spreadsheets, where it is very hard to limit what a user chooses to enter into a cell).Data constraints fall into the following categories:
  • Data-Type Constraints – e.g., values in a particular column must be of a particular datatype, e.g., Boolean, numeric (integer or real), date, etc.
  • Range Constraints: typically, numbers or dates should fall within a certain range. That is, they have minimum and/or maximum permissible values.
  • Mandatory Constraints: Certain columns cannot be empty.
  • Unique Constraints: A field, or a combination of fields, must be unique across a dataset. For example, no two persons can have the same social security number.
  • Set-Membership constraints: The values for a column come from a set of discrete values or codes. For example, a person's gender may be Female, Male or Unknown (not recorded).
  • Foreign-key constraints: This is the more general case of set membership. The set of values in a column is defined in a column of another table that contains unique values. For example, in a US taxpayer database, the "state" column is required to belong to one of the US's defined states or territories: the set of permissible states/territories is recorded in a separate States table. The term foreign key is borrowed from relational database terminology.
  • Regular expression patterns: Occasionally, text fields will have to be validated this way. For example, phone numbers may be required to have the pattern (999) 999-9999.
  • Decleansing is detecting errors and syntactically removing them for better programming.
  • Cross-field validation: Certain conditions that utilize multiple fields must hold. For example, in laboratory medicine, the sum of the components of the differential white blood cell count must be equal to 100 (since they are all percentages). In a hospital database, a patient's date of discharge from hospital cannot be earlier than the date of admission.
  • Accuracy: The degree of conformity of a measure to a standard or a true value - see also Accuracy and precision. Accuracy is very hard to achieve through data-cleansing in the general case, because it requires accessing an external source of data that contains the true value: such "gold standard" data is often unavailable. Accuracy has been achieved in some cleansing contexts, notably customer contact data, by using external databases that match up zip codes to geographical locations (city and state), and also help verify that street addresses within these zip codes actually exist.
  • Completeness: The degree to which all required measures are known (see also Completeness). Incompleteness is almost impossible to fix with data cleansing methodology: one cannot infer facts that were not captured when the data in question was initially recorded. (In some contexts, e.g., interview data, it may be possible to fix incompleteness by going back to the original source of data, i,e., re-interviewing the subject, but even this does not guarantee success because of problems of recall - e.g., in an interview to gather data on food consumption, no one is likely to remember exactly what one ate six months ago. In the case of systems that insist certain columns should not be empty, one may work around the problem by designating a value that indicates "unknown" or "missing", but supplying of default values does not imply that the data has been made complete.
  • Consistency: The degree to which a set of measures are equivalent in across systems (see also Consistency). Inconsistency occurs when two data items in the data set contradict each other: e.g., a customer is recorded in two different systems as having two different current addresses, and only one of them can be correct. Fixing inconsistency is not always possible: it requires a variety of strategies - e.g., deciding which data were recorded more recently, which data source is likely to be most reliable (the latter knowledge may be specific to a given organization), or simply trying to find the truth by testing both data items (e.g., calling up the customer).
  • Uniformity: The degree to which a set data measures are specified using the same units of measure in all systems ( see also Unit of measure). In datasets pooled from different locales, weight may be recorded either in pounds or kilos, and must be converted to a single measure using an arithmetic transformation.
#Decleanse
  • Parsing: for the detection of syntax errors. A parser decides whether a string of data is acceptable within the allowed data specification. This is similar to the way a parser works with grammars and languages.
  • Data transformation: Data transformation allows the mapping of the data from its given format into the format expected by the appropriate application. This includes value conversions or translation functions, as well as normalizing numeric values to conform to minimum and maximum values.
  • Duplicate elimination: Duplicate detection requires an algorithm for determining whether data contains duplicate representations of the same entity. Usually, data is sorted by a key that would bring duplicate entries closer together for faster identification.
  • Statistical methods: By analyzing the data using the values of mean, standard deviation, range, or clustering algorithms, it is possible for an expert to find values that are unexpected and thus erroneous. Although the correction of such data is difficult since the true value is not known, it can be resolved by setting the values to an average or other statistical value. Statistical methods can also be used to handle missing values which can be replaced by one or more plausible values, which are usually obtained by extensive data augmentation algorithms.
#Challenges and problems
  • Error correction and loss of information: The most challenging problem within data cleansing remains the correction of values to remove duplicates and invalid entries. In many cases, the available information on such anomalies is limited and insufficient to determine the necessary transformations or corrections, leaving the deletion of such entries as a primary solution. The deletion of data, though, leads to loss of information; this loss can be particularly costly if there is a large amount of deleted data.
  • Maintenance of cleansed data: Data cleansing is an expensive and time-consuming process. So after having performed data cleansing and achieving a data collection free of errors, one would want to avoid the re-cleansing of data in its entirety after some values in data collection change. The process should only be repeated on values that have changed; this means that a cleansing lineage would need to be kept, which would require efficient data collection and management techniques.
  • Data cleansing in virtually integrated environments: In virtually integrated sources like IBM’s DiscoveryLink, the cleansing of data has to be performed every time the data is accessed, which considerably decreases the response time and efficiency.
  • Data-cleansing framework: In many cases, it will not be possible to derive a complete data-cleansing graph to guide the process in advance. This makes data cleansing an iterative process involving significant exploration and interaction, which may require a framework in the form of a collection of methods for error detection and elimination in addition to data auditing. This can be integrated with other data-processing stages like integration and maintenance.

Microsoft :
Data cleansing is the process of analyzing the quality of data in a data source, manually approving/rejecting the suggestions by the system, and thereby making changes to the data. Data cleansing in Data Quality Services (DQS) includes a computer-assisted process that analyzes how data conforms to the knowledge in a knowledge base, and an interactive process that enables the data steward to review and modify computer-assisted process results to ensure that the data cleansing is exactly as they want to be done.

The data steward can also perform data cleansing in the Integration Services packaging process. In this case, the data steward would use the DQS Cleansing component in Integration Services that automatically performs data cleansing using an existing knowledge base.



Tibco
Inappropriate, incorrect, duplicate, and missing data are prime examples of dirty data.

Dirty data contributes to inaccurate and unreliable results. If dirty data is used as the primary source for decision making, unforeseen critical errors can occur, predictive models become undependable, and calculations are less precise.

Once dirty data is detected, it has to be corrected. But while that’s taking place, managerial decisions are delayed, processes require re-evaluation and the work that’s contributed to generating the dirty data has to be reworked.

All this leads to wasted employee time, incorrect strategic decisions, and a decrease in the organization’s return on investment.


In the International Journal of Engineering Research and Applications (IJERA), author Sweety Patel identifies multiple ways data becomes dirty. Examples include:
  • Data’s been entered erroneously or data entry personnel are poorly trained.
  • System limitations or system configuration rules are applied inaccurately.
  • Scheduled data updates are neglected.
  • Duplicate records are not removed.
  • Lack of validation rules or rules are applied inconsistently.
  • Source to target mapping definitions are inaccurate.
Additionally, the IJERA article notes that when populating a data warehouse, the extraction, transformation and loading cycle (ETL) is the most important process to ensure that dirty data becomes clean.

During an interview, Milan Thakkar, a senior business intelligence engineer at Mindspark Interactive Inc., says he agrees with that sentiment. He reasons that all data is inherently prone to errors and suggests that during ETL data should be:

  1. Subjected to general statistical analysis. Evaluate new data against historical data for outliers. Mean, median, mode, standard deviation, range and other statistical methods can be applied. Confidence intervals should also be part of this analysis.
     
  2. Evaluated against a clustering algorithm. A clustering algorithm will also identify outliers and is usually significantly more complete then the general statistical analysis. Clustering can be used to evaluate an entire data set against itself by considering the Euclidean distance between records.
     
  3. Validated. Data integrity tests should be applied and then the data should be vetted against business rules. Check the data type to ensure that the data is appropriate for the column.
     
  4. Standardized. Data transformation rules should be used to ensure that the data format is consistent and the business logic is dependable and based on user requirements.
     
  5. Tracked. A metadata repository should be established to track the entire process including the data transformation, the process of vetting, and every method that’s used to analyze the data. Calculation formulas, data transformation algorithms, and business logic reason should be readily available.
As with any computer process, an ETL process has to be “told what to do” or programmed correctly. To further protect your organization against dirty data, Drew Rockwell recommends:
  • Dedicating resources to maintaining data integrity.
  • Embedding your analytics.
  • Not forcing an overarching schema.
  • Providing visibility into the origin and history of the data.
  • Thinking beyond Excel.
In general, in order to truly be protected against dirty data you must first be proactive by building automated processes to cleanse data during ETL and then applying the steps suggested by Rockwell.

Tuesday, May 6, 2014

ETL - Data Quality Overview & Definition

IBM : What is Data Quality?
Data quality is an essential characteristic that determines the reliability of data for making decisions. High-quality data is:

  • Complete: All relevant data —such as accounts, addresses and relationships for a given customer—is linked.
  • Accurate: Common data problems like misspellings, typos, and random abbreviations have been cleaned up.
  • Available: Required data is accessible on demand; users do not need to search manually for the information.
  • Timely: Up-to-date information is readily available to support decisions.
Business leaders recognize the value of big data and are eager to analyze it to obtain actionable insights and improve the business outcomes. Unfortunately, the proliferation of data sources and exponential growth in data volumes can make it difficult to maintain high-quality data. To fully realize the benefits of big data, organizations need to lay a strong foundation for managing data quality with best-of-breed data quality tools and practices that can scale and be leveraged across the enterprise.

#Business value of data quality
Data quality-related problems cost companies millions of dollars annually because of lost revenue opportunities, failure to meet regulatory compliance or failure to address customer issues in a timely manner. Poor data quality is often cited as a reason for failure of critical information-intensive projects. By implementing a data quality program, organizations can:

  • Deliver high-quality data for a range of enterprise initiatives including business intelligence, applications consolidation and retirement, and master data management
  • Reduce time and cost to implement CRM, data warehouse/BI, data governance, and other strategic IT initiatives and maximize the return on investments
  • Construct consolidated customer and household views, enabling more effective cross-selling, up-selling, and customer retention
  • Help improve customer service and identify a company's most profitable customers
  • Provide business intelligence on individuals and organizations for research, fraud detection, and planning
  • Reduce the time required for data cleansing—saving on average 5 million hours, for an average company with 6.2 million records (Aberdeen Group research)

Wikipedia
Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran). Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer. Furthermore, apart from these definitions, as data volume increases, the question of internal consistency within data becomes paramount, regardless of fitness for use for any external purpose, e.g. a person's age and birth date may conflict within different parts of a database. The first views can often be in disagreement, even about the same set of data used for the same purpose. This article discusses the concept as it related to business data processing, although of course other data have various quality issues as well.

#Definitions
This list is taken from the online book "Data Quality: High-impact Strategies".

  • Degree of excellence exhibited by the data in relation to the portrayal of the actual scenario.
  • The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use.
  • The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of the degrees of excellence for factors related to data.
  • The processes and technologies involved in ensuring the conformance of data values to business requirements and acceptance criteria.
  • Complete, standards based, consistent, accurate and time stamped.
#Overview
Problems with data quality don't only arise from incorrect data. Inconsistent data is a problem as well. Eliminating data shadow systems and centralizing data in a warehouse is one of the initiatives a company can take to ensure data consistency.

Enterprises, scientists, and researchers are starting to participate within data curation communities to improve the quality of their common data.[10]

The market is going some way to providing data quality assurance. A number of vendors make tools for analysing and repairing poor quality data in situ, service providers can clean the data on a contract basis and consultants can advise on fixing processes or systems to avoid data quality problems in the first place. Most data quality tools offer a series of tools for improving data, which may include some or all of the following:

  • Data profiling - initially assessing the data to understand its quality challenges
  • Data standardization - a business rules engine that ensures that data conforms to quality rules
  • Geocoding - for name and address data. Corrects data to US and Worldwide postal standards
  • Matching or Linking - a way to compare data so that similar, but slightly different records can be aligned. Matching may use "fuzzy logic" to find duplicates in the data. It often recognizes that 'Bob' and 'Robert' may be the same individual. It might be able to manage 'householding', or finding links between husband and wife at the same address, for example. Finally, it often can build a 'best of breed' record, taking the best components from multiple data sources and building a single super-record.
  • Monitoring - keeping track of data quality over time and reporting variations in the quality of data. Software can also auto-correct the variations based on pre-defined business rules.
  • Batch and Real time - Once the data is initially cleansed (batch), companies often want to build the processes into enterprise applications to keep it clean.
#Data Quality Control
Data quality control is the process of controlling the usage of data with known quality measurement—for an application or a process. This process is usually done after a Data Quality Assurance (QA) process, which consists of discovery of data inconsistency and correction.

Data QA process provides following information to Data Quality Control (QC):

  • Severity of inconsistency
  • Incompleteness
  • Accuracy
  • Precision
  • Missing / Unknown
The Data QC process uses the information from the QA process, then it decides to use the data for analysis or in an application or business process. For example, if a Data QC process finds the data contains too much error or inconsistency, it rejects the data to be processed. The usage of incorrect data could crucially impact output. For example, providing invalid measurements from several sensors to the automatic pilot feature on an aircraft could cause it to crash. Thus, establishing data QC process provides the protection of usage of data control and establishes safe information usage.

#Data Quality Assurance
Data quality assurance is the process of profiling the data to discover inconsistencies, and other anomalies in the data and performing data cleansing activities (e.g. removing outliers, missing data interpolation) to improve the data quality .

These activities can be undertaken as part of data warehousing or as part of the database administration of an existing piece of applications software.

#Criticism of existing tools and processes
The main reasons cited are:

  • Project costs: costs typically in the hundreds of thousands of dollars
  • Time: lack of enough time to deal with large-scale data-cleansing software
  • Security: concerns over sharing information, giving an application access across systems, and effects on legacy systems

Gartner

#Data Quality Tools
The market for data quality tools has become highly visible in recent years as more organizations understand the impact of poor-quality data and seek solutions for improvement. Traditionally aligned with cleansing of customer data (names and addresses) in support of CRM-related activities, the tools have expanded well beyond such capabilities, and forward-thinking organizations are recognizing the relevance of these tools in other data domains. Product data — often driven by MDM initiatives — and financial data (driven by compliance pressures) are two such areas in which demand for the tools is quickly building.

Data quality tools are used to address various aspects of the data quality problem:

  • Parsing and standardization — Decomposition of text fields into component parts and formatting of values into consistent layouts based on industry standards, local standards (for example, postal authority standards for address data), user-defined business rules, and knowledge bases of values and patterns
  • Generalized “cleansing” — Modification of data values to meet domain restrictions, integrity constraints or other business rules that define sufficient data quality for the organization
  • Matching — Identification, linking or merging related entries within or across sets of data
  • Profiling — Analysis of data to capture statistics (metadata) that provide insight into the quality of the data and aid in the identification of data quality issues
  • Monitoring — Deployment of controls to ensure ongoing conformance of data to business rules that define data quality for the organization
  • Enrichment — Enhancing the value of internally held data by appending related attributes from external sources (for example, consumer demographic attributes or geographic descriptors)

The tools provided by vendors in this market are generally consumed by technology users for internal deployment in their IT infrastructure, although hosted data quality solutions are continuing to emerge and grow in popularity. The tools are increasingly implemented in support of general data quality improvement initiatives, as well as within critical applications, such as ERP, CRM and BI. As data quality becomes increasingly pervasive, many data integration tools now include data quality management functionality.

Monday, May 5, 2014

ETL - Data Profiling Overview & Definition

Gartner : Data profiling is a technology for discovering and investigating data quality issues, such as duplication, lack of consistency, and lack of accuracy and completeness. This is accomplished by analyzing one or multiple data sources and collecting metadata that shows the condition of the data and enables the data steward to investigate the origin of data errors. The tools provide data statistics, such as degree of duplication and ratios of attribute values, both in tabular and graphical formats.

Techopedia : Data profiling is a technique used to examine data for different purposes like determining accuracy and completeness. This process examines a data source such as a database to uncover the erroneous areas in data organization. Deployment of this technique improves data quality.

Wikipedia

#Data Profiling
Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to:
  1. Find out whether existing data can easily be used for other purposes
  2. Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to a category
  3. Give metrics on data quality including whether the data conforms to particular standards or patterns
  4. Assess the risk involved in integrating data for new applications, including the challenges of joins
  5. Assess whether metadata accurately describes the actual values in the source database
  6. Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.
  7. Have an enterprise view of all data, for uses such as master data management where key data is needed, or data governance for improving data quality.

#Data Profiling in Relation to DW/BI Development
 

Introduction
Data profiling is an analysis of the candidate data sources for a data warehouse to clarify the structure, content, relationships and derivation rules of the data. Profiling helps not only to understand anomalies and to assess data quality, but also to discover, register, and assess enterprise metadata. Thus the purpose of data profiling is both to validate metadata when it is available and to discover metadata when it is not. The result of the analysis is used both strategically, to determine suitability of the candidate source systems and give the basis for an early go/no-go decision, and tactically, to identify problems for later solution design, and to level sponsors’ expectations.


How to do Data Profiling
Data profiling utilizes different kinds of descriptive statistics such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, and variation as well as other aggregates such as count and sum. Additional metadata information obtained during data profiling could be data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns, and abstract type recognition. The metadata can then be used to discover problems such as illegal values, misspelling, missing values, varying value representation, and duplicates. Different analyses are performed for different structural levels. E.g. single columns could be profiled individually to get an understanding of frequency distribution of different values, type, and use of each column. Embedded value dependencies can be exposed in cross-columns analysis. Finally, overlapping value sets possibly representing foreign key relationships between entities can be explored in an inter-table analysis. Normally purpose-built tools are used for data profiling to ease the process. The computation complexity increases when going from single column, to single table, to cross-table structural profiling. Therefore, performance is an evaluation criterion for profiling tools.


When to Conduct Data Profiling
According to Kimball, data profiling is performed several times and with varying intensity throughout the data warehouse developing process. A light profiling assessment should be undertaken as soon as candidate source systems have been identified right after the acquisition of the business requirements for the DW/BI. The purpose is to clarify at an early stage if the right data is available at the right detail level and that anomalies can be handled subsequently. If this is not the case the project might have to be canceled. More detailed profiling is done prior to the dimensional modeling process in order to see what it will require to convert data into the dimensional model, and extends into the ETL system design process to establish what data to extract and which filters to apply. An additional time to conduct data profiling is during the data warehouse development process after data has been loaded into staging, the data marts, etc. Doing so at these points in time helps assure that data cleaning and transformations have been done correctly according to requirements.


Benefits of Data Profiling
The benefits of data profiling is to improve data quality, shorten the implementation cycle of major projects, and improve understanding of data for the users. Discovering business knowledge embedded in data itself is one of the significant benefits derived from data profiling. Data profiling is one of the most effective technologies for improving data accuracy in corporate databases. Although data profiling is effective, then do remember to find a suitable balance and do not slip into “analysis paralysis”