最近の研究 - bitstudio web

Introduce data analysis using Python. Python can perform time series data analysis such as Nikkei average and dollar-yen exchange rate, data extraction by TwitterAPI and web scraping, clustering and prediction model by machine learning. This article introduces these basic analyses.

Pythonを使ったデータ分析について紹介しています。Pythonは、日経平均やドル円為替レートなどのような時系列データの分析や、TwitterAPIやウェブスクレイピングでのデータ抽出、機械学習によるクラスタリングや予測モデルなどができます。この記事では、それらの基本的な分析について紹介しています。

f:id:hira03:20190811181422p:plain — recent works

Time Series Analysis
Twitter API
Web Scraping
- Generation Change in AKB48
- Satisfaction Level of Tabelog Curry
Machine Learning
- Clustering by Tabelog Comment
- Predictive Model with Titanic Dataset

Time Series Analysis

Introducing analysis of time series data. Time series data is data obtained by continuously observing temporal changes. Typical examples include statistical data such as the number of populations and households published by the census, and statistical data such as the monetary base published by the Bank of Japan. The analysis of time series data uses these statistical data to make a line chart or to find the maximum and minimum values for a period.

時系列データの分析について紹介します。時系列データとは、時間的変化を連続的に観測して得られたデータのことです。代表的なものに、国勢調査が公表している、人口数や世帯数などの統計データや、日本銀行が公表しているマネタリーベースなどの統計データがあります。時系列データの分析は、これらの統計データを使用して、折れ線グラフにしたり、ある期間における最大値と最小値を求めたりします。

Monetary Base

The Bank of Japan publishes various statistical data. The monetary base is one of the statistical data published by the Bank of Japan. This graph was visualized using monetary base data.Looking at this graph, the monetary base at the time of introduction of “quantitative and qualitative monetary easing” determined in April 2013 is about 149 trillion yen. On the other hand, the monetary base as of May 2019 is about 510 trillion yen. This means that the monetary base has increased about 3.4 times since the introduction of “quantitative and qualitative monetary easing”.

日本銀行はさまざまな統計データを公表しています。マネタリーベースは日本銀行が公表している統計データのうちの一つです。このグラフはマネタリーベースのデータを使用して可視化しました。このグラフをみると、2013年4月に決定された「量的・質的金融緩和」の導入時のマネタリーベースは約149兆円です。一方で、2019年5月時点でのマネタリーベースは約510兆円となっています。これは、「量的・質的金融緩和」を導入してからマネタリーベースが約3.4倍になったことになります。

f:id:hira03:20190812113607p:plain — Monetary Base

Nikkei Stock Average

The Nikkei average is known as an important economic indicator for understanding economic trends.Stock data such as the Nikkei average is also time-series data, so it is used for various analyses.These data can be downloaded from Yahoo Finance in the US as CSV format data.This section uses the Nikkei average and Dow average data for analysis.

日経平均は、景気の動向を知る上でも重要な経済指標として知られています。日経平均のような株価データも時系列データとなっているので、さまざまな分析に使用されます。これらのデータは、アメリカのヤフーファイナンスからCSV形式のデータとしてダウンロードすることができるようになっています。この節では日経平均とダウ平均のデータを使用して分析をしています。

Trends in the Nikkei Stock Average

This graph shows the transition of the Nikkei average. Data from January 1965 to July 2019 is used. The Nikkei average recorded the highest price of 38,915 yen on December 29, 1989. Since the Lehman shock, the Nikkei average has recorded the lowest price of 7054 yen on March 10, 2009. Since the establishment of the second Abe Cabinet, the Nikkei average has been on an upward trend.

このグラフは日経平均の推移をあらわしています。データは、1965年1月から2019年7月までのデータを使用しています。日経平均は、1989年12月29日に38915円の最高値を記録しています。リーマンショック以降、日経平均は、2009年3月10日に7054円の最安値を記録しています。第２次安倍内閣の発足以降、日経平均は、上昇傾向にあります。

f:id:hira03:20190813010946p:plain — Trends in the Nikkei Stock Average

Trends in the Dow Stock Average

This graph shows the transition in the Dow average. Data from January 1985 to July 2019 is used. The Dow average has plummeted due to the Lehman shock, but other than that, it has been on the rise. Since the low price of $ 6547 was recorded on March 9, 2009 after the Lehman shock, it has risen faster than the Nikkei average. Since President Trump took office on January 20, 2017, the Dow average has continued to update all-time highs.

このグラフはダウ平均の推移をあらわしています。データは、1985年1月から2019年7月までのデータを使用しています。ダウ平均、はリーマンショックによって急落しましたが、それ以外は現在までほぼ上昇傾向にあります。リーマンショック後の2009年3月9日に最安値の6547ドルを記録して以降は、日経平均と比較しても早く上昇しています。2017年1月20日にトランプ大統領が就任して以降は、ダウ平均は歴代最高値を更新し続けています。

f:id:hira03:20190813011010p:plain — Trends in the Dow Stock Average

Comparison of Nikkei average and Dow average

This graph compares the transition in the Nikkei average and the Dow average. A simple comparison is not possible because the exchange rates are different. When the Nikkei average recorded a record high of 38915 yen on December 29, 1989, the Dow average was $ 2753. On the other hand, the Nikkei average on July 1, 2019 is 21729 yen, but the Dow average is 26717 dollars. This means that the stock price ratio for that period is about 0.56 times and about 9.7 times, respectively.

このグラフは日経平均とダウ平均の推移を比較しています。為替レートが異なるため単純な比較はできません。日経平均が1989年12月29日に38915円という過去最高値を記録した時、ダウ平均は2753ドルでした。一方で、2019年7月1日の日経平均は21729円ですが、ダウ平均は26717ドルになっています。これはその期間における株価の比率がそれぞれ約0.56倍と約9.7倍であることを意味しています。

f:id:hira03:20190813011034p:plain — Comparison Nikkei average and Dow average

Exchange Rate

In general, the dollar-yen exchange rate and the Nikkei average are understood to be a low stock price if the yen is high and a high stock price if the yen is low. In this section, the correlation is analyzed by comparing the dollar-yen exchange rate and the Nikkei average.

一般的にドル円為替レートと日経平均は、円高になると株安になり、円安になると株高になると理解されています。この節では、ドル円為替レートと日経平均を比較することで、相関性について分析しています。

Trends in dollar-yen exchange rate

This graph shows the transition in the exchange rate between January 2001 and June 2019. The lowest price dollar-yen exchange rate since January 2001 is 134.95 yen on February 11, 2002. On the other hand, the highest price dollar-yen exchange rate from January 2001 to June 2019 was 75.75 yen on October 31, 2011.This is because the stock price was reduced worldwide due to the Lehman shock. As a result of the decision to introduce “quantitative and qualitative monetary easing” on April 4, 2013, the dollar-yen exchange rate has been depreciating.

このグラフは、2001年1月から2019年6月までのドル円為替レートの推移です。2001年1月からのドル円話為替レートにおける最安値は、2002年2月11日の134.95円です。一方で、2001年1月から2019年6月までに最高値となったのは、2011年10月31日の75.75円です。これはリーマンショックによって世界的株安になったためです。2013年4月4日に「量的・質的金融緩和」の導入が決定されたことにより、ドル円為替レートは円安傾向になっています。

f:id:hira03:20190813174834p:plain — Trends in dollar-yen exchange rate

Comparison of exchange rate and Nikkei average

This graph compares the dollar-yen exchange rate and the Nikkei average. Since the data is standardized, the value is from 0 to 1. Looking at this graph, the period from 2004 to 2016 tends to be a weak yen when the stock price is high and a strong yen when the stock price is low. On the other hand, it can be seen that there is no such correlation between the period from 2001 to 2002 and from 2017 to the present.

このグラフは、ドル円為替レートと日経平均を比較しています。データは、標準化しているので0から1の値となっています。このグラフを見ると、2004年から2016年の期間は、株高の時は円安、株安の時は円高という傾向になっています。一方で、2001年から2002年と2017年から現在までの期間は、そのような相関性にないことがわかります。

f:id:hira03:20190813174858p:plain — Comparison of exchange rate and Nikkei average

Correlation of exchange rate and Nikkei average

This graph is a scatter plot of standardized dollar-yen exchange rate and Nikkei average. The correlation coefficient between the dollar-yen exchange rate and the Nikkei average is 0.4462, so it can be judged that there is a slight correlation. It can be seen from the scatter plot that there is a positive correlation.

このグラフは、標準化したドル円為替レートと日経平均の散布図です。ドル円為替レートと日経平均の相関係数は、0.4462なので、やや相関性があると判断できる数値です。散布図からも正の相関性があることがわかります。

f:id:hira03:20190813174923p:plain — Correlation of exchange rate and Nikkei average

Twitter API

TwitterAPI is an API (application programming interface) provided by Twitter. There are two main types of TwitterAPI: REST API and Streaming API. TwitterAPI mainly has two commands, POST and GET. You can tweet from programming, get tweets, and get ID you follow. By using TwitterAPI, it is possible to get only text data of tweets that contain a certain keyword.

TwitterAPIとは、Twitterが提供しているAPI（アプリケーションプログラミングインターフェイス）のことです。TwitterAPIには、主にREST APIとStreaming APIの二種類があります。TwitterAPIには、主に POSTとGETの二つのコマンドがあり、プログラミングからツイートを行ったり、ツイートを取得、フォローしているIDの取得などができます。TwitterAPIを利用することによって、あるキーワードが含まれているツイートのテキストデータのみを取得するといったことが可能になります。

Marketing of Starbucks

Starbucks Twitter accounts are known for their many followers. Starbucks Twitter in Japan has started in November 2010. The number of followers is about 4.65 million as of July 2019. In this section, posting time and tweet words are visualized from a Starbucks Twitter account.

スターバックスのツイッターアカウントは、フォロワーが多いことで知られています。スターバックスの日本でのツイッターは2010年11月から開始されています。フォロワー数は、2019年7月時点で約465万人となっています。この節では、スターバックスのツイッターアカウントから投稿時間の集計、ツイートワードの可視化を行っています。

Aggregation by posting time

This histogram counts the number of posts by time. According to this graph, posts on Starbucks Twitter accounts are concentrated between 10:00 and 19:00. Business hours in Starback are generally 9am to 11pm. For this reason, Starbucks Twitter accounts are tweeted during business hours.

このヒストグラムは、時間ごとに投稿数を集計しています。このグラフによると、スターバックスのツイッターアカウントは、10時から19時の時間帯に投稿が集中しています。スターバックの営業時間は、概ね9時から23時です。このことから、スターバックスのツイッターアカウントは営業時間内にツイートされています。

f:id:hira03:20190813230042p:plain — Aggregation by posting time

Visualization of nouns

This image is a visualization of the nouns of the words tweeted from the Starbucks account. Looking at this image, there are many words related to stores and products such as Starbucks, coffee, Frappuccino, and hot drinks. On the other hand, there are many positive words such as smiles, appearances, and fun. In addition, there are a lot of other words that describe news such as today and tomorrow, and words that describe products such as taste, fermentation, and flavor.

この画像は、スターバックスのアカウントからツイートされた言葉のうち、名詞を可視化したものです。この画像を見ると、スターバックス、コーヒー、フラペチーノ、ホットドリンクなどの店舗や商品に関するワードが多いです。一方で、笑顔、登場、楽しみなどのポジティブなワードも多いです。その他には、本日、明日などのお知らせに関するワードや、味わい、発酵、風味などの商品を形容するワードも多くなっています。

f:id:hira03:20190813230113p:plain — Visualization of nouns

Visualization of adjectives and adverbs

This image is a visualization of adjectives and adverbs. If you look at this image, you will see many positive words such as fun, new, delicious and bright. In addition, there are many words about tastes such as sweet, sweet and sour, bittersweet and savory. On the other hand, many words that express the texture and appearance, such as excitement, glitter, fluffy, moist, clean, crispy, are also used.

この画像は、形容詞と副詞を可視化したものです。この画像を見ると、楽しい、新しい、おいしい、明るいなどのポジティブな言葉が多くなっています。さらに、甘い、甘酸っぱい、ほろ苦い、香ばしいなどの味に関する言葉も多いです。一方で、ワクワク、キラキラ、ふんわり、しっとり、すっきり、もっちり、サクサク、ふわふわ、などの食感や見た目を表す言葉も多く使用されています。

f:id:hira03:20190813230143p:plain — Visualization of adjectives and adverbs

Scatter Plot of “fav” and “RT”

TwitterAPI can acquire the number of "fav" and "RT" in addition to tweets and ID. In this section, we analyze the relationship between “fav” and “RT” for personal accounts and corporate accounts with a relatively large number of followers. The relationship between “fav” and “RT” of Rino Sashihara of personal account and Tokyo Disneyland of corporate account is visualized using scatter diagrams.

TwitterAPIは、ツイートやIDの他にも「いいね」や「リツイート」の回数なども取得することができます。この節では、比較的フォロワー数の多い個人アカウトと企業アカウントを対象に、「いいね」と「リツイート」の関係について分析しています。個人アカウントの指原莉乃さんと企業アカウントの東京ディズニーランドの「いいね」と「リツイート」の関係についてそれぞれ散布図を使って可視化しています。

About the relationship between “fav” and “RT” by Rino Sashihara

The following graph is a scatter plot of “fav” less than 50000 times and “RT” less than 5000 out of the last 500 tweets of Rino Sashihara. If you look at this scatter plot, you can see that the correlation is fairly high. The correlation coefficient in this case is 0.9548, and it can be judged that there is a fairly strong correlation. From this, it can be said that there is a correlation that “RT” increase as “fav” increase.

次のグラフは、指原莉乃さんの直近500のツイートデータのうち、「いいね」が50000回以下、「リツイート」が5000回以下の条件における散布図となっています。この散布図を見ると、かなり高い相関性にあることがわかります。この場合の相関係数は、0.9548となっていて、かなり強い相関性があると判断できます。このことから、「いいね」が増えると「リツイート」も増えるという相関性があると言えます。

f:id:hira03:20191106112523p:plain — About the relationship between “fav” and “RT” by Rino Sashihara

About the relationship between “fav” and “RT” by Tokyo Disneyland

The following graph is a scatter plot of “fav” less than 30000 times and “RT” less than 5000 out of the last 500 tweets of Tokyo Disneyland. if you look at this scatter plot, you can see that the correlation is fairly high. In this case, the correlation coefficient is 0.9021, indicating that there is a fairly strong correlation. From this, it can be said that there is a correlation that “RT” increase as “fav” increase.

次のグラフは、東京ディズニーランドの直近500のツイートデータのうち、「いいね」が30000回以下、「リツイート」が5000回以下の条件における散布図となっています。この散布図を見ると、かなり高い相関性にあることがわかります。この場合の相関係数は、0.9021となっていて、かなり強い相関性があると判断できます。このことから、「いいね」が増えると「リツイート」も増えるという相関性があると言えます。

f:id:hira03:20191106112610p:plain — About the relationship between “fav” and “RT” by Tokyo Disneyland

Location Information on Twitter

Twitter has a function to link location information when you tweet. This is a function linked when using Twitter from a smartphone. If you post with this function turned on, the location information when you tweet will be linked. In this section, the location information of tweets acquired from a certain place is plotted on a map.

ツイッターには、ツイートした時に位置情報を紐付けする機能があります。これはスマートフォンからツイッターを利用する時に紐付けされる機能です。この機能をオンにした状態で投稿すると、ツイートした時の位置情報が紐付けされます。この節ではある場所から取得したツイートの位置情報を地図にプロットしています。

Usage situation at daytime (from 7:00 to 14:00)

This image shows the location information linked to the tweeted at Shinjuku during the daytime (from 7 to 14:00) during a certain period and plotted on an actual map. If you look at this, you can see that it is tweeted evenly around Shinjuku Station. You can see from this image that there are many tweets from the East Exit area. On the other hand, although there are few numbers in the south exit area and the Tokyo Metropolitan Government area, you can see that they are tweeted.

この画像は、ある期間の昼間（7時から14時）における新宿でのツイーとに紐付けされている位置情報を取得して実際の地図にプロットしたものです。これを見ると、新宿駅を中心にまんべんなくツイートされていることがわかります。特に東口方面からのツイートが多くなっているのがこの画像からわかります。一方で、南口方面や都庁方面も数は少ないもののツイートされていることがわかります。

f:id:hira03:20191106112637p:plain — Usage situation at daytime (from 7:00 to 14:00)

Usage situation at night (from 18:00 to 1:00)

Looking at the usage status of Twitter at night (from 18:00 to 1:00) during the same period, you can see that people are moving toward the east exit as a whole compared to the daytime time zone. In particular, during the daytime hours, tweets were posted in too the South Exit area and the Tokyo Metropolitan Government area, but you can see that the number of tweets decreased at night.

同じ期間の夜間（18時から1時）におけるツイッターの利用状況を見てみると、先ほどの昼間の時間帯と比べると、全体的に東口方面に人が移動していることがわかります。特に、昼間の時間帯には、南口方面や都庁方面でもツイートがされていましたが、夜間になるとツイートが減少していることがわかります。

f:id:hira03:20191106112701p:plain — Usage situation at night (from 18:00 to 1:00)

Web Scraping

Web scraping is a technology that extracts only the necessary information from a website. Python can do web scraping by using libraries such as Requests and Beautifl Soup. For example, you can extract data written in HTML table tags and export it in CSV format. Here, data analysis is performed using data extracted by web scraping.

ウェブスクレイピングとは、ウェブサイトから必要な情報のみを抽出する技術のことを言います。Pythonは、RequestsやBeautifl Soupなどのライブラリを利用することによってウェブスクレイピングを行うことができます。例えば、HTMLのテーブルタグで書かれているデータを抽出してCSV形式で書き出したりすることなどができます。ここでは、ウェブスクレイピングによって抽出したデータを使用してデータ分析を行います。

Generation Change in AKB48

AKB48 is a Japanese idol group with many members. The selected members of AKB48 change for each single song. This system is thought to change generations. Here, we analyze the distribution of generations, etc. by web scraping information on the birth date of members from the official website of AKB48.

AKB48は、多くのメンバーが所属している日本のアイドルグループです。AKB48の選抜メンバーは、シングル曲ごとに変わります。このシステムによって世代交代が進むと考えられます。ここでは、AKB48の公式サイトからメンバーの生年月日の情報をウェブスクレイピングすることで、世代の分布などの分析を行っています。

Generational changes in Selection General Election 2015 and 2018

The histogram below summarizes the members of the elected selection general elections in 2015 and 2018 by age. Looking at this graph, you can see that the elected members are concentrated in the generation from 1994 to 1998 in the selection general election held in 2015. On the other hand, in the selection general election held in 2018, the elected members are concentrated in the generation from 1996 to 2002. Especially in the 2018 selection general election, the increase from 1999 to 2002 is conspicuous compared to the graph in 2015. This also shows that the generation change is progressing between 2015 and 2018.

以下のヒストグラムは、2015年と2018年に行われた選抜総選挙の当選圏内のメンバーを年代ごとに集計したものです。このグラフを見てみると、2015年に行われた選抜総選挙では1994年から1998年の世代に当選メンバーが集中していることがわかります。一方で、2018年に行われた選抜総選挙では、1996年から2002年の世代に当選メンバーが集中しています。特に2018年の選抜総選挙では、2015年のグラフと比較すると1999年から2002年代の増加が目立っています。このことからも2015年から2018年の間にも世代交代が進んでいることがわかります。

f:id:hira03:20191106112750p:plain — Generational changes in Selection General Election 2015 and 2018

Generation distribution of all 48group members (as of July 2019)

Next, we graphed the distribution by each generation of the latest 48group. As of July 2019, the total number of members in 48group (AKB48, SKE48, NMB48, HKT48, STU48) is 317. The standard deviation that represents the variation of each era is about 3, but when you look at the histogram below, the distribution is according to the standard normal distribution. The mode is the 1999s, the median is the 2000s, the oldest is the 1991s, and the youngest is the 2006s. In other words, the number of members around the age of 20 is currently the largest, and the difference between the oldest and youngest age is about 15 years old. From the graph, you can see that the generations are concentrated in the 1997s and 2003s.

次に、直近の48グループ全体の世代ごとの分布をグラフ化しました。48グループ（AKB48、SKE48、NMB48、HKT48、STU48の合計）のメンバー総数は2019年7月時点で、317人となっています。年代ごとのバラツキをあらわす標準偏差は3程度となっていますが、以下のヒストグラムを見ると概ね正規分布に従った分布となっています。最頻値は1999年代、中央値は2000年代、最年長は1991年代、最年少は2006年代となっています。つまり、現在20歳前後のメンバーが最も多く、最年長と最年少の歳の差は15歳程度の範囲で世代が分布していることがわかります。世代的には、1997年代から2003年代に集中していることがこのグラフからわかります。

f:id:hira03:20191106112824p:plain — Generation distribution of all 48group members (as of July 2019)

Generation distribution of AKB48

AKB48 has 102 members. The age average is about the same as the whole, and the variation is normal. The mode is the 2001s, concentrated in the 1997s and 2003s. Since about 30% of the 48group are members of AKB48, the distribution trend is the same as the entire 48group.

AKB48のメンバー数は102人です。年代平均は全体とほぼ同じくらいで、バラツキは普通です。最頻値は2001年代で、1997年代から2003年代に集中しています。48グループの約３割がAKB48のメンバーであるということもあるので、48グループ全体と同じような分布傾向となっています。

f:id:hira03:20191106112846p:plain — Generation distribution of AKB48

Satisfaction Level of Tabelog Curry

There are many curry shops in Minato Ward. In this section, scraping of evaluation point, number of comments and price range data was done from website of tabelog. We analyzed the curry stores in the Roppongi / Azabu / Hiroo area and the Meguro / Shirokane / Gotanda area. We analyze satisfaction of curry stores from three viewpoints: evaluation points, number of comments, and price range.

港区には多くのカレー店があります。この節では、食べログのウェブサイトから評価ポイント、コメント数、価格帯のデータをスクレイピングしています。このうち六本木・麻布・広尾エリアと、目黒・白金・五反田エリアの二つのエリアのカレー店の分析を行いました。カレー店の充実度に関して、評価ポイント、コメント数、価格帯の３つの観点から分析しています。

Compare by evaluation points

This graph is a histogram of evaluation points of tabelog. In each area, the evaluation points are within the range of 3.0 to 3.8. The Meguro, Shirokane, and Gotanda areas have 3.0 to 3.1 and 3.2 to 3.3 points more than the Roppongi, Azabu, and Hiroo areas. On the other hand, the Roppongi / Azabu / Hiroo area has many stores with evaluation points in the range of 3.4 to 3.5. The average value of evaluation points is about the same in both areas.

このグラフは、食べログの評価ポイントをヒストグラムにしたものです。各エリアともに、3.0以上、3.8以下の範囲に評価ポイントは収まっています。目黒・白金・五反田エリアは、六本木・麻布・広尾エリアよりも、3.0から3.1と3.2から3.3ポイントが多くなっています。一方で、六本木・麻布・広尾エリアは、3.4から3.5の範囲の評価ポイントの店が多くあります。評価ポイントの平均値はどちらのエリアも同じくらいになっています。

f:id:hira03:20191106112935p:plain — Compare by evaluation points

Compare by count of comments

This graph is a histogram of the count of comments. Of the stores with a large count of comments, there were two each with over 300 comments. In the Meguro, Shirokane, and Gotanda areas, there are hotspoons and udon noodles with curry, with 410 and 334 comments, respectively. On the other hand, in the Roppongi, Azabu and Hiroo areas, Nirvana New York Tokyo Midtown and Nirwanam Kamiyacho stores have 345 and 578 comments, respectively.

このグラフは、コメント数をにヒストグラムにしたものです。コメント数の多い店のうち、300コメント以上なのが、それぞれ2件づつありました。目黒・白金・五反田エリアには、ホットスプーンとカレーの店うどんの２店舗が、それぞれ、410コメント、334コメントとなっています。一方の、六本木・麻布・広尾エリアには、ニルヴァーナニューヨーク東京ミッドタウンとニルワナム神谷町店の２店舗が、それぞれ、345コメント、578コメントとなっています。

f:id:hira03:20191106113005p:plain — Compare by count of comments

Compare by price range

The Meguro, Shirokane and Gotanda area has many shops within 999 yen, and the Roppongi, Azabu and Hiroo area has price range of 1000 to 1999 yen and 2000 to 2999 yen.

目黒・白金・五反田エリアは、999円以内の店が多く、六本木・麻布・広尾エリアは、1000から1999円、2000から2999円と言った価格帯が多くなっています。

f:id:hira03:20191106113026p:plain — Compare by price range

Machine Learning

Machine learning is a statistical model that performs prediction and classification by learning from a large amount of data. Machine learning includes supervised learning and unsupervised learning. In Python, machine learning can be performed by using libraries such as sicit-learn. Here, clustering of unsupervised learning and classification by supervised learning are performed.

機械学習とは、たくさんのデータから学習を行って、予測や分類を行う統計モデルです。機械学習には教師あり学習と教師なし学習があります。Pythonでは、sicit-learnなどのライブラリを利用することで機械学習を行うことができます。ここでは、教師なし学習のクラスタリングと、教師あり学習による分類を行っています。

Clustering by Tabelog Comment

Here, we are trying to classify ramen shops by clustering the comments in the tabelog. You can write a comment on the eating log as a store rating. By scraping this comment, it will be used as training data. The data of the scraped comment is divided into words for each word by performing morphological analysis. Clustering is performed by converting the data into feature vectors. The classification results were visualized by plotting the average of clustering labels on the vertical axis and the number of comments on the horizontal axis on a scatter plot. As shown in the figure, ramen restaurants with many comments are on the right side, and are categorized on the vertical axis according to the content of the comment.

ここでは、食べログのコメントをクラスタリングすることによってラーメン店の分類を試みています。食べログには、店舗の評価としてコメントを書くことができます。このコメントをスクレイピングすることで、学習データとして使用します。スクレイピングしたコメントのデータは、形態素解析をすることで文章を単語ごとに分割します。そのデータを特徴ベクトルに変換することによってクラスタリングを行います。分類結果は、縦軸にクラスタリングラベルの平均値、横軸にコメント数を散布図にプロットすることで可視化しました。図のようにコメント数の多いラーメン店は右側にあり、コメントの内容によって縦軸に分類されています。

f:id:hira03:20200101150811p:plain — Clustering by Tabelog Comment

Predictive Model with Titanic Dataset

Here, supervised learning is performed using the training data of Titanic, which is famous as a Kaggle data set. We use xgboost for supervised learning. Generally, when performing supervised learning, pre-processing is performed to make the data suitable for learning. The preprocessing includes, for example, converting categorical data such as characters into dummy variables, and standardizing numerical data. After that, the data is divided into training data and test data to generate a learning model. The generated learning model can be used to evaluate the accuracy of the model and make predictions using test data. Also, when generating a learning model, you can evaluate which feature value contribute. This graph shows which feature value of this Titanic dataset contribute to the generation of the learning model. Looking at the results, you can see that Fare contributes the most, followed by Age.

ここでは、Kaggleのデータセットとして有名なタイタニックの学習データを使用して教師あり学習を行っています。教師あり学習にはxgboostを使用しています。一般的に教師あり学習を行う場合は、データを学習に適した状態にする前処理を行います。前処理は、例えば文字などのカテゴリカルなデータをダミー変数に変換したり、数値データを標準化したりすることなどがあります。そのあとは、データを学習データとテストデータに分割して、学習モデルを生成します。生成した学習モデルはモデルの精度を評価したり、テストデータを使用して予測を行ったりすることができます。また、学習モデルを生成する際に、どの特徴量が寄与しているのかを評価することもできます。このグラフは、このタイタニックのデータセットのうち、どの特徴量が学習モデルを生成するのに寄与しているのかを表したものです。この結果を見ると、最も寄与しているのはFareで、その次がAgeとなっていることがわかります。

f:id:hira03:20200101150905p:plain — Predictive Model with Titanic Dataset