본문 바로가기
빅데이터/R

[R] R을 이용하여 CSV 파일을 읽을 때 첫번째 문자가 깨지는 오류 해결 방법

by hs_seo 2019. 9. 16.

R을 이용하여 파일을 읽을 때 첫 번째 칼럼의 이름이나 첫번재 데이터의 문자가 깨지는 경우가 있습니다.

 

이는 UTF-8 형식 문서의 BOM으로 인하여 문자가 깨진것으로 인식하기 때문입니다. 이를 해결하기 위하여 read.csv(fileEncoding="UTF-8-BOM") 처럼 fileEncoding을 BOM형식을 확인하도록 선언하여 주면 됩니다.

 

# UTF-8의 BOM으로 인하여 파일이 깨짐 
> mlbstat = read.csv(file = "mlb-player-stats-Batters.csv", header = T)
> summary(mlbstat)
              癤풮layer        Team     Pos            G                AB              R                H               X2B       
 Adeiny Hechavarria:  3   BAL    : 28   1B: 76   Min.   :  1.00   Min.   :  0.0   Min.   :  0.00   Min.   :  0.00   Min.   : 0.00  
 Jose Bautista     :  3   TOR    : 28   2B:102   1st Qu.: 25.00   1st Qu.: 56.0   1st Qu.:  7.00   1st Qu.: 12.00   1st Qu.: 2.00  
 Adam Duvall       :  2   LAA    : 27   3B: 71   Median : 67.00   Median :189.0   Median : 21.00   Median : 43.00   Median : 9.00  
 Andrew McCutchen  :  2   NYM    : 27   C :125   Mean   : 72.99   Mean   :233.5   Mean   : 31.08   Mean   : 58.77   Mean   :11.88  
 Asdrubal Cabrera  :  2   TB     : 27   DH:  9   3rd Qu.:123.00   3rd Qu.:402.0   3rd Qu.: 51.00   3rd Qu.:100.00   3rd Qu.:19.00  
 Austin Jackson    :  2   CIN    : 25   OF:229   Max.   :162.00   Max.   :664.0   Max.   :129.00   Max.   :192.00   Max.   :51.00  
 (Other)           :675   (Other):527   SS: 77                                                                                     
      X3B               HR              RBI               SB               CS              BB               SO               SH        
 Min.   : 0.000   Min.   : 0.000   Min.   :  0.00   Min.   : 0.000   Min.   : 0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.: 1.000   1st Qu.:  5.00   1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.:  4.00   1st Qu.: 16.00   1st Qu.: 0.000  
 Median : 0.000   Median : 4.000   Median : 20.00   Median : 1.000   Median : 0.00   Median : 16.00   Median : 44.00   Median : 0.000  
 Mean   : 1.225   Mean   : 8.073   Mean   : 29.59   Mean   : 3.581   Mean   : 1.39   Mean   : 22.55   Mean   : 56.65   Mean   : 0.598  
 3rd Qu.: 2.000   3rd Qu.:13.000   3rd Qu.: 49.00   3rd Qu.: 4.000   3rd Qu.: 2.00   3rd Qu.: 33.00   3rd Qu.: 88.00   3rd Qu.: 1.000  
 Max.   :12.000   Max.   :48.000   Max.   :130.00   Max.   :45.000   Max.   :14.00   Max.   :130.00   Max.   :217.00   Max.   :12.000  

       SF              HBP              AVG              OBP             SLG              OPS        
 Min.   : 0.000   Min.   : 0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:0.2020   1st Qu.:0.267   1st Qu.:0.3050   1st Qu.:0.5810  
 Median : 1.000   Median : 2.000   Median :0.2390   Median :0.307   Median :0.3880   Median :0.6930  
 Mean   : 1.772   Mean   : 2.775   Mean   :0.2301   Mean   :0.299   Mean   :0.3713   Mean   :0.6703  
 3rd Qu.: 3.000   3rd Qu.: 4.000   3rd Qu.:0.2670   3rd Qu.:0.337   3rd Qu.:0.4440   3rd Qu.:0.7770  
 Max.   :11.000   Max.   :22.000   Max.   :1.0000   Max.   :1.000   Max.   :1.1250   Max.   :2.0000  



# UTF-8 형식의 BOM을 인식하도록 선언 
> mlbstat = read.csv(file = "mlb-player-stats-Batters.csv", header = T, fileEncoding="UTF-8-BOM")
> summary(mlbstat)
                Player         Team     Pos            G                AB              R                H               X2B       
 Adeiny Hechavarria:  3   BAL    : 28   1B: 76   Min.   :  1.00   Min.   :  0.0   Min.   :  0.00   Min.   :  0.00   Min.   : 0.00  
 Jose Bautista     :  3   TOR    : 28   2B:102   1st Qu.: 25.00   1st Qu.: 56.0   1st Qu.:  7.00   1st Qu.: 12.00   1st Qu.: 2.00  
 Adam Duvall       :  2   LAA    : 27   3B: 71   Median : 67.00   Median :189.0   Median : 21.00   Median : 43.00   Median : 9.00  
 Andrew McCutchen  :  2   NYM    : 27   C :125   Mean   : 72.99   Mean   :233.5   Mean   : 31.08   Mean   : 58.77   Mean   :11.88  
 Asdrubal Cabrera  :  2   TB     : 27   DH:  9   3rd Qu.:123.00   3rd Qu.:402.0   3rd Qu.: 51.00   3rd Qu.:100.00   3rd Qu.:19.00  
 Austin Jackson    :  2   CIN    : 25   OF:229   Max.   :162.00   Max.   :664.0   Max.   :129.00   Max.   :192.00   Max.   :51.00  
 (Other)           :675   (Other):527   SS: 77                                                                                     
      X3B               HR              RBI               SB               CS              BB               SO               SH        
 Min.   : 0.000   Min.   : 0.000   Min.   :  0.00   Min.   : 0.000   Min.   : 0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.: 1.000   1st Qu.:  5.00   1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.:  4.00   1st Qu.: 16.00   1st Qu.: 0.000  
 Median : 0.000   Median : 4.000   Median : 20.00   Median : 1.000   Median : 0.00   Median : 16.00   Median : 44.00   Median : 0.000  
 Mean   : 1.225   Mean   : 8.073   Mean   : 29.59   Mean   : 3.581   Mean   : 1.39   Mean   : 22.55   Mean   : 56.65   Mean   : 0.598  
 3rd Qu.: 2.000   3rd Qu.:13.000   3rd Qu.: 49.00   3rd Qu.: 4.000   3rd Qu.: 2.00   3rd Qu.: 33.00   3rd Qu.: 88.00   3rd Qu.: 1.000  
 Max.   :12.000   Max.   :48.000   Max.   :130.00   Max.   :45.000   Max.   :14.00   Max.   :130.00   Max.   :217.00   Max.   :12.000  

       SF              HBP              AVG              OBP             SLG              OPS        
 Min.   : 0.000   Min.   : 0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:0.2020   1st Qu.:0.267   1st Qu.:0.3050   1st Qu.:0.5810  
 Median : 1.000   Median : 2.000   Median :0.2390   Median :0.307   Median :0.3880   Median :0.6930  
 Mean   : 1.772   Mean   : 2.775   Mean   :0.2301   Mean   :0.299   Mean   :0.3713   Mean   :0.6703  
 3rd Qu.: 3.000   3rd Qu.: 4.000   3rd Qu.:0.2670   3rd Qu.:0.337   3rd Qu.:0.4440   3rd Qu.:0.7770  
 Max.   :11.000   Max.   :22.000   Max.   :1.0000   Max.   :1.000   Max.   :1.1250   Max.   :2.0000  

Hex 정보 확인

Hex Viewer를 이용하여 파일의 hex 정보를 확인하면 첫번째 3byte가 UTF-8의 BOM인것을 확인할 수 있습니다.

00000000:  efbb bf50 6c61 7965 722c 5465 616d 2c50 6f73 2c47 2c41 422c  :...Player,Team,Pos,G,AB,

 

MLB 선수의 기록은 다음의 사이트에서 다운로드 할 수 있습니다.
https://www.rotowire.com/baseball/stats.php

 

2019 MLB Player Stats

 

www.rotowire.com

 

반응형

'빅데이터 > R' 카테고리의 다른 글

[R] 기본 함수: 연산,절대값,반올림,올림,버림,수열  (0) 2019.09.10
[R] 파일 읽기/쓰기  (2) 2019.09.09
[R] 평균, 분산, 표준편차  (0) 2019.05.15
[R] R 시작하기  (0) 2019.04.17