python - scrapy selector string won't accept international characters -
I'm trying to get a Scrap Spider to crawl a website, but I want the item's The elements I need are written in Spanish, using a tone with a tilde (í).
titulo = title.select (u '.// ["Titulo Origin:"] / text (). Extracts ()
I have found similar issues here but their The answers accepted for me have not been able to work for me.
Adding u at the beginning of the string takes care of some problems but gives me error
Unicode encoded error : 'ASCI' codec character U can not encode '\ xed' in position 21: serial number is not in (128)
I am here ... ... / I am suggesting using the '('. ') Decode (' UTF-8 '), but by doing this or using the .encode (' utf-8 ') gives me an error
< Code> exceptions.ValueError: All strings must be XML compliant: Unicode or ASCII, no zero bytes or control characters
Am I missing something or some other way or do I have to Better than Ga regex to catch every other part of my string, but that letter?
Even so I have the code so far:
DEF parse (self-response): #change a HtmlResponse response to allow for UTF -8 encoding body. Feedback = HtmlResponse (url = response.url, status = response.status, headers = response.headers, body = response.body) Print '\ n \ n Response encoding', response.encoding ## Page encoded in UTF-8 Hxs = HtmlXPathSelector (response) title = hxs.select ('// div [@ class = "datosespectaculo"]) Item = [] Title for the title: item = CarteleraItem () titulo = title.select (u'. /) / ["Original title:"] / text () ' Simply put '(utf-8')). Remove () Ano = Title
P>
& lt; Div id = "contgeneral" & gt; & Lt; Div class = "contyrasca" & gt; & Lt; Div id = "contfix" & gt; & Lt; Div class = "contespectaculo" & gt; & Lt; Div class = "callyzack" & gt; & Lt; Div itemscope item type = "http://schema.org/Movie" & gt; & Lt; H1 class = "titulo" itemprop = "name" & gt; 15.361 & lt; / H1> & Lt; Img class = "fef "src =" http://www.cartelera.com.uy/imagenes_espectaculos/musicdetail13/14770.jpg "/> & Lt; Div class = "datosespectaculo" & gt; & Lt; Strong & gt; Original title: & lt; / Strong> & Lt; Em> 15.361 & lt; / Em> & Lt; Br / & gt; & Lt; Strong & gt; Eno: & lt; / Strong> & Lt; Span itemprop = "copyright year" & gt; 2014 & lt; / Span & gt; & Lt; Br / & gt; & Lt; Strong & gt; Gereno: & lt; / Strong> & Lt; Span itemprop = "genre" & gt; Comedy / Drama & lt; / Span & gt; & Lt; Br / & gt; & Lt; Strong & gt; Horror: & lt; / Strong> & Lt; Span itemprop = "duration" & gt; 60 & amp; Nbsp; & Lt; / Span & gt; & Lt; Br / & gt; & Lt; Strong & gt; CalifCian: & lt; / Strong> +18 años & lt; Br / & gt;
# - * - Coding: UTF-8 - * -
/ code > Not working, you can use a Unicode string where non-ASCIII characters use the \ u
escape sequence.
Then you become the XPath selector:
titulo = title.select (u '.// ["t \ u00edtulo origin:"] / text ()' Encounter ('utf-8'). Extract ()
I usually use a simple Python shell session to check the escape sequence:
paul @ wheezy: ~ $ Python Python 2.7.3 (Default, 2 January 2013, 13: 56:14) [GCC 4.7.2] For more information on Linux 2, type "help", "copyright", "credit" or "license" & Gt; & Gt; & Gt; U '. ["Titulo Origin:"] / Text ()' U '.// ["T \ xedtulo Origin:"] / Text ()' & gt; & Gt; & Gt; U '.// ["T \ u00edtulo origin:"] / text ()' U ' ["T \ xedtulo origin:"] / text () '& gt; & Gt; & Gt;
Comments
Post a Comment