Cesar Ortiz: ¿libxml2 leak in python (html sax-parsing)?

martes, enero 09, 2007

¿libxml2 leak in python (html sax-parsing)?

Me siento un poco fustrado con el parser de html de libxml2. Se nos va memoria a saco y no tengo ni idea de como evitarlo.
He preguntado a la lista, pero no ha servido de mucho. A ver si me aclaran algo mas... Actualizaré el post cuando sepa algo.

Tras detectar que en nuestro codigo el mayor problema con la memoria lo teníamos con el parser sax de html de libxml2, modifiqué el test que acompaña a la distribución (pushSAXhtml.py)para que en lugar de procesar 1 fichero, procese N, siendo N un número suficientemente grande para ver las perdidas de memoria...

Pues bien, en algo mas de 10 minutos procesando el proceso se cae por un segmentation fault.
El específico en el bucle para procesar el fichero por el parser es:

ctxt = libxml2.htmlCreatePushParser(handler, initdata, 0, inputFilePath)
ctxt.htmlParseChunk(data, len(data), 1)
ctxt = None

En el callback no hacemos nada:

class callback:
def startDocument(self):
print "."

def endDocument(self):
pass

def startElement(self, tag, attrs):
pass

def endElement(self, tag):
pass

def characters(self, data):
pass

def warning(self, msg):
pass

def error(self, msg):
pass

def fatalError(self, msg):
pass

¡Ojo! la fuga de memoria la vemos a nivel del sistema operativo (con top o mirando en /proc (por ejemplo con mem-monitor)), ya que las rutinas específicas de libxml2 nos indican que todo está OK.
En fin... que con este problema no podemos llevar nuestro codigo a producción. O solucionamos esto o tenemos que mirar otro parser (posible candidato beautifulsoup); cosa que no me gustaría.

Otra cosa que me ha frustado mucho es no encontrar nada en la web. ¿Es que nadie por ahí esta parseando html con libxml2 usando SAX? No me lo puedo creer...

Ah! un detalle que se me olvidada. En el código asignamos None al contexto. Es correcto. Eso provoca la invocación a xmlFreeParserCtxt(que es lo mismo que htmlFreeParserCtxt: esta llama a la anterior). Con lo cual no tenemos que liberar nada mas, ya que el documento no lo usamos. De hecho si intentamos obtener el documento los bindings de python nos devolverán una excepción.

Para terminar de cerrar el tema y asegurame de que no hay nada raro en mi entorno he hecho el ejemplo equivalente en C, y ha funcionado sin problemas. Ahí va el codigo:


/* libxml2 C HTML Parser Example 
 * gcc  -I -L -lxml2
 */

#include 
#include 
#include 
#include 
#include 
#include 

#define BUFFER_SIZE 100000
#define NUM_ITERS  10000
#define FILE_PATH   ""

/*****************************************************************************/

/*
 * Foo context structure
 */
typedef struct
{  
 int foo;
} Context;

/*
 *  libxml start element callback function
 */
void startElement(void *voidContext,
                         const xmlChar *name,
                         const xmlChar **attributes)
{
 return;
}


/*
 *  libxml end element callback function
 */
void endElement(void *voidContext,
                       const xmlChar *name)
{ 
 return;
}



/*
 *  Text handling helper function
 */
void handleCharacters(Context *context,
                             const xmlChar *chars,
                             int length)
{
 return;
}



/*
 *  libxml PCDATA callback function
 */

void characters(void *voidContext,
                       const xmlChar *chars,
                       int length)
{ 
 return;
}



/*
 *  libxml CDATA callback function
*/

void cdata(void *voidContext,
                  const xmlChar *chars,
                  int length)
{  
 return;
}




htmlSAXHandler saxHandler =
{
  NULL,
  NULL,
  NULL,
  NULL,
  NULL,
  NULL,
  NULL,
  NULL,
  NULL,
  NULL,
  NULL,
  NULL,
  NULL,
  NULL,
  startElement,
  endElement,
  NULL,
  characters,
  NULL,
  NULL,
  NULL,
  NULL,
  NULL,
  NULL,
  NULL,
  cdata,
  NULL
};


/*****************************************************************************/

char buffer[BUFFER_SIZE];

int main(void)
{
 htmlParserCtxtPtr ctxt;
  Context context;
  char *filepath=FILE_PATH;
  FILE *f = NULL;
  long fileLen = 0;
  long retRead = 0;
  int i=0;

  
  /*---- Reading the data */
  /*
  struct stat myStat;
  stat(filepath,&myStat);
  if (myStat.st_size >= BUFFER_SIZE)
  {
   print("El fichero no cabe en el buffer\n");
   return 1;
  }
  */
  
  f = fopen(filepath, "r"); 
  
  if ( f == NULL )             /* Could not open file */
  {
    printf("Error opening %s: %s (%u)\n", filepath, strerror(errno), errno);
    return 1;
  }
  
  fseek(f, 0L, SEEK_END); /* Position to end of file */
  fileLen = ftell(f);    /* Get file length */
  rewind(f);              /* Back to start of file */
  
  if (fileLen >= BUFFER_SIZE)
  {
   printf("El fichero no cabe en el buffer: %d\n",fileLen);
   return 1;
  }
  
  
  retRead = fread(buffer,fileLen,1,f);  
  if (retRead != 1)
  {
   printf("Error haciendo el read");
   return 1;
  }
  fclose(f);
  
  /*---- We parser the file */    
  for(i=0;i < NUM_ITERS; ++i)
  {
   ctxt = htmlCreatePushParserCtxt(&saxHandler, &context, "", 0, "",
                                  XML_CHAR_ENCODING_NONE);
   htmlParseChunk(ctxt, buffer, fileLen, 0);
   htmlParseChunk(ctxt, "", 0, 1);
   htmlFreeParserCtxt(ctxt);   
   printf(".\n");  
 }
   
  
  return 0;
}

Para hacerlo operativo modificar las constantes BUFFER_SIZE, NUM_ITERS, FILE_PATH, al gusto del consumidor.

4 comentarios:

cesarob dijo...: Bueno, pues parece que al final si habia un leak. Según me comenta Daniel Veillard:
"We investigated the problem with William, and I think we have it nailed
down in SVN: Committed revision 3573 , it was basically reference counting of
sttribute strings passed in a dictionnary which were not decremented leading
to the leak.".
El parche aplica a python/libxml.c.; 11:29 a. m.
cesarob dijo...: En la web no veo indicado donde está el servidor de SVN. Esta es la url: http://svn.gnome.org/viewcvs/libxml2.; 11:35 a. m.
cesarob dijo...: Bueno la url anterior es para verlo a través de la web. a ver si me entero de la url guena para el svn...; 12:05 p. m.
cesarob dijo...: Fale, ya tengo el comando: svn co http://svn.gnome.org/svn/libxml2/trunk libxml2.
Mas info de svn en gnome aquí.; 2:42 p. m.

Publicar un comentario